For many years traditional Extract, Transform and Load (ETL) was the only way to extract and transform data out of operational systems and load into a data warehouse for analysis by a business intelligence tool like Business Objects, MicroStrategy or OBIEE. Traditional ETL works by extracting data out of one or more transactional systems, merging and reshaping the data using staging tables and populating dimensions and fact target tables in a data warehouse. These data warehouses were usually built on Oracle, SQL Server or Teradata databases. To load the data warehouse using the traditional ETL method the following steps need to be accomplished –
- Understand the business requirements – you must know all the ways the users will slice and dice the data, what is the grain of the data (is it at customer level or order line level etc.)
- Data model the star and snowflake schemas including all facts and dimensions, their keys and how they all join
- Figure out where the data in the source systems are coming from to populate the facts and dimensions and then write the SQLs which join the various tables in the source system to populate these individual facts and dimensions with appropriate keys
- Figure out the load order and write a workflow to populate the dimensions and facts in the appropriate order. This is important because the dimensions need to loaded first before the fact tables for the dimensional keys to get correctly looked up when the facts get loaded.
- Figure out the tables will be populated with Incremental and full loads
Let’s go over each of these steps and see what is involved –
Business Requirements
This is probably one of the most difficult part of any traditional BI project. Net new BI and data warehousing projects are usually started when new operational systems are put in place. For example, the company can implement a new ERP or CRM or a cloud HR system. To give accurate business requirements for a traditional ETL/BI project the business users need to come up to speed on the new source system and its data structure and then try to guess all potential reports and ways to slice the data in the future and define a set of requirements which IT developers can understand.
The probability of getting the requirements right are very low. But the expectation of the IT and business leadership is that proper requirements written in stone needs to be provided upfront before the implementation of the project begins. This is because the ETL developers need to know what data to pull from the source to populate the data warehouse whose data model in turn has been designed based on the business requirements.
Data Modeling
Data Modeling entails working backwards from what questions the business users will ask and then designed to answer those questions. The metrics or measures goes into the fact tables and the attributes by which these metrics can be sliced and diced are modeled as dimensions around these fact tables. The fact tables then will have keys to each of these dimensions.
With this method of building a warehouse each set of question will need a new fact table. For example, if the business users want to analyze opportunities, revenue, orders, COGS or GL data they will need to design a different fact and associated dimensions for each of them. Detailed analysis needs to be done upfront to understand the grain (aggregation level) of each of the fact tables and the dimensions are then designed to match this grain.
Build source SQL or API data pull
Once the data model of the data warehouse is prepared its handed over to the ETL developers. The ETL developers need to understand where the data resides in the source system and write a SQL or API from cloud systems or from flat files to pull the data into the dimensions and facts contained in the data warehouse. Staging tables are usually there for intermediate hosting of the data which are then populated into the target tables. Incremental and full loads also need to be designed.
Here are the challenges with writing SQL/flat files/API from the source system –
- ETL developers rarely understand the source system well. For example, ETL developers usually have very little knowledge of SAP or Oracle ERP tables. They must rely on source system experts who normally don’t have too much time for them.
- Once they write the SQL and populate the staging and target tables they must rely on the BI teams to have completed the semantic data model and build out the reports before any business user can test the results. By the time all this happens the project is almost at its end the data is usually very different from what was expected. The whole BI/ETL project team must go back to the drawing board several times to fix the issues. All this causes a lot of stress and delays in the project.
- The SQL that is written is usually very expensive in terms of system resources in the source database and must be rewritten due to performance reasons. This also sometimes needs significant logic changes leading to data errors etc.
Load Order
The ETL developers also need to figure out which is the correct order to load the dimension and facts. This can get very tricky sometimes when there are a lot of interdependent tables that are getting pulled and especially if the data is coming from multiple source systems. Also, the right set of dimensional keys need to be populated in the all fact tables with right look up logic.
Incremental and Full Loads
Incremental loads are needed in a lot of cases when there a lot of rows that get updated on the source system. But getting the logic right as to which rows are getting updated can get very complicated when there are not proper Last Update Date columns in the source data. Sometimes complicated OR statements need to be written if a dimension or fact gets populated from multiple source tables.
In addition, when a new column is added to the target dimension or fact a full load needs to be triggered or an update SQL needs to be written to populate the column for older rows. Also, deleted rows in the source system need to be taken care as there might not be a clear flag or trigger in the source to indicate which rows are deleted.
Speed of Change
Given all the above complexities getting a change done – for example adding new dimension, fact or even new columns starts to get very difficult to do when traditional ETL is used. The skill level of the ETL developers to perform all the above steps need to be very high and they are very difficult to find. In addition, testing a change using the traditional ETL approach gets quite complicated as the data gets reshaped during the load from an OLTP data model to a OLAP data model. It could take weeks to add a new column or months to add a new dimension or fact table, test and move the change to production.
Can the business really wait that long for a simple change? The answer is clear – most business users are expecting changes to their BI systems to be done in a matter of days if not less.
So, is traditional ETL the best way moving forward for new BI projects?
The answer is NO. With today’s technology including cloud data warehouses, replication methods, and powerful self-service visualization tools there is very little reason to go with the traditional ETL and legacy BI tools approach for any new BI project.
There are several other new techniques to solve the problem of getting the data out of the source system and presenting the data in a meaningful manner to the end user.
- Replication – Replicate the source data to the target database (increasingly cloud data bases like Snowflake or Amazon Redshift are becoming attractive target databases) and build the business logic post replication
- Data Virtualization with a semantic layer and caching in a fast database
I will cover more about these two approaches in future blogs.
What about existing ETL driven data warehouses and BI solutions?
If the existing ETL+BI solution does not require frequent changes and there is a low cost of maintenance, then my suggestion is let the users make use of the system till it loses its business value. If there is a significant change in the source system like move to cloud or an upgrade which needs a lot of change to the ETL, that might be the time to replace the legacy solution to the new architecture.
You don’t have to replace the whole solution to the new architecture at one. You can start with the business-critical reports and then consider stopping the software maintenance support for the legacy BI and ETL tools. Then you can take your time to migrate the other valuable content over to the new platforms and shut down the old ETL/BI solution in a phased manner.