How to work on a BI Modernization strategy for your company — Part 2

In the Part 1 of this article I had written about what should be the goals of a BI Modernization strategy and how grouping your existing BI use cases by data latency requirements can help clarify your BI Strategy. The three groupings by data latency are Real-Time, Operational and Analytical Reporting. In this article I will cover possible architectures for each of these use models.

Design Pattern

The key design assumption for these architectures are the following –

1. Companies will have multiple visualization tools

2. A single architecture and set of tools might not be able to satisfy all the three data latency use cases.

Real-Time Reporting

The characteristics of real time reporting are that users need the data “as it is any point” in the source system when they run the report. The users need data in real-time from one data source like Oracle EBS, SAP, or Salesforce, and the volume of data required for each report is relative small — few tens of thousands at most and the UI needs of the reports are simple, which means mostly they need a table like report. Given the fact that there are multiple source systems that need real-time reporting each might have a different way of accessing the data. Some might have JDBC/ODBC interfaces where you can run SQL queries while others might have an API interface. The other consideration is that there might be multiple visualization tools which can be used for real-time reporting. So, given the fact that there are multiple back end systems and multiple visualization platforms and there is no need to join one system to another (no need for query federation) a central data modeling tool would be ideal. This will avoid creating data models in different data visualization tools and give the capability to restrict the data volume and query performance on the source systems, so the reporting platforms do not impact the source system resources. A leading data virtualization tool like Denodo will be a good solution for this use case and I have written about it here in detail. The data virtualization tool will provide the semantic layer and it will enable the various BI tools to connect to it so if a new BI tool is introduced there is no need to redo the data model.

Operational Reporting

The data needed for operational reporting are usually in multiple systems and the reporting solution should be able to join these data together. Each of these systems can have data in a different shape and format and hence it is important to bring the required data into a central database. If you have a lot of data sources and millions of rows that need to be brought in, then query federation is usually not an option as there will be significant performance differences between the different sources and hence user experience might not be optimal. So, you will need to decide on an ELT/ETL tool to move the data and a database to host the data. Once the data is brought into the central database then the data modeling exercise begins where the data is reshaped and joined appropriately based on the reporting requirements. After the data modeling is done then business friendly labeling can be applied and calculated columns like currency translations, gross margin can be added. Once this is complete, then the reporting and data visualization activity can begin.

There are two approaches companies can make –

  1. Choose one tool which does all the tasks — ETL/ELT, Database, Data Modeling, and Data Visualization. There is a new industry term for these kinds of tools called Unified Data and Analytics Platforms (UDAP). These UDAP tools don’t have best of breed features in any of these four domains but the advantage is that they have a single stack which leads to well-integrated set of capabilities which should lead to significant time to value. Examples of these tools include Incorta, Birst, Domo, Pentaho and others. I specifically have good experience with Incorta which has been very good for operational reporting from large complicated on-premise and cloud data sources and joining them at scale. I have written about the Incorta architecture and capabilities here.

2. Use best of breed software to create operational reporting stack

A best of breed approach can also be taken to create the operational reporting stack. In this case, you will have to choose an ETL/ELT including CDC mechanism which can get near real-time data into the target database quickly and without much transformation in the process. I have talked about how to enable real-time analytics to Snowflake here in detail. The transformation will happen after it is loaded into the target database. In this case, an ETL/ELT tool like Matillion, Stitch, FiveTran and others can be used to move the data into the target database. In today’s world the target database totally makes sense to be one of the major cloud platforms like Snowflake, Redshift, or Google Big Query. If you are using a tool like Stitch or FiveTran which are primarily data ingestion tools then using dbt to transform the data will be very useful. Matillion does both ingestion and transformation as well as orchestration.

This stack works best if all the major data sources are on the cloud. If there are large volumes of on-premise data and data is needed to be loaded several times a day, I don’t think this will be a very robust data pipeline. The biggest advantage is that you will have your BI tool of choice for operational, analytical and real-time reporting. You will have to evaluate your data sources, volume and frequency of loads to make sure this is the best architecture for your operational reporting.

Analytical Reporting

Analytical reporting has the following characteristics –

1. Analyze data trends over months, quarters or years.

2. Frequency of data loads can be daily, weekly or monthly from various sources. No need for loading several times a day.

3. The data is usually stored for the long term in the form of snapshots.

4. Data is summarized to a more aggregate form.

5. Data can come from multiple sources.

6. Analysts and users prefer to use their favorite BI tools. The end users skew more towards senior management.

The proposed architecture for analytical reporting is not too different from the best of breed tools for operational reporting. This architecture should address all the six major characteristics of analytical reporting.

The important distinction between operational reporting and analytical reporting are the reduced frequency of data loads, use of snapshots and aggregations to store data for long term, and the use the best self-service data visualization tool that your organization has access to. The modern cloud data warehouses like Snowflake, Redshift and Google Big Query are well suited for this purpose as they perform fast, provide cheap storage for snapshots and data retention and can be accessed by a wide variety of BI tools for visualization.


When you look to modernizing the data architecture for your company, it might be helpful to classify your current BI landscape into real-time, operational and analytical reporting buckets. This will give clarity on how significant each of these use models are for your company. Once that is done you can design an architecture for each of these and then build your existing reports and analysis into one of these three architectures and then continue using these platforms for your future requirements.

I would love feedback on whether thinking about your BI landscape in this way using data latency requirements is useful to you and your organization. I look forward to your comments.

How to enable real-time analytics in Snowflake with data from On-Premise databases

Snowflake is rapidly becoming the leading cloud data warehouse platform for all kinds of different analytics applications. I have written about why snowflake is a great option for moving on-premise data warehouses to the cloud. While moving to Snowflake has big advantages like no need for a server footprint, DBA, tuning, backup and recovery, it still needs a well thought out data pipeline from your source systems.

A chance to simplify and reduce latency of your data pipelines

If your company has been using a central data warehouse for several years, chances are your BI team uses a traditional tool like Informatica to extract, transform and load (ETL) data from one or several on-premise databases into the data warehouse built on Oracle, SQL Server or Teradata. These ETL processes would have been built and enhanced over many years and are probably very complex to maintain or change now. They were optimized for performance on traditional databases, the core of whose technology were written decades ago. The on-premise data warehouse will get loaded once or twice a day which run for several hours each and usually lead to downtime during loading for the business users. This method of ETL is completely outdated for today’s global 24×7 operations and real time data requirements. Now that you have decided to move to Snowflake as your data warehouse you should rethink your data pipeline completely.

CDC based real-time data integration into Snowflake

Change Data Capture + Streaming as the data pipeline strategy

In the last two years, there has been tremendous development of the partner and technology ecosystem around Snowflake. Now there are more than a dozen tools which enable real time or batch replication to Snowflake from on-premise databases. The tools have become sophisticated enough to be able to replicate only some of the tables and even in these tables they can replicate only some of the columns. This way you can replicate only the data needed for reporting and not the entire on-premise database to Snowflake.

The replication tools look at the logs generated by the on-premise databases which contain an encoded entry for each insert, update or delete that occurs in any table in the source database. For example, if your source database is Oracle then these tools can continuously replicate one or many Oracle tables to Snowflake. As a first step the tables to be replicated needs to be created Snowflake. Then the existing data needs to be loaded into the corresponding tables in Snowflake. Then the tool can read new data using LogMiner and continuously replicate it to Snowflake.

Advantages of using the CDC approach of loading data into Snowflake are:

1. Real time continuous data load into Snowflake without the need for specific time based data loads which used to run for hours

2. What you see in the source database is what you get in Snowflake. It is easy to verify the data quality and accuracy compared to traditional ETL

3. Uses the power of Snowflake performance to do the complex transformations needed and not in the ETL tool. Data can be processed many times faster than before.

4. Can capture deletes in the source and not require a full load to clean out the deletes.

What are the tools available in the market?

There are plenty of tools now available to replicate on-premise data to Snowflake and more are getting added as Snowflake is getting very popular in enterprises large and small. It is important for companies to evaluate these tools on various vectors including capability, performance, pricing, support, data transformations, roadmap, long term viability of the company among other factors. You need to look at all the capabilities of the tool holistically so that your company can standardize on one tool instead of supporting multiple tools for different needs.

Here are some leading tools that are available in the market which can replicate on-premise databases to Snowflake using CDC –

1. Striim

2. Attunity (bought by Qlik) — now called Qlik Replicate

3. Stitch (bought by Talend)

4. FiveTran

5. Matillion

6. Streamsets

The Matillion approach seems a lot more complicated than the others but the product has a lot of traction and completely cloud based with attractive usage basedpricing. So, I think it is worth a look.

There are other tools which are coming up based on a simple Google search — CdataBryteflowHVR SoftwareETLWorksDiyotta. I’m sure more will be there with a deeper search.

In summary, the full power of Snowflake can be unlocked by enabling a real-time data replication from on-premise databases to Snowflake. There are a lot of highly capable tools available in the market. With careful evaluation and a solid POC, you can enable your business users to gain real time insights into their operations with the power of Snowflake and in the process become the data hero of your company!

How Incorta works and how it can fit into your enterprise BI strategy

Incorta is the new kid on the block in a very competitive enterprise business intelligence tools space. The company started around 6 years back and we have been using it for the last 2.5 years successfully for operational reporting.

This blog was written in October 2019 and represents Incorta features as of their 4.3.x version. Please keep in mind Incorta builds significant new capabilities in each of their quarterly releases and the product changes over time. I will try to add new articles to go over significant new capabilities in the platform.

Incorta has taken a very different approach to solve BI and analytics problems than any other vendor in the market. They are essentially four different integrated BI tools in one.

These are the –

1. Data Extraction Layer

2. Database layer

3. Data Modeling Layer

4. Data Visualization Layer

In addition to the above Incorta has a very close integration with Spark and comes with the same installation package. Spark integration provides 2 functions for Incorta –

1. Advanced calculations and joins

2. SQL Interface (SQLi) or Datahub for external BI tools to connect with Incorta

How Incorta Works?

In a traditional BI environment, you have at least three tools (if not four if you have a separate semantic layer tool) to do the same set of tasks. You will have a ETL tool like Informatica or SSIS to load into a database like Oracle or SQL Server and then data model and visualize the data in a BI tool like MicroStrategy, OBIEE or Tableau. You must keep in mind that these are best of breed tools which have been around for a long time with tons of features.

Incorta does a lot of these tasks very well but it cannot match the same set of features as these best of breed solutions in the various categories. As a combined package it delivers a lot of capabilities for BI developers and users. But first let’s take a look at the various components of the Incorta platform.

Components of Incorta platform

Here is a very brief and simplified description of the various components of the Incorta platform and what is special or unique about them –

1. Data Extraction Layer

The first layer of Incorta platform is the Data Extraction layer. Here they have built a technology called Direct Data Mapping where you can extract data table by table from the source system both as a full load or as an incremental load. The data then goes into a parquet file in the Incorta platform and then gets loaded into memory. Each table can have a have a full load and incremental load SQL

Full Load SQL can look as simple as the following –

Select COL_A, COL_B, COL_C from TABLE_1 where CREATED_DATE>=’01–01–2016’

Incremental Load SQL can look like the following –

Select COL_A, COL_B, COL_C from TABLE_1 where LAST_UPDATE_DATE>?

When the load is kicked off the table gets populated for a full load and then in each incremental run it will get the incremental data from the source. Incorta can pull data from almost all traditional databases like Oracle, SQL Server, Cloud applications like Salesforce or from files stored on on-premise or cloud storage. The number of source applications supported increases in each release of the Incorta platform.

After loading data from source systems into parquet files, materialized views (MV) can be created using the Incorta UI and can be written in SQL or PySpark. During run time Incorta sends the request to Spark to read the data from Parquet files and compute the data and reload into a separate parquet file which then gets loaded into memory during the loading stage. The Spark component comes with the Incorta package but does involve some set up to be done for it to work properly.

The advantage of pulling table by table is that the performance of the loads is very fast as there is no need to perform the joins in the source system and the second advantage is that it becomes very easy to add new columns into Incorta as all you need to do is add a new column in the extract SQL and run a full load. Hence an addition of a new column in Incorta can be done in a few minutes compared to days or months using traditional ETL methods.

2. Database Layer

Once the data is extracted into parquet files it is then loaded into the in-memory database engine and then the joins defined in the schema are precomputed at the end of the load. This precomputation of the joins leads to the revolutionary performance of the reports in Incorta. Even if the report built in Incorta has 30 or 40 joins across large tables containing millions of rows the data still comes back in a few seconds. This kind of performance is unheard off in a traditional database.

3. Data Modeling Layer

Incorta data modeling layer comes in two forms — physical schemas and business schemas.

In the physical schema you can create aliases to base tables and materialized views and create joins between various tables and MVs. Business schemas can be created to present a flat, user friendly representation of the physical schema to the user building the reports. Columns can be brought in from one or more tables in different physical schemas and renamed. Formula columns can be added either in physical or business schemas.

4. Visualization Layer

Incorta comes with its own visualization layer. Insights (Reports) and Dashboards can be created on top of physical or business schemas. Analyzer users can create insights or reports with in-built filters, prompts and bookmarks that then end users can consume. There are multiple types of visualizations like charts, pivot tables as well as drill downs are available.

Incorta visualization is good enough for most purposes but when compared to a more best of breed product like Tableau, Power BI or MicroStrategy it does fall short a bit. The Incorta development are rapidly adding features in various releases and they should be able catch up to the other tools in terms of UI capabilities sooner than later.

In addition, Incorta provides a SQL Interface using which other BI tools like Tableau or Power BI can connect to the business schema, physical schema or the Parquet layer.

Where does Incorta fit in your BI architecture

Incorta comes with several distinct advantages compared to traditional BI tools –

– revolutionary performance of reports

– no tuning or DB maintenance needed

– can hold massive amounts of data (billions of rows of data) in memory

– complex data modeling is possible

– simple to develop and make changes as reshaping of data is not needed

– easy to learn and get started for any developer

– a platform where all the four layers of BI are available

– cross data source joins is very easy to achieve

There are three kinds of reporting in any company — Operational Reporting, Real-Time reporting and Analytical Reporting (snapshots, period over period analysis). Incorta’s sweet spot is operational reporting requirements where data can be pulled from various transaction systems and joined together to provide lighting fast reports. You can load data into Incorta several times a day from your source systems and provide operational insights that the users need to make daily decisions to run their business.

In summary, if you are looking for a solution to get lighting quick operational reporting with data joined from multiple complex source systems then Incorta is the one of the best options that is out there today.

What is Data Virtualization and how it can unlock real-time insights directly from source systems

In today’s fast-moving business environment there are a lot of reporting requirements which cannot wait for a batch process to load data from operational systems. The business users need the data as it is currently in their on-premise ERP, CRM or any of the cloud systems. For example, towards the end of a fiscal quarter, sales and finance leaders want to know the order bookings in real-time so that they have a good idea if they are going to “make” the quarter or not in terms of revenue. They cannot wait 4–8 hours of delay for a ETL job to finish.

Data virtualization (DV) is a method of building a “logical” warehouse by connecting “virtually” to various source systems and combining or joining them “virtually” to provide a unified data layer for BI systems or enterprise applications to query from.

Fig 1: High Level Architecture of Data Virtualization

The alternate to the data virtualization approach of providing a unified layer is the traditional ETL approach of moving data from the various source systems at periodic intervals and loading them into a physical data warehouse. The data in the warehouse is then combined by loading them into a target star schema or snowflake data model. ETL loads are usually a batch process and are loaded once a day or a few times a day depending on the time it takes to load and the data latency requirements.

How DV Works?

At the core of it DV is a SQL generation tool with an easy to use UI to build a semantic layer across various data sources. Instead of loading data into a physical warehouse by extracting, transforming and loading (ETL) from various source systems, with DV the data will primarily remain in the source systems and the DV tool will load the metadata of the tables (or other objects like views) from source systems into the DV tool. We can call these virtual tables as “Physical Views”. Once the metadata of the tables are loaded, the DV tool will allow the developer to perform various manipulations to these physical views including joins, and formula columns to enrich them with required business logic. We can name these as “Logical Views”. Once the joins and formula columns are added the developer can then build out the “BI Datasets” using these various logical views to create datasets which are flattened views of the data for a specific business purpose. These datasets can include fixed filters if they are always needed.

These flattened datasets will be then be imported in a BI tool like MicroStrategy, Power BI or Tableau. In these BI tools, users can add further business formula columns, filters or even join two “BI Datasets” from the DV tool. The DV tool will take the input call from the BI tools and then translate the SQL which is appropriate to the specific backend data source, be it a database like Oracle, SQL Server, Teradata or a cloud system using their API like Salesforce.

Fig 2: How Data Virtualization Works

The best DV tools perform SQL tuning before sending queries to the backend data sources. Instead of just sending the base SQL to the backend the DV tool will introspect the incoming request, optimize the query for the specific backend data source and then send the SQL or API call to that data source.

In addition, the leading DV tools can cache any of physical views, logical views or BI Datasets for better performance if the source is slow, either in a native database or a commercial database like Oracle or Snowflake.

Advantages of DV

Data virtualization comes with a lot of benefits –

1. Simplified view of complex source system schemas

If the semantic layer is built in the DV tool and the “BI Datasets” are exposed as a flat table/view to be imported in a BI tool like Tableau or Power BI, the complexity of the source schema like Oracle EBS or SAP will be hidden to the BI developers. BI developers are usually not very familiar with the source system tables and struggle to build a semantic layer and instead depend on a database view in the source system. Instead, building the semantic layer in the DV tool not only simplifies a complex schema, it also avoids maintaining a complex and hard to change physical views in source databases.

2. Speed of development

Since DV just imports the metadata of tables and creates virtual tables which mirrors the source objects, complex ETL to achieve real time reporting can be avoided. Adding new fields or tables in ETL could take several weeks to months as star schemas have to be designed and built, but in DV the shape of the data can be preserved to be the same as the source and hence adding new tables or fields will be much easier and quicker.

3. Real time reporting need

Since DV points to the source system directly, either through ODBC/JDBC to the database or through an API to cloud systems the latency of the data is now zero to a minimum. The data available to the user is real-time and for use cases where users need data immediately as it changes DV really comes through as a great option.

4. Source system changes can be hidden from BI tools

A big benefit of DV is that when a source system is upgraded or even if the company replaces a source system with another or is planning to just move databases from let’s say Oracle to Snowflake, virtualizing the semantic layer will help hiding these complex changes in the environment from the various BI tools. To accommodate changes which occur in source systems due to upgrades or database changes, in the DV tool the data source connection needs to be re-pointed to the new database if the schema is the same or some changes in relevant physical views will be needed if there are schema changes like column additions etc. This way the reports and visualizations built on the various BI tools do not have to change and users are not impacted.

5. Performance optimizations and caching

Most leading DV tools provide the ability to cache certain tables from the source database or cache datasets as well. Caching will significantly improve the performance of the reports in the BI tools. The best DV tools tunes the queries they receive from the BI tools and then pushes them down to the source databases. They also find ways to extract and move minimum amount of data from different data sources before joining them in the DV tool and sending to the BI tools.

Data Virtualization Tools

There is a lot of confusion in the market as to what Data Virtualization is. For Data Virtualization as described here, these are tools that companies can evaluate for their needs –

Commercial Products

1. Denodo

2. Tibco Data Virtualization

3. DataVirtuality

4. AtScale

5. Azure Analysis Service

Open Source Alternatives

1. Dremio

2. Teiid

We personally selected Denodo and we really like it for working with on-premise databases, cloud data sources, merging data from multiple sources as well as presenting a simplified view of the data to multiple BI tools. I will write a more detailed article on Denodo and how it works on a future article.

Why a user-friendly workflow automation and scheduling tool is critical for successful business intelligence and data engineering operations

The nature of today’s enterprise applications environment is that the critical data needed for operational and analytical reporting reside in multiple cloud and on-premise applications. A lot of companies have multiple ERPs, multiple CRMs and multiple other cloud applications that they have implemented over time or inherited them through acquisitions.

Figure 1: Typical ETL and BI Architecture

Similar, to their transaction systems, companies will also have a mixture of modern and legacy BI and ETL technologies. The need to bring the data from all these source applications together into one or more data warehouses (ideally one), whether on-premise or on the cloud, is more important than ever. Business performance analysis can be achieved by having all the required data together in a user-friendly manner for the data scientists and analysts to do their job.

Complicated scheduling architecture leads to operational issues

Each of these source systems have a different way of being accessed by the ETL/ELT tools. Some have API, others expose a ODBC/JDBC interface and others output the data in csv, JSON or other formats. As companies build the data pipelines from these sources over the years each of these are run at different schedules and are triggered in multiple different ways. Some get triggered by CRON Jobs, others by Windows Scheduler, others by in-application schedulers. In addition, most BI tools like MicroStrategy, OBIEE, Business Objects, or Tableau come with their own set of scheduling tools to trigger report extracts, loading cubes or for sending dashboards as emails to users.

Figure 2: Multiple dependent event driven schedules in an enterprise BI architecture

The result of running these various ETL/ELT and BI tasks with native schedulers in each of these tools is that there will multiple different kinds of code bases and mechanisms to trigger the tasks and more importantly, there will be not be one place to check the success of failure of these tasks.

In case a business user says that they didn’t get the latest data on their dashboard or email then the operations team needs to check all these different schedules in the BI tools, ETL tools, database triggers, OS jobs or even the source systems to find out why the data was not updated. Most probably there will be different teams supporting each of these tools. Troubleshooting any small issue will turn out to be a nightmare without knowing where the failure is in the data flow. If the BI and data engineering teams are always troubleshooting operational issues new capabilities and solutions will take a backseat.

To avoid this, companies must invest in and centralize all BI and data engineering related schedules and tasks into a central workflow automation and scheduling architecture

How centralized workflow automation will help

The best way to get the BI and data engineering architecture under control is to run all the operations using a central workflow automation and scheduling tool. For this to happen you will need to figure out a way to trigger the various ETL/ELT and BI tool schedules using their respective APIs. It will take a bit of technical investment to figure out how to work with each of these APIs. In general, most modern tools have simple REST APIs but some of the older legacy tools need some research to figure them out.

Once you have the APIs understood, the next step is to write the code to call the APIs for the various tasks that need to be performed. For example, you will need an API to trigger a download of employee data from your HR cloud platform like SuccessFactors or an API to trigger an Informatica workflow. After the APIs are understood you can standardize on a simple language like Python to write the code to call these APIs. There will be a lot of sequencing of API calls that needs to be performed by the Workflow Automation tool — kick off Task 2 after Task 1 is complete etc.

A good workflow automation tool that is needed for this needs to be able to

  1. call the necessary APIs in different languages
  2. to accept triggers in different formats
  3. to have a visual workflow development environment
  4. to send appropriate notifications to users and support team on success of failure of a certain task
  5. it needs to have an advanced scheduling engine and monitoring capabilities.
  6. last but not least is that it needs to be easy to use — easy to build and maintain.

Is there a tool in the market like that?

Figure 3: BI and Data Engineering operations driven by a central Workflow Automation tool

JAMS Scheduler as workflow automation tool for BI/Data Engineering operations

After a detailed search of both commercial and open source products like Airflow, I landed on a product called JAMS Scheduler by MVP Systems Software. With JAMS we were able build out a complete workflow automation solution which was able to call various ETL/ELT jobs using their respective APIs. We were able to chain various processes so dependent tasks kick off when others complete. BI tool tasks like kicking off an .hyper file creation in Tableau or emailing dashboards can be run after the ETL tasks are done.

In addition, JAMS has a great UI for monitoring tasks as well as a detailed notification and alerting mechanism in case of success or failure of tasks. In BI, one of the critical piece is the scheduling and triggering capability. For example, we had requirements to trigger jobs at 5.30 PM Malaysia time on Workday 1, 3 and 5 of every month. With JAMS, these kinds of triggers are very easy to setup without writing any code. JAMS has an intuitive interface to build workflows with variables and parameters. You can call python code, power shell, database procedures, SQL statements all in the same workflow. Once you have setup the job to run it runs very reliably. You have the capability of putting jobs on hold during maintenance windows, set priorities for jobs and re-trigger them on failures.

One of the most important aspects of JAMS is that I can download a free evaluation copy and test out all the features for 30 days. During this time, I got great support from their sales and technical folks, so I can complete the POC and show it to various stakeholders for buy in.

Overall it seems to be a well thought out, stable and easy to use product which we have gotten to rely on for automating a lot of manual tasks in our BI and data engineering operations.

Here are a few resources that might be useful to check on JAMS –

1. Product Videos

2. JAMS features overview

Is Traditional ETL dead?

For many years traditional Extract, Transform and Load (ETL) was the only way to extract and transform data out of operational systems and load into a data warehouse for analysis by a business intelligence tool like Business Objects, MicroStrategy or OBIEE. Traditional ETL works by extracting data out of one or more transactional systems, merging and reshaping the data using staging tables and populating dimensions and fact target tables in a data warehouse. These data warehouses were usually built on Oracle, SQL Server or Teradata databases.  To load the data warehouse using the traditional ETL method the following steps need to be accomplished –

  1. Understand the business requirements – you must know all the ways the users will slice and dice the data, what is the grain of the data (is it at customer level or order line level etc.)
  2. Data model the star and snowflake schemas including all facts and dimensions, their keys and how they all join
  3. Figure out where the data in the source systems are coming from to populate the facts and dimensions and then write the SQLs which join the various tables in the source system to populate these individual facts and dimensions with appropriate keys
  4. Figure out the load order and write a workflow to populate the dimensions and facts in the appropriate order. This is important because the dimensions need to loaded first before the fact tables for the dimensional keys to get correctly looked up when the facts get loaded.
  5. Figure out the tables will be populated with Incremental and full loads

Let’s go over each of these steps and see what is involved –

Business Requirements

This is probably one of the most difficult part of any traditional BI project. Net new BI and data warehousing projects are usually started when new operational systems are put in place. For example, the company can implement a new ERP or CRM or a cloud HR system. To give accurate business requirements for a traditional ETL/BI project the business users need to come up to speed on the new source system and its data structure and then try to guess all potential reports and ways to slice the data in the future and define a set of requirements which IT developers can understand.

The probability of getting the requirements right are very low. But the expectation of the IT and business leadership is that proper requirements written in stone needs to be provided upfront before the implementation of the project begins. This is because the ETL developers need to know what data to pull from the source to populate the data warehouse whose data model in turn has been designed based on the business requirements.

Data Modeling

Data Modeling entails working backwards from what questions the business users will ask and then designed to answer those questions. The metrics or measures goes into the fact tables and the attributes by which these metrics can be sliced and diced are modeled as dimensions around these fact tables. The fact tables then will have keys to each of these dimensions.

With this method of building a warehouse each set of question will need a new fact table. For example, if the business users want to analyze opportunities, revenue, orders, COGS or GL data they will need to design a different fact and associated dimensions for each of them. Detailed analysis needs to be done upfront to understand the grain (aggregation level) of each of the fact tables and the dimensions are then designed to match this grain.

Build source SQL or API data pull

Once the data model of the data warehouse is prepared its handed over to the ETL developers. The ETL developers need to understand where the data resides in the source system and write a SQL or API from cloud systems or from flat files to pull the data into the dimensions and facts contained in the data warehouse. Staging tables are usually there for intermediate hosting of the data which are then populated into the target tables. Incremental and full loads also need to be designed.

Here are the challenges with writing SQL/flat files/API from the source system –

  1. ETL developers rarely understand the source system well. For example, ETL developers usually have very little knowledge of SAP or Oracle ERP tables. They must rely on source system experts who normally don’t have too much time for them.
  2. Once they write the SQL and populate the staging and target tables they must rely on the BI teams to have completed the semantic data model and build out the reports before any business user can test the results. By the time all this happens the project is almost at its end the data is usually very different from what was expected. The whole BI/ETL project team must go back to the drawing board several times to fix the issues. All this causes a lot of stress and delays in the project.
  3. The SQL that is written is usually very expensive in terms of system resources in the source database and must be rewritten due to performance reasons. This also sometimes needs significant logic changes leading to data errors etc.

Load Order

The ETL developers also need to figure out which is the correct order to load the dimension and facts. This can get very tricky sometimes when there are a lot of interdependent tables that are getting pulled and especially if the data is coming from multiple source systems.  Also, the right set of dimensional keys need to be populated in the all fact tables with right look up logic.

Incremental and Full Loads

Incremental loads are needed in a lot of cases when there a lot of rows that get updated on the source system. But getting the logic right as to which rows are getting updated can get very complicated when there are not proper Last Update Date columns in the source data. Sometimes complicated OR statements need to be written if a dimension or fact gets populated from multiple source tables.

In addition, when a new column is added to the target dimension or fact a full load needs to be triggered or an update SQL needs to be written to populate the column for older rows. Also, deleted rows in the source system need to be taken care as there might not be a clear flag or trigger in the source to indicate which rows are deleted.

Speed of Change

Given all the above complexities getting a change done – for example adding new dimension, fact or even new columns starts to get very difficult to do when traditional ETL is used. The skill level of the ETL developers to perform all the above steps need to be very high and they are very difficult to find. In addition, testing a change using the traditional ETL approach gets quite complicated as the data gets reshaped during the load from an OLTP data model to a OLAP data model. It could take weeks to add a new column or months to add a new dimension or fact table, test and move the change to production.

Can the business really wait that long for a simple change? The answer is clear – most business users are expecting changes to their BI systems to be done in a matter of days if not less.

So, is traditional ETL the best way moving forward for new BI projects?

The answer is NO. With today’s technology including cloud data warehouses, replication methods, and powerful self-service visualization tools there is very little reason to go with the traditional ETL and legacy BI tools approach for any new BI project.

There are several other new techniques to solve the problem of getting the data out of the source system and presenting the data in a meaningful manner to the end user.

  1. Replication – Replicate the source data to the target database (increasingly cloud data bases like Snowflake or Amazon Redshift are becoming attractive target databases) and build the business logic post replication
  2. Data Virtualization with a semantic layer and caching in a fast database

I will cover more about these two approaches in future blogs.

What about existing ETL driven data warehouses and BI solutions?

If the existing ETL+BI solution does not require frequent changes and there is a low cost of maintenance, then my suggestion is let the users make use of the system till it loses its business value. If there is a significant change in the source system like move to cloud or an upgrade which needs a lot of change to the ETL, that might be the time to replace the legacy solution to the new architecture.

You don’t have to replace the whole solution to the new architecture at one. You can start with the business-critical reports and then consider stopping the software maintenance support for the legacy BI and ETL tools. Then you can take your time to migrate the other valuable content over to the new platforms and shut down the old ETL/BI solution in a phased manner.