Is Traditional ETL dead?

For many years traditional Extract, Transform and Load (ETL) was the only way to extract and transform data out of operational systems and load into a data warehouse for analysis by a business intelligence tool like Business Objects, MicroStrategy or OBIEE. Traditional ETL works by extracting data out of one or more transactional systems, merging and reshaping the data using staging tables and populating dimensions and fact target tables in a data warehouse. These data warehouses were usually built on Oracle, SQL Server or Teradata databases.  To load the data warehouse using the traditional ETL method the following steps need to be accomplished –

  1. Understand the business requirements – you must know all the ways the users will slice and dice the data, what is the grain of the data (is it at customer level or order line level etc.)
  2. Data model the star and snowflake schemas including all facts and dimensions, their keys and how they all join
  3. Figure out where the data in the source systems are coming from to populate the facts and dimensions and then write the SQLs which join the various tables in the source system to populate these individual facts and dimensions with appropriate keys
  4. Figure out the load order and write a workflow to populate the dimensions and facts in the appropriate order. This is important because the dimensions need to loaded first before the fact tables for the dimensional keys to get correctly looked up when the facts get loaded.
  5. Figure out the tables will be populated with Incremental and full loads

Let’s go over each of these steps and see what is involved –

Business Requirements

This is probably one of the most difficult part of any traditional BI project. Net new BI and data warehousing projects are usually started when new operational systems are put in place. For example, the company can implement a new ERP or CRM or a cloud HR system. To give accurate business requirements for a traditional ETL/BI project the business users need to come up to speed on the new source system and its data structure and then try to guess all potential reports and ways to slice the data in the future and define a set of requirements which IT developers can understand.

The probability of getting the requirements right are very low. But the expectation of the IT and business leadership is that proper requirements written in stone needs to be provided upfront before the implementation of the project begins. This is because the ETL developers need to know what data to pull from the source to populate the data warehouse whose data model in turn has been designed based on the business requirements.

Data Modeling

Data Modeling entails working backwards from what questions the business users will ask and then designed to answer those questions. The metrics or measures goes into the fact tables and the attributes by which these metrics can be sliced and diced are modeled as dimensions around these fact tables. The fact tables then will have keys to each of these dimensions.

With this method of building a warehouse each set of question will need a new fact table. For example, if the business users want to analyze opportunities, revenue, orders, COGS or GL data they will need to design a different fact and associated dimensions for each of them. Detailed analysis needs to be done upfront to understand the grain (aggregation level) of each of the fact tables and the dimensions are then designed to match this grain.

Build source SQL or API data pull

Once the data model of the data warehouse is prepared its handed over to the ETL developers. The ETL developers need to understand where the data resides in the source system and write a SQL or API from cloud systems or from flat files to pull the data into the dimensions and facts contained in the data warehouse. Staging tables are usually there for intermediate hosting of the data which are then populated into the target tables. Incremental and full loads also need to be designed.

Here are the challenges with writing SQL/flat files/API from the source system –

  1. ETL developers rarely understand the source system well. For example, ETL developers usually have very little knowledge of SAP or Oracle ERP tables. They must rely on source system experts who normally don’t have too much time for them.
  2. Once they write the SQL and populate the staging and target tables they must rely on the BI teams to have completed the semantic data model and build out the reports before any business user can test the results. By the time all this happens the project is almost at its end the data is usually very different from what was expected. The whole BI/ETL project team must go back to the drawing board several times to fix the issues. All this causes a lot of stress and delays in the project.
  3. The SQL that is written is usually very expensive in terms of system resources in the source database and must be rewritten due to performance reasons. This also sometimes needs significant logic changes leading to data errors etc.

Load Order

The ETL developers also need to figure out which is the correct order to load the dimension and facts. This can get very tricky sometimes when there are a lot of interdependent tables that are getting pulled and especially if the data is coming from multiple source systems.  Also, the right set of dimensional keys need to be populated in the all fact tables with right look up logic.

Incremental and Full Loads

Incremental loads are needed in a lot of cases when there a lot of rows that get updated on the source system. But getting the logic right as to which rows are getting updated can get very complicated when there are not proper Last Update Date columns in the source data. Sometimes complicated OR statements need to be written if a dimension or fact gets populated from multiple source tables.

In addition, when a new column is added to the target dimension or fact a full load needs to be triggered or an update SQL needs to be written to populate the column for older rows. Also, deleted rows in the source system need to be taken care as there might not be a clear flag or trigger in the source to indicate which rows are deleted.

Speed of Change

Given all the above complexities getting a change done – for example adding new dimension, fact or even new columns starts to get very difficult to do when traditional ETL is used. The skill level of the ETL developers to perform all the above steps need to be very high and they are very difficult to find. In addition, testing a change using the traditional ETL approach gets quite complicated as the data gets reshaped during the load from an OLTP data model to a OLAP data model. It could take weeks to add a new column or months to add a new dimension or fact table, test and move the change to production.

Can the business really wait that long for a simple change? The answer is clear – most business users are expecting changes to their BI systems to be done in a matter of days if not less.

So, is traditional ETL the best way moving forward for new BI projects?

The answer is NO. With today’s technology including cloud data warehouses, replication methods, and powerful self-service visualization tools there is very little reason to go with the traditional ETL and legacy BI tools approach for any new BI project.

There are several other new techniques to solve the problem of getting the data out of the source system and presenting the data in a meaningful manner to the end user.

  1. Replication – Replicate the source data to the target database (increasingly cloud data bases like Snowflake or Amazon Redshift are becoming attractive target databases) and build the business logic post replication
  2. Data Virtualization with a semantic layer and caching in a fast database

I will cover more about these two approaches in future blogs.

What about existing ETL driven data warehouses and BI solutions?

If the existing ETL+BI solution does not require frequent changes and there is a low cost of maintenance, then my suggestion is let the users make use of the system till it loses its business value. If there is a significant change in the source system like move to cloud or an upgrade which needs a lot of change to the ETL, that might be the time to replace the legacy solution to the new architecture.

You don’t have to replace the whole solution to the new architecture at one. You can start with the business-critical reports and then consider stopping the software maintenance support for the legacy BI and ETL tools. Then you can take your time to migrate the other valuable content over to the new platforms and shut down the old ETL/BI solution in a phased manner.

Why MicroStrategy is a great BI tool for governed self-service

In the previous blog post I had explained what is governed self service and how to set it up in your company.  To be successful governed self service needs a software which supports different set of features and capabilities for different level of users.

In a governed self-service model, IT’s role is to build the data pipelines and ETL from the various source systems to the data warehouse and build out the sematic model using the help of the super user.

The super user will be primarily building the master datasets for consumption by the reports and dashboards. The master dataset hides the complexity of the various joins and where the data came from. It will have all the attributes and metrics for a specific business need. The master dataset is validated for data accuracy by the super users and is deemed as a trusted dataset for the power users. The master dataset can be a cached for high performance in-memory or a cube format.

The power users use the master dataset to analyze, slice and dice the data the way they want. They can build story boards, pivot the data or mash it up with other local data sources to provide the analysis that their end users need. The end user is the consumer of the data and makes decisions based on the analysis provided to them.

What is also critical is that the BI software in the environment need to support the governed self-service model. While a governed self-service can be achieved to some extent in different tools, MicroStrategy seems to be the best tool built from the ground up to support this model.

The various components in MicroStrategy that make up the governed self- service are as follows –

IT ownership

Data warehouse connection – Connecting MicroStrategy Developer to the database and importing the relevant tables

Logical Tables – This is where the developer defines the table or view definitions

Attributes, Facts and other schema objects – The next step in the development process where the attributes, facts, joins and other data model components are built by the IT developer.

Super User Ownership

Once the attributes and metrics are built then the Super User can build the prompts, filters, hierarchies and finally the grid report. The super user can include all the necessary attributes, metrics and filters in this master grid report. This report can be used to support one or multiple dossiers. The idea is that the super user uses her knowledge of the data and business need and creates a master report which has all the information needed for analysis and the users of this report are confident that the data is accurate and in a useful format.

The grid report in MicroStrategy is the basic component on which the Master Dossiers can be built.

In the MicroStrategy web the super user can right click the grid report and create a master dossier. Once the master dossier is created the grid report can be hidden from the power and end users.

The master dossier can then be copied by the power users for further analysis.

Power User/End User Ownership

The Power Users can copy the master dossier into their folders and further create formula columns, add/remove columns, pivot the data, create visualization as needed for their end users. They can be confident that the master report has the data they need, and it is accurate.

The biggest advantage of this model of operation is that if a need field, prompt or any other changes are made to the master grid report by the super user it automatically gets propagated to all the dossiers that are built on top of it. This way the “master dataset” on which all the analysis that are built is owned by the super user who has the best knowledge of the data and the closest relationship to the IT developers.

If the BI structure (both technically and organizationally) is setup in the above fashion, then the power of governed self-service can be achieved in your company using MicroStrategy. This is a great way to utilize the skills and expertise of the various teams as well as getting accurate data to the analysts and end users.

What is Governed Self Service and how to set it up in your company?

After working on many different BI projects, tools and teams I have come to strongly believe in the concept of a “Governed Self Service” model. In this article I will explain how to set this up technically and how to structure your BI teams (both IT and Business users) to make this model successful. Before we go into what is Governed Self Service let’s look at the other two extreme models for making BI operational in your company – 100% Self Service and 100% Governed. By analyzing the strengths and weaknesses of these models, the need for governed self-service model becomes clear.

In any business intelligence/analytics model there are several different applications and people. I want to make sure the definitions are clear before I delve into the various operating model for BI. These are most commonly occurring roles in different companies. I’m sure there are some minor variations in different companies.

Systems

  1. Source Systems – These are the source systems that essentially create the data to be analyzed – ERP (Oracle/SAP/Netsuite), CRM (Salesforce, Siebel), Google Analytics, Marketing (Eloqua, Marketo and others), supply chain and plenty of other custom operational and transactional systems.
  2. Data Warehouse – This is the curated predefined schema designed, loaded and maintained by IT developers in a database of choice – Oracle, SQL Server, Teradata etc..

People

  1. IT Developer – ETL/BI tool – This is the traditional IT role where they have database logins to source systems and from these systems, load the warehouse periodically.
  2. Super User – These users reside in the business and understand the data very well and are technical. They interact with IT and work with power users and end users.
  3. Power User/data scientists – These users work very closely with the end users (CFO, VP of Sales) and use the reports provided by the super users to further analyze and create stories out of the data.
  4. End User/Consumer – These are the end customers of the data and analysis and they and their teams take business actions out of the data.

Now, let’s get back to the 100% self service and 100% governed models.

100% Self Service Model

In simplistic terms this model will operate like this – Business users get direct database access to the on-premise source systems like ERP database or get “extract all” access to cloud systems like Salesforce. Then they take this data and create a  data mart on a small server and then they mash up the data. These department users would have bought Tableau or Power BI licenses and they can wrangle the data and create dashboards and insights out of this data mart.

The advantages of this approach are –

  1. Nimble – This is very nimble as it is done by the super users and they know the data and requirements well and can create a quick and dirty solution fast
  2. Cheap – They can circumvent the enterprise IT processes and rigor which tend to add cost to the implementation.

The disadvantages of this approaches are plenty –

  1. Very people dependent – These solutions tend to stick for a long time and the person who built this solution is stuck doing this for longer than they want. If they leave then the department struggles to maintain the solution and go back to IT to help maintain and enhance this solution.
  2. Security – The direct database logins to back end transaction databases leads to big security risks. In addition, these small business owned databases and systems don’t get patched regularly and pose security risks for the company.
  3. Data Discrepancies – Since there could different interpretation of the same data set, it’s very common to see 2 different answers for the same question when built in different platforms.

Self-service systems are very useful when the data is new and not understood yet. It needs further data discovery and analysis from business users to understand which part of the data is useful for the organization. Once the useful data is identified and it is determined that the analysis of this data is needed for the long term, then a new long term model for this data is needed. It is too dangerous to leave this in the hands of a business super user and the data sitting in a server under their desk.

100% Governed Model

This is the opposite of the 100% self-service model. Here all the requirements are fed to a central IT team and they build the ETL and data pipelines from the data sources into a data model built into a central data warehouse. Then IT creates a semantic layer in a BI system like OBIEE or Business Objects and then build reports and dashboards on this.

In this model, the business team is primarily responsible for providing requirements, testing and consuming the reports and dashboards that IT produces.

The advantages of this approach are –

  1. Very tight governance and security
  2. Centrally built and delivered data – everybody gets the same trusted data

The disadvantages of this approach are –

  1. Slow delivery cycles – IT controls everything and they always have competing priorities. Any new project must go through the rigor of development, test and release cycles. Even small changes might take months to implement. Business cannot wait for data that long anymore.
  2. Knowledge/Know how is underutilized – There are a lot of business people with decent technical knowledge and very good knowledge of the data. Their capabilities are not used in this model.

This kind of 100% governed model is on the decline because of the volume and velocity of data these days. Decision makers cannot wait months for data to do their analysis.

Hence, most companies are going down the 100% self-service model bypassing IT. There is a better way to do this – Governed Self- Service.

Governed Self-Service

Pic 1: Steps involved in the Governed Self-Service model

In the Governed Self Service model, there are distinct parts of the analytics workflow and distinct roles for each of the user groups in the process.

IT’s Role

In this model, based on business needs the IT team will build the data pipelines and ETL from the various source systems to the data warehouse or data lake. On top of this data warehouse in the business intelligence system they will build the semantic data model along with the help of the business super user. To build the joins, dimensions and metrics in the semantic layer IT developers need the help of the super users who understand the data and the business requirements.

Super User’s role and the concept of the “Master Dataset”

In addition to helping IT build the semantic data, the super user will be primarily building the master datasets for consumption by the reports and dashboards. The master dataset is a crucial component of the governed self-service model. This dataset hides the complexity of the various joins and where the data came from. It will have all the attributes and metrics for a specific business need. For example, the master data set will have all the attributes, metrics and calculated formula fields necessary for revenue analysis in a company. For this analysis to be complete the dataset needs to have the time, product, customer, order, pricing, quantity, currency, and various other information. The super user’s role is to build this master dataset which makes it easy for the power users to further analyze the data for their end users. The master dataset is validated for data accuracy by the super users and is deemed as a trusted dataset for the power users. The master dataset can be a cached for high performance in-memory or a cube format.

Power User’s Role

The power user is less technical than the super user and they normally don’t understand the intricacies of the data model behind the master data set. They are the analysts that the executives and decision makers go to get their data from. They could also be data scientists who need the data prepared and ready for analysis. The power users will need to go the super users to get their master dataset ready for analysis.

Once they have the master dataset, the power users will be empowered to analyze, slice and dice the data the way they want. They can do this in Excel or in the BI tool of their choice like Tableau, Qlik or Power BI. They can build story boards, pivot the data or mash it up with other local data sources to provide the analysis that their end users need.

End User’s Role

The end user is the consumer of the data and make decisions based on the analysis provided to them. They could also learn the tools and with the help of the power users can do a lot of their own slicing and dicing of the master data. Their primary role in the governed self-service model is to provide accurate and timely requirements to the power users as to what data and analysis they need to do their job. Power Users in turn work with the super users who prepare the master dataset if the data is already available in the semantic model if not they will work with IT to extract and prepare the data as needed.

It’s clear that for the governed self-service model to work the right people need to be engaged in IT and the various business groups. Different departments within the company need to identify who their super and power users are. A BI Center of Excellence needs to be established where all these users can work together along with IT to get the data that their end users need. Governed Self-Service model will work only with the right sponsorship at the executive level in the company. It will be a long road to establish this in your company, but in the end the reward is that the right data and analysis will be delivered to the decision makers at the right time!

Oracle’s Business Intelligence road map and how it might impact your company’s BI strategy

If you have been working with OBIEE for many years, you probably have a very stable and working OBIEE 11g environment that your company has invested in over the last 6 years. OBIEE 11g was released in July of 2010 a lifetime ago in terms of BI technology and for years it was struggling with a multitudes of bugs. The upgrade from OBIEE 10g was a nightmarish experience for most customers. But most of them have gotten through it. With the release of OBIEE 11.1.1.7 in 2013 most customers have a stable environment on which they could build out their data model and reports and dashboards.

In addition, over the years most business users have been frustrated with OBIEE’s shortcomings of speed of change, visualization, blending and ease of use and have purchased modern BI tools, especially Tableau. In a lot of companies, Tableau entered the organization with a few business users purchasing the desktop product and then it proliferated enough that IT got involved and have purchased and implemented enterprise desktop and server company wide.

In today’s world there is a need for parallel development cycles for both modern data visualization/discovery tools like Tableau and the legacy BI tools like OBIEE. The question that is in the mind of business and IT users are which product to use for what purpose. What about the new products that Oracle has been releasing and whether they should consider these and how it fits with their current dual Tableau (or similar tools) + OBIEE approach. Here is my view on the recent developments from Oracle.

If your company has been using OBIEE, you are most probably aware that Oracle has introduced a slew of new BI products over the past two years.

Here is a sample –

OBIEE 12c

Oracle Business Intelligence Cloud Service (BICS)

Oracle Data Visualization Desktop (DVD)

Oracle Data Visualization Cloud Service (DVCS)

Oracle Analytics Cloud (OAC)

In 2016, Oracle has introduced a new version OBIEE 12c which has taken features from BICS, DVD and DVCS and included them in the base OBIEE 12c version. In this latest release they have attempted to catch up with what Tableau, Qlik, Spotfire has in terms of data visualization, data blending, and wrangling.

Even though this looks like a great step for existing OBIEE 11g or even 10g customers to use these new features there are a few roadblocks to get there –

  1. Complexity of Upgrade – Upgrading all your OBIEE environments to the latest OBIEE 12c – This is a massive project for most organizations as multiple divisions of the company use the same OBIEE 11g platform and it’s a huge challenge to find the time and resources from various business teams to test the upgraded system. BI analysts and business users are some of the busiest people in the company as they are constantly juggling between operational review reporting and multiple other technology initiatives all the time. Getting your company to OBIEE 12c is at least several months to a year project.
  1. Which tools to use? – None of the new tools are part of the upgrade from OBIEE 11g. Everything needs to be bought separately as an add-on module or probably a transfer of license might be available. Understanding and working with Oracle on buying the right set of products for your company is very tricky as you are not sure which products your business needs – OBIEE 12c Visual Analyzer, DVD, DVCS, BICS, OAC etc. Each of these have their own benefits and trade-offs and analyzing and working with each stake holder is a challenge.
  1. Roadmap – Given the number of tools Oracle is releasing and the push to cloud creates a lot of uncertainty and confusion among IT and business users. There is a constant churn in the BI products strategy and roadmap and its very unclear as to what the next year is going to bring. Given that it takes a several months to a year to upgrade from existing OBIEE 11g its very complicated for customers to keep up with Oracle’s roadmap.
  1. In-memory capability and performance issues – Even with the large set of on premise and cloud tools that have come out from Oracle it is still unclear as to what is the in-memory strategy for Oracle BI. Is it Exalytics or Oracle 12c database in memory option. Is cloud performance going to be good for large data volumes? In any case there are additional hardware and software components that customers need to evaluate and purchase.

The release of all these new products from Oracle has added plenty of new capabilities and features for customers to explore but the biggest challenge is that customers do not have just an OBIEE only BI environment in their company. They have multiple modern BI tools like Tableau in their environment in addition to OBIEE. Their business users are already very familiar with Tableau like tools and have heavily invested in terms of knowledge and content in them. Oracle, with the latest set of tools has just about caught up in terms of visualization and self-service features with Tableau, that is if they work as advertised, but it comes with the set of questions that each company needs to evaluate as listed above. In addition, IT and business users have been put under further confusion as to which tools to use for what purpose.

When to upgrade to OBIEE 12c or move to OAC?

Companies who have invested in Tableau, Qlik or similar tools which are already providing the users with great visualization, blending and in-memory capabilities can take their time and upgrade from OBIEE 11g to OBIEE 12c or move to OAC or BICS only when they are ready.

This would be a great time to think about your BI tools strategy. I would suggest they fully understand whether Oracle’s BI strategy is something that they really want to continue investing in given the complexity of Oracle’s roadmap. All of Oracle’s investments are going to the cloud and whether BI in the cloud is in your roadmap needs to be considered. This is especially true if you have Oracle Business Intelligence Applications (OBIA) which Oracle is sun setting. (More on this topic in a later blog).

Consider whether you are ready to move your OBIEE stack to cloud, what are the cost implications of doing so, how is it going to access your data if that is going to continue residing on-premise. Or is this a good time to move out of OBIEE and adopt a different tool?

I’m sure it won’t be an easy discussion to have in your companies, but it needs to be considered before you make the leap to Oracle Analytics Cloud – the latest in Oracle’s multitude of BI offerings. Things to consider include how complex is your OBIEE implementation, whether you have Oracle ERP, Siebel CRM and OBIA or not and where your data sources that feed your warehouse located (are they in the cloud or on-premise) etc.

I would highly recommend that companies think about where their future data sources are going to be. There is a very high likelihood that your applications are going to be moving to multiple cloud platforms. If you are moving to for example Workday for HR, Salesforce for sales, Eloqua for marketing automation, Apptus for CPQ and Coupa for purchasing then is OBIEE 12c or OAC the most effective solution both in cost and features? The answer is most probably not. If you are in this situation then it’s time to move to a more modern platform. But what should it be – I will be covering these and other topics in future blog articles.