How to work on a BI Modernization strategy for your company — Part 2

In the Part 1 of this article I had written about what should be the goals of a BI Modernization strategy and how grouping your existing BI use cases by data latency requirements can help clarify your BI Strategy. The three groupings by data latency are Real-Time, Operational and Analytical Reporting. In this article I will cover possible architectures for each of these use models.

Design Pattern

The key design assumption for these architectures are the following –

1. Companies will have multiple visualization tools

2. A single architecture and set of tools might not be able to satisfy all the three data latency use cases.

Real-Time Reporting

The characteristics of real time reporting are that users need the data “as it is any point” in the source system when they run the report. The users need data in real-time from one data source like Oracle EBS, SAP, or Salesforce, and the volume of data required for each report is relative small — few tens of thousands at most and the UI needs of the reports are simple, which means mostly they need a table like report. Given the fact that there are multiple source systems that need real-time reporting each might have a different way of accessing the data. Some might have JDBC/ODBC interfaces where you can run SQL queries while others might have an API interface. The other consideration is that there might be multiple visualization tools which can be used for real-time reporting. So, given the fact that there are multiple back end systems and multiple visualization platforms and there is no need to join one system to another (no need for query federation) a central data modeling tool would be ideal. This will avoid creating data models in different data visualization tools and give the capability to restrict the data volume and query performance on the source systems, so the reporting platforms do not impact the source system resources. A leading data virtualization tool like Denodo will be a good solution for this use case and I have written about it here in detail. The data virtualization tool will provide the semantic layer and it will enable the various BI tools to connect to it so if a new BI tool is introduced there is no need to redo the data model.

Operational Reporting

The data needed for operational reporting are usually in multiple systems and the reporting solution should be able to join these data together. Each of these systems can have data in a different shape and format and hence it is important to bring the required data into a central database. If you have a lot of data sources and millions of rows that need to be brought in, then query federation is usually not an option as there will be significant performance differences between the different sources and hence user experience might not be optimal. So, you will need to decide on an ELT/ETL tool to move the data and a database to host the data. Once the data is brought into the central database then the data modeling exercise begins where the data is reshaped and joined appropriately based on the reporting requirements. After the data modeling is done then business friendly labeling can be applied and calculated columns like currency translations, gross margin can be added. Once this is complete, then the reporting and data visualization activity can begin.

There are two approaches companies can make –

  1. Choose one tool which does all the tasks — ETL/ELT, Database, Data Modeling, and Data Visualization. There is a new industry term for these kinds of tools called Unified Data and Analytics Platforms (UDAP). These UDAP tools don’t have best of breed features in any of these four domains but the advantage is that they have a single stack which leads to well-integrated set of capabilities which should lead to significant time to value. Examples of these tools include Incorta, Birst, Domo, Pentaho and others. I specifically have good experience with Incorta which has been very good for operational reporting from large complicated on-premise and cloud data sources and joining them at scale. I have written about the Incorta architecture and capabilities here.

2. Use best of breed software to create operational reporting stack

A best of breed approach can also be taken to create the operational reporting stack. In this case, you will have to choose an ETL/ELT including CDC mechanism which can get near real-time data into the target database quickly and without much transformation in the process. I have talked about how to enable real-time analytics to Snowflake here in detail. The transformation will happen after it is loaded into the target database. In this case, an ETL/ELT tool like Matillion, Stitch, FiveTran and others can be used to move the data into the target database. In today’s world the target database totally makes sense to be one of the major cloud platforms like Snowflake, Redshift, or Google Big Query. If you are using a tool like Stitch or FiveTran which are primarily data ingestion tools then using dbt to transform the data will be very useful. Matillion does both ingestion and transformation as well as orchestration.

This stack works best if all the major data sources are on the cloud. If there are large volumes of on-premise data and data is needed to be loaded several times a day, I don’t think this will be a very robust data pipeline. The biggest advantage is that you will have your BI tool of choice for operational, analytical and real-time reporting. You will have to evaluate your data sources, volume and frequency of loads to make sure this is the best architecture for your operational reporting.

Analytical Reporting

Analytical reporting has the following characteristics –

1. Analyze data trends over months, quarters or years.

2. Frequency of data loads can be daily, weekly or monthly from various sources. No need for loading several times a day.

3. The data is usually stored for the long term in the form of snapshots.

4. Data is summarized to a more aggregate form.

5. Data can come from multiple sources.

6. Analysts and users prefer to use their favorite BI tools. The end users skew more towards senior management.

The proposed architecture for analytical reporting is not too different from the best of breed tools for operational reporting. This architecture should address all the six major characteristics of analytical reporting.

The important distinction between operational reporting and analytical reporting are the reduced frequency of data loads, use of snapshots and aggregations to store data for long term, and the use the best self-service data visualization tool that your organization has access to. The modern cloud data warehouses like Snowflake, Redshift and Google Big Query are well suited for this purpose as they perform fast, provide cheap storage for snapshots and data retention and can be accessed by a wide variety of BI tools for visualization.

Summary

When you look to modernizing the data architecture for your company, it might be helpful to classify your current BI landscape into real-time, operational and analytical reporting buckets. This will give clarity on how significant each of these use models are for your company. Once that is done you can design an architecture for each of these and then build your existing reports and analysis into one of these three architectures and then continue using these platforms for your future requirements.

I would love feedback on whether thinking about your BI landscape in this way using data latency requirements is useful to you and your organization. I look forward to your comments.

How to create a BI Modernization strategy for your company Part 1

How grouping your existing BI use cases by data latency requirements can help clarify your BI Strategy

A lot of companies are burdened by legacy BI platforms and are not able to deliver the right data to their users at the right time to run their business today. They know that they need to modernize their BI platforms but do not know how to go about it. First let’s look at what should be the goals of a BI Modernization strategy.

Here are the goals from my view for any BI Modernization Strategy –

1. Speed of change — Probably the most important goal. The new platforms should be able to add new fields or add new data sources much faster than legacy platforms.

2. Combine data from multiple sources — This one was always difficult to achieve in legacy tools but is possible with modern tools.

3. Superior performance — Report performance and data load timings should be significantly improved.

4. Self Service — Enable business users and power users to build their own reports and dashboards without relying on IT and super users

5. Support different data latency requirements — Real Time, Operational and Analytical Reporting are the three ways to look at data latency requirements

6. Reduce overall cost — The BI software licensing models and the costs of compute and storage have come down enormously leading to a significant reduction in costs if implemented correctly compared to legacy BI

Grouping BI Use Models by Data Latency

In most companies BI would have evolved in a somewhat organic manner over many years or decades. Many different BI tools would have been used at different times either brought in by acquisitions or by the best tool available at that time. When companies try to replace their legacy tools, they often try to replace one tool with another tool. For example, their project would like “we should move from OBIEE or Cognos to Tableau or Power BI” and/or “let’s move our data warehouse to Snowflake”. They then end up customizing or building solutions into Tableau/Power BI which it was not meant to do, and the project runs into issues.

A better strategy would be to consider the use models of your current BI solution. One way to look at your current use model is by data latency requirements.

There are three ways to classify use models by data latency –

1. Real-Time Data

When the users want to see the data as it is “right now” in the operational system then that is a real-time requirement. This kind of data needs are common in the supply chain, finance, sales or call center area. The supply chain users would want to see what are the orders they need to ship right now, or in finance they would want to see what is the GL Balance at 12 AM at the end of the quarter, or in sales they want to see quota to orders at frequent intervals at the end of a quarter or in call center they would like to see call volume related data in real-time. None of the use cases can wait for the data refresh to the warehouse to happen. The best place to do this reporting is embedded analytics inside the ERP or CRM software but its often the case that the reporting is often slow or does not offer the capabilities of a BI tool. Hence these requirements fall into the BI arena. The primary users of real-time reporting are operational line level employees in the manufacturing floor, shipping or in finance operations who use the data to perform day to day tasks. The BI user interface can be simple but effective for these needs as user is just trying to get the data they need at any given time.

2. Operational Reporting

Operational reporting is needed when the data being analyzed can be a few hours or few days old. This kind of reporting is useful to analyze order or call volumes count over the last few days, how many of the marketing leads have been generated by various campaigns this month. The data needed for operational reporting are usually in multiple systems and the reporting solution should be able to join these data together. For example, the campaign data could be in Eloqua or Marketo, leads data might be in Salesforce and the order data might be in Oracle or SAP. The operational reporting solution should be able to bring the data from these three sources and tie the data together to get a full picture of the campaign effectiveness. In most companies, the bulk of BI requirements fall into this category which is essential for running the business. The primary users of operational reporting are managers who are analyzing data to make decisions on their operations. They usually make decisions like how many trucks do I need for my shipments today, or which campaigns should I run today based on the data and analysis they were able to obtain from their operational reporting.

3. Analytical Reporting

This kind of reporting is useful to analyze data trends over months, quarters or years. The data is usually stored for the long term and frequently data is also stored as snapshots to get the view of the data at a specific day of a month. Then the data is usually summarized to give a long-term view of the various metrics. Some common examples are revenue by industry or region over many years or how gross margin for a product has changed from last year vs. this year. In this type of reporting the frequency of data refreshes can be once a day, week or month and data is usually summarized and stored for trend reporting. The user base here are analysts who are answering questions based on trends in the data or executives and senior managers of various departments to do long range planning of their business based on the historical trends. The BI tool used here should be very user friendly and will need to enable self-service for the analysts.

Now if you look at your traditional BI environments, all these three use models will be included in all the BI tools. For example, OBIEE will be performing real-time, operational or analytical reporting but so is Business Objects or Cognos or other tools, just that they will be doing the same for different departments or different source applications in your company. When you are trying to modernize the BI architecture, you can look across all your current BI applications and group the reports or dashboards serving these three use cases. Once they are grouped into different use cases, you have a good understanding how much your company uses each of these use models. Then once you have this analysis in hand you can proceed on working out the architecture for these three use models.

I will cover possible architecture for each of these use models in the following articles. Please stay tuned and let me know your comments on whether classifying your BI landscape into these use models makes sense for your company.

What is Data Virtualization and how it can unlock real-time insights directly from source systems

In today’s fast-moving business environment there are a lot of reporting requirements which cannot wait for a batch process to load data from operational systems. The business users need the data as it is currently in their on-premise ERP, CRM or any of the cloud systems. For example, towards the end of a fiscal quarter, sales and finance leaders want to know the order bookings in real-time so that they have a good idea if they are going to “make” the quarter or not in terms of revenue. They cannot wait 4–8 hours of delay for a ETL job to finish.

Data virtualization (DV) is a method of building a “logical” warehouse by connecting “virtually” to various source systems and combining or joining them “virtually” to provide a unified data layer for BI systems or enterprise applications to query from.

Fig 1: High Level Architecture of Data Virtualization

The alternate to the data virtualization approach of providing a unified layer is the traditional ETL approach of moving data from the various source systems at periodic intervals and loading them into a physical data warehouse. The data in the warehouse is then combined by loading them into a target star schema or snowflake data model. ETL loads are usually a batch process and are loaded once a day or a few times a day depending on the time it takes to load and the data latency requirements.

How DV Works?

At the core of it DV is a SQL generation tool with an easy to use UI to build a semantic layer across various data sources. Instead of loading data into a physical warehouse by extracting, transforming and loading (ETL) from various source systems, with DV the data will primarily remain in the source systems and the DV tool will load the metadata of the tables (or other objects like views) from source systems into the DV tool. We can call these virtual tables as “Physical Views”. Once the metadata of the tables are loaded, the DV tool will allow the developer to perform various manipulations to these physical views including joins, and formula columns to enrich them with required business logic. We can name these as “Logical Views”. Once the joins and formula columns are added the developer can then build out the “BI Datasets” using these various logical views to create datasets which are flattened views of the data for a specific business purpose. These datasets can include fixed filters if they are always needed.

These flattened datasets will be then be imported in a BI tool like MicroStrategy, Power BI or Tableau. In these BI tools, users can add further business formula columns, filters or even join two “BI Datasets” from the DV tool. The DV tool will take the input call from the BI tools and then translate the SQL which is appropriate to the specific backend data source, be it a database like Oracle, SQL Server, Teradata or a cloud system using their API like Salesforce.

Fig 2: How Data Virtualization Works

The best DV tools perform SQL tuning before sending queries to the backend data sources. Instead of just sending the base SQL to the backend the DV tool will introspect the incoming request, optimize the query for the specific backend data source and then send the SQL or API call to that data source.

In addition, the leading DV tools can cache any of physical views, logical views or BI Datasets for better performance if the source is slow, either in a native database or a commercial database like Oracle or Snowflake.

Advantages of DV

Data virtualization comes with a lot of benefits –

1. Simplified view of complex source system schemas

If the semantic layer is built in the DV tool and the “BI Datasets” are exposed as a flat table/view to be imported in a BI tool like Tableau or Power BI, the complexity of the source schema like Oracle EBS or SAP will be hidden to the BI developers. BI developers are usually not very familiar with the source system tables and struggle to build a semantic layer and instead depend on a database view in the source system. Instead, building the semantic layer in the DV tool not only simplifies a complex schema, it also avoids maintaining a complex and hard to change physical views in source databases.

2. Speed of development

Since DV just imports the metadata of tables and creates virtual tables which mirrors the source objects, complex ETL to achieve real time reporting can be avoided. Adding new fields or tables in ETL could take several weeks to months as star schemas have to be designed and built, but in DV the shape of the data can be preserved to be the same as the source and hence adding new tables or fields will be much easier and quicker.

3. Real time reporting need

Since DV points to the source system directly, either through ODBC/JDBC to the database or through an API to cloud systems the latency of the data is now zero to a minimum. The data available to the user is real-time and for use cases where users need data immediately as it changes DV really comes through as a great option.

4. Source system changes can be hidden from BI tools

A big benefit of DV is that when a source system is upgraded or even if the company replaces a source system with another or is planning to just move databases from let’s say Oracle to Snowflake, virtualizing the semantic layer will help hiding these complex changes in the environment from the various BI tools. To accommodate changes which occur in source systems due to upgrades or database changes, in the DV tool the data source connection needs to be re-pointed to the new database if the schema is the same or some changes in relevant physical views will be needed if there are schema changes like column additions etc. This way the reports and visualizations built on the various BI tools do not have to change and users are not impacted.

5. Performance optimizations and caching

Most leading DV tools provide the ability to cache certain tables from the source database or cache datasets as well. Caching will significantly improve the performance of the reports in the BI tools. The best DV tools tunes the queries they receive from the BI tools and then pushes them down to the source databases. They also find ways to extract and move minimum amount of data from different data sources before joining them in the DV tool and sending to the BI tools.

Data Virtualization Tools

There is a lot of confusion in the market as to what Data Virtualization is. For Data Virtualization as described here, these are tools that companies can evaluate for their needs –

Commercial Products

1. Denodo

2. Tibco Data Virtualization

3. DataVirtuality

4. AtScale

5. Azure Analysis Service

Open Source Alternatives

1. Dremio

2. Teiid

We personally selected Denodo and we really like it for working with on-premise databases, cloud data sources, merging data from multiple sources as well as presenting a simplified view of the data to multiple BI tools. I will write a more detailed article on Denodo and how it works on a future article.

How Incorta enables Operational Analytics without cumbersome ETL or star schema

Link to the demo and to the full webinar

The key benefits we realized with Incorta for operational reporting are –

  • Delivered real-time operational reports against data aggregated from complex data models of Oracle EBS, Oracle Siebel, and Oracle Agile
  • Empowered business user to self-serve even the most complex reports and analytics
  • Redeployed 13 of 15 BI support resources to work on more innovative projects
  • Reduced new report delivery time from months to mere hours
  • Joining data from multiple sources like Oracle and Siebel which were very difficult to do in the past other than a slow dblink between two Oracle databases
  • Reduced complex 40 table reports running time from 6 hours to seconds