How to work on a BI Modernization strategy for your company — Part 2

In the Part 1 of this article I had written about what should be the goals of a BI Modernization strategy and how grouping your existing BI use cases by data latency requirements can help clarify your BI Strategy. The three groupings by data latency are Real-Time, Operational and Analytical Reporting. In this article I will cover possible architectures for each of these use models.

Design Pattern

The key design assumption for these architectures are the following –

1. Companies will have multiple visualization tools

2. A single architecture and set of tools might not be able to satisfy all the three data latency use cases.

Real-Time Reporting

The characteristics of real time reporting are that users need the data “as it is any point” in the source system when they run the report. The users need data in real-time from one data source like Oracle EBS, SAP, or Salesforce, and the volume of data required for each report is relative small — few tens of thousands at most and the UI needs of the reports are simple, which means mostly they need a table like report. Given the fact that there are multiple source systems that need real-time reporting each might have a different way of accessing the data. Some might have JDBC/ODBC interfaces where you can run SQL queries while others might have an API interface. The other consideration is that there might be multiple visualization tools which can be used for real-time reporting. So, given the fact that there are multiple back end systems and multiple visualization platforms and there is no need to join one system to another (no need for query federation) a central data modeling tool would be ideal. This will avoid creating data models in different data visualization tools and give the capability to restrict the data volume and query performance on the source systems, so the reporting platforms do not impact the source system resources. A leading data virtualization tool like Denodo will be a good solution for this use case and I have written about it here in detail. The data virtualization tool will provide the semantic layer and it will enable the various BI tools to connect to it so if a new BI tool is introduced there is no need to redo the data model.

Operational Reporting

The data needed for operational reporting are usually in multiple systems and the reporting solution should be able to join these data together. Each of these systems can have data in a different shape and format and hence it is important to bring the required data into a central database. If you have a lot of data sources and millions of rows that need to be brought in, then query federation is usually not an option as there will be significant performance differences between the different sources and hence user experience might not be optimal. So, you will need to decide on an ELT/ETL tool to move the data and a database to host the data. Once the data is brought into the central database then the data modeling exercise begins where the data is reshaped and joined appropriately based on the reporting requirements. After the data modeling is done then business friendly labeling can be applied and calculated columns like currency translations, gross margin can be added. Once this is complete, then the reporting and data visualization activity can begin.

There are two approaches companies can make –

  1. Choose one tool which does all the tasks — ETL/ELT, Database, Data Modeling, and Data Visualization. There is a new industry term for these kinds of tools called Unified Data and Analytics Platforms (UDAP). These UDAP tools don’t have best of breed features in any of these four domains but the advantage is that they have a single stack which leads to well-integrated set of capabilities which should lead to significant time to value. Examples of these tools include Incorta, Birst, Domo, Pentaho and others. I specifically have good experience with Incorta which has been very good for operational reporting from large complicated on-premise and cloud data sources and joining them at scale. I have written about the Incorta architecture and capabilities here.

2. Use best of breed software to create operational reporting stack

A best of breed approach can also be taken to create the operational reporting stack. In this case, you will have to choose an ETL/ELT including CDC mechanism which can get near real-time data into the target database quickly and without much transformation in the process. I have talked about how to enable real-time analytics to Snowflake here in detail. The transformation will happen after it is loaded into the target database. In this case, an ETL/ELT tool like Matillion, Stitch, FiveTran and others can be used to move the data into the target database. In today’s world the target database totally makes sense to be one of the major cloud platforms like Snowflake, Redshift, or Google Big Query. If you are using a tool like Stitch or FiveTran which are primarily data ingestion tools then using dbt to transform the data will be very useful. Matillion does both ingestion and transformation as well as orchestration.

This stack works best if all the major data sources are on the cloud. If there are large volumes of on-premise data and data is needed to be loaded several times a day, I don’t think this will be a very robust data pipeline. The biggest advantage is that you will have your BI tool of choice for operational, analytical and real-time reporting. You will have to evaluate your data sources, volume and frequency of loads to make sure this is the best architecture for your operational reporting.

Analytical Reporting

Analytical reporting has the following characteristics –

1. Analyze data trends over months, quarters or years.

2. Frequency of data loads can be daily, weekly or monthly from various sources. No need for loading several times a day.

3. The data is usually stored for the long term in the form of snapshots.

4. Data is summarized to a more aggregate form.

5. Data can come from multiple sources.

6. Analysts and users prefer to use their favorite BI tools. The end users skew more towards senior management.

The proposed architecture for analytical reporting is not too different from the best of breed tools for operational reporting. This architecture should address all the six major characteristics of analytical reporting.

The important distinction between operational reporting and analytical reporting are the reduced frequency of data loads, use of snapshots and aggregations to store data for long term, and the use the best self-service data visualization tool that your organization has access to. The modern cloud data warehouses like Snowflake, Redshift and Google Big Query are well suited for this purpose as they perform fast, provide cheap storage for snapshots and data retention and can be accessed by a wide variety of BI tools for visualization.

Summary

When you look to modernizing the data architecture for your company, it might be helpful to classify your current BI landscape into real-time, operational and analytical reporting buckets. This will give clarity on how significant each of these use models are for your company. Once that is done you can design an architecture for each of these and then build your existing reports and analysis into one of these three architectures and then continue using these platforms for your future requirements.

I would love feedback on whether thinking about your BI landscape in this way using data latency requirements is useful to you and your organization. I look forward to your comments.

How to enable real-time analytics in Snowflake with data from On-Premise databases

Snowflake is rapidly becoming the leading cloud data warehouse platform for all kinds of different analytics applications. I have written about why snowflake is a great option for moving on-premise data warehouses to the cloud. While moving to Snowflake has big advantages like no need for a server footprint, DBA, tuning, backup and recovery, it still needs a well thought out data pipeline from your source systems.

A chance to simplify and reduce latency of your data pipelines

If your company has been using a central data warehouse for several years, chances are your BI team uses a traditional tool like Informatica to extract, transform and load (ETL) data from one or several on-premise databases into the data warehouse built on Oracle, SQL Server or Teradata. These ETL processes would have been built and enhanced over many years and are probably very complex to maintain or change now. They were optimized for performance on traditional databases, the core of whose technology were written decades ago. The on-premise data warehouse will get loaded once or twice a day which run for several hours each and usually lead to downtime during loading for the business users. This method of ETL is completely outdated for today’s global 24×7 operations and real time data requirements. Now that you have decided to move to Snowflake as your data warehouse you should rethink your data pipeline completely.

CDC based real-time data integration into Snowflake

Change Data Capture + Streaming as the data pipeline strategy

In the last two years, there has been tremendous development of the partner and technology ecosystem around Snowflake. Now there are more than a dozen tools which enable real time or batch replication to Snowflake from on-premise databases. The tools have become sophisticated enough to be able to replicate only some of the tables and even in these tables they can replicate only some of the columns. This way you can replicate only the data needed for reporting and not the entire on-premise database to Snowflake.

The replication tools look at the logs generated by the on-premise databases which contain an encoded entry for each insert, update or delete that occurs in any table in the source database. For example, if your source database is Oracle then these tools can continuously replicate one or many Oracle tables to Snowflake. As a first step the tables to be replicated needs to be created Snowflake. Then the existing data needs to be loaded into the corresponding tables in Snowflake. Then the tool can read new data using LogMiner and continuously replicate it to Snowflake.

Advantages of using the CDC approach of loading data into Snowflake are:

1. Real time continuous data load into Snowflake without the need for specific time based data loads which used to run for hours

2. What you see in the source database is what you get in Snowflake. It is easy to verify the data quality and accuracy compared to traditional ETL

3. Uses the power of Snowflake performance to do the complex transformations needed and not in the ETL tool. Data can be processed many times faster than before.

4. Can capture deletes in the source and not require a full load to clean out the deletes.

What are the tools available in the market?

There are plenty of tools now available to replicate on-premise data to Snowflake and more are getting added as Snowflake is getting very popular in enterprises large and small. It is important for companies to evaluate these tools on various vectors including capability, performance, pricing, support, data transformations, roadmap, long term viability of the company among other factors. You need to look at all the capabilities of the tool holistically so that your company can standardize on one tool instead of supporting multiple tools for different needs.

Here are some leading tools that are available in the market which can replicate on-premise databases to Snowflake using CDC –

1. Striim

2. Attunity (bought by Qlik) — now called Qlik Replicate

3. Stitch (bought by Talend)

4. FiveTran

5. Matillion

6. Streamsets

The Matillion approach seems a lot more complicated than the others but the product has a lot of traction and completely cloud based with attractive usage basedpricing. So, I think it is worth a look.

There are other tools which are coming up based on a simple Google search — CdataBryteflowHVR SoftwareETLWorksDiyotta. I’m sure more will be there with a deeper search.

In summary, the full power of Snowflake can be unlocked by enabling a real-time data replication from on-premise databases to Snowflake. There are a lot of highly capable tools available in the market. With careful evaluation and a solid POC, you can enable your business users to gain real time insights into their operations with the power of Snowflake and in the process become the data hero of your company!

How to setup your Snowflake environment when moving on-premise databases to the cloud

In on-premise database environments like Oracle and SQL Server there will usually be multiple physical servers and in each of them there will be multiple databases. For example, in a typical on-premise Oracle/SQL Server data warehouse environment, companies will have 3 separate sets of physical servers — one each for development, test and production.

In each of these physical servers, multiple databases will be created. Each of these databases will built be for a specific purpose for example one could be a financial systems warehouse (FIN_DW) with data pulled from ERP system and another could be a HR warehouse (HR_DW) with data pulled from HR systems and a third could be data from CRM systems like Salesforce which could be called CRM_DW. In each of the databases there can be multiple schema and in each schema, there can be multiple tables, views and other objects.

So, in total you could have 3 databases per server looking like this in your on-premise environment –

DEV/TEST/PROD Physical Server

In an on-premise environment the following picture depicts a typical hierarchy of objects

How Snowflake works

When a company signs up with Snowflake they are given an URL like

https://companyname.snowflakecomputing.com/

In Snowflake, a database is the highest level and inside a database there can be multiple schemas and inside a schema there can be multiple tables and views. So, in other words Snowflake does not have a server concept like Dev, Test or Production physical servers.

In Snowflake, the following picture depicts a typical hierarchy of objects

How to organize your on-premise databases in Snowflake

Given, that Snowflake environment is “one level lower” in terms of objects (there is no concept of dev, test and production physical servers) how do you organize the Snowflake system to match your on-premise setup. There are 2 ways to go about this –

1. Keep your top-level object as an individual database when you migrate

In this method, you will be creating as many databases in Snowflake as there are number of on-premise physical servers x number of databases in each of them.

In the above example you will create 9 databases in Snowflake –

CRM_DW_DEV

HR_DW_DEV

FIN_DW_DEV

CRM_DW_TEST

HR_DW_TEST

FIN_DW_TEST

CRM_DW_PROD

HR_DW_PROD

FIN_DW_PROD

This method will work fine if you have a small number of on-premise servers with a small number of databases in each of them. But your company could have 4–5 physical servers (Sandbox, Dev, Test, Production, etc.) with 10–20 databases in each of them. You can imagine how the number of databases can proliferate in Snowflake. In this example your looking at anywhere between 40 to 100 databases.

You will have to maintain all these databases within Snowflake and assign security and roles to each of them. In addition, in my opinion you will have a very confusing and cluttered environment to maintain many databases for the longer term.

One of the big issue I see is that normally production servers have a high degree of security and access control than dev or test servers. In the on-premise world the server and the databases in the production environment are audited and are under SOX control. In Snowflake if you end up having 10–20 production databases without an umbrella physical server it will become difficult to report out the internal controls to the audit team.

2. Create as many “dummy databases” as there are on-premise physical servers

In this method you create 3 databases in Snowflake at the top level –

1. Development

2. Test

3. Production

This will represent the 3 physical servers in your on-premise environments. Then you can create the 3 on-premise databases (CRM_DW, HR_DW, FIN_DW) as schemas inside these 3 databases. If a database has multiple schemas you can create multiple schemas inside these databases. For example, if CRM_DW has 2 schemas called Marketing_Schema and Sales_Schema, you can create these as 2 separate schemas as CRM_DW_Marketing_Schema and CRM_DW_Sales_Schema under the Development, Test and Production databases. The respective tables and views can then be created under each of these schemas.

The advantage I see in this method is that you have a more structured way of looking at your Snowflake environment. You will have a Development, Test and Production database and then all schemas and tables that belong to each will be sitting inside these databases. You can put a greater level of security control to the Production database and will be able to prove to your auditors that you have similar controls to the production on-premise server.

The only downside I see to this approach is the case where you a lot of schemas under a database in your on-premise environment. In this case you will just have to rename your schemas with the database name in the front for example to distinguish them.

Summary

Before moving your on-premise data warehouses to Snowflake, it is necessary to put some thought into how you want to organize your Snowflake environment. Since you don’t have a concept of a physical development, test or production servers you can try to mimic it by using option 2 above. Option 2 will work well if you have a lot of databases inside each physical server and you have less number of schemas in each database. If you have a lot of schemas in each database and less number of databases in each physical server then option 1 might be better suited for your case.

Why Snowflake is a great option to move your on-premise Oracle, SQL Server and Teradata data warehouses to the cloud

As your key transactional systems start moving to the cloud like Salesforce, Workday, Netsuite etc. it makes sense to move your data warehouse to the cloud also. One of the fastest growing database in the cloud is Snowflake and It is rapidly gaining a lot of customers who want to move their analytic workloads to the cloud from on-premise based Oracle, SQL Server or Teradata based data warehouses

Snowflake is a serverless cloud only database which means you can use as little or as much as you need of various resources like CPU, memory and storage. It is not limited by the capacity of the server it is on. We have recently started several projects to move on-premise Oracle and SQL Server data warehouses to Snowflake. I will write about our experience in this journey in future articles.

To begin with I would like to go over why I think Snowflake is a viable cloud database to replace legacy databases which has been in use for years. There has been a lot of articles written on the technical capabilities of Snowflake. You can read them here  – Article 1, Article 2 and many more if you search on the web or ready up Snowflake blog.

What I want to emphasize below is how it will help the technical and functional BI and analytics practitioners.

  1. Freedom from DBAs – You can create and destroy databases as and when you please. This is very important as in the analytics world you end up starting a lot of projects without knowing how large the data volume will be and how many users will end up using your solution.

The provisioning of on-premise databases is totally depended on CPU, memory and storage. So, before you talk to the DBA team you need to make an estimate on exactly how much of each is needed. Then you must wait for a DBA to create a database, give it the right resources, after you justify why you need this much CPU, memory and storage etc. In the meantime, you just want to load your data and start modeling and analyzing it. Snowflake helps you be your own DBA. Imagine that! It also simplifies backup and recovery as well cloning your databases to create new environments. It does need some budget to get started but it’s not much and you have to deal with way less red tape and politics to get started.

  1. Its “big data” without the complexity – For the last several years there has been so much talk about Hadoop, Hive, HBase, Spark and a whole bunch of Apache projects which promised a lot of performance and cost improvements over traditional databases. The problem was that it moved the discussion of BI and analytics to an extremely technical level and thereby losing sight of the whole purpose of the effort which is to let the end users analyze the data and gain insights. It turned out that for a lot of companies the traditional Hadoop based “big data” was an overkill. They took on something which was very complex to install, develop and run and the business users lost patience and moved on by building their own custom solutions.

A lot of companies used big data to store csv, XML and JSON files in addition to relational data. All this can be easily accomplished with Snowflake without any of the complexity of maintaining a big data platform, writing Map Reduce etc.

  1. SQL Compatibility – There has been a lot of databases that has come to the market which promised a lot of very high performance at low cost. The problem with most of these are that they don’t provide full relational SQL capabilities like sub-queries, union queries, inline SQL etc. When you are trying to replace a traditional Oracle or SQL Server database there will be a lot of usage of these SQL features. You will have to end up rewriting this logic in some other way leading to tremendous increase in complexity of your migration effort. Snowflake from the beginning was built to support all the capabilities of relational SQL. I have found this to be mostly true with some gaps which Snowflake is actively closing with every release.
  1. BI Tools support – Snowflake has partnerships with almost all major BI tools vendors like Tableau, Qlik, MicroStrategy, Looker, Spotfire and many others. All these tools have native connectors to Snowflake which gives immediate connectivity to the data that is loaded into Snowflake. 
  1. Cloud Connectors – Snowflake is built for Extract Load and Transform (ELT) data integrations. The best way to load data into Snowflake is to load data in the raw format from the source system and then transform it within Snowflake whether the source is files, cloud systems like Salesforce or On-premise applications. There is a rapid ecosystem developing with partners like FiveTran, Stitch or Matillion where they are starting to build native connectors to popular cloud applications like Salesforce. These tools sync the data from Salesforce and other systems and then allow the developers to transform it the way the business needs require. 

In summary, if you are looking to move your data warehouse to the cloud Snowflake is one of the best options out there due to its serverless architecture, full SQL support, ease of maintenance and strong partnerships with BI and ETL tools.