How to enable real-time analytics in Snowflake with data from On-Premise databases

Snowflake is rapidly becoming the leading cloud data warehouse platform for all kinds of different analytics applications. I have written about why snowflake is a great option for moving on-premise data warehouses to the cloud. While moving to Snowflake has big advantages like no need for a server footprint, DBA, tuning, backup and recovery, it still needs a well thought out data pipeline from your source systems.

A chance to simplify and reduce latency of your data pipelines

If your company has been using a central data warehouse for several years, chances are your BI team uses a traditional tool like Informatica to extract, transform and load (ETL) data from one or several on-premise databases into the data warehouse built on Oracle, SQL Server or Teradata. These ETL processes would have been built and enhanced over many years and are probably very complex to maintain or change now. They were optimized for performance on traditional databases, the core of whose technology were written decades ago. The on-premise data warehouse will get loaded once or twice a day which run for several hours each and usually lead to downtime during loading for the business users. This method of ETL is completely outdated for today’s global 24×7 operations and real time data requirements. Now that you have decided to move to Snowflake as your data warehouse you should rethink your data pipeline completely.

CDC based real-time data integration into Snowflake

Change Data Capture + Streaming as the data pipeline strategy

In the last two years, there has been tremendous development of the partner and technology ecosystem around Snowflake. Now there are more than a dozen tools which enable real time or batch replication to Snowflake from on-premise databases. The tools have become sophisticated enough to be able to replicate only some of the tables and even in these tables they can replicate only some of the columns. This way you can replicate only the data needed for reporting and not the entire on-premise database to Snowflake.

The replication tools look at the logs generated by the on-premise databases which contain an encoded entry for each insert, update or delete that occurs in any table in the source database. For example, if your source database is Oracle then these tools can continuously replicate one or many Oracle tables to Snowflake. As a first step the tables to be replicated needs to be created Snowflake. Then the existing data needs to be loaded into the corresponding tables in Snowflake. Then the tool can read new data using LogMiner and continuously replicate it to Snowflake.

Advantages of using the CDC approach of loading data into Snowflake are:

1. Real time continuous data load into Snowflake without the need for specific time based data loads which used to run for hours

2. What you see in the source database is what you get in Snowflake. It is easy to verify the data quality and accuracy compared to traditional ETL

3. Uses the power of Snowflake performance to do the complex transformations needed and not in the ETL tool. Data can be processed many times faster than before.

4. Can capture deletes in the source and not require a full load to clean out the deletes.

What are the tools available in the market?

There are plenty of tools now available to replicate on-premise data to Snowflake and more are getting added as Snowflake is getting very popular in enterprises large and small. It is important for companies to evaluate these tools on various vectors including capability, performance, pricing, support, data transformations, roadmap, long term viability of the company among other factors. You need to look at all the capabilities of the tool holistically so that your company can standardize on one tool instead of supporting multiple tools for different needs.

Here are some leading tools that are available in the market which can replicate on-premise databases to Snowflake using CDC –

1. Striim

2. Attunity (bought by Qlik) — now called Qlik Replicate

3. Stitch (bought by Talend)

4. FiveTran

5. Matillion

6. Streamsets

The Matillion approach seems a lot more complicated than the others but the product has a lot of traction and completely cloud based with attractive usage basedpricing. So, I think it is worth a look.

There are other tools which are coming up based on a simple Google search — CdataBryteflowHVR SoftwareETLWorksDiyotta. I’m sure more will be there with a deeper search.

In summary, the full power of Snowflake can be unlocked by enabling a real-time data replication from on-premise databases to Snowflake. There are a lot of highly capable tools available in the market. With careful evaluation and a solid POC, you can enable your business users to gain real time insights into their operations with the power of Snowflake and in the process become the data hero of your company!

How to setup your Snowflake environment when moving on-premise databases to the cloud

In on-premise database environments like Oracle and SQL Server there will usually be multiple physical servers and in each of them there will be multiple databases. For example, in a typical on-premise Oracle/SQL Server data warehouse environment, companies will have 3 separate sets of physical servers — one each for development, test and production.

In each of these physical servers, multiple databases will be created. Each of these databases will built be for a specific purpose for example one could be a financial systems warehouse (FIN_DW) with data pulled from ERP system and another could be a HR warehouse (HR_DW) with data pulled from HR systems and a third could be data from CRM systems like Salesforce which could be called CRM_DW. In each of the databases there can be multiple schema and in each schema, there can be multiple tables, views and other objects.

So, in total you could have 3 databases per server looking like this in your on-premise environment –

DEV/TEST/PROD Physical Server

In an on-premise environment the following picture depicts a typical hierarchy of objects

How Snowflake works

When a company signs up with Snowflake they are given an URL like

https://companyname.snowflakecomputing.com/

In Snowflake, a database is the highest level and inside a database there can be multiple schemas and inside a schema there can be multiple tables and views. So, in other words Snowflake does not have a server concept like Dev, Test or Production physical servers.

In Snowflake, the following picture depicts a typical hierarchy of objects

How to organize your on-premise databases in Snowflake

Given, that Snowflake environment is “one level lower” in terms of objects (there is no concept of dev, test and production physical servers) how do you organize the Snowflake system to match your on-premise setup. There are 2 ways to go about this –

1. Keep your top-level object as an individual database when you migrate

In this method, you will be creating as many databases in Snowflake as there are number of on-premise physical servers x number of databases in each of them.

In the above example you will create 9 databases in Snowflake –

CRM_DW_DEV

HR_DW_DEV

FIN_DW_DEV

CRM_DW_TEST

HR_DW_TEST

FIN_DW_TEST

CRM_DW_PROD

HR_DW_PROD

FIN_DW_PROD

This method will work fine if you have a small number of on-premise servers with a small number of databases in each of them. But your company could have 4–5 physical servers (Sandbox, Dev, Test, Production, etc.) with 10–20 databases in each of them. You can imagine how the number of databases can proliferate in Snowflake. In this example your looking at anywhere between 40 to 100 databases.

You will have to maintain all these databases within Snowflake and assign security and roles to each of them. In addition, in my opinion you will have a very confusing and cluttered environment to maintain many databases for the longer term.

One of the big issue I see is that normally production servers have a high degree of security and access control than dev or test servers. In the on-premise world the server and the databases in the production environment are audited and are under SOX control. In Snowflake if you end up having 10–20 production databases without an umbrella physical server it will become difficult to report out the internal controls to the audit team.

2. Create as many “dummy databases” as there are on-premise physical servers

In this method you create 3 databases in Snowflake at the top level –

1. Development

2. Test

3. Production

This will represent the 3 physical servers in your on-premise environments. Then you can create the 3 on-premise databases (CRM_DW, HR_DW, FIN_DW) as schemas inside these 3 databases. If a database has multiple schemas you can create multiple schemas inside these databases. For example, if CRM_DW has 2 schemas called Marketing_Schema and Sales_Schema, you can create these as 2 separate schemas as CRM_DW_Marketing_Schema and CRM_DW_Sales_Schema under the Development, Test and Production databases. The respective tables and views can then be created under each of these schemas.

The advantage I see in this method is that you have a more structured way of looking at your Snowflake environment. You will have a Development, Test and Production database and then all schemas and tables that belong to each will be sitting inside these databases. You can put a greater level of security control to the Production database and will be able to prove to your auditors that you have similar controls to the production on-premise server.

The only downside I see to this approach is the case where you a lot of schemas under a database in your on-premise environment. In this case you will just have to rename your schemas with the database name in the front for example to distinguish them.

Summary

Before moving your on-premise data warehouses to Snowflake, it is necessary to put some thought into how you want to organize your Snowflake environment. Since you don’t have a concept of a physical development, test or production servers you can try to mimic it by using option 2 above. Option 2 will work well if you have a lot of databases inside each physical server and you have less number of schemas in each database. If you have a lot of schemas in each database and less number of databases in each physical server then option 1 might be better suited for your case.

Why Snowflake is a great option to move your on-premise Oracle, SQL Server and Teradata data warehouses to the cloud

As your key transactional systems start moving to the cloud like Salesforce, Workday, Netsuite etc. it makes sense to move your data warehouse to the cloud also. One of the fastest growing database in the cloud is Snowflake and It is rapidly gaining a lot of customers who want to move their analytic workloads to the cloud from on-premise based Oracle, SQL Server or Teradata based data warehouses

Snowflake is a serverless cloud only database which means you can use as little or as much as you need of various resources like CPU, memory and storage. It is not limited by the capacity of the server it is on. We have recently started several projects to move on-premise Oracle and SQL Server data warehouses to Snowflake. I will write about our experience in this journey in future articles.

To begin with I would like to go over why I think Snowflake is a viable cloud database to replace legacy databases which has been in use for years. There has been a lot of articles written on the technical capabilities of Snowflake. You can read them here  – Article 1, Article 2 and many more if you search on the web or ready up Snowflake blog.

What I want to emphasize below is how it will help the technical and functional BI and analytics practitioners.

  1. Freedom from DBAs – You can create and destroy databases as and when you please. This is very important as in the analytics world you end up starting a lot of projects without knowing how large the data volume will be and how many users will end up using your solution.

The provisioning of on-premise databases is totally depended on CPU, memory and storage. So, before you talk to the DBA team you need to make an estimate on exactly how much of each is needed. Then you must wait for a DBA to create a database, give it the right resources, after you justify why you need this much CPU, memory and storage etc. In the meantime, you just want to load your data and start modeling and analyzing it. Snowflake helps you be your own DBA. Imagine that! It also simplifies backup and recovery as well cloning your databases to create new environments. It does need some budget to get started but it’s not much and you have to deal with way less red tape and politics to get started.

  1. Its “big data” without the complexity – For the last several years there has been so much talk about Hadoop, Hive, HBase, Spark and a whole bunch of Apache projects which promised a lot of performance and cost improvements over traditional databases. The problem was that it moved the discussion of BI and analytics to an extremely technical level and thereby losing sight of the whole purpose of the effort which is to let the end users analyze the data and gain insights. It turned out that for a lot of companies the traditional Hadoop based “big data” was an overkill. They took on something which was very complex to install, develop and run and the business users lost patience and moved on by building their own custom solutions.

A lot of companies used big data to store csv, XML and JSON files in addition to relational data. All this can be easily accomplished with Snowflake without any of the complexity of maintaining a big data platform, writing Map Reduce etc.

  1. SQL Compatibility – There has been a lot of databases that has come to the market which promised a lot of very high performance at low cost. The problem with most of these are that they don’t provide full relational SQL capabilities like sub-queries, union queries, inline SQL etc. When you are trying to replace a traditional Oracle or SQL Server database there will be a lot of usage of these SQL features. You will have to end up rewriting this logic in some other way leading to tremendous increase in complexity of your migration effort. Snowflake from the beginning was built to support all the capabilities of relational SQL. I have found this to be mostly true with some gaps which Snowflake is actively closing with every release.
  1. BI Tools support – Snowflake has partnerships with almost all major BI tools vendors like Tableau, Qlik, MicroStrategy, Looker, Spotfire and many others. All these tools have native connectors to Snowflake which gives immediate connectivity to the data that is loaded into Snowflake. 
  1. Cloud Connectors – Snowflake is built for Extract Load and Transform (ELT) data integrations. The best way to load data into Snowflake is to load data in the raw format from the source system and then transform it within Snowflake whether the source is files, cloud systems like Salesforce or On-premise applications. There is a rapid ecosystem developing with partners like FiveTran, Stitch or Matillion where they are starting to build native connectors to popular cloud applications like Salesforce. These tools sync the data from Salesforce and other systems and then allow the developers to transform it the way the business needs require. 

In summary, if you are looking to move your data warehouse to the cloud Snowflake is one of the best options out there due to its serverless architecture, full SQL support, ease of maintenance and strong partnerships with BI and ETL tools.