How to enable real-time analytics in Snowflake with data from On-Premise databases

Snowflake is rapidly becoming the leading cloud data warehouse platform for all kinds of different analytics applications. I have written about why snowflake is a great option for moving on-premise data warehouses to the cloud. While moving to Snowflake has big advantages like no need for a server footprint, DBA, tuning, backup and recovery, it still needs a well thought out data pipeline from your source systems.

A chance to simplify and reduce latency of your data pipelines

If your company has been using a central data warehouse for several years, chances are your BI team uses a traditional tool like Informatica to extract, transform and load (ETL) data from one or several on-premise databases into the data warehouse built on Oracle, SQL Server or Teradata. These ETL processes would have been built and enhanced over many years and are probably very complex to maintain or change now. They were optimized for performance on traditional databases, the core of whose technology were written decades ago. The on-premise data warehouse will get loaded once or twice a day which run for several hours each and usually lead to downtime during loading for the business users. This method of ETL is completely outdated for today’s global 24×7 operations and real time data requirements. Now that you have decided to move to Snowflake as your data warehouse you should rethink your data pipeline completely.

CDC based real-time data integration into Snowflake

Change Data Capture + Streaming as the data pipeline strategy

In the last two years, there has been tremendous development of the partner and technology ecosystem around Snowflake. Now there are more than a dozen tools which enable real time or batch replication to Snowflake from on-premise databases. The tools have become sophisticated enough to be able to replicate only some of the tables and even in these tables they can replicate only some of the columns. This way you can replicate only the data needed for reporting and not the entire on-premise database to Snowflake.

The replication tools look at the logs generated by the on-premise databases which contain an encoded entry for each insert, update or delete that occurs in any table in the source database. For example, if your source database is Oracle then these tools can continuously replicate one or many Oracle tables to Snowflake. As a first step the tables to be replicated needs to be created Snowflake. Then the existing data needs to be loaded into the corresponding tables in Snowflake. Then the tool can read new data using LogMiner and continuously replicate it to Snowflake.

Advantages of using the CDC approach of loading data into Snowflake are:

1. Real time continuous data load into Snowflake without the need for specific time based data loads which used to run for hours

2. What you see in the source database is what you get in Snowflake. It is easy to verify the data quality and accuracy compared to traditional ETL

3. Uses the power of Snowflake performance to do the complex transformations needed and not in the ETL tool. Data can be processed many times faster than before.

4. Can capture deletes in the source and not require a full load to clean out the deletes.

What are the tools available in the market?

There are plenty of tools now available to replicate on-premise data to Snowflake and more are getting added as Snowflake is getting very popular in enterprises large and small. It is important for companies to evaluate these tools on various vectors including capability, performance, pricing, support, data transformations, roadmap, long term viability of the company among other factors. You need to look at all the capabilities of the tool holistically so that your company can standardize on one tool instead of supporting multiple tools for different needs.

Here are some leading tools that are available in the market which can replicate on-premise databases to Snowflake using CDC –

1. Striim

2. Attunity (bought by Qlik) — now called Qlik Replicate

3. Stitch (bought by Talend)

4. FiveTran

5. Matillion

6. Streamsets

The Matillion approach seems a lot more complicated than the others but the product has a lot of traction and completely cloud based with attractive usage basedpricing. So, I think it is worth a look.

There are other tools which are coming up based on a simple Google search — CdataBryteflowHVR SoftwareETLWorksDiyotta. I’m sure more will be there with a deeper search.

In summary, the full power of Snowflake can be unlocked by enabling a real-time data replication from on-premise databases to Snowflake. There are a lot of highly capable tools available in the market. With careful evaluation and a solid POC, you can enable your business users to gain real time insights into their operations with the power of Snowflake and in the process become the data hero of your company!

How to setup your Snowflake environment when moving on-premise databases to the cloud

In on-premise database environments like Oracle and SQL Server there will usually be multiple physical servers and in each of them there will be multiple databases. For example, in a typical on-premise Oracle/SQL Server data warehouse environment, companies will have 3 separate sets of physical servers — one each for development, test and production.

In each of these physical servers, multiple databases will be created. Each of these databases will built be for a specific purpose for example one could be a financial systems warehouse (FIN_DW) with data pulled from ERP system and another could be a HR warehouse (HR_DW) with data pulled from HR systems and a third could be data from CRM systems like Salesforce which could be called CRM_DW. In each of the databases there can be multiple schema and in each schema, there can be multiple tables, views and other objects.

So, in total you could have 3 databases per server looking like this in your on-premise environment –

DEV/TEST/PROD Physical Server

In an on-premise environment the following picture depicts a typical hierarchy of objects

How Snowflake works

When a company signs up with Snowflake they are given an URL like

https://companyname.snowflakecomputing.com/

In Snowflake, a database is the highest level and inside a database there can be multiple schemas and inside a schema there can be multiple tables and views. So, in other words Snowflake does not have a server concept like Dev, Test or Production physical servers.

In Snowflake, the following picture depicts a typical hierarchy of objects

How to organize your on-premise databases in Snowflake

Given, that Snowflake environment is “one level lower” in terms of objects (there is no concept of dev, test and production physical servers) how do you organize the Snowflake system to match your on-premise setup. There are 2 ways to go about this –

1. Keep your top-level object as an individual database when you migrate

In this method, you will be creating as many databases in Snowflake as there are number of on-premise physical servers x number of databases in each of them.

In the above example you will create 9 databases in Snowflake –

CRM_DW_DEV

HR_DW_DEV

FIN_DW_DEV

CRM_DW_TEST

HR_DW_TEST

FIN_DW_TEST

CRM_DW_PROD

HR_DW_PROD

FIN_DW_PROD

This method will work fine if you have a small number of on-premise servers with a small number of databases in each of them. But your company could have 4–5 physical servers (Sandbox, Dev, Test, Production, etc.) with 10–20 databases in each of them. You can imagine how the number of databases can proliferate in Snowflake. In this example your looking at anywhere between 40 to 100 databases.

You will have to maintain all these databases within Snowflake and assign security and roles to each of them. In addition, in my opinion you will have a very confusing and cluttered environment to maintain many databases for the longer term.

One of the big issue I see is that normally production servers have a high degree of security and access control than dev or test servers. In the on-premise world the server and the databases in the production environment are audited and are under SOX control. In Snowflake if you end up having 10–20 production databases without an umbrella physical server it will become difficult to report out the internal controls to the audit team.

2. Create as many “dummy databases” as there are on-premise physical servers

In this method you create 3 databases in Snowflake at the top level –

1. Development

2. Test

3. Production

This will represent the 3 physical servers in your on-premise environments. Then you can create the 3 on-premise databases (CRM_DW, HR_DW, FIN_DW) as schemas inside these 3 databases. If a database has multiple schemas you can create multiple schemas inside these databases. For example, if CRM_DW has 2 schemas called Marketing_Schema and Sales_Schema, you can create these as 2 separate schemas as CRM_DW_Marketing_Schema and CRM_DW_Sales_Schema under the Development, Test and Production databases. The respective tables and views can then be created under each of these schemas.

The advantage I see in this method is that you have a more structured way of looking at your Snowflake environment. You will have a Development, Test and Production database and then all schemas and tables that belong to each will be sitting inside these databases. You can put a greater level of security control to the Production database and will be able to prove to your auditors that you have similar controls to the production on-premise server.

The only downside I see to this approach is the case where you a lot of schemas under a database in your on-premise environment. In this case you will just have to rename your schemas with the database name in the front for example to distinguish them.

Summary

Before moving your on-premise data warehouses to Snowflake, it is necessary to put some thought into how you want to organize your Snowflake environment. Since you don’t have a concept of a physical development, test or production servers you can try to mimic it by using option 2 above. Option 2 will work well if you have a lot of databases inside each physical server and you have less number of schemas in each database. If you have a lot of schemas in each database and less number of databases in each physical server then option 1 might be better suited for your case.