Snowflake is rapidly becoming the leading cloud data warehouse platform for all kinds of different analytics applications. I have written about why snowflake is a great option for moving on-premise data warehouses to the cloud. While moving to Snowflake has big advantages like no need for a server footprint, DBA, tuning, backup and recovery, it still needs a well thought out data pipeline from your source systems.
A chance to simplify and reduce latency of your data pipelines
If your company has been using a central data warehouse for several years, chances are your BI team uses a traditional tool like Informatica to extract, transform and load (ETL) data from one or several on-premise databases into the data warehouse built on Oracle, SQL Server or Teradata. These ETL processes would have been built and enhanced over many years and are probably very complex to maintain or change now. They were optimized for performance on traditional databases, the core of whose technology were written decades ago. The on-premise data warehouse will get loaded once or twice a day which run for several hours each and usually lead to downtime during loading for the business users. This method of ETL is completely outdated for today’s global 24×7 operations and real time data requirements. Now that you have decided to move to Snowflake as your data warehouse you should rethink your data pipeline completely.
Change Data Capture + Streaming as the data pipeline strategy
In the last two years, there has been tremendous development of the partner and technology ecosystem around Snowflake. Now there are more than a dozen tools which enable real time or batch replication to Snowflake from on-premise databases. The tools have become sophisticated enough to be able to replicate only some of the tables and even in these tables they can replicate only some of the columns. This way you can replicate only the data needed for reporting and not the entire on-premise database to Snowflake.
The replication tools look at the logs generated by the on-premise databases which contain an encoded entry for each insert, update or delete that occurs in any table in the source database. For example, if your source database is Oracle then these tools can continuously replicate one or many Oracle tables to Snowflake. As a first step the tables to be replicated needs to be created Snowflake. Then the existing data needs to be loaded into the corresponding tables in Snowflake. Then the tool can read new data using LogMiner and continuously replicate it to Snowflake.
Advantages of using the CDC approach of loading data into Snowflake are:
1. Real time continuous data load into Snowflake without the need for specific time based data loads which used to run for hours
2. What you see in the source database is what you get in Snowflake. It is easy to verify the data quality and accuracy compared to traditional ETL
3. Uses the power of Snowflake performance to do the complex transformations needed and not in the ETL tool. Data can be processed many times faster than before.
4. Can capture deletes in the source and not require a full load to clean out the deletes.
What are the tools available in the market?
There are plenty of tools now available to replicate on-premise data to Snowflake and more are getting added as Snowflake is getting very popular in enterprises large and small. It is important for companies to evaluate these tools on various vectors including capability, performance, pricing, support, data transformations, roadmap, long term viability of the company among other factors. You need to look at all the capabilities of the tool holistically so that your company can standardize on one tool instead of supporting multiple tools for different needs.
Here are some leading tools that are available in the market which can replicate on-premise databases to Snowflake using CDC –
2. Attunity (bought by Qlik) — now called Qlik Replicate
3. Stitch (bought by Talend)
The Matillion approach seems a lot more complicated than the others but the product has a lot of traction and completely cloud based with attractive usage basedpricing. So, I think it is worth a look.
There are other tools which are coming up based on a simple Google search — Cdata, Bryteflow, HVR Software, ETLWorks. Diyotta. I’m sure more will be there with a deeper search.
In summary, the full power of Snowflake can be unlocked by enabling a real-time data replication from on-premise databases to Snowflake. There are a lot of highly capable tools available in the market. With careful evaluation and a solid POC, you can enable your business users to gain real time insights into their operations with the power of Snowflake and in the process become the data hero of your company!