How to create a BI Modernization strategy for your company Part 1

How grouping your existing BI use cases by data latency requirements can help clarify your BI Strategy

A lot of companies are burdened by legacy BI platforms and are not able to deliver the right data to their users at the right time to run their business today. They know that they need to modernize their BI platforms but do not know how to go about it. First let’s look at what should be the goals of a BI Modernization strategy.

Here are the goals from my view for any BI Modernization Strategy –

1. Speed of change — Probably the most important goal. The new platforms should be able to add new fields or add new data sources much faster than legacy platforms.

2. Combine data from multiple sources — This one was always difficult to achieve in legacy tools but is possible with modern tools.

3. Superior performance — Report performance and data load timings should be significantly improved.

4. Self Service — Enable business users and power users to build their own reports and dashboards without relying on IT and super users

5. Support different data latency requirements — Real Time, Operational and Analytical Reporting are the three ways to look at data latency requirements

6. Reduce overall cost — The BI software licensing models and the costs of compute and storage have come down enormously leading to a significant reduction in costs if implemented correctly compared to legacy BI

Grouping BI Use Models by Data Latency

In most companies BI would have evolved in a somewhat organic manner over many years or decades. Many different BI tools would have been used at different times either brought in by acquisitions or by the best tool available at that time. When companies try to replace their legacy tools, they often try to replace one tool with another tool. For example, their project would like “we should move from OBIEE or Cognos to Tableau or Power BI” and/or “let’s move our data warehouse to Snowflake”. They then end up customizing or building solutions into Tableau/Power BI which it was not meant to do, and the project runs into issues.

A better strategy would be to consider the use models of your current BI solution. One way to look at your current use model is by data latency requirements.

There are three ways to classify use models by data latency –

1. Real-Time Data

When the users want to see the data as it is “right now” in the operational system then that is a real-time requirement. This kind of data needs are common in the supply chain, finance, sales or call center area. The supply chain users would want to see what are the orders they need to ship right now, or in finance they would want to see what is the GL Balance at 12 AM at the end of the quarter, or in sales they want to see quota to orders at frequent intervals at the end of a quarter or in call center they would like to see call volume related data in real-time. None of the use cases can wait for the data refresh to the warehouse to happen. The best place to do this reporting is embedded analytics inside the ERP or CRM software but its often the case that the reporting is often slow or does not offer the capabilities of a BI tool. Hence these requirements fall into the BI arena. The primary users of real-time reporting are operational line level employees in the manufacturing floor, shipping or in finance operations who use the data to perform day to day tasks. The BI user interface can be simple but effective for these needs as user is just trying to get the data they need at any given time.

2. Operational Reporting

Operational reporting is needed when the data being analyzed can be a few hours or few days old. This kind of reporting is useful to analyze order or call volumes count over the last few days, how many of the marketing leads have been generated by various campaigns this month. The data needed for operational reporting are usually in multiple systems and the reporting solution should be able to join these data together. For example, the campaign data could be in Eloqua or Marketo, leads data might be in Salesforce and the order data might be in Oracle or SAP. The operational reporting solution should be able to bring the data from these three sources and tie the data together to get a full picture of the campaign effectiveness. In most companies, the bulk of BI requirements fall into this category which is essential for running the business. The primary users of operational reporting are managers who are analyzing data to make decisions on their operations. They usually make decisions like how many trucks do I need for my shipments today, or which campaigns should I run today based on the data and analysis they were able to obtain from their operational reporting.

3. Analytical Reporting

This kind of reporting is useful to analyze data trends over months, quarters or years. The data is usually stored for the long term and frequently data is also stored as snapshots to get the view of the data at a specific day of a month. Then the data is usually summarized to give a long-term view of the various metrics. Some common examples are revenue by industry or region over many years or how gross margin for a product has changed from last year vs. this year. In this type of reporting the frequency of data refreshes can be once a day, week or month and data is usually summarized and stored for trend reporting. The user base here are analysts who are answering questions based on trends in the data or executives and senior managers of various departments to do long range planning of their business based on the historical trends. The BI tool used here should be very user friendly and will need to enable self-service for the analysts.

Now if you look at your traditional BI environments, all these three use models will be included in all the BI tools. For example, OBIEE will be performing real-time, operational or analytical reporting but so is Business Objects or Cognos or other tools, just that they will be doing the same for different departments or different source applications in your company. When you are trying to modernize the BI architecture, you can look across all your current BI applications and group the reports or dashboards serving these three use cases. Once they are grouped into different use cases, you have a good understanding how much your company uses each of these use models. Then once you have this analysis in hand you can proceed on working out the architecture for these three use models.

I will cover possible architecture for each of these use models in the following articles. Please stay tuned and let me know your comments on whether classifying your BI landscape into these use models makes sense for your company.

How to enable real-time analytics in Snowflake with data from On-Premise databases

Snowflake is rapidly becoming the leading cloud data warehouse platform for all kinds of different analytics applications. I have written about why snowflake is a great option for moving on-premise data warehouses to the cloud. While moving to Snowflake has big advantages like no need for a server footprint, DBA, tuning, backup and recovery, it still needs a well thought out data pipeline from your source systems.

A chance to simplify and reduce latency of your data pipelines

If your company has been using a central data warehouse for several years, chances are your BI team uses a traditional tool like Informatica to extract, transform and load (ETL) data from one or several on-premise databases into the data warehouse built on Oracle, SQL Server or Teradata. These ETL processes would have been built and enhanced over many years and are probably very complex to maintain or change now. They were optimized for performance on traditional databases, the core of whose technology were written decades ago. The on-premise data warehouse will get loaded once or twice a day which run for several hours each and usually lead to downtime during loading for the business users. This method of ETL is completely outdated for today’s global 24×7 operations and real time data requirements. Now that you have decided to move to Snowflake as your data warehouse you should rethink your data pipeline completely.

CDC based real-time data integration into Snowflake

Change Data Capture + Streaming as the data pipeline strategy

In the last two years, there has been tremendous development of the partner and technology ecosystem around Snowflake. Now there are more than a dozen tools which enable real time or batch replication to Snowflake from on-premise databases. The tools have become sophisticated enough to be able to replicate only some of the tables and even in these tables they can replicate only some of the columns. This way you can replicate only the data needed for reporting and not the entire on-premise database to Snowflake.

The replication tools look at the logs generated by the on-premise databases which contain an encoded entry for each insert, update or delete that occurs in any table in the source database. For example, if your source database is Oracle then these tools can continuously replicate one or many Oracle tables to Snowflake. As a first step the tables to be replicated needs to be created Snowflake. Then the existing data needs to be loaded into the corresponding tables in Snowflake. Then the tool can read new data using LogMiner and continuously replicate it to Snowflake.

Advantages of using the CDC approach of loading data into Snowflake are:

1. Real time continuous data load into Snowflake without the need for specific time based data loads which used to run for hours

2. What you see in the source database is what you get in Snowflake. It is easy to verify the data quality and accuracy compared to traditional ETL

3. Uses the power of Snowflake performance to do the complex transformations needed and not in the ETL tool. Data can be processed many times faster than before.

4. Can capture deletes in the source and not require a full load to clean out the deletes.

What are the tools available in the market?

There are plenty of tools now available to replicate on-premise data to Snowflake and more are getting added as Snowflake is getting very popular in enterprises large and small. It is important for companies to evaluate these tools on various vectors including capability, performance, pricing, support, data transformations, roadmap, long term viability of the company among other factors. You need to look at all the capabilities of the tool holistically so that your company can standardize on one tool instead of supporting multiple tools for different needs.

Here are some leading tools that are available in the market which can replicate on-premise databases to Snowflake using CDC –

1. Striim

2. Attunity (bought by Qlik) — now called Qlik Replicate

3. Stitch (bought by Talend)

4. FiveTran

5. Matillion

6. Streamsets

The Matillion approach seems a lot more complicated than the others but the product has a lot of traction and completely cloud based with attractive usage basedpricing. So, I think it is worth a look.

There are other tools which are coming up based on a simple Google search — CdataBryteflowHVR SoftwareETLWorksDiyotta. I’m sure more will be there with a deeper search.

In summary, the full power of Snowflake can be unlocked by enabling a real-time data replication from on-premise databases to Snowflake. There are a lot of highly capable tools available in the market. With careful evaluation and a solid POC, you can enable your business users to gain real time insights into their operations with the power of Snowflake and in the process become the data hero of your company!

How Incorta works and how it can fit into your enterprise BI strategy

Incorta is the new kid on the block in a very competitive enterprise business intelligence tools space. The company started around 6 years back and we have been using it for the last 2.5 years successfully for operational reporting.

This blog was written in October 2019 and represents Incorta features as of their 4.3.x version. Please keep in mind Incorta builds significant new capabilities in each of their quarterly releases and the product changes over time. I will try to add new articles to go over significant new capabilities in the platform.

Incorta has taken a very different approach to solve BI and analytics problems than any other vendor in the market. They are essentially four different integrated BI tools in one.

These are the –

1. Data Extraction Layer

2. Database layer

3. Data Modeling Layer

4. Data Visualization Layer

In addition to the above Incorta has a very close integration with Spark and comes with the same installation package. Spark integration provides 2 functions for Incorta –

1. Advanced calculations and joins

2. SQL Interface (SQLi) or Datahub for external BI tools to connect with Incorta

How Incorta Works?

In a traditional BI environment, you have at least three tools (if not four if you have a separate semantic layer tool) to do the same set of tasks. You will have a ETL tool like Informatica or SSIS to load into a database like Oracle or SQL Server and then data model and visualize the data in a BI tool like MicroStrategy, OBIEE or Tableau. You must keep in mind that these are best of breed tools which have been around for a long time with tons of features.

Incorta does a lot of these tasks very well but it cannot match the same set of features as these best of breed solutions in the various categories. As a combined package it delivers a lot of capabilities for BI developers and users. But first let’s take a look at the various components of the Incorta platform.

Components of Incorta platform

Here is a very brief and simplified description of the various components of the Incorta platform and what is special or unique about them –

1. Data Extraction Layer

The first layer of Incorta platform is the Data Extraction layer. Here they have built a technology called Direct Data Mapping where you can extract data table by table from the source system both as a full load or as an incremental load. The data then goes into a parquet file in the Incorta platform and then gets loaded into memory. Each table can have a have a full load and incremental load SQL

Full Load SQL can look as simple as the following –

Select COL_A, COL_B, COL_C from TABLE_1 where CREATED_DATE>=’01–01–2016’

Incremental Load SQL can look like the following –

Select COL_A, COL_B, COL_C from TABLE_1 where LAST_UPDATE_DATE>?

When the load is kicked off the table gets populated for a full load and then in each incremental run it will get the incremental data from the source. Incorta can pull data from almost all traditional databases like Oracle, SQL Server, Cloud applications like Salesforce or from files stored on on-premise or cloud storage. The number of source applications supported increases in each release of the Incorta platform.

After loading data from source systems into parquet files, materialized views (MV) can be created using the Incorta UI and can be written in SQL or PySpark. During run time Incorta sends the request to Spark to read the data from Parquet files and compute the data and reload into a separate parquet file which then gets loaded into memory during the loading stage. The Spark component comes with the Incorta package but does involve some set up to be done for it to work properly.

The advantage of pulling table by table is that the performance of the loads is very fast as there is no need to perform the joins in the source system and the second advantage is that it becomes very easy to add new columns into Incorta as all you need to do is add a new column in the extract SQL and run a full load. Hence an addition of a new column in Incorta can be done in a few minutes compared to days or months using traditional ETL methods.

2. Database Layer

Once the data is extracted into parquet files it is then loaded into the in-memory database engine and then the joins defined in the schema are precomputed at the end of the load. This precomputation of the joins leads to the revolutionary performance of the reports in Incorta. Even if the report built in Incorta has 30 or 40 joins across large tables containing millions of rows the data still comes back in a few seconds. This kind of performance is unheard off in a traditional database.

3. Data Modeling Layer

Incorta data modeling layer comes in two forms — physical schemas and business schemas.

In the physical schema you can create aliases to base tables and materialized views and create joins between various tables and MVs. Business schemas can be created to present a flat, user friendly representation of the physical schema to the user building the reports. Columns can be brought in from one or more tables in different physical schemas and renamed. Formula columns can be added either in physical or business schemas.

4. Visualization Layer

Incorta comes with its own visualization layer. Insights (Reports) and Dashboards can be created on top of physical or business schemas. Analyzer users can create insights or reports with in-built filters, prompts and bookmarks that then end users can consume. There are multiple types of visualizations like charts, pivot tables as well as drill downs are available.

Incorta visualization is good enough for most purposes but when compared to a more best of breed product like Tableau, Power BI or MicroStrategy it does fall short a bit. The Incorta development are rapidly adding features in various releases and they should be able catch up to the other tools in terms of UI capabilities sooner than later.

In addition, Incorta provides a SQL Interface using which other BI tools like Tableau or Power BI can connect to the business schema, physical schema or the Parquet layer.

Where does Incorta fit in your BI architecture

Incorta comes with several distinct advantages compared to traditional BI tools –

– revolutionary performance of reports

– no tuning or DB maintenance needed

– can hold massive amounts of data (billions of rows of data) in memory

– complex data modeling is possible

– simple to develop and make changes as reshaping of data is not needed

– easy to learn and get started for any developer

– a platform where all the four layers of BI are available

– cross data source joins is very easy to achieve

There are three kinds of reporting in any company — Operational Reporting, Real-Time reporting and Analytical Reporting (snapshots, period over period analysis). Incorta’s sweet spot is operational reporting requirements where data can be pulled from various transaction systems and joined together to provide lighting fast reports. You can load data into Incorta several times a day from your source systems and provide operational insights that the users need to make daily decisions to run their business.

In summary, if you are looking for a solution to get lighting quick operational reporting with data joined from multiple complex source systems then Incorta is the one of the best options that is out there today.

What to expect from the consolidation in the Business Intelligence Market

Why the Looker acquisition by Google makes sense and the Tableau acquisition by Salesforce does not

There were two very high-profile acquisitions in the month of June 2019 in the BI tools space. One was the acquisition of Looker by Google ($2.6 Billion) and another one is the acquisition of Tableau by Salesforce ($16 Billion).

Out of the two major acquisitions, one makes a lot of sense (Looker by Google) and the other one not so much (Salesforce by Tableau). Here is my view of these two acquisitions –

Looker acquisition by Google Cloud

Google has been investing heavily in the enterprise cloud area for the last several years. The scale and scope of their investment was expected to increase drastically when they hired away Thomas Kurian from Oracle. Google Cloud Platform (GCP) entered the enterprise BI market via Google Big Query. Google Big Query has evolved over the years by starting as purely a cloud based serverless columnar store database with a lot of features for the big data developer community. It was not really considered a true enterprise data warehouse supporting all the SQL syntax like Oracle, SQL Server and Teradata. Google probably realized over time how big the enterprise data warehouse market truly was by looking at the success of Snowflake and has caught up in terms of full SQL support. Now it is considered a fully functional cloud data warehouse to which companies can move their traditional data warehouse workloads.

In addition, Google was also building their own BI tools on GCP — Google Data Studio. Google Data Studio is considered as a simple BI tool which primarily connects to Big Query and other Google properties and provides reporting and visualizations. It was never a serious contender as an enterprise BI tool against the likes of Tableau and Qlik. GCP also announced several data integration tools this year — Cloud Data Flow and Cloud Data Fusion. Although these tools look very powerful with lots of features they don’t look like a good fit for the traditional ETL developers who are familiar with Informatica or SSIS. These tools look more for the modern data engineers in startups which is a whole different but less lucrative market than the traditional BI market.

Not surprisingly, a significant development which happened earlier this year after Thomas Kurian joined was the acquisition of AloomaAloomaspecializes in moving data from various cloud data application sources like Salesforce or Workday to cloud data warehouses like Snowflake and Big Query as well as syncing on-premise data sources to the cloud. Clearly the future focus for Alooma will be syncing more and more data sources to Google Big Query. Google also announced an in-memory caching layer on top of Big Query called BI Engine. BI Engine can cache data sets from Big Query for extremely fast results and can compete with Tableau’s hyper file or with Qlik’s in-memory engine in terms of performance.

The missing piece in GCP’s data platform was an enterprise level BI tool. Google Cloud was an early partner with Looker and would have had a ring side view of its popularity and growth. Looker’s big advantage was its all cloud architecture, semantic layer using LookML and integration of the LookML code with Git for versioning. Looker is very popular with startups and is starting to penetrate the enterprise companies who have code-savvy data analysts.

With the acquisition of Alooma and Looker and its own BI Engine and Big Query, GCP now has a complete stack for enterprise BI. GCP can now provide a one stop solution for replacing traditional enterprise BI built on top of Oracle/Teradata warehouses, Informatica as ETL and OBIEE/BO/Cognos/MicroStrategy as BI tools.

If they make these tools easy to purchase, easy to build end to end BI solutions then they really have a gold mine in their hands. It is going to be critical for Google to create a seamless experience across all these four tools for users and developers. They must keep in mind that the customer base is SQL savvy and not code savvy. The tools need to work together without much configuration or coding to truly go after the enterprise BI market.

Google Cloud BI Architecture post Looker and Alooma acquisitions

It remains to be seen if Google can keep its focus for the long haul and really invest in these four capabilities (database, ETL, caching layer and visualization) rather than building more and more tools which distract them and confuse customers as to what their road map is.

Google clearly can provide excellent advanced analytics, ML and AI capabilities on top of this core set of capabilities. That market can be captured if GCP starts hosting the data and the visualizations. It is well known that without the data in proper form, advanced capabilities like ML and AI are hard to achieve.

Tableau acquisition by Salesforce

While Google’s acquisition of Looker was all cash and much smaller, the Tableau acquisition by Salesforce was a massive all stock $16 billion deal. Tableau’s acquisition by Salesforce is a real head scratcher not only in terms of price paid but also because of the significant architecture incompatibilities and Salesforce’s own relatively poor post-acquisition track record for the various companies they have purchased so far.

Salesforce needs to continue acquiring to have a end to end BI Solution

The motivation for entering the BI space makes sense for Salesforce as they can use their much bigger sales channel and customer base to expand into new markets. The question is why Tableau?

Customers love Tableau because of its easy to use and powerful desktop product. Tableau does have server and cloud products, but they were built more as a sharing mechanism for dashboards and workbooks between users rather than as the primary development platform. Tableau also completely relies on data extracts in .hyper form to perform in a reasonable manner. This adds a whole layer of complexity in the environment as it leads to two stage ETLs and data sitting in various desktops leading to security issues. Salesforce’s whole premise was to get rid of on-premise software and especially desktop-based applications.

They might have been better off buying cloud based BI tools like Domo rather than Tableau. With Tableau they now must explain to the customers who love the desktop tool in the ultra-competitive BI market as to what their future architecture is for the product. Microsoft with its Power BI tool set which includes a near free desktop product was already giving Tableau a run for their money, but now their sales team can create fear and uncertainty in the market regarding Tableau’s future architecture.

One explanation for the acquisition of Tableau is that Salesforce may not be done yet but is planning on adding a cloud data warehouse as well as a ETL tool to its product set. This might lead them to buy Alteryx or Talend for ETL and possibly Snowflake as the data warehouse. But this is going to be an extremely costly route to take as each of these companies are already very big. Salesforce’s shareholders will demand that the Tableau acquisition has led to significant growth acceleration for both products before its next big acquisition in the analytics space.

What is Data Virtualization and how it can unlock real-time insights directly from source systems

In today’s fast-moving business environment there are a lot of reporting requirements which cannot wait for a batch process to load data from operational systems. The business users need the data as it is currently in their on-premise ERP, CRM or any of the cloud systems. For example, towards the end of a fiscal quarter, sales and finance leaders want to know the order bookings in real-time so that they have a good idea if they are going to “make” the quarter or not in terms of revenue. They cannot wait 4–8 hours of delay for a ETL job to finish.

Data virtualization (DV) is a method of building a “logical” warehouse by connecting “virtually” to various source systems and combining or joining them “virtually” to provide a unified data layer for BI systems or enterprise applications to query from.

Fig 1: High Level Architecture of Data Virtualization

The alternate to the data virtualization approach of providing a unified layer is the traditional ETL approach of moving data from the various source systems at periodic intervals and loading them into a physical data warehouse. The data in the warehouse is then combined by loading them into a target star schema or snowflake data model. ETL loads are usually a batch process and are loaded once a day or a few times a day depending on the time it takes to load and the data latency requirements.

How DV Works?

At the core of it DV is a SQL generation tool with an easy to use UI to build a semantic layer across various data sources. Instead of loading data into a physical warehouse by extracting, transforming and loading (ETL) from various source systems, with DV the data will primarily remain in the source systems and the DV tool will load the metadata of the tables (or other objects like views) from source systems into the DV tool. We can call these virtual tables as “Physical Views”. Once the metadata of the tables are loaded, the DV tool will allow the developer to perform various manipulations to these physical views including joins, and formula columns to enrich them with required business logic. We can name these as “Logical Views”. Once the joins and formula columns are added the developer can then build out the “BI Datasets” using these various logical views to create datasets which are flattened views of the data for a specific business purpose. These datasets can include fixed filters if they are always needed.

These flattened datasets will be then be imported in a BI tool like MicroStrategy, Power BI or Tableau. In these BI tools, users can add further business formula columns, filters or even join two “BI Datasets” from the DV tool. The DV tool will take the input call from the BI tools and then translate the SQL which is appropriate to the specific backend data source, be it a database like Oracle, SQL Server, Teradata or a cloud system using their API like Salesforce.

Fig 2: How Data Virtualization Works

The best DV tools perform SQL tuning before sending queries to the backend data sources. Instead of just sending the base SQL to the backend the DV tool will introspect the incoming request, optimize the query for the specific backend data source and then send the SQL or API call to that data source.

In addition, the leading DV tools can cache any of physical views, logical views or BI Datasets for better performance if the source is slow, either in a native database or a commercial database like Oracle or Snowflake.

Advantages of DV

Data virtualization comes with a lot of benefits –

1. Simplified view of complex source system schemas

If the semantic layer is built in the DV tool and the “BI Datasets” are exposed as a flat table/view to be imported in a BI tool like Tableau or Power BI, the complexity of the source schema like Oracle EBS or SAP will be hidden to the BI developers. BI developers are usually not very familiar with the source system tables and struggle to build a semantic layer and instead depend on a database view in the source system. Instead, building the semantic layer in the DV tool not only simplifies a complex schema, it also avoids maintaining a complex and hard to change physical views in source databases.

2. Speed of development

Since DV just imports the metadata of tables and creates virtual tables which mirrors the source objects, complex ETL to achieve real time reporting can be avoided. Adding new fields or tables in ETL could take several weeks to months as star schemas have to be designed and built, but in DV the shape of the data can be preserved to be the same as the source and hence adding new tables or fields will be much easier and quicker.

3. Real time reporting need

Since DV points to the source system directly, either through ODBC/JDBC to the database or through an API to cloud systems the latency of the data is now zero to a minimum. The data available to the user is real-time and for use cases where users need data immediately as it changes DV really comes through as a great option.

4. Source system changes can be hidden from BI tools

A big benefit of DV is that when a source system is upgraded or even if the company replaces a source system with another or is planning to just move databases from let’s say Oracle to Snowflake, virtualizing the semantic layer will help hiding these complex changes in the environment from the various BI tools. To accommodate changes which occur in source systems due to upgrades or database changes, in the DV tool the data source connection needs to be re-pointed to the new database if the schema is the same or some changes in relevant physical views will be needed if there are schema changes like column additions etc. This way the reports and visualizations built on the various BI tools do not have to change and users are not impacted.

5. Performance optimizations and caching

Most leading DV tools provide the ability to cache certain tables from the source database or cache datasets as well. Caching will significantly improve the performance of the reports in the BI tools. The best DV tools tunes the queries they receive from the BI tools and then pushes them down to the source databases. They also find ways to extract and move minimum amount of data from different data sources before joining them in the DV tool and sending to the BI tools.

Data Virtualization Tools

There is a lot of confusion in the market as to what Data Virtualization is. For Data Virtualization as described here, these are tools that companies can evaluate for their needs –

Commercial Products

1. Denodo

2. Tibco Data Virtualization

3. DataVirtuality

4. AtScale

5. Azure Analysis Service

Open Source Alternatives

1. Dremio

2. Teiid

We personally selected Denodo and we really like it for working with on-premise databases, cloud data sources, merging data from multiple sources as well as presenting a simplified view of the data to multiple BI tools. I will write a more detailed article on Denodo and how it works on a future article.

How Incorta enables Operational Analytics without cumbersome ETL or star schema

Link to the demo and to the full webinar

The key benefits we realized with Incorta for operational reporting are –

  • Delivered real-time operational reports against data aggregated from complex data models of Oracle EBS, Oracle Siebel, and Oracle Agile
  • Empowered business user to self-serve even the most complex reports and analytics
  • Redeployed 13 of 15 BI support resources to work on more innovative projects
  • Reduced new report delivery time from months to mere hours
  • Joining data from multiple sources like Oracle and Siebel which were very difficult to do in the past other than a slow dblink between two Oracle databases
  • Reduced complex 40 table reports running time from 6 hours to seconds

Why a user-friendly workflow automation and scheduling tool is critical for successful business intelligence and data engineering operations

The nature of today’s enterprise applications environment is that the critical data needed for operational and analytical reporting reside in multiple cloud and on-premise applications. A lot of companies have multiple ERPs, multiple CRMs and multiple other cloud applications that they have implemented over time or inherited them through acquisitions.

Figure 1: Typical ETL and BI Architecture

Similar, to their transaction systems, companies will also have a mixture of modern and legacy BI and ETL technologies. The need to bring the data from all these source applications together into one or more data warehouses (ideally one), whether on-premise or on the cloud, is more important than ever. Business performance analysis can be achieved by having all the required data together in a user-friendly manner for the data scientists and analysts to do their job.

Complicated scheduling architecture leads to operational issues

Each of these source systems have a different way of being accessed by the ETL/ELT tools. Some have API, others expose a ODBC/JDBC interface and others output the data in csv, JSON or other formats. As companies build the data pipelines from these sources over the years each of these are run at different schedules and are triggered in multiple different ways. Some get triggered by CRON Jobs, others by Windows Scheduler, others by in-application schedulers. In addition, most BI tools like MicroStrategy, OBIEE, Business Objects, or Tableau come with their own set of scheduling tools to trigger report extracts, loading cubes or for sending dashboards as emails to users.

Figure 2: Multiple dependent event driven schedules in an enterprise BI architecture

The result of running these various ETL/ELT and BI tasks with native schedulers in each of these tools is that there will multiple different kinds of code bases and mechanisms to trigger the tasks and more importantly, there will be not be one place to check the success of failure of these tasks.

In case a business user says that they didn’t get the latest data on their dashboard or email then the operations team needs to check all these different schedules in the BI tools, ETL tools, database triggers, OS jobs or even the source systems to find out why the data was not updated. Most probably there will be different teams supporting each of these tools. Troubleshooting any small issue will turn out to be a nightmare without knowing where the failure is in the data flow. If the BI and data engineering teams are always troubleshooting operational issues new capabilities and solutions will take a backseat.

To avoid this, companies must invest in and centralize all BI and data engineering related schedules and tasks into a central workflow automation and scheduling architecture

How centralized workflow automation will help

The best way to get the BI and data engineering architecture under control is to run all the operations using a central workflow automation and scheduling tool. For this to happen you will need to figure out a way to trigger the various ETL/ELT and BI tool schedules using their respective APIs. It will take a bit of technical investment to figure out how to work with each of these APIs. In general, most modern tools have simple REST APIs but some of the older legacy tools need some research to figure them out.

Once you have the APIs understood, the next step is to write the code to call the APIs for the various tasks that need to be performed. For example, you will need an API to trigger a download of employee data from your HR cloud platform like SuccessFactors or an API to trigger an Informatica workflow. After the APIs are understood you can standardize on a simple language like Python to write the code to call these APIs. There will be a lot of sequencing of API calls that needs to be performed by the Workflow Automation tool — kick off Task 2 after Task 1 is complete etc.

A good workflow automation tool that is needed for this needs to be able to

  1. call the necessary APIs in different languages
  2. to accept triggers in different formats
  3. to have a visual workflow development environment
  4. to send appropriate notifications to users and support team on success of failure of a certain task
  5. it needs to have an advanced scheduling engine and monitoring capabilities.
  6. last but not least is that it needs to be easy to use — easy to build and maintain.

Is there a tool in the market like that?

Figure 3: BI and Data Engineering operations driven by a central Workflow Automation tool

JAMS Scheduler as workflow automation tool for BI/Data Engineering operations

After a detailed search of both commercial and open source products like Airflow, I landed on a product called JAMS Scheduler by MVP Systems Software. With JAMS we were able build out a complete workflow automation solution which was able to call various ETL/ELT jobs using their respective APIs. We were able to chain various processes so dependent tasks kick off when others complete. BI tool tasks like kicking off an .hyper file creation in Tableau or emailing dashboards can be run after the ETL tasks are done.

In addition, JAMS has a great UI for monitoring tasks as well as a detailed notification and alerting mechanism in case of success or failure of tasks. In BI, one of the critical piece is the scheduling and triggering capability. For example, we had requirements to trigger jobs at 5.30 PM Malaysia time on Workday 1, 3 and 5 of every month. With JAMS, these kinds of triggers are very easy to setup without writing any code. JAMS has an intuitive interface to build workflows with variables and parameters. You can call python code, power shell, database procedures, SQL statements all in the same workflow. Once you have setup the job to run it runs very reliably. You have the capability of putting jobs on hold during maintenance windows, set priorities for jobs and re-trigger them on failures.

One of the most important aspects of JAMS is that I can download a free evaluation copy and test out all the features for 30 days. During this time, I got great support from their sales and technical folks, so I can complete the POC and show it to various stakeholders for buy in.

Overall it seems to be a well thought out, stable and easy to use product which we have gotten to rely on for automating a lot of manual tasks in our BI and data engineering operations.

Here are a few resources that might be useful to check on JAMS –

1. Product Videos

2. JAMS features overview

How to setup your Snowflake environment when moving on-premise databases to the cloud

In on-premise database environments like Oracle and SQL Server there will usually be multiple physical servers and in each of them there will be multiple databases. For example, in a typical on-premise Oracle/SQL Server data warehouse environment, companies will have 3 separate sets of physical servers — one each for development, test and production.

In each of these physical servers, multiple databases will be created. Each of these databases will built be for a specific purpose for example one could be a financial systems warehouse (FIN_DW) with data pulled from ERP system and another could be a HR warehouse (HR_DW) with data pulled from HR systems and a third could be data from CRM systems like Salesforce which could be called CRM_DW. In each of the databases there can be multiple schema and in each schema, there can be multiple tables, views and other objects.

So, in total you could have 3 databases per server looking like this in your on-premise environment –

DEV/TEST/PROD Physical Server

In an on-premise environment the following picture depicts a typical hierarchy of objects

How Snowflake works

When a company signs up with Snowflake they are given an URL like

https://companyname.snowflakecomputing.com/

In Snowflake, a database is the highest level and inside a database there can be multiple schemas and inside a schema there can be multiple tables and views. So, in other words Snowflake does not have a server concept like Dev, Test or Production physical servers.

In Snowflake, the following picture depicts a typical hierarchy of objects

How to organize your on-premise databases in Snowflake

Given, that Snowflake environment is “one level lower” in terms of objects (there is no concept of dev, test and production physical servers) how do you organize the Snowflake system to match your on-premise setup. There are 2 ways to go about this –

1. Keep your top-level object as an individual database when you migrate

In this method, you will be creating as many databases in Snowflake as there are number of on-premise physical servers x number of databases in each of them.

In the above example you will create 9 databases in Snowflake –

CRM_DW_DEV

HR_DW_DEV

FIN_DW_DEV

CRM_DW_TEST

HR_DW_TEST

FIN_DW_TEST

CRM_DW_PROD

HR_DW_PROD

FIN_DW_PROD

This method will work fine if you have a small number of on-premise servers with a small number of databases in each of them. But your company could have 4–5 physical servers (Sandbox, Dev, Test, Production, etc.) with 10–20 databases in each of them. You can imagine how the number of databases can proliferate in Snowflake. In this example your looking at anywhere between 40 to 100 databases.

You will have to maintain all these databases within Snowflake and assign security and roles to each of them. In addition, in my opinion you will have a very confusing and cluttered environment to maintain many databases for the longer term.

One of the big issue I see is that normally production servers have a high degree of security and access control than dev or test servers. In the on-premise world the server and the databases in the production environment are audited and are under SOX control. In Snowflake if you end up having 10–20 production databases without an umbrella physical server it will become difficult to report out the internal controls to the audit team.

2. Create as many “dummy databases” as there are on-premise physical servers

In this method you create 3 databases in Snowflake at the top level –

1. Development

2. Test

3. Production

This will represent the 3 physical servers in your on-premise environments. Then you can create the 3 on-premise databases (CRM_DW, HR_DW, FIN_DW) as schemas inside these 3 databases. If a database has multiple schemas you can create multiple schemas inside these databases. For example, if CRM_DW has 2 schemas called Marketing_Schema and Sales_Schema, you can create these as 2 separate schemas as CRM_DW_Marketing_Schema and CRM_DW_Sales_Schema under the Development, Test and Production databases. The respective tables and views can then be created under each of these schemas.

The advantage I see in this method is that you have a more structured way of looking at your Snowflake environment. You will have a Development, Test and Production database and then all schemas and tables that belong to each will be sitting inside these databases. You can put a greater level of security control to the Production database and will be able to prove to your auditors that you have similar controls to the production on-premise server.

The only downside I see to this approach is the case where you a lot of schemas under a database in your on-premise environment. In this case you will just have to rename your schemas with the database name in the front for example to distinguish them.

Summary

Before moving your on-premise data warehouses to Snowflake, it is necessary to put some thought into how you want to organize your Snowflake environment. Since you don’t have a concept of a physical development, test or production servers you can try to mimic it by using option 2 above. Option 2 will work well if you have a lot of databases inside each physical server and you have less number of schemas in each database. If you have a lot of schemas in each database and less number of databases in each physical server then option 1 might be better suited for your case.

Why Snowflake is a great option to move your on-premise Oracle, SQL Server and Teradata data warehouses to the cloud

As your key transactional systems start moving to the cloud like Salesforce, Workday, Netsuite etc. it makes sense to move your data warehouse to the cloud also. One of the fastest growing database in the cloud is Snowflake and It is rapidly gaining a lot of customers who want to move their analytic workloads to the cloud from on-premise based Oracle, SQL Server or Teradata based data warehouses

Snowflake is a serverless cloud only database which means you can use as little or as much as you need of various resources like CPU, memory and storage. It is not limited by the capacity of the server it is on. We have recently started several projects to move on-premise Oracle and SQL Server data warehouses to Snowflake. I will write about our experience in this journey in future articles.

To begin with I would like to go over why I think Snowflake is a viable cloud database to replace legacy databases which has been in use for years. There has been a lot of articles written on the technical capabilities of Snowflake. You can read them here  – Article 1, Article 2 and many more if you search on the web or ready up Snowflake blog.

What I want to emphasize below is how it will help the technical and functional BI and analytics practitioners.

  1. Freedom from DBAs – You can create and destroy databases as and when you please. This is very important as in the analytics world you end up starting a lot of projects without knowing how large the data volume will be and how many users will end up using your solution.

The provisioning of on-premise databases is totally depended on CPU, memory and storage. So, before you talk to the DBA team you need to make an estimate on exactly how much of each is needed. Then you must wait for a DBA to create a database, give it the right resources, after you justify why you need this much CPU, memory and storage etc. In the meantime, you just want to load your data and start modeling and analyzing it. Snowflake helps you be your own DBA. Imagine that! It also simplifies backup and recovery as well cloning your databases to create new environments. It does need some budget to get started but it’s not much and you have to deal with way less red tape and politics to get started.

  1. Its “big data” without the complexity – For the last several years there has been so much talk about Hadoop, Hive, HBase, Spark and a whole bunch of Apache projects which promised a lot of performance and cost improvements over traditional databases. The problem was that it moved the discussion of BI and analytics to an extremely technical level and thereby losing sight of the whole purpose of the effort which is to let the end users analyze the data and gain insights. It turned out that for a lot of companies the traditional Hadoop based “big data” was an overkill. They took on something which was very complex to install, develop and run and the business users lost patience and moved on by building their own custom solutions.

A lot of companies used big data to store csv, XML and JSON files in addition to relational data. All this can be easily accomplished with Snowflake without any of the complexity of maintaining a big data platform, writing Map Reduce etc.

  1. SQL Compatibility – There has been a lot of databases that has come to the market which promised a lot of very high performance at low cost. The problem with most of these are that they don’t provide full relational SQL capabilities like sub-queries, union queries, inline SQL etc. When you are trying to replace a traditional Oracle or SQL Server database there will be a lot of usage of these SQL features. You will have to end up rewriting this logic in some other way leading to tremendous increase in complexity of your migration effort. Snowflake from the beginning was built to support all the capabilities of relational SQL. I have found this to be mostly true with some gaps which Snowflake is actively closing with every release.
  1. BI Tools support – Snowflake has partnerships with almost all major BI tools vendors like Tableau, Qlik, MicroStrategy, Looker, Spotfire and many others. All these tools have native connectors to Snowflake which gives immediate connectivity to the data that is loaded into Snowflake. 
  1. Cloud Connectors – Snowflake is built for Extract Load and Transform (ELT) data integrations. The best way to load data into Snowflake is to load data in the raw format from the source system and then transform it within Snowflake whether the source is files, cloud systems like Salesforce or On-premise applications. There is a rapid ecosystem developing with partners like FiveTran, Stitch or Matillion where they are starting to build native connectors to popular cloud applications like Salesforce. These tools sync the data from Salesforce and other systems and then allow the developers to transform it the way the business needs require. 

In summary, if you are looking to move your data warehouse to the cloud Snowflake is one of the best options out there due to its serverless architecture, full SQL support, ease of maintenance and strong partnerships with BI and ETL tools.

Is Traditional ETL dead?

For many years traditional Extract, Transform and Load (ETL) was the only way to extract and transform data out of operational systems and load into a data warehouse for analysis by a business intelligence tool like Business Objects, MicroStrategy or OBIEE. Traditional ETL works by extracting data out of one or more transactional systems, merging and reshaping the data using staging tables and populating dimensions and fact target tables in a data warehouse. These data warehouses were usually built on Oracle, SQL Server or Teradata databases.  To load the data warehouse using the traditional ETL method the following steps need to be accomplished –

  1. Understand the business requirements – you must know all the ways the users will slice and dice the data, what is the grain of the data (is it at customer level or order line level etc.)
  2. Data model the star and snowflake schemas including all facts and dimensions, their keys and how they all join
  3. Figure out where the data in the source systems are coming from to populate the facts and dimensions and then write the SQLs which join the various tables in the source system to populate these individual facts and dimensions with appropriate keys
  4. Figure out the load order and write a workflow to populate the dimensions and facts in the appropriate order. This is important because the dimensions need to loaded first before the fact tables for the dimensional keys to get correctly looked up when the facts get loaded.
  5. Figure out the tables will be populated with Incremental and full loads

Let’s go over each of these steps and see what is involved –

Business Requirements

This is probably one of the most difficult part of any traditional BI project. Net new BI and data warehousing projects are usually started when new operational systems are put in place. For example, the company can implement a new ERP or CRM or a cloud HR system. To give accurate business requirements for a traditional ETL/BI project the business users need to come up to speed on the new source system and its data structure and then try to guess all potential reports and ways to slice the data in the future and define a set of requirements which IT developers can understand.

The probability of getting the requirements right are very low. But the expectation of the IT and business leadership is that proper requirements written in stone needs to be provided upfront before the implementation of the project begins. This is because the ETL developers need to know what data to pull from the source to populate the data warehouse whose data model in turn has been designed based on the business requirements.

Data Modeling

Data Modeling entails working backwards from what questions the business users will ask and then designed to answer those questions. The metrics or measures goes into the fact tables and the attributes by which these metrics can be sliced and diced are modeled as dimensions around these fact tables. The fact tables then will have keys to each of these dimensions.

With this method of building a warehouse each set of question will need a new fact table. For example, if the business users want to analyze opportunities, revenue, orders, COGS or GL data they will need to design a different fact and associated dimensions for each of them. Detailed analysis needs to be done upfront to understand the grain (aggregation level) of each of the fact tables and the dimensions are then designed to match this grain.

Build source SQL or API data pull

Once the data model of the data warehouse is prepared its handed over to the ETL developers. The ETL developers need to understand where the data resides in the source system and write a SQL or API from cloud systems or from flat files to pull the data into the dimensions and facts contained in the data warehouse. Staging tables are usually there for intermediate hosting of the data which are then populated into the target tables. Incremental and full loads also need to be designed.

Here are the challenges with writing SQL/flat files/API from the source system –

  1. ETL developers rarely understand the source system well. For example, ETL developers usually have very little knowledge of SAP or Oracle ERP tables. They must rely on source system experts who normally don’t have too much time for them.
  2. Once they write the SQL and populate the staging and target tables they must rely on the BI teams to have completed the semantic data model and build out the reports before any business user can test the results. By the time all this happens the project is almost at its end the data is usually very different from what was expected. The whole BI/ETL project team must go back to the drawing board several times to fix the issues. All this causes a lot of stress and delays in the project.
  3. The SQL that is written is usually very expensive in terms of system resources in the source database and must be rewritten due to performance reasons. This also sometimes needs significant logic changes leading to data errors etc.

Load Order

The ETL developers also need to figure out which is the correct order to load the dimension and facts. This can get very tricky sometimes when there are a lot of interdependent tables that are getting pulled and especially if the data is coming from multiple source systems.  Also, the right set of dimensional keys need to be populated in the all fact tables with right look up logic.

Incremental and Full Loads

Incremental loads are needed in a lot of cases when there a lot of rows that get updated on the source system. But getting the logic right as to which rows are getting updated can get very complicated when there are not proper Last Update Date columns in the source data. Sometimes complicated OR statements need to be written if a dimension or fact gets populated from multiple source tables.

In addition, when a new column is added to the target dimension or fact a full load needs to be triggered or an update SQL needs to be written to populate the column for older rows. Also, deleted rows in the source system need to be taken care as there might not be a clear flag or trigger in the source to indicate which rows are deleted.

Speed of Change

Given all the above complexities getting a change done – for example adding new dimension, fact or even new columns starts to get very difficult to do when traditional ETL is used. The skill level of the ETL developers to perform all the above steps need to be very high and they are very difficult to find. In addition, testing a change using the traditional ETL approach gets quite complicated as the data gets reshaped during the load from an OLTP data model to a OLAP data model. It could take weeks to add a new column or months to add a new dimension or fact table, test and move the change to production.

Can the business really wait that long for a simple change? The answer is clear – most business users are expecting changes to their BI systems to be done in a matter of days if not less.

So, is traditional ETL the best way moving forward for new BI projects?

The answer is NO. With today’s technology including cloud data warehouses, replication methods, and powerful self-service visualization tools there is very little reason to go with the traditional ETL and legacy BI tools approach for any new BI project.

There are several other new techniques to solve the problem of getting the data out of the source system and presenting the data in a meaningful manner to the end user.

  1. Replication – Replicate the source data to the target database (increasingly cloud data bases like Snowflake or Amazon Redshift are becoming attractive target databases) and build the business logic post replication
  2. Data Virtualization with a semantic layer and caching in a fast database

I will cover more about these two approaches in future blogs.

What about existing ETL driven data warehouses and BI solutions?

If the existing ETL+BI solution does not require frequent changes and there is a low cost of maintenance, then my suggestion is let the users make use of the system till it loses its business value. If there is a significant change in the source system like move to cloud or an upgrade which needs a lot of change to the ETL, that might be the time to replace the legacy solution to the new architecture.

You don’t have to replace the whole solution to the new architecture at one. You can start with the business-critical reports and then consider stopping the software maintenance support for the legacy BI and ETL tools. Then you can take your time to migrate the other valuable content over to the new platforms and shut down the old ETL/BI solution in a phased manner.