What is Data Virtualization and how it can unlock real-time insights directly from source systems

In today’s fast-moving business environment there are a lot of reporting requirements which cannot wait for a batch process to load data from operational systems. The business users need the data as it is currently in their on-premise ERP, CRM or any of the cloud systems. For example, towards the end of a fiscal quarter, sales and finance leaders want to know the order bookings in real-time so that they have a good idea if they are going to “make” the quarter or not in terms of revenue. They cannot wait 4–8 hours of delay for a ETL job to finish.

Data virtualization (DV) is a method of building a “logical” warehouse by connecting “virtually” to various source systems and combining or joining them “virtually” to provide a unified data layer for BI systems or enterprise applications to query from.

Fig 1: High Level Architecture of Data Virtualization

The alternate to the data virtualization approach of providing a unified layer is the traditional ETL approach of moving data from the various source systems at periodic intervals and loading them into a physical data warehouse. The data in the warehouse is then combined by loading them into a target star schema or snowflake data model. ETL loads are usually a batch process and are loaded once a day or a few times a day depending on the time it takes to load and the data latency requirements.

How DV Works?

At the core of it DV is a SQL generation tool with an easy to use UI to build a semantic layer across various data sources. Instead of loading data into a physical warehouse by extracting, transforming and loading (ETL) from various source systems, with DV the data will primarily remain in the source systems and the DV tool will load the metadata of the tables (or other objects like views) from source systems into the DV tool. We can call these virtual tables as “Physical Views”. Once the metadata of the tables are loaded, the DV tool will allow the developer to perform various manipulations to these physical views including joins, and formula columns to enrich them with required business logic. We can name these as “Logical Views”. Once the joins and formula columns are added the developer can then build out the “BI Datasets” using these various logical views to create datasets which are flattened views of the data for a specific business purpose. These datasets can include fixed filters if they are always needed.

These flattened datasets will be then be imported in a BI tool like MicroStrategy, Power BI or Tableau. In these BI tools, users can add further business formula columns, filters or even join two “BI Datasets” from the DV tool. The DV tool will take the input call from the BI tools and then translate the SQL which is appropriate to the specific backend data source, be it a database like Oracle, SQL Server, Teradata or a cloud system using their API like Salesforce.

Fig 2: How Data Virtualization Works

The best DV tools perform SQL tuning before sending queries to the backend data sources. Instead of just sending the base SQL to the backend the DV tool will introspect the incoming request, optimize the query for the specific backend data source and then send the SQL or API call to that data source.

In addition, the leading DV tools can cache any of physical views, logical views or BI Datasets for better performance if the source is slow, either in a native database or a commercial database like Oracle or Snowflake.

Advantages of DV

Data virtualization comes with a lot of benefits –

1. Simplified view of complex source system schemas

If the semantic layer is built in the DV tool and the “BI Datasets” are exposed as a flat table/view to be imported in a BI tool like Tableau or Power BI, the complexity of the source schema like Oracle EBS or SAP will be hidden to the BI developers. BI developers are usually not very familiar with the source system tables and struggle to build a semantic layer and instead depend on a database view in the source system. Instead, building the semantic layer in the DV tool not only simplifies a complex schema, it also avoids maintaining a complex and hard to change physical views in source databases.

2. Speed of development

Since DV just imports the metadata of tables and creates virtual tables which mirrors the source objects, complex ETL to achieve real time reporting can be avoided. Adding new fields or tables in ETL could take several weeks to months as star schemas have to be designed and built, but in DV the shape of the data can be preserved to be the same as the source and hence adding new tables or fields will be much easier and quicker.

3. Real time reporting need

Since DV points to the source system directly, either through ODBC/JDBC to the database or through an API to cloud systems the latency of the data is now zero to a minimum. The data available to the user is real-time and for use cases where users need data immediately as it changes DV really comes through as a great option.

4. Source system changes can be hidden from BI tools

A big benefit of DV is that when a source system is upgraded or even if the company replaces a source system with another or is planning to just move databases from let’s say Oracle to Snowflake, virtualizing the semantic layer will help hiding these complex changes in the environment from the various BI tools. To accommodate changes which occur in source systems due to upgrades or database changes, in the DV tool the data source connection needs to be re-pointed to the new database if the schema is the same or some changes in relevant physical views will be needed if there are schema changes like column additions etc. This way the reports and visualizations built on the various BI tools do not have to change and users are not impacted.

5. Performance optimizations and caching

Most leading DV tools provide the ability to cache certain tables from the source database or cache datasets as well. Caching will significantly improve the performance of the reports in the BI tools. The best DV tools tunes the queries they receive from the BI tools and then pushes them down to the source databases. They also find ways to extract and move minimum amount of data from different data sources before joining them in the DV tool and sending to the BI tools.

Data Virtualization Tools

There is a lot of confusion in the market as to what Data Virtualization is. For Data Virtualization as described here, these are tools that companies can evaluate for their needs –

Commercial Products

1. Denodo

2. Tibco Data Virtualization

3. DataVirtuality

4. AtScale

5. Azure Analysis Service

Open Source Alternatives

1. Dremio

2. Teiid

We personally selected Denodo and we really like it for working with on-premise databases, cloud data sources, merging data from multiple sources as well as presenting a simplified view of the data to multiple BI tools. I will write a more detailed article on Denodo and how it works on a future article.