The nature of today’s enterprise applications environment is that the critical data needed for operational and analytical reporting reside in multiple cloud and on-premise applications. A lot of companies have multiple ERPs, multiple CRMs and multiple other cloud applications that they have implemented over time or inherited them through acquisitions.
Similar, to their transaction systems, companies will also have a mixture of modern and legacy BI and ETL technologies. The need to bring the data from all these source applications together into one or more data warehouses (ideally one), whether on-premise or on the cloud, is more important than ever. Business performance analysis can be achieved by having all the required data together in a user-friendly manner for the data scientists and analysts to do their job.
Complicated scheduling architecture leads to operational issues
Each of these source systems have a different way of being accessed by the ETL/ELT tools. Some have API, others expose a ODBC/JDBC interface and others output the data in csv, JSON or other formats. As companies build the data pipelines from these sources over the years each of these are run at different schedules and are triggered in multiple different ways. Some get triggered by CRON Jobs, others by Windows Scheduler, others by in-application schedulers. In addition, most BI tools like MicroStrategy, OBIEE, Business Objects, or Tableau come with their own set of scheduling tools to trigger report extracts, loading cubes or for sending dashboards as emails to users.
The result of running these various ETL/ELT and BI tasks with native schedulers in each of these tools is that there will multiple different kinds of code bases and mechanisms to trigger the tasks and more importantly, there will be not be one place to check the success of failure of these tasks.
In case a business user says that they didn’t get the latest data on their dashboard or email then the operations team needs to check all these different schedules in the BI tools, ETL tools, database triggers, OS jobs or even the source systems to find out why the data was not updated. Most probably there will be different teams supporting each of these tools. Troubleshooting any small issue will turn out to be a nightmare without knowing where the failure is in the data flow. If the BI and data engineering teams are always troubleshooting operational issues new capabilities and solutions will take a backseat.
To avoid this, companies must invest in and centralize all BI and data engineering related schedules and tasks into a central workflow automation and scheduling architecture
How centralized workflow automation will help
The best way to get the BI and data engineering architecture under control is to run all the operations using a central workflow automation and scheduling tool. For this to happen you will need to figure out a way to trigger the various ETL/ELT and BI tool schedules using their respective APIs. It will take a bit of technical investment to figure out how to work with each of these APIs. In general, most modern tools have simple REST APIs but some of the older legacy tools need some research to figure them out.
Once you have the APIs understood, the next step is to write the code to call the APIs for the various tasks that need to be performed. For example, you will need an API to trigger a download of employee data from your HR cloud platform like SuccessFactors or an API to trigger an Informatica workflow. After the APIs are understood you can standardize on a simple language like Python to write the code to call these APIs. There will be a lot of sequencing of API calls that needs to be performed by the Workflow Automation tool — kick off Task 2 after Task 1 is complete etc.
A good workflow automation tool that is needed for this needs to be able to
- call the necessary APIs in different languages
- to accept triggers in different formats
- to have a visual workflow development environment
- to send appropriate notifications to users and support team on success of failure of a certain task
- it needs to have an advanced scheduling engine and monitoring capabilities.
- last but not least is that it needs to be easy to use — easy to build and maintain.
Is there a tool in the market like that?
JAMS Scheduler as workflow automation tool for BI/Data Engineering operations
After a detailed search of both commercial and open source products like Airflow, I landed on a product called JAMS Scheduler by MVP Systems Software. With JAMS we were able build out a complete workflow automation solution which was able to call various ETL/ELT jobs using their respective APIs. We were able to chain various processes so dependent tasks kick off when others complete. BI tool tasks like kicking off an .hyper file creation in Tableau or emailing dashboards can be run after the ETL tasks are done.
In addition, JAMS has a great UI for monitoring tasks as well as a detailed notification and alerting mechanism in case of success or failure of tasks. In BI, one of the critical piece is the scheduling and triggering capability. For example, we had requirements to trigger jobs at 5.30 PM Malaysia time on Workday 1, 3 and 5 of every month. With JAMS, these kinds of triggers are very easy to setup without writing any code. JAMS has an intuitive interface to build workflows with variables and parameters. You can call python code, power shell, database procedures, SQL statements all in the same workflow. Once you have setup the job to run it runs very reliably. You have the capability of putting jobs on hold during maintenance windows, set priorities for jobs and re-trigger them on failures.
One of the most important aspects of JAMS is that I can download a free evaluation copy and test out all the features for 30 days. During this time, I got great support from their sales and technical folks, so I can complete the POC and show it to various stakeholders for buy in.
Overall it seems to be a well thought out, stable and easy to use product which we have gotten to rely on for automating a lot of manual tasks in our BI and data engineering operations.
Here are a few resources that might be useful to check on JAMS –