QuickBooks Engineering

Cool stuff we are doing and thinking about in engineering at Intuit

Follow publication

Tracking Complex Data Pipeline Lineage

Sandeep Uttamchandani
QuickBooks Engineering
4 min readDec 2, 2018

--

Data pipelines in production are complex involving hundreds of tables, transformation jobs, and scripts. Before consuming business metrics, Analysts and Data Scientists often need to understand the ETL logic and the sources used to generate the given metric. Also, in production, debugging metric value anomalies today is a nightmare — pinpointing the root cause can take days, weeks or even months! Also, in writing new ETLs, it is critical to correctly specify the job dependencies in the scheduler.

The complexities in Data Pipeline lineage directly impacts the productivity of Data Analyst and Data Scientists. The key productivity metric that we track are Time-to-Iterate representing the ability to understand, monitor, debug existing pipelines, and creating new ones. An ideal tool should be able to automatically extract lineage by parsing the data pipeline ETL scripts written in heterogeneous languages namely Python, SQL, Hive, etc.

At Intuit, we have developed SuperGlue — a tool that seamlessly tracks lineage of complex production pipelines making it self-serve for Analysts, Data Scientists, Engineers, to interpret, debug, and iterate on data pipelines. Users start-off by logging into the SuperGlue portal and can search for any job, table, or QlikView report.

A type-ahead feature where users can search of tables, jobs, reports they want to visualize and debug

As an output, SuperGlue provides a single pane holistic view combining the pipeline lineage with runtime execution stats including scheduler timings, data quality, change tracking in the scripts. Given the job/table/report name, SuperGlue provides the following views:

  • Lineage View: Shows backward lineage of the specified table, job or report.
Showing lineage of a Data Pipeline Job
Showing lineage of a Table
Showing lineage of a QlikView report
  • Execution View: This shows the runtime details associated with Jobs and Tables. Users can highlight any element in the interactive lineage view and get the execution, data quality issues, and change tracking views.

Under-the-covers, SuperGlue tracks data lineage by analyzing the jobs associated with the pipeline. Specifically, we define a pipeline to be composed of jobs; each job is composed of one or more scripts; each script consists of one or more SQL statements. A SQL query is analyzed for input and output tables. The lineage of a pipeline is defined as an array of triplets <Job Name, Input Tables, Output Tables>. This analysis is not a one-time activity. It is continuously evolving. Each script consists of one or more queries in different languages: SQL with some of Hive. These are then glued together with the output of one becomes the input for the next job.

After extracting lineage, SuperGlue joins the dependencies with execution profiling. It integrates two categories of profiling:

  • Operational Profiling: The focus is on Job health and Data Fabric health. Job health involves tracking execution related stats such as completion time, start-time, etc. Data Fabric health focuses on tracking events and stats from system components namely source databases, ingestion tools, scheduling frameworks, analytical engines, serving databases, publishing frameworks (such as Tableau, QlikView, SageMaker, etc.)
  • Data Profiling: The focus in this bucket is on analyzing the data-related patterns. This is a fairly broad topic and a topic for a future blog.

SuperGlue has been significantly helpful in improving productivity of our data platform. We have been in internal beta, and now releasing broadly within Intuit.

This is a team initiative led by Sunil Goplani with the technical leadership from Anand Elluru and Shradha Ambekar. The team included Shrushti Patel, Shikha Singhal & Sooji Son.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Published in QuickBooks Engineering

Cool stuff we are doing and thinking about in engineering at Intuit

Written by Sandeep Uttamchandani

Sharing 20+ years of real-world exec experience leading Data, Analytics, AI & SW Products. O’Reilly book author. Founder AIForEveryone.org. #Mentor #Advise

Responses (2)

Write a response