Lineage in MySQL

Lineage is a common area of study in the data modeling space. You often see graphs like this:

Screen Shot 2022-04-13 at 7.15.50 PM.png

This is a way of organizing tables into common groups for reusability with ad hoc reporting queries. These tables often start at source tables (your raw input data), and are layered with views for the specific query type you want to solve. In many cases there are multiple layers of views, and thus you end up with the lineage graph above.

Once you have your business modeled appropriately you may run into data issues where numbers are off or simply don't make sense. This is a challenging debugging situation as you need to understand where the pipeline broke. With very deep layers it's especially frustrating.

Screen Shot 2022-04-13 at 7.18.31 PM.png

There are often specialized tools to do these analysis in data warehouses like Bigquery or Snowflake. Datafold, Monte Carlo, Secoda are a few of the options. However you can also do this in MySQL.

Let's use the classical Jaffle Shop example: github.com/dbt-labs/jaffle_shop and build it in MySQL with lineage. Follow along with your own local mysql and the tables described here: gist.github.com/avaitla/dd7d20228c9db5d15c8..

SET FOREIGN_KEY_CHECKS=0;

CREATE DATABASE IF NOT EXISTS jaffle_shop;
USE jaffle_shop;

DROP TABLE raw_customers;
CREATE TABLE `raw_customers` (
  `id` int NOT NULL AUTO_INCREMENT,
  `first_name` varchar(255) DEFAULT NULL,
  `last_name` varchar(255) DEFAULT NULL,
  PRIMARY KEY (`id`)
);

DROP TABLE raw_orders;
CREATE TABLE `raw_orders` (
  `id` int NOT NULL AUTO_INCREMENT,
  `user_id` int DEFAULT NULL,
  `status` enum('completed','returned','return_pending','shipped','placed') DEFAULT NULL,
  `order_date` date DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `user_id` (`user_id`,`status`),
  CONSTRAINT `raw_orders_ibfk_1` FOREIGN KEY (`user_id`) REFERENCES `raw_customers` (`id`)
);

DROP TABLE raw_payments;
CREATE TABLE `raw_payments` (
  `id` int NOT NULL AUTO_INCREMENT,
  `order_id` int DEFAULT NULL,
  `payment_method` enum('credit_card','coupon','bank_transfer','gift_card') DEFAULT NULL,
  `amount` int DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `order_id` (`order_id`,`payment_method`),
  CONSTRAINT `raw_payments_ibfk_1` FOREIGN KEY (`order_id`) REFERENCES `raw_orders` (`id`)
);



CREATE DATABASE IF NOT EXISTS dbt_jaffle_shop;
USE dbt_jaffle_shop;

CREATE OR REPLACE VIEW `stg_customers` AS 

    WITH `source` AS (
       SELECT  `jaffle_shop`.`raw_customers`.`id` AS `id`,
               `jaffle_shop`.`raw_customers`.`first_name` AS `first_name`,
               `jaffle_shop`.`raw_customers`.`last_name` AS `last_name` 
       FROM `jaffle_shop`.`raw_customers`

    ), `renamed` as (
        SELECT `source`.`id` AS `customer_id`,
               `source`.`first_name` AS `first_name`,
               `source`.`last_name` AS `last_name` 
        FROM `source`
    ) 

    SELECT `renamed`.`customer_id` AS `customer_id`,
           `renamed`.`first_name` AS `first_name`,
           `renamed`.`last_name` AS `last_name` 
    FROM `renamed`;


CREATE OR REPLACE VIEW `stg_orders` AS 

    WITH `source` AS (
       SELECT `jaffle_shop`.`raw_orders`.`id` AS `id`,
              `jaffle_shop`.`raw_orders`.`user_id` AS `user_id`,
              `jaffle_shop`.`raw_orders`.`status` AS `status`,
              `jaffle_shop`.`raw_orders`.`order_date` AS `order_date`
        FROM `jaffle_shop`.`raw_orders`

    ), `renamed` AS (
       SELECT `source`.`id` AS `order_id`,
              `source`.`user_id` AS `customer_id`,
              `source`.`order_date` AS `order_date`,
              `source`.`status` AS `status`
       FROM `source`
    )

    SELECT `renamed`.`order_id` AS `order_id`,
           `renamed`.`customer_id` AS `customer_id`,
           `renamed`.`order_date` AS `order_date`,
           `renamed`.`status` AS `status`

    FROM `renamed`;


CREATE OR REPLACE VIEW `stg_payments` AS 

    WITH `source` AS (
       SELECT `jaffle_shop`.`raw_payments`.`id` AS `id`,
              `jaffle_shop`.`raw_payments`.`order_id` AS `order_id`,
              `jaffle_shop`.`raw_payments`.`payment_method` AS `payment_method`,
              `jaffle_shop`.`raw_payments`.`amount` AS `amount` 
       FROM `jaffle_shop`.`raw_payments`

    ), `renamed` AS (

       SELECT `source`.`id` AS `payment_id`,
              `source`.`order_id` AS `order_id`,
              `source`.`payment_method` AS `payment_method`,
              (`source`.`amount` / 100) AS `amount` 

       FROM `source`
    )

    SELECT `renamed`.`payment_id` AS `payment_id`,
           `renamed`.`order_id` AS `order_id`,
           `renamed`.`payment_method` AS `payment_method`,
           `renamed`.`amount` AS `amount`
    FROM `renamed`;


CREATE OR REPLACE VIEW customers AS 

with `customers` as (select `stg_customers`.`customer_id` AS `customer_id`,`stg_customers`.`first_name` AS `first_name`,`stg_customers`.`last_name` AS `last_name` from `stg_customers`), `orders` as (select `stg_orders`.`order_id` AS `order_id`,`stg_orders`.`customer_id` AS `customer_id`,`stg_orders`.`order_date` AS `order_date`,`stg_orders`.`status` AS `status` from `stg_orders`), `payments` as (select `stg_payments`.`payment_id` AS `payment_id`,`stg_payments`.`order_id` AS `order_id`,`stg_payments`.`payment_method` AS `payment_method`,`stg_payments`.`amount` AS `amount` from `stg_payments`), `customer_orders` as (select `orders`.`customer_id` AS `customer_id`,min(`orders`.`order_date`) AS `first_order`,max(`orders`.`order_date`) AS `most_recent_order`,count(`orders`.`order_id`) AS `number_of_orders` from `orders` group by `orders`.`customer_id`), `customer_payments` as (select `orders`.`customer_id` AS `customer_id`,sum(`payments`.`amount`) AS `total_amount` from (`payments` left join `orders` on((`payments`.`order_id` = `orders`.`order_id`))) group by `orders`.`customer_id`), `final` as (select `customers`.`customer_id` AS `customer_id`,`customers`.`first_name` AS `first_name`,`customers`.`last_name` AS `last_name`,`customer_orders`.`first_order` AS `first_order`,`customer_orders`.`most_recent_order` AS `most_recent_order`,`customer_orders`.`number_of_orders` AS `number_of_orders`,`customer_payments`.`total_amount` AS `customer_lifetime_value` from ((`customers` left join `customer_orders` on((`customers`.`customer_id` = `customer_orders`.`customer_id`))) left join `customer_payments` on((`customers`.`customer_id` = `customer_payments`.`customer_id`)))) select `final`.`customer_id` AS `customer_id`,`final`.`first_name` AS `first_name`,`final`.`last_name` AS `last_name`,`final`.`first_order` AS `first_order`,`final`.`most_recent_order` AS `most_recent_order`,`final`.`number_of_orders` AS `number_of_orders`,`final`.`customer_lifetime_value` AS `customer_lifetime_value` from `final`;


CREATE OR REPLACE VIEW orders AS

with `orders` as (select `stg_orders`.`order_id` AS `order_id`,`stg_orders`.`customer_id` AS `customer_id`,`stg_orders`.`order_date` AS `order_date`,`stg_orders`.`status` AS `status` from `stg_orders`), `payments` as (select `stg_payments`.`payment_id` AS `payment_id`,`stg_payments`.`order_id` AS `order_id`,`stg_payments`.`payment_method` AS `payment_method`,`stg_payments`.`amount` AS `amount` from `stg_payments`), `order_payments` as (select `payments`.`order_id` AS `order_id`,sum((case when (`payments`.`payment_method` = 'credit_card') then `payments`.`amount` else 0 end)) AS `credit_card_amount`,sum((case when (`payments`.`payment_method` = 'coupon') then `payments`.`amount` else 0 end)) AS `coupon_amount`,sum((case when (`payments`.`payment_method` = 'bank_transfer') then `payments`.`amount` else 0 end)) AS `bank_transfer_amount`,sum((case when (`payments`.`payment_method` = 'gift_card') then `payments`.`amount` else 0 end)) AS `gift_card_amount`,sum(`payments`.`amount`) AS `total_amount` from `payments` group by `payments`.`order_id`), `final` as (select `orders`.`order_id` AS `order_id`,`orders`.`customer_id` AS `customer_id`,`orders`.`order_date` AS `order_date`,`orders`.`status` AS `status`,`order_payments`.`credit_card_amount` AS `credit_card_amount`,`order_payments`.`coupon_amount` AS `coupon_amount`,`order_payments`.`bank_transfer_amount` AS `bank_transfer_amount`,`order_payments`.`gift_card_amount` AS `gift_card_amount`,`order_payments`.`total_amount` AS `amount` from (`orders` left join `order_payments` on((`orders`.`order_id` = `order_payments`.`order_id`)))) select `final`.`order_id` AS `order_id`,`final`.`customer_id` AS `customer_id`,`final`.`order_date` AS `order_date`,`final`.`status` AS `status`,`final`.`credit_card_amount` AS `credit_card_amount`,`final`.`coupon_amount` AS `coupon_amount`,`final`.`bank_transfer_amount` AS `bank_transfer_amount`,`final`.`gift_card_amount` AS `gift_card_amount`,`final`.`amount` AS `amount` from `final`;

SET FOREIGN_KEY_CHECKS=1;

MySQL will store views and their ancestor views in the information_schema.view_table_usage table:

Screen Shot 2022-04-13 at 7.21.43 PM.png

Using a recursive CTE you can indeed make a full lineage model of this where you "walk" the ancestor relationship until you ultimately arrive at an actual table instead of a view itself.

We build a view called lineage_view as follows:

CREATE OR REPLACE VIEW lineage_view AS

WITH RECURSIVE cte (depth, is_view, root_schema, root_name, root_concat, VIEW_SCHEMA, VIEW_NAME, TABLE_SCHEMA, TABLE_NAME, table_concat) AS (

    SELECT 1 as depth, IF(v2.VIEW_NAME IS NULL, False, True) as is_view, 

           v1.VIEW_SCHEMA as root_schema, v1.VIEW_NAME as root_name, 
           CONCAT(v1.VIEW_SCHEMA, ".", v1.VIEW_NAME) as root_concat,

           v1.VIEW_SCHEMA, v1.VIEW_NAME, v1.TABLE_SCHEMA, v1.TABLE_NAME,
           CONCAT(v1.TABLE_SCHEMA, ".", v1.TABLE_NAME) as table_concat

    FROM information_schema.view_table_usage v1
    LEFT JOIN information_schema.view_table_usage v2 
      ON v1.TABLE_SCHEMA = v2.VIEW_SCHEMA AND v1.TABLE_NAME = v2.VIEW_NAME

    UNION ALL

    SELECT depth+1 as depth, IF(v4.VIEW_NAME IS NULL, False, True) as is_view, 

           cte.VIEW_SCHEMA as root_schema, cte.VIEW_NAME as root_name, 
           CONCAT(cte.VIEW_SCHEMA, ".", cte.VIEW_NAME) as root_concat, 

           v3.VIEW_SCHEMA, v3.VIEW_NAME,

           v3.TABLE_SCHEMA, v3.TABLE_NAME,
           CONCAT(v3.TABLE_SCHEMA, ".", v3.TABLE_NAME) as table_concat


    FROM information_schema.view_table_usage v3
    LEFT JOIN information_schema.view_table_usage v4 
      ON v3.TABLE_SCHEMA = v4.VIEW_SCHEMA AND v3.TABLE_NAME = v4.VIEW_NAME

    INNER JOIN cte ON cte.TABLE_SCHEMA = v3.VIEW_SCHEMA AND cte.TABLE_NAME = v3.VIEW_NAME

)

SELECT * FROM cte;

Now to find the tables that comprise the "dbt_jaffle_shop.orders" view we simply do:

SELECT root_concat, depth, table_concat, is_view FROM lineage_view
WHERE root_schema = "dbt_jaffle_shop"
  AND root_name   = "orders";

Screen Shot 2022-04-13 at 7.24.57 PM.png

Enjoy!