Extracting Lineage from Stored Procedures Using SQL Queries Module in DataHub

Original Slack Thread

Hi guis!
How I can use SQL Queries module to extract lineage from stored procedure, for example?
Is there some guide or examples?
I tried to use parse_sql_lineage method of DataHubGraph to test it, but I received error.
<@UV14447EU> <@U01GZEETMEZ> Any ideas?

'CREATE PROCEDURE [schema_with_lineage].[procedure_number_three] AS
            BEGIN
            	IF' contains unsupported syntax. Falling back to parsing as a 'Command'.```![attachment](https://files.slack.com/files-pri/TUMKD5EGJ-F07KEUMTTN3/______________________.png)

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

<@U06TM7M40RK> Can you help with the questions in this thread?

:information_source: _Please continue to tag RunLLM (e.g <@U06TM7M40RK>) to have it continue generating responses!

To extract lineage from stored procedures using the SQL Queries module in DataHub, you can follow these steps:

  1. Ingest SQL Queries: Use the SQL queries connector to ingest your SQL queries into DataHub. This connector generates column-level lineage and detailed table usage statistics from the query log.

  2. Enable Lineage Extraction: Ensure that lineage extraction is enabled in your ingestion recipe. For SQL-based sources, you need to enable the appropriate lineage extraction settings.

  3. Use DataHub SDK: If you need to manually emit lineage, you can use the DataHub SDK to define lineage relationships between datasets programmatically.

Here is an example of how you can use the DataHub Python SDK to emit lineage:

from datahub.emitter.mce_builder import make_dataset_urn
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.schema_classes import (
    DatasetLineageTypeClass,
    UpstreamClass,
    UpstreamLineageClass,
)

# Initialize the DataHub emitter
emitter = DatahubRestEmitter("http://localhost:8080")

# Define the source and target datasets
source_dataset_urn = make_dataset_urn(platform="mysql", name="database_name.table_b", env="PROD")
target_dataset_urn = make_dataset_urn(platform="mysql", name="database_name.table_a", env="PROD")

# Create the lineage relationship
upstream = UpstreamClass(
    dataset=source_dataset_urn,
    type=DatasetLineageTypeClass.TRANSFORMED,
)

lineage = UpstreamLineageClass(
    upstreams=[upstream],
)

# Emit the lineage to DataHub
emitter.emit_mce(
    {
        "entityType": "dataset",
        "entityUrn": target_dataset_urn,
        "aspectName": "upstreamLineage",
        "aspect": lineage,
    }
)

Example Ingestion Recipe:

source:
  type: sql-queries
  config:
    platform: mysql
    query_log_path: /path/to/your/query/log
    include_table_lineage: true
    include_column_lineage: true
sink:
  type: datahub-rest
  config:
    server: http://localhost:8080

Running the Ingestion:

datahub ingest -c /path/to/your/ingestion_recipe.yml

Viewing Lineage in DataHub:
After ingestion, you can view the lineage information in the DataHub UI. Navigate to the datasets involved in the SQL queries, and you should see the lineage relationships visualized.

Handling Stored Procedures:
As per <Slack Slack Conversation>, parsing stored procedures in MS SQL for lineage is not yet supported directly. One solution is to convert the stored procedure definition to DML statements and use the SQL parser to parse DML and emit lineage with the DataHub SDK.

Would you like more detailed steps on any specific part of this process?

Sources:

  • <Slack Slack Conversation> :+1: 0 button :-1: 0 button :bulb: Hint: Mention @runllm in the thread for followups.