Troubleshooting Lineage Extraction for SQL Queries in DataHub

Original Slack Thread

Hello everyone, I am new to datahub using v0. 13.3.2 I am trying to create lineage from SQL Queries(MYSQL source) .I have installed python dependencies and ingested them from CLI and it was a success (Parsing queries from json file) Query: Insert into A(col1) Select abc from B; but after ingestion if I go to A Or B table I don’t see lineage in UI at all. Can someone help me here

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

<@U06TM7M40RK> Can you help with the questions in this thread?


:information_source: We’re experimenting with a new AI assistant for questions in the <#CUMUWQU66> channel to help you track down helpful resources more quickly. Keep in mind that responses may not always be spot-on; feel free to ask it follow-up questions and don’t forget to provide its responses a :thumbsup: or :thumbsdown: so other folks in the Community know if you’re still looking for help on the topic!

Thanks for reaching out! To help you more effectively, could you please provide the following information?

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (You mentioned v0.13.3.2, please confirm if that’s correct.)
  3. What data source(s) are you integrating with DataHub? (e.g., BigQuery, Redshift, MySQL, etc.)

Additionally, it would be helpful to know more about the steps you’ve taken so far and any specific error messages or logs you encountered. This information will help us pinpoint the issue more accurately.

Based on your initial description, it sounds like you successfully parsed the SQL queries and ingested them, but the lineage is not appearing in the UI. Here are a few things to check:

  1. Ensure Lineage Extraction is Enabled: Make sure that lineage extraction is enabled in your ingestion recipe. For example, for SQL-based sources like Redshift, you need to enable include_table_lineage and include_view_lineage in the configuration.

  2. Check the Ingestion Logs: Run the ingestion in debug mode to see detailed logs. This can help identify if there were any issues during the lineage extraction process.

  3. Validate the Ingested Data: Ensure that the tables and views referenced in your SQL queries are already ingested and exist in DataHub.

  4. Review the Lineage Configuration: Verify that your ingestion recipe correctly specifies the lineage extraction settings. For example, in Redshift, you might need to configure the sql_based or mixed lineage collector.

If you can provide more details about your setup and the exact steps you’ve taken, we can give you more specific guidance.

Sources:

Does Sql queries support mysql?

<@U06TM7M40RK>

DataHub does support SQL queries for lineage extraction, but the automatic lineage detection is not available for MySQL. According to the <Airflow Integration | DataHub documentation>, automatic lineage detection is supported for certain data warehouses like BigQuery and Snowflake that provide structured data for lineage construction. However, for MySQL, this feature is not available.

You can still manually create lineage for MySQL using the DataHub API or the DataHub UI. Here are the steps you can follow:

  1. Manual Lineage Annotation: You can manually annotate lineage by setting inlets and outlets on your Airflow operators. This is useful if you’re using an operator that doesn’t support automatic lineage extraction or if you want to override the automatic lineage extraction. More details can be found in the <https://github.com/datahub-project/datahub/blob/master/docs/lineage/airflow.md|DataHub Airflow documentation>.

  2. Using the DataHub API: You can emit lineage manually using the DataHub Python SDK or GraphQL API. This allows you to define lineage relationships between datasets programmatically. More information can be found in the <https://datahubproject.io/docs/lineage/lineage-feature-guide/#managing-lineage-via-api|Lineage Feature Guide>.

  3. Using the DataHub UI: You can also add lineage through the DataHub UI by manually linking datasets.

Here is an example of how you can use the DataHub Python SDK to emit lineage:

from datahub.emitter.mce_builder import make_dataset_urn
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.schema_classes import (
    DatasetLineageTypeClass,
    UpstreamClass,
    UpstreamLineageClass,
)

# Initialize the DataHub emitter
emitter = DatahubRestEmitter("http://localhost:8080")

# Define the source and target datasets
source_dataset_urn = make_dataset_urn(platform="mysql", name="database_name.table_b", env="PROD")
target_dataset_urn = make_dataset_urn(platform="mysql", name="database_name.table_a", env="PROD")

# Create the lineage relationship
upstream = UpstreamClass(
    dataset=source_dataset_urn,
    type=DatasetLineageTypeClass.TRANSFORMED,
)

lineage = UpstreamLineageClass(
    upstreams=[upstream],
)

# Emit the lineage to DataHub
emitter.emit_mce(
    {
        "entityType": "dataset",
        "entityUrn": target_dataset_urn,
        "aspectName": "upstreamLineage",
        "aspect": lineage,
    }
)

Would you like more details on any specific part of this process?

Sources:

Check sql-lineage says query_type is unknown

The datahub check sql-lineage command is used to parse SQL queries and generate lineage information. If the query_type is reported as “unknown,” it typically indicates that the SQL parser was unable to recognize the type of SQL statement provided.

Here are a few steps to troubleshoot this issue:

  1. Check SQL Query Syntax: Ensure that your SQL query is syntactically correct and adheres to the SQL dialect of the platform you are targeting (e.g., MySQL, BigQuery, Snowflake).

  2. Specify SQL Dialect: When using the datahub check sql-lineage command, make sure to specify the correct SQL dialect using the --platform option. For example, if you are using MySQL, you should specify --platform mysql.

  3. Use Supported SQL Statements: Ensure that the SQL statement you are using is supported by the SQL parser. Commonly supported statements include SELECT, INSERT, UPDATE, and DELETE.

  4. Run in Debug Mode: Run the command in debug mode to get more detailed logs, which can help identify the issue. You can add the --verbose flag to get more information.

Here is an example of how to use the datahub check sql-lineage command with the correct options:

datahub check sql-lineage --sql "INSERT INTO A(col1) SELECT abc FROM B;" --platform mysql --default-db my_database --default-schema public

If the issue persists, it may be due to limitations or bugs in the SQL parser. In such cases, you can manually construct the lineage information using the DataHub API or the DataHub UI, as described in the previous response.

Would you like more detailed steps on any specific part of this process?

Sources: