Hello everyone, I am new to datahub using v0. 13.3.2 I am trying to create lineage from SQL Queries(MYSQL source) .I have installed python dependencies and ingested them from CLI and it was a success (Parsing queries from json file) Query: Insert into A(col1) Select abc from B; but after ingestion if I go to A Or B table I don’t see lineage in UI at all. Can someone help me here
Hey there! Make sure your message includes the following information if relevant, so we can help more effectively!
- Are you using UI or CLI for ingestion?
- Which DataHub version are you using? (e.g. 0.12.0)
- What data source(s) are you integrating with DataHub? (e.g. BigQuery)
<@U06TM7M40RK> Can you help with the questions in this thread?
We’re experimenting with a new AI assistant for questions in the <#CUMUWQU66> channel to help you track down helpful resources more quickly. Keep in mind that responses may not always be spot-on; feel free to ask it follow-up questions and don’t forget to provide its responses a or so other folks in the Community know if you’re still looking for help on the topic!
Thanks for reaching out! To help you more effectively, could you please provide the following information?
- Are you using UI or CLI for ingestion?
- Which DataHub version are you using? (You mentioned v0.13.3.2, please confirm if that’s correct.)
- What data source(s) are you integrating with DataHub? (e.g., BigQuery, Redshift, MySQL, etc.)
Additionally, it would be helpful to know more about the steps you’ve taken so far and any specific error messages or logs you encountered. This information will help us pinpoint the issue more accurately.
Based on your initial description, it sounds like you successfully parsed the SQL queries and ingested them, but the lineage is not appearing in the UI. Here are a few things to check:
-
Ensure Lineage Extraction is Enabled: Make sure that lineage extraction is enabled in your ingestion recipe. For example, for SQL-based sources like Redshift, you need to enable
include_table_lineage
andinclude_view_lineage
in the configuration. -
Check the Ingestion Logs: Run the ingestion in debug mode to see detailed logs. This can help identify if there were any issues during the lineage extraction process.
-
Validate the Ingested Data: Ensure that the tables and views referenced in your SQL queries are already ingested and exist in DataHub.
-
Review the Lineage Configuration: Verify that your ingestion recipe correctly specifies the lineage extraction settings. For example, in Redshift, you might need to configure the
sql_based
ormixed
lineage collector.
If you can provide more details about your setup and the exact steps you’ve taken, we can give you more specific guidance.
Sources:
- <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/redshift/redshift.py|Lineage Extraction in Redshift>
- <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/lineage.py|SQL Parsing and Lineage> 0 button 0 button Hint: Mention @runllm in the thread for followups.
Does Sql queries support mysql?
<@U06TM7M40RK>
DataHub does support SQL queries for lineage extraction, but the automatic lineage detection is not available for MySQL. According to the <Airflow Integration | DataHub documentation>, automatic lineage detection is supported for certain data warehouses like BigQuery and Snowflake that provide structured data for lineage construction. However, for MySQL, this feature is not available.
You can still manually create lineage for MySQL using the DataHub API or the DataHub UI. Here are the steps you can follow:
-
Manual Lineage Annotation: You can manually annotate lineage by setting
inlets
andoutlets
on your Airflow operators. This is useful if you’re using an operator that doesn’t support automatic lineage extraction or if you want to override the automatic lineage extraction. More details can be found in the <https://github.com/datahub-project/datahub/blob/master/docs/lineage/airflow.md|DataHub Airflow documentation>. -
Using the DataHub API: You can emit lineage manually using the DataHub Python SDK or GraphQL API. This allows you to define lineage relationships between datasets programmatically. More information can be found in the <https://datahubproject.io/docs/lineage/lineage-feature-guide/#managing-lineage-via-api|Lineage Feature Guide>.
-
Using the DataHub UI: You can also add lineage through the DataHub UI by manually linking datasets.
Here is an example of how you can use the DataHub Python SDK to emit lineage:
from datahub.emitter.mce_builder import make_dataset_urn
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.schema_classes import (
DatasetLineageTypeClass,
UpstreamClass,
UpstreamLineageClass,
)
# Initialize the DataHub emitter
emitter = DatahubRestEmitter("http://localhost:8080")
# Define the source and target datasets
source_dataset_urn = make_dataset_urn(platform="mysql", name="database_name.table_b", env="PROD")
target_dataset_urn = make_dataset_urn(platform="mysql", name="database_name.table_a", env="PROD")
# Create the lineage relationship
upstream = UpstreamClass(
dataset=source_dataset_urn,
type=DatasetLineageTypeClass.TRANSFORMED,
)
lineage = UpstreamLineageClass(
upstreams=[upstream],
)
# Emit the lineage to DataHub
emitter.emit_mce(
{
"entityType": "dataset",
"entityUrn": target_dataset_urn,
"aspectName": "upstreamLineage",
"aspect": lineage,
}
)
Would you like more details on any specific part of this process?
Sources:
- <https://github.com/datahub-project/datahub/blob/master/docs/lineage/airflow.md|DataHub Airflow Documentation>
- <https://datahubproject.io/docs/lineage/lineage-feature-guide/#managing-lineage-via-api|Lineage Feature Guide>
- <Slack Slack Conversation>
- <Airflow Integration | DataHub documentation> 0 button 0 button Hint: Mention @runllm in the thread for followups.
Check sql-lineage says query_type is unknown
The datahub check sql-lineage
command is used to parse SQL queries and generate lineage information. If the query_type
is reported as “unknown,” it typically indicates that the SQL parser was unable to recognize the type of SQL statement provided.
Here are a few steps to troubleshoot this issue:
-
Check SQL Query Syntax: Ensure that your SQL query is syntactically correct and adheres to the SQL dialect of the platform you are targeting (e.g., MySQL, BigQuery, Snowflake).
-
Specify SQL Dialect: When using the
datahub check sql-lineage
command, make sure to specify the correct SQL dialect using the--platform
option. For example, if you are using MySQL, you should specify--platform mysql
. -
Use Supported SQL Statements: Ensure that the SQL statement you are using is supported by the SQL parser. Commonly supported statements include
SELECT
,INSERT
,UPDATE
, andDELETE
. -
Run in Debug Mode: Run the command in debug mode to get more detailed logs, which can help identify the issue. You can add the
--verbose
flag to get more information.
Here is an example of how to use the datahub check sql-lineage
command with the correct options:
datahub check sql-lineage --sql "INSERT INTO A(col1) SELECT abc FROM B;" --platform mysql --default-db my_database --default-schema public
If the issue persists, it may be due to limitations or bugs in the SQL parser. In such cases, you can manually construct the lineage information using the DataHub API or the DataHub UI, as described in the previous response.
Would you like more detailed steps on any specific part of this process?
Sources:
- <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/cli/check_cli.py|DataHub SQL Lineage Command>
- <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/lineage.py|DataHub SQL Parsing and Lineage>
- <Slack Slack Conversation> 0 button 0 button Hint: Mention @runllm in the thread for followups.