Troubleshooting Common Table Expressions (CTEs) Lineage in DataHub 0.14.1 and Handling KeyError in `sqlglot_lineage.py`

Original Slack Thread

<@U06TM7M40RK> 0.14.1 datahub에서 CTE에 대한 리니지처리에 대해서

To handle lineage for Common Table Expressions (CTEs) in DataHub version 0.14.1, you can utilize the SQL Queries ingestion source. This source is capable of parsing SQL queries, including those with CTEs, to generate both column-level and table-level lineage.

Here’s a brief overview of how you can achieve this:

  1. Prepare the Query File: Create a newline-delimited JSON file containing your SQL queries. Each line should be a JSON object with fields such as query, timestamp, user, downstream_tables, and upstream_tables. This file will be used to parse the SQL queries and extract lineage information.

    Example JSON object for a query with a CTE:

    {
      "query": "WITH cte AS (SELECT * FROM my_table) SELECT * FROM cte",
      "timestamp": 1689232738.051,
      "user": "user_a",
      "downstream_tables": [],
      "upstream_tables": ["my_database.my_schema.my_table"]
    }
    
  2. Configure the Ingestion: Set up the ingestion configuration to use the sql-queries source. Specify the path to your query file and other necessary configurations.

    Example configuration:

    datahub_api:
      server: http://localhost:8080
      timeout_sec: 60
    source:
      type: sql-queries
      config:
        platform: "your_platform"
        default_db: "your_default_db"
        query_file: "./queries.json"
    
  3. Run the Ingestion: Use the DataHub CLI to run the ingestion process. This will parse the SQL queries and ingest the lineage information into DataHub.

    Command to run the ingestion:

    datahub ingest -c /path/to/your/ingestion-config.yml
    
  4. Visualize the Lineage: Once the ingestion is complete, you can visualize the lineage in the DataHub UI. Navigate to the dataset page and click on the “Lineage” tab to see the upstream and downstream dependencies, including those involving CTEs.

For more detailed information, you can refer to the <https://datahubproject.io/docs/0.14.0/generated/ingestion/sources/sql-queries|SQL Queries documentation>.

Would you like more detailed steps or examples on any part of this process?

Sources:

<@U06TM7M40RK> sqlglot_lineage.py에서 cte오류 발생. keyError, table_mapping

It seems likely that you’re encountering a KeyError related to CTEs (Common Table Expressions) in the sqlglot_lineage.py script, specifically with the table_mapping. This can happen if the script is unable to resolve the table names or mappings correctly when processing CTEs.

Based on the information from the <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/sql_parsing/sqlglot_lineage.py|sqlglot_lineage.py file>, here are some potential steps to troubleshoot and resolve the issue:

  1. Check Table Name Mapping: Ensure that the table names used in your CTEs are correctly mapped in the table_name_urn_mapping. The script attempts to convert table names to URNs, and a KeyError can occur if a table name is not found in the mapping.

  2. Review Schema Information: The script uses schema information to resolve table names. Make sure that the schema information for all tables involved in the CTEs is available and correctly configured. This includes ensuring that the schema_resolver is providing the necessary schema details.

  3. Debugging Information: The script logs debug information about the tables and schemas it resolves. Check the debug logs to see if there are any messages indicating missing schema information or unresolved table names.

  4. Handling Unsupported Syntax: If the CTEs contain complex or unsupported SQL syntax, the script might not be able to parse them correctly. Consider simplifying the SQL or breaking it down into smaller parts to identify the problematic section.

  5. Update to Latest Version: Ensure you are using the latest version of the DataHub ingestion framework, as there might be bug fixes or improvements related to SQL parsing and lineage extraction.

If you need more detailed guidance or specific examples, please let me know!

Sources:

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><