Troubleshooting Redshift Metadata Ingestion for Lineage Extraction

Original Slack Thread

Hi everyone, <@U06TM7M40RK>

I’m trying to ingest Redshift table/view metadata and limit the lineage extraction to views only by setting include_view_lineage = true and include_table_lineage = false. However, based on the logs, it seems that table lineage is still being extracted. Specifically, it appears that setting include_table_lineage does not fully skip lineage extraction for tables.

Here is the recipe I am using:

source:
type: redshift
config:
host_port: “${REDSHIFT_HOST}:${REDSHIFT_PORT}”
database: “${REDSHIFT_DB}”
username: “${REDSHIFT_USER}”
password: “${REDSHIFT_PASSWORD}”

    use_lineage_v2: true
    include_tables: true
    include_views: true
    include_view_lineage: true
    skip_external_tables: false
    include_table_lineage: false
    include_table_rename_lineage: false
    table_lineage_mode: "sql_based" # I would like to set this to null, but get an error
    include_table_location_lineage: false
    include_usage_statistics: false

    include_top_n_queries: false
    top_n_queries: 1
    lineage_v2_generate_queries: false

    include_unload_lineage: false
    extract_column_level_lineage: false
    include_operational_stats: false
    include_copy_lineage: false
    include_view_column_lineage: false
    incremental_lineage: false

    resolve_temp_table_in_lineage: false

    profiling:
        enabled: false
    stateful_ingestion:
        enabled: true
        remove_stale_metadata: true

pipeline_name: “redshift_lineage_only_ingest”

Has anyone encountered this issue or found a way to configure ingestion to exclude table lineage completely? Any guidance would be greatly appreciated!

Thanks!

It seems likely that you are encountering an issue where setting include_table_lineage to false does not fully prevent table lineage extraction. Based on the <Redshift | DataHub documentation>, the include_table_lineage parameter is intended to control whether table lineage should be ingested. However, there might be other configurations or defaults that are causing table lineage to still be extracted.

One potential area to investigate is the table_lineage_mode parameter. You mentioned wanting to set it to null but encountering errors. This parameter controls the mode of table lineage collection, and setting it to a specific mode like sql_based might still trigger some lineage extraction processes. Unfortunately, the documentation does not specify a null option for this parameter, which might be why you’re seeing errors.

Additionally, the <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/redshift/lineage_v2.py|source code> suggests that even with include_table_lineage set to false, other configurations or default behaviors might still lead to some lineage extraction, especially if other related parameters are not correctly set to prevent it.

To address this, you might want to ensure that all related lineage parameters are set to false, as you have done, and verify if there are any other configurations that might be influencing the behavior. It could also be helpful to check if there are any updates or known issues in the DataHub community or GitHub repository that might address this behavior.

Would you like more detailed guidance on any specific part of the configuration or further exploration of the source code?

Sources:

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><