Analyzing Data Lineage on Column Level in DataHub Version 0.13.3 CLI for Snowflake, Oracle, and PostgreSQL

Original Slack Thread

Use DataHub version 0.13.3 CLI for Snowflake, Oracle, and PostgreSQL. Specify the sink as a JSON file. Explain two methods for analyzing data lineage on the column level:

1.Parsing SQL files directly.
2.Parsing SQL queries stored in PostgreSQL table(TB_SQL).sql_src

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

<@U06TM7M40RK> Can you help with the questions in this thread?


:information_source: We’re experimenting with a new AI assistant for questions in the <#CUMUWQU66> channel to help you track down helpful resources more quickly. Keep in mind that responses may not always be spot-on; feel free to ask it follow-up questions and don’t forget to provide its responses a :thumbsup: or :thumbsdown: so other folks in the Community know if you’re still looking for help on the topic!

I’m not able to find any data sources to answer your question. Could you provide more details about the specific issue or question you have regarding DataHub? If this doesn’t help, please reach out to the DataHub support team for further assistance. :+1: 0 button :-1: 0 button :bulb: Hint: Mention @runllm in the thread for followups.

DataHub version 0.13.3 CLI for Snowflake, Oracle, and PostgreSQL.

<@U06TM7M40RK> DataHub version 0.13.3 CLI for Snowflake, Oracle, and PostgreSQL.

Great, thanks for the details! Given that you’re using DataHub version 0.13.3 and the CLI for ingestion with Snowflake, Oracle, and PostgreSQL, I can provide guidance on setting up ingestion and lineage for these data sources.

Steps to Set Up Ingestion

  1. Install Required Packages:
    Ensure you have the necessary packages installed for Snowflake, Oracle, and PostgreSQL ingestion.

    pip install 'acryl-datahub[snowflake,oracle,postgresql]'
    
  2. Create Ingestion Configuration Files:
    Create YAML configuration files for each data source.

    Snowflake Configuration:

    source:
      type: snowflake
      config:
        account_id: your_snowflake_account_id
        cloud_provider: aws  # or gcp, azure
        region: your_region
        warehouse: your_warehouse
        database: your_database
        schema: your_schema
        include_views: true
        profiling:
          enabled: true
          sample_size: 10
        username: your_username
        password: your_password
        role: ACCOUNTADMIN
    sink:
      type: datahub-rest
      config:
        server: http://localhost:8080
    

    Oracle Configuration:

    source:
      type: oracle
      config:
        host_port: your_oracle_host:1521
        service_name: your_service_name
        username: your_username
        password: your_password
        platform_instance: your_service_name
        include_views: true
        include_tables: true
        include_view_lineage: true
        include_view_column_lineage: true
        include_table_location_lineage: true
        data_dictionary_mode: ALL
        env: PROD
    sink:
      type: datahub-rest
      config:
        server: http://localhost:8080
    

    PostgreSQL Configuration:

    source:
      type: postgres
      config:
        host_port: your_postgres_host:5432
        database: your_database
        username: your_username
        password: your_password
        platform_instance: your_instance
        include_views: true
        include_tables: true
        include_view_lineage: true
        include_view_column_lineage: true
        include_table_location_lineage: true
        env: PROD
    sink:
      type: datahub-rest
      config:
        server: http://localhost:8080
    
  3. Run the Ingestion:
    Use the DataHub CLI to run the ingestion for each data source.

    datahub ingest -c /path/to/snowflake-ingestion.yml
    datahub ingest -c /path/to/oracle-ingestion.yml
    datahub ingest -c /path/to/postgres-ingestion.yml
    

Setting Up Lineage

To set up lineage between datasets from different platforms, you can use file-based lineage.

  1. Define Lineage in a YAML File:
    Create a YAML file to define lineage relationships.

    source:
      type: file-based-lineage
      config:
        filename: /path/to/lineage.yaml
    
    lineage:
      edges:
        - upstream: "urn:li:dataset:(urn:li:dataPlatform:snowflake,db1.table1,PROD)"
    

         downstream: "urn:li:dataset:(urn:li:dataPlatform:oracle,db2.table2,PROD)"
       - upstream: "urn:li:dataset:(urn:li:dataPlatform:postgres,db3.table3,PROD)"
         downstream: "urn:li:dataset:(urn:li:dataPlatform:snowflake,db4.table4,PROD)"

2. **Ingest Lineage Information**:
   Use the CLI to ingest the lineage information.

datahub ingest -c /path/to/lineage-ingestion-recipe.yaml


### Visualizing Lineage in DataHub UI

After running the ingestion and lineage setup, you can visualize the lineage in the DataHub UI:

1. **Navigate to the Dataset Page**: Go to the dataset page in the DataHub UI.
2. **View Lineage Tab**: Click on the "Lineage" tab to see the upstream and downstream dependencies.

Would you like more detailed instructions or examples for any specific part of this process?

### Sources:
- [Oracle Ingestion Configuration](<https://datahubproject.io/docs/0.13.1/generated/ingestion/sources/oracle#config-details>)
- [SchemaResolver Initialization](<https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/sql_parsing/schema_resolver.py>)
- [SQL Parsing Example](<https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/sql_parsing/sqlglot_lineage.py>)
- [File-based Lineage Documentation](<https://datahubproject.io/docs/generated/ingestion/sources/file-based-lineage/>)
- [DataHub CLI Documentation](<https://github.com/datahub-project/datahub/blob/master/docs/cli.md>) :+1: 0 button :-1: 0 button :bulb: *Hint:* Mention @runllm in the thread for followups.