Analyzing Data Lineage on Column Level in DataHub Version 0.13.3 CLI for Snowflake, Oracle, and PostgreSQL

user-1 · July 22, 2024, 12:05am

Use DataHub version 0.13.3 CLI for Snowflake, Oracle, and PostgreSQL. Specify the sink as a JSON file. Explain two methods for analyzing data lineage on the column level:

1.Parsing SQL files directly.
2.Parsing SQL queries stored in PostgreSQL table(TB_SQL).sql_src

datahub_team · July 22, 2024, 12:05am

Hey there! Make sure your message includes the following information if relevant, so we can help more effectively!

Are you using UI or CLI for ingestion?
Which DataHub version are you using? (e.g. 0.12.0)
What data source(s) are you integrating with DataHub? (e.g. BigQuery)

<@U06TM7M40RK> Can you help with the questions in this thread?

We’re experimenting with a new AI assistant for questions in the <#CUMUWQU66> channel to help you track down helpful resources more quickly. Keep in mind that responses may not always be spot-on; feel free to ask it follow-up questions and don’t forget to provide its responses a or so other folks in the Community know if you’re still looking for help on the topic!

user-2 · July 22, 2024, 12:05am

I’m not able to find any data sources to answer your question. Could you provide more details about the specific issue or question you have regarding DataHub? If this doesn’t help, please reach out to the DataHub support team for further assistance. 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-1 · July 22, 2024, 12:05am

DataHub version 0.13.3 CLI for Snowflake, Oracle, and PostgreSQL.

user-1 · July 22, 2024, 12:05am

<@U06TM7M40RK> DataHub version 0.13.3 CLI for Snowflake, Oracle, and PostgreSQL.

user-2 · July 22, 2024, 12:05am

Great, thanks for the details! Given that you’re using DataHub version 0.13.3 and the CLI for ingestion with Snowflake, Oracle, and PostgreSQL, I can provide guidance on setting up ingestion and lineage for these data sources.

Steps to Set Up Ingestion

Install Required Packages:
Ensure you have the necessary packages installed for Snowflake, Oracle, and PostgreSQL ingestion.
```
pip install 'acryl-datahub[snowflake,oracle,postgresql]'
```

Create Ingestion Configuration Files:
Create YAML configuration files for each data source.

Snowflake Configuration:

source:
  type: snowflake
  config:
    account_id: your_snowflake_account_id
    cloud_provider: aws  # or gcp, azure
    region: your_region
    warehouse: your_warehouse
    database: your_database
    schema: your_schema
    include_views: true
    profiling:
      enabled: true
      sample_size: 10
    username: your_username
    password: your_password
    role: ACCOUNTADMIN
sink:
  type: datahub-rest
  config:
    server: http://localhost:8080

Oracle Configuration:

source:
  type: oracle
  config:
    host_port: your_oracle_host:1521
    service_name: your_service_name
    username: your_username
    password: your_password
    platform_instance: your_service_name
    include_views: true
    include_tables: true
    include_view_lineage: true
    include_view_column_lineage: true
    include_table_location_lineage: true
    data_dictionary_mode: ALL
    env: PROD
sink:
  type: datahub-rest
  config:
    server: http://localhost:8080

PostgreSQL Configuration:

source:
  type: postgres
  config:
    host_port: your_postgres_host:5432
    database: your_database
    username: your_username
    password: your_password
    platform_instance: your_instance
    include_views: true
    include_tables: true
    include_view_lineage: true
    include_view_column_lineage: true
    include_table_location_lineage: true
    env: PROD
sink:
  type: datahub-rest
  config:
    server: http://localhost:8080

Run the Ingestion:
Use the DataHub CLI to run the ingestion for each data source.

datahub ingest -c /path/to/snowflake-ingestion.yml
datahub ingest -c /path/to/oracle-ingestion.yml
datahub ingest -c /path/to/postgres-ingestion.yml

Setting Up Lineage

To set up lineage between datasets from different platforms, you can use file-based lineage.

Define Lineage in a YAML File:
Create a YAML file to define lineage relationships.

source:
  type: file-based-lineage
  config:
    filename: /path/to/lineage.yaml

lineage:
  edges:
    - upstream: "urn:li:dataset:(urn:li:dataPlatform:snowflake,db1.table1,PROD)"

user-2 · July 22, 2024, 12:05am


         downstream: "urn:li:dataset:(urn:li:dataPlatform:oracle,db2.table2,PROD)"
       - upstream: "urn:li:dataset:(urn:li:dataPlatform:postgres,db3.table3,PROD)"
         downstream: "urn:li:dataset:(urn:li:dataPlatform:snowflake,db4.table4,PROD)"


2. **Ingest Lineage Information**:
   Use the CLI to ingest the lineage information.

datahub ingest -c /path/to/lineage-ingestion-recipe.yaml


### Visualizing Lineage in DataHub UI

After running the ingestion and lineage setup, you can visualize the lineage in the DataHub UI:

1. **Navigate to the Dataset Page**: Go to the dataset page in the DataHub UI.
2. **View Lineage Tab**: Click on the "Lineage" tab to see the upstream and downstream dependencies.

Would you like more detailed instructions or examples for any specific part of this process?

### Sources:
- [Oracle Ingestion Configuration](<https://datahubproject.io/docs/0.13.1/generated/ingestion/sources/oracle#config-details>)
- [SchemaResolver Initialization](<https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/sql_parsing/schema_resolver.py>)
- [SQL Parsing Example](<https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/sql_parsing/sqlglot_lineage.py>)
- [File-based Lineage Documentation](<https://datahubproject.io/docs/generated/ingestion/sources/file-based-lineage/>)
- [DataHub CLI Documentation](<https://github.com/datahub-project/datahub/blob/master/docs/cli.md>) :+1: 0 button :-1: 0 button :bulb: *Hint:* Mention @runllm in the thread for followups.

Topic		Replies	Views
Building custom column level lineage between Salesforce and Snowflake ingestion	1	57	May 6, 2024
Activating Data Lineage for Oracle Database in DataHub ingestion	5	24	March 17, 2025
Ensuring Proper Configuration for Column-Level Lineage in DataHub ingestion	4	88	January 20, 2025
How to Ingest Lineage Information Manually Stored in Spark into DataHub ingestion	2	23	December 16, 2024
Troubleshooting Looker-Snowflake File Lineage Ingestion and Visibility ingestion	3	56	May 6, 2024

Analyzing Data Lineage on Column Level in DataHub Version 0.13.3 CLI for Snowflake, Oracle, and PostgreSQL

Related topics