How to Ingest Lineage Information Manually Stored in Spark into DataHub

user-2 · December 16, 2024, 12:05am

HI Team,
Could any one help me to understand how we can ingest lineage information if we do manually store lineage information spark to a file and that file needs to loaded into datahub.

datahub_team · December 16, 2024, 12:05am

Hey there!

<@U06TM7M40RK> Can you help with the questions in this thread?

Please continue to tag RunLLM (e.g <@U06TM7M40RK>) to have it continue generating responses!

user-1 · December 16, 2024, 12:05am

To ingest lineage information from a file into DataHub, you can use the file-based lineage ingestion method. This involves creating a lineage file in a specific format and then using DataHub’s ingestion capabilities to load this file. Here’s a brief overview of the process:

Lineage File Format: The lineage source file should be a .yml file with specific top-level keys. According to the <File Based Lineage | DataHub documentation>, the file should include:
- version: The version of the lineage file config. Currently, only version 1 is supported.
- lineage: A top-level key containing a list of EntityNodeConfig objects.
Each EntityNodeConfig object should have:
- entity: An EntityConfig object with details like name, type (only dataset is supported), env, and platform.
- upstream: (Optional) A list of child EntityNodeConfig objects.
- fineGrainedLineages: (Optional) A list of FineGrainedLineageConfig objects for column-level lineage.

Example Configuration: Here is an example of how the configuration might look:

version: 1
lineage:
  - entity:
      name: userdb.public.customer_table
      type: dataset
      env: DEV
      platform: redshift
    upstream:
      - entity:
          name: userdb.public.orders_table
          type: dataset
          env: DEV
          platform: redshift
    fineGrainedLineages:
      - upstreamType: FIELD_SET
        upstreams: [urn:li:schemaField:(urn:li:dataPlatform:redshift,userdb.public.orders_table,order_id)]
        downstreamType: FIELD_SET
        downstreams: [urn:li:schemaField:(urn:li:dataPlatform:redshift,userdb.public.customer_table,customer_id)]
        transformOperation: "join"
        confidenceScore: 0.9

Ingesting the File: Once your .yml file is ready, you can use DataHub’s ingestion framework to load the file. This typically involves using a command-line interface (CLI) or a script to run the ingestion process.

Would you like more detailed steps on how to set up the ingestion process or any specific part of the configuration?

Sources:

<File Based Lineage | DataHub File Format> 0 button 0 button Hint: Mention @runllm in the thread for followups.

Topic		Replies	Views
Obtaining Column-Level Lineage Information with Spark Push-Based Agent ingestion	5	37	June 17, 2024
Troubleshooting Data Lineage In DataHub Spark Listener Configuration ingestion	9	14	October 28, 2024
Integrating spark lineage listener: Choosing the correct library version ingestion	24	58	May 13, 2024
Guide on Extracting and Storing Lineage Information with Custom Aspects in DataHub ingestion	19	41	September 2, 2024
Troubleshooting missing lineage info in DataHub UI after successful Spark pipeline execution ingestion	2	105	March 4, 2024

How to Ingest Lineage Information Manually Stored in Spark into DataHub

Hey there!

<@U06TM7M40RK> Can you help with the questions in this thread?

Related topics