How to Ingest Lineage Information Manually Stored in Spark into DataHub

Original Slack Thread

HI Team,
Could any one help me to understand how we can ingest lineage information if we do manually store lineage information spark to a file and that file needs to loaded into datahub.

Hey there! :wave:

<@U06TM7M40RK> Can you help with the questions in this thread?

:information_source: Please continue to tag RunLLM (e.g <@U06TM7M40RK>) to have it continue generating responses!

To ingest lineage information from a file into DataHub, you can use the file-based lineage ingestion method. This involves creating a lineage file in a specific format and then using DataHub’s ingestion capabilities to load this file. Here’s a brief overview of the process:

  1. Lineage File Format: The lineage source file should be a .yml file with specific top-level keys. According to the <File Based Lineage | DataHub documentation>, the file should include:

    • version: The version of the lineage file config. Currently, only version 1 is supported.
    • lineage: A top-level key containing a list of EntityNodeConfig objects.

    Each EntityNodeConfig object should have:

    • entity: An EntityConfig object with details like name, type (only dataset is supported), env, and platform.
    • upstream: (Optional) A list of child EntityNodeConfig objects.
    • fineGrainedLineages: (Optional) A list of FineGrainedLineageConfig objects for column-level lineage.
  2. Example Configuration: Here is an example of how the configuration might look:

    version: 1
    lineage:
      - entity:
          name: userdb.public.customer_table
          type: dataset
          env: DEV
          platform: redshift
        upstream:
          - entity:
              name: userdb.public.orders_table
              type: dataset
              env: DEV
              platform: redshift
        fineGrainedLineages:
          - upstreamType: FIELD_SET
            upstreams: [urn:li:schemaField:(urn:li:dataPlatform:redshift,userdb.public.orders_table,order_id)]
            downstreamType: FIELD_SET
            downstreams: [urn:li:schemaField:(urn:li:dataPlatform:redshift,userdb.public.customer_table,customer_id)]
            transformOperation: "join"
            confidenceScore: 0.9
    
  3. Ingesting the File: Once your .yml file is ready, you can use DataHub’s ingestion framework to load the file. This typically involves using a command-line interface (CLI) or a script to run the ingestion process.

Would you like more detailed steps on how to set up the ingestion process or any specific part of the configuration?

Sources: