HI Team,
Could any one help me to understand how we can ingest lineage information if we do manually store lineage information spark to a file and that file needs to loaded into datahub.
Hey there!
<@U06TM7M40RK> Can you help with the questions in this thread?
Please continue to tag RunLLM (e.g <@U06TM7M40RK>) to have it continue generating responses!
To ingest lineage information from a file into DataHub, you can use the file-based lineage ingestion method. This involves creating a lineage file in a specific format and then using DataHub’s ingestion capabilities to load this file. Here’s a brief overview of the process:
-
Lineage File Format: The lineage source file should be a
.yml
file with specific top-level keys. According to the <File Based Lineage | DataHub documentation>, the file should include:version
: The version of the lineage file config. Currently, only version 1 is supported.lineage
: A top-level key containing a list ofEntityNodeConfig
objects.
Each
EntityNodeConfig
object should have:entity
: AnEntityConfig
object with details likename
,type
(onlydataset
is supported),env
, andplatform
.upstream
: (Optional) A list of childEntityNodeConfig
objects.fineGrainedLineages
: (Optional) A list ofFineGrainedLineageConfig
objects for column-level lineage.
-
Example Configuration: Here is an example of how the configuration might look:
version: 1 lineage: - entity: name: userdb.public.customer_table type: dataset env: DEV platform: redshift upstream: - entity: name: userdb.public.orders_table type: dataset env: DEV platform: redshift fineGrainedLineages: - upstreamType: FIELD_SET upstreams: [urn:li:schemaField:(urn:li:dataPlatform:redshift,userdb.public.orders_table,order_id)] downstreamType: FIELD_SET downstreams: [urn:li:schemaField:(urn:li:dataPlatform:redshift,userdb.public.customer_table,customer_id)] transformOperation: "join" confidenceScore: 0.9
-
Ingesting the File: Once your
.yml
file is ready, you can use DataHub’s ingestion framework to load the file. This typically involves using a command-line interface (CLI) or a script to run the ingestion process.
Would you like more detailed steps on how to set up the ingestion process or any specific part of the configuration?
Sources:
- <File Based Lineage | DataHub File Format> 0 button 0 button Hint: Mention @runllm in the thread for followups.