Obtaining Column-Level Lineage Information with Spark Push-Based Agent

Original Slack Thread

Hi, Can anyone that has used Spark Push based agent help understand if its possible to get column level lineage information, i dont find any information related to it in the documentation either

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

<@U06TM7M40RK> Can you help with the questions in this thread?


:information_source: We’re experimenting with a new AI assistant for questions in the <#CUMUWQU66> channel to help you track down helpful resources more quickly. Keep in mind that responses may not always be spot-on; feel free to ask it follow-up questions and don’t forget to provide its responses a :thumbsup: or :thumbsdown: so other folks in the Community know if you’re still looking for help on the topic!

To help you better, could you please provide the following information:

  1. Are you using the UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g., 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g., BigQuery)

Regarding your question about obtaining column-level lineage information using the Spark Push-based agent, the documentation does not explicitly mention support for column-level lineage with the Spark Push-based agent. However, there are some relevant details that might help:

  1. Column-Level Lineage in General: According to the <Lineage | DataHub Lineage API documentation>, column-level lineage can be added programmatically using the API. This involves creating fine-grained lineage objects and emitting them to the DataHub REST API.

  2. Spark Lineage: There have been discussions in the community about capturing column-level lineage in Spark jobs. For instance, a <Slack thread> mentions that a newer version of the Spark Lineage library, which captures column-level lineage, is available. You can check the setup instructions for this new plugin https://github.com/datahub-project/datahub/tree/master/metadata-integration/java/spark-lineage-beta|here.

Would you like more detailed steps on how to set up and use the new Spark Lineage library to capture column-level lineage?

Sources:

  1. CLI push based ingestion
  1. 0.13.1
  1. Datasource - Spark jobs on Databricks