Seeking Guidance on Achieving Column-Level Lineage in Pyspark Script Using Datahub and Spark

user-1 · March 4, 2024, 3:57pm

Hi Team,

I want to acheive column level lineage using Pyspark script in Datahub. Created Pyspark script to read data from mysql server (datahub docker img) using JDBC driver, performed some transformation and load it in the same location. I also ingested Mysql server in datahub Ref: https://datahubproject.io/docs/api/tutorials/lineage/#add-lineage (Add Column-level Lineage). My Task is reflecting correctly on spark platform in datahub but unable to achieve column level lineage. can some one please guide me on the same?
version:
datahub:v0.12.1
spark:v3.5.0

datahub_team · March 4, 2024, 3:57pm

Hey there! Make sure your message includes the following information if relevant, so we can help more effectively!

Are you using UI or CLI for ingestion?
Which DataHub version are you using? (e.g. 0.12.0)
What data source(s) are you integrating with DataHub? (e.g. BigQuery)

user-2 · March 4, 2024, 3:57pm

in the near future we are going to release a newer version of out Spark Lineage lib which will capture column level lineage of a spark job

user-1 · March 4, 2024, 3:57pm

can we achieve Pyspark column level lineage in airflow or any downgrade version will help to achieve the same?

user-2 · March 4, 2024, 3:57pm

The Airflow plugin can only capture table/column level lineage from SQL queries (where it supports the operator), but on Pyspark, you will need the new spark plugin to capture it as there. To get the column-level lineage, the query execution context from Spark is needed.
If you use the spark SQL operator with pure SQL, I can see there is a chance it works.

user-1 · March 4, 2024, 3:57pm

Thanks for the info. <@UV14447EU>. I really want to know, when is the next version getting released?

Topic		Replies	Views
Obtaining Column-Level Lineage Information with Spark Push-Based Agent ingestion	5	42	June 17, 2024
Discussion on Datahub 0.10.3 Version's Column-Level Lineage Features all-things-deployment	3	55	March 4, 2024
Exploring DataHub Spark Agent for Lineage Tracking in Spark Jobs ingestion	6	50	December 16, 2024
How to Ingest Lineage Information Manually Stored in Spark into DataHub ingestion	2	26	December 16, 2024
Troubleshooting Data Lineage In DataHub Spark Listener Configuration ingestion	9	19	October 28, 2024

Seeking Guidance on Achieving Column-Level Lineage in Pyspark Script Using Datahub and Spark

Related topics