Seeking Guidance on Achieving Column-Level Lineage in Pyspark Script Using Datahub and Spark

Original Slack Thread

Hi Team,

I want to acheive column level lineage using Pyspark script in Datahub. Created Pyspark script to read data from mysql server (datahub docker img) using JDBC driver, performed some transformation and load it in the same location. I also ingested Mysql server in datahub Ref: (Add Column-level Lineage). My Task is reflecting correctly on spark platform in datahub but unable to achieve column level lineage. can some one please guide me on the same?

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

in the near future we are going to release a newer version of out Spark Lineage lib which will capture column level lineage of a spark job

can we achieve Pyspark column level lineage in airflow or any downgrade version will help to achieve the same?

The Airflow plugin can only capture table/column level lineage from SQL queries (where it supports the operator), but on Pyspark, you will need the new spark plugin to capture it as there. To get the column-level lineage, the query execution context from Spark is needed.
If you use the spark SQL operator with pure SQL, I can see there is a chance it works.

Thanks for the info. <@UV14447EU>. I really want to know, when is the next version getting released?