Exploring DataHub Spark Agent for Lineage Tracking in Spark Jobs

user-3 · December 16, 2024, 12:05am

Has anyone had success with the Datahub Spark agent? All I get is the various attributes of the spark job, like spark.app.name. My spark job processes various tables in Iceberg, but I see nothing.

datahub_team · December 16, 2024, 12:05am

Hey there!

<@U06TM7M40RK> Can you help with the questions in this thread?

Please continue to tag RunLLM (e.g <@U06TM7M40RK>) to have it continue generating responses!

user-1 · December 16, 2024, 12:05am

To better assist you with your issue regarding the DataHub Spark agent, could you provide more details on the following:

Configuration: How have you configured the Spark agent? Are you using a config file or setting configurations directly in your Spark session? For example, are you using the spark.jars.packages and spark.extraListeners configurations as described in the <https://datahubproject.io/docs/0.14.0/metadata-integration/java/acryl-spark-lineage|DataHub Spark Lineage documentation>?
Version Information: What versions of DataHub, Spark, and the Spark Lineage jar are you using? It’s important to ensure compatibility between these components.
Job Details: Could you describe the specific operations your Spark job is performing on the Iceberg tables? Are you using SQL operations, DataFrame API, or a combination of both?
Logs and Errors: Are there any specific logs or error messages that you see when running your Spark job? This could provide clues as to why the lineage information is not being captured.
Output Format: Are you writing the output in a specific format (e.g., Parquet, Delta)? As noted in a <Issues · datahub-project/datahub · GitHub issue>, certain formats might affect the visibility of lineage information.

Providing this information will help in diagnosing the issue more effectively. Let me know if you need further assistance!

Sources:

<https://datahubproject.io/docs/0.14.0/metadata-integration/java/acryl-spark-lineage|DataHub Spark Lineage documentation>
<Issues · datahub-project/datahub · GitHub issue on column lineage with Spark> 1 button 0 button Hint: Mention <@U06TM7M40RK> in the thread for followups.

user-3 · December 16, 2024, 12:05am

If I find the spark agent insufficient, is it recommended to directly send data lineage from inside the spark job itself?

datahub_team · December 16, 2024, 12:05am

Hi ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look

Hope this helps ! ><

user-3 · December 16, 2024, 12:05am

<@U07FV3UHE4D> I was just wondering, for those who got the spark agent to work, what did you see in datahub?

datahub_team · December 16, 2024, 12:05am

Thank you for letting us know you need a person to take a look at this thread.

Either a community member or one of our team members will try to assist when they have a moment, responses unfortunately might not be immediate. We hope this helps speed things along

Topic		Replies	Views
Obtaining Column-Level Lineage Information with Spark Push-Based Agent ingestion	5	54	June 17, 2024
Investigating Spark Ingestion on Databricks with Datahub Lineage Issues getting-started	29	127	March 4, 2024
Seeking Guidance on Achieving Column-Level Lineage in Pyspark Script Using Datahub and Spark ingestion	5	114	March 4, 2024
Troubleshooting Data Lineage In DataHub Spark Listener Configuration ingestion	9	36	October 28, 2024
Troubleshooting missing lineage info in DataHub UI after successful Spark pipeline execution ingestion	2	121	March 4, 2024

Exploring DataHub Spark Agent for Lineage Tracking in Spark Jobs

Hey there!

<@U06TM7M40RK> Can you help with the questions in this thread?

Related topics