Exploring DataHub Spark Agent for Lineage Tracking in Spark Jobs

Original Slack Thread

Has anyone had success with the Datahub Spark agent? All I get is the various attributes of the spark job, like spark.app.name. My spark job processes various tables in Iceberg, but I see nothing.

Hey there! :wave:

<@U06TM7M40RK> Can you help with the questions in this thread?

:information_source: Please continue to tag RunLLM (e.g <@U06TM7M40RK>) to have it continue generating responses!

To better assist you with your issue regarding the DataHub Spark agent, could you provide more details on the following:

  1. Configuration: How have you configured the Spark agent? Are you using a config file or setting configurations directly in your Spark session? For example, are you using the spark.jars.packages and spark.extraListeners configurations as described in the <https://datahubproject.io/docs/0.14.0/metadata-integration/java/acryl-spark-lineage|DataHub Spark Lineage documentation>?

  2. Version Information: What versions of DataHub, Spark, and the Spark Lineage jar are you using? It’s important to ensure compatibility between these components.

  3. Job Details: Could you describe the specific operations your Spark job is performing on the Iceberg tables? Are you using SQL operations, DataFrame API, or a combination of both?

  4. Logs and Errors: Are there any specific logs or error messages that you see when running your Spark job? This could provide clues as to why the lineage information is not being captured.

  5. Output Format: Are you writing the output in a specific format (e.g., Parquet, Delta)? As noted in a <Issues · datahub-project/datahub · GitHub issue>, certain formats might affect the visibility of lineage information.

Providing this information will help in diagnosing the issue more effectively. Let me know if you need further assistance!

Sources:

If I find the spark agent insufficient, is it recommended to directly send data lineage from inside the spark job itself?

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><

<@U07FV3UHE4D> I was just wondering, for those who got the spark agent to work, what did you see in datahub?

Thank you for letting us know you need a person to take a look at this thread.

Either a community member or one of our team members will try to assist when they have a moment, responses unfortunately might not be immediate. We hope this helps speed things along :slight_smile: