Has anyone had success with the Datahub Spark agent? All I get is the various attributes of the spark job, like spark.app.name. My spark job processes various tables in Iceberg, but I see nothing.
Hey there!
<@U06TM7M40RK> Can you help with the questions in this thread?
Please continue to tag RunLLM (e.g <@U06TM7M40RK>) to have it continue generating responses!
To better assist you with your issue regarding the DataHub Spark agent, could you provide more details on the following:
-
Configuration: How have you configured the Spark agent? Are you using a config file or setting configurations directly in your Spark session? For example, are you using the
spark.jars.packages
andspark.extraListeners
configurations as described in the <https://datahubproject.io/docs/0.14.0/metadata-integration/java/acryl-spark-lineage|DataHub Spark Lineage documentation>? -
Version Information: What versions of DataHub, Spark, and the Spark Lineage jar are you using? It’s important to ensure compatibility between these components.
-
Job Details: Could you describe the specific operations your Spark job is performing on the Iceberg tables? Are you using SQL operations, DataFrame API, or a combination of both?
-
Logs and Errors: Are there any specific logs or error messages that you see when running your Spark job? This could provide clues as to why the lineage information is not being captured.
-
Output Format: Are you writing the output in a specific format (e.g., Parquet, Delta)? As noted in a <Issues · datahub-project/datahub · GitHub issue>, certain formats might affect the visibility of lineage information.
Providing this information will help in diagnosing the issue more effectively. Let me know if you need further assistance!
Sources:
- <https://datahubproject.io/docs/0.14.0/metadata-integration/java/acryl-spark-lineage|DataHub Spark Lineage documentation>
- <Issues · datahub-project/datahub · GitHub issue on column lineage with Spark> 1 button 0 button Hint: Mention <@U06TM7M40RK> in the thread for followups.
If I find the spark agent insufficient, is it recommended to directly send data lineage from inside the spark job itself?
Hi ! Did this interaction resolve your issue or do you still require assistance ?
Please tag @ RunLLM if you still need further assistance !
Or tag @Datahub Community Support if you require a person to take a look
Hope this helps ! ><
<@U07FV3UHE4D> I was just wondering, for those who got the spark agent to work, what did you see in datahub?
Thank you for letting us know you need a person to take a look at this thread.
Either a community member or one of our team members will try to assist when they have a moment, responses unfortunately might not be immediate. We hope this helps speed things along