How to Enable Lineage Metadata Emission in Spark Functions Using Datahub Integration

user-3 · March 4, 2024, 4:33pm

Hello,
I’m wondering what spark functions I should expect to have lineage metadata emitted for. I have the following snippet to test out lineage which references the AWS Glue Catalog as it’s Hive Metastore. (Running on EMR Serverless 6.7.0)

spark = SparkSession.builder.appName("TestDatahubJob").enableHiveSupport().getOrCreate()
df = spark.sql("select * from database.kpi_metrics limit 10")
df.write.saveAsTable("tmp_drew.table_datahub_test")```
This code runs successfully. The table is also successfully created in the glue `tmp_drew` database and written to s3. However there is no lineage metadata showing up on the UI. Do I need to use a different function to have datahub catch the lineage connection?
---
Datahub version `0.12.0`

datahub_team · March 4, 2024, 4:33pm

Hey there! Make sure your message includes the following information if relevant, so we can help more effectively!

Which DataHub version are you using? (e.g. 0.12.0)
Please post any relevant error logs on the thread!

user-3 · March 4, 2024, 4:33pm

I’m fairly certain I’m using a supported command type since the queryPlan property on my task is
CreateHiveTableAsSelectCommand [Database: tmp_drew, TableName: table_datahub_4, InsertIntoHiveTable] +- GlobalLimit 10 +- LocalLimit 10 +- Relation database.kpi_metrics[,... 572 more fields] parquet
And both CreateHiveTableAsSelectCommand and InsertIntoHiveTable are supposed to be <Spark | DataHub for lineage>

user-1 · March 4, 2024, 4:33pm

Just to verify: to provide metadata to Datahub you’re using the spark jar, and NOT the glue integration? You provided the agent JAR to EMR? Routing between Datahub and EMR is working, and you have other push-based integrations working?

If that is all in place then I’d expect lineage to populate for you based on your write example. If you are instead extracting from glue then I wouldn’t expect lineage information. Glue doesn’t have a true table DDL stored when you write to it via df.write, hence needing to pull lineage off of the query plan with the Datahub Agent JAR.
I don’t use serverless EMR, but same idea I imagine.
Example:


CREATE TABLE spark_catalog.schema.table (   company_id INT,   dimension STRING NOT NULL,   id INT,   code STRING,   name STRING,   parent STRING,   parent_id INT) USING delta LOCATION '<s3://BUCKET/gold/schema/table>' TBLPROPERTIES (   'delta.enableChangeDataFeed' = 'true',   'delta.minReaderVersion' = '1',   'delta.minWriterVersion' = '4')
-- useless for lineage :) ```

user-3 · March 4, 2024, 4:33pm

Yup! your understanding is correct.

user-3 · March 4, 2024, 4:33pm

As update, I got the downstream lineage working. It appears that since 0.10.? nodes which are in the metadata graph don’t appear unless they have been ingested in glue.
So by re-ingesting in glue, the data lineage was captured from Spark Job → Glue Table.

user-3 · March 4, 2024, 4:33pm

My remaining problem is that the spark extraction treats the upstream reference as coming from s3, so it’s tmp_drew.test_table -> DOWNSTREAM_OF -> <s3://path/to/data> Instead of tmp_drew.test_table -> DOWNSTREAM_OF -> database.kpi_metrics

user-1 · March 4, 2024, 4:33pm

Interesting, are there any logs on the spark side listing what URNs each run is trying to change metadata on? And is glue data already loaded in?

user-3 · March 4, 2024, 4:33pm

For glue, the datasets are loaded in both as glue entities and as s3 entities (I turned on the emit_s3_lineage option for the glue connector). When I inspect the neo4j DB I see both the s3 nodes and glue nodes. Curiously these are also not showing lineage edges.
On spark I’ve had trouble setting the logs to debug mode since I’m submitting a python file to EMR. Is there a way to set logging to debug via the spark confs? I don’t see much from the non-debug logging other than the query plan

user-3 · March 4, 2024, 4:33pm

Here’s the non-debug log for basically the same test job, just different table names attachment

Topic		Replies	Views
Troubleshooting missing lineage info in DataHub UI after successful Spark pipeline execution ingestion	2	108	March 4, 2024
Exploring DataHub Spark Agent for Lineage Tracking in Spark Jobs ingestion	6	39	December 16, 2024
How to Ingest Lineage Information Manually Stored in Spark into DataHub ingestion	2	16	December 16, 2024
Integrating spark lineage listener: Choosing the correct library version ingestion	24	62	May 13, 2024
Code Discrepancy in `spark.datahub.metadata.table.hive_platform_alias` Configuration in Spark Lineage Beta ingestion	6	42	June 24, 2024

How to Enable Lineage Metadata Emission in Spark Functions Using Datahub Integration

Related topics