Hi Team
ingestion : spark jar CLI
DataHub version: 0.12.1
source: spark
I run datahub-spark-lineage:0.12.1-1 with spark-submit.
It is successful to see spark pipeline and spark task on datahub ui.
But there is not lineage info on ui , although the lineage info already in mysql and kafka MetadataChangeLog_Versioned_v1 topic.
lineage info in topic valuse such as :
dataJoburn:li:dataJob:(urn:li:dataFlow:(spark,enterprise_green_certification_dws_test_spark_lineage,yarn),QueryExecId_2)$dataJobInputOutput{"inputDatasets":["urn:li:dataset:(urn:li:dataPlatform:hdfs,<hdfs://intsig-bigdata-nameservice/user/hive/warehouse/staging_edw_company.db/s_db_qualification_certificate_t_certificate_main/dt_batch=202401230000,PROD>)","urn:li:dataset:(urn:li:dataPlatform:hdfs,<hdfs://intsig-bigdata-nameservice/user/hive/warehouse/staging_edw_company.db/s_db_qualification_certificate_t_management_system/dt_batch=202401230000,PROD>)"],"outputDatasets":["urn:li:dataset:(urn:li:dataPlatform:hive,test.edw_company_dws_dg_hq_enterprise_green_certification_1d_df,PROD)"]} application/jsonc$no-run-id-provided$no-run-id-providedc@urn:li:corpuser:__datahub_system
What should i check about. So appreciate for you respone!
Hey there! Make sure your message includes the following information if relevant, so we can help more effectively!
- Are you using UI or CLI for ingestion?
- Which DataHub version are you using? (e.g. 0.12.0)
- What data source(s) are you integrating with DataHub? (e.g. BigQuery)
The issue is most probably is the upstream/downstream datasets don’t exist in DataHub and we don’t show lineage to non existing datasets.
We are going to release a new Spark plugin soon where there is way to materialise dataset in the DataHub.