Using DataHub for Locating Input for Data Pipelines Based on Metadata

user-1 · March 4, 2024, 3:18pm

Hello, I was wondering how to use DataHub to locate input for data pipelines based on metadata. I have in mind a Spark job on Databricks, where we obviously want to register the outputs on DataHub, but also we would like to query Datahub to find the input. The person who submits the data pipeline only pass a logical identifier of the dataset, and then the right dataset is looked up on DataHub

datahub_team · March 4, 2024, 3:18pm

Hey Edmondo! I’m not sure I fully understand your question, but curious if you’ve seen our Spark/Databricks configuration docs? https://datahubproject.io/docs/metadata-integration/java/spark-lineage/#configuration-instructions--databricks

user-1 · March 4, 2024, 3:18pm

That would register derived datasets on the DataHub database

user-1 · March 4, 2024, 3:18pm

let’s say I have pipeline A. Pipeline A writes data, and register the data to Datahub with a certain “client id”

now a user submit Pipeline B, using a single parameter, client id. Pipeline B would need to talk to DataHub to know where the output of pipeline A is located

Topic		Replies	Views
Using DataHub for Master Metadata System in a Hadoop Warehouse getting-started	7	73	March 4, 2024
Handling Metadata and Access Control in DataHub: Strategies for Data Source Metadata Management, User Access, and Documentation Collaboration getting-started	3	85	March 4, 2024
How to Specify Dataset Metadata Using Python SDK for DataHub Emitters getting-started	1	63	March 4, 2024
Integrating DataHub with Presto for Querying Tables Synced from Spark on Kubernetes getting-started	2	41	March 4, 2024
Can Datahub Act as a Catalog for Iceberg or Do I Need Dataproc or Project Nessie? getting-started	4	67	March 4, 2024

Using DataHub for Locating Input for Data Pipelines Based on Metadata

Related topics