Using DataHub for Locating Input for Data Pipelines Based on Metadata

Original Slack Thread

Hello, I was wondering how to use DataHub to locate input for data pipelines based on metadata. I have in mind a Spark job on Databricks, where we obviously want to register the outputs on DataHub, but also we would like to query Datahub to find the input. The person who submits the data pipeline only pass a logical identifier of the dataset, and then the right dataset is looked up on DataHub

Hey Edmondo! I’m not sure I fully understand your question, but curious if you’ve seen our Spark/Databricks configuration docs? https://datahubproject.io/docs/metadata-integration/java/spark-lineage/#configuration-instructions--databricks

That would register derived datasets on the DataHub database

let’s say I have pipeline A. Pipeline A writes data, and register the data to Datahub with a certain “client id”

now a user submit Pipeline B, using a single parameter, client id. Pipeline B would need to talk to DataHub to know where the output of pipeline A is located