Hello, I was wondering how to use DataHub to locate input for data pipelines based on metadata. I have in mind a Spark job on Databricks, where we obviously want to register the outputs on DataHub, but also we would like to query Datahub to find the input. The person who submits the data pipeline only pass a logical identifier of the dataset, and then the right dataset is looked up on DataHub
Hey Edmondo! I’m not sure I fully understand your question, but curious if you’ve seen our Spark/Databricks configuration docs? https://datahubproject.io/docs/metadata-integration/java/spark-lineage/#configuration-instructions--databricks
That would register derived datasets on the DataHub database
let’s say I have pipeline A. Pipeline A writes data, and register the data to Datahub with a certain “client id”
now a user submit Pipeline B, using a single parameter, client id. Pipeline B would need to talk to DataHub to know where the output of pipeline A is located