Ingesting Lineage from Files on GCS using Python Emitter in DataHub

Original Slack Thread

Hi folks, I have a quick question about creating and ingesting lineage from files on GCS. Suppose I have a set of notebooks or local scripts producing some datasets that get saved to GCS. These scripts just run raw Python (and maybe some Pandas), but don’t run on one of DataHub’s supported integrations.

Is this a use case for the <https://datahubproject.io/docs/metadata-ingestion/as-a-library/|Python emitter>? Assuming I can come up with some logic to extract dataset metadata from my local python scripts, is the Emitter the correct tool to write metadata/lineage to the DataHub metadata store?

yes, with the python api you can basically do whatever you want :slightly_smiling_face: