Hi team!
I’m exploring DataHub and so far it looks like a great tool, it has more complexity than I expected at first sight though, and I was hoping to get some insight on sources documentation.
> NOTE: I’m self-starting to explore this project, apologies in advance for the lack of understanding and any inaccurate term I may use
To provide further context, I’m running Datahub on a kubernetes cluster in GKE, for now, our main data source is a PostgreSQL/PostGIS database that also lives in the same cluster (leveraging Cloud Native PG), our focus is geospatial, having both vector and raster tables, the second being “out-db” rasters that relate to files (COGs) in Google Cloud Storage buckets.
While most of the data sources will be generated by ETL tasks (we use Prefect for orchestration), some Catalog users will be “Data owners” and they will need to be able to edit data source metadata (tags, documentation, etc.) for only a certain set of Datasets.
Ideally, users will be able to add Dataset metadata via the UI but we also need a way to back all this metadata and store it elsewhere in case we need to recover it somehow. If this was not possible, we’d go for a different approach to generate the metadata programmatically and store it in GitHub (maybe through the Python SDK?).
I was thinking:
-
Creating an ingestion source that periodically pulls information from the PostgreSQL database (actually a few of them so rules and tags can be automatically set depending on the schema etc)
-
Adding metadata to specific Datasets programmatically, preferably through the Python SDK so this is directly handled as part of the Prefect pipeline itself
-
Allowing metadata collaboration directly via UI since some Data owners are not technical, ideally defining granular access to these users so they cannot mess up other Datasets
-
Backing up the metadata periodically so we’re able to restore if needed (we don’t want to lose the information added by users via the UI in worst case scenarios)
Does it make any sense? do you have any advice on how to approach this use case?
Thanks a lot!!