Handling Metadata and Access Control in DataHub: Strategies for Data Source Metadata Management, User Access, and Documentation Collaboration

user-1 · March 4, 2024, 3:08pm

Hi team!

I’m exploring DataHub and so far it looks like a great tool, it has more complexity than I expected at first sight though, and I was hoping to get some insight on sources documentation.

> NOTE: I’m self-starting to explore this project, apologies in advance for the lack of understanding and any inaccurate term I may use
To provide further context, I’m running Datahub on a kubernetes cluster in GKE, for now, our main data source is a PostgreSQL/PostGIS database that also lives in the same cluster (leveraging Cloud Native PG), our focus is geospatial, having both vector and raster tables, the second being “out-db” rasters that relate to files (COGs) in Google Cloud Storage buckets.

While most of the data sources will be generated by ETL tasks (we use Prefect for orchestration), some Catalog users will be “Data owners” and they will need to be able to edit data source metadata (tags, documentation, etc.) for only a certain set of Datasets.

Ideally, users will be able to add Dataset metadata via the UI but we also need a way to back all this metadata and store it elsewhere in case we need to recover it somehow. If this was not possible, we’d go for a different approach to generate the metadata programmatically and store it in GitHub (maybe through the Python SDK?).

I was thinking:

Creating an ingestion source that periodically pulls information from the PostgreSQL database (actually a few of them so rules and tags can be automatically set depending on the schema etc)
Adding metadata to specific Datasets programmatically, preferably through the Python SDK so this is directly handled as part of the Prefect pipeline itself
Allowing metadata collaboration directly via UI since some Data owners are not technical, ideally defining granular access to these users so they cannot mess up other Datasets
Backing up the metadata periodically so we’re able to restore if needed (we don’t want to lose the information added by users via the UI in worst case scenarios)

Does it make any sense? do you have any advice on how to approach this use case?

Thanks a lot!!

user-1 · March 4, 2024, 3:08pm

For metadata edition/access control on metadata would setting users as Readers only, then define granular permissions via policies be the way to go?

datahub_team · March 4, 2024, 3:08pm

Thanks for the profound research on this - all the options you suggest seem reasonable. <@U01GCJKA8P9> Could you advise some best practices here?

user-1 · March 4, 2024, 3:08pm

Hi team, could anyone shed some light here if possible? thanks a lot!!

Topic		Replies	Views
Using DataHub for Master Metadata System in a Hadoop Warehouse getting-started	7	77	March 4, 2024
Integrating Athena Metadata with Datahub for Tag Visibility ingestion	2	58	March 4, 2024
Implementing Metadata Writeback in DataHub for Data Stewards' Description Updates getting-started	3	65	March 4, 2024
Ingesting Metadata Directly from Code into DataHub without Connecting to Datasource ingestion	7	39	July 22, 2024
Ingesting Data from Private Postgres Cluster into DataHub using APIs and SDKs ingestion	21	4	April 28, 2025

Handling Metadata and Access Control in DataHub: Strategies for Data Source Metadata Management, User Access, and Documentation Collaboration

Related topics