Using DataHub for Master Metadata System in a Hadoop Warehouse

Original Slack Thread

Hi!
I’m newcommer in DataHub. Can you please help me with next simple question. Is it possible to create new fields on Dataset and create Dataset itself? Or I can moderate Dataset and other entitites only via API?

You cannot edit schema nor create dataset in UI. Need to use API or sdk

Good! Btw is it by design? Looks like in this way DataHub try to protect schema from moderation via UI.

<@U05T6M1BEUS> - DataHub’s Dataset / Schema pages are primarily meant to capture data assets that already exist somewhere else (e.g. a table in a database, a schema for a Kafka topic) etc. Curious to understand your objective for wanting edits.
Were you thinking about edits in the UI as a way to correct incorrectly observed facts? or to create new things? e.g. change the schema in the UI and then have that change be reflected in the underlying database table?

<@UV0M2EB8Q> Thank you for detailed response. My case is little bit different. We use Datahub is a master metadata system for Hadoop warehouse. Where metadata schema for datasets that are designed by Data modeling team, is documented by them in Datahub. And then using this defined metadata schema, some special CI/CD scripts automatically create physical datasets using Spark transformations. So this means that metadata for warehouse is stored and designed in Datahub

Dear colleagues and <@UV0M2EB8Q> I’m still confused if my team is acting correctly with DataHub.
What we have
• A number of databases, usually Postgres, which are used by business microservices
• Metadata is pulled by DataHub from this Postgres databases.
• Data analyst is designing in Excel metadata schema of DDS, CDM of Data lake, based on metadata from source databases
• Using DataHub API we push this Excel in Datahub, so we publicate a schema of DataLake in DataHub
• Then we implement ETL which is actually fills DataLake layers(DDS, CDM) according to metadata published in DataHub
So if you get our approach, can someone comment it? If we are on the right way at all )) Thank you

<@U05T6M1BEUS>: that’s very cool and an excellent way to use DataHub -> to drive data provisioning rather than just viewing data as it exists

Hello <@UV0M2EB8Q>! Thank you for response. Glad to hear that our approach is correct.