Adding Column Tags for Partitioning in DataHub with Minimal Effort

Original Slack Thread

Hey :wave: we use a custom pipeline to ingest our dbt project into Datahub and already utilise meta_mapping for model tags and such. We would like to add column tags for the partition columns and ideally in a way with minimal manual effort. Our model.yml looks something like this:

  - name: model_name

    description: |
      This model summarises ...  
    config:
      materialized: incremental
      on_schema_change: append_new_columns
      properties:
        partitioning: "ARRAY['date']"
      unique_key: ['date']
      meta:
        datahub_technical_owner: <mailto:owner@email.com|owner@email.com>
      tags:
        - 'model_tag'
    columns:
      - name: id
        description: '{{ doc("id") }}'```
Q: Is there a way to parse the `partitioning` property to add the column tag to the partition key(s)? Or is that not possible and we would need to add a new propertie (eg. `is_partition_key: True` ) to the `columns:`  section?

I couldn’t really figure out from the docs if that is possible.

Hey there! :wave:

<@U06TM7M40RK> Can you help with the questions in this thread?

:information_source: Please continue to tag RunLLM (e.g <@U06TM7M40RK>) to have it continue generating responses!

It seems likely that you can achieve your goal by leveraging the column_meta_mapping feature in DataHub’s dbt ingestion process. This feature allows you to define mapping rules that can be executed against dbt column meta properties. However, based on the provided data sources, there isn’t a direct mention of parsing the partitioning property to automatically add tags to partition columns.

Here are a few steps you might consider:

  1. Use column_meta_mapping: You can define a column_meta_mapping in your ingestion recipe to add tags to columns based on certain conditions. This would require you to have some identifiable property in the column metadata that indicates it is a partition key.

  2. Add a Custom Property: If the partitioning property is not directly accessible for tagging, you might need to add a custom property to each column that is a partition key. For example, you could add is_partition_key: True to the relevant columns in your model.yml.

  3. Automate with a Script: If you have many models and columns, consider writing a script to parse your model.yml files and automatically add the is_partition_key property to the appropriate columns. This script could then update your dbt project files before ingestion.

  4. Ingestion Recipe: Update your ingestion recipe to include logic that checks for the is_partition_key property and adds the appropriate tags during the ingestion process.

For more detailed guidance, you might want to refer to the <https://datahubproject.io/docs/0.14.0/generated/ingestion/sources/dbt#dbt-meta-automated-mappings|dbt meta automated mappings documentation> and the <https://datahubproject.io/docs/0.14.0/generated/metamodel/entities/dataset#adding-tags-or-glossary-terms-to-columns--fields-of-a-dataset|DataHub documentation on adding tags to columns>.

Would you like more detailed instructions on any of these steps?

Sources:

Hi <@U05QKC45UFL>, how did my answer work out for you? Need any more assistance? You can also tag <@U07FV3UHE4D> to let your fellow community members know you’re still looking for guidance.

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><