Automating Data Ingestion and Tagging with DataHub's Transformer Feature

Original Slack Thread

New to DataHub - is the GMS able to infer tags on ingestion or do I need to preload tags?

(coworker of Christian’s here) Also, is there any capability to automate data ingest/parse/cleansing along with tagging? I think the ‘transformer’ feature is what we might need for auto-tagging, but our boss seems convinced Datahub can also accomplish ingest/parse/cleansing & routing to the final repo destination, which I am not seeing.

Hi <@U059X6T0CSV> & <@U05N53BMLBT>! Great to have you with us in the Community :teamwork:

Here are the options we currently support for auto-ingesting tags:
• Extract from source during ingestion — for many of our ingestion sources (Airflow, Snowflake, dbt, etc.), we will automatically extract existing tags from the source system & apply them in DataHub. Our Source docs will have details about what is available for each source; for example, <dbt | DataHub how we extract tags/owners/etc. from dbt’s >meta<dbt | DataHub block>
https://datahubproject.io/docs/metadata-ingestion/docs/transformer/intro/|Transformers - <@U05N53BMLBT> you’re exactly right - transformers are a way to auto-apply tags/terms/owners/etc. during ingestion if they don’t exist in Source
• Actions Framework - this is the most dynamic & customizable way for you to apply tags as your sources evolve; check out <https://datahubproject.io/docs/actions/guides/developing-an-action|this guide> and <https://www.youtube.com/watch?v=lrx8LFbe7w0|Hyejin’s demo> of what you can do with it!