Managing and Editing Tags Assigned During Data Ingestion in DataHub

Original Slack Thread

Can we edit the tags which are assigned along with ingestion (through recipe) in datahub? If No , is it possible to make them editable after assigning then through ingestion?

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

<@U06TM7M40RK> Can you help with the questions in this thread?


:information_source: Please continue to tag RunLLM (e.g <@U06TM7M40RK>) to have it continue generating responses! We’re experimenting with a new AI assistant for questions in the <#CUMUWQU66|> channel to help you track down helpful resources more quickly. Keep in mind that responses may not always be spot-on; feel free to ask it follow-up questions and don’t forget to provide its responses a :+1: or :-1: so other folks in the Community know if you’re still looking for help on the topic and to help us improve the assistant!

To provide you with the most accurate and helpful response, could you please share the following details?

  1. Are you using the UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g., 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g., BigQuery)

This information will help tailor the response to your specific setup.

Sources:

<@U06TM7M40RK> We are using the UI for ingestions and adding the tags through transformation section in the recipe for any source type
Also we are using 0.13.3 version

Thank you for the details. Based on your setup, here is how you can manage tags assigned during ingestion and make them editable afterward:

Editing Tags Assigned During Ingestion

By default, tags assigned during ingestion can overwrite existing tags. However, you can configure your ingestion recipe to ensure that tags added through the UI are not overwritten by subsequent ingestions. This can be achieved using the simple_add_dataset_tags transformer with the semantics: PATCH configuration.

Configuration Example

Here’s how you can configure your ingestion recipe to add tags while preserving existing tags:

transformers:
  - type: "simple_add_dataset_tags"
    config:
      tag_urns:
        - "urn:li:tag:NeedsDocumentation"
        - "urn:li:tag:Legacy"
      semantics: PATCH  # This ensures existing tags are preserved

Steps to Implement

  1. Modify Your Ingestion Recipe: Update your ingestion recipe to include the simple_add_dataset_tags transformer with the semantics: PATCH configuration.
  2. Run the Ingestion: Execute the ingestion process through the UI as usual.
  3. Edit Tags in the UI: After ingestion, you can manually add or edit tags through the DataHub UI. The PATCH semantics ensure that these manual changes are not overwritten by subsequent ingestions.

Example Recipe

Here’s a complete example of an ingestion recipe that pulls metadata from a source and applies tags while preserving existing ones:

source:
  type: your_source_type
  config:
    # Your source configuration here

transformers:
  - type: "simple_add_dataset_tags"
    config:
      tag_urns:
        - "urn:li:tag:NeedsDocumentation"
        - "urn:li:tag:Legacy"
      semantics: PATCH  # Preserve existing tags

sink:
  type: "datahub-rest"
  config:
    server: "http://localhost:8080"

Additional Information

  • Behavior of UPSERT: According to <Slack Slack discussion>, the default behavior of ingestion is to overwrite existing tags. Using the PATCH semantics in your transformer configuration will prevent this.
  • Handling Multiple Ingestions: If you have multiple ingestion jobs, ensure that all of them use the PATCH semantics to maintain consistency across your DataHub instance.

Would you like more detailed instructions or have any other specific questions?

Sources:

<@U06TM7M40RK> We are not able to delete the tag added on column or dataset from UI which are assigned through the transformers in the ingestion. How to make then tags can be editable like remove from UI which are assigned through transformers?

To make tags assigned through transformers in the ingestion editable (including the ability to remove them) from the UI, you need to ensure that the tags are not overwritten by subsequent ingestions. This can be achieved by using the simple_add_dataset_tags transformer with the semantics: PATCH configuration. This configuration ensures that tags added through the UI are not overwritten by subsequent ingestions.

Configuration Example

Here’s how you can configure your ingestion recipe to add tags while preserving existing tags:

transformers:
  - type: "simple_add_dataset_tags"
    config:
      tag_urns:
        - "urn:li:tag:NeedsDocumentation"
        - "urn:li:tag:Legacy"
      semantics: PATCH  # This ensures existing tags are preserved

Steps to Implement

  1. Modify Your Ingestion Recipe: Update your ingestion recipe to include the simple_add_dataset_tags transformer with the semantics: PATCH configuration.
  2. Run the Ingestion: Execute the ingestion process through the UI as usual.
  3. Edit Tags in the UI: After ingestion, you can manually add, edit, or remove tags through the DataHub UI. The PATCH semantics ensure that these manual changes are not overwritten by subsequent ingestions.

Example Recipe

Here’s a complete example of an ingestion recipe that pulls metadata from a source and applies tags while preserving existing ones:

source:
  type: your_source_type
  config:
    # Your source configuration here

transformers:
  - type: "simple_add_dataset_tags"
    config:
      tag_urns:
        - "urn:li:tag:NeedsDocumentation"
        - "urn:li:tag:Legacy"
      semantics: PATCH  # Preserve existing tags

sink:
  type: "datahub-rest"
  config:
    server: "http://localhost:8080"

Additional Information

  • Behavior of UPSERT: According to <Slack Slack discussion>, the default behavior of ingestion is to overwrite existing tags. Using the PATCH semantics in your transformer configuration will prevent this.
  • Handling Multiple Ingestions: If you have multiple ingestion jobs, ensure that all of them use the PATCH semantics to maintain consistency across your DataHub instance.

Would you like more detailed instructions or have any other specific questions?

Sources: