Managing Ownership Types in DataHub Metadata Transformation

Original Slack Thread

hi <@U06TM7M40RK> If you have two of the same type transformers in the same ingestion script will the last transformer over write the first one even though the ownership type is different

  • type: pattern_add_dataset_ownership
    config:
    semantics: OVERWRITE
    replace_existing: true
    owner_pattern:
    rules:
    ‘urn:li:dataset:example1’: [‘urn:li:corpGroup:group1’]
    ownership_type: BUSINESS_OWNER

  • type: pattern_add_dataset_ownership
    config:
    semantics: OVERWRITE
    replace_existing: true
    owner_pattern:
    rules:
    ‘urn:li:dataset:example1’: [‘urn:li:corpGroup:group2’]
    ownership_type: DATA_OWNER

In DataHub, when you have multiple transformers of the same type in an ingestion script, the behavior depends on the configuration of each transformer, particularly the semantics and replace_existing settings.

According to the <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/docs/transformer/dataset_transformer.md|DataHub documentation>, the semantics and replace_existing parameters control how ownership is applied:

  1. semantics: OVERWRITE: This setting means that the ownership information provided by the transformer will overwrite any existing ownership information for the dataset on the DataHub GMS server.

  2. replace_existing: true: This setting indicates that the existing ownership information from the ingestion source should be removed before applying the new ownership information specified in the transformer.

Given your configuration, both transformers are set to OVERWRITE and replace_existing: true. This means that each transformer will remove any existing ownership information and apply the new ownership information specified in its configuration.

Therefore, the second transformer will overwrite the ownership set by the first transformer, even though the ownership types are different. The final ownership for urn:li:dataset:example1 will be set to DATA_OWNER with urn:li:corpGroup:group2 as the owner, as specified by the second transformer.

Would you like more detailed information on how to configure these transformers to achieve a different result, such as retaining both ownership types?

Sources:

<@U06TM7M40RK> yes how can i retain both ownership types

To retain both ownership types for the same dataset, you can adjust the configuration of your transformers to use the PATCH semantics instead of OVERWRITE. This will allow you to add new ownership information without removing the existing ownership data.

Here’s how you can configure your transformers to retain both ownership types:

transformers:
  - type: pattern_add_dataset_ownership
    config:
      semantics: PATCH
      replace_existing: false
      owner_pattern:
        rules:
          'urn:li:dataset:example1': ['urn:li:corpGroup:group1']
      ownership_type: BUSINESS_OWNER

  - type: pattern_add_dataset_ownership
    config:
      semantics: PATCH
      replace_existing: false
      owner_pattern:
        rules:
          'urn:li:dataset:example1': ['urn:li:corpGroup:group2']
      ownership_type: DATA_OWNER

Explanation:

  1. semantics: PATCH: This setting ensures that the ownership information is added to the existing ownership data rather than overwriting it. This allows multiple ownership types to coexist for the same dataset.

  2. replace_existing: false: This setting ensures that the existing ownership information from the ingestion source is not removed before applying the new ownership information.

By using these settings, both ownership types (BUSINESS_OWNER and DATA_OWNER) will be retained for urn:li:dataset:example1, with urn:li:corpGroup:group1 and urn:li:corpGroup:group2 as the respective owners.

Would you like more detailed information on any of these configurations?

Sources:

<@U06TM7M40RK> what are the enums for ownership type

In DataHub, the ownership types are defined as enums, which categorize the relationship between a person or group and a metadata entity. According to the <https://datahubproject.io/docs/0.14.0/generated/metamodel/entities/dataset#ownership-1|DataHub documentation>, the available ownership types include:

  1. CUSTOM: Used when the ownership type is unknown or when a new ownership type is specified that does not have a predefined enum value.
  2. TECHNICAL_OWNER: A person or group responsible for the technical aspects of the asset.
  3. BUSINESS_OWNER: A person or group responsible for the logical or business-related aspects of the asset.
  4. DATA_STEWARD: A steward, expert, or delegate responsible for the asset.
  5. NONE: No specific type associated with the owner.
  6. DEVELOPER: Deprecated. Previously used for a person or group in charge of developing the code.
  7. DATAOWNER: Deprecated. Previously used for a person or group owning the data.
  8. DELEGATE: Deprecated. Previously used for a person or group overseeing the operation, e.g., a DBA or SRE.
  9. PRODUCER: Deprecated. Previously used for a person, group, or service that produces or generates the data.
  10. CONSUMER: Deprecated. Previously used for a person, group, or service that consumes the data.
  11. STAKEHOLDER: Deprecated. Previously used for a person or group with a direct business interest.

The deprecated types are recommended to be replaced with more specific types like TECHNICAL_OWNER, BUSINESS_OWNER, or DATA_STEWARD.

Would you like more detailed information on how to use these enums in a specific context?

Sources:

<@U06TM7M40RK> how do i remove all owners from all glue dataset and containers

To remove all owners from all Glue datasets and their containers in DataHub, you can use the simple_remove_dataset_ownership transformer. This transformer is designed to clear all ownership information from datasets during the ingestion process.

Here’s how you can configure your ingestion recipe to achieve this:

transformers:
  - type: "simple_remove_dataset_ownership"
    config: {}

Explanation:

  • simple_remove_dataset_ownership: This transformer removes all owners from the datasets it processes. It is useful when you want to clear incorrect or outdated ownership information.

Steps:

  1. Add the Transformer to Your Ingestion Recipe: Include the simple_remove_dataset_ownership transformer in your ingestion recipe YAML file.
  2. Run the Ingestion: Execute the ingestion process with this configuration. This will remove all ownership information from the Glue datasets and their containers.

This approach will clear all ownership data, so ensure that this is the desired action before proceeding.

Would you like more detailed information on how to set up and run this ingestion process?

Sources:

<@U06TM7M40RK> can it be run with other transformers

Yes, the simple_remove_dataset_ownership transformer can be run alongside other transformers in the same ingestion recipe. This allows you to clear existing ownership information and then apply new transformations, such as adding new owners, tags, or other metadata modifications.

Here’s an example of how you can configure multiple transformers in a single ingestion recipe:

transformers:
  - type: "simple_remove_dataset_ownership"
    config: {}

  - type: "simple_add_dataset_ownership"
    config:
      owner_urns:
        - "urn:li:corpuser:username1"
        - "urn:li:corpuser:username2"
      ownership_type: "TECHNICAL_OWNER"

  - type: "simple_add_dataset_tags"
    config:
      tag_urns:
        - "urn:li:tag:Important"

Explanation:

  1. simple_remove_dataset_ownership: This transformer will first clear all existing ownership information from the datasets.
  2. simple_add_dataset_ownership: After clearing ownership, this transformer adds new owners to the datasets.
  3. simple_add_dataset_tags: This transformer adds tags to the datasets.

Steps:

  • Order of Execution: Transformers are executed in the order they are listed in the YAML file. This means that the simple_remove_dataset_ownership transformer will run first, clearing all ownership data, followed by the addition of new owners and tags.

This configuration allows you to manage dataset metadata comprehensively within a single ingestion process.

Would you like more detailed information on any of these transformers or how to configure them?

Sources:

<@U06TM7M40RK> what should i look out for in the logs to prove that the simple_remove_dataset_ownership is working

The DataHub’s documentation I have ingested does not mention specific log entries or indicators to confirm that the simple_remove_dataset_ownership transformer is working. To verify its operation, you might want to look for log entries that indicate the removal of ownership metadata or any confirmation messages that the transformer has been applied successfully. Additionally, checking the resulting dataset metadata in the DataHub UI to ensure that ownership information has been cleared can also serve as proof. If this doesn’t help, please reach out to the DataHub support team for further assistance. :+1: 0 button :-1: 0 button :light_bulb: Hint: Mention @runllm in the thread for followups.

<@U07FV3UHE4D> I have added the simple_remove_dataset_ownership transformer to a glue ingestion script v0.15.0.1 datahub, It seems to have no impact is there an alternative to remove all the ownership from glue dataset and container entities?

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><

<@U07FV3UHE4D> i could not get the transformer to work with no errors reported