hi <@U06TM7M40RK> If you have two of the same type transformers in the same ingestion script will the last transformer over write the first one even though the ownership type is different
In DataHub, when you have multiple transformers of the same type in an ingestion script, the behavior depends on the configuration of each transformer, particularly the semantics and replace_existing settings.
semantics: OVERWRITE: This setting means that the ownership information provided by the transformer will overwrite any existing ownership information for the dataset on the DataHub GMS server.
replace_existing: true: This setting indicates that the existing ownership information from the ingestion source should be removed before applying the new ownership information specified in the transformer.
Given your configuration, both transformers are set to OVERWRITE and replace_existing: true. This means that each transformer will remove any existing ownership information and apply the new ownership information specified in its configuration.
Therefore, the second transformer will overwrite the ownership set by the first transformer, even though the ownership types are different. The final ownership for urn:li:dataset:example1 will be set to DATA_OWNER with urn:li:corpGroup:group2 as the owner, as specified by the second transformer.
Would you like more detailed information on how to configure these transformers to achieve a different result, such as retaining both ownership types?
To retain both ownership types for the same dataset, you can adjust the configuration of your transformers to use the PATCH semantics instead of OVERWRITE. This will allow you to add new ownership information without removing the existing ownership data.
Here’s how you can configure your transformers to retain both ownership types:
semantics: PATCH: This setting ensures that the ownership information is added to the existing ownership data rather than overwriting it. This allows multiple ownership types to coexist for the same dataset.
replace_existing: false: This setting ensures that the existing ownership information from the ingestion source is not removed before applying the new ownership information.
By using these settings, both ownership types (BUSINESS_OWNER and DATA_OWNER) will be retained for urn:li:dataset:example1, with urn:li:corpGroup:group1 and urn:li:corpGroup:group2 as the respective owners.
Would you like more detailed information on any of these configurations?
To remove all owners from all Glue datasets and their containers in DataHub, you can use the simple_remove_dataset_ownership transformer. This transformer is designed to clear all ownership information from datasets during the ingestion process.
Here’s how you can configure your ingestion recipe to achieve this:
simple_remove_dataset_ownership: This transformer removes all owners from the datasets it processes. It is useful when you want to clear incorrect or outdated ownership information.
Steps:
Add the Transformer to Your Ingestion Recipe: Include the simple_remove_dataset_ownership transformer in your ingestion recipe YAML file.
Run the Ingestion: Execute the ingestion process with this configuration. This will remove all ownership information from the Glue datasets and their containers.
This approach will clear all ownership data, so ensure that this is the desired action before proceeding.
Would you like more detailed information on how to set up and run this ingestion process?
Yes, the simple_remove_dataset_ownership transformer can be run alongside other transformers in the same ingestion recipe. This allows you to clear existing ownership information and then apply new transformations, such as adding new owners, tags, or other metadata modifications.
Here’s an example of how you can configure multiple transformers in a single ingestion recipe:
simple_remove_dataset_ownership: This transformer will first clear all existing ownership information from the datasets.
simple_add_dataset_ownership: After clearing ownership, this transformer adds new owners to the datasets.
simple_add_dataset_tags: This transformer adds tags to the datasets.
Steps:
Order of Execution: Transformers are executed in the order they are listed in the YAML file. This means that the simple_remove_dataset_ownership transformer will run first, clearing all ownership data, followed by the addition of new owners and tags.
This configuration allows you to manage dataset metadata comprehensively within a single ingestion process.
Would you like more detailed information on any of these transformers or how to configure them?
The DataHub’s documentation I have ingested does not mention specific log entries or indicators to confirm that the simple_remove_dataset_ownership transformer is working. To verify its operation, you might want to look for log entries that indicate the removal of ownership metadata or any confirmation messages that the transformer has been applied successfully. Additionally, checking the resulting dataset metadata in the DataHub UI to ensure that ownership information has been cleared can also serve as proof. If this doesn’t help, please reach out to the DataHub support team for further assistance. 0 button 0 button Hint: Mention @runllm in the thread for followups.
<@U07FV3UHE4D> I have added the simple_remove_dataset_ownership transformer to a glue ingestion script v0.15.0.1 datahub, It seems to have no impact is there an alternative to remove all the ownership from glue dataset and container entities?