Undeleting Soft-Deleted Entities and Managing Lineage in DataHub via Python SDK and Graph Client

user-3 · January 13, 2025, 12:04am

<@U06TM7M40RK> how to undelete a soft-deleted entity via python sdk or graph client?

user-1 · January 13, 2025, 12:04am

To undelete a soft-deleted entity in DataHub, you can update the “Status” aspect of the entity to set the removed field to false. This can be done using the DataHub Python SDK or the GraphQL API.

Using the Python SDK

You can use the DatahubRestEmitter from the Python SDK to update the status aspect of the entity. Here’s an example:

from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.specific.common import StatusPatchBuilder

# Initialize the emitter
emitter = DatahubRestEmitter(gms_server="http://localhost:8080")

# Create a StatusPatchBuilder instance
status_patch = StatusPatchBuilder(urn="urn:li:your-entity-type:your-entity-id")

# Set the removed field to false
status_patch.set_removed(False)

# Emit the changes
emitter.emit_mcp(status_patch.build())

Using the GraphQL API

If you prefer using the GraphQL API, you can construct a mutation to update the status aspect. Here’s a general approach:

Collect the URN of the entity you want to undelete.
Use a GraphQL mutation to update the removed field in the Status aspect to false.

Unfortunately, the exact GraphQL mutation syntax is not provided in the data sources, but you can refer to the <Rest.li API | DataHub API documentation> for more details on how to construct such a request.

Would you like more detailed information on any of these steps?

Sources:

<https://github.com/datahub-project/datahub/blob/master/docs/how/delete-metadata.md|DataHub Delete Metadata Documentation>
<Rest.li API | DataHub Deleting an Entity> 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-3 · January 13, 2025, 12:04am

<@U06TM7M40RK> which package to install datahub.specific.common

user-1 · January 13, 2025, 12:04am

I’m not able to find any data sources to answer your question. You might want to specify the context or version of DataHub you are using. If this doesn’t help, please reach out to the DataHub support team for further assistance. 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-3 · January 13, 2025, 12:04am

<@U06TM7M40RK> how do you update entity dataset status via python sdk ?

user-1 · January 13, 2025, 12:04am

To update the status of a dataset entity using the DataHub Python SDK, you can use the MetadataChangeProposalWrapper to modify the “Status” aspect of the dataset. Here’s a step-by-step guide on how to do this:

Install the DataHub Python SDK: Ensure you have the DataHub Python SDK installed. You can install it using pip if you haven’t already:
```
pip install acryl-datahub
```
Use the Python SDK to Update the Status: You can use the DatahubRestEmitter to send a metadata change proposal that updates the status aspect of the dataset.

Here’s an example code snippet to update the status of a dataset:

from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.schema_classes import MetadataChangeProposalWrapper, StatusClass, ChangeTypeClass

# Initialize the emitter
emitter = DatahubRestEmitter(gms_server="<http://localhost:8080>")

# Define the URN of the dataset you want to update
dataset_urn = "urn:li:dataset:(urn:li:dataPlatform:hive,your_dataset_name,PROD)"

# Create a MetadataChangeProposalWrapper to update the status
mcp = MetadataChangeProposalWrapper(
    entityType="dataset",
    entityUrn=dataset_urn,
    aspectName="status",
    aspect=StatusClass(removed=False),  # Set to True to soft delete, False to undelete
    changeType=ChangeTypeClass.UPSERT,
)

# Emit the MCP
emitter.emit(mcp)

This code snippet sets the removed field to False, which means the dataset will be marked as active. If you want to soft delete the dataset, you can set removed to True.

Would you like more detailed information on any of these steps?

Sources:

<metadata-service | DataHub Deleting an Entity> 1 button 0 button Hint: Mention <@U06TM7M40RK> in the thread for followups.

user-3 · January 13, 2025, 12:04am

<@U06TM7M40RK> how to remove upstream lineage of dataset entity via sdk?

user-1 · January 13, 2025, 12:04am

To remove upstream lineage from a dataset entity using the DataHub Python SDK, you can use the MetadataChangeProposalWrapper to modify the “UpstreamLineage” aspect of the dataset. Here’s how you can do it:

Install the DataHub Python SDK: Ensure you have the DataHub Python SDK installed. You can install it using pip if you haven’t already:
```
pip install acryl-datahub
```
Use the Python SDK to Update the UpstreamLineage: You can use the DatahubRestEmitter to send a metadata change proposal that updates the upstream lineage aspect of the dataset.

Here’s an example code snippet to remove upstream lineage:

from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.schema_classes import MetadataChangeProposalWrapper, UpstreamLineageClass, UpstreamClass, ChangeTypeClass

# Initialize the emitter
emitter = DatahubRestEmitter(gms_server="http://localhost:8080")

# Define the URN of the dataset you want to update
dataset_urn = "urn:li:dataset:(urn:li:dataPlatform:hive,your_dataset_name,PROD)"

# Create an empty UpstreamLineageClass to remove all upstreams
upstream_lineage = UpstreamLineageClass(upstreams=[])

# Create a MetadataChangeProposalWrapper to update the upstream lineage
mcp = MetadataChangeProposalWrapper(
    entityType="dataset",
    entityUrn=dataset_urn,
    aspectName="upstreamLineage",
    aspect=upstream_lineage,
    changeType=ChangeTypeClass.UPSERT,
)

# Emit the MCP
emitter.emit(mcp)

This code snippet sets the upstreams field to an empty list, effectively removing all upstream lineage connections for the specified dataset.

Would you like more detailed information on any of these steps?

Sources:

<https://github.com/datahub-project/datahub/blob/master/docs/advanced/patch.md|Remove Upstream Lineage for Dataset> 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-3 · January 13, 2025, 12:04am

<@U06TM7M40RK> how about removing downstream dataset lineage?

user-1 · January 13, 2025, 12:04am

To remove downstream lineage for a dataset entity in DataHub, you need to understand that downstream lineage is typically inferred from upstream relationships. Therefore, removing downstream lineage directly is not a straightforward operation like removing upstream lineage. However, you can manage downstream lineage by adjusting the upstream relationships of the entities that depend on the dataset.

Here’s a general approach using the DataHub Python SDK:

Identify the Downstream Entities: First, identify which entities have the dataset as an upstream dependency. This can be done by querying the lineage graph to find all entities that list the dataset as an upstream.
Remove the Upstream Relationship: For each of these downstream entities, you would need to update their upstream lineage to remove the reference to the dataset in question.

Unfortunately, the data sources do not provide a direct method to remove downstream lineage using the SDK or API. The process involves managing the upstream relationships of dependent entities.

If you need to perform this operation, you might need to:

Query the lineage graph to find all downstream entities.
For each downstream entity, update its upstream lineage to exclude the dataset.

Would you like more detailed information on how to query the lineage graph or manage upstream relationships?

Sources:

<https://datahubproject.io/docs/0.14.0/generated/lineage/lineage-feature-guide|DataHub Lineage Feature Guide> 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-3 · January 13, 2025, 12:04am

<@U06TM7M40RK> when ingesting redshift spectrum dataset and athena dataset, are lineage automatically created?

user-1 · January 13, 2025, 12:04am

When ingesting datasets from Redshift Spectrum and Athena into DataHub, the automatic creation of lineage depends on the capabilities of the ingestion source and the configuration used.

Redshift Spectrum
For Redshift, DataHub supports automatic lineage extraction, but there are some limitations. According to the <https://github.com/datahub-project/datahub/blob/master/docs/quick-ingestion-guides/redshift/overview.md|Redshift Ingestion Guide>, table-level lineage can be extracted automatically. However, as per the <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/redshift/redshift.py|Redshift source code>, the stl_scan_based lineage collector does not work with Spectrum/external tables because those scans do not show up in the stl_scan table. Therefore, lineage for Redshift Spectrum datasets may not be automatically created.

Athena
For Athena, automatic lineage is not supported out of the box. According to a <Slack conversation>, Athena does not support lineage at the moment. However, if Athena is orchestrated with Airflow, you might be able to see lineage by using the v2 Airflow plugin, which can manually annotate dependencies or potentially parse SQL queries if the operator is supported.

In summary, automatic lineage creation for Redshift Spectrum is limited and may not work for external tables, while Athena does not support automatic lineage unless integrated with Airflow and configured appropriately.

Would you like more detailed information on configuring these ingestion sources or using Airflow for lineage?

Sources:

<https://github.com/datahub-project/datahub/blob/master/docs/quick-ingestion-guides/redshift/overview.md|Redshift Ingestion Guide>
<https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/redshift/redshift.py|Redshift Source Code>
<Slack Conversation on Athena Lineage> 0 button 0 button Hint: Mention @runllm in the thread for followups.

datahub_team · January 13, 2025, 12:04am

Hi ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look

Hope this helps ! ><

Topic		Replies	Views
Deleting Wrong Dataset Ingestion in DataHub ingestion	4	24	February 17, 2025
How to Soft Delete DataHub Entity Using REST API ingestion	3	17	February 17, 2025
Removing Old Items from DataHub by Query and Script ingestion	4	15	December 9, 2024
How to Ingest Lineage Information Manually Stored in Spark into DataHub ingestion	2	24	December 16, 2024
Creating Lineage from S3 Stored Procedure Logic and DataHub SDK Usage ingestion	23	90	March 17, 2025

Undeleting Soft-Deleted Entities and Managing Lineage in DataHub via Python SDK and Graph Client

Related topics