<@U06TM7M40RK> how to undelete a soft-deleted entity via python sdk or graph client?
To undelete a soft-deleted entity in DataHub, you can update the “Status” aspect of the entity to set the removed
field to false
. This can be done using the DataHub Python SDK or the GraphQL API.
Using the Python SDK
You can use the DatahubRestEmitter
from the Python SDK to update the status aspect of the entity. Here’s an example:
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.specific.common import StatusPatchBuilder
# Initialize the emitter
emitter = DatahubRestEmitter(gms_server="http://localhost:8080")
# Create a StatusPatchBuilder instance
status_patch = StatusPatchBuilder(urn="urn:li:your-entity-type:your-entity-id")
# Set the removed field to false
status_patch.set_removed(False)
# Emit the changes
emitter.emit_mcp(status_patch.build())
Using the GraphQL API
If you prefer using the GraphQL API, you can construct a mutation to update the status aspect. Here’s a general approach:
- Collect the URN of the entity you want to undelete.
- Use a GraphQL mutation to update the
removed
field in theStatus
aspect tofalse
.
Unfortunately, the exact GraphQL mutation syntax is not provided in the data sources, but you can refer to the <Rest.li API | DataHub API documentation> for more details on how to construct such a request.
Would you like more detailed information on any of these steps?
Sources:
- <https://github.com/datahub-project/datahub/blob/master/docs/how/delete-metadata.md|DataHub Delete Metadata Documentation>
- <Rest.li API | DataHub Deleting an Entity>
0 button
0 button
Hint: Mention @runllm in the thread for followups.
<@U06TM7M40RK> which package to install datahub.specific.common
I’m not able to find any data sources to answer your question. You might want to specify the context or version of DataHub you are using. If this doesn’t help, please reach out to the DataHub support team for further assistance. 0 button
0 button
Hint: Mention @runllm in the thread for followups.
<@U06TM7M40RK> how do you update entity dataset status via python sdk ?
To update the status of a dataset entity using the DataHub Python SDK, you can use the MetadataChangeProposalWrapper
to modify the “Status” aspect of the dataset. Here’s a step-by-step guide on how to do this:
-
Install the DataHub Python SDK: Ensure you have the DataHub Python SDK installed. You can install it using pip if you haven’t already:
pip install acryl-datahub
-
Use the Python SDK to Update the Status: You can use the
DatahubRestEmitter
to send a metadata change proposal that updates the status aspect of the dataset.
Here’s an example code snippet to update the status of a dataset:
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.schema_classes import MetadataChangeProposalWrapper, StatusClass, ChangeTypeClass
# Initialize the emitter
emitter = DatahubRestEmitter(gms_server="<http://localhost:8080>")
# Define the URN of the dataset you want to update
dataset_urn = "urn:li:dataset:(urn:li:dataPlatform:hive,your_dataset_name,PROD)"
# Create a MetadataChangeProposalWrapper to update the status
mcp = MetadataChangeProposalWrapper(
entityType="dataset",
entityUrn=dataset_urn,
aspectName="status",
aspect=StatusClass(removed=False), # Set to True to soft delete, False to undelete
changeType=ChangeTypeClass.UPSERT,
)
# Emit the MCP
emitter.emit(mcp)
This code snippet sets the removed
field to False
, which means the dataset will be marked as active. If you want to soft delete the dataset, you can set removed
to True
.
Would you like more detailed information on any of these steps?
Sources:
- <metadata-service | DataHub Deleting an Entity>
1 button
0 button
Hint: Mention <@U06TM7M40RK> in the thread for followups.
<@U06TM7M40RK> how to remove upstream lineage of dataset entity via sdk?
To remove upstream lineage from a dataset entity using the DataHub Python SDK, you can use the MetadataChangeProposalWrapper
to modify the “UpstreamLineage” aspect of the dataset. Here’s how you can do it:
-
Install the DataHub Python SDK: Ensure you have the DataHub Python SDK installed. You can install it using pip if you haven’t already:
pip install acryl-datahub
-
Use the Python SDK to Update the UpstreamLineage: You can use the
DatahubRestEmitter
to send a metadata change proposal that updates the upstream lineage aspect of the dataset.
Here’s an example code snippet to remove upstream lineage:
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.schema_classes import MetadataChangeProposalWrapper, UpstreamLineageClass, UpstreamClass, ChangeTypeClass
# Initialize the emitter
emitter = DatahubRestEmitter(gms_server="http://localhost:8080")
# Define the URN of the dataset you want to update
dataset_urn = "urn:li:dataset:(urn:li:dataPlatform:hive,your_dataset_name,PROD)"
# Create an empty UpstreamLineageClass to remove all upstreams
upstream_lineage = UpstreamLineageClass(upstreams=[])
# Create a MetadataChangeProposalWrapper to update the upstream lineage
mcp = MetadataChangeProposalWrapper(
entityType="dataset",
entityUrn=dataset_urn,
aspectName="upstreamLineage",
aspect=upstream_lineage,
changeType=ChangeTypeClass.UPSERT,
)
# Emit the MCP
emitter.emit(mcp)
This code snippet sets the upstreams
field to an empty list, effectively removing all upstream lineage connections for the specified dataset.
Would you like more detailed information on any of these steps?
Sources:
- <https://github.com/datahub-project/datahub/blob/master/docs/advanced/patch.md|Remove Upstream Lineage for Dataset>
0 button
0 button
Hint: Mention @runllm in the thread for followups.
<@U06TM7M40RK> how about removing downstream dataset lineage?
To remove downstream lineage for a dataset entity in DataHub, you need to understand that downstream lineage is typically inferred from upstream relationships. Therefore, removing downstream lineage directly is not a straightforward operation like removing upstream lineage. However, you can manage downstream lineage by adjusting the upstream relationships of the entities that depend on the dataset.
Here’s a general approach using the DataHub Python SDK:
-
Identify the Downstream Entities: First, identify which entities have the dataset as an upstream dependency. This can be done by querying the lineage graph to find all entities that list the dataset as an upstream.
-
Remove the Upstream Relationship: For each of these downstream entities, you would need to update their upstream lineage to remove the reference to the dataset in question.
Unfortunately, the data sources do not provide a direct method to remove downstream lineage using the SDK or API. The process involves managing the upstream relationships of dependent entities.
If you need to perform this operation, you might need to:
- Query the lineage graph to find all downstream entities.
- For each downstream entity, update its upstream lineage to exclude the dataset.
Would you like more detailed information on how to query the lineage graph or manage upstream relationships?
Sources:
- <https://datahubproject.io/docs/0.14.0/generated/lineage/lineage-feature-guide|DataHub Lineage Feature Guide>
0 button
0 button
Hint: Mention @runllm in the thread for followups.
<@U06TM7M40RK> when ingesting redshift spectrum dataset and athena dataset, are lineage automatically created?
When ingesting datasets from Redshift Spectrum and Athena into DataHub, the automatic creation of lineage depends on the capabilities of the ingestion source and the configuration used.
Redshift Spectrum
For Redshift, DataHub supports automatic lineage extraction, but there are some limitations. According to the <https://github.com/datahub-project/datahub/blob/master/docs/quick-ingestion-guides/redshift/overview.md|Redshift Ingestion Guide>, table-level lineage can be extracted automatically. However, as per the <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/redshift/redshift.py|Redshift source code>, the stl_scan_based
lineage collector does not work with Spectrum/external tables because those scans do not show up in the stl_scan
table. Therefore, lineage for Redshift Spectrum datasets may not be automatically created.
Athena
For Athena, automatic lineage is not supported out of the box. According to a <Slack conversation>, Athena does not support lineage at the moment. However, if Athena is orchestrated with Airflow, you might be able to see lineage by using the v2 Airflow plugin, which can manually annotate dependencies or potentially parse SQL queries if the operator is supported.
In summary, automatic lineage creation for Redshift Spectrum is limited and may not work for external tables, while Athena does not support automatic lineage unless integrated with Airflow and configured appropriately.
Would you like more detailed information on configuring these ingestion sources or using Airflow for lineage?
Sources:
- <https://github.com/datahub-project/datahub/blob/master/docs/quick-ingestion-guides/redshift/overview.md|Redshift Ingestion Guide>
- <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/redshift/redshift.py|Redshift Source Code>
- <Slack Conversation on Athena Lineage>
0 button
0 button
Hint: Mention @runllm in the thread for followups.
Hi ! Did this interaction resolve your issue or do you still require assistance ?
Please tag @ RunLLM if you still need further assistance !
Or tag @Datahub Community Support if you require a person to take a look
Hope this helps ! ><