Difference between outputDatasetEdges and output_datasets on DataJobInputOutput Aspect

user-3 · January 13, 2025, 12:02am

<@U06TM7M40RK> what is the difference between outputDatasetEdges and output_datasets on dataJobInputOutput aspect ?

user-1 · January 13, 2025, 12:02am

The outputDatasetEdges and outputDatasets properties on the DataJobInputOutput aspect represent different ways of modeling the output datasets produced by a data job.

outputDatasetEdges: This property is a list of EdgeClass objects. It is used to represent the output datasets produced by the data job during processing. The EdgeClass includes additional metadata such as audit stamps, which contain information about who created or last modified the relationship edge and when. This approach provides a more detailed and structured way to represent relationships and is the recommended method for modeling output datasets in DataHub.
outputDatasets: This property is a list of strings representing the output datasets. However, it is marked as deprecated. The recommendation is to use outputDatasetEdges instead, as it provides a more comprehensive representation of the relationships between data jobs and datasets.

According to the <Models | DataHub documentation>, the outputDatasets property is deprecated, and users are encouraged to use outputDatasetEdges for a more robust representation.

Would you like more detailed information on how to implement or transition to using outputDatasetEdges?

Sources:

<Models | DataHub documentation> 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-1 · January 13, 2025, 12:02am

Hi <@U06JN8TT06Q>, how did my answer work out for you? Need any more assistance? You can also tag <@U07FV3UHE4D> to let your fellow community members know you’re still looking for guidance.

user-3 · January 13, 2025, 12:02am

<@U06TM7M40RK> how to update outputDatasetEdges ?

user-1 · January 13, 2025, 12:02am

To update outputDatasetEdges in the DataJobInputOutput aspect, you can use the DataJobPatchBuilder class to construct and apply patches. This approach allows you to add or remove edges representing the relationships between a data job and its output datasets.

Here’s a step-by-step guide on how to update outputDatasetEdges:

Initialize the Patch Builder: Create an instance of DataJobPatchBuilder with the data job URN.
Add or Remove Edges: Use methods like addOutputDatasetEdge or removeOutputDatasetEdge to modify the edges. These methods require a DatasetUrn for the dataset you want to add or remove.
Build and Emit the Patch: Once you’ve added or removed the necessary edges, build the patch and emit it using a graph client.

Here’s a code snippet illustrating these steps:

import com.linkedin.common.urn.DatasetUrn;
import com.linkedin.common.urn.DataJobUrn;
import com.linkedin.metadata.aspect.patch.builder.DataJobInputOutputPatchBuilder;

// Initialize the patch builder with the data job URN
DataJobInputOutputPatchBuilder patchBuilder = new DataJobInputOutputPatchBuilder(dataJobUrn);

// Add an output dataset edge
DatasetUrn outputDatasetUrn = DatasetUrn.createFromString("urn:li:dataset:(urn:li:dataPlatform:hive,output_dataset,PROD)");
patchBuilder.addOutputDatasetEdge(outputDatasetUrn);

// Build the patch
MetadataChangeProposal patch = patchBuilder.build();

// Emit the patch using a graph client
graphClient.emitMcp(patch);

This example demonstrates how to add an output dataset edge. You can similarly use removeOutputDatasetEdge to remove an edge.

Would you like more detailed information on any of these steps?

Sources:

https://github.com/datahub-project/datahub/blob/master/entity-registry/src/main/java/com/linkedin/metadata/aspect/patch/builder/DataJobInputOutputPatchBuilder.java|DataJobInputOutputPatchBuilder.java
https://github.com/datahub-project/datahub/blob/master/metadata-service/services/src/main/java/com/linkedin/metadata/service/LineageService.java|LineageService.java 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-3 · January 13, 2025, 12:02am

<@U06TM7M40RK> how to update outputDatasetEdges via python sdk?

user-1 · January 13, 2025, 12:02am

To update outputDatasetEdges using the Python SDK, you can utilize the DataJobPatchBuilder class to construct and apply patches. This involves creating or modifying edges that represent the relationships between a data job and its output datasets.

Here’s a step-by-step guide on how to update outputDatasetEdges using Python:

Initialize the Patch Builder: Create an instance of DataJobPatchBuilder with the data job URN.
Add or Remove Edges: Use methods like add_output_dataset or remove_output_dataset to modify the edges. These methods require an EdgeClass object for the dataset you want to add or remove.
Build and Emit the Patch: Once you’ve added or removed the necessary edges, build the patch and emit it using a graph client.

Here’s a Python code snippet illustrating these steps:

from datahub.emitter.mce_builder import make_data_job_urn, make_dataset_urn
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.metadata.schema_classes import EdgeClass, DataJobInputOutputClass
from datahub.specific.datajob import DataJobPatchBuilder

# Initialize the data job URN
datajob_urn = make_data_job_urn(orchestrator="airflow", flow_id="exampleFlow", job_id="exampleJob")

# Initialize the patch builder with the data job URN
patch_builder = DataJobPatchBuilder(datajob_urn)

# Create an output dataset URN
output_dataset_urn = make_dataset_urn(platform="hive", name="output_dataset", env="PROD")

# Create an edge for the output dataset
output_edge = EdgeClass(destinationUrn=output_dataset_urn)

# Add the output dataset edge
patch_builder.add_output_dataset(output_edge)

# Build the patch
patches = patch_builder.build()

# Emit the patch using a graph client
for patch_mcp in patches:
    graph_client.emit_mcp(patch_mcp)

This example demonstrates how to add an output dataset edge. You can similarly use remove_output_dataset to remove an edge.

Would you like more detailed information on any of these steps?

Sources:

https://github.com/datahub-project/datahub/blob/master/smoke-test/tests/patch/test_datajob_patches.py|test_datajob_patches.py 0 button 0 button Hint: Mention @runllm in the thread for followups.

datahub_team · January 13, 2025, 12:02am

Hi ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look

Hope this helps ! ><

Topic		Replies	Views
How to Display and Extract Dates in DataHub GUI for Athena Datasets getting-started	2	50	March 4, 2024
Understanding Data Products, Domains, and Environments in Data Mesh Architecture getting-started	2	64	March 4, 2024
Impact of Updating Dataset Description via UI, Redshift, or Python SDK ingestion	4	2	March 3, 2025
Troubleshooting issues with recognizing the structure of datasets in DataHub and underlying clickhouse driver ingestion	7	57	March 4, 2024
Publishing Data Quality Check Results in DataHub Dataset Validation Tab ingestion	4	119	June 24, 2024

Difference between outputDatasetEdges and output_datasets on DataJobInputOutput Aspect

Related topics