Difference between outputDatasetEdges and output_datasets on DataJobInputOutput Aspect

Original Slack Thread

<@U06TM7M40RK> what is the difference between outputDatasetEdges and output_datasets on dataJobInputOutput aspect ?

The outputDatasetEdges and outputDatasets properties on the DataJobInputOutput aspect represent different ways of modeling the output datasets produced by a data job.

  1. outputDatasetEdges: This property is a list of EdgeClass objects. It is used to represent the output datasets produced by the data job during processing. The EdgeClass includes additional metadata such as audit stamps, which contain information about who created or last modified the relationship edge and when. This approach provides a more detailed and structured way to represent relationships and is the recommended method for modeling output datasets in DataHub.

  2. outputDatasets: This property is a list of strings representing the output datasets. However, it is marked as deprecated. The recommendation is to use outputDatasetEdges instead, as it provides a more comprehensive representation of the relationships between data jobs and datasets.

According to the <Models | DataHub documentation>, the outputDatasets property is deprecated, and users are encouraged to use outputDatasetEdges for a more robust representation.

Would you like more detailed information on how to implement or transition to using outputDatasetEdges?

Sources:

  • <Models | DataHub documentation> :+1: 0 button :-1: 0 button :bulb: Hint: Mention @runllm in the thread for followups.

Hi <@U06JN8TT06Q>, how did my answer work out for you? Need any more assistance? You can also tag <@U07FV3UHE4D> to let your fellow community members know you’re still looking for guidance.

<@U06TM7M40RK> how to update outputDatasetEdges ?

To update outputDatasetEdges in the DataJobInputOutput aspect, you can use the DataJobPatchBuilder class to construct and apply patches. This approach allows you to add or remove edges representing the relationships between a data job and its output datasets.

Here’s a step-by-step guide on how to update outputDatasetEdges:

  1. Initialize the Patch Builder: Create an instance of DataJobPatchBuilder with the data job URN.

  2. Add or Remove Edges: Use methods like addOutputDatasetEdge or removeOutputDatasetEdge to modify the edges. These methods require a DatasetUrn for the dataset you want to add or remove.

  3. Build and Emit the Patch: Once you’ve added or removed the necessary edges, build the patch and emit it using a graph client.

Here’s a code snippet illustrating these steps:

import com.linkedin.common.urn.DatasetUrn;
import com.linkedin.common.urn.DataJobUrn;
import com.linkedin.metadata.aspect.patch.builder.DataJobInputOutputPatchBuilder;

// Initialize the patch builder with the data job URN
DataJobInputOutputPatchBuilder patchBuilder = new DataJobInputOutputPatchBuilder(dataJobUrn);

// Add an output dataset edge
DatasetUrn outputDatasetUrn = DatasetUrn.createFromString("urn:li:dataset:(urn:li:dataPlatform:hive,output_dataset,PROD)");
patchBuilder.addOutputDatasetEdge(outputDatasetUrn);

// Build the patch
MetadataChangeProposal patch = patchBuilder.build();

// Emit the patch using a graph client
graphClient.emitMcp(patch);

This example demonstrates how to add an output dataset edge. You can similarly use removeOutputDatasetEdge to remove an edge.

Would you like more detailed information on any of these steps?

Sources:

<@U06TM7M40RK> how to update outputDatasetEdges via python sdk?

To update outputDatasetEdges using the Python SDK, you can utilize the DataJobPatchBuilder class to construct and apply patches. This involves creating or modifying edges that represent the relationships between a data job and its output datasets.

Here’s a step-by-step guide on how to update outputDatasetEdges using Python:

  1. Initialize the Patch Builder: Create an instance of DataJobPatchBuilder with the data job URN.

  2. Add or Remove Edges: Use methods like add_output_dataset or remove_output_dataset to modify the edges. These methods require an EdgeClass object for the dataset you want to add or remove.

  3. Build and Emit the Patch: Once you’ve added or removed the necessary edges, build the patch and emit it using a graph client.

Here’s a Python code snippet illustrating these steps:

from datahub.emitter.mce_builder import make_data_job_urn, make_dataset_urn
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.metadata.schema_classes import EdgeClass, DataJobInputOutputClass
from datahub.specific.datajob import DataJobPatchBuilder

# Initialize the data job URN
datajob_urn = make_data_job_urn(orchestrator="airflow", flow_id="exampleFlow", job_id="exampleJob")

# Initialize the patch builder with the data job URN
patch_builder = DataJobPatchBuilder(datajob_urn)

# Create an output dataset URN
output_dataset_urn = make_dataset_urn(platform="hive", name="output_dataset", env="PROD")

# Create an edge for the output dataset
output_edge = EdgeClass(destinationUrn=output_dataset_urn)

# Add the output dataset edge
patch_builder.add_output_dataset(output_edge)

# Build the patch
patches = patch_builder.build()

# Emit the patch using a graph client
for patch_mcp in patches:
    graph_client.emit_mcp(patch_mcp)

This example demonstrates how to add an output dataset edge. You can similarly use remove_output_dataset to remove an edge.

Would you like more detailed information on any of these steps?

Sources:

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><