<@U06TM7M40RK> what is the difference between outputDatasetEdges and output_datasets on dataJobInputOutput aspect ?
The outputDatasetEdges
and outputDatasets
properties on the DataJobInputOutput
aspect represent different ways of modeling the output datasets produced by a data job.
-
outputDatasetEdges: This property is a list of
EdgeClass
objects. It is used to represent the output datasets produced by the data job during processing. TheEdgeClass
includes additional metadata such as audit stamps, which contain information about who created or last modified the relationship edge and when. This approach provides a more detailed and structured way to represent relationships and is the recommended method for modeling output datasets in DataHub. -
outputDatasets: This property is a list of strings representing the output datasets. However, it is marked as deprecated. The recommendation is to use
outputDatasetEdges
instead, as it provides a more comprehensive representation of the relationships between data jobs and datasets.
According to the <Models | DataHub documentation>, the outputDatasets
property is deprecated, and users are encouraged to use outputDatasetEdges
for a more robust representation.
Would you like more detailed information on how to implement or transition to using outputDatasetEdges
?
Sources:
- <Models | DataHub documentation>
0 button
0 button
Hint: Mention @runllm in the thread for followups.
Hi <@U06JN8TT06Q>, how did my answer work out for you? Need any more assistance? You can also tag <@U07FV3UHE4D> to let your fellow community members know you’re still looking for guidance.
<@U06TM7M40RK> how to update outputDatasetEdges ?
To update outputDatasetEdges
in the DataJobInputOutput
aspect, you can use the DataJobPatchBuilder
class to construct and apply patches. This approach allows you to add or remove edges representing the relationships between a data job and its output datasets.
Here’s a step-by-step guide on how to update outputDatasetEdges
:
-
Initialize the Patch Builder: Create an instance of
DataJobPatchBuilder
with the data job URN. -
Add or Remove Edges: Use methods like
addOutputDatasetEdge
orremoveOutputDatasetEdge
to modify the edges. These methods require aDatasetUrn
for the dataset you want to add or remove. -
Build and Emit the Patch: Once you’ve added or removed the necessary edges, build the patch and emit it using a graph client.
Here’s a code snippet illustrating these steps:
import com.linkedin.common.urn.DatasetUrn;
import com.linkedin.common.urn.DataJobUrn;
import com.linkedin.metadata.aspect.patch.builder.DataJobInputOutputPatchBuilder;
// Initialize the patch builder with the data job URN
DataJobInputOutputPatchBuilder patchBuilder = new DataJobInputOutputPatchBuilder(dataJobUrn);
// Add an output dataset edge
DatasetUrn outputDatasetUrn = DatasetUrn.createFromString("urn:li:dataset:(urn:li:dataPlatform:hive,output_dataset,PROD)");
patchBuilder.addOutputDatasetEdge(outputDatasetUrn);
// Build the patch
MetadataChangeProposal patch = patchBuilder.build();
// Emit the patch using a graph client
graphClient.emitMcp(patch);
This example demonstrates how to add an output dataset edge. You can similarly use removeOutputDatasetEdge
to remove an edge.
Would you like more detailed information on any of these steps?
Sources:
- https://github.com/datahub-project/datahub/blob/master/entity-registry/src/main/java/com/linkedin/metadata/aspect/patch/builder/DataJobInputOutputPatchBuilder.java|DataJobInputOutputPatchBuilder.java
- https://github.com/datahub-project/datahub/blob/master/metadata-service/services/src/main/java/com/linkedin/metadata/service/LineageService.java|LineageService.java
0 button
0 button
Hint: Mention @runllm in the thread for followups.
<@U06TM7M40RK> how to update outputDatasetEdges via python sdk?
To update outputDatasetEdges
using the Python SDK, you can utilize the DataJobPatchBuilder
class to construct and apply patches. This involves creating or modifying edges that represent the relationships between a data job and its output datasets.
Here’s a step-by-step guide on how to update outputDatasetEdges
using Python:
-
Initialize the Patch Builder: Create an instance of
DataJobPatchBuilder
with the data job URN. -
Add or Remove Edges: Use methods like
add_output_dataset
orremove_output_dataset
to modify the edges. These methods require anEdgeClass
object for the dataset you want to add or remove. -
Build and Emit the Patch: Once you’ve added or removed the necessary edges, build the patch and emit it using a graph client.
Here’s a Python code snippet illustrating these steps:
from datahub.emitter.mce_builder import make_data_job_urn, make_dataset_urn
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.metadata.schema_classes import EdgeClass, DataJobInputOutputClass
from datahub.specific.datajob import DataJobPatchBuilder
# Initialize the data job URN
datajob_urn = make_data_job_urn(orchestrator="airflow", flow_id="exampleFlow", job_id="exampleJob")
# Initialize the patch builder with the data job URN
patch_builder = DataJobPatchBuilder(datajob_urn)
# Create an output dataset URN
output_dataset_urn = make_dataset_urn(platform="hive", name="output_dataset", env="PROD")
# Create an edge for the output dataset
output_edge = EdgeClass(destinationUrn=output_dataset_urn)
# Add the output dataset edge
patch_builder.add_output_dataset(output_edge)
# Build the patch
patches = patch_builder.build()
# Emit the patch using a graph client
for patch_mcp in patches:
graph_client.emit_mcp(patch_mcp)
This example demonstrates how to add an output dataset edge. You can similarly use remove_output_dataset
to remove an edge.
Would you like more detailed information on any of these steps?
Sources:
- https://github.com/datahub-project/datahub/blob/master/smoke-test/tests/patch/test_datajob_patches.py|test_datajob_patches.py
0 button
0 button
Hint: Mention @runllm in the thread for followups.
Hi ! Did this interaction resolve your issue or do you still require assistance ?
Please tag @ RunLLM if you still need further assistance !
Or tag @Datahub Community Support if you require a person to take a look
Hope this helps ! ><