Creating a Dataset URN in Java SDK for Datahub and emitting it to DataHub GMS

Original Slack Thread

In the Java SDK for Datahub, how do i programmatically create a Dataset URN and emit it to Datahub GMS?

Hey there! :wave:

<@U06TM7M40RK> Can you help with the questions in this thread?

:information_source: Please continue to tag RunLLM (e.g <@U06TM7M40RK>) to have it continue generating responses!

To programmatically create a Dataset URN and emit it to DataHub GMS using the Java SDK, you can follow these steps:

  1. Create a Dataset URN: Use the UrnUtils.toDatasetUrn method to create a Dataset URN. This method requires the data platform, dataset name, and environment as parameters.

  2. Prepare Metadata: Construct the metadata for the dataset using SchemaMetadata and SchemaField classes. This includes setting the schema name, platform, version, and fields.

  3. Create a MetadataChangeProposalWrapper: Use the MetadataChangeProposalWrapper to wrap the metadata change proposal. This involves specifying the entity type, entity URN, and the aspect (metadata) to be upserted.

  4. Emit to DataHub GMS: Use the RestEmitter to send the metadata change proposal to the DataHub GMS server.

Here is a code example based on the provided data sources:

import com.linkedin.common.AuditStamp;
import com.linkedin.common.urn.CorpuserUrn;
import com.linkedin.common.urn.DataPlatformUrn;
import com.linkedin.common.urn.DatasetUrn;
import com.linkedin.common.urn.UrnUtils;
import com.linkedin.schema.DateType;
import com.linkedin.schema.OtherSchema;
import com.linkedin.schema.SchemaField;
import com.linkedin.schema.SchemaFieldArray;
import com.linkedin.schema.SchemaFieldDataType;
import com.linkedin.schema.SchemaMetadata;
import com.linkedin.schema.StringType;
import datahub.client.MetadataWriteResponse;
import datahub.event.MetadataChangeProposalWrapper;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.Future;

public class DatasetAdd {

  public static void main(String[] args)
      throws IOException, ExecutionException, InterruptedException {
    // Create Dataset URN
    DatasetUrn datasetUrn = UrnUtils.toDatasetUrn("hive", "fct_users_deleted", "PROD");
    CorpuserUrn userUrn = new CorpuserUrn("ingestion");
    AuditStamp lastModified = new AuditStamp().setTime(1640692800000L).setActor(userUrn);

    // Prepare Schema Metadata
    SchemaMetadata schemaMetadata = new SchemaMetadata()
        .setPlatform(new DataPlatformUrn("hive"))
        .setPlatformSchema(SchemaMetadata.PlatformSchema.create(new OtherSchema().setRawSchema("__insert raw schema here__")))

    SchemaFieldArray fields = new SchemaFieldArray();
    fields.add(new SchemaField().setFieldPath("address.zipcode")
        .setType(new SchemaFieldDataType().setType(SchemaFieldDataType.Type.create(new StringType())))
        .setDescription("This is the zipcode of the address.")


    // Create MetadataChangeProposalWrapper
    MetadataChangeProposalWrapper mcpw = MetadataChangeProposalWrapper.builder()

    // Emit to DataHub GMS
    String token = ""; // Add your token here
    RestEmitter emitter = RestEmitter.create(b -&gt; b.server("<http://localhost:8080>").token(token));
    Future&lt;MetadataWriteResponse&gt; response = emitter.emit(mcpw, null);

This example demonstrates how to create a dataset URN, prepare the schema metadata, wrap it in a metadata change proposal, and emit it to the DataHub GMS using the Java SDK.

Would you like more details on any specific part of this process?


Hi, how did my answer work out for you? Need any more assistance? You can also tag <@U07FV3UHE4D> to let your fellow community members know you’re still looking for guidance.

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><