Creating a Dataset URN in Java SDK for Datahub and emitting it to DataHub GMS

Original Slack Thread

In the Java SDK for Datahub, how do i programmatically create a Dataset URN and emit it to Datahub GMS?

Hey there! :wave:

<@U06TM7M40RK> Can you help with the questions in this thread?

:information_source: Please continue to tag RunLLM (e.g <@U06TM7M40RK>) to have it continue generating responses!

To programmatically create a Dataset URN and emit it to DataHub GMS using the Java SDK, you can follow these steps:

  1. Create a Dataset URN: Use the UrnUtils.toDatasetUrn method to create a Dataset URN. This method requires the data platform, dataset name, and environment as parameters.

  2. Prepare Metadata: Construct the metadata for the dataset using SchemaMetadata and SchemaField classes. This includes setting the schema name, platform, version, and fields.

  3. Create a MetadataChangeProposalWrapper: Use the MetadataChangeProposalWrapper to wrap the metadata change proposal. This involves specifying the entity type, entity URN, and the aspect (metadata) to be upserted.

  4. Emit to DataHub GMS: Use the RestEmitter to send the metadata change proposal to the DataHub GMS server.

Here is a code example based on the provided data sources:

import com.linkedin.common.AuditStamp;
import com.linkedin.common.urn.CorpuserUrn;
import com.linkedin.common.urn.DataPlatformUrn;
import com.linkedin.common.urn.DatasetUrn;
import com.linkedin.common.urn.UrnUtils;
import com.linkedin.schema.DateType;
import com.linkedin.schema.OtherSchema;
import com.linkedin.schema.SchemaField;
import com.linkedin.schema.SchemaFieldArray;
import com.linkedin.schema.SchemaFieldDataType;
import com.linkedin.schema.SchemaMetadata;
import com.linkedin.schema.StringType;
import datahub.client.MetadataWriteResponse;
import datahub.client.rest.RestEmitter;
import datahub.event.MetadataChangeProposalWrapper;
import java.io.IOException;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.Future;

public class DatasetAdd {

  public static void main(String[] args)
      throws IOException, ExecutionException, InterruptedException {
    // Create Dataset URN
    DatasetUrn datasetUrn = UrnUtils.toDatasetUrn("hive", "fct_users_deleted", "PROD");
    CorpuserUrn userUrn = new CorpuserUrn("ingestion");
    AuditStamp lastModified = new AuditStamp().setTime(1640692800000L).setActor(userUrn);

    // Prepare Schema Metadata
    SchemaMetadata schemaMetadata = new SchemaMetadata()
        .setSchemaName("customer")
        .setPlatform(new DataPlatformUrn("hive"))
        .setVersion(0L)
        .setPlatformSchema(SchemaMetadata.PlatformSchema.create(new OtherSchema().setRawSchema("__insert raw schema here__")))
        .setLastModified(lastModified);

    SchemaFieldArray fields = new SchemaFieldArray();
    fields.add(new SchemaField().setFieldPath("address.zipcode")
        .setType(new SchemaFieldDataType().setType(SchemaFieldDataType.Type.create(new StringType())))
        .setNativeDataType("VARCHAR(50)")
        .setDescription("This is the zipcode of the address.")
        .setLastModified(lastModified));

    schemaMetadata.setFields(fields);

    // Create MetadataChangeProposalWrapper
    MetadataChangeProposalWrapper mcpw = MetadataChangeProposalWrapper.builder()
        .entityType("dataset")
        .entityUrn(datasetUrn)
        .upsert()
        .aspect(schemaMetadata)
        .build();

    // Emit to DataHub GMS
    String token = ""; // Add your token here
    RestEmitter emitter = RestEmitter.create(b -&gt; b.server("<http://localhost:8080>").token(token));
    Future&lt;MetadataWriteResponse&gt; response = emitter.emit(mcpw, null);
    System.out.println(response.get().getResponseContent());
  }
}

This example demonstrates how to create a dataset URN, prepare the schema metadata, wrap it in a metadata change proposal, and emit it to the DataHub GMS using the Java SDK.

Would you like more details on any specific part of this process?

Sources:

Hi, how did my answer work out for you? Need any more assistance? You can also tag <@U07FV3UHE4D> to let your fellow community members know you’re still looking for guidance.

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><