Adding Domains to Datasets via CLI and API in DataHub

Original Slack Thread

<@U06TM7M40RK> How to add Domains via CLI?

Currently, DataHub does not support creating domains directly via the CLI. According to a <Slack discussion>, the CLI does not have the capability to create domains or tags. However, you can manage domains using the DataHub API, specifically through GraphQL mutations.

To create a domain programmatically, you can use the GraphQL API. Here is an example of how you can create a domain using a GraphQL mutation:

mutation createDomain {
  createDomain(input: { name: "My New Domain", description: "An optional description" })
}

You can execute this mutation using a tool like curl or through a GraphQL client. Here is an example using curl:

curl --location --request POST 'http://localhost:8080/api/graphql' \
--header 'Authorization: Bearer <my-access-token>' \
--header 'Content-Type: application/json' \
--data-raw '{ "query": "mutation createDomain { createDomain(input: { name: \"My New Domain\", description: \"An optional description\" }) }", "variables":{}}'

This will create a new domain in DataHub. You can also manage domains using the DataHub UI, where you can create, view, and manage domains.

Would you like more details on using the GraphQL API for domain management?

Sources:

<@U06TM7M40RK> Is there a possibility to add a domain to a dataset via cli?

Yes, you can add a domain to a dataset using the DataHub CLI, but not directly. Instead, you can use the DataHub API or the Python SDK to achieve this. Here’s how you can do it using the API:

Using the DataHub API
You can use a GraphQL mutation to set a domain for a dataset. Here is an example using curl:

curl --location --request POST 'http://localhost:8080/api/graphql' \
--header 'Authorization: Bearer <my-access-token>' \
--header 'Content-Type: application/json' \
--data-raw '{
  "query": "mutation setDomain { setDomain(entityUrn: \"urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)\", domainUrn: \"urn:li:domain:marketing\") }",
  "variables": {}
}'

Using the Python SDK
You can also use the Python SDK to add a domain to a dataset. Here is an example script:

from datahub.emitter.mce_builder import make_dataset_urn
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph
from datahub.metadata.schema_classes import DomainsClass

# Setup DataHubGraph client
gms_endpoint = "http://localhost:8080"
graph = DataHubGraph(config=DatahubClientConfig(server=gms_endpoint))

# Define dataset URN and domain URN
dataset_urn = make_dataset_urn(platform="hive", name="fct_users_created", env="PROD")
domain_urn = "urn:li:domain:marketing"

# Create a MetadataChangeProposalWrapper for the domain
event = MetadataChangeProposalWrapper(
    entityUrn=dataset_urn,
    aspect=DomainsClass(domains=[domain_urn])
)

# Emit the event to DataHub
graph.emit(event)

These methods allow you to programmatically assign a domain to a dataset in DataHub.

Would you like more detailed steps on using the API or Python SDK?

Sources:

<@U06TM7M40RK> Can I use yaml for that?

Yes, you can use YAML to add a domain to a dataset in DataHub by using the ingestion framework. You can define a transformer in your ingestion recipe YAML file to map datasets to domains. Here’s how you can do it using the simple_add_dataset_domain transformer:

Example YAML Configuration

transformers:
  - type: "simple_add_dataset_domain"
    config:
      semantics: OVERWRITE
      domains:
        - "urn:li:domain:marketing"

Steps to Use the YAML Configuration

  1. Create an Ingestion Recipe: Create a YAML file for your ingestion recipe. This file will define the source of the metadata and the transformers to apply.

  2. Define the Transformer: Use the simple_add_dataset_domain transformer to specify the domain you want to add to the dataset. The semantics field can be set to OVERWRITE to replace any existing domains or PATCH to add to the existing domains.

  3. Run the Ingestion: Use the DataHub CLI to run the ingestion process with your YAML file.

datahub ingest -c your_ingestion_recipe.yaml

This configuration will apply the specified domain to the datasets processed by the ingestion pipeline.

Would you like more detailed steps or additional examples?

Sources:

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><

<@U06TM7M40RK> I need to add domains without using a recipe that includes dataset information

If you want to add domains to datasets without using a full ingestion recipe that includes dataset information, you can use the DataHub API or the Python SDK to directly set domains on existing datasets. This approach allows you to manage domains independently of the dataset ingestion process.

Using the DataHub API

You can use a GraphQL mutation to set a domain for a dataset. Here is an example using curl:

curl --location --request POST 'http://localhost:8080/api/graphql' \
--header 'Authorization: Bearer <my-access-token>' \
--header 'Content-Type: application/json' \
--data-raw '{
  "query": "mutation setDomain { setDomain(entityUrn: \"urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)\", domainUrn: \"urn:li:domain:marketing\") }",
  "variables": {}
}'

Using the Python SDK

You can also use the Python SDK to add a domain to a dataset. Here is an example script:

from datahub.emitter.mce_builder import make_dataset_urn
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph
from datahub.metadata.schema_classes import DomainsClass

# Setup DataHubGraph client
gms_endpoint = "http://localhost:8080"
graph = DataHubGraph(config=DatahubClientConfig(server=gms_endpoint))

# Define dataset URN and domain URN
dataset_urn = make_dataset_urn(platform="hive", name="fct_users_created", env="PROD")
domain_urn = "urn:li:domain:marketing"

# Create a MetadataChangeProposalWrapper for the domain
event = MetadataChangeProposalWrapper(
    entityUrn=dataset_urn,
    aspect=DomainsClass(domains=[domain_urn])
)

# Emit the event to DataHub
graph.emit(event)

These methods allow you to programmatically assign a domain to a dataset in DataHub without using a full ingestion recipe.

Would you like more detailed steps on using the API or Python SDK?

Sources:

<@U06TM7M40RK> I use cli to ingest datasets. How can I add domains to that datasets