<@U06TM7M40RK> How to add Domains via CLI?
Currently, DataHub does not support creating domains directly via the CLI. According to a <Slack discussion>, the CLI does not have the capability to create domains or tags. However, you can manage domains using the DataHub API, specifically through GraphQL mutations.
To create a domain programmatically, you can use the GraphQL API. Here is an example of how you can create a domain using a GraphQL mutation:
mutation createDomain {
createDomain(input: { name: "My New Domain", description: "An optional description" })
}
You can execute this mutation using a tool like curl
or through a GraphQL client. Here is an example using curl
:
curl --location --request POST 'http://localhost:8080/api/graphql' \
--header 'Authorization: Bearer <my-access-token>' \
--header 'Content-Type: application/json' \
--data-raw '{ "query": "mutation createDomain { createDomain(input: { name: \"My New Domain\", description: \"An optional description\" }) }", "variables":{}}'
This will create a new domain in DataHub. You can also manage domains using the DataHub UI, where you can create, view, and manage domains.
Would you like more details on using the GraphQL API for domain management?
Sources:
- <https://github.com/datahub-project/datahub/blob/master/docs/api/tutorials/domains.md|DataHub Domains Documentation>
- <Slack discussion>
0 button
0 button
Hint: Mention @runllm in the thread for followups.
<@U06TM7M40RK> Is there a possibility to add a domain to a dataset via cli?
Yes, you can add a domain to a dataset using the DataHub CLI, but not directly. Instead, you can use the DataHub API or the Python SDK to achieve this. Here’s how you can do it using the API:
Using the DataHub API
You can use a GraphQL mutation to set a domain for a dataset. Here is an example using curl
:
curl --location --request POST 'http://localhost:8080/api/graphql' \
--header 'Authorization: Bearer <my-access-token>' \
--header 'Content-Type: application/json' \
--data-raw '{
"query": "mutation setDomain { setDomain(entityUrn: \"urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)\", domainUrn: \"urn:li:domain:marketing\") }",
"variables": {}
}'
Using the Python SDK
You can also use the Python SDK to add a domain to a dataset. Here is an example script:
from datahub.emitter.mce_builder import make_dataset_urn
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph
from datahub.metadata.schema_classes import DomainsClass
# Setup DataHubGraph client
gms_endpoint = "http://localhost:8080"
graph = DataHubGraph(config=DatahubClientConfig(server=gms_endpoint))
# Define dataset URN and domain URN
dataset_urn = make_dataset_urn(platform="hive", name="fct_users_created", env="PROD")
domain_urn = "urn:li:domain:marketing"
# Create a MetadataChangeProposalWrapper for the domain
event = MetadataChangeProposalWrapper(
entityUrn=dataset_urn,
aspect=DomainsClass(domains=[domain_urn])
)
# Emit the event to DataHub
graph.emit(event)
These methods allow you to programmatically assign a domain to a dataset in DataHub.
Would you like more detailed steps on using the API or Python SDK?
Sources:
- <https://github.com/datahub-project/datahub/blob/master/docs/api/tutorials/domains.md|DataHub Domains Documentation>
0 button
0 button
Hint: Mention @runllm in the thread for followups.
<@U06TM7M40RK> Can I use yaml for that?
Yes, you can use YAML to add a domain to a dataset in DataHub by using the ingestion framework. You can define a transformer in your ingestion recipe YAML file to map datasets to domains. Here’s how you can do it using the simple_add_dataset_domain
transformer:
Example YAML Configuration
transformers:
- type: "simple_add_dataset_domain"
config:
semantics: OVERWRITE
domains:
- "urn:li:domain:marketing"
Steps to Use the YAML Configuration
-
Create an Ingestion Recipe: Create a YAML file for your ingestion recipe. This file will define the source of the metadata and the transformers to apply.
-
Define the Transformer: Use the
simple_add_dataset_domain
transformer to specify the domain you want to add to the dataset. Thesemantics
field can be set toOVERWRITE
to replace any existing domains orPATCH
to add to the existing domains. -
Run the Ingestion: Use the DataHub CLI to run the ingestion process with your YAML file.
datahub ingest -c your_ingestion_recipe.yaml
This configuration will apply the specified domain to the datasets processed by the ingestion pipeline.
Would you like more detailed steps or additional examples?
Sources:
- <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/docs/transformer/dataset_transformer.md|Simple Add Dataset Domain Transformer Documentation>
0 button
0 button
Hint: Mention @runllm in the thread for followups.
Hi ! Did this interaction resolve your issue or do you still require assistance ?
Please tag @ RunLLM if you still need further assistance !
Or tag @Datahub Community Support if you require a person to take a look
Hope this helps ! ><
<@U06TM7M40RK> I need to add domains without using a recipe that includes dataset information
If you want to add domains to datasets without using a full ingestion recipe that includes dataset information, you can use the DataHub API or the Python SDK to directly set domains on existing datasets. This approach allows you to manage domains independently of the dataset ingestion process.
Using the DataHub API
You can use a GraphQL mutation to set a domain for a dataset. Here is an example using curl
:
curl --location --request POST 'http://localhost:8080/api/graphql' \
--header 'Authorization: Bearer <my-access-token>' \
--header 'Content-Type: application/json' \
--data-raw '{
"query": "mutation setDomain { setDomain(entityUrn: \"urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)\", domainUrn: \"urn:li:domain:marketing\") }",
"variables": {}
}'
Using the Python SDK
You can also use the Python SDK to add a domain to a dataset. Here is an example script:
from datahub.emitter.mce_builder import make_dataset_urn
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph
from datahub.metadata.schema_classes import DomainsClass
# Setup DataHubGraph client
gms_endpoint = "http://localhost:8080"
graph = DataHubGraph(config=DatahubClientConfig(server=gms_endpoint))
# Define dataset URN and domain URN
dataset_urn = make_dataset_urn(platform="hive", name="fct_users_created", env="PROD")
domain_urn = "urn:li:domain:marketing"
# Create a MetadataChangeProposalWrapper for the domain
event = MetadataChangeProposalWrapper(
entityUrn=dataset_urn,
aspect=DomainsClass(domains=[domain_urn])
)
# Emit the event to DataHub
graph.emit(event)
These methods allow you to programmatically assign a domain to a dataset in DataHub without using a full ingestion recipe.
Would you like more detailed steps on using the API or Python SDK?
Sources:
- https://datahubspace.slack.com/archives/CUMUWQU66/p1726213603.217929|https://datahubspace.slack.com/archives/CUMUWQU66/p1726213603.217929
- https://datahubspace.slack.com/archives/CUMUWQU66/p1673422642.899729|https://datahubspace.slack.com/archives/CUMUWQU66/p1673422642.899729
0 button
0 button
Hint: Mention @runllm in the thread for followups.
<@U06TM7M40RK> I use cli to ingest datasets. How can I add domains to that datasets