<@U06TM7M40RK>
How do I make a a new dbt ingestion source using the datahub python client, and how do I keep the manifest / catalog data updated
To create a new dbt ingestion source using the DataHub Python client and keep the manifest/catalog data updated, you can follow these steps:
-
Set Up the Ingestion Recipe: You need to define an ingestion recipe that specifies the paths to your dbt artifacts (manifest, catalog, sources, and run_results files). According to the <https://datahubproject.io/docs/0.14.0/generated/ingestion/sources/dbt|DataHub DBT Ingestion Config Details>, these files are essential for ingesting metadata from dbt.
-
Configure the Ingestion Source: Use the DataHub Python client to configure the dbt ingestion source. You can specify the paths to your dbt files in the configuration. Here’s an example configuration:
source: type: dbt config: manifest_path: '/path/to/manifest.json' catalog_path: '/path/to/catalog.json' sources_path: '/path/to/sources.json' run_results_path: '/path/to/run_results.json' target_platform: 'your_target_platform'
-
Run the Ingestion: Use the DataHub CLI to run the ingestion process. You can enable debug logs to get more insights into the process:
datahub ingest -c your_ingestion_recipe.yml --debug
-
Keep Data Updated: To keep the manifest and catalog data updated, you should regularly regenerate these files using dbt commands. As per the <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/docs/sources/dbt/dbt_pre.md|DataHub documentation>, you can use the following workflow:
dbt source snapshot-freshness dbt build cp target/run_results.json target/run_results_backup.json dbt docs generate cp target/run_results_backup.json target/run_results.json
This ensures that the necessary artifact files are up-to-date and located in the
target/
directory of your dbt project. -
Automate the Process: Consider setting up a scheduled job or a CI/CD pipeline to automate the regeneration of dbt artifacts and the ingestion process to keep your DataHub metadata current.
Would you like more detailed steps or further assistance on any specific part of this process?
Sources:
- <https://datahubproject.io/docs/0.14.0/generated/ingestion/sources/dbt|DataHub DBT Ingestion Config Details>
- <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/docs/sources/dbt/dbt_pre.md|DataHub GitHub Documentation on DBT Ingestion> 0 button 0 button Hint: Mention @runllm in the thread for followups.
is there no way to do this using the acryl datahub plugin?
is there no way to do this using the acryl datahub plugin? <@U06TM7M40RK>
I’m here to help with questions related to DataHub. Could you please provide more details or clarify your question so I can assist you better? 0 button 0 button Hint: Mention @runllm in the thread for followups.
is there no way to make a new dbt ingestion source using the datahub python client, and how do I keep the manifest / catalog data updated <@U06TM7M40RK>? I don’t want to use the CLI. I want to use the python client, unless it’s not possibel
To create a new dbt ingestion source using the DataHub Python client without using the CLI, you can utilize the DataHub Python SDK. Here’s a general approach to achieve this:
-
Set Up the Python Environment: Ensure you have the DataHub Python SDK installed. You can install it using pip:
pip install acryl-datahub
-
Configure the Ingestion Source: You can use the Python SDK to configure the dbt ingestion source. The SDK allows you to programmatically define the ingestion configuration, similar to what you would do in a YAML recipe.
-
Use the Python SDK for Ingestion: While the provided data sources do not explicitly mention using the Python SDK for dbt ingestion, you can use the SDK to emit metadata to DataHub. Here’s a basic example of how you might set up a dbt ingestion source using the Python SDK:
from datahub.ingestion.run.pipeline import Pipeline # Define your dbt ingestion configuration config = { "source": { "type": "dbt", "config": { "manifest_path": "/path/to/manifest.json", "catalog_path": "/path/to/catalog.json", "sources_path": "/path/to/sources.json", "run_results_path": "/path/to/run_results.json", "target_platform": "your_target_platform" } }, "sink": { "type": "datahub-rest", "config": { "server": "http://<datahub-instance>.com:8080" } } } # Create and run the pipeline pipeline = Pipeline.create(config) pipeline.run() pipeline.raise_from_status()
-
Keep Data Updated: To keep the manifest and catalog data updated, you should regularly regenerate these files using dbt commands, as mentioned in the previous response. Automate this process using a script or a scheduled job.
-
Automate the Process: You can automate the ingestion process by running the Python script on a schedule, ensuring that your DataHub metadata remains current.
If you need more detailed steps or further assistance on any specific part of this process, please let me know!
Sources:
- https://github.com/datahub-project/datahub/blob/master/docs/api/datahub-apis.md|https://github.com/datahub-project/datahub/blob/master/docs/api/datahub-apis.md
- https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/dbt/dbt_core.py|https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/dbt/dbt_core.py 0 button 0 button Hint: Mention @runllm in the thread for followups.
<@U06TM7M40RK> Would this pipeline be visible in the UI or is it just a disposable one?
When you run an ingestion pipeline using the DataHub Python client, the ingestion process itself is not automatically visible in the DataHub UI as a scheduled or managed ingestion source. The pipeline is essentially a one-time execution unless you set it up to run on a schedule using an external scheduler like cron or a CI/CD pipeline.
However, the metadata ingested by the pipeline will be visible in the DataHub UI. This means that the datasets, schemas, and other metadata entities that are ingested will appear in the DataHub UI, allowing you to browse and search them.
If you want the ingestion pipeline to be visible and managed within the DataHub UI, you would typically set it up using the DataHub UI’s ingestion scheduling features. This involves creating an ingestion source through the UI, which allows you to manage and monitor the ingestion runs directly from the DataHub interface.
For programmatic ingestion using the Python client, the focus is on the metadata being ingested rather than the ingestion process itself being visible in the UI.
Would you like more details on how to set up scheduled ingestion through the UI or any other aspect of DataHub?
Sources:
- <https://datahubproject.io/docs/0.14.0/ui-ingestion|DataHub UI Ingestion Documentation> 0 button 0 button Hint: Mention @runllm in the thread for followups.
<@U06TM7M40RK> is there a way to do it via the graphql api?
Yes, you can manage ingestion sources and trigger ingestion runs using the DataHub GraphQL API. Here’s how you can do it:
-
Create an Ingestion Source: You can use the
createIngestionSource
mutation to create a new ingestion source. This allows you to define the source type, schedule, and configuration. -
Trigger an Ingestion Run: Once the ingestion source is created, you can use the
createIngestionExecutionRequest
mutation to trigger an ingestion run for the specified source.
Here is an example of how you can use the GraphQL API to create a dbt ingestion source and trigger it:
Create an Ingestion Source
mutation {
createIngestionSource(input: {
name: "My DBT Ingestion Source",
type: "dbt",
description: "Ingestion source for dbt metadata",
schedule: {
interval: "0 0 * * *", # Example CRON schedule
timezone: "UTC"
},
config: {
recipe: "{\"source\":{\"type\":\"dbt\",\"config\":{\"manifest_path\":\"/path/to/manifest.json\",\"catalog_path\":\"/path/to/catalog.json\",\"sources_path\":\"/path/to/sources.json\",\"run_results_path\":\"/path/to/run_results.json\",\"target_platform\":\"your_target_platform\"}},\"pipeline_name\":\"urn:li:dataHubIngestionSource:example-urn\"}",
version: "0.8.18",
executorId: "myExecutorId"
}
}) {
urn
}
}
Trigger an Ingestion Run
mutation {
createIngestionExecutionRequest(input: {
ingestionSourceUrn: "urn:li:dataHubIngestionSource:example-urn"
})
}
Using curl
to Execute the GraphQL Mutations
You can execute these GraphQL mutations using a tool like curl
:
curl -X POST https://your-datahub-instance/api/graphql \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_ACCESS_TOKEN" \
-d '{
"query": "mutation { createIngestionSource(input: { name: \"My DBT Ingestion Source\", type: \"dbt\", description: \"Ingestion source for dbt metadata\", schedule: { interval: \"0 0 * * *\", timezone: \"UTC\" }, config: { recipe: \"{\\\"source\\\":{\\\"type\\\":\\\"dbt\\\",\\\"config\\\":{\\\"manifest_path\\\":\\\"/path/to/manifest.json\\\",\\\"catalog_path\\\":\\\"/path/to/catalog.json\\\",\\\"sources_path\\\":\\\"/path/to/sources.json\\\",\\\"run_results_path\\\":\\\"/path/to/run_results.json\\\",\\\"target_platform\\\":\\\"your_target_platform\\\"}},\\\"pipeline_name\\\":\\\"urn:li:dataHubIngestionSource:example-urn\\\"}\", version: \"0.8.18\", executorId: \"myExecutorId\" } }) { urn } }"
}'
Replace <https://your-datahub-instance/api/graphql>
with your DataHub GraphQL endpoint and YOUR_ACCESS_TOKEN
with your authentication token.
Would you like more detailed steps or additional information on any specific part of the process?
Sources:
- https://datahubproject.io/docs/0.14.0/ui-ingestion#advanced-ingestion-configs|https://datahubproject.io/docs/0.14.0/ui-ingestion#advanced-ingestion-configs
- https://github.com/datahub-project/datahub/blob/master/docs/ui-ingestion.md|https://github.com/datahub-project/datahub/blob/master/docs/ui-ingestion.md 0 button 0 button Hint: Mention @runllm in the thread for followups.
<@U06TM7M40RK> How will this get the updated dbt manifests? Would I have to recreate the pipleine, can I update it? How would it work?