Creating and Managing a dbt Ingestion Source with DataHub Python Client

user-1 · October 28, 2024, 12:05am

<@U06TM7M40RK>
How do I make a a new dbt ingestion source using the datahub python client, and how do I keep the manifest / catalog data updated

user-3 · October 28, 2024, 12:05am

To create a new dbt ingestion source using the DataHub Python client and keep the manifest/catalog data updated, you can follow these steps:

Set Up the Ingestion Recipe: You need to define an ingestion recipe that specifies the paths to your dbt artifacts (manifest, catalog, sources, and run_results files). According to the <https://datahubproject.io/docs/0.14.0/generated/ingestion/sources/dbt|DataHub DBT Ingestion Config Details>, these files are essential for ingesting metadata from dbt.

Configure the Ingestion Source: Use the DataHub Python client to configure the dbt ingestion source. You can specify the paths to your dbt files in the configuration. Here’s an example configuration:

source:
  type: dbt
  config:
    manifest_path: '/path/to/manifest.json'
    catalog_path: '/path/to/catalog.json'
    sources_path: '/path/to/sources.json'
    run_results_path: '/path/to/run_results.json'
    target_platform: 'your_target_platform'

Run the Ingestion: Use the DataHub CLI to run the ingestion process. You can enable debug logs to get more insights into the process:
```
datahub ingest -c your_ingestion_recipe.yml --debug
```
Keep Data Updated: To keep the manifest and catalog data updated, you should regularly regenerate these files using dbt commands. As per the <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/docs/sources/dbt/dbt_pre.md|DataHub documentation>, you can use the following workflow:
```
dbt source snapshot-freshness
dbt build
cp target/run_results.json target/run_results_backup.json
dbt docs generate
cp target/run_results_backup.json target/run_results.json
```
This ensures that the necessary artifact files are up-to-date and located in the target/ directory of your dbt project.
Automate the Process: Consider setting up a scheduled job or a CI/CD pipeline to automate the regeneration of dbt artifacts and the ingestion process to keep your DataHub metadata current.

Would you like more detailed steps or further assistance on any specific part of this process?

Sources:

<https://datahubproject.io/docs/0.14.0/generated/ingestion/sources/dbt|DataHub DBT Ingestion Config Details>
<https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/docs/sources/dbt/dbt_pre.md|DataHub GitHub Documentation on DBT Ingestion> 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-1 · October 28, 2024, 12:05am

is there no way to do this using the acryl datahub plugin?

user-1 · October 28, 2024, 12:05am

is there no way to do this using the acryl datahub plugin? <@U06TM7M40RK>

user-3 · October 28, 2024, 12:05am

I’m here to help with questions related to DataHub. Could you please provide more details or clarify your question so I can assist you better? 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-1 · October 28, 2024, 12:05am

is there no way to make a new dbt ingestion source using the datahub python client, and how do I keep the manifest / catalog data updated <@U06TM7M40RK>? I don’t want to use the CLI. I want to use the python client, unless it’s not possibel

user-3 · October 28, 2024, 12:05am

To create a new dbt ingestion source using the DataHub Python client without using the CLI, you can utilize the DataHub Python SDK. Here’s a general approach to achieve this:

Set Up the Python Environment: Ensure you have the DataHub Python SDK installed. You can install it using pip:
```
pip install acryl-datahub
```
Configure the Ingestion Source: You can use the Python SDK to configure the dbt ingestion source. The SDK allows you to programmatically define the ingestion configuration, similar to what you would do in a YAML recipe.

Use the Python SDK for Ingestion: While the provided data sources do not explicitly mention using the Python SDK for dbt ingestion, you can use the SDK to emit metadata to DataHub. Here’s a basic example of how you might set up a dbt ingestion source using the Python SDK:

from datahub.ingestion.run.pipeline import Pipeline

# Define your dbt ingestion configuration
config = {
    "source": {
        "type": "dbt",
        "config": {
            "manifest_path": "/path/to/manifest.json",
            "catalog_path": "/path/to/catalog.json",
            "sources_path": "/path/to/sources.json",
            "run_results_path": "/path/to/run_results.json",
            "target_platform": "your_target_platform"
        }
    },
    "sink": {
        "type": "datahub-rest",
        "config": {
            "server": "http://<datahub-instance>.com:8080"
        }
    }
}

# Create and run the pipeline
pipeline = Pipeline.create(config)
pipeline.run()
pipeline.raise_from_status()

Keep Data Updated: To keep the manifest and catalog data updated, you should regularly regenerate these files using dbt commands, as mentioned in the previous response. Automate this process using a script or a scheduled job.
Automate the Process: You can automate the ingestion process by running the Python script on a schedule, ensuring that your DataHub metadata remains current.

If you need more detailed steps or further assistance on any specific part of this process, please let me know!

Sources:

https://github.com/datahub-project/datahub/blob/master/docs/api/datahub-apis.md|https://github.com/datahub-project/datahub/blob/master/docs/api/datahub-apis.md
https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/dbt/dbt_core.py|https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/dbt/dbt_core.py 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-1 · October 28, 2024, 12:05am

<@U06TM7M40RK> Would this pipeline be visible in the UI or is it just a disposable one?

user-3 · October 28, 2024, 12:05am

When you run an ingestion pipeline using the DataHub Python client, the ingestion process itself is not automatically visible in the DataHub UI as a scheduled or managed ingestion source. The pipeline is essentially a one-time execution unless you set it up to run on a schedule using an external scheduler like cron or a CI/CD pipeline.

However, the metadata ingested by the pipeline will be visible in the DataHub UI. This means that the datasets, schemas, and other metadata entities that are ingested will appear in the DataHub UI, allowing you to browse and search them.

If you want the ingestion pipeline to be visible and managed within the DataHub UI, you would typically set it up using the DataHub UI’s ingestion scheduling features. This involves creating an ingestion source through the UI, which allows you to manage and monitor the ingestion runs directly from the DataHub interface.

For programmatic ingestion using the Python client, the focus is on the metadata being ingested rather than the ingestion process itself being visible in the UI.

Would you like more details on how to set up scheduled ingestion through the UI or any other aspect of DataHub?

Sources:

<https://datahubproject.io/docs/0.14.0/ui-ingestion|DataHub UI Ingestion Documentation> 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-1 · October 28, 2024, 12:05am

<@U06TM7M40RK> is there a way to do it via the graphql api?

user-3 · October 28, 2024, 12:05am

Yes, you can manage ingestion sources and trigger ingestion runs using the DataHub GraphQL API. Here’s how you can do it:

Create an Ingestion Source: You can use the createIngestionSource mutation to create a new ingestion source. This allows you to define the source type, schedule, and configuration.
Trigger an Ingestion Run: Once the ingestion source is created, you can use the createIngestionExecutionRequest mutation to trigger an ingestion run for the specified source.

Here is an example of how you can use the GraphQL API to create a dbt ingestion source and trigger it:

Create an Ingestion Source

mutation {
  createIngestionSource(input: {
    name: "My DBT Ingestion Source",
    type: "dbt",
    description: "Ingestion source for dbt metadata",
    schedule: {
      interval: "0 0 * * *",  # Example CRON schedule
      timezone: "UTC"
    },
    config: {
      recipe: "{\"source\":{\"type\":\"dbt\",\"config\":{\"manifest_path\":\"/path/to/manifest.json\",\"catalog_path\":\"/path/to/catalog.json\",\"sources_path\":\"/path/to/sources.json\",\"run_results_path\":\"/path/to/run_results.json\",\"target_platform\":\"your_target_platform\"}},\"pipeline_name\":\"urn:li:dataHubIngestionSource:example-urn\"}",
      version: "0.8.18",
      executorId: "myExecutorId"
    }
  }) {
    urn
  }
}

Trigger an Ingestion Run

mutation {
  createIngestionExecutionRequest(input: {
    ingestionSourceUrn: "urn:li:dataHubIngestionSource:example-urn"
  })
}

Using curl to Execute the GraphQL Mutations

You can execute these GraphQL mutations using a tool like curl:

curl -X POST https://your-datahub-instance/api/graphql \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_ACCESS_TOKEN" \
  -d '{
    "query": "mutation { createIngestionSource(input: { name: \"My DBT Ingestion Source\", type: \"dbt\", description: \"Ingestion source for dbt metadata\", schedule: { interval: \"0 0 * * *\", timezone: \"UTC\" }, config: { recipe: \"{\\\"source\\\":{\\\"type\\\":\\\"dbt\\\",\\\"config\\\":{\\\"manifest_path\\\":\\\"/path/to/manifest.json\\\",\\\"catalog_path\\\":\\\"/path/to/catalog.json\\\",\\\"sources_path\\\":\\\"/path/to/sources.json\\\",\\\"run_results_path\\\":\\\"/path/to/run_results.json\\\",\\\"target_platform\\\":\\\"your_target_platform\\\"}},\\\"pipeline_name\\\":\\\"urn:li:dataHubIngestionSource:example-urn\\\"}\", version: \"0.8.18\", executorId: \"myExecutorId\" } }) { urn } }"
  }'

user-3 · October 28, 2024, 12:05am

Replace <https://your-datahub-instance/api/graphql> with your DataHub GraphQL endpoint and YOUR_ACCESS_TOKEN with your authentication token.

Would you like more detailed steps or additional information on any specific part of the process?

Sources:

https://datahubproject.io/docs/0.14.0/ui-ingestion#advanced-ingestion-configs|https://datahubproject.io/docs/0.14.0/ui-ingestion#advanced-ingestion-configs
https://github.com/datahub-project/datahub/blob/master/docs/ui-ingestion.md|https://github.com/datahub-project/datahub/blob/master/docs/ui-ingestion.md 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-1 · October 28, 2024, 12:05am

<@U06TM7M40RK> How will this get the updated dbt manifests? Would I have to recreate the pipleine, can I update it? How would it work?

Topic		Replies	Views
Troubleshooting Steps for Ingesting Local DBT Metadata ingestion	6	28	December 9, 2024
Strategies for Running Multiple dbt Ingests and Sending Run Results to DataHub API via Python ingestion	4	19	November 4, 2024
Using the dbt ingestion source to load metadata into a custom data platform in DataHub ingestion	4	29	December 2, 2024
Developing a Custom Ingestion Source for Omni BI in DataHub ingestion	2	12	December 9, 2024
Configuring Datahub Ingestion of dbt Artifacts from Azure DevOps to Databricks Cluster ingestion	12	54	April 7, 2025

Creating and Managing a dbt Ingestion Source with DataHub Python Client

Related topics