Strategies for Running Multiple dbt Ingests and Sending Run Results to DataHub API via Python

user-2 · November 4, 2024, 12:02am

<@U06TM7M40RK> how do I run hundreds of ingests of run_results.json from dbt daily without ingesting catalog.json and manifest.json every time?

user-1 · November 4, 2024, 12:02am

To run multiple ingests of run_results.json from dbt without ingesting catalog.json and manifest.json every time, you can follow a strategy that separates the ingestion of test results from other metadata. Here’s a summarized approach:

Initial Ingestion of Metadata:
- Perform a one-time ingestion of manifest.json and catalog.json to capture the model, source, and schema metadata. This can be done using the dbt docs generate command to ensure these files are up-to-date.
Frequent Ingestion of Test Results:
- For the frequent ingestion of run_results.json, you can configure your DataHub ingestion recipe to focus only on test results. This can be achieved by using the entities_enabled configuration to specify that only test results should be ingested.

Configuration Example:

Use the following configuration in your ingestion recipe to focus on test results:

source:
  type: dbt
  config:
    manifest_path: "path_to_manifest.json"  # Use the path from the initial ingestion
    catalog_path: "path_to_catalog.json"    # Use the path from the initial ingestion
    run_results_paths: ["path_to_run_results.json"]
    entities_enabled:
      test_results: Only

Workflow:
- Run dbt build or dbt test to generate the run_results.json.
- Copy the run_results.json to a location accessible for ingestion.
- Use the above configuration to ingest only the test results into DataHub.

This approach allows you to efficiently manage the ingestion of test results without repeatedly ingesting the static metadata files (manifest.json and catalog.json).

Would you like more detailed steps or further assistance on any specific part of this process?

Sources:

<dbt | DataHub documentation on dbt integration>
<https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/docs/sources/dbt/dbt.md|GitHub documentation on dbt setup> 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-2 · November 4, 2024, 12:02am

<@U06TM7M40RK> instead of passing files back and forth, how would I go about sending the run results data to a Datahub API using python?

user-1 · November 4, 2024, 12:02am

To send run_results.json data to DataHub using Python, you can utilize the DataHub Python SDK or the REST API to programmatically emit metadata. Here’s a summarized approach using the Python SDK:

Set Up the DataHub Python SDK:
- First, ensure you have the DataHub Python SDK installed. You can install it using pip:
```
pip install acryl-datahub
```
Read the run_results.json File:
- Load the run_results.json file into your Python script. This file contains the results of your dbt test runs.

Emit Metadata to DataHub:

Use the DataHub Python SDK to create and emit metadata change proposals (MCPs) for the test results. Here’s a basic example:

from datahub.emitter.mce_builder import make_dataset_urn
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.schema_classes import DatasetPropertiesClass, MetadataChangeProposalWrapper

# Initialize the DataHub emitter
emitter = DatahubRestEmitter(gms_server='http://localhost:8080')  # Replace with your DataHub server URL
emitter.test_connection()

# Load the run_results.json
import json
with open('path_to_run_results.json') as f:
    run_results = json.load(f)

# Example: Emit metadata for each test result
for result in run_results['results']:
    dataset_urn = make_dataset_urn(platform='dbt', name=result['unique_id'], env='PROD')
    dataset_properties = DatasetPropertiesClass(description=result['status'])

    metadata_event = MetadataChangeProposalWrapper(
        entityType='dataset',
        changeType='UPSERT',
        entityUrn=dataset_urn,
        aspect=dataset_properties,
    )

    emitter.emit(metadata_event)

Customize the Metadata:
- Customize the metadata you want to emit based on the contents of run_results.json. You can include additional details such as execution time, status, and any other relevant information.

This approach allows you to programmatically send test results to DataHub without manually handling files. You can automate this process to run after each dbt test execution.

Would you like more detailed steps or further assistance on any specific part of this process?

Sources:

<https://github.com/datahub-project/datahub/blob/master/docs/api/datahub-apis.md|DataHub Python SDK Documentation> 0 button 0 button Hint: Mention @runllm in the thread for followups.

datahub_team · November 4, 2024, 12:02am

Hi ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look

Hope this helps ! ><

Topic		Replies	Views
Update Model Test Validation Logic from Manifest JSON to Run Results JSON File ingestion	4	0	December 30, 2024
Creating and Managing a dbt Ingestion Source with DataHub Python Client ingestion	12	51	October 28, 2024
Troubleshooting dbt Test Results Ingestion for BigQuery Alignment ingestion	13	10	February 3, 2025
Error while trying to ingest dbt run_results.json from S3 bucket using DataHub v0.13.2 ingestion	1	81	May 13, 2024
Finding and Utilizing the Source Code for Example Ingestion in DataHub ingestion	11	19	December 30, 2024

Strategies for Running Multiple dbt Ingests and Sending Run Results to DataHub API via Python

Related topics