<@U06TM7M40RK> how do I run hundreds of ingests of run_results.json from dbt daily without ingesting catalog.json and manifest.json every time?
To run multiple ingests of run_results.json
from dbt without ingesting catalog.json
and manifest.json
every time, you can follow a strategy that separates the ingestion of test results from other metadata. Here’s a summarized approach:
-
Initial Ingestion of Metadata:
- Perform a one-time ingestion of
manifest.json
andcatalog.json
to capture the model, source, and schema metadata. This can be done using thedbt docs generate
command to ensure these files are up-to-date.
- Perform a one-time ingestion of
-
Frequent Ingestion of Test Results:
- For the frequent ingestion of
run_results.json
, you can configure your DataHub ingestion recipe to focus only on test results. This can be achieved by using theentities_enabled
configuration to specify that only test results should be ingested.
- For the frequent ingestion of
-
Configuration Example:
- Use the following configuration in your ingestion recipe to focus on test results:
source: type: dbt config: manifest_path: "path_to_manifest.json" # Use the path from the initial ingestion catalog_path: "path_to_catalog.json" # Use the path from the initial ingestion run_results_paths: ["path_to_run_results.json"] entities_enabled: test_results: Only
-
Workflow:
- Run
dbt build
ordbt test
to generate therun_results.json
. - Copy the
run_results.json
to a location accessible for ingestion. - Use the above configuration to ingest only the test results into DataHub.
- Run
This approach allows you to efficiently manage the ingestion of test results without repeatedly ingesting the static metadata files (manifest.json
and catalog.json
).
Would you like more detailed steps or further assistance on any specific part of this process?
Sources:
- <dbt | DataHub documentation on dbt integration>
- <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/docs/sources/dbt/dbt.md|GitHub documentation on dbt setup> 0 button 0 button Hint: Mention @runllm in the thread for followups.
<@U06TM7M40RK> instead of passing files back and forth, how would I go about sending the run results data to a Datahub API using python?
To send run_results.json
data to DataHub using Python, you can utilize the DataHub Python SDK or the REST API to programmatically emit metadata. Here’s a summarized approach using the Python SDK:
-
Set Up the DataHub Python SDK:
- First, ensure you have the DataHub Python SDK installed. You can install it using pip:
pip install acryl-datahub
- First, ensure you have the DataHub Python SDK installed. You can install it using pip:
-
Read the
run_results.json
File:- Load the
run_results.json
file into your Python script. This file contains the results of your dbt test runs.
- Load the
-
Emit Metadata to DataHub:
- Use the DataHub Python SDK to create and emit metadata change proposals (MCPs) for the test results. Here’s a basic example:
from datahub.emitter.mce_builder import make_dataset_urn from datahub.emitter.rest_emitter import DatahubRestEmitter from datahub.metadata.schema_classes import DatasetPropertiesClass, MetadataChangeProposalWrapper # Initialize the DataHub emitter emitter = DatahubRestEmitter(gms_server='http://localhost:8080') # Replace with your DataHub server URL emitter.test_connection() # Load the run_results.json import json with open('path_to_run_results.json') as f: run_results = json.load(f) # Example: Emit metadata for each test result for result in run_results['results']: dataset_urn = make_dataset_urn(platform='dbt', name=result['unique_id'], env='PROD') dataset_properties = DatasetPropertiesClass(description=result['status']) metadata_event = MetadataChangeProposalWrapper( entityType='dataset', changeType='UPSERT', entityUrn=dataset_urn, aspect=dataset_properties, ) emitter.emit(metadata_event)
-
Customize the Metadata:
- Customize the metadata you want to emit based on the contents of
run_results.json
. You can include additional details such as execution time, status, and any other relevant information.
- Customize the metadata you want to emit based on the contents of
This approach allows you to programmatically send test results to DataHub without manually handling files. You can automate this process to run after each dbt test execution.
Would you like more detailed steps or further assistance on any specific part of this process?
Sources:
- <https://github.com/datahub-project/datahub/blob/master/docs/api/datahub-apis.md|DataHub Python SDK Documentation> 0 button 0 button Hint: Mention @runllm in the thread for followups.
Hi ! Did this interaction resolve your issue or do you still require assistance ?
Please tag @ RunLLM if you still need further assistance !
Or tag @Datahub Community Support if you require a person to take a look
Hope this helps ! ><