Strategies for Running Multiple dbt Ingests and Sending Run Results to DataHub API via Python

Original Slack Thread

<@U06TM7M40RK> how do I run hundreds of ingests of run_results.json from dbt daily without ingesting catalog.json and manifest.json every time?

To run multiple ingests of run_results.json from dbt without ingesting catalog.json and manifest.json every time, you can follow a strategy that separates the ingestion of test results from other metadata. Here’s a summarized approach:

  1. Initial Ingestion of Metadata:

    • Perform a one-time ingestion of manifest.json and catalog.json to capture the model, source, and schema metadata. This can be done using the dbt docs generate command to ensure these files are up-to-date.
  2. Frequent Ingestion of Test Results:

    • For the frequent ingestion of run_results.json, you can configure your DataHub ingestion recipe to focus only on test results. This can be achieved by using the entities_enabled configuration to specify that only test results should be ingested.
  3. Configuration Example:

    • Use the following configuration in your ingestion recipe to focus on test results:
    source:
      type: dbt
      config:
        manifest_path: "path_to_manifest.json"  # Use the path from the initial ingestion
        catalog_path: "path_to_catalog.json"    # Use the path from the initial ingestion
        run_results_paths: ["path_to_run_results.json"]
        entities_enabled:
          test_results: Only
    
  4. Workflow:

    • Run dbt build or dbt test to generate the run_results.json.
    • Copy the run_results.json to a location accessible for ingestion.
    • Use the above configuration to ingest only the test results into DataHub.

This approach allows you to efficiently manage the ingestion of test results without repeatedly ingesting the static metadata files (manifest.json and catalog.json).

Would you like more detailed steps or further assistance on any specific part of this process?

Sources:

<@U06TM7M40RK> instead of passing files back and forth, how would I go about sending the run results data to a Datahub API using python?

To send run_results.json data to DataHub using Python, you can utilize the DataHub Python SDK or the REST API to programmatically emit metadata. Here’s a summarized approach using the Python SDK:

  1. Set Up the DataHub Python SDK:

    • First, ensure you have the DataHub Python SDK installed. You can install it using pip:
      pip install acryl-datahub
      
  2. Read the run_results.json File:

    • Load the run_results.json file into your Python script. This file contains the results of your dbt test runs.
  3. Emit Metadata to DataHub:

    • Use the DataHub Python SDK to create and emit metadata change proposals (MCPs) for the test results. Here’s a basic example:
    from datahub.emitter.mce_builder import make_dataset_urn
    from datahub.emitter.rest_emitter import DatahubRestEmitter
    from datahub.metadata.schema_classes import DatasetPropertiesClass, MetadataChangeProposalWrapper
    
    # Initialize the DataHub emitter
    emitter = DatahubRestEmitter(gms_server='http://localhost:8080')  # Replace with your DataHub server URL
    emitter.test_connection()
    
    # Load the run_results.json
    import json
    with open('path_to_run_results.json') as f:
        run_results = json.load(f)
    
    # Example: Emit metadata for each test result
    for result in run_results['results']:
        dataset_urn = make_dataset_urn(platform='dbt', name=result['unique_id'], env='PROD')
        dataset_properties = DatasetPropertiesClass(description=result['status'])
    
        metadata_event = MetadataChangeProposalWrapper(
            entityType='dataset',
            changeType='UPSERT',
            entityUrn=dataset_urn,
            aspect=dataset_properties,
        )
    
        emitter.emit(metadata_event)
    
  4. Customize the Metadata:

    • Customize the metadata you want to emit based on the contents of run_results.json. You can include additional details such as execution time, status, and any other relevant information.

This approach allows you to programmatically send test results to DataHub without manually handling files. You can automate this process to run after each dbt test execution.

Would you like more detailed steps or further assistance on any specific part of this process?

Sources:

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><