Configuring Datahub Ingestion of dbt Artifacts from Azure DevOps to Databricks Cluster

Original Slack Thread

I need to configure Datahub to ingest dbt artifacts. I have a dbt code in Azure DevOps repository and it is running in Databricks cluster. What I need to configure this ingestion is manifest path, catalog path and target platform. How should my recipe look like if my dbt code is in Azure DevOps and being executed in Databricks cluster.

Hey there! :wave:

<@U06TM7M40RK> Can you help with the questions in this thread?

:information_source: Please continue to tag RunLLM (e.g <@U06TM7M40RK>) to have it continue generating responses!

To configure DataHub to ingest dbt artifacts when your dbt code is stored in an Azure DevOps repository and executed in a Databricks cluster, you need to ensure that the necessary dbt artifact files (manifest.json, catalog.json, etc.) are accessible to DataHub. Here’s a general outline of how your ingestion recipe might look:

  1. Ensure Access to Artifacts: First, ensure that the dbt artifacts are generated and stored in a location accessible to DataHub. This could be an S3 bucket, Azure Blob Storage, or any other storage service that DataHub can access.

  2. Configure the Ingestion Recipe: You will need to specify the paths to the manifest and catalog files, as well as the target platform. Since your dbt code is executed in a Databricks cluster, you might be using a cloud storage service to store these artifacts.

Here is an example configuration for your ingestion recipe:

source:
  type: "dbt"
  config:
    manifest_path: "s3a://your-bucket/path/to/manifest.json"  # Adjust the path as needed
    catalog_path: "s3a://your-bucket/path/to/catalog.json"    # Adjust the path as needed
    sources_path: "s3a://your-bucket/path/to/sources.json"    # Optional
    run_results_paths: ["s3a://your-bucket/path/to/run_results.json"]  # Optional
    include_column_lineage: true
    infer_dbt_schemas: true
    include_compiled_code: true
    enable_meta_mapping: true
    target_platform: "databricks"  # Ensure this is set correctly
    entities_enabled:
      models: 'Yes'
      sources: 'Yes'
      seeds: 'Yes'
      test_definitions: 'Yes'
      test_results: 'Yes'
    stateful_ingestion:
      remove_stale_metadata: true

sink:
  type: datahub-rest
  config:
    server: "http://your-datahub-server:8080"
    token: "your-datahub-token"
  1. Ensure Connectivity: Make sure that DataHub has the necessary permissions and network access to read from the storage location where your dbt artifacts are stored.

  2. Run the Ingestion: Use the DataHub CLI to run the ingestion with the configured recipe.

If you have any specific questions or need further assistance with any part of this process, feel free to ask!

Sources:

<@U06TM7M40RK>
Ingestion pipeline raised an unexpected exception!: [Errno 2] No such file or directory
pipeline raised an error for no such file or directory when I’ve tried to access azure blob storage

It seems likely that the error you’re encountering, [Errno 2] No such file or directory, is related to the configuration of your DataHub ingestion pipeline when trying to access Azure Blob Storage. This error typically indicates that the specified file path is not found or accessible. Here are some steps to troubleshoot and resolve the issue:

  1. Verify File Path: Ensure that the file path specified in your ingestion recipe is correct and points to the actual location of the file in Azure Blob Storage. The path should include the correct container and blob name.

  2. Check Azure Blob Storage Configuration: Make sure that your Azure Blob Storage configuration in the ingestion recipe is set up correctly. This includes specifying the correct account name, container, and any necessary credentials.

  3. Use the Correct URI Scheme: When accessing Azure Blob Storage, ensure that you are using the correct URI scheme. For example, use https://<account>.blob.core.windows.net/<container>/<blob>.

  4. Ensure Permissions: Verify that the DataHub service has the necessary permissions to access the Azure Blob Storage. This might involve setting up appropriate access keys or using a service principal with the required permissions.

  5. Example Configuration: Here is an example of how you might configure the Azure Blob Storage access in your ingestion recipe:

    source:
      type: "abs"
      config:
        account_name: "your_account_name"
        container: "your_container_name"
        path: "path/to/your/file"
        azure_config:
          client_id: "your_client_id"
          client_secret: "your_client_secret"
          tenant_id: "your_tenant_id"
    
  6. Check for Known Issues: There might be known issues with the version of DataHub or the Azure Blob Storage ingestion plugin you are using. According to <https://github.com/datahub-project/datahub/issues/12595|this GitHub issue>, there might be specific configurations or updates required.

Would you like more detailed steps or further assistance on any specific part of this process?

Sources:

Your ingestion recipe for DataHub using dbt looks mostly correct, but there are a few considerations and potential improvements to ensure it works as expected:

  1. Ensure Correct URI Scheme: The paths to manifest.json and catalog.json are specified as URLs. Ensure that these URLs are accessible and that the necessary permissions are in place for DataHub to access them.

  2. Azure Blob Storage Access: If you are accessing files from Azure Blob Storage, you need to ensure that the DataHub ingestion process has the necessary credentials to access these files. This typically involves setting up an Azure connection configuration.

  3. Check for Required Fields: Ensure that all required fields are included in your configuration. For example, if you are using Azure Blob Storage, you might need to specify additional configuration for authentication.

  4. Example Configuration with Azure Blob Storage: Here is an example of how you might configure the Azure Blob Storage access in your ingestion recipe:

    source:
      type: dbt
      config:
        manifest_path: "https://saeuwedatalakedev01.blob.core.windows.net/test/dbt/manifest.json"
        catalog_path: "https://saeuwedatalakedev01.blob.core.windows.net/test/dbt/catalog.json"
        include_column_lineage: true
        infer_dbt_schemas: true
        include_compiled_code: true
        enable_meta_mapping: true
        target_platform: databricks
        entities_enabled:
          models: 'Yes'
          sources: 'Yes'
          seeds: 'Yes'
          test_definitions: 'Yes'
          test_results: 'Yes'
        stateful_ingestion:
          remove_stale_metadata: true
        azure_config:
          client_id: "your_client_id"
          client_secret: "your_client_secret"
          tenant_id: "your_tenant_id"
    
  5. Verify Permissions: Ensure that the Azure Blob Storage account has the necessary permissions set up to allow access from DataHub. This might involve setting up a service principal or using access keys.

  6. Test Connectivity: Before running the ingestion, test the connectivity to the Azure Blob Storage URLs to ensure they are accessible from the environment where DataHub is running.

If you have any specific questions or need further assistance with any part of this process, feel free to ask!

Sources:

<@U06TM7M40RK>
recipe does not work for me now, it says: Pipeline Error: Ingestion pipeline raised an unexpected exception!Pipeline Error: Ingestion pipeline raised an unexpected exception!: Expecting value: line 1 column 1 (char 0)

The error message “Expecting value: line 1 column 1 (char 0)” typically indicates that the JSON decoder is encountering an empty response or an invalid JSON format. This can happen if the file paths specified in your recipe are not accessible or if there is an issue with the content of the files. Here are some steps to troubleshoot and resolve the issue:

  1. Verify File Accessibility: Ensure that the URLs specified for manifest_path and catalog_path are correct and accessible. You can test this by opening the URLs in a web browser or using a tool like curl to check the response.

  2. Check JSON Format: Make sure that the manifest.json and catalog.json files are correctly formatted JSON files. You can use online JSON validators to check the syntax.

  3. Inspect Response Content: Before attempting to decode the JSON, print out the response content to see what is being returned. This can help identify if the response is empty or contains an error message.

  4. Enable Debug Mode: Run the ingestion process with debug logs enabled to get more detailed information about where the error is occurring. You can enable debug logs by setting the --debug flag when running the ingestion command:

    datahub --debug ingest -c your_recipe.yml
    
  5. Check for Authentication Issues: Ensure that your requests to Azure Blob Storage are properly authenticated. Missing or incorrect authentication tokens can result in unexpected responses. Make sure that the azure_config section is correctly configured with the necessary credentials.

  6. Example Code to Print Response Content: You can modify the ingestion script to print the response content for debugging purposes:

    import requests
    
    url = "https://saeuwedatalakedev01.blob.core.windows.net/test/dbt/manifest.json"
    response = requests.get(url)
    print("Response Content:", response.text)  # Print the response content
    

If you provide more details about the specific part of the process where the error occurs, I can give you more specific guidance. Would you like more detailed steps or further assistance on any specific part of this process?

Sources:

  • <Slack Slack Thread> :+1: 0 button :-1: 0 button :light_bulb: Hint: Mention @runllm in the thread for followups.

<@U06TM7M40RK>
how to authenticate to azure blob storage within this dbt pipeline

To authenticate to Azure Blob Storage within a dbt pipeline for DataHub ingestion, you need to configure the Azure credentials properly. Here’s how you can set up authentication using Azure credentials in your ingestion recipe:

  1. Azure Credentials: You can authenticate to Azure Blob Storage using one of the following methods:

    • Account Key: Use the storage account name and account key.
    • Shared Access Signature (SAS) Token: Use a SAS token for temporary access.
    • Azure Active Directory (AAD) with Client Secret: Use a client ID, client secret, and tenant ID for authentication.
  2. Configure Azure Authentication in the Recipe: Depending on the method you choose, you need to include the appropriate configuration in your ingestion recipe.

Here’s an example configuration using Azure Active Directory (AAD) with a client secret:

source:
  type: dbt
  config:
    manifest_path: "https://saeuwedatalakedev01.blob.core.windows.net/test/dbt/manifest.json"
    catalog_path: "https://saeuwedatalakedev01.blob.core.windows.net/test/dbt/catalog.json"
    include_column_lineage: true
    infer_dbt_schemas: true
    include_compiled_code: true
    enable_meta_mapping: true
    target_platform: databricks
    entities_enabled:
      models: 'Yes'
      sources: 'Yes'
      seeds: 'Yes'
      test_definitions: 'Yes'
      test_results: 'Yes'
    stateful_ingestion:
      remove_stale_metadata: true
    azure_config:
      account_name: "your_account_name"
      client_id: "your_client_id"
      client_secret: "your_client_secret"
      tenant_id: "your_tenant_id"
  1. Ensure Permissions: Make sure that the Azure AD application has the necessary permissions to access the Blob Storage. Typically, you would assign the “Storage Blob Data Contributor” role to the application.

  2. Test Connectivity: Before running the ingestion, test the connectivity to Azure Blob Storage using the configured credentials to ensure they are correct.

  3. Use the Correct URI Scheme: Ensure that the paths to the manifest and catalog files use the correct URI scheme (https://).

If you have any specific questions or need further assistance with any part of this process, feel free to ask!

Sources:

<@U06TM7M40RK> this does not work within this dbt recipe, I cannot add azure_config because it do not recognize it

The DataHub’s documentation I have ingested does not mention how to directly include Azure authentication details like azure_config within a dbt ingestion recipe. You might want to ensure that the Azure Blob Storage paths are publicly accessible or use a different method to authenticate, such as setting up a network configuration that allows access without explicit credentials in the recipe. If this doesn’t help, please reach out to the DataHub support team for further assistance. :+1: 0 button :-1: 0 button :light_bulb: Hint: Mention @runllm in the thread for followups.

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><