Issue with `target_platform_instance` field population in data source ingestion recipe

Original Slack Thread

Hello :wave:.

I’ve been running a dbt source ingestion and have a question about the target_platform_instance that can be specified in the ingestion recipe.

I’m wondering if it is a bug or intended behaviour, that the given target platform instance is not populated into the expected field. e.g. using a graphql query the dataPlatformInstance property of the target platform entity returns null (it is populated for the dbt platform entity).

I am using datahub cli version 0.13.1.3 to ingest into datahub version 0.13.2

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

<@U06TM7M40RK> Can you help with the questions in this thread?


:information_source: We’re experimenting with a new AI assistant for questions in the <#CUMUWQU66> channel to help you track down helpful resources more quickly. Keep in mind that responses may not always be spot-on; feel free to ask it follow-up questions and don’t forget to provide its responses a :thumbsup: or :thumbsdown: so other folks in the Community know if you’re still looking for help on the topic!

To address your question about the target_platform_instance not being populated into the expected field, it appears that this issue has been observed before in different contexts, such as with MySQL and S3 ingestions. Here is a summary and step-by-step reasoning based on the provided data sources:

Summary
The issue where the dataPlatformInstance property returns null for the target platform entity, despite being specified in the ingestion recipe, is likely due to a race condition or a bug in the ingestion process. This has been observed in other contexts and versions of DataHub.

Step-by-Step Reasoning

  1. Observation of Similar Issues:

    • According to <Slack Slack thread>, a user observed that the dataPlatformInstance was null until the dataset was ingested twice.
    • Another user reported a similar issue with MySQL ingestion in <Slack Slack thread>, where the instance field was missing due to a race condition during parallel requests.
  2. Race Condition:

    • As per the discussion in <Slack Slack thread>, the issue was identified as a race condition where multiple requests were trying to update the dataPlatformInstance aspect simultaneously. This resulted in the instance field being sometimes omitted.
  3. S3 Ingestion Issue:

    • A similar problem was reported for S3 ingestion, where the dataPlatformInstance was missing from the dataset aspect, as detailed in <Slack Slack thread>. The issue was acknowledged and a fix was proposed.
  4. Potential Bug:

    • Given the pattern of issues across different data sources and ingestion types, it is likely that the problem you are experiencing with the target_platform_instance not being populated is due to a similar bug or race condition in the ingestion process.

Next Steps

  1. Check Logs and Configuration:

    • Ensure that your ingestion recipe is correctly configured and check the debug logs for any errors or warnings during the ingestion process.
  2. Re-run Ingestion:

    • As observed in some cases, re-running the ingestion might resolve the issue temporarily. However, this is not a permanent solution.
  3. Report the Issue:

    • If the problem persists, consider reporting it on the DataHub GitHub repository, providing details of your configuration and the issue. This will help the maintainers to investigate and address the bug.

Example Configuration
Here is an example of how your ingestion recipe might look:

source:
    type: dbt
    config:
        path_specs:
            - include: '<path_to_your_dbt_files>'
        target_platform: 'your_target_platform'
        target_platform_instance: 'your_target_platform_instance'
        env: 'PROD'
sink:

    type: datahub-rest
    config:
        server: '<http://datahub-gms:8080>'
        max_threads: 1
        mode: 'SYNC'

GraphQL Query Example
To query the dataPlatformInstance property, you can use the following GraphQL query:

query getDataset {
  dataset(urn: "urn:li:dataset:(urn:li:dataPlatform:your_platform,your_dataset,PROD)") {
    urn
    type
    lastIngested
    dataPlatformInstance {
      urn
    }
  }
}

Sources:

  • <Slack Thread 1>
  • <Slack Thread 2>
  • <Slack Thread 3> :+1: 1 button :-1: 0 button :bulb: Hint: Mention <@U06TM7M40RK> in the thread for followups.