Troubleshooting Configuration Errors in DataHub Ingestion Pipelines

Original Slack Thread

<@U06TM7M40RK>

    with _add_init_error_context(f"configure the source ({self.source_type})"):
  File "/usr/lib/python3.10/contextlib.py", line 153, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/tmp/datahub/ingest/venv-bigquery-bcd0fd86d8666b0c/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 141, in _add_init_error_context
    raise PipelineInitError(f"Failed to {step}: {e}") from e
datahub.ingestion.run.pipeline.PipelineInitError: Failed to configure the source (bigquery): A graph instance is required. Either pass one in the pipeline context, or set it explicitly in the stateful ingestion provider config.```

The error message you’re encountering indicates that there is a problem with configuring the BigQuery source in your DataHub ingestion pipeline. Specifically, it states that a graph instance is required, but it is not being provided in the pipeline context or explicitly set in the stateful ingestion provider configuration.

According to the <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/state_provider/datahub_ingestion_checkpointing_provider.py|DataHub Ingestion Checkpointing Provider>, a graph instance is necessary for stateful ingestion. This graph instance can be set either in the pipeline context or explicitly in the stateful ingestion provider configuration.

Here are a few steps you can take to resolve this issue:

  1. Check Pipeline Context: Ensure that your pipeline context is correctly configured to include a graph instance. This can be done by setting the graph attribute in the PipelineContext.

  2. Explicit Configuration: If the graph instance is not set in the pipeline context, you can explicitly configure it in the stateful ingestion provider configuration. This involves setting the datahub_api parameter in the DatahubIngestionStateProviderConfig to a valid DatahubClientConfig.

  3. Review Configuration Files: Double-check your configuration files (e.g., YAML files) to ensure that all necessary parameters are correctly specified and that there are no missing configurations related to the graph instance.

Would you like more detailed guidance on how to implement these steps?

Sources:

<@U06TM7M40RK> show me examples of set a graph instance in recipe yaml from bigquery source

It might be possible that setting a graph instance in a recipe YAML for a BigQuery source involves configuring the stateful ingestion settings to include a graph instance. However, I’m not entirely certain, as the data sources do not provide a direct example of this configuration.

Based on the available information, you might consider the following approach:

  1. Stateful Ingestion Configuration: Ensure that your YAML configuration includes the necessary settings for stateful ingestion. This typically involves specifying the stateful_ingestion section in your YAML file.

  2. DataHub API Configuration: You may need to configure the datahub_api section to specify the server endpoint for the DataHub Graph Management Service (GMS).

Here’s a conceptual example of how you might structure your YAML configuration:

source:
  type: bigquery
  config:
    project_id: your_project_id
    credential:
      project_id: your_project_id
      private_key_id: your_private_key_id
      private_key: your_private_key
      client_email: your_client_email
      client_id: your_client_id
    include_table_lineage: true
    include_view_lineage: true
    profiling:
      enabled: true
    stateful_ingestion:
      enabled: true
      datahub_api:
        server: "http://localhost:8080"  # Replace with your DataHub GMS endpoint

sink:
  type: "datahub-rest"
  config:
    server: "http://localhost:8080"  # Replace with your DataHub instance URL

This example assumes that you have a DataHub GMS running at http://localhost:8080. The datahub_api section under stateful_ingestion is where you might specify the graph instance configuration.

Would you like more detailed steps or additional examples on any specific part of this process?

Sources:

<@U06TM7M40RK> [2025-01-02 19:05:59,806] ERROR {datahub.ingestion.run.pipeline:78} - failed to write record with workunit urn:li:container:2b83b5081f4169d2a2a996eab644700f-dataPlatformInstance with KafkaError{code=_VALUE_SERIALIZATION,val=-161,str=“[Errno 110] Connection timed out”} and info {‘error’: ValueSerializationError(KafkaError{code=_VALUE_SERIALIZATION,val=-161,str=“[Errno 110] Connection timed out”}), ‘msg’: ‘Failed to write record: KafkaError{code=_VALUE_SERIALIZATION,val=-161,str=“[Errno 110] Connection timed out”}’}

The error message you’re encountering indicates a ValueSerializationError related to a Kafka connection timeout. This typically means that the DataHub ingestion pipeline is unable to serialize and send a record to the Kafka broker due to a connection timeout. Here are some steps to troubleshoot and resolve this issue:

  1. Check Kafka Broker Status: Ensure that your Kafka broker is running and accessible. You can verify the status of the Kafka broker using commands like kubectl get pods if you’re using Kubernetes, or by checking the Kafka service status directly.

  2. Network Connectivity: Verify that there are no network issues preventing the DataHub ingestion pipeline from reaching the Kafka broker. This includes checking firewall rules, network policies, and ensuring that the correct ports are open.

  3. Kafka Configuration: Double-check your Kafka configuration settings in the DataHub ingestion pipeline. Ensure that the bootstrap.servers configuration is correctly set to the address of your Kafka broker.

  4. Serialization Configuration: Ensure that the serialization settings for Kafka are correctly configured. This includes setting the appropriate serializers for the key and value of the Kafka messages.

  5. Increase Timeout Settings: Consider increasing the timeout settings for Kafka connections to allow more time for the connection to be established. This can be done by adjusting the request.timeout.ms or similar configuration settings.

  6. Review Logs: Check the logs of both the DataHub ingestion pipeline and the Kafka broker for any additional error messages or warnings that might provide more context about the issue.

Would you like more detailed guidance on any specific part of this process?

Sources:

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><