Troubleshooting Metadata Ingestion Error 'Failed to produce MCLs'

Original Slack Thread

Hi All,
I am trying to ingest a metadata as file lineage and I am getting following error

               'info': {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException',
                        'message': 'java.lang.RuntimeException: Failed to produce MCLs',
                        'status': 500,
                        'urn': 'urn:li:dataset:(urn:li:dataPlatform:looker,thirty_madison_snowflake.view.real_time_benefits,PROD)'}}],```
Can you please suggest the suitable place that I need to look to debug the issue. I am not sure why we are getting the error `Failed to produce MCLs`

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

<@U06TM7M40RK> Can you help with the questions in this thread?


:information_source: We’re experimenting with a new AI assistant for questions in the <#CUMUWQU66> channel to help you track down helpful resources more quickly. Keep in mind that responses may not always be spot-on; feel free to ask it follow-up questions and don’t forget to provide its responses a :thumbsup: or :thumbsdown: so other folks in the Community know if you’re still looking for help on the topic!

To help you effectively, I need a bit more information:

  1. Are you using the UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g., 0.12.0)?
  3. What data source(s) are you integrating with DataHub? (e.g., BigQuery, MSSQL)?

Additionally, if you could provide any error messages or issues you’re encountering, that would be very helpful.

Based on the error message Failed to produce MCLs, it seems like there might be an issue with the Metadata Change Log (MCL) production in the DataHub GMS (Global Metadata Service). This error can occur due to various reasons, such as misconfiguration, connectivity issues, or specific bugs in the DataHub version you are using.

Here are some steps and considerations to help troubleshoot and resolve the issue:

  1. Check Logs for Detailed Errors:

    • Review the logs from the DataHub ingestion executor and the DataHub GMS to identify any specific error messages. This can provide insights into why the jobs are marked as failed.
    • You can access the logs by clicking on the Details button of the failed ingestion run in the UI.
  2. Common Issues:

    • Recipe Misconfiguration: Ensure that your ingestion recipe is correctly configured. Refer to the <https://datahubproject.io/docs/metadata-ingestion|Metadata Ingestion Framework> source docs for the required configurations.
    • Connectivity / Network Reachability: Verify that DataHub can reach your data source. Network issues can cause ingestion failures.
    • Authentication: Ensure that the credentials provided in your ingestion recipe are correct and have the necessary permissions to access the data source.
  3. Specific Error Messages:

    • If you encounter errors related to encoding or other specific issues, you may need to adjust your configuration accordingly.

For more detailed steps on debugging failed ingestion runs, you can refer to the <Ingestion | DataHub a Failed Ingestion Run> guide.

Would you like more detailed steps or specific examples based on your setup?

Sources:

• CLI
• 0.13.0
• File as a source

Some of the background of the issue. I am trying to run the datahub ingestion within the same EKS cluster where my datahub is running.

The pod that is running the job has the datahub configured with the following:

  server: <http://datahub-datahub-gms.datahub.svc.cluster.local:8080>
  token: my token```

I would inspect the logs from GMS, this indicates that GMS is returning a 500 error from the restli endpoint. There is likely a stacktrace in the GMS logs which contains additional informatiion.

Thanks <@U03MF8MU5P0> for pointing to the right direction. On looking into the datahub-gms logs I could see the following error:
Connection to node 0 (prerequisites-kafka-broker-0.prerequisites-kafka-broker-headless.datahub.svc.cluster.local/10.40.1.153:9092) could not be established. Broker may not be available
I could see the container in prerequisites-kafka-broker is in the failed state. On looking into the logs for kafka container I could see the following error:
java.io.IOException: No space left on device
Can you please let me know how can I clear old logs and create space. Is there any alternate way to increase the size of the disk?

Bumping this up as this is causing outage in our ingestion pipeline