Debugging 'Unable to emit metadata to DataHub GMS: java.lang.RuntimeException: Failed to produce MCLs' error after upgrading to 0.11.0

Original Slack Thread

Getting a new error after upgrading to 11, it is affecting a Snowflake ingestion, multiple separate snowflake ingestions are running completely fine. Curious if anyone has debugged this before.

             'info': {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException',
                        'message': 'java.lang.RuntimeException: Failed to produce MCLs',
                        'status': 500,```

<@U058HUSFHLL> Could you please share the gms logs ? steps are here https://datahubproject.io/docs/how/extract-container-logs/

<@U04N9PYJBEW> might help you

Anything I’m looking for specifically <@U0348BYAS56>? I found the INGEST PROPOSAL for the dataset aspects that seems to be failing in each nightly ingestion, but nothing insightful in the logs besides a bunch of ingest proposals for the dataset.
Each nightly ingestion seems to be failing for the same table which is interesting. And reverting back to 0.10.5 fixed it, but I would like to stay on 0.11.0

It happens to be a very large table, 500M rows, but I dont see why that would make a difference, also… when I check that table in DataHub it looks like it was ingested correctly, we got the most recent row counts and operation aspect for it. Even though the ingestion finishes with a failure.

Can you post the full error? Is there any extra info after the “500”? It would be great to see every example of the 500 error as well. <@U04UKA5L5LK> any ideas on why we might be getting this error from gms?

I can only really grab the error from the Sink (datahub-rest) report the logs themselves from the GMS have no clear error, this is the whole error …

[2023-10-10, 04:06:15 PDT] {{pod_manager.py:381}} INFO -                'info': {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException',
[2023-10-10, 04:06:15 PDT] {{pod_manager.py:381}} INFO -                         'message': 'java.lang.RuntimeException: Failed to produce MCLs',
[2023-10-10, 04:06:15 PDT] {{pod_manager.py:381}} INFO -                         'status': 500,
[2023-10-10, 04:06:15 PDT] {{pod_manager.py:381}} INFO -                         'id': 'urn:li:dataset:(urn:li:dataPlatform:snowflake,db1.schema1.table_name,PROD)'}}],```

Im realizing that all the tables do in fact seem to be successfully ingested, but since its registering this error its marking the ingestion failed.

Yeah, currently if there are any sink errors, we’ll report the ingestion as failed. We’re in the process of changing that, but that’ll take some time. We’re likely missing one aspect for one table at the moment

Hmm, missing from the source side or the sink? Like specifically the upsert is failing? Is there any checks I can do?

I just means that we failed to send to datahub one aspect. Unfortunately, we don’t have great logs on what that aspect was, but you can see which urn it’s for

I see… in that case im not sure what the best way to get around this is. It is marking our daily airflow task as a failure and causing noise, Anything you think I can try to get it working?

Ah I see. If it’s okay for now, you can skip ingesting the table via the table_pattern parameter (I assume this is snowflake ingestion), which should avoid the error. And we’ll look into what’s causing this error.

<@U04N9PYJBEW> If there is a bug opened for this or a way I can track it please let me know, thanks!

<@U04N9PYJBEW> Hey, checking in a few months later, this ingestion still fails each night because of one tables aspect that is not correctly being sent to Datahub. Has there been any development on this? I would rather not exclude the table cuz its other aspects are getting emmitted properly. Would prefer just a WARNING and the ingestion completing successfully, or ideally for the aspect to be emmitted with proper logs if it fails. We are running 0.12.1.

Unfortunately we haven’t finished our ingestion status refactor yet, so this will still cause an ingestion failure. We don’t want to make all “failed to ingest” sink errors warnings, as if all MCPs are not emitting properly, that should certainly be considered an ingestion failure. Our desired end state is that we allow a certain percentage of sink failures, e.g. 1%, where ingestion status is warning, but if that threshold is passed we consider the ingestion failed. But building that requires counting how many MCPs are emitted and how many fail to get emitted, which we haven’t gotten around to yet. I’d like to get this in by the end of Q1 next year, but we don’t have any firm commitments there

Hi, I seemed to get this error too for snowflake but when I upgraded to 0.12.1 it worked. I keep getting this error though for Vertica and PowerBI resources.