Debugging issues with dbt lineage loading in v0.10.5 and lag in kafka consumer group

Original Slack Thread

hi folks, I am running into a problem loading dbt lineage on v.0.10.5

I had previously loaded linage via datahub ingest of a dbt manifest.json and catalog.json. Then I did some cleanup of datasets that were named incorrectly via the /entities?action=delete endpoint

when i rerun datahub ingest I can see that datasets are reloaded, but they aren’t showing up in searches, and lineage is not reflected in the UI (or via gql).

any thoughts on where to start looking to debug this?

Hey <@U046TV6LVT5>, we had a similar issue, the problem in our case was with ElasticSearch, which had problems refreshing the indexes.

thanks <@U03NZ8JDYR4>, were you able to diagnose that via any specific error messages in the log? and do you recall which indexes were affected or was it all?

so it doesn’t look like our issue was related to elastic search. we noticed that we had quite a bit of lag on this kafka consumer group: generic-mae-consumer-job-client , which we started looking at after tracing log messages like Invoking MCL hooks for urn: that seemed to be re-issuing deletes. working on the theory that the delete messages were being reprocessed by that consumer group we manually updated its offset to latest. After doing this I was able to re-publish the dbt lineage and datasets are searchable again, and the lineage is showing up as expected. one oddity is that the dbt models and snowflake tables are no longer composed together, but I haven’t had time to dig into that yet.

the bigger question is what was causing the lag in this consumer group. I’ve pulled up some of the MCL messages in the topic near where our consumer got stuck and they look like the deletes themselves. does anyone have any suggestions for how to dig further?

A few things to take a look at:

  1. Are you running with standalone or embedded within GMS consumers? Splitting out the consumers can reduce load on GMS and give more resources specifically to the consumers which can help if the issue is timeout related.
  2. ElasticSearch resource consumption and metrics, deletes are fairly hefty on the ES side and can’t be batched as efficiently. If a delete isn’t getting processed definitely take a look there.
  3. Depending on your answer to 1, the logs to look at would be GMS/MAE Consumer. If there are any errors happening on the message processing itself they should be there.
  4. You mentioned something about message size and one oddity is that the dbt models and snowflake tables are no longer composed together, but I haven't had time to dig into that yet , when you inspected the topic were the messages you looked at particularly large and now that they’re not composed it has reduced in size? It could be that they went over the default size limit of 1mb and you may need to increase this on both the consumer and server side.

hey, thanks for the info Ryan. I’m not sure i understand this question, so probably the answer is whichever is easier/default.
> Are you running with standalone or embedded within GMS consumers?
I think you might be on the right track with timeout being an issue - i seem to have recreated the issue this morning by issuing a batch of deletes. I see this in the log (this one i remember seeing last time as well):

mer-job-client-5-4138b9b7-686c-4568-8100-ca963508750f sending LeaveGroup request to coordinator 10.161.112.227:9092 (id: 2147483635 rack: null) due to consumer poll timeout has expired. This means the time between subsequent calls to poll() was longer than the configured 
<http://max.poll.interval.ms|max.poll.interval.ms>, which typically implies that the poll loop is spending too much time processing messages. You can address this either by increasing <http://max.poll.interval.ms|max.poll.interval.ms> or by reducing the maximum size of batches returned in poll() with max.poll.records.```
and then a couple of minutes later i see these messages (which i don't remember seeing last time, but that doesn't mean they weren't there):
```2023-10-04 13:42:05,756 [ThreadPoolTaskExecutor-1] INFO  o.a.k.c.c.i.ConsumerCoordinator - [Consumer clientId=consumer-generic-mae-consumer-job-client-5, groupId=generic-mae-consumer-job-client] Failing OffsetCommit request since the consumer is not part of an active grou
p
2023-10-04 13:42:05,756 [ThreadPoolTaskExecutor-1] WARN  o.a.k.c.c.i.ConsumerCoordinator - [Consumer clientId=consumer-generic-mae-consumer-job-client-5, groupId=generic-mae-consumer-job-client] Synchronous auto-commit of offsets {dp-datahub-metadata-changelog-timeseries-
v1-0=OffsetAndMetadata{offset=6, leaderEpoch=13, metadata=''}, dp-datahub-metadata-changelog-versioned-v1-0=OffsetAndMetadata{offset=81164, leaderEpoch=19, metadata=''}} failed: Offset commit cannot be completed since the consumer is not part of an active group for auto p
artition assignment; it is likely that the consumer was kicked out of the group.
2023-10-04 13:42:05,756 [ThreadPoolTaskExecutor-1] INFO  o.a.k.c.c.i.ConsumerCoordinator - [Consumer clientId=consumer-generic-mae-consumer-job-client-5, groupId=generic-mae-consumer-job-client] Giving away all assigned partitions as lost since generation has been reset,i
ndicating that consumer is no longer part of the group
2023-10-04 13:42:05,756 [ThreadPoolTaskExecutor-1] INFO  o.a.k.c.c.i.ConsumerCoordinator - [Consumer clientId=consumer-generic-mae-consumer-job-client-5, groupId=generic-mae-consumer-job-client] Lost previously assigned partitions dp-datahub-metadata-changelog-timeseries-
v1-0, dp-datahub-metadata-changelog-versioned-v1-0
2023-10-04 13:42:05,756 [ThreadPoolTaskExecutor-1] INFO  o.s.k.l.KafkaMessageListenerContainer - generic-mae-consumer-job-client: partitions lost: [dp-datahub-metadata-changelog-timeseries-v1-0, dp-datahub-metadata-changelog-versioned-v1-0]
2023-10-04 13:42:05,756 [ThreadPoolTaskExecutor-1] INFO  o.s.k.l.KafkaMessageListenerContainer - generic-mae-consumer-job-client: partitions revoked: [dp-datahub-metadata-changelog-timeseries-v1-0, dp-datahub-metadata-changelog-versioned-v1-0]
2023-10-04 13:42:05,757 [ThreadPoolTaskExecutor-1] INFO  o.a.k.c.c.i.AbstractCoordinator - [Consumer clientId=consumer-generic-mae-consumer-job-client-5, groupId=generic-mae-consumer-job-client] (Re-)joining group
2023-10-04 13:42:05,757 [ThreadPoolTaskExecutor-1] INFO  c.d.event.PlatformEventProcessor - Consuming a Platform Event
2023-10-04 13:42:05,757 [ThreadPoolTaskExecutor-1] INFO  o.a.k.c.c.i.AbstractCoordinator - [Consumer clientId=consumer-generic-mae-consumer-job-client-5, groupId=generic-mae-consumer-job-client] Join group failed with org.apache.kafka.common.errors.MemberIdRequiredExcepti
on: The group member needs to have a valid member id before actually entering a consumer group
2023-10-04 13:42:05,757 [ThreadPoolTaskExecutor-1] INFO  o.a.k.c.c.i.AbstractCoordinator - [Consumer clientId=consumer-generic-mae-consumer-job-client-5, groupId=generic-mae-consumer-job-client] (Re-)joining group
2023-10-04 13:42:06,743 [I/O dispatcher 1] INFO  c.l.m.s.e.update.BulkListener - Successfully fed bulk request. Number of events: 1 Took time ms: -1
2023-10-04 13:42:07,748 [I/O dispatcher 1] INFO  c.l.m.s.e.update.BulkListener - Successfully fed bulk request. Number of events: 1 Took time ms: -1
2023-10-04 13:42:08,758 [ThreadPoolTaskExecutor-1] INFO  o.a.k.c.c.i.ConsumerCoordinator - [Consumer clientId=consumer-generic-mae-consumer-job-client-5, groupId=generic-mae-consumer-job-client] Finished assignment for group at generation 1813: {consumer-generic-mae-consu
mer-job-client-5-6368e370-d3fa-4eb6-b3c5-ee35bfad66d8=Assignment(partitions=[dp-datahub-metadata-changelog-versioned-v1-0, dp-datahub-metadata-changelog-timeseries-v1-0])}
2023-10-04 13:42:08,798 [ThreadPoolTaskExecutor-1] INFO  o.a.k.c.c.i.AbstractCoordinator - [Consumer clientId=consumer-generic-mae-consumer-job-client-5, groupId=generic-mae-consumer-job-client] Successfully joined group with generation 1813
2023-10-04 13:42:08,799 [ThreadPoolTaskExecutor-1] INFO  o.a.k.c.c.i.ConsumerCoordinator - [Consumer clientId=consumer-generic-mae-consumer-job-client-5, groupId=generic-mae-consumer-job-client] Adding newly assigned partitions: dp-datahub-metadata-changelog-timeseries-v1
-0, dp-datahub-metadata-changelog-versioned-v1-0
2023-10-04 13:42:08,799 [ThreadPoolTaskExecutor-1] INFO  o.a.k.c.c.i.ConsumerCoordinator - [Consumer clientId=consumer-generic-mae-consumer-job-client-5, groupId=generic-mae-consumer-job-client] Setting offset for partition dp-datahub-metadata-changelog-timeseries-v1-0 to
 the committed offset FetchPosition{offset=6, offsetEpoch=Optional[13], currentLeader=LeaderAndEpoch{leader=Optional[10.161.114.8:9092 (id: 10 rack: us-east-1c)], epoch=13}}
2023-10-04 13:42:08,800 [ThreadPoolTaskExecutor-1] INFO  o.a.k.c.c.i.ConsumerCoordinator - [Consumer clientId=consumer-generic-mae-consumer-job-client-5, groupId=generic-mae-consumer-job-client] Setting offset for partition dp-datahub-metadata-changelog-versioned-v1-0 to 
the committed offset FetchPosition{offset=80749, offsetEpoch=Optional[19], currentLeader=LeaderAndEpoch{leader=Optional[10.161.112.227:9092 (id: 12 rack: us-east-1b)], epoch=19}}```

also, i should note that all of the urns i tried to delete do appear to have been successfully deleted, and consumer’s offset is stuck at 80749 (same offset mentioned in the log above)
*edit: the urns have been deleted from the metadata_aspects_v2 RDS table, but they are still returned in a gql dataset query, is that expected or does it give a hint at the problem i am running into?

i suspect that our deployment that could be improved - what are the easiest levers to pull to make this less likely to happen in the future?

Yeah this kinda looks like the Kafka poll is timing out. <Kafka Consumer configuration reference for Confluent Platform | Confluent Documentation is a configuration for the poll time that can be configured > as well as max poll records which we’ve seen some success with in the past and that’s probably the easiest lever without costing you more money. ES instance size is another one, but has a price tag and may require migrations depending on your deployment. This https://datahubspace.slack.com/archives/CUMUWQU66/p1674114882345779|thread has some more details on the en

Splitting out to standalone consumers can also help. <datahub-helm/charts/datahub/README.md at master · acryldata/datahub-helm · GitHub can set this helm var>. This will deploy additional pods that are dedicated to consuming events and remove that burden from GMS.

> *edit: the urns have been deleted from the metadata_aspects_v2 RDS table, but they are still returned in a gql dataset query, is that expected or does it give a hint at the problem i am running into?
Yes, this indicates that the consumer has not processed the delete. When the delete is issued GMS first deletes it from the DB and then produces an event to remove it from ElasticSearch.