Incomplete Ingestion of Datasets/Entities in DataHub from File Source

Original Slack Thread

Hello! I am ingesting from a file source and noticed that not all of the datasets/entities are ingested in DataHub even though the ingestion finishes successfully without errors as far as I can understand. Do you have any clues why that is the case and how to resolve it? Thank you!
DataHub CLI version 0.12.1.1
DataHub frontend v0.12.0
(The data in the file is ingested from a MySQL source with DataHub CLI version: 0.13.1.2)

Additional observations 1:
I have started looking into the GMS logs from the ingestion and noticed the following type of errors even though some of those entities appear in the frontend:
2024-04-17 09:35:15,349 [ThreadPoolTaskExecutor-1] INFO c.l.m.k.h.s.SiblingAssociationHook:112 - Urn urn:li:dataset:(urn:li:dataPlatform:mysql,elin_db.View_SHOW_vehicle_states,PROD) received by Sibling Hook.
2024-04-17 09:35:15,349 [ThreadPoolTaskExecutor-1] ERROR c.l.m.k.h.s.SiblingAssociationHook:209 - urn:li:dataset:(urn:li:dataPlatform:mysql,elin_db.View_SHOW_vehicle_states,PROD) has an unexpected number of dbt upstreams: 0. Not adding any as siblings.

For other entities - even though not errors of the above type exist, they don’t appear when I browse the data source but appear when I search about them by their name.

So it seems to me that due to some reason all entities are ingested but can’t be browsed when browsing the data source and only appear in a text search.

Additional observations 2:
Very curiously, I observe the same behavior when following the same ingestion procedure and using the same file for ingestion source but using our production DataHub HOWEVER, different number of entities are missing from the interface.

Any suggestions how to deal with this will be highly appreciated!

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)