Resolving Data Discrepancies and Ingestion Job Issues

Original Slack Thread

Hi Team,
I am using the quickstart version for testing. The problem I’m encountering is a discrepancy between the data displayed in the UI and the data in the database. When I use the web interface, I can’t see the data, but I can retrieve it using the DataHub CLI or by accessing the MySQL container directly. How can I resolve this issue?

Not everything kept in the db is necessarily displayed in the UI, like the auditstamp info in the aspects

Not sure which piece of info you’re referring to

I mean datasets, which in my case are tables in BigQuery. Some tables are showed in UI, some are not.

Is it marked as removed=true in a status aspect for the dataset?

I think it’s not. Btw, when I hard deleted one dataset entity, it was still showed in UI. Meanwhile you can’t get it via datahub cli.

Seems the datahub-actions container is failed to up. I am trying to fix it.

imo, actions container is not essential to solving this problem.
for the datasets that you cannot find in the UI, have you tried to hardcode the URL and see if the page can be loaded?

Good idea. I will try it. :joy:

<@U01TCN40JKV> The page can be loaded when I hardcode the URL. However, it doesn’t show in the datasets section. I can’t search it either.

its probably a missync between ES and MySQL DBs

Thanks for you help. I have another question. One ingestion job has been run successfully. However , no ingested assets has been found.
The following are logs without sensitive information:

Execution finished successfully!
{'exec_id': '99686e11-0bbf-446c-969a-fad2bcace6a6',
 'infos': ['2023-09-22 11:02:58.269958 INFO: Starting execution for task with name=RUN_INGEST',
           "2023-09-22 15:15:45.692699 INFO: Successfully executed 'datahub ingest'",
           '2023-09-22 15:15:45.693002 INFO: Finished execution for task with name=RUN_INGEST'],
 'errors': []}

~~~~ Ingestion Report ~~~~
  "cli": {
    "cli_version": "",
    "cli_entry_location": "/usr/local/lib/python3.10/site-packages/datahub/",
    "py_version": "3.10.11 (main, May 23 2023, 13:58:30) [GCC 10.2.1 20210110]",
    "py_exec_path": "/usr/local/bin/python",
    "os_details": "Linux-6.2.0-1014-gcp-x86_64-with-glibc2.31",
    "peak_memory_usage": "1.74 GB",
    "mem_info": "1.73 GB",
    "peak_disk_usage": "55.43 GB",
    "disk_info": {
      "total": "103.87 GB",
      "used": "55.35 GB",
      "free": "48.5 GB"
  "source": {
    "type": "bigquery",
    "report": {
      "window_end_time": "2023-09-22 11:03:03.374739+00:00 (4 hours, 12 minutes and 38.56 seconds ago)",
      "window_start_time": "2023-09-21 00:00:00+00:00 (1 day, 15 hours and 15 minutes ago)",
      "ingestion_stage": "*: Lineage Extraction at 2023-09-22 15:13:49.008724+00:00",
 'usage_state_size': "{'main': '76.57 MB', 'queries': '75.65 MB'}",
 'schema_api_perf': '&lt;datahub.ingestion.source.bigquery_v2.bigquery_report.BigQuerySchemaApiPerfReport object at 0x7f74d1fc0dc0&gt;',
 'audit_log_api_perf': '&lt;datahub.ingestion.source.bigquery_v2.bigquery_report.BigQueryAuditLogApiPerfReport object at 0x7f74d1fc0fa0&gt;',
 'lineage_start_time': '2023-09-21 00:00:00+00:00 (1 day, 15 hours and 15 minutes ago)',
 'lineage_end_time': '2023-09-22 11:03:03.374739+00:00 (4 hours, 12 minutes and 38.89 seconds ago)',
 'stateful_lineage_ingestion_enabled': True,
 'usage_start_time': '2023-09-21 00:00:00+00:00 (1 day, 15 hours and 15 minutes ago)',
 'usage_end_time': '2023-09-22 11:03:03.374739+00:00 (4 hours, 12 minutes and 38.89 seconds ago)',
 'stateful_usage_ingestion_enabled': True,
 'start_time': '2023-09-22 11:03:03.401425 (4 hours, 12 minutes and 38.87 seconds ago)',
 'running_time': '4 hours, 12 minutes and 38.86 seconds'}
Sink (datahub-rest) report:
{'total_records_written': 14274,
 'records_written_per_second': 0,
 'warnings': [],
 'failures': [],
 'start_time': '2023-09-22 11:02:59.843766 (4 hours, 12 minutes and 42.42 seconds ago)',
 'current_time': '2023-09-22 15:15:42.268023 (now)',
 'total_duration_in_seconds': 15162.42,
 'gms_version': 'null',
 'pending_requests': 0}

 Pipeline finished successfully; produced 14274 events in 4 hours, 12 minutes and 38.86 seconds.```
As you can see, the events has been produced.

sink report doesnt necessarily mean rdbms and elasticsearch are in sync (i think, i not datahub developer)
you can see if the relevant doc is added to the datasetindex_v2 index

Thanks. I checked and found no doc is added. For my case, while the ingestion job is running, I guess the data has not been sent to gms. After I changed the bucket_duration from DAY to HOUR, seems the ingestion job went wrong.

Could you please rerun the ingestion and share below logs
• ingestion debug logs
• GMS container logs (

<@U01GZEETMEZ>might help you