Incremental Ingestion for Hive Data in DataHub

Original Slack Thread

Hello All,

I am looking for incremental ingestion from a single cluster.
For example, I have 1+ million datasets in Hive. After doing the initial loading is there any way of ingesting updated/mew datasets in incremental way instead of running the ingestion on the whole cluster?
Currently it takes days to do the ingestion on the complete cluster.

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)
  1. Using both UI and CLI, but CLI is preferred.
  2. Datahub version 0.12.0 and Datahub Actions 0.0.14
  3. Hive data sources

<@U01GZEETMEZ> might be able to speak to this!

We have support for this sort of incremental ingestion for one or two sources (e.g. powerbi), but it’s not something we’ve built out in general yet

In the particular case of hive, are you ingesting directly from hive or using our hive metastore connector (https://datahubproject.io/docs/next/generated/ingestion/sources/presto-on-hive/)? The latter can be much more performant

Currently using directly Hive.
Will explore presto-on-hive.