Incremental Ingestion for Hive Data in DataHub

user-2 · March 4, 2024, 4:16pm

Hello All,

I am looking for incremental ingestion from a single cluster.
For example, I have 1+ million datasets in Hive. After doing the initial loading is there any way of ingesting updated/mew datasets in incremental way instead of running the ingestion on the whole cluster?
Currently it takes days to do the ingestion on the complete cluster.

datahub_team · March 4, 2024, 4:16pm

Hey there! Make sure your message includes the following information if relevant, so we can help more effectively!

Are you using UI or CLI for ingestion?
Which DataHub version are you using? (e.g. 0.12.0)
What data source(s) are you integrating with DataHub? (e.g. BigQuery)

user-2 · March 4, 2024, 4:16pm

Using both UI and CLI, but CLI is preferred.
Datahub version 0.12.0 and Datahub Actions 0.0.14
Hive data sources

datahub_team · March 4, 2024, 4:16pm

<@U01GZEETMEZ> might be able to speak to this!

user-1 · March 4, 2024, 4:16pm

We have support for this sort of incremental ingestion for one or two sources (e.g. powerbi), but it’s not something we’ve built out in general yet

In the particular case of hive, are you ingesting directly from hive or using our hive metastore connector (https://datahubproject.io/docs/next/generated/ingestion/sources/presto-on-hive/)? The latter can be much more performant

user-2 · March 4, 2024, 4:16pm

Currently using directly Hive.
Will explore presto-on-hive.

Topic		Replies	Views
Ingesting Multiple Tables with the Same Name from Different Datasets in UI BigQuery ingestion	7	64	March 4, 2024
Ingesting Hive Metastore Entities into DataHub for Metadata Management ingestion	3	96	April 22, 2024
Ingesting Hive Metadata from Cloudera Data Warehouse in DataHub: Seeking Help to Resolve Errors ingestion	11	26	May 20, 2024
Determining Ingestion Duration and Table Population Issue in Datahub v.0.13.0 troubleshoot	2	48	April 8, 2024
.title {"How to Ingest Only the Most Recent Partition of Sharded Tables in Datahub UI"} ingestion	3	30	May 20, 2024

Incremental Ingestion for Hive Data in DataHub

Related topics