Integrating DataHub with S3 Data Lake: Configuring Folder Structure Visibility for JSON Data

user-1 · May 20, 2024, 12:03am

Hi all, my team and I are currently exploring the possibility of integrating DataHub

we are using S3 data lake as the source and we have a bunch of JSON data partitioned by clients and date, i want my data hub to show the folder structure rather than capturing the metadata by scanning all the JSON data , can we know how to do that.

problem is that dataHub scans complete JSON data present and not some sample and scanning all JSON documents take time since data size in TB’s

CLI - ingestion
DataHub version 0.13.1.3
data source(s) s3 data lake

datahub_team · May 20, 2024, 12:03am

Hey there! Make sure your message includes the following information if relevant, so we can help more effectively!

Are you using UI or CLI for ingestion?
Which DataHub version are you using? (e.g. 0.12.0)
What data source(s) are you integrating with DataHub? (e.g. BigQuery)

datahub_team · May 20, 2024, 12:03am

We have a few config parameters for controlling the sampling / schema inference, documented here https://datahubproject.io/docs/generated/ingestion/sources/s3/ - have you tried using those?

Topic		Replies	Views
Troubleshooting S3 Recipe Issues with Partitioned Datasets ingestion	16	5	December 2, 2024
Support for PII classification in s3 ingestion with DataHub v0.13.2 using UI ingestion	2	39	May 13, 2024
Ingesting Data Using DataHub UI Guidance - Step-by-Step Instructions ingestion	4	264	June 17, 2024
Determining Schema from Multiple Files in S3 DataHub Ingestion ingestion	4	13	July 29, 2024
Determining the Folder Structure for Kafka Topics in DataHub Integration ingestion	1	47	March 4, 2024

Integrating DataHub with S3 Data Lake: Configuring Folder Structure Visibility for JSON Data

Related topics