Integrating DataHub with S3 Data Lake: Configuring Folder Structure Visibility for JSON Data

Original Slack Thread

Hi all, my team and I are currently exploring the possibility of integrating DataHub

we are using S3 data lake as the source and we have a bunch of JSON data partitioned by clients and date, i want my data hub to show the folder structure rather than capturing the metadata by scanning all the JSON data , can we know how to do that.

problem is that dataHub scans complete JSON data present and not some sample and scanning all JSON documents take time since data size in TB’s

CLI - ingestion
DataHub version 0.13.1.3
data source(s) s3 data lake

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

We have a few config parameters for controlling the sampling / schema inference, documented here https://datahubproject.io/docs/generated/ingestion/sources/s3/ - have you tried using those?