Ingesting Metadata for S3 Datalake: How to Ingest Only the Latest File and Infer Schema from It?

user-2 · March 4, 2024, 4:02pm

Hello all,
I want to ingest metadata for my s3 datalake.
But i only want to ingest one file(latest file) and infer schema from it from each folder. Not all files. Is there a way to do it?

user-1 · March 4, 2024, 4:02pm

How is your path_spec look like?
This is exactly how the ingestion works if you have the proper path_spec

user-2 · March 4, 2024, 4:02pm

My path spec is like <s://bucket-name|s://bucket-name> / */ * / * / * . *
As i have 3 level subfolders inside the bucket all with different names. Hence the wildcard
And finally the all files within.
Now how do i make sure datahub only infesers and ingest one file.
I cant pass the file name

user-2 · March 4, 2024, 4:02pm

And i have passed files to sample as 1 in recipi

user-2 · March 4, 2024, 4:02pm

Lets say the last subfolder is partition date and within that there are multiple part files. I only want latest partition date folder ingested with only one latest file

user-1 · March 4, 2024, 4:02pm

check this example which I use is your usecase -> https://datahubproject.io/docs/generated/ingestion/sources/s3/#example-3---folder-of-files-as-dataset-with-partitions

user-1 · March 4, 2024, 4:02pm

if {table} property is not set in the path_spec it will scrape and ingest all the individual files which is most probably something you don’t want

user-2 · March 4, 2024, 4:02pm

Ahh thanks let me try.

user-2 · March 4, 2024, 4:02pm

I think this works as i expected.
I had 2 questions

Does it always infer schema from latest file?
How to infer schema for gz files?
Thank you

user-1 · March 4, 2024, 4:02pm

the logic is basically to find the latest folder and infer schema from the latest file from the first I think 100 files in the folder
It uncompress the file and infer from there if it has known format

user-2 · March 4, 2024, 4:02pm

For json files i see schema inferred, but for gz there is no schema on datahub.

user-1 · March 4, 2024, 4:02pm

your gzip file should be named like myfile.json.gz to infer the schema

user-2 · March 4, 2024, 4:02pm

No its not in this formation. Just name.gz
What do we do in this scenario

user-1 · March 4, 2024, 4:02pm

try to set the default_extension config property for the path_spec

user-2 · March 4, 2024, 4:02pm

I set json as default and it worked. Thank you <@UV14447EU>

Topic		Replies	Views
Determining Schema from Multiple Files in S3 DataHub Ingestion ingestion	4	13	July 29, 2024
Customizing S3 ingestion to pick one CSV file from each table folder ingestion	3	12	April 21, 2025
Does Datahub merges schema in parquet files with S3 ingestion? ingestion	2	75	May 27, 2024
Troubleshooting S3 Recipe Issues with Partitioned Datasets ingestion	16	21	December 2, 2024
Troubleshooting Data Ingestion from s3 Minio Data Lake ingestion	5	122	March 4, 2024

Ingesting Metadata for S3 Datalake: How to Ingest Only the Latest File and Infer Schema from It?

Related topics