Ingesting Metadata for S3 Datalake: How to Ingest Only the Latest File and Infer Schema from It?

Original Slack Thread

Hello all,
I want to ingest metadata for my s3 datalake.
But i only want to ingest one file(latest file) and infer schema from it from each folder. Not all files. Is there a way to do it?

How is your path_spec look like?
This is exactly how the ingestion works if you have the proper path_spec

My path spec is like <s://bucket-name|s://bucket-name> / */ * / * / * . *
As i have 3 level subfolders inside the bucket all with different names. Hence the wildcard
And finally the all files within.
Now how do i make sure datahub only infesers and ingest one file.
I cant pass the file name

And i have passed files to sample as 1 in recipi

Lets say the last subfolder is partition date and within that there are multiple part files. I only want latest partition date folder ingested with only one latest file

check this example which I use is your usecase -> https://datahubproject.io/docs/generated/ingestion/sources/s3/#example-3---folder-of-files-as-dataset-with-partitions

if {table} property is not set in the path_spec it will scrape and ingest all the individual files which is most probably something you don’t want

Ahh thanks let me try.

I think this works as i expected.
I had 2 questions

  1. Does it always infer schema from latest file?
  2. How to infer schema for gz files?
    Thank you
  1. the logic is basically to find the latest folder and infer schema from the latest file from the first I think 100 files in the folder
  2. It uncompress the file and infer from there if it has known format

For json files i see schema inferred, but for gz there is no schema on datahub.

your gzip file should be named like myfile.json.gz to infer the schema

No its not in this formation. Just name.gz
What do we do in this scenario

try to set the default_extension config property for the path_spec

I set json as default and it worked. Thank you <@UV14447EU>