Hello all,
I want to ingest metadata for my s3 datalake.
But i only want to ingest one file(latest file) and infer schema from it from each folder. Not all files. Is there a way to do it?
How is your path_spec look like?
This is exactly how the ingestion works if you have the proper path_spec
My path spec is like <s://bucket-name|s://bucket-name> / */ * / * / * . *
As i have 3 level subfolders inside the bucket all with different names. Hence the wildcard
And finally the all files within.
Now how do i make sure datahub only infesers and ingest one file.
I cant pass the file name
And i have passed files to sample as 1 in recipi
Lets say the last subfolder is partition date and within that there are multiple part files. I only want latest partition date folder ingested with only one latest file
check this example which I use is your usecase -> https://datahubproject.io/docs/generated/ingestion/sources/s3/#example-3---folder-of-files-as-dataset-with-partitions
if {table}
property is not set in the path_spec it will scrape and ingest all the individual files which is most probably something you don’t want
Ahh thanks let me try.
I think this works as i expected.
I had 2 questions
- Does it always infer schema from latest file?
- How to infer schema for gz files?
Thank you
- the logic is basically to find the latest folder and infer schema from the latest file from the first I think 100 files in the folder
- It uncompress the file and infer from there if it has known format
For json files i see schema inferred, but for gz there is no schema on datahub.
your gzip file should be named like myfile.json.gz
to infer the schema
No its not in this formation. Just name.gz
What do we do in this scenario
try to set the default_extension
config property for the path_spec
I set json as default and it worked. Thank you <@UV14447EU>