<@U06TM7M40RK> I’m using the S3 source for ingestion and have s3 paths format like below
bucket/daily/2025/id/{table}/info_2025_03_24.csv
bucket/daily/2025/id/{table}/info_2025_03_25.csv
Is it possible to pickup only one csv file from each table folder?
To ingest only one CSV file from each table folder in your S3 bucket, you can use the path_spec
configuration in your ingestion recipe. The path_spec
allows you to specify the path pattern and control which files are included in the ingestion process.
Based on the <S3 / Local Files | DataHub documentation>, you can use the {table}
placeholder to represent the folder level for which the dataset is to be created. However, to pick only one file from each folder, you might need to implement a custom logic or use a specific pattern that matches only the desired file.
Here’s a general approach you can take:
-
Define the Path Spec: Use the
{table}
placeholder to represent the folder level. For example:path_specs: - include: "s3://bucket/daily/2025/id/{table}/info_2025_03_24.csv"
-
Custom Logic: If you need to dynamically select the latest or a specific file from each folder, you might need to implement a custom logic or script that determines which file to ingest. This could involve listing the files in each folder and selecting the one you want based on your criteria (e.g., latest modified date).
-
Use Exclude Patterns: If there are specific files you want to exclude, you can use the
exclude
option in thepath_spec
to filter out unwanted files.
If you need more detailed guidance on implementing custom logic or have specific requirements, please let me know!
Sources:
- https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/tests/unit/s3/test_s3_source.py|https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/tests/unit/s3/test_s3_source.py
- https://datahubspace.slack.com/archives/CV2KB471C/p1701713054.089729|https://datahubspace.slack.com/archives/CV2KB471C/p1701713054.089729
0 button
0 button
Hint: Mention @runllm in the thread for followups.
Hi <@U08ML7K1W1J>, how did my answer work out for you? Need any more assistance? You can also tag <@U07FV3UHE4D> to let your fellow community members know you’re still looking for guidance.
Hi ! Did this interaction resolve your issue or do you still require assistance ?
Please tag @ RunLLM if you still need further assistance !
Or tag @Datahub Community Support if you require a person to take a look
Hope this helps ! ><