Question about S3 Data Lake Connector and path_spec Configuration

user-1 · April 1, 2024, 7:04am

Hi all,
I have a question regarding the S3 Data Lake connector and the path_spec configuration.

For example, given the following file structure
s3:///bucket-name/some/dir/001_adv.csv
s3:///bucket-name/some/dir/002_adv.csv
s3:///bucket-name/some/dir/003_adv.csv
s3:///bucket-name/some/dir/001_rsld.csv
s3:///bucket-name/some/dir/002_rsld.csv
I’d like to ingest all files ending with _adv as one table and all files ending with _rsld as another table. I don’t mind if I have to write a separate path_spec line for each table and if I have to/can provide the table names to be shown in DataHub manually. But I struggle to write a path_spec to get the desired result. s3://bucket-name/some/dir/*_{table}.csv does not work.
Is there any option to provide the table name (and ideally the browse path shown in DataHub) manually when ingesting groups of files via the S3 Data Lake connector? Or any other way to achive the desired result?
Changing the directory/file structure is unfortunately not an option.

Using DataHub 0.13.0

datahub_team · April 1, 2024, 7:04am

Hey there! Make sure your message includes the following information if relevant, so we can help more effectively!

Are you using UI or CLI for ingestion?
Which DataHub version are you using? (e.g. 0.12.0)
What data source(s) are you integrating with DataHub? (e.g. BigQuery)

datahub_team · April 1, 2024, 7:04am

Hi Uwe, do you mind sharing your current recipe for this?

If I understand correctly -

I believe you should be able to write a separate ingestion recipe

The connector is most likely interpreting the path specification literally with {table} instead of reading it as a placeholder variable

Have you tried * or ? as wildcards?

Topic		Replies	Views
Customizing S3 ingestion to pick one CSV file from each table folder ingestion	3	11	April 21, 2025
Finding an Efficient Way to Ingest S3 Buckets with Datahub ingestion	6	182	March 4, 2024
Troubleshooting Data Ingestion from s3 Minio Data Lake ingestion	5	105	March 4, 2024
Ingesting Metadata for S3 Datalake: How to Ingest Only the Latest File and Infer Schema from It? ingestion	14	88	March 4, 2024
Integrating DataHub with S3 Data Lake: Configuring Folder Structure Visibility for JSON Data ingestion	2	34	May 20, 2024

Question about S3 Data Lake Connector and path_spec Configuration

Related topics