Question about S3 Data Lake Connector and path_spec Configuration

Original Slack Thread

Hi all,
I have a question regarding the S3 Data Lake connector and the path_spec configuration.

For example, given the following file structure
s3:///bucket-name/some/dir/001_adv.csv
s3:///bucket-name/some/dir/002_adv.csv
s3:///bucket-name/some/dir/003_adv.csv
s3:///bucket-name/some/dir/001_rsld.csv
s3:///bucket-name/some/dir/002_rsld.csv
I’d like to ingest all files ending with _adv as one table and all files ending with _rsld as another table. I don’t mind if I have to write a separate path_spec line for each table and if I have to/can provide the table names to be shown in DataHub manually. But I struggle to write a path_spec to get the desired result. s3://bucket-name/some/dir/*_{table}.csv does not work.
Is there any option to provide the table name (and ideally the browse path shown in DataHub) manually when ingesting groups of files via the S3 Data Lake connector? Or any other way to achive the desired result?
Changing the directory/file structure is unfortunately not an option.

Using DataHub 0.13.0

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

Hi Uwe, do you mind sharing your current recipe for this?

If I understand correctly -

I believe you should be able to write a separate ingestion recipe

The connector is most likely interpreting the path specification literally with {table} instead of reading it as a placeholder variable

Have you tried * or ? as wildcards?