Using Ingestion with Profiling on S3 for Aggregating Multiple Entities

Original Slack Thread

Hi all,
Has anyone used Ingestion with profiling on the S3?
As my folder (Table) has several files, this has generated several entities, and I would like to unify them into a single object.
source:
type: s3
config:
path_specs:
-
include: ‘s3://bucket/table_folder/*.*
table_name: kinesis_folder
aws_config:
aws_region: eu-central-1
aws_role: ‘arn:aws:iam:acc:role/s3_data_catalog_read_only’
profiling:
enabled: true

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

what is the expected output you would like to see?

Hi Tamas,
To complement my question source is S3, I’m using UI v0.11.0.

I want to map a folder and all the parquet inside it to be the content of a single table/entity. Because I will have partitioned content in my S3 bucket.

Thanks

<@U068C8YJLBY> then just use the following path_spec:
'<s3://bucket/{table}/*.*>'
And if you have partitions as well under it then use something like this:
<s3://bucket/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/{partition_key[2]}={partition[2]}/*.avro> # specify partition key and value format

Thanks, Tamas, I tried this approach and it worked!

yaay, awesome!