Using Ingestion with Profiling on S3 for Aggregating Multiple Entities

user-2 · March 4, 2024, 3:48pm

Hi all,
Has anyone used Ingestion with profiling on the S3?
As my folder (Table) has several files, this has generated several entities, and I would like to unify them into a single object.
source:
type: s3
config:
path_specs:
-
include: ‘s3://bucket/table_folder/*.*’
table_name: kinesis_folder
aws_config:
aws_region: eu-central-1
aws_role: ‘arn:aws:iam:acc:role/s3_data_catalog_read_only’
profiling:
enabled: true

datahub_team · March 4, 2024, 3:48pm

Hey there! Make sure your message includes the following information if relevant, so we can help more effectively!

Are you using UI or CLI for ingestion?
Which DataHub version are you using? (e.g. 0.12.0)
What data source(s) are you integrating with DataHub? (e.g. BigQuery)

user-1 · March 4, 2024, 3:48pm

what is the expected output you would like to see?

user-2 · March 4, 2024, 3:48pm

Hi Tamas,
To complement my question source is S3, I’m using UI v0.11.0.

I want to map a folder and all the parquet inside it to be the content of a single table/entity. Because I will have partitioned content in my S3 bucket.

Thanks

datahub_team · March 4, 2024, 3:48pm

<@U068C8YJLBY> then just use the following path_spec:
'<s3://bucket/{table}/*.*>'
And if you have partitions as well under it then use something like this:
<s3://bucket/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/{partition_key[2]}={partition[2]}/*.avro> # specify partition key and value format

user-2 · March 4, 2024, 3:48pm

Thanks, Tamas, I tried this approach and it worked!

datahub_team · March 4, 2024, 3:48pm

yaay, awesome!

Topic		Replies	Views
Using Ingestion with Profiling on S3 and Unifying Entities getting-started	2	67	March 4, 2024
Ingesting Metadata for S3 Datalake: How to Ingest Only the Latest File and Infer Schema from It? ingestion	14	59	March 4, 2024
Finding an Efficient Way to Ingest S3 Buckets with Datahub ingestion	6	150	March 4, 2024
Setting up an Ingestion Pipeline for CSV Files from S3 and Local File Systems ingestion	5	86	March 4, 2024
Customizing S3 ingestion to pick one CSV file from each table folder ingestion	3	3	April 21, 2025

Using Ingestion with Profiling on S3 for Aggregating Multiple Entities

Related topics