Mitigating Memory Issues with S3 Integration in DataHub

user-2 · March 4, 2024, 3:49pm

Has anyone else found the S3 integration to be particularly memory-hungry, and does anyone know of any ways to mitigate it? Kept hitting OOMs when trying to pull metadata from a bucket that contains 8 distinct datasets, 80MM gzipped files in total to crawl.
Graph is memory usage as I tried to extract from s3 on the datahub-actions pod.

datahub_team · March 4, 2024, 3:49pm

Hey there! Make sure your message includes the following information if relevant, so we can help more effectively!

Are you using UI or CLI for ingestion?
Which DataHub version are you using? (e.g. 0.12.0)
What data source(s) are you integrating with DataHub? (e.g. BigQuery)

user-2 · March 4, 2024, 3:50pm

Datahub 0.12.0.

Integration defintion:

  type: s3
  config:
    platform: s3
    platform_instance: s3-raw-data-sources
    path_specs:
      - include: "<s3://BUCKET-NAME/{table-1}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/{partition_key[2]}={partition[2]}/{partition_key[3]}={partition[3]}/{partition_key[4]}={partition[4]}/*.gz>"
      - include: "<s3://BUCKET-NAME/{table-2}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/{partition_key[2]}={partition[2]}/{partition_key[3]}={partition[3]}/{partition_key[4]}={partition[4]}/*.gz>"
      - include: "<s3://BUCKET-NAME/{table-3}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/{partition_key[2]}={partition[2]}/{partition_key[3]}={partition[3]}/*.gz>"
      - include: "<s3://BUCKET-NAME/{table-4}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/{partition_key[2]}={partition[2]}/{partition_key[3]}={partition[3]}/{partition_key[4]}={partition[4]}/*.gz>"
      - include: "<s3://BUCKET-NAME/{table-5}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/{partition_key[2]}={partition[2]}/{partition_key[3]}={partition[3]}/{partition_key[4]}={partition[4]}/*.gz>"
      - include: "<s3://BUCKET-NAME/{table-6}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/{partition_key[2]}={partition[2]}/{partition_key[3]}={partition[3]}/{partition_key[4]}={partition[4]}/*.gz>"
      - include: "<s3://BUCKET-NAME/{table-7}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/{partition_key[2]}={partition[2]}/{partition_key[3]}={partition[3]}/{partition_key[4]}={partition[4]}/*.gz>"
      - include: "<s3://BUCKET-NAME/{table-8}/*.gz>"

    env: STG
    profiling:
        enabled: false
    aws_config:
        aws_region: us-east-1
sink:
    type: datahub-rest
    config:
        server: '<http://datahub-datahub-gms:8080>'```

user-1 · March 4, 2024, 3:50pm

instead of {table-1} can you use just {table}?

user-2 · March 4, 2024, 3:50pm

Hi, thanks for looking in to this! in our case these are the names of different data sources in the bucket, which I’m altering before posting here. Single real example in my spec:
- include: "<s3://BUCKET-NAME/{ghost-bid-logs}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/{partition_key[2]}={partition[2]}/*.gz>"
Here’s what one object key looks like:
<s3://BUCKET-NAME/ghost-bid-logs/YYYY=2023/MM=04/dd=20/FILE-NAME.gz>
Is my understanding of the path spec in the documentation correct, and is this formatted properly to capture ghost-bid-logs as a “table name”, and all of the partitions for this given example?

I’ll note that I hit OOM issues even when I include one path here out of 8.

user-2 · March 4, 2024, 3:50pm

Nope, looks like I misread it!
Moving to literally “{table}” worked, ingested everting and didn’t OOM.
Paths ended up looking like this to capture all potential permutations :

      - include: "<s3://BUCKET-NAME/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/{partition_key[2]}={partition[2]}/{partition_key[3]}={partition[3]}/{partition_key[4]}={partition[4]}/*.gz>"
      - include: "<s3://BUCKET-NAME/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/{partition_key[2]}={partition[2]}/{partition_key[3]}={partition[3]}/*.gz>"
      - include: "<s3://BUCKET-NAME/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/{partition_key[2]}={partition[2]}/*.gz>"
      - include: "<s3://BUCKET-NAME/{table}/*.gz>"```
Thanks!

datahub_team · March 4, 2024, 3:50pm

<@U05JJ9WESHL> yeah, the table template parameter is a special one, and we use it for various optimizations (like not reading all the files under table folders, etc…

user-2 · March 4, 2024, 3:50pm

Do you know any special tricks for s3 buckets where all the objects are in the root of the bucket, with no “folder” prefixes? We have a ton of data from Cloudfront, and it all gets written directly to s3 with no organization. Looks like I can either ingest individual files or all of them, but would love to ingest each bucket as an individual “table”.

datahub_team · March 4, 2024, 3:50pm

how the folder structure looks like?

user-2 · March 4, 2024, 3:50pm

Like so:
<s3://BUCKET-NAME/HWBNQLUY2PSB3W.2021-06-29-20.30dd9b2a.gz>
With thousands of files. Not my idea

datahub_team · March 4, 2024, 3:50pm

and how do you want to store it?

datahub_team · March 4, 2024, 3:50pm

I mean how do you want to see this on datahub?

user-2 · March 4, 2024, 3:50pm

I would want it like:
container: (randomly generated urn:li:container: URN)
dataset: bucket-name
They could even both be the bucket name perhaps.

user-1 · March 4, 2024, 3:50pm

With having everything in the root folder I’m not sure if you can achieve this with the current source

Topic		Replies	Views
Troubleshooting S3 Recipe Issues with Partitioned Datasets ingestion	16	5	December 2, 2024
Using Ingestion with Profiling on S3 for Aggregating Multiple Entities ingestion	6	110	March 4, 2024
Integrating DataHub with S3 Data Lake: Configuring Folder Structure Visibility for JSON Data ingestion	2	37	May 20, 2024
Troubleshooting Data Ingestion from s3 Minio Data Lake ingestion	5	116	March 4, 2024
Determining Schema from Multiple Files in S3 DataHub Ingestion ingestion	4	13	July 29, 2024

Mitigating Memory Issues with S3 Integration in DataHub

Related topics