Mitigating Memory Issues with S3 Integration in DataHub

Original Slack Thread

Has anyone else found the S3 integration to be particularly memory-hungry, and does anyone know of any ways to mitigate it? Kept hitting OOMs when trying to pull metadata from a bucket that contains 8 distinct datasets, 80MM gzipped files in total to crawl.
Graph is memory usage as I tried to extract from s3 on the datahub-actions pod.

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

Datahub 0.12.0.

Integration defintion:

  type: s3
    platform: s3
    platform_instance: s3-raw-data-sources
      - include: "<s3://BUCKET-NAME/{table-1}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/{partition_key[2]}={partition[2]}/{partition_key[3]}={partition[3]}/{partition_key[4]}={partition[4]}/*.gz>"
      - include: "<s3://BUCKET-NAME/{table-2}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/{partition_key[2]}={partition[2]}/{partition_key[3]}={partition[3]}/{partition_key[4]}={partition[4]}/*.gz>"
      - include: "<s3://BUCKET-NAME/{table-3}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/{partition_key[2]}={partition[2]}/{partition_key[3]}={partition[3]}/*.gz>"
      - include: "<s3://BUCKET-NAME/{table-4}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/{partition_key[2]}={partition[2]}/{partition_key[3]}={partition[3]}/{partition_key[4]}={partition[4]}/*.gz>"
      - include: "<s3://BUCKET-NAME/{table-5}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/{partition_key[2]}={partition[2]}/{partition_key[3]}={partition[3]}/{partition_key[4]}={partition[4]}/*.gz>"
      - include: "<s3://BUCKET-NAME/{table-6}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/{partition_key[2]}={partition[2]}/{partition_key[3]}={partition[3]}/{partition_key[4]}={partition[4]}/*.gz>"
      - include: "<s3://BUCKET-NAME/{table-7}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/{partition_key[2]}={partition[2]}/{partition_key[3]}={partition[3]}/{partition_key[4]}={partition[4]}/*.gz>"
      - include: "<s3://BUCKET-NAME/{table-8}/*.gz>"

    env: STG
        enabled: false
        aws_region: us-east-1
    type: datahub-rest
        server: '<http://datahub-datahub-gms:8080>'```

instead of {table-1} can you use just {table}?

Hi, thanks for looking in to this! in our case these are the names of different data sources in the bucket, which I’m altering before posting here. Single real example in my spec:
- include: "<s3://BUCKET-NAME/{ghost-bid-logs}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/{partition_key[2]}={partition[2]}/*.gz>"
Here’s what one object key looks like:
Is my understanding of the path spec in the documentation correct, and is this formatted properly to capture ghost-bid-logs as a “table name”, and all of the partitions for this given example?

I’ll note that I hit OOM issues even when I include one path here out of 8.

Nope, looks like I misread it!
Moving to literally “{table}” worked, ingested everting and didn’t OOM.
Paths ended up looking like this to capture all potential permutations :

      - include: "<s3://BUCKET-NAME/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/{partition_key[2]}={partition[2]}/{partition_key[3]}={partition[3]}/{partition_key[4]}={partition[4]}/*.gz>"
      - include: "<s3://BUCKET-NAME/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/{partition_key[2]}={partition[2]}/{partition_key[3]}={partition[3]}/*.gz>"
      - include: "<s3://BUCKET-NAME/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/{partition_key[2]}={partition[2]}/*.gz>"
      - include: "<s3://BUCKET-NAME/{table}/*.gz>"```

<@U05JJ9WESHL> yeah, the table template parameter is a special one, and we use it for various optimizations (like not reading all the files under table folders, etc…

Do you know any special tricks for s3 buckets where all the objects are in the root of the bucket, with no “folder” prefixes? We have a ton of data from Cloudfront, and it all gets written directly to s3 with no organization. Looks like I can either ingest individual files or all of them, but would love to ingest each bucket as an individual “table”.

how the folder structure looks like?

Like so:
With thousands of files. Not my idea :slightly_smiling_face:

and how do you want to store it?

I mean how do you want to see this on datahub?

I would want it like:
container: (randomly generated urn:li:container: URN)
dataset: bucket-name
They could even both be the bucket name perhaps.

With having everything in the root folder I’m not sure if you can achieve this with the current source