Troubleshooting Data Ingestion from s3 Minio Data Lake

user-3 · March 4, 2024, 3:48pm

Hi Team,
im trying to ingest data from my s3 minio data lake. When I run the following recipe, no error is shown but also no data is ingested:
source:
type: s3
config:
path_specs:
-
include: “s3://open-data-lake/tickit/bronze/*/*.parquet”

aws_config:
  aws_access_key_id: '****'
  aws_secret_access_key: '****'
  aws_region: 'eu-central'
  aws_endpoint_url: '<http://my-minio-server:9000>'

sink:
type: “datahub-rest”
config:
server: “http://datahub-gms:8080”

“Pipeline finished successfully; produced 0 events in 0.88 seconds.”

This is the path of one of my tables:
Do I need to specify the partitions even if I don’t really have some?

datahub_team · March 4, 2024, 3:48pm

Hey there! Make sure your message includes the following information if relevant, so we can help more effectively!

Are you using UI or CLI for ingestion?
Which DataHub version are you using? (e.g. 0.12.0)
What data source(s) are you integrating with DataHub? (e.g. BigQuery)

user-2 · March 4, 2024, 3:48pm

Please, can you run in debug mode and check the logs which folder it tries t o read?

user-1 · March 4, 2024, 3:49pm

also you can use {table} name var… e.g.
- include: "<s3://datalake/{table}/*.parquet>"

user-3 · March 4, 2024, 3:49pm

thanks for the replies. it worked for now with the following path: include: “s3://open-data-lake/tickit/bronze/*/*.parquet/*.parquet”

I’m struggling now with the idea how the ingestion pipeline should be done. I load the data with spark from different databases and perform some data cleaning on it. Should the ingestion be done then directly via a spark listener (if this is possible?) or manually with a specific recipe (the path without the wildcards) or in regular phases the whole bucket should be scanned? That’s not really a datahub related question but if you can give me a hint would be awesome (im quite new to this topic sorry for that)

datahub_team · March 4, 2024, 3:49pm

<@U067L5SQA9Z> I would do the following:

Use spark lineage to ingest individual s3 folders (setting a path_spec with {table} placeholder) (not individual files as it can become pretty quickly unmaintainable)
If your folder structure in the bucket is like s3:/my-bucket/event/event_name/year=2023/month=10/day=11/1.parquet then you are good with a path_spec like ``s3:/my-bucket/event/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/{partition_key[2]}={partition[2]}/*.parquet` path_spec
If you have mixed data in the bucket, you must specify multiple path_specs in your recipe.
Then, I would use the spark lineage plugin to capture lineage edges between the files/folders, which the s3 ingestion connector captured. -> Soon, we will have a spark lineage plugin in open source, which will have path_spec support.

Topic		Replies	Views
Ingesting Metadata from Delta Lake Tables in MinIO using DataHub ingestion	2	56	December 2, 2024
Finding an Efficient Way to Ingest S3 Buckets with Datahub ingestion	6	168	March 4, 2024
Troubleshooting ingestion of Parquet file from GitHub using S3 ingestion in DataHub ingestion	1	48	March 4, 2024
Integrating DataHub with S3 Data Lake: Configuring Folder Structure Visibility for JSON Data ingestion	2	33	May 20, 2024
Troubleshooting Stuck DBT Ingestion Job from AWS S3 to DataHub ingestion	11	4	January 13, 2025

Troubleshooting Data Ingestion from s3 Minio Data Lake

Related topics