Troubleshooting Data Ingestion from s3 Minio Data Lake

Original Slack Thread

Hi Team,
im trying to ingest data from my s3 minio data lake. When I run the following recipe, no error is shown but also no data is ingested:
source:
type: s3
config:
path_specs:
-
include: “s3://open-data-lake/tickit/bronze/*/*.parquet

aws_config:
  aws_access_key_id: '****'
  aws_secret_access_key: '****'
  aws_region: 'eu-central'
  aws_endpoint_url: '<http://my-minio-server:9000>'

sink:
type: “datahub-rest”
config:
server: “http://datahub-gms:8080

“Pipeline finished successfully; produced 0 events in 0.88 seconds.”

This is the path of one of my tables:
Do I need to specify the partitions even if I don’t really have some?

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

Please, can you run in debug mode and check the logs which folder it tries t o read?

also you can use {table} name var… e.g.
- include: "<s3://datalake/{table}/*.parquet>"

thanks for the replies. it worked for now with the following path: include: “s3://open-data-lake/tickit/bronze/*/*.parquet/*.parquet

I’m struggling now with the idea how the ingestion pipeline should be done. I load the data with spark from different databases and perform some data cleaning on it. Should the ingestion be done then directly via a spark listener (if this is possible?) or manually with a specific recipe (the path without the wildcards) or in regular phases the whole bucket should be scanned? That’s not really a datahub related question but if you can give me a hint would be awesome (im quite new to this topic sorry for that)

<@U067L5SQA9Z> I would do the following:

  1. Use spark lineage to ingest individual s3 folders (setting a path_spec with {table} placeholder) (not individual files as it can become pretty quickly unmaintainable)
  2. If your folder structure in the bucket is like s3:/my-bucket/event/event_name/year=2023/month=10/day=11/1.parquet then you are good with a path_spec like ``s3:/my-bucket/event/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/{partition_key[2]}={partition[2]}/*.parquet` path_spec
  3. If you have mixed data in the bucket, you must specify multiple path_specs in your recipe.
  4. Then, I would use the spark lineage plugin to capture lineage edges between the files/folders, which the s3 ingestion connector captured. -> Soon, we will have a spark lineage plugin in open source, which will have path_spec support.