Hi Team,
im trying to ingest data from my s3 minio data lake. When I run the following recipe, no error is shown but also no data is ingested:
source:
type: s3
config:
path_specs:
-
include: “s3://open-data-lake/tickit/bronze/*/*.parquet”
I’m struggling now with the idea how the ingestion pipeline should be done. I load the data with spark from different databases and perform some data cleaning on it. Should the ingestion be done then directly via a spark listener (if this is possible?) or manually with a specific recipe (the path without the wildcards) or in regular phases the whole bucket should be scanned? That’s not really a datahub related question but if you can give me a hint would be awesome (im quite new to this topic sorry for that)
Use spark lineage to ingest individual s3 folders (setting a path_spec with {table} placeholder) (not individual files as it can become pretty quickly unmaintainable)
If your folder structure in the bucket is like s3:/my-bucket/event/event_name/year=2023/month=10/day=11/1.parquet then you are good with a path_spec like ``s3:/my-bucket/event/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/{partition_key[2]}={partition[2]}/*.parquet` path_spec
If you have mixed data in the bucket, you must specify multiple path_specs in your recipe.
Then, I would use the spark lineage plugin to capture lineage edges between the files/folders, which the s3 ingestion connector captured. -> Soon, we will have a spark lineage plugin in open source, which will have path_spec support.