"Triggering Metadata Ingest and Uploading Recipes on the Fly - AWS Lambda, Airflow, and More"

Original Slack Thread

Newbie Questions:

  1. Is there a way to trigger metadata ingest with an AWS Lambda function (or any other, e.g AirFlow).
  2. Can recipes be uploaded on the fly?

There’s a couple options for how to trigger/schedule ingestion

For example, we have a doc on using Airflow here: https://datahubproject.io/docs/metadata-ingestion/schedule_docs/airflow/

Recipes can be uploaded on the fly using the datahub ingest deploy command, but I would caution that it’s a somewhat advanced feature and it’s pretty rare to actually need that, so it’d be helpful to understand what you’re trying to accomplish with it

We will most likely be ingesting metadata from different sources, including various S3 buckets - so it sounds like we’d need to pre-create recipes for the various S3 buckets as long as we knew in advance the path to those buckets. But if we didn’t know the S3 path in advance, as in the case where we create an S3 bucket on the fly, can we use the same recipe but have a parameter in the recipe to replace the actual bucket name (or upload a unique recipe for that bucket)??

I think the s3 source will iterate over the buckets it can access, so this may not be necessary

It depends on how frequently the set of paths is changing. Your options include dynamically modifying + uploading recipes, running CLI ingestion with an env var in the recipe to control the bucket, or running ingestion programmatically using Pipeline.create(config).run(...)

"running ingestion programmatically using Pipeline.create(config).run(...)" looks interesting… can you share some docs/blogs/readme’s on this?

Yup we have some sample code here https://datahubproject.io/docs/metadata-ingestion/#programmatic-pipeline

<@U01GZEETMEZ> Is it possible to create the pipeline via java api?

Nope - our ingestion sources are written in python. That said, for most use case, using UI ingestion is the easiest / least error prone

Thanks for the reply. I was aiming for creating ingestion sources dynamically. I guess I could do it via graphql API.