Scheduling Metadata Ingestion from MWAA into DataHub on EKS

Original Slack Thread

<@U06TM7M40RK> How can we schedule metadata ingestion from MWAA into Datahub hosted on EKS

To schedule metadata ingestion from Amazon Managed Workflows for Apache Airflow (MWAA) into DataHub hosted on Amazon EKS, you can follow these steps:

  1. Create Ingestion Recipes: Write YAML configuration files (recipes) for each data source you want to ingest metadata from. These recipes will define the source (e.g., MWAA) and the sink (DataHub REST API).

  2. Set Up Airflow DAGs: Configure Airflow DAGs in MWAA to run these ingestion recipes on a schedule.

  3. Deploy and Schedule DAGs: Deploy the DAGs to your MWAA environment and schedule them to run at your desired intervals.

Step-by-Step Guide

1. Create Ingestion Recipes
Create YAML files for each data source. Below is an example of a recipe for ingesting metadata from an Airflow instance:

# example-recipe.yml
source:
  type: airflow
  config:
    airflow_url: "http://<your-mwaa-url>"
    cluster: "mwaa_cluster"
    # other necessary configurations

sink:
  type: "datahub-rest"
  config:
    server: "http://<your-datahub-url>/gms"
    token: "<your-api-key>"

2. Set Up Airflow DAGs
Create a DAG in MWAA to run the ingestion recipe. Below is an example DAG:

from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.utils.dates import days_ago

default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
}

with DAG(
    'datahub_ingestion_dag',
    default_args=default_args,
    description='A DAG to ingest metadata into DataHub',
    schedule_interval='@daily',
    start_date=days_ago(1),
    catchup=False,
) as dag:

    ingest_task = BashOperator(
        task_id='ingest_metadata',
        bash_command='datahub ingest -c /path/to/your/example-recipe.yml',
    )

    ingest_task

3. Deploy and Schedule DAGs

  1. Deploy the DAG: Upload the DAG file to your MWAA environment.
  2. Schedule the DAG: Ensure the DAG is scheduled to run at your desired intervals (e.g., daily).

Additional Considerations

  • Authentication: Ensure that your DataHub REST API token is securely managed and accessible to your Airflow tasks.
  • Network Configuration: Ensure that your MWAA environment can communicate with your DataHub instance hosted on EKS.

Would you like more details on any specific part of this setup?

Sources:

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><