Setting AWS Permissions and Creating Ingestion Recipes for Pulling Metadata from MWAA into DataHub

Original Slack Thread

<@U06TM7M40RK> What are the permissions required on AWS to pull metadata from MWAA to Datahub?

To pull metadata from Amazon Managed Workflows for Apache Airflow (MWAA) into DataHub, you need to ensure that your AWS credentials have the necessary permissions. However, the provided data sources do not explicitly mention MWAA.

For similar AWS services, such as AWS Glue, the required permissions typically include actions to list and get metadata. Based on the permissions required for AWS Glue, you might need similar permissions for MWAA.

Here is an example IAM policy that includes permissions for listing and getting metadata, which you can adapt for MWAA:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "mwaa:ListEnvironments",
        "mwaa:GetEnvironment",
        "mwaa:ListTagsForResource",
        "mwaa:ListExecutions",
        "mwaa:GetExecution"
      ],
      "Resource": "*"
    }
  ]
}

Would you like more details on how to configure these permissions or any other assistance?

Sources:

<@U06TM7M40RK> Create recipe in this case?

To create an ingestion recipe for pulling metadata from Amazon Managed Workflows for Apache Airflow (MWAA) into DataHub, you need to define the source and sink configurations in a YAML file. Although the provided data sources do not explicitly mention MWAA, you can adapt the general approach used for other AWS services.

Here is an example ingestion recipe (mwaa_ingestion.yaml) that you can use as a template:

source:
  type: "airflow"
  config:
    airflow_url: "https://your-mwaa-endpoint.amazonaws.com"
    username: "your_airflow_username"
    password: "your_airflow_password"
    env: "PROD"
    include_task_templates: true
    include_lineage: true
    include_task_instances: true
    include_dag_runs: true

sink:
  type: "datahub-rest"
  config:
    server: "http://localhost:8080"  # Replace with your DataHub instance URL
    token: "your_datahub_token"  # Replace with your DataHub auth token if required

Explanation:

  1. Source Configuration:

    • type: Specifies the type of source, which is airflow in this case.
    • airflow_url: The URL of your MWAA endpoint.
    • username and password: Credentials for accessing MWAA.
    • env: The environment name (e.g., PROD).
    • include_task_templates, include_lineage, include_task_instances, include_dag_runs: Flags to include various metadata from Airflow.
  2. Sink Configuration:

    • type: Specifies the type of sink, which is datahub-rest.
    • server: The URL of your DataHub instance.
    • token: Your DataHub authentication token.

Running the Ingestion:
To run the ingestion with the above recipe, use the following command:

datahub ingest -c mwaa_ingestion.yaml

This command will read the metadata from MWAA and ingest it into your DataHub instance.

Would you like more detailed steps or further assistance on any specific part of this process?

Sources:

  • <Ingestion | DataHub Ingestion Overview> :+1: 0 button :-1: 0 button :bulb: Hint: Mention @runllm in the thread for followups.

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><