Synchronizing Deleted DAGs Between Airflow and DataHub to Maintain Consistency

Original Slack Thread

I am connected Datahub with airflow using the datahub rest api plugin in airflow. The deleted DAGs are still exist in datahub, how can I make the DAGs consistent in the both platform?

Hey there! :wave:

<@U06TM7M40RK> Can you help with the questions in this thread?

:information_source: Please continue to tag RunLLM (e.g <@U06TM7M40RK>) to have it continue generating responses!

To ensure that deleted DAGs in Airflow are also removed from DataHub, you can follow a cleanup process. According to the <Airflow Integration | DataHub documentation>, there is no automatic mechanism to delete metadata for removed DAGs in Airflow. However, you can manually clean up obsolete pipelines and tasks from DataHub.

Here are the steps you can follow:

  1. Create a Cleanup DAG: You can create a DAG in Airflow specifically for cleaning up obsolete data in DataHub. This DAG will identify and remove pipelines and tasks that no longer exist in Airflow but are still present in DataHub.

    from datetime import datetime
    from airflow import DAG
    from airflow.operators.bash import BashOperator
    
    with DAG(
        "Datahub_Cleanup",
        start_date=datetime(2024, 1, 1),
        schedule_interval=None,
        catchup=False,
    ) as dag:
        task = BashOperator(
            task_id="cleanup_obsolete_data",
            dag=dag,
            bash_command="echo 'cleaning up the obsolete data from datahub'",
        )
    

    This DAG, once ingested, will remove all obsolete pipelines and tasks from DataHub based on the cluster value set in the airflow.cfg.

  2. Manual Deletion: If you prefer a manual approach, you can use the DataHub CLI to delete the metadata associated with the DAGs. First, identify the URNs of the DAGs you want to delete, and then use the datahub delete command to perform a soft or hard delete.

    For example, to perform a soft delete:

    datahub delete --urn "<urn-of-your-dag>" --soft
    

    To perform a hard delete:

    datahub delete --urn "<urn-of-your-dag>" --hard
    

Would you like more detailed steps or information on a specific part of the process?

Sources:

Hi, how did my answer work out for you? Need any more assistance? You can also tag <@U07FV3UHE4D> to let your fellow community members know you’re still looking for guidance.

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><