Implementing Column-Level Lineage for Kafka Topics in DataHub

Original Slack Thread

<@U06TM7M40RK> cant we have column level lineage for kafka topics ?

Currently, DataHub does not support column-level lineage for Kafka topics out of the box. According to a <https://github.com/datahub-project/datahub/issues/10594|GitHub issue>, there is no direct support for column-level lineage from Kafka topics’ JSON contracts to tables. However, you can programmatically add this information if needed.

For more detailed guidance on how to implement custom solutions for lineage, including column-level lineage, you can refer to the <https://datahubproject.io/docs/0.14.0/api/tutorials/lineage#add-column-level-lineage|DataHub API documentation> which provides examples of how to add column-level lineage programmatically using Python.

Would you like more detailed guidance on how to implement such a custom solution?

Sources:

Hi <@U08JET5BELS>, how did my answer work out for you? Need any more assistance? You can also tag <@U07FV3UHE4D> to let your fellow community members know you’re still looking for guidance.

hi <@U06TM7M40RK> how can we abort ongoing ingestion in datahub ?

To abort an ongoing ingestion in DataHub, you can follow these steps:

  1. Identify the Ingestion Run: First, you need to identify the specific ingestion run that is currently running. You can list all the ingestion runs using the datahub ingest list-runs command.

    datahub ingest list-runs
    
  2. Cancel the Ingestion Run: Once you have identified the run ID of the running ingestion, you can attempt to cancel it using the datahub ingest cancel command. However, this command is not explicitly mentioned in the provided data sources, so you may need to manually stop the process as described below.

  3. Manually Stop the Process: If the ingestion run is not responding to cancellation commands, you can manually stop the process by identifying the process ID (PID) and killing it. This can be done using Docker commands if you are running DataHub in Docker.

    # List all running containers to find the datahub-actions container
    docker ps | grep datahub-actions
    
    # Get the container ID of the datahub-actions container
    CONTAINER_ID=$(docker ps | grep datahub-actions | awk '{print $1}')
    
    # List all processes running inside the datahub-actions container
    docker exec -it $CONTAINER_ID ps -aux
    
    # Identify the PID of the stuck ingestion process and kill it
    docker exec -it $CONTAINER_ID kill -9 <PID>
    
  4. Check Logs for Issues: After stopping the stuck ingestion, it is important to check the logs to understand why the ingestion got stuck. You can inspect the logs of the datahub-actions container.

    docker logs $CONTAINER_ID
    
  5. Verify Resource Allocation: Ensure that your Docker environment has sufficient CPU and memory allocated to avoid future issues.

    docker stats $CONTAINER_ID
    
  6. Check Disk Space: Verify that there is enough disk space available, as disk space issues can cause ingestion jobs to fail or get stuck.

    docker exec -it $CONTAINER_ID df -h
    

Would you like more detailed steps on any of these troubleshooting points or further assistance on a specific part of the process?

Sources:

  • <Slack Conversation on Ingestion Issues>
  • <Slack Conversation on Ingestion Issues> :+1: 0 button :-1: 0 button :light_bulb: Hint: Mention @runllm in the thread for followups.