Troubleshooting Superset Dataset Ingestion in Datahub

Original Slack Thread

Hi,

I am running superset v4.0.2 and v0.14.0 of Datahub. The ingestion seems to work with the following config:

    type: superset
    config:
        connect_uri: '<http://host:8088>'
        username: username
        password: password```
However while dashboards and charts are imported, datasets are not (see screenshot).
Is that integration supported? The datasets are Trino backed.![attachment](https://files.slack.com/files-pri/TUMKD5EGJ-F07KT8KDM16/image.png)

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

<@U06TM7M40RK> Can you help with the questions in this thread?

:information_source: _Please continue to tag RunLLM (e.g <@U06TM7M40RK>) to have it continue generating responses!

To address your issue with Superset ingestion where dashboards and charts are imported but datasets are not, let’s break down the possible causes and solutions.

Supported Integration
According to the <https://datahubproject.io/docs/0.14.0/generated/ingestion/sources/superset|Superset ingestion documentation>, the Superset source should support the ingestion of charts, dashboards, and associated metadata. However, there have been known issues and ongoing improvements related to dataset ingestion.

Troubleshooting Steps

  1. Verify API Endpoints:
    Ensure that your Superset instance is exposing the necessary API endpoints. The Superset connector extracts datasets by contacting the /api/v1/dataset/ endpoint. You can manually check this endpoint to see if it returns the expected data.

    curl -u username:password http://host:8088/api/v1/dataset/
    
  2. Check Permissions:
    Ensure that the user account used for the ingestion has the necessary permissions to access the datasets. There have been instances where permissions issues prevented the ingestion from retrieving the data.

  3. Enable Debug Logging:
    Run the ingestion with the --debug flag to get detailed logs. This can help identify any issues during the ingestion process.

    datahub --debug ingest run -c <recipe-path> &> /tmp/superset-ingest.log
    
  4. Inspect the Payload:
    Add logging to inspect the payload being sent to DataHub. This can help determine if the payload is empty or if there are any issues with the data being retrieved from Superset.

  5. Check Stateful Ingestion Configuration:
    Ensure that your stateful ingestion configuration is correctly set up. Verify that the stateful_ingestion settings are properly configured in your ingestion recipe.

Example Configuration

Here is an example of an ingestion recipe for Superset with stateful ingestion enabled:

source:
  type: superset
  config:
    connect_uri: 'http://host:8088'
    username: 'username'
    password: 'password'
    provider: db
    stateful_ingestion:
      enabled: true
      remove_stale_metadata: false
sink:
  type: datahub-rest
  config:
    server: 'http://localhost:8080'
    token: 'your-token'

Known Issues and Updates
There have been updates and discussions regarding the ingestion of datasets from Superset. For instance, a <https://github.com/datahub-project/datahub/pull/10592|GitHub pull request> introduces support for the Datasets entity in the Superset data ingestion plugin, enhancing its integration capabilities with DataHub. This might be relevant to your issue.

Would you like more detailed steps or further assistance on any specific part of the process?

Sources: