Troubleshooting Dataset Ingestion Issue with Superset Configuration

Original Slack Thread

i try to configure superset ingestion and i get all the CHARTS and the DASHBOARD but i dont get any DATASETS
i use datahub v0.12.0 and
source:
type: superset
config:
connect_uri: '<dns>'
username: <username>
password: <password>
provider: db

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

<@U06TM7M40RK> Can you help with the questions in this thread?


:information_source: Please continue to tag RunLLM (e.g <@U06TM7M40RK>) to have it continue generating responses! We’re experimenting with a new AI assistant for questions in the <#CUMUWQU66|> channel to help you track down helpful resources more quickly. Keep in mind that responses may not always be spot-on; feel free to ask it follow-up questions and don’t forget to provide its responses a :+1: or :-1: so other folks in the Community know if you’re still looking for help on the topic and to help us improve the assistant!

To help you more effectively, could you please provide the following information:

  1. Are you using the UI or CLI for ingestion?
  2. Which DataHub version are you using? (You mentioned v0.12.0, please confirm)
  3. What data source(s) are you integrating with DataHub? (e.g., Superset)

Additionally, please share the relevant part of your ingestion recipe, especially the source configuration.

Based on the information you provided, it seems like you are able to ingest charts and dashboards but not datasets. This issue might be related to the configuration or permissions. Here are a few steps you can follow to troubleshoot:

  1. Verify API Endpoints: Ensure that your Superset instance is exposing the necessary API endpoints. The Superset connector extracts datasets by contacting the /api/v1/dataset/ endpoint. You can manually check this endpoint to see if it returns the expected data.

  2. Check Permissions: Ensure that the user account used for the ingestion has the necessary permissions to access the datasets. There have been instances where permissions issues prevented the ingestion from retrieving the data.

  3. Enable Debug Logging: Run the ingestion with the --debug flag to get detailed logs. This can help identify any issues during the ingestion process. You can use the following command:

    datahub --debug ingest run -c <recipe-path> &> /tmp/superset-ingest.log
    
  4. Inspect the Payload: Add logging to inspect the payload being sent to DataHub. This can help determine if the payload is empty or if there are any issues with the data being retrieved from Superset.

  5. Check Stateful Ingestion Configuration: Ensure that your stateful ingestion configuration is correctly set up. Verify that the stateful_ingestion settings are properly configured in your ingestion recipe.

Here is an example of an ingestion recipe for Superset with stateful ingestion enabled:

source:
  type: superset
  config:
    connect_uri: 'https://your-superset-instance/'
    username: 'your-username'
    password: 'your-password'
    provider: db
    stateful_ingestion:
      enabled: true
      remove_stale_metadata: false
sink:
  type: datahub-rest
  config:
    server: 'http://localhost:8080'
    token: 'your-token'

Would you like more detailed steps or further assistance?

Sources:

ui
v0.12.0
Superset

this is my configuration why my problem occure? <@U06TM7M40RK>

Thank you for providing the details. Given that you are using the UI for ingestion with DataHub v0.12.0 and integrating with Superset, let’s focus on the configuration and potential issues.

Here are some steps to troubleshoot why datasets are not being ingested:

  1. Verify API Endpoints:
    Ensure that your Superset instance is exposing the necessary API endpoints. The Superset connector extracts datasets by contacting the /api/v1/dataset/ endpoint. You can manually check this endpoint to see if it returns the expected data.

    curl -u your-username:your-password https://your-superset-instance/api/v1/dataset/
    
  2. Check Permissions:
    Ensure that the user account used for the ingestion has the necessary permissions to access the datasets. There have been instances where permissions issues prevented the ingestion from retrieving the data. For example, the can read on Dataset permission might be required.

  3. Enable Debug Logging:
    Run the ingestion with the --debug flag to get detailed logs. This can help identify any issues during the ingestion process. You can use the following command:

    datahub --debug ingest run -c <recipe-path> &> /tmp/superset-ingest.log
    
  4. Inspect the Payload:
    Add logging to inspect the payload being sent to DataHub. This can help determine if the payload is empty or if there are any issues with the data being retrieved from Superset.

  5. Check Stateful Ingestion Configuration:
    Ensure that your stateful ingestion configuration is correctly set up. Verify that the stateful_ingestion settings are properly configured in your ingestion recipe.

Here is an example of an ingestion recipe for Superset with stateful ingestion enabled:

source:
  type: superset
  config:
    connect_uri: 'https://your-superset-instance/'
    username: 'your-username'
    password: 'your-password'
    provider: db
    stateful_ingestion:
      enabled: true
      remove_stale_metadata: false
sink:
  type: datahub-rest
  config:
    server: 'http://localhost:8080'
    token: 'your-token'

Example Configuration

Ensure your configuration looks similar to the following:

source:
  type: superset
  config:
    connect_uri: 'https://your-superset-instance/'
    username: 'your-username'
    password: 'your-password'
    provider: db
    stateful_ingestion:
      enabled: true
      remove_stale_metadata: false
sink:
  type: datahub-rest
  config:
    server: 'http://localhost:8080'
    token: 'your-token'

Debugging Steps

  1. Verify API Response:
    Manually check the response from the /api/v1/dataset/ endpoint to ensure it returns the expected data.

    curl -u your-username:your-password https://your-superset-instance/api/v1/dataset/
    
  1. Enable Detailed Logging:
    Enable detailed logging to capture more information about the ingestion process. Add a debug statement to log the payload being sent to DataHub. Modify the superset.py file to include a debug statement at the point where the payload is being created.

    import logging
    
    logger = logging.getLogger(__name__)
    
    # Add this line to log the payload
    <http://logger.info|logger.info>(f"payload = {payload}")
    
  2. Run Ingestion Locally:
    If you are running the ingestion in a Kubernetes environment, try running it locally to isolate any potential issues with the Kubernetes setup. Set up a local DataHub instance using the quickstart guide and run the ingestion using the CLI.

    datahub docker quickstart
    datahub ingest -c your-ingestion-recipe.yml
    
  3. Check for Known Issues:
    Review the <https://github.com/datahub-project/datahub/issues|DataHub GitHub issues> and the <Slack Slack community> for any known issues or similar problems reported by other users.

Would you like more detailed steps or further assistance on any specific part of the process?

Sources:

curl -u your-username:your-password <https://your-superset-instance/api/v1/dataset/>
that work and the user have admin premission

Yes, this worked for me as well.

As a note, if your Superset instance has OAuth enabled, then you can only use an admin user. You may have to reset the admin user’s password first, but the admin user can always be used to login to the API.

From here: https://github.com/apache/superset/discussions/26107#discussioncomment-10268086

it “just worked” for me after I reset the admin password by executing this command inside the superset_app container:
superset fab reset-password --username admin --password my_complex_password
I sent a POST request to /api/v1/security/login with the following payload:

    "username" = "admin";
    "password" = "my_complex_password";
    "refresh" = "true";
    "provider" = "db"
}```
Interestingly, it didn't work - response was {"message": "Not authorized"} - before I reset the admin password.
Superset v3.1.3 run via docker-compose.
Running OAuth with 2 different OAuth providers configured, so username/password auth is not available on the login page. I didn't realize the admin username/password would always work on the /api/v1/security/login route.
As a note, this also worked when pointing at Superset from a DataHub instance.

Hi <@U07FU9WQMAQ> I tried your solution by reset admin password and sent POST request after that I tried a command

curl -u your-username:your-password <https://your-superset-instance/api/v1/security/login>
but the output is here

  "errors": [
    {
      "message": "405 Method Not Allowed: The method is not allowed for the requested URL.",
      "error_type": "GENERIC_BACKEND_ERROR",
      "level": "error",
      "extra": {
        "issue_codes": [
          {
            "code": 1011,
            "message": "Issue 1011 - Superset encountered an unexpected error."
          }
        ]
      }
    }
  ]
}```

However, when I tried to ingest the config from superset it was succeed but there were no dashboards and charts ingested.

Am I missing something?![attachment]({'ID': 'F07G4226HKN', 'EDITABLE': False, 'IS_EXTERNAL': False, 'USER_ID': 'U03ED7W97AB', 'CREATED': '2024-08-08 07:25:17+00:00', 'PERMALINK': 'https://datahubspace.slack.com/files/U03ED7W97AB/F07G4226HKN/screenshot_2024-08-08_at_14.22.14.png', 'EXTERNAL_TYPE': '', 'TIMESTAMPS': '2024-08-08 07:25:17+00:00', 'MODE': 'hosted', 'DISPLAY_AS_BOT': False, 'PRETTY_TYPE': 'PNG', 'NAME': 'Screenshot 2024-08-08 at 14.22.14.png', 'IS_PUBLIC': True, 'PREVIEW_HIGHLIGHT': None, 'MIMETYPE': 'image/png', 'PERMALINK_PUBLIC': 'https://slack-files.com/TUMKD5EGJ-F07G4226HKN-6007319daa', 'FILETYPE': 'png', 'EDIT_LINK': None, 'URL_PRIVATE': 'https://files.slack.com/files-pri/TUMKD5EGJ-F07G4226HKN/screenshot_2024-08-08_at_14.22.14.png', 'HAS_RICH_PREVIEW': False, 'TITLE': 'Screenshot 2024-08-08 at 14.22.14.png', 'IS_STARRED': False, 'PREVIEW_IS_TRUNCATED': None, 'URL_PRIVATE_DOWNLOAD': 'https://files.slack.com/files-pri/TUMKD5EGJ-F07G4226HKN/download/screenshot_2024-08-08_at_14.22.14.png', 'PREVIEW': None, 'PUBLIC_URL_SHARED': False, 'MESSAGE_TS': '1723101935.809879', 'PARENT_MESSAGE_TS': '1723035025.140999', 'MESSAGE_CHANNEL_ID': 'CUMUWQU66', '_FIVETRAN_DELETED': False, 'LINES_MORE': None, 'LINES': None, 'SIZE': 694855, '_FIVETRAN_SYNCED': '2024-08-11 08:22:42.970000+00:00'})

<@U06TM7M40RK> can you provide further assistance here?

It seems like there are a few issues at play here, including the method not being allowed for the requested URL and the ingestion not capturing dashboards and charts despite the successful configuration.

Steps to Troubleshoot and Resolve the Issues

  1. Verify API Endpoints and Methods:
    Ensure that the API endpoints and methods you are using are correct. The /api/v1/security/login endpoint should be accessed using a POST request, not a GET request.

    curl -X POST -H "Content-Type: application/json" -d '{
      "username": "admin",
      "password": "my_complex_password",
      "provider": "db",
      "refresh": true
    }' https://your-superset-instance/api/v1/security/login
    
  2. Check Superset Version Compatibility:
    Ensure that the version of Superset you are using is compatible with the DataHub version. There might be changes in the API endpoints or authentication mechanisms in different versions of Superset.

  3. Reset Admin Password:
    If you haven’t already, reset the admin password inside the Superset container to ensure that the credentials are correct.

    superset fab reset-password --username admin --password my_complex_password
    
  4. Enable Debug Logging:
    Run the ingestion with the --debug flag to get detailed logs. This can help identify any issues during the ingestion process.

    datahub --debug ingest run -c <recipe-path> &> /tmp/superset-ingest.log
    
  5. Inspect the Payload:
    Add logging to inspect the payload being sent to DataHub. This can help determine if the payload is empty or if there are any issues with the data being retrieved from Superset.

  6. Check Permissions:
    Ensure that the user account used for the ingestion has the necessary permissions to access the dashboards and charts. There have been instances where permissions issues prevented the ingestion from retrieving the data.

Example Configuration

Ensure your configuration looks similar to the following:

source:
  type: superset
  config:
    connect_uri: 'https://your-superset-instance/'
    username: 'admin'
    password: 'my_complex_password'
    provider: db
    stateful_ingestion:
      enabled: true
      remove_stale_metadata: false
sink:
  type: datahub-rest
  config:
    server: 'http://localhost:8080'
    token: 'your-token'

Debugging Steps

  1. Verify API Response:
    Manually check the response from the /api/v1/dataset/, /api/v1/chart/, and /api/v1/dashboard/ endpoints to ensure they return the expected data.

    curl -u admin:my_complex_password https://your-superset-instance/api/v1/dataset/
    curl -u admin:my_complex_password https://your-superset-instance/api/v1/chart/
    curl -u admin:my_complex_password https://your-superset-instance/api/v1/dashboard/
    

Sorry, there was a temporary error when processing your request. Please try again. If this problem persists, please contact support. :+1: 0 button :-1: 0 button :bulb: Hint: Mention @runllm in the thread for followups.

<@U06TM7M40RK> seems like the output is access token, is it confirmed success for post command? do this access token will be use in config sink?

after I checked seems like the role admin is public, should I change into admin to make available any permissions?

Yes, if the output of your POST command to the /api/v1/security/login endpoint is an access token, it indicates a successful authentication. This access token can be used for subsequent API requests to Superset and should be included in your DataHub ingestion configuration.

Using the Access Token in Configuration

You should use the access token obtained from the POST request in your DataHub ingestion configuration. Here’s how you can modify your configuration to include the access token:

  1. Update the Superset Source Configuration:
    Instead of using the username and password directly, use the access token for authentication.

  2. Modify the Ingestion Recipe:
    Update your ingestion recipe to include the access token in the Authorization header.

Here is an example of how your ingestion recipe might look:

source:
  type: superset
  config:
    connect_uri: 'https://your-superset-instance/'
    token: 'your-access-token'
    provider: db
    stateful_ingestion:
      enabled: true
      remove_stale_metadata: false
sink:
  type: datahub-rest
  config:
    server: 'http://localhost:8080'
    token: 'your-datahub-token'

Example of Obtaining the Access Token

Here is an example of how to obtain the access token using a POST request:

curl -X POST -H "Content-Type: application/json" -d '{
  "username": "admin",
  "password": "my_complex_password",
  "provider": "db",
  "refresh": true
}' https://your-superset-instance/api/v1/security/login

The response should include an access token, which you can then use in your ingestion configuration.

Checking and Updating Permissions

If the role admin is public, it might not have the necessary permissions to access all datasets, charts, and dashboards. You should ensure that the admin role has the appropriate permissions. Here’s how you can check and update the permissions:

  1. Check Current Permissions:
    Verify the current permissions for the admin role in Superset.

  2. Update Permissions:
    If necessary, update the permissions to ensure that the admin role has access to all required resources.

Example of Checking and Updating Permissions

You can check and update permissions in Superset through the UI or by using the Superset CLI. Here’s an example of how to update permissions using the Superset CLI:

superset fab grant-role --role admin --perm can_read --view all_datasources

This command grants the can_read permission on all datasources to the admin role.

Summary

  1. Use the Access Token: Include the access token obtained from the POST request in your DataHub ingestion configuration.
  2. Check Permissions: Ensure that the admin role has the necessary permissions to access all datasets, charts, and dashboards.
  3. Update Configuration: Modify your ingestion recipe to use the access token for authentication.

Would you like more detailed steps or further assistance on any specific part of the process?

Sources: