Setting Up Superset Ingestion Recipe with Single Sign-On (SSO) via DataHub

Original Slack Thread

Hi,

If my superset using Single Sign on as login, how do I setup the recipe in datahub?

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

<@U06TM7M40RK> Can you help with the questions in this thread?


:information_source: Please continue to tag RunLLM (e.g <@U06TM7M40RK>) to have it continue generating responses! We’re experimenting with a new AI assistant for questions in the <#CUMUWQU66> channel to help you track down helpful resources more quickly. Keep in mind that responses may not always be spot-on; feel free to ask it follow-up questions and don’t forget to provide its responses a :thumbsup: or :thumbsdown: so other folks in the Community know if you’re still looking for help on the topic and to help us improve the assistant!

To help you more effectively, could you please provide the following information?

  1. Are you using the UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g., 0.12.0)?
  3. What data source(s) are you integrating with DataHub? (e.g., Superset)?

Based on the information provided, I can offer more targeted assistance. However, I can share some general insights on the issue of ingesting metadata from Superset when it is configured to use Single Sign-On (SSO).

General Insights

  1. OAuth Authentication:
    DataHub’s current Superset ingestion connector primarily supports basic authentication methods like username and password. OAuth support is not explicitly mentioned in the available documentation or codebase.

  2. Workarounds:

    • Custom Scripts: Some users have created custom scripts to handle OAuth authentication and then use the DataHub ingestion APIs to push metadata.
    • Proxy Authentication: Another approach is to use a proxy service that handles OAuth authentication and forwards requests to Superset with the necessary tokens.
  3. Community Discussions:

    • According to a <Slack discussion>, users have faced similar issues and have had to verify that their Superset instance exposes the necessary API endpoints and that the user has the required permissions.
    • Another <Issues · datahub-project/datahub · GitHub issue> discusses problems with LDAP and OAuth configurations, indicating that these are known challenges.
  4. Debugging:

    • Enable detailed logging by running the ingestion with the --debug flag to capture more information about the process.
    • Manually check the API responses from Superset to ensure they return the expected data.

Example Ingestion Recipe

Here is an example of a basic ingestion recipe for Superset:

source:
  type: superset
  config:
    connect_uri: 'https://your-superset-instance/'
    username: 'your-username'
    password: 'your-password'
    provider: db
    stateful_ingestion:
      enabled: true
      remove_stale_metadata: false
sink:
  type: datahub-rest
  config:
    server: 'http://localhost:8080'
    token: 'your-personal-access-token'

Next Steps

  1. Provide the requested information so I can offer more specific guidance.
  2. Check the community resources like the <Slack Slack community> and <https://github.com/datahub-project/datahub/issues|GitHub issues> for any updates or similar issues.

Would you like more detailed steps or further assistance on any specific part of the process?

Sources:

<@U06TM7M40RK>

  1. I Use UI for setting up ingestion.
  2. I use head version
  3. Superset

Thank you for providing the details. Given that you are using the UI for ingestion with the latest version of DataHub and integrating with Superset, let’s focus on the configuration and potential issues related to Single Sign-On (SSO).

Steps to Configure Superset Ingestion with SSO

  1. Verify API Endpoints:
    Ensure that your Superset instance is exposing the necessary API endpoints. The Superset connector extracts datasets by contacting the /api/v1/dataset/ endpoint. You can manually check this endpoint to see if it returns the expected data.

    curl -u your-username:your-password https://your-superset-instance/api/v1/dataset/
    
  2. Check Permissions:
    Ensure that the user account used for the ingestion has the necessary permissions to access the datasets. There have been instances where permissions issues prevented the ingestion from retrieving the data. For example, the can read on Dataset permission might be required.

  3. Enable Debug Logging:
    Run the ingestion with the --debug flag to get detailed logs. This can help identify any issues during the ingestion process. You can use the following command:

    datahub --debug ingest run -c <recipe-path> &> /tmp/superset-ingest.log
    
  4. Inspect the Payload:
    Add logging to inspect the payload being sent to DataHub. This can help determine if the payload is empty or if there are any issues with the data being retrieved from Superset.

  5. Check Stateful Ingestion Configuration:
    Ensure that your stateful ingestion configuration is correctly set up. Verify that the stateful_ingestion settings are properly configured in your ingestion recipe.

Example Configuration

Ensure your configuration looks similar to the following:

source:
  type: superset
  config:
    connect_uri: 'https://your-superset-instance/'
    username: 'your-username'
    password: 'your-password'
    provider: db
    stateful_ingestion:
      enabled: true
      remove_stale_metadata: false
sink:
  type: datahub-rest
  config:
    server: 'http://localhost:8080'
    token: 'your-token'

Debugging Steps

  1. Verify API Response:
    Manually check the response from the /api/v1/dataset/ endpoint to ensure it returns the expected data.

    curl -u your-username:your-password https://your-superset-instance/api/v1/dataset/
    
  2. Enable Detailed Logging:
    Enable detailed logging to capture more information about the ingestion process. Add a debug statement to log the payload being sent to DataHub. Modify the superset.py file to include a debug statement at the point where the payload is being created.

    import logging
    
    logger = logging.getLogger(__name__)
    
    # Add this line to log the payload
    logger.info(f"payload = {payload}")
    
  1. Run Ingestion Locally:
    If you are running the ingestion in a Kubernetes environment, try running it locally to isolate any potential issues with the Kubernetes setup. Set up a local DataHub instance using the quickstart guide and run the ingestion using the CLI.

    datahub docker quickstart
    datahub ingest -c your-ingestion-recipe.yml
    
  2. Check for Known Issues:
    Review the <https://github.com/datahub-project/datahub/issues|DataHub GitHub issues> and the <Slack Slack community> for any known issues or similar problems reported by other users.

Would you like more detailed steps or further assistance on any specific part of the process?

Sources:

<@U06TM7M40RK>

Failed to get dashboard data: {"message":"Forbidden"}\n'
The message is like this. When I tried to test via curl above, it mentioned

{“msg”:“Missing ‘Bearer’ type in ‘Authorization’ header. Expected ‘Authorization: Bearer <JWT>’”}

The error message you’re encountering indicates that the Superset API expects a Bearer token for authentication, which is typical for setups using Single Sign-On (SSO). To resolve this, you’ll need to configure your DataHub ingestion recipe to use a Bearer token for authentication.

Steps to Configure Bearer Token Authentication

  1. Obtain a Bearer Token:

    • If your Superset instance uses an Identity Provider (IdP) like Okta, Google, or Azure, you will need to obtain a JWT (JSON Web Token) from the IdP. This usually involves logging in through the IdP and retrieving the token from the response.
  2. Update Ingestion Recipe:

    • Modify your ingestion recipe to include the Bearer token in the Authorization header.

Example Ingestion Recipe with Bearer Token

Here’s an example of how you can configure your ingestion recipe to use a Bearer token:

source:
  type: superset
  config:
    connect_uri: 'https://your-superset-instance/'
    provider: db
    extra_headers:
      Authorization: 'Bearer <your-jwt-token>'
    stateful_ingestion:
      enabled: true
      remove_stale_metadata: false
sink:
  type: datahub-rest
  config:
    server: 'http://localhost:8080'
    token: 'your-datahub-token'

Testing with cURL

To verify that your Bearer token is working correctly, you can test the API endpoint using cURL:

curl --location --request GET 'https://your-superset-instance/api/v1/dataset/' \
--header 'Authorization: Bearer <your-jwt-token>'

Debugging Steps

  1. Verify Token Validity:

    • Ensure that the JWT token is valid and has not expired. JWT tokens typically have an expiration time (exp claim).
  2. Check API Permissions:

    • Ensure that the token has the necessary permissions to access the Superset API endpoints.
  3. Enable Detailed Logging:

    • Enable detailed logging in your ingestion setup to capture more information about the process.

Example cURL Command

Here’s an example cURL command to test the API endpoint with a Bearer token:

curl --location --request GET 'https://your-superset-instance/api/v1/dataset/' \
--header 'Authorization: Bearer <your-jwt-token>'

If the above steps do not resolve the issue, please provide more details about your setup, such as the Identity Provider you are using for SSO and any specific configurations you have applied.

Would you like more detailed steps or further assistance on any specific part of the process?

Sources:

<@U06TM7M40RK>

           'extra_headers\n'
           '  extra fields not permitted (type=value_error.extra)\n',```

The error message indicates that the extra_headers field is not recognized in the Superset source configuration. This suggests that the current version of the DataHub Superset connector does not support the extra_headers parameter directly.

Workaround for Bearer Token Authentication

Since the extra_headers parameter is not supported, you can use environment variables to set the Bearer token for authentication. This approach involves setting the REQUESTS_CA_BUNDLE environment variable to include the Bearer token.

Steps to Set Environment Variable for Bearer Token

  1. Set Environment Variable:

    • You can set the REQUESTS_CA_BUNDLE environment variable to include your Bearer token. This will ensure that the requests library used by DataHub can include the Bearer token in the Authorization header.
  2. Run Ingestion with Environment Variable:

    • Ensure that the environment variable is set before running the ingestion process.

Example Configuration

Here is an example of how you can set the environment variable and run the ingestion:

Setting Environment Variable

export REQUESTS_CA_BUNDLE="Bearer <your-jwt-token>"

Running Ingestion

datahub ingest -c your-ingestion-recipe.yml

Example Ingestion Recipe

Here is your ingestion recipe without the unsupported extra_headers parameter:

source:
  type: superset
  config:
    connect_uri: 'https://your-superset-instance/'
    provider: db
    stateful_ingestion:
      enabled: true
      remove_stale_metadata: false
sink:
  type: datahub-rest
  config:
    server: 'http://localhost:8080'
    token: 'your-datahub-token'

Debugging Steps

  1. Verify Token Validity:

    • Ensure that the JWT token is valid and has not expired. JWT tokens typically have an expiration time (exp claim).
  2. Check API Permissions:

    • Ensure that the token has the necessary permissions to access the Superset API endpoints.
  3. Enable Detailed Logging:

    • Enable detailed logging in your ingestion setup to capture more information about the process.

Example cURL Command

Here’s an example cURL command to test the API endpoint with a Bearer token:

curl --location --request GET 'https://your-superset-instance/api/v1/dataset/' \
--header 'Authorization: Bearer <your-jwt-token>'

If the above steps do not resolve the issue, please provide more details about your setup, such as the Identity Provider you are using for SSO and any specific configurations you have applied.

Would you like more detailed steps or further assistance on any specific part of the process?

Sources:

<@U06TM7M40RK>

I see in the documentation that it has options in superset recipe, but I can’t find the documentation. help me explain

Certainly! The options parameter in the Superset ingestion recipe allows you to specify additional configuration options for the Superset connector. These options can be used to customize the behavior of the ingestion process.

Superset Ingestion Configuration

According to the <Superset | DataHub Config Details>, the options parameter is an object that can include various settings to control the ingestion process.

Example Configuration with options

Here is an example of how you can use the options parameter in your Superset ingestion recipe:

source:
  type: superset
  config:
    connect_uri: 'https://your-superset-instance/'
    username: 'your-username'
    password: 'your-password'
    provider: db
    options:
      # Add your custom options here
      key1: value1
      key2: value2
    stateful_ingestion:
      enabled: true
      remove_stale_metadata: false
sink:
  type: datahub-rest
  config:
    server: 'http://localhost:8080'
    token: 'your-datahub-token'

Available Options

The specific options you can include in the options parameter depend on the capabilities and requirements of the Superset connector. Here are some common options you might consider:

  • timeout: Set a timeout for API requests.
  • verify_ssl: Enable or disable SSL verification.
  • custom_headers: Add custom headers to API requests.

Example with Custom Headers

If you need to add custom headers, such as a Bearer token for authentication, you can include them in the options parameter:

source:
  type: superset
  config:
    connect_uri: 'https://your-superset-instance/'
    provider: db
    options:
      custom_headers:
        Authorization: 'Bearer <your-jwt-token>'
    stateful_ingestion:
      enabled: true
      remove_stale_metadata: false
sink:
  type: datahub-rest
  config:
    server: 'http://localhost:8080'
    token: 'your-datahub-token'

Debugging Steps

  1. Verify Token Validity:

    • Ensure that the JWT token is valid and has not expired. JWT tokens typically have an expiration time (exp claim).
  2. Check API Permissions:

    • Ensure that the token has the necessary permissions to access the Superset API endpoints.
  3. Enable Detailed Logging:

    • Enable detailed logging in your ingestion setup to capture more information about the process.

Example cURL Command

Here’s an example cURL command to test the API endpoint with a Bearer token:

curl --location --request GET 'https://your-superset-instance/api/v1/dataset/' \
--header 'Authorization: Bearer <your-jwt-token>'

If the above steps do not resolve the issue, please provide more details about your setup, such as the Identity Provider you are using for SSO and any specific configurations you have applied.

Would you like more detailed steps or further assistance on any specific part of the process?

Sources: