Troubleshooting Authentication and Configuration Issues in DataHub Deployment

Original Slack Thread

<@U06TM7M40RK>
The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File “/datahub-ingestion/.venv/lib/python3.10/site-packages/acryl/executor/dispatcher/default_dispatcher.py”, line 34, in dispatch_async
res = executor.execute(request)
File “/datahub-ingestion/.venv/lib/python3.10/site-packages/acryl/executor/execution/reporting_executor.py”, line 89, in execute
self._datahub_graph.emit_mcp(kickoff_mcp, async_flag = False)
File “/datahub-ingestion/.venv/lib/python3.10/site-packages/datahub/emitter/rest_emitter.py”, line 266, in emit_mcp
self._emit_generic(url, payload)
datahub.configuration.common.OperationalError: (‘Unable to emit metadata to DataHub GMS’, {‘message’: '401 Client Error: Unauthorized for url: http://datahub-datahub-gms:8080/aspects?action=ingestProposal’})
2024-11-14T11:45:45.621213009Z

what is this error?

The error message you are encountering, 401 Client Error: Unauthorized, indicates that your request to the DataHub GMS (Global Metadata Service) server is not authenticated. This typically happens when the required authentication token is missing or incorrect.

To resolve this issue, you need to ensure that you are using a valid Personal Access Token (PAT) in your request headers. Here are the steps to generate and use a PAT:

  1. Generate a Personal Access Token:

    • Navigate to ‘Settings’ > ‘Access Tokens’ in the DataHub UI.
    • Click on ‘Generate new token’ and fill out the required information.
    • Save the generated token securely.
  2. Use the Access Token in Your Requests:

    • Include the generated token in the Authorization header of your HTTP requests as follows:
      Authorization: Bearer <your-personal-access-token>
      
  3. Update Your Configuration:

    • Ensure that your configuration file includes the token. Here is an example configuration for the datahub-rest sink:
source:
  # source configs
sink:
  type: "datahub-rest"
  config:
    server: "http://localhost:8080"
    token: "<your-personal-access-token>"

According to the <https://datahubproject.io/docs/0.13.1/authentication/personal-access-tokens|DataHub documentation on Personal Access Tokens>, you can find more details on generating and using these tokens.

Would you like more detailed steps or assistance with a specific part of this process?

Sources:

<@U06TM7M40RK> where is my Configuration file saved?

To determine where your configuration file is saved, it depends on how you have set up your DataHub environment. Here are some common locations and methods to find your configuration file:

  1. Default Configuration Path:

  2. Environment Variables:

    • If you have set environment variables to specify a different configuration path, you should check those variables. The relevant environment variables include DATAHUB_CONFIG_PATH and CONDENSED_DATAHUB_CONFIG_PATH.
  3. Docker or Kubernetes Deployments:

    • If you are using Docker or Kubernetes, the configuration might be mounted as a volume or specified in a values.yaml file for Helm deployments. For Docker Compose, you might have a docker-compose.yml file that specifies the configuration file path.
  4. Custom Paths:

    • If you have customized the location of your configuration file, you would need to refer to your deployment scripts or documentation to find the exact path.

If you need more detailed steps on how to locate or configure your setup, please let me know!

Sources:

<@U06TM7M40RK> I am using helm in k8 so where should i configure it? undfer what compoonent?

When using Helm to deploy DataHub on Kubernetes, you typically configure your settings in the values.yaml file. This file is used to customize the deployment of various components of DataHub. Here’s how you can configure it:

  1. Locate the values.yaml File:

    • This file is part of the Helm chart you are using to deploy DataHub. It contains configuration settings for all components of the DataHub deployment.
  2. Configure the DataHub GMS (Global Metadata Service):

    • If you need to configure authentication or other settings for the DataHub GMS, you would typically do this under the datahub section in the values.yaml file.
  3. Example Configuration:

    • Here is an example of how you might configure the DataHub GMS with a personal access token for authentication:
datahub:
  gms:
    enabled: true
    image:
      repository: acryldata/datahub-gms
      tag: "v0.8.40"
    config:
      # Add your configuration here
      authentication:
        enabled: true
        token: "<your-personal-access-token>"
  1. Apply the Configuration:
    • After updating the values.yaml file, you need to apply the changes by upgrading your Helm release:
helm upgrade --install datahub datahub/datahub -f path/to/your/values.yaml
  1. Verify the Deployment:
    • Check the status of your Kubernetes pods to ensure that the changes have been applied successfully:
kubectl get pods

This setup ensures that your DataHub deployment is configured with the necessary settings for authentication and other components. If you need more detailed steps or have specific configurations in mind, feel free to ask!

Sources:

<@U06TM7M40RK> If i dont want to use the tokens? i want all acces to be open

If you want to configure DataHub to allow open access without requiring authentication tokens, you can disable authentication in your deployment. Here’s how you can achieve this when using Helm in a Kubernetes environment:

  1. Disable Authentication in values.yaml:

    • You need to modify the values.yaml file to disable authentication for the DataHub GMS (Global Metadata Service). This can be done by setting the authentication.enabled flag to false.
  2. Example Configuration:

    • Here is an example of how you might configure the values.yaml file to disable authentication:
datahub:
  gms:
    enabled: true
    image:
      repository: acryldata/datahub-gms
      tag: "v0.8.40"
    config:
      authentication:
        enabled: false
  1. Apply the Configuration:
    • After updating the values.yaml file, apply the changes by upgrading your Helm release:
helm upgrade --install datahub datahub/datahub -f path/to/your/values.yaml
  1. Verify the Deployment:
    • Check the status of your Kubernetes pods to ensure that the changes have been applied successfully:
kubectl get pods

By disabling authentication, you allow all users to access DataHub without needing to provide tokens. This setup is suitable for environments where security is not a concern, such as development or testing environments.

If you need further assistance or have specific configurations in mind, feel free to ask!

Sources:

<@U06TM7M40RK> What is the defualt?

The default configuration for DataHub regarding authentication is that it is disabled. This means that, out of the box, DataHub does not require authentication tokens for accessing its services. This setup is intended to simplify initial setup and development processes.

According to the <Metadata Service Authentication | DataHub documentation>, authentication is disabled by default. It is only enabled if you explicitly set the METADATA_SERVICE_AUTH_ENABLED environment variable to true in the datahub-gms container or change the authentication.enabled setting to true in your DataHub Metadata Service configuration (application.yml).

Would you like more detailed information on how to configure or change these settings?

Sources:

<@U06TM7M40RK> OK now a different issue:

  1. i have datahub on k8 via helm.
  2. i added a source of looker
  3. from the start it is on pending

When a DataHub ingestion source remains in a “pending” state, it typically indicates that there might be an issue with the configuration or connectivity. Here are some steps you can take to troubleshoot and resolve this issue:

  1. Check Logs:

    • Start by checking the logs of the DataHub ingestion pod to see if there are any error messages or warnings that can provide more context about the issue. You can do this by running:
      kubectl logs <ingestion-pod-name>
      
    • Look for any errors related to network connectivity, authentication, or configuration.
  2. Verify Looker Configuration:

  3. Network Connectivity:

    • Ensure that your Kubernetes cluster can reach the Looker instance. This might involve checking network policies, firewall rules, or any proxies that might be in place.
  4. Check Kubernetes Resources:

    • Verify that all necessary Kubernetes resources (pods, services, etc.) are running correctly. You can do this by running:
      kubectl get pods
      kubectl get services
      
    • Ensure that there are no pending or failed pods that might indicate resource constraints or misconfigurations.
  5. Resource Allocation:

    • Check if the ingestion pod has sufficient resources (CPU, memory) allocated. Sometimes, insufficient resources can cause the pod to remain in a pending state.
  6. Helm Chart Version:

    • Ensure that you are using a compatible version of the DataHub Helm chart with your Looker setup. Sometimes, upgrading to a newer version can resolve compatibility issues.

If you need more detailed steps or have specific configurations in mind, feel free to ask!

Sources:

<@U06TM7M40RK> How do i know what is the pod? i dont see one

If you are unable to see the specific pod related to your DataHub ingestion job, it might be due to several reasons such as namespace issues, pod naming conventions, or the pod not being created due to configuration errors. Here are some steps to help you identify the pod:

  1. Check the Namespace:

    • Ensure that you are looking in the correct Kubernetes namespace. By default, the kubectl command operates in the default namespace unless specified otherwise. If your DataHub deployment is in a different namespace, you need to specify it using the -n flag.
    • List all namespaces to find the correct one:
      kubectl get namespaces
      
    • Once you know the correct namespace, list the pods in that namespace:
      kubectl get pods -n <your-namespace>
      
  2. Filter Pods by Name:

    • If you know part of the pod name, you can filter the pods using grep:
      kubectl get pods -n <your-namespace> | grep datahub
      
  3. Check for Pending Pods:

    • Sometimes, pods might be in a pending state due to resource constraints or scheduling issues. You can list all pods and check their status:
      kubectl get pods -n <your-namespace> -o wide
      
  4. Describe the Deployment:

    • If you still cannot find the pod, you can describe the deployment to see if there are any issues preventing the pod from being created:
      kubectl describe deployment <datahub-deployment-name> -n <your-namespace>
      
  5. Check Helm Release:

    • Ensure that your Helm release is installed correctly. You can list all Helm releases and check their status:
      helm list -n <your-namespace>
      
  6. Check Events:

    • Sometimes, Kubernetes events can provide insights into why a pod is not being created. You can check events in the namespace:
      kubectl get events -n <your-namespace>
      

If you need further assistance or have specific configurations in mind, feel free to ask!

Sources:

<@U06TM7M40RK>
The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File “/datahub-ingestion/.venv/lib/python3.10/site-packages/acryl/executor/dispatcher/default_dispatcher.py”, line 34, in dispatch_async
res = executor.execute(request)
File “/datahub-ingestion/.venv/lib/python3.10/site-packages/acryl/executor/execution/reporting_executor.py”, line 89, in execute
self._datahub_graph.emit_mcp(kickoff_mcp, async_flag = False)
File “/datahub-ingestion/.venv/lib/python3.10/site-packages/datahub/emitter/rest_emitter.py”, line 266, in emit_mcp
self._emit_generic(url, payload)
datahub.configuration.common.OperationalError: (‘Unable to emit metadata to DataHub GMS’, {‘message’: '401 Client Error: Unauthorized for url: http://datahub-datahub-gms:8080/aspects?action=ingestProposalhttp://datahub-datahub-gms:8080/aspects?action=ingestProposal’}|'})
2024-11-14T11:45:45.621213009Z

what is this error?

I check i dont have the authentication enabled so what else can it be?

If you are encountering a 401 Client Error: Unauthorized error even though you have confirmed that authentication is disabled, there might be other factors at play. Here are some potential causes and steps to troubleshoot the issue:

  1. Configuration Mismatch:

    • Double-check your values.yaml or application.yml configuration files to ensure that the authentication.enabled setting is indeed set to false for both the DataHub GMS and any other relevant components.
  2. Environment Variables:

    • Verify that there are no environment variables overriding your configuration settings. Environment variables can sometimes take precedence over configuration files.
  3. Network Issues:

    • Ensure that there are no network issues preventing the ingestion service from reaching the DataHub GMS. This includes checking for any firewalls, proxies, or network policies that might be blocking access.
  4. Service Availability:

    • Confirm that the DataHub GMS service is running and accessible. You can do this by checking the status of the GMS pod and ensuring it is in a Running state:
      kubectl get pods -n <your-namespace>
      
  5. Logs and Events:

    • Check the logs of the DataHub GMS pod for any error messages or warnings that might provide more context about the issue:
      kubectl logs <gms-pod-name> -n <your-namespace>
      
    • Additionally, check Kubernetes events for any issues related to pod scheduling or resource allocation:
      kubectl get events -n <your-namespace>
      
  6. Ingress and Service Configuration:

    • Ensure that the ingress and service configurations are correctly set up to route traffic to the DataHub GMS. Misconfigurations here can lead to connectivity issues.
  7. Helm Chart Version:

    • Make sure you are using a compatible version of the DataHub Helm chart. Sometimes, upgrading to a newer version can resolve underlying issues.

If you need further assistance or have specific configurations in mind, feel free to ask!

Sources:

  • <Slack Discussion on 401 Unauthorized Error> :+1: 0 button :-1: 0 button :bulb: Hint: Mention @runllm in the thread for followups.

<@U06TM7M40RK>
here is a more detailed :
2024/11/12 14:50:27 Ready: http://datahub-datahub-gms:8080/health.
No user action configurations found. Not starting user actions.
ls: cannot access ‘/etc/datahub/actions/conf/’: No such file or directory
[2024-11-12 14:50:28,486] INFO {datahub_actions.cli.actions:80} - DataHub Actions version: unavailable (installed editable via git)
[2024-11-12 14:50:28,536] INFO {datahub_actions.plugin.action.propagation.docs.propagation_action:192} - Doc Propagation Config action configured with enabled=True columns_enabled=True datasets_enabled=False
[2024-11-12 14:50:28,541] WARNING {datahub_actions.plugin.action.propagation.docs.propagation_action:434} - Error fetching global settings for doc propagation. Will try again in 1 minute.
Traceback (most recent call last):
File “/actions-src/src/datahub_actions/plugin/action/propagation/docs/propagation_action.py”, line 401, in refresh_config
entity_dict = self.ctx.graph.graph.get_entity_raw(
File “/datahub-ingestion/.venv/lib/python3.10/site-packages/datahub/ingestion/graph/client.py”, line 481, in get_entity_raw
response.raise_for_status()
File “/datahub-ingestion/.venv/lib/python3.10/site-packages/requests/models.py”, line 1024, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: http://datahub-datahub-gms:8080/entitiesV2/urn%3Ali%3AglobalSettings%3A0?aspects=List(globalSettingsInfo)
[2024-11-12 14:50:28,944] WARNING {datahub_actions.cli.actions:107} - Skipping pipeline datahub_slack_action as it is not enabled
[2024-11-12 14:50:28,947] WARNING {datahub_actions.cli.actions:107} - Skipping pipeline datahub_teams_action as it is not enabled
[2024-11-12 14:50:28,947] INFO {datahub_actions.cli.actions:123} - Action Pipeline with name ‘datahub_doc_propagation_action’ is now running.
[2024-11-12 14:50:28,947] INFO {datahub_actions.cli.actions:123} - Action Pipeline with name ‘ingestion_executor’ is now running.
[2024-11-12 14:56:20,589] WARNING {datahub_actions.plugin.action.propagation.docs.propagation_action:434} - Error fetching global settings for doc propagation. Will try again in 1 minute.
Traceback (most recent call last):
File “/actions-src/src/datahub_actions/plugin/action/propagation/docs/propagation_action.py”, line 401, in refresh_config
entity_dict = self.ctx.graph.graph.get_entity_raw(
File “/datahub-ingestion/.venv/lib/python3.10/site-packages/datahub/ingestion/graph/client.py”, line 481, in get_entity_raw
response.raise_for_status()
File “/datahub-ingestion/.venv/lib/python3.10/site-packages/requests/models.py”, line 1024, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: http://datahub-datahub-gms:8080/entitiesV2/urn%3Ali%3AglobalSettings%3A0?aspects=List(globalSettingsInfo)
[2024-11-12 15:06:25,117] WARNING {datahub_actions.plugin.action.propagation.docs.propagation_action:434} - Error fetching global settings for doc propagation. Will try again in 1 minute.
Traceback (most recent call last):
File “/actions-src/src/datahub_actions/plugin/action/propagation/docs/propagation_action.py”, line 401, in refresh_config
entity_dict = self.ctx.graph.graph.get_entity_raw(
File “/datahub-ingestion/.venv/lib/python3.10/site-packages/datahub/ingestion/graph/client.py”, line 481, in get_entity_raw
response.raise_for_status()
File “/datahub-ingestion/.venv/lib/python3.10/site-packages/requests/models.py”, line 1024, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: http://datahub-datahub-gms:8080/entitiesV2/urn%3Ali%3AglobalSettings%3A0?aspects=List(globalSettingsInfo)
[2024-11-12 15:06:27,778] DEBUG {acryl.executor.dispatcher.default_dispatcher:75} - Started thread <Thread(Thread-3 (dispatch_async), started 139690220959296)> for 188df420-78b6-48e8-824a-078c8ea6d1f9
[2024-11-12 15:06:27,781] ERROR {acryl.executor.dispatcher.default_dispatcher:37} - Failed dispatch for 188df420-78b6-48e8-824a-078c8ea6d1f9: Traceback (most recent call last):
File “/datahub-ingestion/.venv/lib/python3.10/site-packages/datahub/emitter/rest_emitter.py”, line 306, in _emit_generic
File “/datahub-ingestion/.venv/lib/python3.10/site-packages/datahub/emitter/rest_emitter.py”, line 306, in _emit_generic
response.raise_for_status()
File “/datahub-ingestion/.venv/lib/python3.10/site-packages/requests/models.py”, line 1024, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: http://datahub-datahub-gms:8080/aspects?action=ingestProposal

Please llet me know how to solve it

The detailed logs you provided indicate that the 401 Client Error: Unauthorized error is occurring when trying to access certain endpoints on the DataHub GMS (Global Metadata Service). Here are some steps to troubleshoot and resolve this issue:

  1. Verify Authentication Settings:

    • Double-check that authentication is indeed disabled in your configuration. Ensure that the METADATA_SERVICE_AUTH_ENABLED environment variable is not set to true in your Helm chart or Kubernetes deployment configuration.
  2. Check for Environment Variable Overrides:

    • Ensure that there are no environment variables set in your Kubernetes deployment that might override the configuration files. You can check this by inspecting the environment variables of the running pods.
  3. Inspect Helm Values:

    • Review your values.yaml file to ensure that authentication is not enabled. Look for any settings related to authentication and ensure they are set to false.
  4. Network and Connectivity:

    • Verify that there are no network issues preventing the ingestion service from reaching the DataHub GMS. This includes checking for any firewalls, proxies, or network policies that might be blocking access.
  5. Service and Ingress Configuration:

    • Ensure that the service and ingress configurations are correctly set up to route traffic to the DataHub GMS. Misconfigurations here can lead to connectivity issues.
  6. Check Logs for Additional Context:

    • Continue to monitor the logs for any additional error messages or warnings that might provide more context about the issue. This can help identify specific problems with the service startup or network configuration.
  7. Restart Services:

    • Sometimes, simply restarting the services can resolve transient issues. You can restart the DataHub GMS and related services to see if the issue persists.
  8. Review Documentation:

If you need further assistance or have specific configurations in mind, feel free to ask!

Sources:

<@U06TM7M40RK> In which component should METADATA_SERVICE_AUTH_ENABLED be defind?

The METADATA_SERVICE_AUTH_ENABLED environment variable should be defined in the configuration for the DataHub GMS (Global Metadata Service) component. This setting controls whether authentication is enabled for the GMS, which is the backend service responsible for storing and serving metadata.

Here’s how you can configure it:

  1. DataHub GMS Configuration:

    • Ensure that the METADATA_SERVICE_AUTH_ENABLED environment variable is set in the deployment configuration for the DataHub GMS. This can be done in the values.yaml file if you are using Helm for deployment.
  2. Example Configuration in values.yaml:

    • You can specify this environment variable under the GMS configuration section in your values.yaml file:
datahub:
  gms:
    enabled: true
    image:
      repository: acryldata/datahub-gms
      tag: "v0.8.40"
    env:
      - name: METADATA_SERVICE_AUTH_ENABLED
        value: "true"  # Set to "false" if you want to disable authentication
  1. Apply the Configuration:
    • After updating the values.yaml file, apply the changes by upgrading your Helm release:
helm upgrade --install datahub datahub/datahub -f path/to/your/values.yaml
  1. Verify the Deployment:
    • Check the status of your Kubernetes pods to ensure that the changes have been applied successfully:
kubectl get pods

By setting the METADATA_SERVICE_AUTH_ENABLED environment variable in the GMS component, you control whether authentication is required for accessing the metadata service.

If you need further assistance or have specific configurations in mind, feel free to ask!

Sources: