Troubleshooting Authentication for Running Ingestion Using `datahub-ingestion-cron`

Original Slack Thread

Hello, I’m looking to run some integration tests with the helm distribution of datahub and I was wondering how I can run a ingest using “datahub-ingestion-cron” without having to visit the UI to generate a token.

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

<@U06TM7M40RK> Can you help with the questions in this thread?


:information_source: We’re experimenting with a new AI assistant for questions in the <#CUMUWQU66> channel to help you track down helpful resources more quickly. Keep in mind that responses may not always be spot-on; feel free to ask it follow-up questions and don’t forget to provide its responses a :thumbsup: or :thumbsdown: so other folks in the Community know if you’re still looking for help on the topic!

To help you effectively, I need a bit more information:

  1. Are you using the UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g., 0.12.0)?
  3. What data source(s) are you integrating with DataHub (e.g., BigQuery)?

Based on your initial question, it seems you want to run ingestion using the datahub-ingestion-cron without having to visit the UI to generate a token. You can achieve this by using a Kubernetes ConfigMap to store your ingestion recipes and scheduling the ingestion via the Helm chart.

Here’s a general approach:

  1. Create a Kubernetes ConfigMap: Store your ingestion recipe in a ConfigMap.
  2. Configure values.yaml: Enable the datahub-ingestion-cron subchart and reference your ConfigMap and recipe file.
  3. Use Environment Variables for Tokens: Instead of hardcoding tokens in your ConfigMap, use environment variables to inject them securely.

Example Configuration

ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: recipe-config
data:
  mysql_recipe.yml: |-
    source:
      type: mysql
      config:
        host_port: <MYSQL_HOST>:3306
        database: dbname
        username: root
        password: example
    sink:
      type: datahub-rest
      config:
        server: http://<GMS_HOST>:8080
        token: ${DATAHUB_TOKEN}

values.yaml:

datahub-ingestion-cron:
  enabled: true
  crons:
    mysql:
      schedule: "0 * * * *" # Every hour
      recipe:
        configmapName: recipe-config
        fileName: mysql_recipe.yml

Injecting Environment Variables

You can inject the DATAHUB_TOKEN environment variable into your Kubernetes pods using a Kubernetes Secret or directly in your deployment configuration.

For more detailed steps, you can refer to the <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/schedule_docs/kubernetes.md|official documentation>.

Would you like more detailed steps or have any specific questions about this process?

Sources:

  1. CLI
  2. 0.13.2
  3. s3
    <@U06TM7M40RK>

Great, thanks for the details! Given that you’re using the CLI for ingestion, DataHub version 0.13.2, and integrating with S3, here’s how you can set up and run ingestion using datahub-ingestion-cron without visiting the UI to generate a token.

Steps to Schedule Ingestion Using datahub-ingestion-cron

  1. Create a Kubernetes ConfigMap: Store your ingestion recipe in a ConfigMap.
  2. Configure values.yaml: Enable the datahub-ingestion-cron subchart and reference your ConfigMap and recipe file.
  3. Use Environment Variables for Tokens: Instead of hardcoding tokens in your ConfigMap, use environment variables to inject them securely.

Example Configuration

ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: recipe-config
data:
  s3_recipe.yml: |-
    source:
      type: s3
      config:
        bucket_name: <YOUR_S3_BUCKET>
        aws_access_key_id: ${AWS_ACCESS_KEY_ID}
        aws_secret_access_key: ${AWS_SECRET_ACCESS_KEY}
    sink:
      type: datahub-rest
      config:
        server: http://<GMS_HOST>:8080
        token: ${DATAHUB_TOKEN}

values.yaml:

datahub-ingestion-cron:
  enabled: true
  crons:
    s3:
      schedule: "0 * * * *" # Every hour
      recipe:
        configmapName: recipe-config
        fileName: s3_recipe.yml

Injecting Environment Variables

You can inject the AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and DATAHUB_TOKEN environment variables into your Kubernetes pods using a Kubernetes Secret or directly in your deployment configuration.

Example Kubernetes Secret:

apiVersion: v1
kind: Secret
metadata:
  name: datahub-secrets
type: Opaque
data:
  AWS_ACCESS_KEY_ID: <base64_encoded_access_key>
  AWS_SECRET_ACCESS_KEY: <base64_encoded_secret_key>
  DATAHUB_TOKEN: <base64_encoded_token>

Referencing the Secret in Your Deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: datahub-ingestion
spec:
  template:
    spec:
      containers:
      - name: datahub-ingestion
        envFrom:
        - secretRef:
            name: datahub-secrets

Running the Ingestion

Once you have your ConfigMap and values.yaml configured, and your secrets set up, you can deploy the Helm chart to schedule the ingestion.

helm upgrade --install datahub datahub/datahub -f values.yaml

This setup will ensure that your ingestion runs as per the schedule defined in the values.yaml without needing to visit the UI to generate a token.

Would you like more detailed steps or have any specific questions about this process?

Sources:

Where does the token environment variable come from? <@U06TM7M40RK>

The token environment variable (DATAHUB_TOKEN) can be generated and managed programmatically without needing to visit the UI. Here’s how you can achieve this:

Generating a Personal Access Token Programmatically

  1. Enable Metadata Service Authentication: Ensure that metadata service authentication is enabled in your DataHub deployment.
  2. Generate the Token: Use the DataHub API to generate a personal access token.

Step-by-Step Process

1. Enable Metadata Service Authentication

Ensure that the METADATA_SERVICE_AUTH_ENABLED environment variable is set to true for the datahub-gms container or pod. This enables token-based authentication.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: datahub-gms
spec:
  template:
    spec:
      containers:
      - name: datahub-gms
        env:
        - name: METADATA_SERVICE_AUTH_ENABLED
          value: "true"

2. Generate the Token

You can generate a personal access token programmatically using the DataHub API. Here’s an example using curl:

curl -X POST "http://<GMS_HOST>:8080/api/authentication/token" \
     -H "Content-Type: application/json" \
     -d '{
           "username": "<your-username>",
           "password": "<your-password>"
         }'

This will return a JSON response containing the token. You can then store this token securely in a Kubernetes Secret.

Example Kubernetes Secret

apiVersion: v1
kind: Secret
metadata:
  name: datahub-secrets
type: Opaque
data:
  DATAHUB_TOKEN: <base64_encoded_token>

Referencing the Secret in Your Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: datahub-ingestion
spec:
  template:
    spec:
      containers:
      - name: datahub-ingestion
        envFrom:
        - secretRef:
            name: datahub-secrets

Using the Token in Your Ingestion Recipe

In your ingestion recipe, reference the DATAHUB_TOKEN environment variable:

sink:
  type: datahub-rest
  config:
    server: http://<GMS_HOST>:8080
    token: ${DATAHUB_TOKEN}

Summary

By following these steps, you can generate and manage the DATAHUB_TOKEN programmatically, ensuring that your ingestion process is automated and secure.

Would you like more detailed steps or have any specific questions about this process?

Sources:

<@U06TM7M40RK> I get a 401 when I try and curl that service here is how I’m curling:

     -H "Content-Type: application/json" \
     -d '{
           "username": "datahub",
           "password": "datahub"
         };'```

The 401 Unauthorized error indicates that the authentication request is not being accepted. Here are a few steps to troubleshoot and resolve this issue:

1. Ensure Metadata Service Authentication is Enabled

First, double-check that the METADATA_SERVICE_AUTH_ENABLED environment variable is set to true for the datahub-gms container or pod. This is crucial for enabling token-based authentication.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: datahub-gms
spec:
  template:
    spec:
      containers:
      - name: datahub-gms
        env:
        - name: METADATA_SERVICE_AUTH_ENABLED
          value: "true"

2. Verify Credentials

Ensure that the username and password you are using are correct. The default credentials are often datahub for both username and password, but these might have been changed in your setup.

3. Check API Endpoint

Ensure that the API endpoint URL is correct and accessible. You can test the endpoint by accessing it directly in a browser or using a simple curl command without authentication to see if it responds.

4. Correct Curl Command

Ensure that your curl command is correctly formatted. Here is a corrected version of your command:

curl -X POST "http://10.42.0.22:8080/api/authentication/token" \
     -H "Content-Type: application/json" \
     -d '{
           "username": "datahub",
           "password": "datahub"
         }'

5. Check Logs

Check the logs of the datahub-gms service for any errors or warnings that might provide more context on why the authentication is failing.

Example of Correct Configuration

Here is a complete example of how to enable metadata service authentication and generate a token:

Enable Metadata Service Authentication

apiVersion: apps/v1
kind: Deployment
metadata:
  name: datahub-gms
spec:
  template:
    spec:
      containers:
      - name: datahub-gms
        env:
        - name: METADATA_SERVICE_AUTH_ENABLED
          value: "true"

Generate Token Using Curl

curl -X POST "http://10.42.0.22:8080/api/authentication/token" \
     -H "Content-Type: application/json" \
     -d '{
           "username": "datahub",
           "password": "datahub"
         }'

If you continue to face issues, please provide more details about your setup and any error messages from the logs.

Would you like more detailed steps or have any specific questions about this process?

Sources:

METADATA_SERVICE_AUTH_ENABLED
is enabled by default in the helm chart deployment <@U06TM7M40RK>

If METADATA_SERVICE_AUTH_ENABLED is enabled by default in your Helm chart deployment and you are still receiving a 401 error, there are a few additional steps you can take to troubleshoot and resolve the issue.

1. Verify Helm Chart Configuration

Ensure that the Helm chart values are correctly set to enable authentication. Specifically, check the values.yaml file for the datahub-gms subchart to confirm that METADATA_SERVICE_AUTH_ENABLED is set to true.

2. Check for Correct Credentials

Ensure that the credentials you are using (username and password) are correct. If you have customized these credentials, make sure they match what is configured in your DataHub deployment.

3. Inspect Logs for Errors

Check the logs of the datahub-gms service for any errors or warnings that might provide more context on why the authentication is failing. This can often provide clues about misconfigurations or other issues.

4. Correct Curl Command

Ensure that your curl command is correctly formatted and does not contain any syntax errors. Here is a corrected version of your command:

curl -X POST "<http://10.42.0.22:8080/api/authentication/token>" \
     -H "Content-Type: application/json" \
     -d '{
           "username": "datahub",
           "password": "datahub"
         }'

Example Configuration

Here is a complete example of how to enable metadata service authentication and generate a token:

Enable Metadata Service Authentication

Ensure that the values.yaml file for your Helm chart includes the following configuration:

datahub-gms:
  env:
    METADATA_SERVICE_AUTH_ENABLED: "true"

Generate Token Using Curl

curl -X POST "<http://10.42.0.22:8080/api/authentication/token>" \
     -H "Content-Type: application/json" \
     -d '{
           "username": "datahub",
           "password": "datahub"
         }'

Additional Resources

For more detailed steps and troubleshooting, you can refer to the following resources:

If you continue to face issues, please provide more details about your setup and any error messages from the logs.

Would you like more detailed steps or have any specific questions about this process?

Sources: