Configuring Pre-execution Commands in a Kubernetes CronJob for Datahub Ingestion via Helm Deployment

Original Slack Thread

<@U06TM7M40RK> if i want to run commands before running the - datahub ingestion command in an ingestion cron how can i add it in the yaml? deployed on k8 via helm?

It seems likely that you can run commands before executing the datahub ingestion command in a Kubernetes CronJob by modifying the command section in your values.yaml file for the Helm chart. You can achieve this by specifying a shell command that includes the pre-execution steps followed by the datahub ingestion command.

Here’s a general approach based on the available data sources:

  1. Modify the values.yaml File: In your values.yaml file, you can specify a shell command that runs your pre-execution steps followed by the datahub ingestion command. This can be done by using a shell script or inline shell commands.

  2. Example Configuration:

    datahub-ingestion-cron:
      enabled: true
      crons:
        my_ingestion_job:
          schedule: "0 * * * *"  # Adjust the schedule as needed
          recipe:
            configmapName: recipe-config
            fileName: my_recipe.yml
          jobTemplate:
            spec:
              template:
                spec:
                  containers:
                    - name: ingestion
                      image: linkedin/datahub-ingestion:latest
                      command: ["/bin/sh", "-c", "echo 'Running pre-execution steps'; <your-pre-execution-command>; datahub ingest -c /etc/recipe/my_recipe.yml"]
                      volumeMounts:
                        - name: recipe-volume
                          mountPath: /etc/recipe
                  volumes:
                    - name: recipe-volume
                      configMap:
                        name: recipe-config
    
  3. Deploy the Updated Configuration: Apply the updated configuration to your Kubernetes cluster using Helm.

    helm upgrade --install datahub datahub/datahub -f path/to/your/values.yaml
    

This setup allows you to run any necessary commands before the datahub ingestion command within the same container execution context.

Would you like more detailed steps or additional information on any specific part of the process?

Sources:

<@U06TM7M40RK> Can you how me a very long command?

I’m here to assist with questions related to DataHub. If you have a specific question about using DataHub, such as configuring ingestion, deploying on Kubernetes, or any other feature, please let me know, and I’ll be happy to help! If you could provide more details or rephrase your question, that would be great. :+1: 0 button :-1: 0 button :bulb: Hint: Mention @runllm in the thread for followups.

<@U06TM7M40RK> i want to see a command that also installs gcloud before running the datahub ingestion command

It might be possible to install gcloud before running the datahub ingestion command by using a shell script within your Kubernetes CronJob configuration. This can be done by specifying a shell command that first installs gcloud and then proceeds with the datahub ingestion command.

Here’s an example of how you might configure this in your values.yaml file for a Helm deployment:

datahub-ingestion-cron:
  enabled: true
  crons:
    my_ingestion_job:
      schedule: "0 * * * *"  # Adjust the schedule as needed
      recipe:
        configmapName: recipe-config
        fileName: my_recipe.yml
      jobTemplate:
        spec:
          template:
            spec:
              containers:
                - name: ingestion
                  image: linkedin/datahub-ingestion:latest
                  command: ["/bin/sh", "-c", "echo 'Installing gcloud'; curl -O https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-367.0.0-linux-x86_64.tar.gz && tar -xf google-cloud-sdk-367.0.0-linux-x86_64.tar.gz && ./google-cloud-sdk/install.sh -q; echo 'Running datahub ingestion'; datahub ingest -c /etc/recipe/my_recipe.yml"]
                  volumeMounts:
                    - name: recipe-volume
                      mountPath: /etc/recipe
              volumes:
                - name: recipe-volume
                  configMap:
                    name: recipe-config

Explanation:

  • Install gcloud: The command uses curl to download the Google Cloud SDK, extracts it, and runs the installation script.
  • Run datahub ingest: After installing gcloud, it proceeds to run the datahub ingest command using the specified recipe.

Deployment:
To apply this configuration, you would update your Helm chart with the modified values.yaml and deploy it to your Kubernetes cluster:

helm upgrade --install datahub datahub/datahub -f path/to/your/values.yaml

This setup should allow you to install gcloud before executing the datahub ingestion command within the same job execution.

Would you like more detailed steps or additional information on any specific part of the process?

Sources:

<@U06TM7M40RK> please show me the example of :
Array of strings denoting the crawling command to be invoked in the cron job. By default it will execute the recipe defined in the crons.recipe object. Cron crawling customization is possible by having extra volumes with custom logic to be executed.

I’m not able to find any data sources to answer your question. You might want to provide more details about the specific use case or configuration you are trying to achieve. If this doesn’t help, please reach out to the DataHub support team for further assistance. :+1: 0 button :-1: 0 button :bulb: Hint: Mention @runllm in the thread for followups.

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><

<@U06TM7M40RK>
Is there a way using the cli or graphql playground, to list all the soft deleted objects i have?

Yes, you can list all soft-deleted objects using both the CLI and GraphQL in DataHub.

Using the CLI
You can use the DataHub CLI to list soft-deleted entities by using the datahub delete command with the --dry-run and --only-soft-deleted options. This will simulate a delete operation and list entities that are soft-deleted without actually deleting them.

datahub delete --platform <your_platform> --dry-run --only-soft-deleted

Replace <your_platform> with the specific platform you are interested in, such as metabase, looker, etc.

Using GraphQL
You can also use the GraphQL API to query for soft-deleted entities. Here’s an example of how you might construct a GraphQL query to find such entities:

{
  search(input: {
    type: DATASET,  # Change this to the entity type you are interested in
    query: "*",
    filters: [
      {
        field: "removed",
        value: "true",
        condition: "EQUAL"
      }
    ],
    start: 0,
    count: 100
  }) {
    entities {
      entity {
        urn
      }
    }
  }
}

This query searches for datasets that have been soft-deleted. You can adjust the type to match the entity type you are interested in, such as CHART, DASHBOARD, etc.

Additional Information

  • Soft Deletion: Soft-deleted entities are not visible in search or browse results by default, but they can still be accessed directly via their URN.
  • Reversibility: Soft deletion is reversible, meaning you can restore these entities if needed.

Would you like more detailed steps or additional examples?

Sources:

<@U06TM7M40RK>
Using graphQL i want to create a new access token with the lifee time of 3 months can i do that?

Yes, you can create a new access token with a lifetime of 3 months using the GraphQL API in DataHub. You can use the createAccessToken mutation and specify the duration as THREE_MONTHS.

Here’s an example of how you can construct the GraphQL mutation to achieve this:

mutation {
  createAccessToken(
    input: {
      type: PERSONAL
      actorUrn: "urn:li:corpuser:your_user_urn"
      duration: THREE_MONTHS
      name: "My 3-Month Token"
    }
  ) {
    accessToken
    metadata {
      id
      name
      description
    }
  }
}

Explanation:

  • type: Set to PERSONAL to create a personal access token.
  • actorUrn: Replace "urn:li:corpuser:your_user_urn" with the URN of the user for whom you are creating the token.
  • duration: Set to THREE_MONTHS to specify the token’s lifetime.
  • name: Provide a name for the token, such as "My 3-Month Token".

Execution:
You can execute this mutation using a GraphQL client or the GraphQL playground provided by DataHub.

Would you like more detailed steps or additional information on executing this mutation?

Sources:

<@U06TM7M40RK> How can i list the all the options for your_user_urn?

To list all the options for your_user_urn, you can use the GraphQL API to query for users in your DataHub instance. This will allow you to retrieve the URNs of all users, which you can then use to identify the specific user URN you are interested in.

Here’s how you can construct a GraphQL query to list all users and their URNs:

{
  listUsers(input: { start: 0, count: 100 }) {
    users {
      urn
      username
      info {
        displayName
        email
      }
    }
  }
}

Explanation:

  • listUsers: This query retrieves a list of users.
  • start and count: These parameters control pagination. Adjust count to retrieve more or fewer users as needed.
  • urn: The unique identifier for each user.
  • username, displayName, email: Additional user information that can help you identify the correct user URN.

Execution:
You can execute this query using a GraphQL client or the GraphQL playground provided by DataHub.

Would you like more detailed steps or additional information on executing this query?

Sources:

<@U06TM7M40RK> What types of access tokens are there in datahub?
type: PERSONAL
what else?

In DataHub, access tokens are primarily categorized into two types:

  1. PERSONAL: This type of token is used for personal access, allowing users to represent themselves in code and programmatically use DataHub’s APIs. It is typically used for actions that a user would perform manually but wants to automate.

  2. SESSION: This type of token is used for UI-based sessions. It is generated when a user logs into the DataHub UI, for example, via JaaS or OIDC. The session token is used to maintain the user’s session in the web application.

These types are defined in the DataHub Token Service, which is responsible for generating and verifying access tokens. The tokens are issued as JSON Web Tokens (JWTs) and contain claims such as the token type, expiration time, and the actor associated with the token.

Would you like more detailed information on how these tokens are used or managed?

Sources:

<@U06TM7M40RK> in opensource version how can i have authentication for users? meaning based on okta or on gsuite

In the open-source version of DataHub, you can set up authentication for users using OpenID Connect (OIDC) with identity providers like Okta or Google (G Suite). This allows you to delegate authentication to these providers, enabling Single Sign-On (SSO) for your DataHub instance.

Steps to Configure OIDC Authentication

  1. Register an Application with Your Identity Provider:

    • Okta: Log in to your Okta admin account, navigate to the developer console, and create a new application. Select “Web” as the platform and “OpenID Connect” as the sign-on method. Configure the redirect URI to point to your DataHub instance, e.g., https://your-datahub-domain.com/callback/oidc.
    • Google (G Suite): Use the Google API Console to create a new project and configure the OAuth consent screen. Set up OAuth credentials and specify the redirect URI similarly.
  2. Configure DataHub Frontend:

    • Update the datahub-frontend configuration to enable OIDC authentication. This involves setting environment variables with the credentials obtained from your identity provider.

    Example configuration in docker.env or Kubernetes values.yaml:

    extraEnvs:
      - name: AUTH_OIDC_ENABLED
        value: "true"
      - name: AUTH_OIDC_CLIENT_ID
        value: "<your-client-id>"
      - name: AUTH_OIDC_CLIENT_SECRET
        value: "<your-client-secret>"
      - name: AUTH_OIDC_DISCOVERY_URI
        value: "https://<your-provider-discovery-url>/.well-known/openid-configuration"
      - name: AUTH_OIDC_BASE_URL
        value: "https://your-datahub-domain.com"
    
  3. Restart DataHub Frontend:

    • After configuring the environment variables, restart the datahub-frontend service to apply the changes.

Additional Considerations

  • User Provisioning: You can enable Just-In-Time (JIT) provisioning to automatically create user accounts in DataHub when users log in for the first time.
  • Group Synchronization: If your identity provider supports it, you can also synchronize group memberships to manage access control within DataHub.

For more detailed instructions, you can refer to the <https://datahubproject.io/docs/authentication/guides/sso/configure-oidc-react|OIDC Authentication Guide>.

Would you like more detailed steps or additional information on any specific part of the process?

Sources:

<@U06TM7M40RK>
I want to delete all ingestions or changes caused from engestions for looker platform that happend before a spesific date, how can i do that using the cli?