<@U06TM7M40RK> if i want to run commands before running the - datahub ingestion command in an ingestion cron how can i add it in the yaml? deployed on k8 via helm?
Configuring Pre-execution Commands in a Kubernetes CronJob for Datahub Ingestion via Helm Deployment
It seems likely that you can run commands before executing the datahub ingestion
command in a Kubernetes CronJob by modifying the command section in your values.yaml
file for the Helm chart. You can achieve this by specifying a shell command that includes the pre-execution steps followed by the datahub ingestion
command.
Here’s a general approach based on the available data sources:
-
Modify the
values.yaml
File: In yourvalues.yaml
file, you can specify a shell command that runs your pre-execution steps followed by thedatahub ingestion
command. This can be done by using a shell script or inline shell commands. -
Example Configuration:
datahub-ingestion-cron: enabled: true crons: my_ingestion_job: schedule: "0 * * * *" # Adjust the schedule as needed recipe: configmapName: recipe-config fileName: my_recipe.yml jobTemplate: spec: template: spec: containers: - name: ingestion image: linkedin/datahub-ingestion:latest command: ["/bin/sh", "-c", "echo 'Running pre-execution steps'; <your-pre-execution-command>; datahub ingest -c /etc/recipe/my_recipe.yml"] volumeMounts: - name: recipe-volume mountPath: /etc/recipe volumes: - name: recipe-volume configMap: name: recipe-config
-
Deploy the Updated Configuration: Apply the updated configuration to your Kubernetes cluster using Helm.
helm upgrade --install datahub datahub/datahub -f path/to/your/values.yaml
This setup allows you to run any necessary commands before the datahub ingestion
command within the same container execution context.
Would you like more detailed steps or additional information on any specific part of the process?
Sources:
- https://datahubspace.slack.com/archives/CUMUWQU66/p1726641121.402129|https://datahubspace.slack.com/archives/CUMUWQU66/p1726641121.402129
- https://datahubspace.slack.com/archives/CUMUWQU66/p1723097046.049089|https://datahubspace.slack.com/archives/CUMUWQU66/p1723097046.049089
0 button
0 button
Hint: Mention @runllm in the thread for followups.
<@U06TM7M40RK> Can you how me a very long command?
I’m here to assist with questions related to DataHub. If you have a specific question about using DataHub, such as configuring ingestion, deploying on Kubernetes, or any other feature, please let me know, and I’ll be happy to help! If you could provide more details or rephrase your question, that would be great. 0 button
0 button
Hint: Mention @runllm in the thread for followups.
<@U06TM7M40RK> i want to see a command that also installs gcloud before running the datahub ingestion command
It might be possible to install gcloud
before running the datahub ingestion
command by using a shell script within your Kubernetes CronJob configuration. This can be done by specifying a shell command that first installs gcloud
and then proceeds with the datahub ingestion
command.
Here’s an example of how you might configure this in your values.yaml
file for a Helm deployment:
datahub-ingestion-cron:
enabled: true
crons:
my_ingestion_job:
schedule: "0 * * * *" # Adjust the schedule as needed
recipe:
configmapName: recipe-config
fileName: my_recipe.yml
jobTemplate:
spec:
template:
spec:
containers:
- name: ingestion
image: linkedin/datahub-ingestion:latest
command: ["/bin/sh", "-c", "echo 'Installing gcloud'; curl -O https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-367.0.0-linux-x86_64.tar.gz && tar -xf google-cloud-sdk-367.0.0-linux-x86_64.tar.gz && ./google-cloud-sdk/install.sh -q; echo 'Running datahub ingestion'; datahub ingest -c /etc/recipe/my_recipe.yml"]
volumeMounts:
- name: recipe-volume
mountPath: /etc/recipe
volumes:
- name: recipe-volume
configMap:
name: recipe-config
Explanation:
- Install
gcloud
: The command usescurl
to download the Google Cloud SDK, extracts it, and runs the installation script. - Run
datahub ingest
: After installinggcloud
, it proceeds to run thedatahub ingest
command using the specified recipe.
Deployment:
To apply this configuration, you would update your Helm chart with the modified values.yaml
and deploy it to your Kubernetes cluster:
helm upgrade --install datahub datahub/datahub -f path/to/your/values.yaml
This setup should allow you to install gcloud
before executing the datahub ingestion
command within the same job execution.
Would you like more detailed steps or additional information on any specific part of the process?
Sources:
- https://datahubproject.io/docs/0.14.0/deploy/gcp|https://datahubproject.io/docs/0.14.0/deploy/gcp
- https://datahubproject.io/docs/0.13.1/deploy/gcp|https://datahubproject.io/docs/0.13.1/deploy/gcp
Note: I wasn’t able to find highly relevant data sources, but above are a few potentially relevant links.0 button
0 button
Hint: Mention @runllm in the thread for followups.
<@U06TM7M40RK> please show me the example of :
Array of strings denoting the crawling command to be invoked in the cron job. By default it will execute the recipe defined in the crons.recipe
object. Cron crawling customization is possible by having extra volumes with custom logic to be executed.
I’m not able to find any data sources to answer your question. You might want to provide more details about the specific use case or configuration you are trying to achieve. If this doesn’t help, please reach out to the DataHub support team for further assistance. 0 button
0 button
Hint: Mention @runllm in the thread for followups.
Hi ! Did this interaction resolve your issue or do you still require assistance ?
Please tag @ RunLLM if you still need further assistance !
Or tag @Datahub Community Support if you require a person to take a look
Hope this helps ! ><
<@U06TM7M40RK>
Is there a way using the cli or graphql playground, to list all the soft deleted objects i have?
Yes, you can list all soft-deleted objects using both the CLI and GraphQL in DataHub.
Using the CLI
You can use the DataHub CLI to list soft-deleted entities by using the datahub delete
command with the --dry-run
and --only-soft-deleted
options. This will simulate a delete operation and list entities that are soft-deleted without actually deleting them.
datahub delete --platform <your_platform> --dry-run --only-soft-deleted
Replace <your_platform>
with the specific platform you are interested in, such as metabase
, looker
, etc.
Using GraphQL
You can also use the GraphQL API to query for soft-deleted entities. Here’s an example of how you might construct a GraphQL query to find such entities:
{
search(input: {
type: DATASET, # Change this to the entity type you are interested in
query: "*",
filters: [
{
field: "removed",
value: "true",
condition: "EQUAL"
}
],
start: 0,
count: 100
}) {
entities {
entity {
urn
}
}
}
}
This query searches for datasets that have been soft-deleted. You can adjust the type
to match the entity type you are interested in, such as CHART
, DASHBOARD
, etc.
Additional Information
- Soft Deletion: Soft-deleted entities are not visible in search or browse results by default, but they can still be accessed directly via their URN.
- Reversibility: Soft deletion is reversible, meaning you can restore these entities if needed.
Would you like more detailed steps or additional examples?
Sources:
- https://datahubspace.slack.com/archives/CUMUWQU66/p1699265860.278719|https://datahubspace.slack.com/archives/CUMUWQU66/p1699265860.278719
- https://github.com/datahub-project/datahub/pull/10920|https://github.com/datahub-project/datahub/pull/10920
0 button
0 button
Hint: Mention @runllm in the thread for followups.
<@U06TM7M40RK>
Using graphQL i want to create a new access token with the lifee time of 3 months can i do that?
Yes, you can create a new access token with a lifetime of 3 months using the GraphQL API in DataHub. You can use the createAccessToken
mutation and specify the duration
as THREE_MONTHS
.
Here’s an example of how you can construct the GraphQL mutation to achieve this:
mutation {
createAccessToken(
input: {
type: PERSONAL
actorUrn: "urn:li:corpuser:your_user_urn"
duration: THREE_MONTHS
name: "My 3-Month Token"
}
) {
accessToken
metadata {
id
name
description
}
}
}
Explanation:
- type: Set to
PERSONAL
to create a personal access token. - actorUrn: Replace
"urn:li:corpuser:your_user_urn"
with the URN of the user for whom you are creating the token. - duration: Set to
THREE_MONTHS
to specify the token’s lifetime. - name: Provide a name for the token, such as
"My 3-Month Token"
.
Execution:
You can execute this mutation using a GraphQL client or the GraphQL playground provided by DataHub.
Would you like more detailed steps or additional information on executing this mutation?
Sources:
- https://github.com/datahub-project/datahub/blob/master/datahub-graphql-core/src/main/resources/auth.graphql|https://github.com/datahub-project/datahub/blob/master/datahub-graphql-core/src/main/resources/auth.graphql
- https://github.com/datahub-project/datahub/blob/master/datahub-web-react/src/app/settingsV2/utils.ts|https://github.com/datahub-project/datahub/blob/master/datahub-web-react/src/app/settingsV2/utils.ts
0 button
0 button
Hint: Mention @runllm in the thread for followups.
<@U06TM7M40RK> How can i list the all the options for your_user_urn?
To list all the options for your_user_urn
, you can use the GraphQL API to query for users in your DataHub instance. This will allow you to retrieve the URNs of all users, which you can then use to identify the specific user URN you are interested in.
Here’s how you can construct a GraphQL query to list all users and their URNs:
{
listUsers(input: { start: 0, count: 100 }) {
users {
urn
username
info {
displayName
email
}
}
}
}
Explanation:
- listUsers: This query retrieves a list of users.
- start and count: These parameters control pagination. Adjust
count
to retrieve more or fewer users as needed. - urn: The unique identifier for each user.
- username, displayName, email: Additional user information that can help you identify the correct user URN.
Execution:
You can execute this query using a GraphQL client or the GraphQL playground provided by DataHub.
Would you like more detailed steps or additional information on executing this query?
Sources:
- https://github.com/datahub-project/datahub/blob/master/datahub-web-react/src/graphql/user.graphql|https://github.com/datahub-project/datahub/blob/master/datahub-web-react/src/graphql/user.graphql
0 button
0 button
Hint: Mention @runllm in the thread for followups.
<@U06TM7M40RK> What types of access tokens are there in datahub?
type: PERSONAL
what else?
In DataHub, access tokens are primarily categorized into two types:
-
PERSONAL: This type of token is used for personal access, allowing users to represent themselves in code and programmatically use DataHub’s APIs. It is typically used for actions that a user would perform manually but wants to automate.
-
SESSION: This type of token is used for UI-based sessions. It is generated when a user logs into the DataHub UI, for example, via JaaS or OIDC. The session token is used to maintain the user’s session in the web application.
These types are defined in the DataHub Token Service, which is responsible for generating and verifying access tokens. The tokens are issued as JSON Web Tokens (JWTs) and contain claims such as the token type, expiration time, and the actor associated with the token.
Would you like more detailed information on how these tokens are used or managed?
Sources:
- <https://datahubproject.io/docs/0.13.1/authentication/concepts#what-is-a-datahub-token-service-what-are-access-tokens|DataHub Token Service and Access Tokens>
- https://github.com/datahub-project/datahub/blob/master/metadata-service/auth-impl/src/main/java/com/datahub/authentication/token/TokenClaims.java|TokenClaims.java
0 button
0 button
Hint: Mention @runllm in the thread for followups.
<@U06TM7M40RK> in opensource version how can i have authentication for users? meaning based on okta or on gsuite
In the open-source version of DataHub, you can set up authentication for users using OpenID Connect (OIDC) with identity providers like Okta or Google (G Suite). This allows you to delegate authentication to these providers, enabling Single Sign-On (SSO) for your DataHub instance.
Steps to Configure OIDC Authentication
-
Register an Application with Your Identity Provider:
- Okta: Log in to your Okta admin account, navigate to the developer console, and create a new application. Select “Web” as the platform and “OpenID Connect” as the sign-on method. Configure the redirect URI to point to your DataHub instance, e.g.,
https://your-datahub-domain.com/callback/oidc
. - Google (G Suite): Use the Google API Console to create a new project and configure the OAuth consent screen. Set up OAuth credentials and specify the redirect URI similarly.
- Okta: Log in to your Okta admin account, navigate to the developer console, and create a new application. Select “Web” as the platform and “OpenID Connect” as the sign-on method. Configure the redirect URI to point to your DataHub instance, e.g.,
-
Configure DataHub Frontend:
- Update the
datahub-frontend
configuration to enable OIDC authentication. This involves setting environment variables with the credentials obtained from your identity provider.
Example configuration in
docker.env
or Kubernetesvalues.yaml
:extraEnvs: - name: AUTH_OIDC_ENABLED value: "true" - name: AUTH_OIDC_CLIENT_ID value: "<your-client-id>" - name: AUTH_OIDC_CLIENT_SECRET value: "<your-client-secret>" - name: AUTH_OIDC_DISCOVERY_URI value: "https://<your-provider-discovery-url>/.well-known/openid-configuration" - name: AUTH_OIDC_BASE_URL value: "https://your-datahub-domain.com"
- Update the
-
Restart DataHub Frontend:
- After configuring the environment variables, restart the
datahub-frontend
service to apply the changes.
- After configuring the environment variables, restart the
Additional Considerations
- User Provisioning: You can enable Just-In-Time (JIT) provisioning to automatically create user accounts in DataHub when users log in for the first time.
- Group Synchronization: If your identity provider supports it, you can also synchronize group memberships to manage access control within DataHub.
For more detailed instructions, you can refer to the <https://datahubproject.io/docs/authentication/guides/sso/configure-oidc-react|OIDC Authentication Guide>.
Would you like more detailed steps or additional information on any specific part of the process?
Sources:
- <https://github.com/datahub-project/datahub/blob/master/docs/authentication/guides/sso/configure-oidc-react.md|DataHub GitHub Repository>
0 button
0 button
Hint: Mention @runllm in the thread for followups.
<@U06TM7M40RK>
I want to delete all ingestions or changes caused from engestions for looker platform that happend before a spesific date, how can i do that using the cli?