Configuring Pre-execution Commands in a Kubernetes CronJob for Datahub Ingestion via Helm Deployment

To delete all ingestions or changes caused by ingestions for the Looker platform before a specific date using the DataHub CLI, you can follow these general steps. It seems likely that you will need to use a combination of the datahub delete command and possibly some scripting to filter entities based on their ingestion date.

Steps to Achieve This:

  1. Identify Entities to Delete:

    • Use the datahub delete command with the --dry-run option to list entities from the Looker platform. This will help you identify which entities were ingested before the specific date.
  2. Filter by Ingestion Date:

    • Unfortunately, the CLI does not directly support filtering by ingestion date. You may need to use a script to query the metadata service for entities and filter them based on their lastIngested timestamp.
  3. Delete Entities:

    • Once you have identified the entities to delete, use the datahub delete command to perform the deletion.

Example Script:

Here’s a conceptual example of how you might script this process using Python and the DataHub CLI:

import requests
import subprocess
from datetime import datetime

# Define your DataHub API endpoint
api_endpoint = 'http://localhost:8080/api/graphql'

# Define the date before which you want to delete entities
cutoff_date = datetime(2023, 1, 1)

# GraphQL query to fetch entities with their lastIngested date
query = """
{
  search(input: {
    type: CHART,
    query: "*",
    filters: [
      {
        field: "platform",
        value: "looker"
      }
    ],
    start: 0,
    count: 100
  }) {
    entities {
      entity {
        urn
        lastIngested
      }
    }
  }
}
"""

# Execute the GraphQL query
response = requests.post(api_endpoint, json={'query': query})
data = response.json()

# Filter entities based on the cutoff date
entities_to_delete = [
    entity['entity']['urn']
    for entity in data['data']['search']['entities']
    if datetime.fromisoformat(entity['entity']['lastIngested']) < cutoff_date
]

# Delete the filtered entities using the DataHub CLI
for urn in entities_to_delete:
    subprocess.run(['datahub', 'delete', '--urn', urn, '--hard'])

Important Considerations:

  • Dry Run: Always perform a dry run first to ensure you are targeting the correct entities.
  • Backup: Consider backing up your metadata before performing bulk deletions.
  • Batch Processing: If you have a large number of entities, consider batching the deletions to avoid overloading the system.

Would you like more detailed information on any specific part of this process?

Sources:

<@U06TM7M40RK> just give me a query

{
  search(input: {
    type: CHART,
    query: "*",
    filters: [
      {
        field: "platform",
        value: "looker"
      }
    ],
    start: 0,
    count: 100
  }) {
    entities {
      entity {
        urn
        lastIngested
      }
    }
  }
}```


this dosent work

The query you provided is intended to search for charts on the Looker platform and retrieve their URNs and lastIngested timestamps. However, the lastIngested field might not be directly available in the GraphQL schema for the CHART entity type, which could be causing the issue.

Here’s a revised version of the query that focuses on retrieving available fields. If lastIngested is not directly accessible, you might need to check the available fields for the CHART entity type in your DataHub instance:

{
  search(input: {
    type: CHART,
    query: "*",
    filters: [
      {
        field: "platform",
        value: "looker"
      }
    ],
    start: 0,
    count: 100
  }) {
    searchResults {
      entity {
        urn
        ... on Chart {
          properties {
            name
            description
            lastModified {
              time
            }
          }
        }
      }
    }
  }
}

Explanation:

  • searchResults: This field is used to access the results of the search query.
  • entity: This field contains the URN and other details of the entity.
  • properties: This field includes properties of the chart, such as name, description, and lastModified.

Troubleshooting:

  • Check Available Fields: Use the GraphQL schema or introspection to verify which fields are available for the CHART entity type.
  • GraphQL Playground: Use the GraphQL playground to explore the schema and test queries interactively.

If lastIngested is a custom field or part of a specific aspect, you may need to adjust the query to match your DataHub instance’s schema.

Would you like more detailed steps or additional information on exploring the GraphQL schema?

Sources:

<@U06TM7M40RK>
If i want to see what graphql are being executed when clicking on the UI?

To see which GraphQL queries are being executed when you interact with the DataHub UI, you can use your browser’s developer tools. Here’s how you can do it:

Steps to Monitor GraphQL Queries in the UI

  1. Open Developer Tools:

    • In most browsers, you can open the developer tools by pressing F12 or Ctrl+Shift+I (Windows/Linux) or Cmd+Option+I (Mac).
  2. Navigate to the Network Tab:

    • Once the developer tools are open, click on the “Network” tab. This tab allows you to see all network requests made by the browser.
  3. Filter for GraphQL Requests:

    • In the filter box, type graphql to filter the requests. This will show you only the requests made to the GraphQL endpoint.
  4. Perform Actions in the UI:

    • Interact with the DataHub UI as you normally would. For example, click on different entities, search for datasets, or view dashboards.
  5. Inspect GraphQL Requests:

    • As you perform actions, you will see GraphQL requests appear in the network tab. Click on a request to see its details.
    • In the “Headers” section, you can see the request URL and any headers sent with the request.
    • In the “Payload” section, you can see the actual GraphQL query or mutation that was executed.
  6. View Response:

    • In the “Response” section, you can see the data returned by the GraphQL server.

Additional Tips

  • Operation Name: Look for the operationName in the payload to identify the specific query or mutation being executed.
  • Use GraphiQL: If you want to experiment with the queries you find, you can use the GraphiQL interface provided by DataHub to run and modify them.

This method allows you to see exactly what GraphQL operations are being executed in response to your interactions with the DataHub UI.

Would you like more detailed steps or additional information on any specific part of the process?

Sources: