Using DataHubGraph for Monitoring and Analytics

Original Slack Thread

<@U06TM7M40RK> any api to scrap ingrestion related details for monitoring

To scrape ingestion-related details for monitoring in DataHub, you can use the DataHub GraphQL API. Specifically, the ingestionSourceExecutionRequests query can be used to fetch details about ingestion runs, including their status. Here’s a brief overview of how you can achieve this:

  1. GraphQL Query: Use the ingestionSourceExecutionRequests query to retrieve information about ingestion runs. This query will provide details such as the run ID, start time, end time, and status.

  2. Example GraphQL Query:

    {
      ingestionSourceExecutionRequests(input: {start: 0, count: 10}) {
        start
        count
        total
        executionRequests {
          id
          status
          startTime
          endTime
          ingestionSource {
            name
          }
        }
      }
    }
    
  3. Using curl: You can execute the GraphQL query using curl as follows:

    curl -X POST "http://localhost:8080/api/graphql" -H "Content-Type: application/json" -d '{
      "query": "{
        ingestionSourceExecutionRequests(input: {start: 0, count: 10}) {
          start
          count
          total
          executionRequests {
            id
            status
            startTime
            endTime
            ingestionSource {
              name
            }
          }
        }
      }"
    }'
    
  4. Using Python: You can also use Python to execute the GraphQL query:

    import requests
    
    url = "http://localhost:8080/api/graphql"
    query = {
        "query": """
        {
          ingestionSourceExecutionRequests(input: {start: 0, count: 10}) {
            start
            count
            total
            executionRequests {
              id
              status
              startTime
              endTime
              ingestionSource {
                name
              }
            }
          }
        }
        """
    }
    
    response = requests.post(url, json=query)
    print(response.json())
    

This approach allows you to programmatically monitor the status of ingestion processes in DataHub.

Would you like more detailed steps or additional examples?

Sources:

  • <Slack Slack Discussion on Ingestion Status> :+1: 0 button :-1: 0 button :light_bulb: Hint: Mention @runllm in the thread for followups.

<@U06TM7M40RK> by default is these data are exposed ?

By default, DataHub’s GraphQL API is designed to expose metadata-related information, including ingestion details, to users who have the appropriate permissions. However, access to this data is controlled by DataHub’s security and access control mechanisms. Here are some key points to consider:

  1. Access Control: Access to the GraphQL API and the data it exposes is typically governed by user roles and permissions. Users need to have the necessary permissions to query ingestion-related data.

  2. Authentication: The GraphQL API is usually protected by authentication mechanisms. Users must authenticate themselves to access the API, ensuring that only authorized users can retrieve sensitive information.

  3. Configuration: The level of detail and the specific data exposed by the API can be configured by the DataHub administrators. This means that while the API is capable of exposing ingestion details, the actual data available to a user depends on the configuration and permissions set by the administrators.

  4. Security Best Practices: It is recommended to follow security best practices, such as using secure connections (HTTPS), managing API keys securely, and regularly reviewing user permissions to ensure that only authorized users have access to sensitive data.

If you have specific concerns about data exposure or need to configure access controls, it would be best to consult with your DataHub administrator or refer to the DataHub documentation for guidance on securing the API.

Would you like more information on configuring access controls or any other specific aspect of DataHub’s security?

Sources:

<@U06TM7M40RK> DataHubGraph how to use this

To use DataHubGraph in Python, you need to interact with the DataHub’s metadata graph through its API. DataHubGraph is part of the DataHub Python SDK, which allows you to programmatically interact with DataHub’s metadata. Here’s a step-by-step guide on how to use it:

  1. Installation: Ensure you have the DataHub Python SDK installed. You can install it using pip:

    pip install --upgrade acryl-datahub
    
  2. Configuration: Set up the DataHubGraph client with the necessary configuration, including the server endpoint and authentication token.

  3. Example Usage: Below is an example of how to use DataHubGraph to fetch metadata about a dataset:

    from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph
    
    # Set the GMS endpoint and your personal access token
    gms_endpoint = "http://localhost:8080"
    access_token = "<your-personal-access-token>"
    
    # Configure the DataHubGraph client with the access token
    config = DatahubClientConfig(
        server=gms_endpoint,
        extra_headers={"Authorization": f"Bearer {access_token}"}
    )
    graph = DataHubGraph(config)
    
    # Example: Fetch dataset metadata
    dataset_urn = "urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)"
    dataset_metadata = graph.get_aspects_for_entity(
        entity_urn=dataset_urn,
        aspects=["schemaMetadata"]
    )
    
    print(dataset_metadata)
    
  4. Authentication: Make sure to replace <your-personal-access-token> with a valid token. You can generate a personal access token from your DataHub instance.

  5. Operations: You can perform various operations using DataHubGraph, such as fetching metadata, updating entities, and querying relationships.

Would you like more detailed steps or specific examples on using DataHubGraph for other operations?

Sources:

  • <Slack Slack Discussion on Using DataHubGraph> :+1: 0 button :-1: 0 button :light_bulb: Hint: Mention @runllm in the thread for followups.

<@U06TM7M40RK> {“operationName”:“getHighlights”,“variables”:{},“query”:“query getHighlights {\n getHighlights {\n value\n title\n body\n __typename\n }\n}\n”}

how to call using datahubgraph

To call the getHighlights query using the DataHubGraph client in Python, you need to execute a GraphQL query through the client. Here’s how you can do it:

  1. Set Up the DataHubGraph Client: Ensure you have the DataHub Python SDK installed and set up the DataHubGraph client with the necessary configuration, including the server endpoint and authentication token.

  2. Execute the GraphQL Query: Use the execute_graphql method of the DataHubGraph client to execute the getHighlights query.

Here is a step-by-step example:

from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph

# Set the GMS endpoint and your personal access token
gms_endpoint = "http://localhost:8080"
access_token = "<your-personal-access-token>"

# Configure the DataHubGraph client with the access token
config = DatahubClientConfig(
    server=gms_endpoint,
    extra_headers={"Authorization": f"Bearer {access_token}"}
)
graph = DataHubGraph(config)

# Define the GraphQL query for getHighlights
query = """
query getHighlights {
  getHighlights {
    value
    title
    body
    __typename
  }
}
"""

# Execute the GraphQL query
result = graph.execute_graphql(query=query)

# Print the result
print(result)

Explanation:

  • Configuration: The DatahubClientConfig is configured with the GMS endpoint and the Authorization header containing the Bearer token.
  • GraphQL Query: The getHighlights query is defined to fetch highlights, including value, title, body, and __typename.
  • Execution: The execute_graphql method is used to execute the query and print the results.

Make sure to replace <your-personal-access-token> with your actual access token.

Would you like more detailed steps or additional examples?

Sources:

<@U06TM7M40RK> {“operationName”:“getAnalyticsCharts”,“variables”:{},“query”:“query getAnalyticsCharts {\n getAnalyticsCharts {\n groupId\n title\n charts {\n …analyticsChart\n __typename\n }\n __typename\n }\n}\n\nfragment analyticsChart on AnalyticsChart {\n … on TimeSeriesChart {\n title\n lines {\n name\n data {\n x\n y\n __typename\n }\n __typename\n }\n dateRange {\n start\n end\n __typename\n }\n interval\n __typename\n }\n … on BarChart {\n title\n bars {\n name\n segments {\n label\n value\n __typename\n }\n __typename\n }\n __typename\n }\n … on TableChart {\n title\n columns\n rows {\n values\n cells {\n value\n linkParams {\n searchParams {\n types\n query\n filters {\n field\n values\n __typename\n }\n __typename\n }\n entityProfileParams {\n urn\n type\n __typename\n }\n __typename\n }\n __typename\n }\n __typename\n }\n __typename\n }\n __typename\n}\n”}

To execute the getAnalyticsCharts GraphQL query using the DataHubGraph client in Python, you can follow these steps. This query retrieves analytics charts, including time series, bar charts, and table charts, from DataHub.

  1. Set Up the DataHubGraph Client: Ensure you have the DataHub Python SDK installed and configure the DataHubGraph client with the necessary server endpoint and authentication token.

  2. Execute the GraphQL Query: Use the execute_graphql method of the DataHubGraph client to execute the getAnalyticsCharts query.

Here is a step-by-step example:

from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph

# Set the GMS endpoint and your personal access token
gms_endpoint = "http://localhost:8080"
access_token = "<your-personal-access-token>"

# Configure the DataHubGraph client with the access token
config = DatahubClientConfig(
    server=gms_endpoint,
    extra_headers={"Authorization": f"Bearer {access_token}"}
)
graph = DataHubGraph(config)

# Define the GraphQL query for getAnalyticsCharts
query = """
query getAnalyticsCharts {
  getAnalyticsCharts {
    groupId
    title
    charts {
      ...analyticsChart
      __typename
    }
    __typename
  }
}

fragment analyticsChart on AnalyticsChart {
  ... on TimeSeriesChart {
    title
    lines {
      name
      data {
        x
        y
        __typename
      }
      __typename
    }
    dateRange {
      start
      end
      __typename
    }
    interval
    __typename
  }
  ... on BarChart {
    title
    bars {
      name
      segments {
        label
        value
        __typename
      }
      __typename
    }
    __typename
  }
  ... on TableChart {
    title
    columns
    rows {
      values
      cells {
        value
        linkParams {
          searchParams {
            types
            query
            filters {
              field
              values
              __typename
            }
            __typename
          }
          entityProfileParams {
            urn
            type
            __typename
          }
          __typename
        }
        __typename
      }
      __typename
    }
    __typename
  }
  __typename
}
"""

# Execute the GraphQL query
result = graph.execute_graphql(query=query)

# Print the result
print(result)

Explanation:

  • Configuration: The DatahubClientConfig is configured with the GMS endpoint and the Authorization header containing the Bearer token.
  • GraphQL Query: The getAnalyticsCharts query is defined to fetch analytics charts, including time series, bar charts, and table charts.
  • Execution: The execute_graphql method is used to execute the query and print the results.

Make sure to replace &lt;your-personal-access-token&gt; with your actual access token.

Would you like more detailed steps or additional examples?

Sources: