Tracking Changes in Datasets: Utilizing DataHub APIs and Implementing Strategies

Original Slack Thread

<@U06TM7M40RK> How can we track the changes made on different datasets specific to change in tag or business attributes?

To track changes made to datasets, specifically changes in tags or business attributes, you can utilize the DataHub’s Timeline API and Entity Events API. Here’s a brief overview of how you can achieve this:

  1. Timeline API: This API allows you to view the version history of entities, including changes to tags and business attributes. It supports tracking changes for datasets and glossary terms. You can query the timeline for specific categories of changes, such as tags or business attributes, to see what modifications have been made over time.

  2. Entity Events API: This API allows you to integrate changes happening on the DataHub Metadata Graph in real time. It can be used to track changes in tags and business attributes by listening to specific events.

  3. Actions Framework: You can also use the Actions Framework to react to changes in real time. This framework allows you to set up pipelines that listen for specific events, such as tag additions or modifications, and trigger actions like notifications or updates to other systems (https://datahubproject.io/docs/0.14.0/actions/concepts|source).

Would you like more detailed information on setting up these APIs or using the Actions Framework?

Sources:

<@U06TM7M40RK> - best way to implement this

To implement tracking of changes in tags or business attributes on datasets using DataHub, you can follow these steps:

  1. Set Up the Timeline API:

    • Ensure your DataHub instance is running a version that supports the Timeline API (server versions 0.8.28 and higher).
    • Use the Timeline API to query changes for specific datasets. You can filter by categories such as TAG to focus on tag changes.
    • Example command using the CLI:
      datahub timeline --urn "urn:li:dataset:(urn:li:dataPlatform:mysql,User.UserAccount,PROD)" --category TAG --start 7daysago
      
    • This command will show modifications to tags for the specified dataset over the past week (https://datahubproject.io/docs/0.14.0/cli#timeline|source).
  2. Utilize the Entity Events API:

  3. Leverage the Actions Framework:

    • Use the Actions Framework to automate responses to changes. For example, you can set up a pipeline that triggers a notification or updates another system when a tag is added or modified.
    • Define an action configuration file specifying the event source, filters, and actions to be taken.
    • Example configuration snippet:
      name: "tag_change_listener"
      source:
        type: "kafka"
        config:
          connection:
            bootstrap: ${KAFKA_BOOTSTRAP_SERVER:-localhost:9092}
      filter:
        event_type: "MetadataChangeLogEvent_v1"
        event:
          category: "TAG"
      action:
        type: "notification"
        config:
          message: "Tag change detected on dataset"
      
    • This setup will listen for tag changes and trigger a notification action (https://datahubproject.io/docs/0.14.0/actions/concepts|source).

Would you like more detailed guidance on any specific part of this implementation?

Sources:

<@U06TM7M40RK> - can we check this changes in timelines via UI?

Yes, you can view changes in timelines via the DataHub UI. The Timeline feature in DataHub allows you to see the version history of datasets, including changes to tags and other metadata aspects. Here’s how you can access and use this feature:

  1. Accessing the Timeline in the UI:

    • Navigate to the dataset for which you want to view the timeline.
    • Go to the “Schema” tab of the dataset. If the dataset has multiple versions, you will see a version selector that allows you to view different versions of the dataset’s schema.
  2. Viewing Schema History:

    • The Schema History feature provides a visual representation of changes over time. You can select different versions to see what the schema looked like at each point in time.
    • You can also toggle the “Audit” view to see when the most recent changes were made to each field (https://datahubproject.io/docs/0.14.0/schema-history|source).
  3. Using the Timeline API:

    • While the UI provides a visual way to explore changes, the Timeline API can be used for more detailed queries and is accessible via the DataHub UI’s API explorer. This allows you to query specific changes, such as tag modifications, directly from the UI (https://datahubproject.io/docs/0.14.0/dev-guides/timeline|source).

Would you like more detailed instructions on navigating the UI or using the Timeline API?

Sources:

<@U06TM7M40RK> - how to get history of all business attribute changes on a column using API?

To get the history of all business attribute changes on a column using the API, you can utilize the Timeline API in DataHub. This API allows you to track changes to various aspects of metadata, including business attributes. Here’s how you can achieve this:

  1. Use the Timeline API:

    • The Timeline API supports viewing the version history of entities, including changes to business attributes. You can query the API to get the change history for a specific column.
  2. GraphQL Query:

    • You can use a GraphQL query to fetch the change history for a specific column, focusing on business attributes. Here’s an example query structure:
query getColumnBusinessAttributeChanges($urn: String!) {
  entity(urn: $urn) {
    ... on Dataset {
      schema {
        fields {
          fieldPath
          businessAttributes {
            name
            description
            changeLog {
              timestampMillis
              changeType
              actor
            }
          }
        }
      }
    }
  }
}
  1. Execute the Query:

    • Replace $urn with the URN of the dataset containing the column you are interested in. This query will return the change log for business attributes associated with the column, including the timestamp of the change, the type of change, and the actor who made the change.
  2. API Endpoint:

    • You can execute this query using the DataHub GraphQL API endpoint. Typically, this would be something like http://<datahub-host>:<port>/api/graphql.

Would you like more detailed guidance on setting up this GraphQL query or any additional information?

Sources:

<@U06TM7M40RK> - how to get businessattribute aspects with API? what is the endpoint?

To retrieve business attribute aspects using the API, you can use the DataHub’s REST API or GraphQL API. Here’s how you can do it:

  1. Using the REST API:

    • You can retrieve entity aspects by making a request to the entitiesV2 endpoint. This endpoint allows you to fetch aspects associated with a specific entity URN.
    • Example curl command to retrieve aspects:
      curl --header 'X-RestLi-Protocol-Version: 2.0.0' 'http://localhost:8080/entitiesV2/<url-encoded-entity-urn>'
      
    • Replace <url-encoded-entity-urn> with the URN of the business attribute you are interested in. This will return all aspects associated with that entity (https://datahubproject.io/docs/0.14.0/metadata-service#retrieving-entity-aspects|source).
  2. Using the GraphQL API:

    • You can use a GraphQL query to fetch specific aspects of a business attribute. Here’s an example query structure:
      query getBusinessAttributeAspects($urn: String!) {
        entity(urn: $urn) {
          ... on BusinessAttribute {
            urn
            type
            aspects {
              aspectName
              payload
            }
          }
        }
      }
      
    • Replace $urn with the URN of the business attribute. This query will return the aspects associated with the business attribute, including their names and payloads.
  3. API Endpoint:

    • For the REST API, the endpoint is typically http://<datahub-host>:<port>/entitiesV2.
    • For the GraphQL API, the endpoint is typically http://<datahub-host>:<port>/api/graphql.

Would you like more detailed guidance on using these APIs or any additional information?

Sources:

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><