Using GraphQL to Retrieve S3 Assets with Downstream Asset Count

Original Slack Thread

<@U06TM7M40RK> graphql to get all s3 assets with a downstream asset

To retrieve all S3 assets with a downstream asset using GraphQL in DataHub, you can use the searchAcrossLineage query. This query allows you to search for entities and their lineage relationships. Here’s a step-by-step guide on how to construct this query:

  1. Define the Query: Use the searchAcrossLineage query to search for S3 assets. You will need to specify the urn of the S3 assets and set the direction to DOWNSTREAM to find downstream assets.

  2. Set the Filters: Use filters to specify that you are interested in S3 assets. You can filter by the platform or any other relevant field that identifies S3 assets.

  3. Specify the Degree: If you are interested in direct downstream assets, set the degree to 1. If you want to include indirect downstream assets, you can specify higher degrees.

Here is an example GraphQL query:

query searchS3AssetsWithDownstream {
  searchAcrossLineage(
    input: {
      urn: "urn:li:dataset:(urn:li:dataPlatform:s3,YourS3AssetName,PROD)"
      query: "*"
      start: 0
      count: 10
      direction: DOWNSTREAM
      orFilters: [
        {
          and: [
            {
              field: "degree"
              condition: EQUAL
              values: ["1"]
            }
          ]
        }
      ]
    }
  ) {
    searchResults {
      entity {
        urn
        type
        ... on Dataset {
          name
        }
      }
      paths {
        path {
          urn
          type
        }
      }
      degree
    }
  }
}

Explanation:

  • urn: Replace "urn:li:dataset:(urn:li:dataPlatform:s3,YourS3AssetName,PROD)" with the actual URN of your S3 assets.
  • direction: Set to DOWNSTREAM to find downstream assets.
  • degree: Set to 1 for direct downstream assets.

This query will return the S3 assets along with their downstream entities.

Would you like more detailed information or assistance with a specific part of this query?

Sources:

<@U06TM7M40RK> I do not want to look at the downstream asset of a specific s3 asset. I want to check how many s3 assets have atleast one downstream asset

It seems likely that you can achieve this by using the aggregateAcrossEntities GraphQL query to count the number of S3 assets that have at least one downstream asset. This query allows you to aggregate data across entities based on specific conditions.

Here’s a step-by-step guide on how to construct this query:

  1. Define the Query: Use the aggregateAcrossEntities query to aggregate S3 assets.

  2. Set the Filters: Use filters to specify that you are interested in S3 assets with downstream relationships.

  3. Specify the Conditions: You can use conditions to filter entities that have downstream relationships.

Here is an example GraphQL query:

query aggregateS3AssetsWithDownstream {
  aggregateAcrossEntities(
    input: {
      types: ["DATASET"]
      facets: ["_entityType"]
      orFilters: [
        {
          and: [
            {
              field: "platform"
              condition: EQUAL
              values: ["s3"]
            },
            {
              field: "downstream"
              condition: EXISTS
              negated: false
            }
          ]
        }
      ]
    }
  ) {
    facets {
      field
      displayName
      aggregations {
        value
        count
      }
    }
  }
}

Explanation:

  • types: Set to ["DATASET"] to focus on dataset entities.
  • facets: Use ["_entityType"] to aggregate by entity type.
  • orFilters: Use filters to specify that you are interested in S3 assets (platform: "s3") and that have downstream relationships (downstream: EXISTS).

This query will return the count of S3 assets that have at least one downstream asset.

Would you like more detailed information or assistance with a specific part of this query?

Sources: