Filtering BigQuery Ingestion for Specific Datasets

Original Slack Thread

<@U06TM7M40RK> how can I filter my Bigquery ingestion so it only includes a specific dataset?

To filter your BigQuery ingestion to include only a specific dataset, you can use the dataset_pattern configuration in your ingestion recipe. This allows you to specify regex patterns to include or exclude datasets during ingestion.

According to the <https://datahubproject.io/docs/0.13.1/generated/ingestion/sources/bigquery#config-details|DataHub documentation>, you can set up a dataset filter like this:

dataset_pattern:
  allow:
    - "my_project_id.my_dataset"

This configuration will ensure that only the specified dataset (my_project_id.my_dataset) is included in the ingestion process. You can also use regex patterns to match multiple datasets if needed.

Would you like more details on how to set this up or any other aspect of the ingestion process?

Sources:

<@U06TM7M40RK> if I do this:

  allow:
    - "my_project_id.my_dataset"```
I get both <http://my_project_id.my|my_project_id.my>_dataset and <http://my_project_id.my|my_project_id.my>_dataset2, how can I filter it for only  <http://my_project_id.my|my_project_id.my>_dataset?

To ensure that only the specific dataset my_project_id.my_dataset is included and not any other datasets like my_project_id.my_dataset2, you should use regex anchors to precisely match the dataset name. You can achieve this by using the ^ and $ anchors in your regex pattern to indicate the start and end of the string, respectively.

Here’s how you can modify your dataset_pattern to achieve this:

dataset_pattern:
  allow:
    - "^my_project_id\\.my_dataset$"

This pattern will match exactly my_project_id.my_dataset and exclude any other datasets that start with the same prefix, such as my_project_id.my_dataset2.

Would you like more details on regex patterns or any other aspect of the configuration?

Sources:

  • <Slack Discussion on Regex Anchors> :+1: 0 button :-1: 0 button :light_bulb: Hint: Mention @runllm in the thread for followups.

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><

<@U06TM7M40RK> and if I want to filter OUT all datasets that start with “dbt_”?

To filter out all datasets that start with “dbt_”, you can use the deny field in the dataset_pattern configuration. This allows you to specify regex patterns for datasets that should be excluded from ingestion.

Here’s how you can set up your configuration to exclude datasets starting with “dbt_”:

dataset_pattern:
  deny:
    - "^dbt_.*"

This pattern uses the ^ anchor to match the start of the dataset name and .* to match any characters following “dbt_”. As a result, any dataset name that begins with “dbt_” will be excluded from the ingestion process.

Would you like more information on configuring patterns or any other aspect of DataHub ingestion?

Sources:

<@U06TM7M40RK> I also want to exclude all datasets that are like “analytics_” except “analytics_322146590”

To exclude all datasets that start with “analytics_” except for a specific one like “analytics_322146590”, you can use both allow and deny patterns in your dataset_pattern configuration. You can first deny all datasets starting with “analytics_” and then explicitly allow the specific dataset you want to include.

Here’s how you can configure this:

dataset_pattern:
  deny:
    - "^analytics_.*"
  allow:
    - "^analytics_322146590$"

In this configuration:

  • The deny pattern ^analytics_.* will exclude all datasets that start with “analytics_”.
  • The allow pattern ^analytics_322146590$ will specifically include the dataset “analytics_322146590”.

This setup ensures that all datasets starting with “analytics_” are excluded except for “analytics_322146590”.

Would you like more details on this configuration or any other aspect of DataHub ingestion?

Sources:

<@U06TM7M40RK> won’t that only allow analytics_322146590 and filter out other datasets like “data”?

You’re correct. The configuration I provided will only allow the dataset “analytics_322146590” and exclude all other datasets, including those that do not start with “analytics_”. If you want to exclude datasets starting with “analytics_” except for “analytics_322146590” while allowing other datasets, you should adjust the configuration to focus on the deny pattern only, as follows:

dataset_pattern:
  deny:
    - "^analytics_(?!322146590$).*"

In this configuration:

  • The deny pattern ^analytics_(?!322146590$).* uses a negative lookahead (?!322146590$) to exclude all datasets starting with “analytics_” except for “analytics_322146590”.

This setup will exclude datasets like “analytics_123” but will allow “analytics_322146590” and any other datasets that do not start with “analytics_”.

Would you like more information on regex patterns or any other aspect of the configuration?

Sources:

<@U06TM7M40RK> I am setting this in my ingestion recipe:
dataset_pattern:
deny:
- ‘^dbt_.
- lodgify_backend
- lodgify_frontend
- '^analytics_(?!322146590$).

and all “dbt_” datasets are being included, and also the analytics_ datasets

It seems likely that the issue might be related to how the regex patterns are being interpreted or applied in your ingestion configuration. Let’s break down the potential issues and solutions:

  1. Regex Pattern for “dbt_” Datasets:

    • The pattern '^dbt_.*' should correctly exclude all datasets starting with “dbt_”. If these datasets are still being included, it might be worth checking if there are any other configurations or overrides in your ingestion setup that might affect this pattern.
  2. Regex Pattern for “analytics_” Datasets:

    • The pattern '^analytics_(?!322146590$).*' is intended to exclude all datasets starting with “analytics_” except for “analytics_322146590”. If this is not working as expected, ensure that the regex engine being used supports negative lookaheads, as some environments might have limitations.
  3. Specific Dataset Names:

    • The patterns lodgify_backend and lodgify_frontend should exclude datasets with these exact names. If they are still being included, verify that the dataset names match exactly and that there are no additional spaces or characters.
  4. General Debugging Steps:

    • Double-check the indentation and formatting of your YAML file to ensure that the patterns are correctly nested under dataset_pattern.
    • Ensure that there are no other conflicting configurations in your ingestion recipe that might override these patterns.
    • Consider testing the regex patterns separately in a regex testing tool to confirm they behave as expected.

If these suggestions do not resolve the issue, it might be helpful to review the ingestion logs for any warnings or errors related to pattern matching.

Would you like more detailed assistance on any specific part of this configuration?

Sources: