<@U06TM7M40RK> how can I filter my Bigquery ingestion so it only includes a specific dataset?
To filter your BigQuery ingestion to include only a specific dataset, you can use the dataset_pattern
configuration in your ingestion recipe. This allows you to specify regex patterns to include or exclude datasets during ingestion.
According to the <https://datahubproject.io/docs/0.13.1/generated/ingestion/sources/bigquery#config-details|DataHub documentation>, you can set up a dataset filter like this:
dataset_pattern:
allow:
- "my_project_id.my_dataset"
This configuration will ensure that only the specified dataset (my_project_id.my_dataset
) is included in the ingestion process. You can also use regex patterns to match multiple datasets if needed.
Would you like more details on how to set this up or any other aspect of the ingestion process?
Sources:
- <https://datahubproject.io/docs/0.13.1/generated/ingestion/sources/bigquery#config-details|DataHub BigQuery Ingestion Configuration>
- <Slack Discussion on Dataset Filtering>
0 button
0 button
Hint: Mention @runllm in the thread for followups.
<@U06TM7M40RK> if I do this:
allow:
- "my_project_id.my_dataset"```
I get both <http://my_project_id.my|my_project_id.my>_dataset and <http://my_project_id.my|my_project_id.my>_dataset2, how can I filter it for only <http://my_project_id.my|my_project_id.my>_dataset?
To ensure that only the specific dataset my_project_id.my_dataset
is included and not any other datasets like my_project_id.my_dataset2
, you should use regex anchors to precisely match the dataset name. You can achieve this by using the ^
and $
anchors in your regex pattern to indicate the start and end of the string, respectively.
Here’s how you can modify your dataset_pattern
to achieve this:
dataset_pattern:
allow:
- "^my_project_id\\.my_dataset$"
This pattern will match exactly my_project_id.my_dataset
and exclude any other datasets that start with the same prefix, such as my_project_id.my_dataset2
.
Would you like more details on regex patterns or any other aspect of the configuration?
Sources:
- <Slack Discussion on Regex Anchors>
0 button
0 button
Hint: Mention @runllm in the thread for followups.
Hi ! Did this interaction resolve your issue or do you still require assistance ?
Please tag @ RunLLM if you still need further assistance !
Or tag @Datahub Community Support if you require a person to take a look
Hope this helps ! ><
<@U06TM7M40RK> and if I want to filter OUT all datasets that start with “dbt_”?
To filter out all datasets that start with “dbt_”, you can use the deny
field in the dataset_pattern
configuration. This allows you to specify regex patterns for datasets that should be excluded from ingestion.
Here’s how you can set up your configuration to exclude datasets starting with “dbt_”:
dataset_pattern:
deny:
- "^dbt_.*"
This pattern uses the ^
anchor to match the start of the dataset name and .*
to match any characters following “dbt_”. As a result, any dataset name that begins with “dbt_” will be excluded from the ingestion process.
Would you like more information on configuring patterns or any other aspect of DataHub ingestion?
Sources:
- https://datahubproject.io/docs/0.13.1/generated/ingestion/sources/mongodb#config-details|https://datahubproject.io/docs/0.13.1/generated/ingestion/sources/mongodb#config-details
- https://datahubspace.slack.com/archives/CUMUWQU66/p1673885771.028139|https://datahubspace.slack.com/archives/CUMUWQU66/p1673885771.028139
0 button
0 button
Hint: Mention @runllm in the thread for followups.
<@U06TM7M40RK> I also want to exclude all datasets that are like “analytics_” except “analytics_322146590”
To exclude all datasets that start with “analytics_” except for a specific one like “analytics_322146590”, you can use both allow
and deny
patterns in your dataset_pattern
configuration. You can first deny all datasets starting with “analytics_” and then explicitly allow the specific dataset you want to include.
Here’s how you can configure this:
dataset_pattern:
deny:
- "^analytics_.*"
allow:
- "^analytics_322146590$"
In this configuration:
- The
deny
pattern^analytics_.*
will exclude all datasets that start with “analytics_”. - The
allow
pattern^analytics_322146590$
will specifically include the dataset “analytics_322146590”.
This setup ensures that all datasets starting with “analytics_” are excluded except for “analytics_322146590”.
Would you like more details on this configuration or any other aspect of DataHub ingestion?
Sources:
- https://datahubproject.io/docs/0.14.0/generated/ingestion/sources/redshift#config-details|https://datahubproject.io/docs/0.14.0/generated/ingestion/sources/redshift#config-details
- https://datahubproject.io/docs/0.13.1/generated/ingestion/sources/mongodb#config-details|https://datahubproject.io/docs/0.13.1/generated/ingestion/sources/mongodb#config-details
0 button
0 button
Hint: Mention @runllm in the thread for followups.
<@U06TM7M40RK> won’t that only allow analytics_322146590 and filter out other datasets like “data”?
You’re correct. The configuration I provided will only allow the dataset “analytics_322146590” and exclude all other datasets, including those that do not start with “analytics_”. If you want to exclude datasets starting with “analytics_” except for “analytics_322146590” while allowing other datasets, you should adjust the configuration to focus on the deny
pattern only, as follows:
dataset_pattern:
deny:
- "^analytics_(?!322146590$).*"
In this configuration:
- The
deny
pattern^analytics_(?!322146590$).*
uses a negative lookahead(?!322146590$)
to exclude all datasets starting with “analytics_” except for “analytics_322146590”.
This setup will exclude datasets like “analytics_123” but will allow “analytics_322146590” and any other datasets that do not start with “analytics_”.
Would you like more information on regex patterns or any other aspect of the configuration?
Sources:
- https://datahubproject.io/docs/0.14.0/generated/ingestion/sources/redshift#config-details|https://datahubproject.io/docs/0.14.0/generated/ingestion/sources/redshift#config-details
0 button
0 button
Hint: Mention @runllm in the thread for followups.
<@U06TM7M40RK> I am setting this in my ingestion recipe:
dataset_pattern:
deny:
- ‘^dbt_.’
- lodgify_backend
- lodgify_frontend
- '^analytics_(?!322146590$).’
and all “dbt_” datasets are being included, and also the analytics_ datasets
It seems likely that the issue might be related to how the regex patterns are being interpreted or applied in your ingestion configuration. Let’s break down the potential issues and solutions:
-
Regex Pattern for “dbt_” Datasets:
- The pattern
'^dbt_.*'
should correctly exclude all datasets starting with “dbt_”. If these datasets are still being included, it might be worth checking if there are any other configurations or overrides in your ingestion setup that might affect this pattern.
- The pattern
-
Regex Pattern for “analytics_” Datasets:
- The pattern
'^analytics_(?!322146590$).*'
is intended to exclude all datasets starting with “analytics_” except for “analytics_322146590”. If this is not working as expected, ensure that the regex engine being used supports negative lookaheads, as some environments might have limitations.
- The pattern
-
Specific Dataset Names:
- The patterns
lodgify_backend
andlodgify_frontend
should exclude datasets with these exact names. If they are still being included, verify that the dataset names match exactly and that there are no additional spaces or characters.
- The patterns
-
General Debugging Steps:
- Double-check the indentation and formatting of your YAML file to ensure that the patterns are correctly nested under
dataset_pattern
. - Ensure that there are no other conflicting configurations in your ingestion recipe that might override these patterns.
- Consider testing the regex patterns separately in a regex testing tool to confirm they behave as expected.
- Double-check the indentation and formatting of your YAML file to ensure that the patterns are correctly nested under
If these suggestions do not resolve the issue, it might be helpful to review the ingestion logs for any warnings or errors related to pattern matching.
Would you like more detailed assistance on any specific part of this configuration?
Sources:
- <https://datahubproject.io/docs/0.13.1/generated/ingestion/sources/bigquery#config-details|DataHub BigQuery Ingestion Configuration>
0 button
0 button
Hint: Mention @runllm in the thread for followups.