Ingesting Multiple Tables with the Same Name from Different Datasets in UI BigQuery

Original Slack Thread

Hello everyone,
During UI BigQuery ingestion we’ve noticed that if we have multiple tables with same names but in different datasets, only the table from the last dataset will be ingested.
We have configured our ingestion recipe to allow specific datasets and we are running a single ingestion for the whole project.
Is there a configuration option that we are missing? The ingestion recipe is in the :thread:

Should we consider having separate ingestion recipes? We have a single ingestion for Hive and it ingests the data to separate databases.

We are using Datahub v0.12.1

Ingestion recipe:

    config:
        credential:
            client_email: 'client_email'
            client_id: 'client_id'
            private_key: '${bq_private_key}'
            private_key_id: '${bq_private_key_id}'
            project_id: 'project_id'
        dataset_pattern:
            allow:
                - ^dataset_1$
                - ^dataset_2$
        include_table_lineage: true
        include_tables: true
        include_usage_statistics: true
        include_views: true
        column_limit: 1000
        profiling:
            enabled: false
        stateful_ingestion:
            enabled: true
        table_pattern:
            deny:
                - '.*table_1.*'
                - '.*_table_2$'
    type: bigquery```

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

you should see tables to be organized in this pattern in their URNs - project_id.dataset_name.table_name

Take a look at the URN of the dataset you’re seeing by hitting the share button on that page and say “copy URN”. Is there a dataset name there?

You might try using a CLI get of the URN you’re not seeing

I only see one table from the last ingested dataset. For example if dataset_1 and dataset_2 contain the table test_table, I only see this:
project_id.dataset_2.test_table

I was able to debug this. The issues was that the table that was missing had a type CLONE and in the BigQuery ingestion queries.py the query for fetching metadata considers only BASE TABLE and EXTERNAL.

Does it makes sense to add a filter for table types? I’ve noticed that Snowflake queries has the same condition. If it does, I can create a Feature request.

<@U04G3HGFB88> you mean on Snowflake CLONEd tables are not filtered out?
I think it makes sense to not filter out cloned tables, especially if it is not filtered out on Snowflake

Sorry for the confusion. I was talking about BigQuery ingestion, but noticed the same principle in Snowflake ingestion.
Basicaly, cloned BigQuery tables are filtered out per this https://github.com/datahub-project/datahub/blob/e4bc915c78b83f8f85a27d74696719db7f9e2e9b/metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/queries.py#L77|line. Would it make sense to enable user to include different table types, other than BASE and EXTERNAL?