One more. DataHub 0.13.3. UI. Metabase ingestion.
Metabase https://datahubproject.io/docs/generated/ingestion/sources/metabase/|documentation shows the “Platform Instance” capability as enabled by default. Looking at the https://github.com/acryldata/datahub/blob/v0.13.3/metadata-ingestion/src/datahub/ingestion/source/metabase.py|code, there is no option, available for other sources, to use the platform_instance
key in the recipe, but the platform instance is detected by checking database_id_to_instance_map
and platform_instance_map
mappings. The problem is that two Metabase instances can share the same database id and the same to_instance. In our case, we have two Metabase instances pointing to the same Snowflake instance, and some of the databases have a shared id since it is automatically assigned by Metabase. Is there any solution for this use case?
Hey there! Make sure your message includes the following information if relevant, so we can help more effectively!
- Are you using UI or CLI for ingestion?
- Which DataHub version are you using? (e.g. 0.12.0)
- What data source(s) are you integrating with DataHub? (e.g. BigQuery)
<@U06TM7M40RK> Can you help with the questions in this thread?
We’re experimenting with a new AI assistant for questions in the <#CUMUWQU66> channel to help you track down helpful resources more quickly. Keep in mind that responses may not always be spot-on; feel free to ask it follow-up questions and don’t forget to provide its responses a
or
so other folks in the Community know if you’re still looking for help on the topic!
To address your issue with Metabase ingestion in DataHub 0.13.3, where two Metabase instances point to the same Snowflake instance and share database IDs, you are correct that the platform_instance
key is not directly available in the Metabase ingestion source configuration. Instead, the database_id_to_instance_map
and platform_instance_map
mappings are used to detect platform instances.
Potential Solutions
-
Unique Mapping Configuration:
According to the <https://datahubproject.io/docs/generated/ingestion/sources/metabase/|Metabase ingestion documentation>, you can use thedatabase_id_to_instance_map
andplatform_instance_map
to map database IDs to platform instances. However, since the database IDs are shared, you might need to ensure that each Metabase instance has a unique identifier in your mapping configuration. -
Custom Configuration:
If the default mappings are insufficient due to shared database IDs, you might consider creating a custom mapping configuration. This could involve modifying the ingestion source code to handle multiple Metabase instances more effectively. -
Separate Ingestion Runs:
Another approach is to run separate ingestion processes for each Metabase instance, ensuring that the configurations do not overlap. This way, you can avoid conflicts caused by shared database IDs.
Example Configuration
Here is an example of how you might configure the mappings in your ingestion recipe:
source:
type: metabase
config:
connect_uri: "http://your-metabase-instance:3000"
username: "your-username"
password: "your-password"
database_id_to_instance_map:
"1": "instance1"
"2": "instance2"
platform_instance_map:
snowflake: "snowflake_instance"
Further Assistance
If these solutions do not fully address your use case, you may need to open a GitHub issue or a Slack discussion to seek further assistance from the DataHub community. You can refer to the <Issues · datahub-project/datahub · GitHub issue> and <Slack discussion> for similar topics.
Would you like more detailed steps on any of these potential solutions or further assistance on a specific part of the process?
Sources:
- <https://datahubproject.io/docs/generated/ingestion/sources/metabase/|Metabase Ingestion Documentation>
- <https://github.com/datahub-project/datahub/pull/8359|GitHub Pull Request on Metabase Ingestion>
- <Issues · datahub-project/datahub · GitHub issue>
- <Slack discussion>
0 button
0 button
Hint: Mention @runllm in the thread for followups.
<@U06TM7M40RK>, the ingestion runs are separated, but this doesn’t seem to solve the platform_instance naming issue looking at the https://github.com/acryldata/datahub/blob/v0.13.3/metadata-ingestion/src/datahub/ingestion/source/metabase.py#L664|get_platform_instance logic.
Thank you for providing more context. Given that the ingestion runs are separated but the platform_instance
naming issue persists due to shared database IDs, let’s delve into the get_platform_instance
logic and explore potential solutions.
Understanding the get_platform_instance
Logic
The get_platform_instance
method in the Metabase ingestion source attempts to detect the platform_instance
by checking the database_id_to_instance_map
and platform_instance_map
mappings. Here’s a snippet of the relevant code:
@lru_cache(maxsize=None)
def get_platform_instance(
self, platform: Optional[str] = None, datasource_id: Optional[int] = None
) -> Optional[str]:
platform_instance = None
if datasource_id is not None and self.config.database_id_to_instance_map:
platform_instance = self.config.database_id_to_instance_map.get(str(datasource_id))
if platform and self.config.platform_instance_map and platform_instance is None:
platform_instance = self.config.platform_instance_map.get(platform)
return platform_instance
Issue with Shared Database IDs
The problem arises because the same datasource_id
(database ID) is used in both Metabase instances, leading to potential conflicts in the database_id_to_instance_map
.
Potential Solutions
-
Enhanced Mapping Configuration:
Modify the mapping configuration to include Metabase instance identifiers. This way, you can distinguish between the samedatasource_id
from different Metabase instances. -
Custom Logic for Platform Instances:
Implement custom logic in theget_platform_instance
method to handle multiple Metabase instances more effectively.
Enhanced Mapping Configuration
You can enhance the database_id_to_instance_map
to include Metabase instance identifiers. For example:
source:
type: metabase
config:
connect_uri: "http://your-metabase-instance-1:3000"
username: "your-username"
password: "your-password"
database_id_to_instance_map:
"metabase1_1": "instance1"
"metabase1_2": "instance2"
platform_instance_map:
snowflake: "snowflake_instance_1"
source:
type: metabase
config:
connect_uri: "http://your-metabase-instance-2:3000"
username: "your-username"
password: "your-password"
database_id_to_instance_map:
"metabase2_1": "instance3"
"metabase2_2": "instance4"
platform_instance_map:
snowflake: "snowflake_instance_2"
Custom Logic for Platform Instances
If the enhanced mapping configuration is not feasible, you might need to customize the get_platform_instance
method. Here’s an example of how you could modify the method:
@lru_cache(maxsize=None)
def get_platform_instance(
self, platform: Optional[str] = None, datasource_id: Optional[int] = None, metabase_instance: Optional[str] = None
) -> Optional[str]:
platform_instance = None
if datasource_id is not None and self.config.database_id_to_instance_map:
key = f"{metabase_instance}_{datasource_id}"
platform_instance = self.config.database_id_to_instance_map.get(key)
if platform and self.config.platform_instance_map and platform_instance is None:
platform_instance = self.config.platform_instance_map.get(platform)
return platform_instance
Conclusion
These solutions should help you address the platform instance naming issue when dealing with multiple Metabase instances pointing to the same Snowflake instance. Would you like more detailed steps on implementing any of these solutions, or do you have any other specific questions?
Sources:
- <https://datahubproject.io/docs/generated/ingestion/sources/metabase/|Metabase Ingestion Documentation>
- <datahub/metadata-ingestion/src/datahub/ingestion/source/metabase.py at v0.13.3 · acryldata/datahub · GitHub Source Code>
- <https://github.com/datahub-project/datahub/pull/8359|GitHub Pull Request on Metabase Ingestion>
- <Issues · datahub-project/datahub · GitHub issue>
- <Slack discussion>
0 button
0 button
Hint: Mention @runllm in the thread for followups.
<@U06TM7M40RK>, why am I getting:
"metabase-dbname-12": [
"Cannot determine database name for platform: snowflake"
],
"metabase-dbname-6": [
"Cannot determine database name for platform: snowflake"
],```
Despite the fact that I specified them in `database_alias_map`
``` database_alias_map:
'6': 'NAME_1'
'12': 'NAME_2'```
The warning message "Cannot determine database name for platform: snowflake"
indicates that the ingestion process is unable to map the database IDs to their respective names for the Snowflake platform, despite specifying them in the database_alias_map
.
Understanding the Issue
The database_alias_map
is used to map database IDs to user-friendly names. However, this feature is being deprecated in favor of using platform_instance_map
and database_id_to_instance_map
. The warning could be due to several reasons:
- Deprecation of
database_alias_map
: As per the <https://datahubproject.io/docs/generated/ingestion/sources/metabase/|Metabase ingestion documentation>, thedatabase_alias_map
might be deprecated or not functioning as expected in the newer versions. - Incorrect Configuration: There might be an issue with the configuration format or the way the IDs are specified.
Recommended Solution
To resolve this issue, you should use the database_id_to_instance_map
and platform_instance_map
instead of database_alias_map
. Here’s how you can configure it:
- Update Configuration: Use
database_id_to_instance_map
to map database IDs to platform instances.
Example Configuration
Here is an example configuration using database_id_to_instance_map
:
source:
type: metabase
config:
connect_uri: "http://your-metabase-instance:3000"
username: "your-username"
password: "your-password"
database_id_to_instance_map:
"6": "snowflake_instance_1"
"12": "snowflake_instance_2"
platform_instance_map:
snowflake: "snowflake_instance"
Detailed Steps
-
Update the Configuration File:
- Replace
database_alias_map
withdatabase_id_to_instance_map
. - Ensure that the IDs and instance names are correctly specified.
- Replace
-
Run the Ingestion Process:
- Execute the ingestion process with the updated configuration.
- Check the logs to ensure that the database names are correctly mapped.
Example Code Snippet
Here’s an example of how you might configure and run the ingestion process:
source:
type: metabase
config:
connect_uri: "http://your-metabase-instance:3000"
username: "your-username"
password: "your-password"
database_id_to_instance_map:
"6": "snowflake_instance_1"
"12": "snowflake_instance_2"
platform_instance_map:
snowflake: "snowflake_instance"
Further Assistance
If the issue persists, you might need to check the Metabase source code or open a GitHub issue for further assistance from the DataHub community. You can refer to the <Issues · datahub-project/datahub · GitHub issue> and <Slack discussion> for similar topics.
Would you like more detailed steps on implementing this solution, or do you have any other specific questions?
Sources:
- <https://datahubproject.io/docs/generated/ingestion/sources/metabase/|Metabase Ingestion Documentation>
- <datahub/metadata-ingestion/src/datahub/ingestion/source/metabase.py at v0.13.3 · acryldata/datahub · GitHub Source Code>
- <Issues · datahub-project/datahub · GitHub issue>
- <Slack discussion>
0 button
1 button
Hint: Mention <@U06TM7M40RK> in the thread for followups.