Discussion on Handling Multiple Metabase Instances with Shared Database IDs in DataHub 0.13.3

Original Slack Thread

One more. DataHub 0.13.3. UI. Metabase ingestion.
Metabase https://datahubproject.io/docs/generated/ingestion/sources/metabase/|documentation shows the “Platform Instance” capability as enabled by default. Looking at the https://github.com/acryldata/datahub/blob/v0.13.3/metadata-ingestion/src/datahub/ingestion/source/metabase.py|code, there is no option, available for other sources, to use the platform_instance key in the recipe, but the platform instance is detected by checking database_id_to_instance_map and platform_instance_map mappings. The problem is that two Metabase instances can share the same database id and the same to_instance. In our case, we have two Metabase instances pointing to the same Snowflake instance, and some of the databases have a shared id since it is automatically assigned by Metabase. Is there any solution for this use case?

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

<@U06TM7M40RK> Can you help with the questions in this thread?


:information_source: We’re experimenting with a new AI assistant for questions in the <#CUMUWQU66> channel to help you track down helpful resources more quickly. Keep in mind that responses may not always be spot-on; feel free to ask it follow-up questions and don’t forget to provide its responses a :thumbsup: or :thumbsdown: so other folks in the Community know if you’re still looking for help on the topic!

To address your issue with Metabase ingestion in DataHub 0.13.3, where two Metabase instances point to the same Snowflake instance and share database IDs, you are correct that the platform_instance key is not directly available in the Metabase ingestion source configuration. Instead, the database_id_to_instance_map and platform_instance_map mappings are used to detect platform instances.

Potential Solutions

  1. Unique Mapping Configuration:
    According to the <https://datahubproject.io/docs/generated/ingestion/sources/metabase/|Metabase ingestion documentation>, you can use the database_id_to_instance_map and platform_instance_map to map database IDs to platform instances. However, since the database IDs are shared, you might need to ensure that each Metabase instance has a unique identifier in your mapping configuration.

  2. Custom Configuration:
    If the default mappings are insufficient due to shared database IDs, you might consider creating a custom mapping configuration. This could involve modifying the ingestion source code to handle multiple Metabase instances more effectively.

  3. Separate Ingestion Runs:
    Another approach is to run separate ingestion processes for each Metabase instance, ensuring that the configurations do not overlap. This way, you can avoid conflicts caused by shared database IDs.

Example Configuration

Here is an example of how you might configure the mappings in your ingestion recipe:

source:
  type: metabase
  config:
    connect_uri: "http://your-metabase-instance:3000"
    username: "your-username"
    password: "your-password"
    database_id_to_instance_map:
      "1": "instance1"
      "2": "instance2"
    platform_instance_map:
      snowflake: "snowflake_instance"

Further Assistance

If these solutions do not fully address your use case, you may need to open a GitHub issue or a Slack discussion to seek further assistance from the DataHub community. You can refer to the <Issues · datahub-project/datahub · GitHub issue> and <Slack discussion> for similar topics.

Would you like more detailed steps on any of these potential solutions or further assistance on a specific part of the process?

Sources:

<@U06TM7M40RK>, the ingestion runs are separated, but this doesn’t seem to solve the platform_instance naming issue looking at the https://github.com/acryldata/datahub/blob/v0.13.3/metadata-ingestion/src/datahub/ingestion/source/metabase.py#L664|get_platform_instance logic.

Thank you for providing more context. Given that the ingestion runs are separated but the platform_instance naming issue persists due to shared database IDs, let’s delve into the get_platform_instance logic and explore potential solutions.

Understanding the get_platform_instance Logic

The get_platform_instance method in the Metabase ingestion source attempts to detect the platform_instance by checking the database_id_to_instance_map and platform_instance_map mappings. Here’s a snippet of the relevant code:

@lru_cache(maxsize=None)
def get_platform_instance(
    self, platform: Optional[str] = None, datasource_id: Optional[int] = None
) -> Optional[str]:
    platform_instance = None
    if datasource_id is not None and self.config.database_id_to_instance_map:
        platform_instance = self.config.database_id_to_instance_map.get(str(datasource_id))

    if platform and self.config.platform_instance_map and platform_instance is None:
        platform_instance = self.config.platform_instance_map.get(platform)

    return platform_instance

Issue with Shared Database IDs

The problem arises because the same datasource_id (database ID) is used in both Metabase instances, leading to potential conflicts in the database_id_to_instance_map.

Potential Solutions

  1. Enhanced Mapping Configuration:
    Modify the mapping configuration to include Metabase instance identifiers. This way, you can distinguish between the same datasource_id from different Metabase instances.

  2. Custom Logic for Platform Instances:
    Implement custom logic in the get_platform_instance method to handle multiple Metabase instances more effectively.

Enhanced Mapping Configuration

You can enhance the database_id_to_instance_map to include Metabase instance identifiers. For example:

source:
  type: metabase
  config:
    connect_uri: "http://your-metabase-instance-1:3000"
    username: "your-username"
    password: "your-password"
    database_id_to_instance_map:
      "metabase1_1": "instance1"
      "metabase1_2": "instance2"
    platform_instance_map:
      snowflake: "snowflake_instance_1"

source:
  type: metabase
  config:
    connect_uri: "http://your-metabase-instance-2:3000"
    username: "your-username"
    password: "your-password"
    database_id_to_instance_map:
      "metabase2_1": "instance3"
      "metabase2_2": "instance4"
    platform_instance_map:
      snowflake: "snowflake_instance_2"

Custom Logic for Platform Instances

If the enhanced mapping configuration is not feasible, you might need to customize the get_platform_instance method. Here’s an example of how you could modify the method:

@lru_cache(maxsize=None)
def get_platform_instance(
    self, platform: Optional[str] = None, datasource_id: Optional[int] = None, metabase_instance: Optional[str] = None
) -> Optional[str]:
    platform_instance = None

    if datasource_id is not None and self.config.database_id_to_instance_map:
        key = f"{metabase_instance}_{datasource_id}"
        platform_instance = self.config.database_id_to_instance_map.get(key)

    if platform and self.config.platform_instance_map and platform_instance is None:
        platform_instance = self.config.platform_instance_map.get(platform)

    return platform_instance

Conclusion

These solutions should help you address the platform instance naming issue when dealing with multiple Metabase instances pointing to the same Snowflake instance. Would you like more detailed steps on implementing any of these solutions, or do you have any other specific questions?

Sources:

<@U06TM7M40RK>, why am I getting:

        "metabase-dbname-12": [
          "Cannot determine database name for platform: snowflake"
        ],
        "metabase-dbname-6": [
          "Cannot determine database name for platform: snowflake"
        ],```
Despite the fact that I specified them in `database_alias_map`
```        database_alias_map:
            '6': 'NAME_1'
            '12': 'NAME_2'```

The warning message "Cannot determine database name for platform: snowflake" indicates that the ingestion process is unable to map the database IDs to their respective names for the Snowflake platform, despite specifying them in the database_alias_map.

Understanding the Issue

The database_alias_map is used to map database IDs to user-friendly names. However, this feature is being deprecated in favor of using platform_instance_map and database_id_to_instance_map. The warning could be due to several reasons:

  1. Deprecation of database_alias_map: As per the <https://datahubproject.io/docs/generated/ingestion/sources/metabase/|Metabase ingestion documentation>, the database_alias_map might be deprecated or not functioning as expected in the newer versions.
  2. Incorrect Configuration: There might be an issue with the configuration format or the way the IDs are specified.

Recommended Solution

To resolve this issue, you should use the database_id_to_instance_map and platform_instance_map instead of database_alias_map. Here’s how you can configure it:

  1. Update Configuration: Use database_id_to_instance_map to map database IDs to platform instances.

Example Configuration

Here is an example configuration using database_id_to_instance_map:

source:
  type: metabase
  config:
    connect_uri: "http://your-metabase-instance:3000"
    username: "your-username"
    password: "your-password"
    database_id_to_instance_map:
      "6": "snowflake_instance_1"
      "12": "snowflake_instance_2"
    platform_instance_map:
      snowflake: "snowflake_instance"

Detailed Steps

  1. Update the Configuration File:

    • Replace database_alias_map with database_id_to_instance_map.
    • Ensure that the IDs and instance names are correctly specified.
  2. Run the Ingestion Process:

    • Execute the ingestion process with the updated configuration.
    • Check the logs to ensure that the database names are correctly mapped.

Example Code Snippet

Here’s an example of how you might configure and run the ingestion process:

source:
  type: metabase
  config:
    connect_uri: "http://your-metabase-instance:3000"
    username: "your-username"
    password: "your-password"
    database_id_to_instance_map:
      "6": "snowflake_instance_1"
      "12": "snowflake_instance_2"
    platform_instance_map:
      snowflake: "snowflake_instance"

Further Assistance

If the issue persists, you might need to check the Metabase source code or open a GitHub issue for further assistance from the DataHub community. You can refer to the <Issues · datahub-project/datahub · GitHub issue> and <Slack discussion> for similar topics.

Would you like more detailed steps on implementing this solution, or do you have any other specific questions?

Sources: