Resolving Uppercase and Lowercase Mix-up in Snowflake and dbt-cloud Platforms

user-3 · January 13, 2025, 12:02am

<@U06TM7M40RK> I am struggling with uppercase and lowercase mixup in a setup with snowflake and dbt-cloud platforms.
The dbt ingests are setup with these parameters:

    type: dbt-cloud
    config:
        ...
        convert_column_urns_to_lowercase: true
        target_platform: snowflake
        target_platform_instance: prod```
Snowflake ingests:
```config:
    ...
    include_table_lineage: true
    include_view_lineage: true
    include_tables: true
    include_views: true
    env: prod
    platform_instance: prod
    convert_urns_to_lowercase: true
    profiling:
        enabled: true
        profile_table_level_only: false
        include_field_distinct_value_frequencies: true
        include_field_histogram: true
    stateful_ingestion:
        enabled: true```
Result:
```- **Datasets** (2k)
    - ❌ **dbt** 
        - 📁 assignment 
        - 📁 ASSIGNMENT 
        - 📁 CDP 
        - 📁 dpi 
        - 📁 kapp 
        - 📁 KAPP 
        - 📁 oportal 
        - 📁 OPORTAL 
        - 📁 paxtelling 
        - 📁 PAXTELLING 
        - 📁 plandata 
        - 📁 PLANDATA 
        - 📁 progress 
        - 📁 PROGRESS 
        - 📁 telemetry 
        - 📁 TELEMETRY 
        - 📁 trafikk 
        - 📁 TRAFIKK 
        - 📁 UTILITIES 
    - ❄️ **Snowflake** 
        - 📁 prod
            - 📁 assignment 
            - 📁 cdp 
            - 📁 kapp 
            - 📁 oportal 
            - 📁 paxtelling 
            - 📁 plandata 
            - 📁 progress 
            - 📁 telemetry 
            - 📁 trafikk 
            - 📁 utilities 
        - 📁 prod
            - 📁 ANALYSE 
            - 📁 ASSIGNMENT 
            - 📁 CDP 
            - 📁 DATASCIENCE 
            - 📁 DPI 
            - 📁 JOURNEYMON ```
I am struggling to understand why I am getting uppercase and lowercase dataset naming after ingestion?

user-1 · January 13, 2025, 12:02am

The issue you’re experiencing with mixed uppercase and lowercase dataset names in your Snowflake and dbt-cloud ingestion setup is likely due to inconsistencies in how case sensitivity is handled during the ingestion process. Here’s a breakdown of the potential causes and solutions:

Inconsistent Case Handling:
- The convert_urns_to_lowercase flag is set to true in both your dbt-cloud and Snowflake ingestion configurations. This should ideally convert all URNs to lowercase, but it seems there might be a discrepancy in how this is applied across different parts of the ingestion process.
DBT Ingestion Configuration:
- The convert_column_urns_to_lowercase parameter in your dbt-cloud configuration is set to true, which should convert column URNs to lowercase. However, this parameter is not explicitly mentioned in the provided data sources, so its behavior might not be fully supported or consistent.
Snowflake Ingestion Configuration:
- The convert_urns_to_lowercase flag is correctly set to true, which should ensure that all dataset URNs are lowercased during ingestion. However, if there are discrepancies in how dbt and Snowflake handle case sensitivity, this could lead to the observed duplication.
Potential Solutions:
- Ensure that both dbt and Snowflake ingestion configurations are consistent in terms of case handling. You might need to verify if there are any additional settings or flags in your dbt configuration that could affect case sensitivity.
- Consider normalizing the case in your dbt ingestion process to match the Snowflake configuration. This might involve setting a similar convert_urns_to_lowercase flag in the dbt configuration if supported.
Known Issues:
- According to a <Issues · datahub-project/datahub · GitHub issue>, there have been known issues with inconsistent URN casing during DBT ingestion, which could lead to mismatches between dbt and Snowflake datasets.

Would you like more detailed steps on how to address this issue or any specific part of the configuration process?

Sources:

<Issues · datahub-project/datahub · GitHub URN casing in DBT ingestion> 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-3 · January 13, 2025, 12:02am

<@U06TM7M40RK> I need more detailed steps on how to address this issue

user-1 · January 13, 2025, 12:02am

To address the issue of mixed uppercase and lowercase dataset names in your Snowflake and dbt-cloud ingestion setup, follow these detailed steps:

Step 1: Verify and Adjust DBT Ingestion Configuration

Check DBT Configuration:
- Ensure that your dbt ingestion configuration is set to consistently handle case sensitivity. The convert_urns_to_lowercase parameter should be set to true to ensure all URNs are lowercased.
Update DBT Ingestion Recipe:
- If the convert_urns_to_lowercase parameter is not explicitly available in your dbt configuration, you might need to manually adjust the case of dataset names in your dbt models or scripts.

Example DBT Configuration:

Here is an example configuration for dbt ingestion:

source:
  type: dbt-cloud
  config:
    convert_urns_to_lowercase: true
    target_platform: snowflake
    target_platform_instance: prod
    # Other necessary configurations

Step 2: Verify and Adjust Snowflake Ingestion Configuration

Check Snowflake Configuration:
- Ensure that the convert_urns_to_lowercase flag is set to true in your Snowflake ingestion configuration. This will ensure that all dataset URNs are lowercased during ingestion.

Example Snowflake Configuration:

Here is an example configuration for Snowflake ingestion:

config:
  include_table_lineage: true
  include_view_lineage: true
  include_tables: true
  include_views: true
  env: prod
  platform_instance: prod
  convert_urns_to_lowercase: true
  profiling:
    enabled: true
    profile_table_level_only: false
    include_field_distinct_value_frequencies: true
    include_field_histogram: true
  stateful_ingestion:
    enabled: true

user-1 · January 13, 2025, 12:02am

Step 3: Re-run Ingestion and Verify Results

Re-run Ingestion:
- After updating the configurations, re-run the ingestion processes for both dbt and Snowflake. This will apply the updated case handling settings.
Verify Results:
- Check the DataHub UI to verify that the datasets are now consistently named in lowercase. If the issue persists, there might be additional configurations or scripts that need to be adjusted.

Step 4: Troubleshoot Further if Needed

Inspect Logs:
- Review the logs from the ingestion processes to identify any errors or warnings related to case handling.
Check for Known Issues:
- Refer to the <Issues · datahub-project/datahub · GitHub issue> for any known issues or updates related to URN casing during ingestion.
Manual Adjustments:
- If automatic adjustments do not resolve the issue, consider manually adjusting the case of dataset names in your dbt models or scripts.

Would you like further assistance on any specific part of this process?

Sources:

https://datahubspace.slack.com/archives/CUMUWQU66/p1716555485.025399|https://datahubspace.slack.com/archives/CUMUWQU66/p1716555485.025399
https://runllm.slack.com/archives/C06SQ733HGQ/p1717518371.919619|https://runllm.slack.com/archives/C06SQ733HGQ/p1717518371.919619 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-3 · January 13, 2025, 12:02am

<@U06TM7M40RK> The proposed solution is identical to what I am already doing

user-1 · January 13, 2025, 12:02am

It seems likely that the issue might be more complex than just the configuration settings. Let’s explore some additional steps and considerations that might help resolve the problem:

Additional Steps to Consider

Review Ingestion Logs:
- Carefully review the logs from both the dbt and Snowflake ingestion processes. Look for any discrepancies or warnings related to case sensitivity that might provide more context on the issue.
Check for Duplicate Entries:
- Verify if there are any duplicate entries in your DataHub instance that might be causing the mix of uppercase and lowercase dataset names. This could be due to previous ingestions with different configurations.
Stateful Ingestion:
- Ensure that stateful ingestion is correctly configured and functioning. This should help manage and clean up any old or duplicate entries. If stateful ingestion is not working as expected, you might need to manually delete old entries.
Manual Cleanup:
- Consider manually cleaning up the existing datasets in DataHub to remove duplicates. You can use the DataHub CLI to delete specific datasets or URNs.
Cross-Check with Other Teams:
- If you are working in a collaborative environment, check with other teams to ensure that there are no conflicting ingestion processes or configurations that might be affecting the dataset names.
Consult DataHub Community:
- Engage with the DataHub community or support channels to see if others have faced similar issues and if there are any known workarounds or solutions.

Example CLI Command for Cleanup

If you decide to manually clean up datasets, you can use the following DataHub CLI command to delete specific datasets:

datahub delete --urn <urn> --soft

This command will mark the dataset as removed without permanently deleting it, allowing you to re-ingest with the correct configuration.

Would you like more detailed guidance on any of these additional steps?

Sources:

https://datahubspace.slack.com/archives/CUMUWQU66/p1716555485.025399|https://datahubspace.slack.com/archives/CUMUWQU66/p1716555485.025399
https://runllm.slack.com/archives/C06SQ733HGQ/p1717518371.919619|https://runllm.slack.com/archives/C06SQ733HGQ/p1717518371.919619 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-3 · January 13, 2025, 12:02am

<@U06TM7M40RK> In fact, omitting the dbt config variable
convert_urns_to_lowercase: true
Improves the dbt parsing, I see several datasets now have much more information on the schema page that was missing when the lowercasing was applied

user-1 · January 13, 2025, 12:02am

It seems that omitting the convert_urns_to_lowercase: true configuration in your dbt setup has improved the parsing and resulted in more comprehensive schema information. This suggests that the lowercasing of URNs might have been causing mismatches or conflicts with the existing dataset names, leading to incomplete schema information.

Explanation and Recommendations

Case Sensitivity in URNs:
- When convert_urns_to_lowercase is set to true, all URNs are converted to lowercase, which can lead to mismatches if the original dataset names are case-sensitive. This can result in incomplete or incorrect schema information being displayed.
Improved Schema Information:
- By omitting the lowercasing configuration, the URNs retain their original casing, which seems to align better with the actual dataset names in your environment. This alignment allows for more accurate parsing and display of schema information.
Recommendations:
- Consistency Across Platforms: Ensure that both dbt and Snowflake configurations are consistent in terms of case handling. If Snowflake datasets are case-sensitive, it might be beneficial to avoid lowercasing in dbt as well.
- Re-ingestion: After making configuration changes, re-ingest the metadata to ensure that the changes take effect and the schema information is updated.
- Monitor and Validate: Continuously monitor the ingestion results and validate that the schema information remains accurate and complete.

Would you like further assistance or more detailed guidance on any specific aspect of this process?

Sources:

<Issues · datahub-project/datahub · GitHub URN casing in DBT ingestion>
<dbt | DataHub DBT Ingestion Configuration> 0 button 0 button Hint: Mention @runllm in the thread for followups.

datahub_team · January 13, 2025, 12:02am

Hi ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look

Hope this helps ! ><

user-3 · January 13, 2025, 12:02am

<@U07FV3UHE4D> Hi, are you able to take a look? I am hitting a wall here and can’t seem to find any solution. Planning on onboarding teams that are affected by this soon

datahub_team · January 13, 2025, 12:02am

Thank you for letting us know you need a person to take a look at this thread.

Either a community member or one of our team members will try to assist when they have a moment, responses unfortunately might not be immediate. We hope this helps speed things along

Topic		Replies	Views
Handling Duplicate Datasets and Snowflake Job Hangs in DataHub Ingestion ingestion	7	5	November 4, 2024
Troubleshooting dbt Models Not Combining with Snowflake Tables in Datahub Production Instance ingestion	4	76	March 4, 2024
Ingesting dbt Cloud with Snowflake: Metadata Order Considerations ingestion	3	12	April 7, 2025
Understanding Case Sensitivity and Configuration Options in Vertica Data Ingestion ingestion	4	18	October 7, 2024
Troubleshooting dbt Test Results Ingestion for BigQuery Alignment ingestion	13	7	February 3, 2025

Resolving Uppercase and Lowercase Mix-up in Snowflake and dbt-cloud Platforms

Related topics