<@U06TM7M40RK> I am struggling with uppercase and lowercase mixup in a setup with snowflake and dbt-cloud platforms.
The dbt ingests are setup with these parameters:
The issue youβre experiencing with mixed uppercase and lowercase dataset names in your Snowflake and dbt-cloud ingestion setup is likely due to inconsistencies in how case sensitivity is handled during the ingestion process. Hereβs a breakdown of the potential causes and solutions:
Inconsistent Case Handling:
The convert_urns_to_lowercase flag is set to true in both your dbt-cloud and Snowflake ingestion configurations. This should ideally convert all URNs to lowercase, but it seems there might be a discrepancy in how this is applied across different parts of the ingestion process.
DBT Ingestion Configuration:
The convert_column_urns_to_lowercase parameter in your dbt-cloud configuration is set to true, which should convert column URNs to lowercase. However, this parameter is not explicitly mentioned in the provided data sources, so its behavior might not be fully supported or consistent.
Snowflake Ingestion Configuration:
The convert_urns_to_lowercase flag is correctly set to true, which should ensure that all dataset URNs are lowercased during ingestion. However, if there are discrepancies in how dbt and Snowflake handle case sensitivity, this could lead to the observed duplication.
Potential Solutions:
Ensure that both dbt and Snowflake ingestion configurations are consistent in terms of case handling. You might need to verify if there are any additional settings or flags in your dbt configuration that could affect case sensitivity.
Consider normalizing the case in your dbt ingestion process to match the Snowflake configuration. This might involve setting a similar convert_urns_to_lowercase flag in the dbt configuration if supported.
Known Issues:
According to a <Issues Β· datahub-project/datahub Β· GitHub issue>, there have been known issues with inconsistent URN casing during DBT ingestion, which could lead to mismatches between dbt and Snowflake datasets.
Would you like more detailed steps on how to address this issue or any specific part of the configuration process?
To address the issue of mixed uppercase and lowercase dataset names in your Snowflake and dbt-cloud ingestion setup, follow these detailed steps:
Step 1: Verify and Adjust DBT Ingestion Configuration
Check DBT Configuration:
Ensure that your dbt ingestion configuration is set to consistently handle case sensitivity. The convert_urns_to_lowercase parameter should be set to true to ensure all URNs are lowercased.
Update DBT Ingestion Recipe:
If the convert_urns_to_lowercase parameter is not explicitly available in your dbt configuration, you might need to manually adjust the case of dataset names in your dbt models or scripts.
Example DBT Configuration:
Here is an example configuration for dbt ingestion:
Step 2: Verify and Adjust Snowflake Ingestion Configuration
Check Snowflake Configuration:
Ensure that the convert_urns_to_lowercase flag is set to true in your Snowflake ingestion configuration. This will ensure that all dataset URNs are lowercased during ingestion.
Example Snowflake Configuration:
Here is an example configuration for Snowflake ingestion:
After updating the configurations, re-run the ingestion processes for both dbt and Snowflake. This will apply the updated case handling settings.
Verify Results:
Check the DataHub UI to verify that the datasets are now consistently named in lowercase. If the issue persists, there might be additional configurations or scripts that need to be adjusted.
Step 4: Troubleshoot Further if Needed
Inspect Logs:
Review the logs from the ingestion processes to identify any errors or warnings related to case handling.
It seems likely that the issue might be more complex than just the configuration settings. Letβs explore some additional steps and considerations that might help resolve the problem:
Additional Steps to Consider
Review Ingestion Logs:
Carefully review the logs from both the dbt and Snowflake ingestion processes. Look for any discrepancies or warnings related to case sensitivity that might provide more context on the issue.
Check for Duplicate Entries:
Verify if there are any duplicate entries in your DataHub instance that might be causing the mix of uppercase and lowercase dataset names. This could be due to previous ingestions with different configurations.
Stateful Ingestion:
Ensure that stateful ingestion is correctly configured and functioning. This should help manage and clean up any old or duplicate entries. If stateful ingestion is not working as expected, you might need to manually delete old entries.
Manual Cleanup:
Consider manually cleaning up the existing datasets in DataHub to remove duplicates. You can use the DataHub CLI to delete specific datasets or URNs.
Cross-Check with Other Teams:
If you are working in a collaborative environment, check with other teams to ensure that there are no conflicting ingestion processes or configurations that might be affecting the dataset names.
Consult DataHub Community:
Engage with the DataHub community or support channels to see if others have faced similar issues and if there are any known workarounds or solutions.
Example CLI Command for Cleanup
If you decide to manually clean up datasets, you can use the following DataHub CLI command to delete specific datasets:
datahub delete --urn <urn> --soft
This command will mark the dataset as removed without permanently deleting it, allowing you to re-ingest with the correct configuration.
Would you like more detailed guidance on any of these additional steps?
<@U06TM7M40RK> In fact, omitting the dbt config variable convert_urns_to_lowercase: true
Improves the dbt parsing, I see several datasets now have much more information on the schema page that was missing when the lowercasing was applied
It seems that omitting the convert_urns_to_lowercase: true configuration in your dbt setup has improved the parsing and resulted in more comprehensive schema information. This suggests that the lowercasing of URNs might have been causing mismatches or conflicts with the existing dataset names, leading to incomplete schema information.
Explanation and Recommendations
Case Sensitivity in URNs:
When convert_urns_to_lowercase is set to true, all URNs are converted to lowercase, which can lead to mismatches if the original dataset names are case-sensitive. This can result in incomplete or incorrect schema information being displayed.
Improved Schema Information:
By omitting the lowercasing configuration, the URNs retain their original casing, which seems to align better with the actual dataset names in your environment. This alignment allows for more accurate parsing and display of schema information.
Recommendations:
Consistency Across Platforms: Ensure that both dbt and Snowflake configurations are consistent in terms of case handling. If Snowflake datasets are case-sensitive, it might be beneficial to avoid lowercasing in dbt as well.
Re-ingestion: After making configuration changes, re-ingest the metadata to ensure that the changes take effect and the schema information is updated.
Monitor and Validate: Continuously monitor the ingestion results and validate that the schema information remains accurate and complete.
Would you like further assistance or more detailed guidance on any specific aspect of this process?
<@U07FV3UHE4D> Hi, are you able to take a look? I am hitting a wall here and canβt seem to find any solution. Planning on onboarding teams that are affected by this soon
Thank you for letting us know you need a person to take a look at this thread.
Either a community member or one of our team members will try to assist when they have a moment, responses unfortunately might not be immediate. We hope this helps speed things along