<@U06TM7M40RK> in snowflake ingestion, we have a problem where some of the datasets are being ingested duplicate with lower case DB name, schema name and table name. The actual table name, databse name and schema name is in upper case on the source and is also ingested into datahub. For e.g. SOC_TO_PROFILE table under DB INGESTION_DB and schema CDATA is present in datahub which is correct format. But its lower case entity is also present with the name h2_system_norway.ingestion_db.cdata.soc_to_profile all in lower case and without any schema. This entity doesnt exist in snowflake source.
What could be the reason for this ?
The issue you’re experiencing with duplicate datasets in DataHub, where one version is in uppercase and the other in lowercase, is likely due to inconsistencies in how case sensitivity is handled during the ingestion process. This is a known issue that can occur due to different ingestion configurations or tools that normalize the case differently.
Possible Causes:
-
Inconsistent Case Handling in Ingestion Tools: Different ingestion tools or connectors might handle case sensitivity differently. For example, some tools might convert table names to lowercase while others retain the original case. This inconsistency can lead to duplicate entries with different cases.
-
Normalization in DataHub Code: There might be explicit lowercasing done in some parts of the DataHub code to keep URNs in lowercase consistently. However, this is not done everywhere, leading to inconsistencies. According to <Issues · datahub-project/datahub · GitHub GitHub issue>, this can cause datasets to appear in both uppercase and lowercase in the UI.
-
Configuration Settings: The
convert_urns_to_lowercase
setting in your Snowflake ingestion configuration might be set toTrue
, which would convert all dataset URNs to lowercase. This setting is detailed in the <Snowflake | DataHub ingestion documentation>.
Steps to Resolve:
-
Check Ingestion Configurations: Ensure that all your ingestion configurations are consistent in terms of case handling. Specifically, check the
convert_urns_to_lowercase
setting in your Snowflake ingestion recipe. If you want to preserve the original case, set this toFalse
. -
Normalize URNs: You might need to normalize URNs to a consistent case across all ingestion jobs. This can be done by setting the
convert_urns_to_lowercase
flag toFalse
if you want to retain the original casing. -
Review and Update Ingestion Recipes: Review your ingestion recipes to ensure they are not causing case inconsistencies. Make sure all tools and configurations are aligned in terms of case sensitivity.
Here is an example configuration for Snowflake ingestion to ensure consistent case handling:
source:
type: snowflake
config:
convert_urns_to_lowercase: false
Would you like more detailed guidance on any of these steps?
Sources:
- <Slack Thread on Duplicate Snowflake Databases>
- <Issues · datahub-project/datahub · GitHub Issue on Inconsistent Datasets Name Case>
- <Snowflake | DataHub Ingestion Configuration Details> 0 button 0 button Hint: Mention @runllm in the thread for followups.
Hi ! Did this interaction resolve your issue or do you still require assistance ?
Please tag @ RunLLM if you still need further assistance !
Or tag @Datahub Community Support if you require a person to take a look
Hope this helps ! ><
<@U06TM7M40RK> in our snowflake ingestion there is a wait period of 4 hrs where nothing is emitted and nothing written to the ingestion job log. We get the below line -
{datahub.ingestion.source.snowflake.snowflake_usage_v2:341} - Getting access history
and after that the job hangs for 4hrs. This increases the job run time. What could be the issue ?
The issue you’re experiencing, where the Snowflake ingestion job hangs for an extended period during the “Getting access history” step, could be related to several factors, including network latency, large data volumes, or inefficient query execution. Here are some potential causes and solutions:
Potential Causes:
-
Large Volume of Data: If your Snowflake account has a large volume of access history data, retrieving this information can take a significant amount of time. This is especially true if the
include_usage_stats
orinclude_operational_stats
configurations are enabled, as they require querying the Snowflake account usage tables, which can be slow. -
Network Latency: Network issues between your DataHub instance and Snowflake can cause delays in data retrieval. This can be due to network congestion, DNS resolution issues, or other connectivity problems.
-
Inefficient Queries: The queries used to retrieve access history might not be optimized for performance, especially if they involve complex joins or filters.
Steps to Troubleshoot and Resolve:
-
Optimize Query Performance: Review the queries used to retrieve access history and ensure they are optimized. Consider using indexes or partitioning strategies in Snowflake to improve query performance.
-
Limit Data Retrieval: If possible, limit the amount of data being retrieved by adjusting the time window for access history. This can be done by configuring the
start_time
andend_time
parameters to a smaller range. -
Check Network Connectivity: Ensure that there are no network issues affecting the connection between DataHub and Snowflake. This includes checking for DNS resolution issues, network latency, and any firewall rules that might be affecting connectivity.
-
Review Configuration Settings: Double-check your ingestion configuration to ensure that unnecessary data retrieval is not being performed. For example, if you do not need usage statistics, consider setting
include_usage_stats
toFalse
. -
Monitor Snowflake Performance: Use Snowflake’s performance monitoring tools to identify any bottlenecks or slow queries. This can help pinpoint the cause of the delay.
Would you like more detailed guidance on any of these steps?
Sources:
- <Issues · datahub-project/datahub · GitHub Issue on Snowflake Ingestion Performance> 0 button 0 button Hint: Mention @runllm in the thread for followups.
<@U06TM7M40RK> for airflow integration to datahub, how to make sure that the lineage is emitted in PATCH manner and not OVERWRITE the existing lineage with every run ?
To ensure that lineage is emitted in a PATCH manner rather than overwriting existing lineage with every run in your Airflow integration with DataHub, you need to configure your DataHub Airflow plugin correctly. Here are the steps and considerations to achieve this:
Steps to Ensure PATCH Lineage Emission:
-
Use the DataHub Airflow Plugin: Make sure you are using the DataHub Airflow plugin, which supports lineage emission. The plugin should be configured to emit lineage in a way that updates existing metadata rather than overwriting it.
-
Configure the Plugin for PATCH Emission:
- The DataHub Airflow plugin should be set up to emit lineage in a PATCH manner. This involves ensuring that the plugin is configured to append or update lineage information rather than replacing it.
- According to the <Airflow Integration | DataHub Airflow documentation>, you can use the
DatahubEmitterOperator
to emit lineage. This operator can be configured to update existing lineage by specifying the correct parameters.
-
Use the
DatahubEmitterOperator
:- The
DatahubEmitterOperator
allows you to emit lineage information explicitly. You can configure it to update existing lineage by specifying theinlets
andoutlets
for your tasks. - Example configuration:
from datahub_airflow_plugin.operators import DatahubEmitterOperator emit_lineage = DatahubEmitterOperator( task_id='emit_lineage', datahub_conn_id='datahub_rest_default', inlets=[Dataset("platform", "dataset_name")], outlets=[Dataset("platform", "output_dataset_name")] )
- The
-
Ensure Correct Use of Inlets and Outlets:
- When defining your Airflow tasks, make sure to correctly specify the
inlets
andoutlets
for each task. This ensures that lineage is appended to the existing metadata rather than overwriting it.
- When defining your Airflow tasks, make sure to correctly specify the
-
Check Plugin Logs:
- Monitor the Airflow logs to ensure that lineage is being emitted correctly. Look for log messages indicating that lineage is being updated rather than replaced.
-
Review DataHub Configuration:
- Ensure that your DataHub instance is configured to handle PATCH updates correctly. This might involve checking the DataHub server settings to ensure it supports incremental updates.
Would you like more detailed guidance on any of these steps?
Sources:
- https://datahubspace.slack.com/archives/C029A3M079U/p1726078739.502529|https://datahubspace.slack.com/archives/C029A3M079U/p1726078739.502529
- https://datahubspace.slack.com/archives/CUMUWQU66/p1726471219.830139|https://datahubspace.slack.com/archives/CUMUWQU66/p1726471219.830139 0 button 0 button Hint: Mention @runllm in the thread for followups.
Hi ! Did this interaction resolve your issue or do you still require assistance ?
Just checking in a final time, Please tag @ RunLLM if you still need further assistance !
Or tag @Datahub Community Support if you require a person to take a look at this thread
Hope this helps ! ><