Handling Duplicate Datasets and Snowflake Job Hangs in DataHub Ingestion

Original Slack Thread

<@U06TM7M40RK> in snowflake ingestion, we have a problem where some of the datasets are being ingested duplicate with lower case DB name, schema name and table name. The actual table name, databse name and schema name is in upper case on the source and is also ingested into datahub. For e.g. SOC_TO_PROFILE table under DB INGESTION_DB and schema CDATA is present in datahub which is correct format. But its lower case entity is also present with the name h2_system_norway.ingestion_db.cdata.soc_to_profile all in lower case and without any schema. This entity doesnt exist in snowflake source.
What could be the reason for this ?

The issue you’re experiencing with duplicate datasets in DataHub, where one version is in uppercase and the other in lowercase, is likely due to inconsistencies in how case sensitivity is handled during the ingestion process. This is a known issue that can occur due to different ingestion configurations or tools that normalize the case differently.

Possible Causes:

  1. Inconsistent Case Handling in Ingestion Tools: Different ingestion tools or connectors might handle case sensitivity differently. For example, some tools might convert table names to lowercase while others retain the original case. This inconsistency can lead to duplicate entries with different cases.

  2. Normalization in DataHub Code: There might be explicit lowercasing done in some parts of the DataHub code to keep URNs in lowercase consistently. However, this is not done everywhere, leading to inconsistencies. According to <Issues · datahub-project/datahub · GitHub GitHub issue>, this can cause datasets to appear in both uppercase and lowercase in the UI.

  3. Configuration Settings: The convert_urns_to_lowercase setting in your Snowflake ingestion configuration might be set to True, which would convert all dataset URNs to lowercase. This setting is detailed in the <Snowflake | DataHub ingestion documentation>.

Steps to Resolve:

  1. Check Ingestion Configurations: Ensure that all your ingestion configurations are consistent in terms of case handling. Specifically, check the convert_urns_to_lowercase setting in your Snowflake ingestion recipe. If you want to preserve the original case, set this to False.

  2. Normalize URNs: You might need to normalize URNs to a consistent case across all ingestion jobs. This can be done by setting the convert_urns_to_lowercase flag to False if you want to retain the original casing.

  3. Review and Update Ingestion Recipes: Review your ingestion recipes to ensure they are not causing case inconsistencies. Make sure all tools and configurations are aligned in terms of case sensitivity.

Here is an example configuration for Snowflake ingestion to ensure consistent case handling:

source:
  type: snowflake
  config:
    convert_urns_to_lowercase: false

Would you like more detailed guidance on any of these steps?

Sources:

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><

<@U06TM7M40RK> in our snowflake ingestion there is a wait period of 4 hrs where nothing is emitted and nothing written to the ingestion job log. We get the below line -
{datahub.ingestion.source.snowflake.snowflake_usage_v2:341} - Getting access history
and after that the job hangs for 4hrs. This increases the job run time. What could be the issue ?

The issue you’re experiencing, where the Snowflake ingestion job hangs for an extended period during the “Getting access history” step, could be related to several factors, including network latency, large data volumes, or inefficient query execution. Here are some potential causes and solutions:

Potential Causes:

  1. Large Volume of Data: If your Snowflake account has a large volume of access history data, retrieving this information can take a significant amount of time. This is especially true if the include_usage_stats or include_operational_stats configurations are enabled, as they require querying the Snowflake account usage tables, which can be slow.

  2. Network Latency: Network issues between your DataHub instance and Snowflake can cause delays in data retrieval. This can be due to network congestion, DNS resolution issues, or other connectivity problems.

  3. Inefficient Queries: The queries used to retrieve access history might not be optimized for performance, especially if they involve complex joins or filters.

Steps to Troubleshoot and Resolve:

  1. Optimize Query Performance: Review the queries used to retrieve access history and ensure they are optimized. Consider using indexes or partitioning strategies in Snowflake to improve query performance.

  2. Limit Data Retrieval: If possible, limit the amount of data being retrieved by adjusting the time window for access history. This can be done by configuring the start_time and end_time parameters to a smaller range.

  3. Check Network Connectivity: Ensure that there are no network issues affecting the connection between DataHub and Snowflake. This includes checking for DNS resolution issues, network latency, and any firewall rules that might be affecting connectivity.

  4. Review Configuration Settings: Double-check your ingestion configuration to ensure that unnecessary data retrieval is not being performed. For example, if you do not need usage statistics, consider setting include_usage_stats to False.

  5. Monitor Snowflake Performance: Use Snowflake’s performance monitoring tools to identify any bottlenecks or slow queries. This can help pinpoint the cause of the delay.

Would you like more detailed guidance on any of these steps?

Sources:

<@U06TM7M40RK> for airflow integration to datahub, how to make sure that the lineage is emitted in PATCH manner and not OVERWRITE the existing lineage with every run ?

To ensure that lineage is emitted in a PATCH manner rather than overwriting existing lineage with every run in your Airflow integration with DataHub, you need to configure your DataHub Airflow plugin correctly. Here are the steps and considerations to achieve this:

Steps to Ensure PATCH Lineage Emission:

  1. Use the DataHub Airflow Plugin: Make sure you are using the DataHub Airflow plugin, which supports lineage emission. The plugin should be configured to emit lineage in a way that updates existing metadata rather than overwriting it.

  2. Configure the Plugin for PATCH Emission:

    • The DataHub Airflow plugin should be set up to emit lineage in a PATCH manner. This involves ensuring that the plugin is configured to append or update lineage information rather than replacing it.
    • According to the <Airflow Integration | DataHub Airflow documentation>, you can use the DatahubEmitterOperator to emit lineage. This operator can be configured to update existing lineage by specifying the correct parameters.
  3. Use the DatahubEmitterOperator:

    • The DatahubEmitterOperator allows you to emit lineage information explicitly. You can configure it to update existing lineage by specifying the inlets and outlets for your tasks.
    • Example configuration:
      from datahub_airflow_plugin.operators import DatahubEmitterOperator
      
      emit_lineage = DatahubEmitterOperator(
          task_id='emit_lineage',
          datahub_conn_id='datahub_rest_default',
          inlets=[Dataset("platform", "dataset_name")],
          outlets=[Dataset("platform", "output_dataset_name")]
      )
      
  4. Ensure Correct Use of Inlets and Outlets:

    • When defining your Airflow tasks, make sure to correctly specify the inlets and outlets for each task. This ensures that lineage is appended to the existing metadata rather than overwriting it.
  5. Check Plugin Logs:

    • Monitor the Airflow logs to ensure that lineage is being emitted correctly. Look for log messages indicating that lineage is being updated rather than replaced.
  6. Review DataHub Configuration:

    • Ensure that your DataHub instance is configured to handle PATCH updates correctly. This might involve checking the DataHub server settings to ensure it supports incremental updates.

Would you like more detailed guidance on any of these steps?

Sources:

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Just checking in a final time, Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look at this thread :slight_smile:

Hope this helps ! ><