Understanding Case Sensitivity and Configuration Options in Vertica Data Ingestion

Original Slack Thread

Hi <@U06TM7M40RK>, how does case sensitivity play a role when ingesting a specific view from a specific schema in vertica?

Case sensitivity can significantly impact the ingestion process in DataHub, especially when dealing with specific views from specific schemas in Vertica. According to the <Vertica | DataHub ingestion documentation>, DataHub provides configuration options to handle case sensitivity during pattern matching for various entities, including tables and views.

Key Points on Case Sensitivity in Vertica Ingestion:

  1. Regex Patterns for Views:

    • You can specify regex patterns to filter views during ingestion using the view_pattern configuration. This pattern should match the entire view name in the format database.schema.view.
    • The ignoreCase boolean parameter within view_pattern determines whether case sensitivity should be ignored during pattern matching. By default, this is set to True, meaning case sensitivity is ignored.
  2. Configuration Example:

    • To ingest views from a specific schema while ignoring case sensitivity, you can configure the view_pattern as follows:
view_pattern:
  allow: [ 'Customer.public.customer_view.*' ]
  deny: [ ]
  ignoreCase: True

This configuration will include all views starting with customer_view in the public schema of the Customer database, regardless of the case.

  1. Impact on Column Level Lineage:
    • As per the <Issues · datahub-project/datahub · GitHub issue on SQL Column Level Lineage Case Sensitivity>, case sensitivity can affect column-level lineage. If column names contain uppercase characters and the convert_urns_to_lowercase option is enabled, it may lead to mismatches in lineage.

Would you like more detailed information on configuring these patterns or addressing specific issues related to case sensitivity in your ingestion process?

Sources:

What does convert_urns_to_lowercase option do in an ingestion recipe <@U06TM7M40RK>?

The convert_urns_to_lowercase option in an ingestion recipe is used to ensure that all URNs (Uniform Resource Names) are converted to lowercase during the ingestion process. This can help avoid issues related to case sensitivity, such as duplicate entries or mismatches in lineage and metadata.

Key Points on convert_urns_to_lowercase:

  1. Purpose:

    • The primary purpose of this option is to standardize the casing of URNs across different ingestion sources, which can help prevent inconsistencies and conflicts.
  2. Configuration:

    • This option can be set in the ingestion recipe for various data sources. When enabled, it converts all dataset URNs to lowercase.
  3. Example Usage:

    • Here is an example of how to use the convert_urns_to_lowercase option in a DBT ingestion recipe:
source:
  type: dbt
  config:
    manifest_path: /path/to/manifest.json
    catalog_path: /path/to/catalog.json
    target_platform: bigquery
    convert_urns_to_lowercase: true
  • Similarly, for a BigQuery ingestion recipe:
source:
  type: bigquery
  config:
    project_id: your_project_id
    convert_urns_to_lowercase: true
  1. Impact:
    • Enabling this option ensures that all URNs are consistently lowercased, which can help avoid issues where the same entity might be ingested with different cases (e.g., Customer vs. customer).

References:

Would you like more detailed steps or further assistance on any specific part of this process?

Sources:

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><