Configuring Schema, Table, and View Patterns in DataHub SQLAlchemy Ingestion Pipeline

Original Slack Thread

<@U06TM7M40RK>

here in ingestion pipeline

  type: sqlalchemy
  config:
    connect_uri: '<sybase+pyodbc://KDW_DATA_GOV>:${App_01298_SybaseIQ_EDW_prod_password}@SYBASE_PROD'
    include_views: true
    schema_pattern:
      allow:
        - INTR
        - CDR
        - DRIMRT
        - REL
        - AUDIT
    table_pattern:
      allow:
        - 'INTR.*'
        - 'CDR.*'
        - 'DRIMRT.*'
        - 'REL.*'
        - 'AUDIT.*'
    view_pattern:
      allow:
        - 'INTR.*'
        - 'CDR.*'
        - 'DRIMRT.*'
        - 'REL.*'
        - 'AUDIT.*'
    stateful_ingestion:
      enabled: true
      remove_stale_metadata: true
    platform: Sybase
deployment:
  name: "App_01298_SybaseIQ_EDW_Part4"
  schedule: "15 8 * * MON" # At 08:15 AM, only on Monday
  time_zone: "Asia/Kolkata"
transformers:
  -
    type: simple_add_dataset_properties
    config:
      replace_existing: true
      properties:
        app_id: APP-01298
        app_name: EDW
  -
    type: simple_add_dataset_tags
    config:
      replace_existing: false
      semantics: patch
      tag_urns:
        - 'urn:li:tag:APP-01298'
        - 'urn:li:tag:EDW'
  -
    type: simple_add_dataset_ownership
    config:
      replace_existing: true
      semantics: patch
      owner_urns:
        - 'urn:li:corpuser:kailash.thadani@kotak.com'
      ownership_type: 'urn:li:ownershipType:7059d03e-a15e-4cb3-aab7-ab5dd870411e'
  -
    type: simple_add_dataset_ownership
    config:
      replace_existing: false
      semantics: patch
      owner_urns:
        - 'urn:li:corpuser:yogesh.atkari@kotak.com'
      ownership_type: 'urn:li:ownershipType:__system__technical_owner'```
I wan to deny the above schema patterns

To deny specific schema patterns in your DataHub ingestion pipeline, you can modify the schema_pattern configuration by adding the deny field with the desired patterns. According to the <SQLAlchemy | DataHub documentation for SQLAlchemy ingestion>, you can specify regex patterns to exclude certain schemas from ingestion.

Here’s how you can update your configuration to deny the specified schema patterns:

source:
  type: sqlalchemy
  config:
    connect_uri: '<sybase+pyodbc://KDW_DATA_GOV>:${App_01298_SybaseIQ_EDW_prod_password}@SYBASE_PROD'
    include_views: true
    schema_pattern:
      deny:
        - INTR
        - CDR
        - DRIMRT
        - REL
        - AUDIT
    table_pattern:
      allow:
        - 'INTR.*'
        - 'CDR.*'
        - 'DRIMRT.*'
        - 'REL.*'
        - 'AUDIT.*'
    view_pattern:
      allow:
        - 'INTR.*'
        - 'CDR.*'
        - 'DRIMRT.*'
        - 'REL.*'
        - 'AUDIT.*'
    stateful_ingestion:
      enabled: true
      remove_stale_metadata: true
    platform: Sybase
deployment:
  name: "App_01298_SybaseIQ_EDW_Part4"
  schedule: "15 8 * * MON" # At 08:15 AM, only on Monday
  time_zone: "Asia/Kolkata"
transformers:
  -
    type: simple_add_dataset_properties
    config:
      replace_existing: true
      properties:
        app_id: APP-01298
        app_name: EDW
  -
    type: simple_add_dataset_tags
    config:
      replace_existing: false
      semantics: patch
      tag_urns:
        - 'urn:li:tag:APP-01298'
        - 'urn:li:tag:EDW'
  -
    type: simple_add_dataset_ownership
    config:
      replace_existing: true
      semantics: patch
      owner_urns:
        - 'urn:li:corpuser:kailash.thadani@kotak.com'
      ownership_type: 'urn:li:ownershipType:7059d03e-a15e-4cb3-aab7-ab5dd870411e'
  -
    type: simple_add_dataset_ownership
    config:
      replace_existing: false
      semantics: patch
      owner_urns:
        - 'urn:li:corpuser:yogesh.atkari@kotak.com'
      ownership_type: 'urn:li:ownershipType:__system__technical_owner'

This configuration will exclude schemas named INTR, CDR, DRIMRT, REL, and AUDIT from being ingested. Would you like more details on any specific part of this configuration?

Sources:

  • <SQLAlchemy | DataHub SQLAlchemy Ingestion Configuration> :+1: 0 button :-1: 0 button :bulb: Hint: Mention @runllm in the thread for followups.

<@U06TM7M40RK> and what to put in table pattern and view pattern

To configure the table_pattern and view_pattern in your DataHub ingestion pipeline, you can use regex patterns to specify which tables and views to include or exclude. According to the <SQLAlchemy | DataHub documentation for SQLAlchemy ingestion>, you can use the allow and deny fields within these patterns to control the inclusion and exclusion of tables and views.

If you want to deny specific tables and views based on the schema patterns you mentioned earlier, you can set the deny field in both table_pattern and view_pattern to match the tables and views within those schemas. Here’s how you can update your configuration:

source:
  type: sqlalchemy
  config:
    connect_uri: '<sybase+pyodbc://KDW_DATA_GOV>:${App_01298_SybaseIQ_EDW_prod_password}@SYBASE_PROD'
    include_views: true
    schema_pattern:
      deny:
        - INTR
        - CDR
        - DRIMRT
        - REL
        - AUDIT
    table_pattern:
      deny:
        - 'INTR.*'
        - 'CDR.*'
        - 'DRIMRT.*'
        - 'REL.*'
        - 'AUDIT.*'
    view_pattern:
      deny:
        - 'INTR.*'
        - 'CDR.*'
        - 'DRIMRT.*'
        - 'REL.*'
        - 'AUDIT.*'
    stateful_ingestion:
      enabled: true
      remove_stale_metadata: true
    platform: Sybase
deployment:
  name: "App_01298_SybaseIQ_EDW_Part4"
  schedule: "15 8 * * MON" # At 08:15 AM, only on Monday
  time_zone: "Asia/Kolkata"
transformers:
  -
    type: simple_add_dataset_properties
    config:
      replace_existing: true
      properties:
        app_id: APP-01298
        app_name: EDW
  -
    type: simple_add_dataset_tags
    config:
      replace_existing: false
      semantics: patch
      tag_urns:
        - 'urn:li:tag:APP-01298'
        - 'urn:li:tag:EDW'
  -
    type: simple_add_dataset_ownership
    config:
      replace_existing: true
      semantics: patch
      owner_urns:
        - 'urn:li:corpuser:kailash.thadani@kotak.com'
      ownership_type: 'urn:li:ownershipType:7059d03e-a15e-4cb3-aab7-ab5dd870411e'
  -
    type: simple_add_dataset_ownership
    config:
      replace_existing: false
      semantics: patch
      owner_urns:
        - 'urn:li:corpuser:yogesh.atkari@kotak.com'
      ownership_type: 'urn:li:ownershipType:__system__technical_owner'

In this configuration:

  • table_pattern.deny and view_pattern.deny are set to exclude any tables and views that start with INTR, CDR, DRIMRT, REL, and AUDIT.

Would you like more details on any specific part of this configuration?

Sources:

<@U06TM7M40RK> schema_pattern, table_pattern, view_pattern

To configure schema_pattern, table_pattern, and view_pattern in your DataHub ingestion pipeline, you can use regex patterns to specify which schemas, tables, and views to include or exclude. These patterns are part of the configuration that allows you to filter the data being ingested based on your requirements.

schema_pattern

  • Purpose: To filter which schemas are included or excluded during ingestion.
  • Configuration: You can use allow and deny fields to specify regex patterns.
  • Example:
    schema_pattern:
      allow:
        - 'analytics'
      deny:
        - 'test'
    

table_pattern

  • Purpose: To filter which tables are included or excluded during ingestion.
  • Configuration: Similar to schema_pattern, use allow and deny fields.
  • Example:
    table_pattern:
      allow:
        - 'analytics.*'
      deny:
        - 'analytics.temp.*'
    

view_pattern

  • Purpose: To filter which views are included or excluded during ingestion.
  • Configuration: Defaults to table_pattern if not specified. Use allow and deny fields.
  • Example:
    view_pattern:
      allow:
        - 'analytics.*'
      deny:
        - 'analytics.temp.*'
    

Example Configuration
Here’s how you can configure these patterns in your ingestion pipeline:

source:
  type: sqlalchemy
  config:
    connect_uri: '<your_connection_uri>'
    include_views: true
    schema_pattern:
      allow:
        - 'analytics'
      deny:
        - 'test'
    table_pattern:
      allow:
        - 'analytics.*'
      deny:
        - 'analytics.temp.*'
    view_pattern:
      allow:
        - 'analytics.*'
      deny:
        - 'analytics.temp.*'

This configuration will:

  • Include only the analytics schema and exclude the test schema.
  • Include tables and views that start with analytics but exclude those that start with analytics.temp.

Would you like more details on any specific part of this configuration?

Sources:

  • <SQLAlchemy | DataHub SQLAlchemy Ingestion Configuration> :+1: 0 button :-1: 0 button :bulb: Hint: Mention @runllm in the thread for followups.

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><