Configurations for Ingesting Data and Profiling Specific Schemas in Databricks

Original Slack Thread

how to ingress Databricks, give me the example to set include_usage_statistics to False
<@U06TM7M40RK>

To configure DataHub to ingest data from Databricks and set include_usage_statistics to false, you need to modify your ingestion recipe accordingly. Here is an example configuration:

source:
  type: unity-catalog
  config:
    workspace_url: 'https://your-databricks-workspace-url'
    token: 'your-databricks-token'
    include_hive_metastore: true
    warehouse_id: 'your-warehouse-id'
    include_usage_statistics: false  # Set this to false to disable usage statistics
    # other configurations

In this configuration:

  • workspace_url is the URL of your Databricks workspace.
  • token is your Databricks personal access token.
  • include_hive_metastore is set to true to include the Hive Metastore, but you can set it to false if not needed.
  • warehouse_id is required if include_hive_metastore is set to true.
  • include_usage_statistics is explicitly set to false to disable the ingestion of usage statistics.

Make sure to replace placeholders like 'https://your-databricks-workspace-url', 'your-databricks-token', and 'your-warehouse-id' with your actual Databricks details.

Would you like more detailed steps or further assistance on a specific part of the process?

Sources:

profiling:
method: ge
enabled: true
i want to only get scehma x and y in catalog z <@U06TM7M40RK>

To configure DataHub to profile only specific schemas (e.g., x and y) within a specific catalog (e.g., z), you can use the schema_pattern and catalog_pattern configuration options in your ingestion recipe. Here’s how you can set it up:

source:
  type: unity-catalog
  config:
    workspace_url: 'https://your-databricks-workspace-url'
    token: 'your-databricks-token'
    include_hive_metastore: true
    warehouse_id: 'your-warehouse-id'
    profiling:
      method: "ge"
      enabled: true
    catalog_pattern:
      allow:
        - '^z$'  # Only include catalog 'z'
    schema_pattern:
      allow:
        - '^x$'  # Only include schema 'x'
        - '^y$'  # Only include schema 'y'
    # other configurations

In this configuration:

  • catalog_pattern is set to allow only the catalog named z.
  • schema_pattern is set to allow only the schemas named x and y.
  • The profiling section is configured to use the ge method and is enabled.

Make sure to replace placeholders like 'https://your-databricks-workspace-url', 'your-databricks-token', and 'your-warehouse-id' with your actual Databricks details.

Would you like more detailed steps or further assistance on a specific part of the process?

Sources:

will it impact other lineage?

<@U06TM7M40RK> will it impact other lineage. I only want to profile in limited schema, but I want to get all lineages

It seems likely that configuring your ingestion recipe to profile only specific schemas using schema_pattern and catalog_pattern should not impact the lineage extraction for other schemas or catalogs. Profiling and lineage are generally separate processes in DataHub, and you can configure them independently.

To ensure that you still capture all lineage information while limiting profiling to specific schemas, you can configure your ingestion recipe as follows:

  1. Set Profiling Patterns: Use schema_pattern and catalog_pattern to limit profiling to specific schemas and catalogs.
  2. Ensure Lineage is Enabled: Make sure that lineage-related configurations are set to include all necessary tables and views for lineage extraction.

Here’s an example configuration:

source:
  type: unity-catalog
  config:
    workspace_url: 'https://your-databricks-workspace-url'
    token: 'your-databricks-token'
    include_hive_metastore: true
    warehouse_id: 'your-warehouse-id'
    profiling:
      method: "ge"
      enabled: true
    catalog_pattern:
      allow:
        - '^z$'  # Only profile catalog 'z'
    schema_pattern:
      allow:
        - '^x$'  # Only profile schema 'x'
        - '^y$'  # Only profile schema 'y'
    include_table_lineage: true  # Ensure lineage is enabled
    include_view_lineage: true   # Ensure view lineage is enabled
    # other configurations

In this setup:

  • Profiling is limited to schemas x and y in catalog z.
  • Lineage extraction is enabled for all tables and views, ensuring that you capture lineage information across your data ecosystem.

Would you like more detailed steps or further assistance on a specific part of the process?

Sources:

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><