Configurations for Ingesting Data and Profiling Specific Schemas in Databricks

hyejin.yoon · December 16, 2024, 12:02am

how to ingress Databricks, give me the example to set include_usage_statistics to False
<@U06TM7M40RK>

user-2 · December 16, 2024, 12:03am

To configure DataHub to ingest data from Databricks and set include_usage_statistics to false, you need to modify your ingestion recipe accordingly. Here is an example configuration:

source:
  type: unity-catalog
  config:
    workspace_url: 'https://your-databricks-workspace-url'
    token: 'your-databricks-token'
    include_hive_metastore: true
    warehouse_id: 'your-warehouse-id'
    include_usage_statistics: false  # Set this to false to disable usage statistics
    # other configurations

In this configuration:

workspace_url is the URL of your Databricks workspace.
token is your Databricks personal access token.
include_hive_metastore is set to true to include the Hive Metastore, but you can set it to false if not needed.
warehouse_id is required if include_hive_metastore is set to true.
include_usage_statistics is explicitly set to false to disable the ingestion of usage statistics.

Make sure to replace placeholders like 'https://your-databricks-workspace-url', 'your-databricks-token', and 'your-warehouse-id' with your actual Databricks details.

Would you like more detailed steps or further assistance on a specific part of the process?

Sources:

https://datahubspace.slack.com/archives/CUMUWQU66/p1719896965.332949|https://datahubspace.slack.com/archives/CUMUWQU66/p1719896965.332949
https://datahubproject.io/docs/0.13.1/generated/ingestion/sources/databricks#config-details|https://datahubproject.io/docs/0.13.1/generated/ingestion/sources/databricks#config-details 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-1 · December 16, 2024, 12:03am

profiling:
method: ge
enabled: true
i want to only get scehma x and y in catalog z <@U06TM7M40RK>

user-2 · December 16, 2024, 12:03am

To configure DataHub to profile only specific schemas (e.g., x and y) within a specific catalog (e.g., z), you can use the schema_pattern and catalog_pattern configuration options in your ingestion recipe. Here’s how you can set it up:

source:
  type: unity-catalog
  config:
    workspace_url: 'https://your-databricks-workspace-url'
    token: 'your-databricks-token'
    include_hive_metastore: true
    warehouse_id: 'your-warehouse-id'
    profiling:
      method: "ge"
      enabled: true
    catalog_pattern:
      allow:
        - '^z$'  # Only include catalog 'z'
    schema_pattern:
      allow:
        - '^x$'  # Only include schema 'x'
        - '^y$'  # Only include schema 'y'
    # other configurations

In this configuration:

catalog_pattern is set to allow only the catalog named z.
schema_pattern is set to allow only the schemas named x and y.
The profiling section is configured to use the ge method and is enabled.

Make sure to replace placeholders like 'https://your-databricks-workspace-url', 'your-databricks-token', and 'your-warehouse-id' with your actual Databricks details.

Would you like more detailed steps or further assistance on a specific part of the process?

Sources:

<https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/unity/config.py|Databricks Unity Catalog Configuration> 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-1 · December 16, 2024, 12:03am

will it impact other lineage?

user-1 · December 16, 2024, 12:03am

<@U06TM7M40RK> will it impact other lineage. I only want to profile in limited schema, but I want to get all lineages

user-2 · December 16, 2024, 12:03am

It seems likely that configuring your ingestion recipe to profile only specific schemas using schema_pattern and catalog_pattern should not impact the lineage extraction for other schemas or catalogs. Profiling and lineage are generally separate processes in DataHub, and you can configure them independently.

To ensure that you still capture all lineage information while limiting profiling to specific schemas, you can configure your ingestion recipe as follows:

Set Profiling Patterns: Use schema_pattern and catalog_pattern to limit profiling to specific schemas and catalogs.
Ensure Lineage is Enabled: Make sure that lineage-related configurations are set to include all necessary tables and views for lineage extraction.

Here’s an example configuration:

source:
  type: unity-catalog
  config:
    workspace_url: 'https://your-databricks-workspace-url'
    token: 'your-databricks-token'
    include_hive_metastore: true
    warehouse_id: 'your-warehouse-id'
    profiling:
      method: "ge"
      enabled: true
    catalog_pattern:
      allow:
        - '^z$'  # Only profile catalog 'z'
    schema_pattern:
      allow:
        - '^x$'  # Only profile schema 'x'
        - '^y$'  # Only profile schema 'y'
    include_table_lineage: true  # Ensure lineage is enabled
    include_view_lineage: true   # Ensure view lineage is enabled
    # other configurations

In this setup:

Profiling is limited to schemas x and y in catalog z.
Lineage extraction is enabled for all tables and views, ensuring that you capture lineage information across your data ecosystem.

Would you like more detailed steps or further assistance on a specific part of the process?

Sources:

<https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/unity/config.py|Databricks Unity Catalog Configuration> 0 button 0 button Hint: Mention @runllm in the thread for followups.

datahub_team · December 16, 2024, 12:03am

Hi ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look

Hope this helps ! ><

Topic		Replies	Views
Troubleshooting Data Ingestion Issues from Databricks to DataHub ingestion	10	85	July 8, 2024
Troubleshooting Lineage Between Databricks Tables and Tableau Dashboards ingestion	7	25	July 8, 2024
Configuring Profiling and Stateful Ingestion Features in DataHub ingestion	8	61	February 10, 2025
Enabling Stats Tab and Usage Statistics in DataHub GUI ingestion	7	25	March 17, 2025
Configuring Snowflake Recipe Profiling in DataHub ingestion	4	9	August 12, 2024

Configurations for Ingesting Data and Profiling Specific Schemas in Databricks

Related topics