Troubleshooting SQL View Profiling in Snowflake Datahub

Original Slack Thread

Hello everyone,

I am trying to use SQL view in Snowflake and ingest sample data in Datahub. Column names was successfully extracted, but I still don’t see sample data in Datahub.

There is my config below:

    type: snowflake
    config:
        account_id: <account_id>
        include_views: true
        profiling:
            enabled: true
            sample_size: 10
        warehouse: <warehouse>
        username: <username>
        password: <password>
        enable_stateful_profiling: false
        role: ACCOUNTADMIN
        database_pattern:
            allow:
                - '<pattern>*'
        schema_pattern:
            allow:
                - '<pattern>*'```
I am using UI and Datahub version is `0.13.1.3`.

Thanks !

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

<@U06TM7M40RK> Can you help with the questions in this thread?


:information_source: We’re experimenting with a new AI assistant for questions in the <#CUMUWQU66> channel to help you track down helpful resources more quickly. Keep in mind that responses may not always be spot-on; feel free to ask it follow-up questions and don’t forget to provide its responses a :thumbsup: or :thumbsdown: so other folks in the Community know if you’re still looking for help on the topic!

To better assist you, could you please provide the following information?

  1. Are you using the UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g., 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g., BigQuery)

This information will help in diagnosing the issue more effectively.

Additionally, based on your provided configuration, it seems like you have enabled profiling and included views, but you might be missing some specific configurations for profiling views. According to <Slack Slack thread>, profiling for views might require some code modifications in the SnowflakeProfiler class.

Would you like more detailed guidance on how to modify the code to enable profiling for views?

Sources:

  • <Slack Thread on Profiling Views> :+1: 0 button :-1: 0 button :bulb: Hint: Mention @runllm in the thread for followups.

Any ideas ?

<@U06TM7M40RK>

Given your configuration and the issue you’re facing, it seems like the profiling for views might not be working as expected. Here are a few steps you can take to troubleshoot and potentially resolve the issue:

  1. Check DataHub Version: Ensure you are using a compatible version of DataHub that supports the features you need. Profiling for views might have specific requirements or bugs fixed in later versions.

  2. Enable Debug Logging: Enable debug logging to get more detailed insights into what might be going wrong during the ingestion process. This can help identify if there are any errors or warnings related to profiling views.

  3. Verify Profiling Configuration: Double-check your profiling configuration to ensure it is correctly set up. For instance, make sure profiling.enabled is set to true and sample_size is appropriately configured.

  4. Custom Modifications: As mentioned earlier, profiling for views might require custom modifications in the SnowflakeProfiler class. If you are comfortable with code changes, you can modify the ingestion framework to handle profiling for views specifically.

Here is a sample configuration snippet based on your provided configuration:

source:
    type: snowflake
    config:
        account_id: <account_id>
        include_views: true
        profiling:
            enabled: true
            sample_size: 10
        warehouse: <warehouse>
        username: <username>
        password: <password>
        enable_stateful_profiling: false
        role: ACCOUNTADMIN
        database_pattern:
            allow:
                - '<pattern>*'
        schema_pattern:
            allow:
                - '<pattern>*'

If these steps do not resolve the issue, you might want to reach out to the DataHub community on Slack for more specific guidance or check the <https://datahubproject.io/docs/0.13.1/components|DataHub documentation> for any updates or additional configuration options.

Would you like more detailed guidance on how to modify the SnowflakeProfiler class to enable profiling for views?

Sources:

<@U06TM7M40RK> Yes I want. I want sample data {stats} to be in the UI… It works in case of tables, but it doesn’t work in case of views…

To enable profiling for Snowflake views and have the sample data appear in the DataHub UI, you will need to modify the SnowflakeProfiler class. Here are the steps to achieve this:

  1. Locate the SnowflakeProfiler Class: The SnowflakeProfiler class is responsible for handling profiling in Snowflake. You can find it in the snowflake_profiler.py file.

  2. Modify the Profiling Logic: Update the logic in the SnowflakeProfiler class to include views in the profiling process.

  3. Rebuild and Deploy: After making the necessary changes, you will need to rebuild and deploy the ingestion framework.

Step-by-Step Guide

Step 1: Locate the SnowflakeProfiler Class
The SnowflakeProfiler class can be found in the following file:

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_profiler.py

Step 2: Modify the Profiling Logic
Edit the SnowflakeProfiler class to ensure it includes views in the profiling process. Here is an example modification:

class SnowflakeProfiler(GenericProfiler, SnowflakeCommonMixin):
    def __init__(
        self,
        config: SnowflakeV2Config,
        report: SnowflakeV2Report,
        state_handler: Optional[ProfilingHandler] = None,
    ) -> None:
        super().__init__(config, report, self.platform, state_handler)
        self.config: SnowflakeV2Config = config
        self.report: SnowflakeV2Report = report
        self.logger = logger
        self.database_default_schema: Dict[str, str] = dict()

    def get_workunits(
        self, database: SnowflakeDatabase, db_tables: Dict[str, List[SnowflakeTable]]
    ) -> Iterable[MetadataWorkUnit]:
        if self.config.is_profiling_enabled():
            self.config.options.setdefault(
                "max_overflow", self.config.profiling.max_workers
            )

        if PUBLIC_SCHEMA not in db_tables:
            self.database_default_schema[database.name] = list(db_tables.keys())[0]

        profile_requests = []
        for schema in database.schemas:
            for table in db_tables[schema.name]:
                # Include views in profiling
                if (
                    not self.config.profiling.profile_external_tables
                    and table.type == "EXTERNAL TABLE"
                ):
                    logger.info(
                        f"Skipping profiling of external table {database.name}.{schema.name}.{table.name}"
                    )
                    self.report.profiling_skipped_other[schema.name] += 1
                    continue

                # Add condition to include views
                if table.type == "VIEW" or table.type == "BASE TABLE":
                    profile_request = self.get_profile_request(
                        table, schema.name, database.name
                    )
                    if profile_request is not None:
                        self.report.report_entity_profiled(profile_request.pretty_name)

                        profile_requests.append(profile_request)

        if len(profile_requests) == 0:
            return

        yield from self.generate_profile_workunits(
            profile_requests,
            max_workers=self.config.profiling.max_workers,
            db_name=database.name,
            platform=self.platform,
            profiler_args=self.get_profile_args(),
        )

Step 3: Rebuild and Deploy
After making the changes, you will need to rebuild the ingestion framework and deploy it. Follow the standard build and deployment procedures for your environment.

Additional Configuration
Ensure your configuration file includes the necessary parameters to enable profiling for views:

source:
    type: snowflake
    config:
        account_id: &lt;account_id&gt;
        include_views: true
        profiling:
            enabled: true
            sample_size: 10
        warehouse: &lt;warehouse&gt;
        username: &lt;username&gt;
        password: &lt;password&gt;
        enable_stateful_profiling: false
        role: ACCOUNTADMIN
        database_pattern:
            allow:
                - '&lt;pattern&gt;*'
        schema_pattern:
            allow:
                - '&lt;pattern&gt;*'

By following these steps, you should be able to enable profiling for views in Snowflake and have the sample data appear in the DataHub UI.

Would you like more detailed guidance on any specific part of this process?

Sources:

No, I don’t want to write code. I want to configure all the stuff from Ingestion.

To configure profiling for Snowflake views without modifying the code, you need to ensure that your ingestion configuration is correctly set up. Based on the provided data sources, here are the steps you should follow:

  1. Ensure Profiling is Enabled: Make sure that profiling is enabled in your configuration.
  2. Include Views in Profiling: Ensure that views are included in the profiling process.
  3. Set Sample Size: Define the sample size for profiling.
  4. Verify Permissions: Ensure that the Snowflake user has the necessary permissions to profile views.

Configuration Steps

  1. Ensure Profiling is Enabled:
    Your configuration should have the profiling.enabled set to true and sample_size defined.

  2. Include Views in Profiling:
    Make sure the include_views parameter is set to true in your configuration.

  3. Set Sample Size:
    Define the sample size for the profiling process.

  4. Verify Permissions:
    Ensure that the Snowflake user has the necessary permissions to access and profile views.

Here is an example configuration based on your requirements:

source:
    type: snowflake
    config:
        account_id: <account_id>
        include_views: true
        profiling:
            enabled: true
            sample_size: 10
        warehouse: <warehouse>
        username: <username>
        password: <password>
        enable_stateful_profiling: false
        role: ACCOUNTADMIN
        database_pattern:
            allow:
                - '<pattern>*'
        schema_pattern:
            allow:
                - '<pattern>*'

Detailed Steps

  1. Create a ConfigMap:
    Store your ingestion recipe in a Kubernetes ConfigMap.

    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: recipe-config
    data:
      snowflake_recipe.yml: |-
        source:
          type: snowflake
          config:
            account_id: <account_id>
            include_views: true
            profiling:
                enabled: true
                sample_size: 10
            warehouse: <warehouse>
            username: <username>
            password: <password>
            enable_stateful_profiling: false
            role: ACCOUNTADMIN
            database_pattern:
                allow:
                    - '<pattern>*'
            schema_pattern:
                allow:
                    - '<pattern>*'
    
  2. Configure values.yaml:
    Enable the datahub-ingestion-cron subchart and reference your ConfigMap and recipe file.

    datahub-ingestion-cron:
      enabled: true
      crons:
        snowflake:
          schedule: "0 * * * *" # Every hour
          recipe:
            configmapName: recipe-config
            fileName: snowflake_recipe.yml
    
  3. Inject Environment Variables:
    Use environment variables to securely inject credentials.

    apiVersion: v1
    kind: Secret
    metadata:
      name: datahub-secrets
    type: Opaque
    data:
    

     SNOWFLAKE_PASSWORD: &lt;base64_encoded_password&gt;

   Reference the secret in your deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
name: datahub-ingestion
spec:
template:
spec:
containers:
- name: datahub-ingestion
envFrom:
- secretRef:
name: datahub-secrets


4. **Deploy the Helm Chart**:
Deploy the Helm chart to schedule the ingestion.

helm upgrade --install datahub datahub/datahub -f values.yaml


By following these steps, you should be able to configure profiling for Snowflake views and have the sample data appear in the DataHub UI without modifying the code.

Would you like more detailed steps or have any specific questions about this process?

Sources:
- [Slack Thread on Profiling Views](<https://datahubspace.slack.com/archives/CUMUWQU66/p1708696303.163829>)
- [Using Kubernetes for Scheduling Ingestion](<https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/schedule_docs/kubernetes.md>)
- [Snowflake Configuration Guide](<https://github.com/datahub-project/datahub/blob/master/docs/quick-ingestion-guides/snowflake/configuration.md>) :+1: 0 button :-1: 0 button :bulb: *Hint:* Mention @runllm in the thread for followups.

<@U06TM7M40RK> Can I ingest sample data from SQL view or this is just for tables ?

Based on the provided data sources, it appears that profiling and ingestion of sample data for SQL views in Snowflake might not be fully supported by default in DataHub. Profiling and sample data ingestion are generally more straightforward for tables.

Profiling SQL Views

From the Slack conversations:

  • Profiling for Views: There are indications that profiling for views might not be fully supported out-of-the-box. For example, in <Slack Slack thread>, a user mentions that profiling for views is not available by default and requires custom profiling.

Configuration for Profiling Tables and Views

While the configuration you provided is correct for enabling profiling, it is primarily designed for tables. Here is a summary of the configuration for profiling:

source:
    type: snowflake
    config:
        account_id: <account_id>
        include_views: true
        profiling:
            enabled: true
            sample_size: 10
        warehouse: <warehouse>
        username: <username>
        password: <password>
        enable_stateful_profiling: false
        role: ACCOUNTADMIN
        database_pattern:
            allow:
                - '<pattern>*'
        schema_pattern:
            allow:
                - '<pattern>*'

Steps to Verify and Enable Profiling for Views

  1. Ensure Permissions: Make sure that the Snowflake user has the necessary permissions to access and profile views. This includes SELECT permissions on the views.

  2. Check DataHub Version: Ensure you are using a version of DataHub that supports the features you need. Profiling for views might have specific requirements or bugs fixed in later versions.

  3. Enable Debug Logging: Enable debug logging to get more detailed insights into what might be going wrong during the ingestion process. This can help identify if there are any errors or warnings related to profiling views.

Conclusion

As per the available data sources, it seems that profiling for views might not be fully supported by default in DataHub. If you need to profile views, you might need to implement custom profiling logic or reach out to the DataHub community for more specific guidance.

Would you like more detailed guidance on how to implement custom profiling for views, or do you have any other questions?

Sources: