Troubleshooting Snowflake Classification Issue in DataHub Version 0.13.1.2

Original Slack Thread

my classification for snowflake is not working on version 0.13.1.2 . I tried checking the logs in debug mode but nothing appears except num_tables_classification_failed’: 11 . How can I verify further what the problem is

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

<@U06TM7M40RK> Can you help with the questions in this thread?

:information_source: _Please continue to tag RunLLM (e.g <@U06TM7M40RK>) to have it continue generating responses!

To troubleshoot the issue with classification for Snowflake in DataHub version 0.13.1.2, you can follow these steps:

  1. Verify Configuration: Ensure that your classification configuration in the ingestion recipe is correctly set up. The configuration should look something like this:

    source:
      type: snowflake
      config:
        account_id: <your-account-id>
        username: <your-username>
        password: <your-password>
        role: <your-role>
        warehouse: <your-warehouse>
        include_tables: true
        include_views: true
        profiling:
          enabled: true
        classification:
          enabled: true
          classifiers:
            - type: datahub
              config:
                confidence_level_threshold: 0.7
    

    Make sure that the classification and classifiers sections are correctly nested under the source configuration.

  2. Check Privileges: Ensure that the Snowflake user has the necessary privileges. According to the <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/docs/sources/snowflake/snowflake_pre.md|Snowflake Configuration in DataHub>, the user needs SELECT privileges on the tables for classification to work.

    grant select on all tables in database "<your-database>" to role datahub_role;
    grant select on future tables in database "<your-database>" to role datahub_role;
    
  3. Inspect Debug Logs: Run the ingestion with debug logs enabled to get more detailed information. You can enable debug logs by running the ingestion command with the --debug flag:

    datahub --debug ingest -c <your-recipe-file>.yaml
    

    This will generate detailed logs that can help identify the issue. Look for any specific error messages related to classification.

  4. Check for Null Values: Classification might fail if there are too many null values in the columns. As mentioned in <Slack Discussion>, ensure that the columns have sufficient non-null values for classification.

  5. Update Classification Library: Ensure that you are using the latest version of the classification library. You can manually update it if necessary:

    pip install acryl-datahub-classify==0.0.6
    
  6. Verify Network Connectivity: If you encounter SSL certificate errors, ensure that your environment can access external resources required by the classification library. You can test this by running a curl command from within the container:

    docker exec -it datahub-actions bash
    curl https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json
    

Would you like more detailed steps or further clarification on any specific part of the process?

Sources:

Can you share your logs with us?