Troubleshooting Hive Source Classification Configuration Issues

Original Slack Thread

Hi Team,

I am unable to run Hive Source Classification. Created below Github Issue with the details.
https://github.com/datahub-project/datahub/issues/10897

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

<@U06TM7M40RK> Can you help with the questions in this thread?


:information_source: We’re experimenting with a new AI assistant for questions in the <#CUMUWQU66> channel to help you track down helpful resources more quickly. Keep in mind that responses may not always be spot-on; feel free to ask it follow-up questions and don’t forget to provide its responses a :thumbsup: or :thumbsdown: so other folks in the Community know if you’re still looking for help on the topic!

To assist you effectively with your Hive Source Classification issue, could you please provide the following details?

  1. Are you using the UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g., 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g., BigQuery, MSSQL)

Additionally, any specific error messages or issues you’re encountering would be very helpful. Based on your GitHub issue, it seems like there might be a configuration or compatibility issue.

For context, similar issues have been discussed in the DataHub community, where errors related to Hive ingestion often involve configuration settings or permissions. For example, ensuring that hive.strict.checks.no.partition.filter is set to false and that hive.mapred.mode is not set to strict can help resolve some of these issues (<Issues · datahub-project/datahub · GitHub Issue #10897>).

Would you like more detailed troubleshooting steps or specific examples based on your setup?

Sources:

<@U06TM7M40RK>
More details

Based on the details provided in the GitHub issue and the information available, it seems like the main problem is related to the configuration settings for partitioned tables in Hive. Specifically, the error message indicates that queries against partitioned tables without a partition filter are disabled for safety reasons.

Here are the steps to troubleshoot and resolve this issue:

1. Check Hive Configuration

Ensure that the Hive configuration settings hive.strict.checks.no.partition.filter and hive.mapred.mode are correctly set.

  • Set hive.strict.checks.no.partition.filter to false:
    This setting allows queries against partitioned tables without a partition filter.

  • Ensure hive.mapred.mode is not set to strict:
    This setting should be configured to allow non-strict mode operations.

2. Update Hive Configuration

You can update the Hive configuration settings either through the Hive configuration files or by setting them dynamically in your Hive session.

Example of Updating Hive Configuration in Hive Session:

SET hive.strict.checks.no.partition.filter=false;
SET hive.mapred.mode=nonstrict;

3. Verify Permissions

Ensure that the user running the ingestion job has the necessary SELECT permissions on the dataset. The error message indicates a potential permission issue, so double-check that the user has the appropriate access.

4. Review Ingestion Recipe

Make sure your ingestion recipe is correctly configured. Here is an example recipe for Hive ingestion with classification enabled:

source:
  type: hive
  config:
    platform_instance: OPR.OCE.DEV
    env: DEV
    host_port: 'host.visa.com:10000'
    username: svcdlcnifi
    options:
      connect_args:
        auth: KERBEROS
        kerberos_service_name: hive
    classification:
      enabled: true
      sample_size: 1000
      classifiers:
        - type: datahub
          config:
            confidence_level_threshold: 0.7
    schema_pattern:
      allow:
        - "^aucdl$" # allow all schemas that start with aucdl
    table_pattern:
      allow:
        - "^aucdl.tcaef_auth_dtl_full$" # allow all tables that start with tcaef_auth_dtl_full in the schema aucdl
sink:
  type: datahub-rest
  config:
    server: '${DATAHUB_GMS_HOST}'
    token: '${DATAHUB_GMS_TOKEN}'
    max_threads: 1

5. Enable Debug Logs

Enable debug logs during the ingestion process to get more detailed error messages. This can help identify the root cause of the issue.

Example Command to Enable Debug Logs:

datahub ingest -c your_recipe.yml --debug

Summary

  1. Update Hive Configuration: Set hive.strict.checks.no.partition.filter to false and ensure hive.mapred.mode is not set to strict.
  2. Verify Permissions: Ensure the user has SELECT permissions on the dataset.
  3. Review Ingestion Recipe: Make sure the recipe is correctly configured.
  1. Enable Debug Logs: Use debug logs to get more detailed error messages.

Would you like more detailed steps on any of these points or further assistance with a specific part of the process?

Sources:

<@U06TM7M40RK> How to get classification by passing partition filters?

To perform classification on partitioned tables in Hive by passing partition filters, you need to ensure that your Hive queries include the necessary partition filters. This can be achieved by configuring the ingestion recipe to include the appropriate partition filters.

Here’s a step-by-step guide to achieve this:

1. Update Hive Configuration

Ensure that your Hive configuration allows queries with partition filters. Set hive.strict.checks.no.partition.filter to false and ensure hive.mapred.mode is not set to strict.

SET hive.strict.checks.no.partition.filter=false;
SET hive.mapred.mode=nonstrict;

2. Configure Ingestion Recipe with Partition Filters

You can include partition filters in your ingestion recipe to ensure that the classification process only considers specific partitions. This can be done by using the options field in the Hive source configuration.

Here is an example ingestion recipe that includes partition filters:

source:
  type: hive
  config:
    platform_instance: OPR.OCE.DEV
    env: DEV
    host_port: 'host.visa.com:10000'
    username: svcdlcnifi
    options:
      connect_args:
        auth: KERBEROS
        kerberos_service_name: hive
    classification:
      enabled: true
      sample_size: 1000
      classifiers:
        - type: datahub
          config:
            confidence_level_threshold: 0.7
    schema_pattern:
      allow:
        - "^aucdl$" # allow all schemas that start with aucdl
    table_pattern:
      allow:
        - "^aucdl.tcaef_auth_dtl_full$" # allow all tables that start with tcaef_auth_dtl_full in the schema aucdl
    options:
      partition_filters:
        - "date='2024-07-01'" # Example partition filter
sink:
  type: datahub-rest
  config:
    server: '${DATAHUB_GMS_HOST}'
    token: '${DATAHUB_GMS_TOKEN}'
    max_threads: 1

3. Run the Ingestion Job

Execute the ingestion job with the updated recipe. Make sure to enable debug logs to capture detailed information about the process.

datahub ingest -c your_recipe.yml --debug

4. Verify Classification Results

After the ingestion job completes, verify the classification results in DataHub to ensure that the correct partitions were processed.

Summary

  1. Update Hive Configuration: Ensure hive.strict.checks.no.partition.filter is set to false and hive.mapred.mode is not set to strict.
  2. Configure Ingestion Recipe: Include partition filters in the ingestion recipe.
  3. Run the Ingestion Job: Execute the ingestion job with the updated recipe.
  4. Verify Results: Check the classification results in DataHub.

Would you like more detailed steps on any of these points or further assistance with a specific part of the process?

Sources: