Troubleshooting SQL Parsing Errors in Metadata Ingestion from Databricks in DataHub

Original Slack Thread

<@U06TM7M40RK>
Im using datahub to ingest metadata from databricks. I have many queries to analyse the data, some of them are in wrong format. But datahub is using those to analyse the usage of tables. I got this error

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/11/15 07:07:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable```
How can I configure it as a warning or completely ignore it. This is blocking me to have a successful ingestion

To address the issue of SQL parsing errors during metadata ingestion from Databricks in DataHub, you can configure the system to treat these errors as warnings instead of failures. This will prevent the ingestion process from being blocked by such errors.

  1. Mark SQL Parsing Failures as Warnings: According to a <https://forum.datahubproject.io/t/troubleshooting-metabase-ingestion-issues-in-datahub-and-suggestions-for-improvement/632|discussion on the DataHub forum>, you can modify the configuration to mark SQL parsing failures as warnings. This change will allow the ingestion process to continue even if some queries cannot be parsed.

  2. Adjust Logging Level: You can set the default log level to “WARN” to reduce the verbosity of logs and focus on warnings. This can be done by configuring the logging settings in your environment or script. For Spark, you can use the following command to set the log level:

    sc.setLogLevel("WARN")
    
  3. Ignore Specific Errors: If you want to completely ignore specific parsing errors, you might need to customize the ingestion logic or use a pre-processing step to filter out problematic queries before they are ingested.

Would you like more detailed steps on how to implement these changes, or do you need further assistance with specific configurations?

Sources:

<@U06TM7M40RK> I run thru this discussion but still dont know what to do next :slightly_smiling_face:

  1. Mark SQL Parsing Failures as Warnings: According to a <https://forum.datahubproject.io/t/troubleshooting-metabase-ingestion-issues-in-datahub-and-suggestions-for-improvement/632|discussion on the DataHub forum>, you can modify the configuration to mark SQL parsing failures as warnings. This change will allow the ingestion process to continue even if some queries cannot be parsed.

To configure DataHub to treat SQL parsing failures as warnings instead of errors, you can follow these steps:

  1. Update DataHub Configuration: You need to modify the DataHub ingestion configuration to handle SQL parsing errors as warnings. This involves changing the error handling logic in the ingestion source code or configuration files.

  2. Modify the Ingestion Source Code: If you have access to the source code, you can update the logic where SQL parsing errors are handled. For example, in the DataHub ingestion source code, you can change the error reporting from a failure to a warning. This is similar to the change made in the <https://github.com/datahub-project/datahub/pull/10924|DataHub GitHub pull request> where SQL parsing issues were reduced in severity from failures to warnings.

  3. Use a Custom Docker Image: If you are deploying DataHub using Docker, you might need to create a custom Docker image with the updated source code. This involves building a Docker image with the necessary changes and using it in your deployment.

  4. Deploy the Updated Configuration: Once you have made the necessary changes, deploy the updated configuration or Docker image to your DataHub environment.

Here is a general example of how you might modify the source code to report warnings instead of failures:

# Example code snippet to change error reporting
try:
    # Attempt to parse SQL query
    parse_sql_query(query)
except SQLLineageException as e:
    # Log a warning instead of raising an error
    logger.warning(f"SQL parsing warning: {e}")

Would you like more detailed guidance on any of these steps, or do you need help with a specific part of the process?

Sources:

<@U06TM7M40RK> this will not work for me
2. Adjust Logging Level: You can set the default log level to “WARN” to reduce the verbosity of logs and focus on warnings. This can be done by configuring the logging settings in your environment or script. For Spark, you can use the following command to set the log level:
sc.setLogLevel("WARN")
because the error is on Datahub UI, there’s no spark code from my side!

If the error is occurring on the DataHub UI and you don’t have direct access to Spark code, you can still manage how SQL parsing errors are handled by adjusting the DataHub ingestion configuration or environment settings. Here are some steps you can take:

  1. Modify DataHub Ingestion Configuration: Check if there is a configuration option in your DataHub ingestion setup that allows you to specify how errors are handled. Some ingestion frameworks provide options to continue on error or to log errors as warnings.

  2. Check DataHub Documentation: Review the <https://datahubproject.io/docs|DataHub documentation> to see if there are any specific settings or flags that can be used to adjust error handling behavior during ingestion. This might include options to ignore certain types of errors or to log them at a different severity level.

  3. Custom Error Handling: If the standard configuration options do not provide the flexibility you need, consider implementing a custom error handling mechanism. This might involve writing a custom ingestion script or plugin that wraps the standard ingestion process and handles errors according to your requirements.

  4. Consult DataHub Community: If you are unable to find a solution through configuration or documentation, consider reaching out to the DataHub community for support. The community may have encountered similar issues and could offer insights or workarounds.

Would you like more detailed guidance on any of these steps, or do you need help with a specific part of the process?

Sources:

<@U06TM7M40RK> how ???

  1. Modify DataHub Ingestion Configuration: Check if there is a configuration option in your DataHub ingestion setup that allows you to specify how errors are handled. Some ingestion frameworks provide options to continue on error or to log errors as warnings.

<@U06TM7M40RK> this is actually a limitation of datahub, why it collects draft query to get table statistics?? then it throw me an error. better to skip that error and proceed. my whole pipeline corrupted of this!

I’m not able to find any data sources to answer your question. You might want to provide details about the specific DataHub version and ingestion configuration you are using. If this doesn’t help, please reach out to the DataHub support team for further assistance. :+1: 0 button :-1: 0 button :bulb: Hint: Mention @runllm in the thread for followups.

<@U06TM7M40RK>
Datahub version 0.14.1
source: Databricks

I’m not able to find any data sources to answer your question. You might want to provide information on whether you have customized any ingestion scripts or if you are using the default configuration. If this doesn’t help, please reach out to the DataHub support team for further assistance. :+1: 0 button :-1: 0 button :bulb: Hint: Mention @runllm in the thread for followups.

<@U06TM7M40RK> can you help me to reachout to datahub support team?

    type: unity-catalog
    config:
        workspace_url: '<https://dbc-sfaf-bgeq5f.cloud.databricks.com>'
        include_metastore: false
        include_ownership: true
        include_table_lineage: true
        include_column_lineage: true
        profiling:
            method: ge
            enabled: true
            warehouse_id: xxx123
            profile_table_level_only: false
            max_wait_secs: 300
            pattern:
                allow:
                    - '.*\.a1'
                    - '.*\.a2'
        stateful_ingestion:
            enabled: true
        softDelete: false
        warehouse_id: xxx123
        token: '${datahub_databricks_connector_access_token}'```

To reach out to the DataHub support team, you can utilize the following resources:

  1. DataHub Slack Community: Join the DataHub Slack community where you can ask questions and get support from other users and the core team. You can sign up https://datahubproject.io/slack?utm_source=docs&utm_medium=page_link&utm_campaign=docs_page_link|here and navigate to relevant channels such as #troubleshoot for debugging help or #office-hours for live support.

  2. GitHub Issues: If you believe this is a bug or a feature request, you can open an issue on the <https://github.com/datahub-project/datahub/issues|DataHub GitHub repository>. Provide detailed information about your configuration, the error you are encountering, and any relevant logs.

  3. Email Support: If you are using DataHub Cloud or have a support agreement, you might have direct email support available. Check any documentation or agreements you have for contact details.

  4. Documentation: Review the <https://datahubproject.io/docs|DataHub documentation> for any troubleshooting guides or configuration tips that might help resolve your issue.

Would you like more detailed guidance on any of these options, or do you need help with a specific part of the process?

Sources:

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><

<@U07FV3UHE4D> please take a look at my issue.

Thank you for letting us know you need a person to take a look at this thread.

Either a community member or one of our team members will try to assist when they have a moment, responses unfortunately might not be immediate. We hope this helps speed things along :slight_smile: