Troubleshooting error in dbt metadata ingestion caused by sqlglot compatibility issue

Original Slack Thread

I am running
datahub ingest -c recipe.dhub.yaml
to import dbt metadata.

with the following package settings:

sqlglot = "20.8.0"```
The output I get:
```INFO     {datahub.cli.ingest_cli:147} - DataHub CLI version: 0.13.2.4
ERROR    {datahub.entrypoints:205} - Command failed: Failed to find a registered source for type dbt: dbt is disabled due to an error in initialization```
caused by:
```File "/usr/local/lib/python3.11/site-packages/sqlglot/tokens.py", line 479, in __new__
settings = RsTokenizerSettings(
TypeError: TokenizerSettings.__new__() missing 1 required positional argument: 'heredoc_tag_is_identifier'```
What could be the cause of the error and the solution?

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

<@U06TM7M40RK> Can you help with the questions in this thread?


:information_source: We’re experimenting with a new AI assistant for questions in the <#CUMUWQU66> channel to help you track down helpful resources more quickly. Keep in mind that responses may not always be spot-on; feel free to ask it follow-up questions and don’t forget to provide its responses a :thumbsup: or :thumbsdown: so other folks in the Community know if you’re still looking for help on the topic!

To address the issue you’re encountering with the dbt metadata ingestion, let’s break down the potential solutions step-by-step:

Summary
The error TypeError: TokenizerSettings.__new__() missing 1 required positional argument: 'heredoc_tag_is_identifier' suggests that there is an issue with the version compatibility of the sqlglot library. Specifically, the TokenizerSettings class is missing a required argument, which indicates a mismatch between the expected and actual versions of sqlglot.

Steps to Resolve

  1. Check sqlglot Version:
    Ensure that the version of sqlglot you are using is compatible with your DataHub and dbt ingestion setup. The heredoc_tag_is_identifier argument might not be available in the version you are using. You can check the installed version using:

    pip show sqlglot
    
  2. Update or Downgrade sqlglot:
    If the heredoc_tag_is_identifier argument is not available in your current version, you might need to update or downgrade sqlglot. For example, to install a specific version, you can use:

    pip install sqlglot==<desired_version>
    
  3. Modify Ingestion Configuration:
    Ensure that your ingestion configuration is correctly set up. Refer to the <https://datahubproject.io/docs/metadata-ingestion|DataHub documentation on adding a custom ingestion source> for the required configurations.

  4. Rebuild and Redeploy:
    After making the necessary changes, rebuild and redeploy your ingestion setup. Ensure that the dbt source is correctly registered and available for ingestion.

Example of Custom Source Code Adjustment
Here is an example of how you might adjust your custom source code to ensure compatibility:

from sqlglot import expressions  # Adjust the import based on the sqlglot version

class DbtSource:
    def __init__(self, config, ctx):
        self.config = config
        self.ctx = ctx

    def get_data(self):
        # Your data extraction logic here
        pass

    @classmethod
    def create(cls, config_dict, ctx):
        config = DbtConfig.parse_obj(config_dict)
        return cls(config, ctx)

Additional Information

  • Ingestion Method: Are you using the UI or CLI for ingestion?
  • DataHub Version: Which version of DataHub are you using? (e.g., 0.13.2.4)
  • Data Source: What data source(s) are you integrating with DataHub? (e.g., dbt, Snowflake)

Would you like more detailed steps or further assistance on any specific part of this process?

Sources:

Sources: