Fastest way to add SQLAlchemy dialect for Datahub UI using pip

Original Slack Thread

Hi!
We are running a PoC with Datahub UI and we are using the quickstart guide. What’s the fastest way to add sqlalchemy dialect? The documentation states that you can use pip install, but with no context. We tried to bash into running datahub-actions container and run pip install, but the ingestion job fails since sqlalchemy can’t see the dialect or installed libraries.

Hey there! :wave:

<@U06TM7M40RK> Can you help with the questions in this thread?

:information_source: Please continue to tag RunLLM (e.g <@U06TM7M40RK>) to have it continue generating responses!

To add an SQLAlchemy dialect for use with DataHub, you need to ensure that the required dialect package is installed in the correct environment where the ingestion job is executed. According to the <https://datahubproject.io/docs/0.13.1/generated/ingestion/sources/sqlalchemy|DataHub SQLAlchemy documentation>, the SQLAlchemy source is useful if there is an SQLAlchemy dialect defined elsewhere, but you must install the required dialect packages yourself.

Here are the steps you can follow:

  1. Install the SQLAlchemy Plugin: First, ensure that the DataHub SQLAlchemy plugin is installed. You can do this using pip:

    pip install 'acryl-datahub[sqlalchemy]'
    
  2. Install the Required Dialect: You need to install the SQLAlchemy dialect package in the same environment where the DataHub ingestion job runs. If you are using the CLI, this should be done in the environment where the CLI is executed. For example:

    pip install &lt;your-sqlalchemy-dialect-package&gt;
    
  3. Configure the Ingestion Recipe: Create a YAML recipe file to configure the ingestion. Here is an example:

    source:
      type: sqlalchemy
      config:
        connect_uri: "<dialect+driver://username:password@host:port/database>"
    sink:
      type: "datahub-rest"
      config:
        server: "<http://localhost:8080>"  # Change this to your DataHub server URL
    
  4. Run the Ingestion: Execute the ingestion using the DataHub CLI:

    datahub ingest -c your_recipe.yaml
    

If you are using a managed environment or UI-based ingestion, it might create a separate virtual environment for the ingestion runs, and the packages should be installed there. Unfortunately, as per the <Slack discussion>, there is currently no way to affect the virtual environment’s requirements directly from the UI.

Would you like more detailed guidance on any of these steps?

Sources:

Hi, how did my answer work out for you? Need any more assistance? You can also tag <@U07FV3UHE4D> to let your fellow community members know you’re still looking for guidance.

As a reference to any future user who might face the same problem, here is a short summary of how we got forward with our PoC:
• Our database is IBM DB2, which is not supported by Datahub UI
• One could build the project by forking the GitHub repository and adding the needed ibm-db, ibm-db-sa and DLL drivers. This requires way too much development time for a quick PoC
• Using the Datahub CLI circumvents the lack of DB2 support
• First you must be familiar with connecting to DB2 database from your local Python development environment, using sqlalchemy, ibm-db and ibm-db-sa and you have DB2 client drivers. The db2+ibm_db connect url will be handy later
• Install both datahub and acryl-datahub[sqlalchemy] libraries
• Now you can create a custom recipe which you can ingest to the Datahub UI
• In our case the recipe was the following:
• Source: type: sqlalchemy config: platform: db2 connect_uri: (the connect URL you have used before in your local env to connect to your DB2 database)
• Per our understanding, the datahub CLI runs the ingestion job using your local environment against the DB2 database. The datahub containers will not be ingested with the IBM DB2 drivers.
• The ingestion job is run locally and the metadata transferred to the Datahub containers.
• After the job is executed, you’ll see the metadata available in the UI
• Again, the Datahub UI will still lack the ability to connect to DB2 databases
• Theoretically you’d need another container which runs CLI ingestion jobs against DB2 database and then communicates the metadata to the other containers
• The learning curve in our PoC was stagnant. Building the project from source code is an overkill for a PoC in order to make a basic sqlalchemy connection
• The documentation could be better since it makes you assume that getting this to work is just as simple as pip install :slight_smile:

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><

<@U01GZEETMEZ> ^

You can add extra python dependencies in the “advanced” section of the ingestion setup process. You should not need to build things from source![attachment]({‘ID’: ‘F087WTVQK16’, ‘EDITABLE’: False, ‘IS_EXTERNAL’: False, ‘USER_ID’: ‘U01GZEETMEZ’, ‘CREATED’: ‘2025-01-03 20:29:54+00:00’, ‘PERMALINK’: ‘Slack’, ‘EXTERNAL_TYPE’: ‘’, ‘TIMESTAMPS’: ‘2025-01-03 20:29:54+00:00’, ‘MODE’: ‘hosted’, ‘DISPLAY_AS_BOT’: False, ‘PRETTY_TYPE’: ‘PNG’, ‘NAME’: ‘image.png’, ‘IS_PUBLIC’: True, ‘PREVIEW_HIGHLIGHT’: None, ‘MIMETYPE’: ‘image/png’, ‘PERMALINK_PUBLIC’: ‘https://slack-files.com/TUMKD5EGJ-F087WTVQK16-e0c98038fe’, ‘FILETYPE’: ‘png’, ‘EDIT_LINK’: None, ‘URL_PRIVATE’: ‘Slack’, ‘HAS_RICH_PREVIEW’: False, ‘TITLE’: ‘image.png’, ‘IS_STARRED’: False, ‘PREVIEW_IS_TRUNCATED’: None, ‘URL_PRIVATE_DOWNLOAD’: ‘Slack’, ‘PREVIEW’: None, ‘PUBLIC_URL_SHARED’: False, ‘MESSAGE_TS’: ‘1735936287.801519’, ‘PARENT_MESSAGE_TS’: ‘1735565344.664719’, ‘MESSAGE_CHANNEL_ID’: ‘CUMUWQU66’, ‘_FIVETRAN_DELETED’: False, ‘LINES_MORE’: None, ‘LINES’: None, ‘SIZE’: 263698, ‘_FIVETRAN_SYNCED’: ‘2025-01-05 08:22:35.869000+00:00’})