Resolving Errors in Custom Source Ingestion and Data Visibility in DataHub UI

Original Slack Thread

Dear Community and <@U0348BYAS56>,

I am injecting a custom source and having error:
datahub.ingestion.run.pipeline.PipelineInitError: Failed to find a registered source for type my-source.custom_ingestion_source.GreenplumSource: cannot import name 'LegacyRow' from 'sqlalchemy.engine.row'
Could you please show potential solution to this issue.

Thank you in advance!

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

<@U06TM7M40RK> Can you help with the questions in this thread?


:information_source: We’re experimenting with a new AI assistant for questions in the <#CUMUWQU66> channel to help you track down helpful resources more quickly. Keep in mind that responses may not always be spot-on; feel free to ask it follow-up questions and don’t forget to provide its responses a :thumbsup: or :thumbsdown: so other folks in the Community know if you’re still looking for help on the topic!

To address the issue you’re encountering with the custom source ingestion, let’s break down the potential solutions step-by-step:

Summary
The error PipelineInitError: Failed to find a registered source for type my-source.custom_ingestion_source.GreenplumSource: cannot import name 'LegacyRow' from 'sqlalchemy.engine.row' suggests that there is an issue with the SQLAlchemy version compatibility. Specifically, the LegacyRow class is not found in the sqlalchemy.engine.row module.

Steps to Resolve

  1. Check SQLAlchemy Version:
    Ensure that the version of SQLAlchemy you are using is compatible with your custom source code. The LegacyRow class might not be available in the version you are using. You can check the installed version using:

    pip show SQLAlchemy
    
  2. Update or Downgrade SQLAlchemy:
    If the LegacyRow class is not available in your current version, you might need to update or downgrade SQLAlchemy. For example, to install a specific version, you can use:

    pip install SQLAlchemy==1.4.39
    
  3. Modify Custom Source Code:
    If updating or downgrading SQLAlchemy does not resolve the issue, you might need to modify your custom source code to be compatible with the installed version of SQLAlchemy. Ensure that the import statements and usage of SQLAlchemy classes are correct.

  4. Rebuild and Redeploy:
    After making the necessary changes, rebuild and redeploy your custom source. Ensure that the custom source is correctly registered and available for ingestion.

Example of Custom Source Code Adjustment
Here is an example of how you might adjust your custom source code to ensure compatibility:

from sqlalchemy.engine import Row  # Adjust the import based on the SQLAlchemy version

class GreenplumSource:
    def __init__(self, config, ctx):
        self.config = config
        self.ctx = ctx

    def get_data(self):
        # Your data extraction logic here
        pass

    @classmethod
    def create(cls, config_dict, ctx):
        config = GreenplumConfig.parse_obj(config_dict)
        return cls(config, ctx)

Additional Information

  • Ingestion Method: Are you using the UI or CLI for ingestion?
  • DataHub Version: Which version of DataHub are you using? (e.g., 0.12.0)
  • Data Source: What data source(s) are you integrating with DataHub? (e.g., BigQuery)

Sources:

  • <Slack Slack Discussion 1>
  • <Slack Slack Discussion 2>
  • <Slack Slack Discussion 3> :+1: 0 button :-1: 0 button :bulb: Hint: Mention @runllm in the thread for followups.

I am using CLI for ingestion

I have successfully ingested the custom source but I do not see the datasets in the Datahub UI. I am running the datahub with help of command: datahub docker quickstart. While I using DataHub CLI for custom source ingestion.

I have successfully ingested the custom source but I do not see the datasets in the Datahub UI. I am running the datahub with help of command: datahub docker quickstart. While I using DataHub CLI for custom source ingestion. <@U06TM7M40RK>

Summary
If you have successfully ingested the custom source but do not see the datasets in the DataHub UI, the issue might be related to indexing in Elasticsearch or the ingestion process itself. Here are the steps to troubleshoot and resolve this issue.

Steps to Resolve

  1. Check Elasticsearch Health:
    Ensure that your Elasticsearch instance is running and healthy. You can check the health status by accessing the Elasticsearch health endpoint:

    curl -X GET "localhost:9200/_cluster/health?pretty"
    
  2. Verify Ingestion Logs:
    Check the ingestion logs to ensure that the ingestion process completed successfully and that the metadata was correctly ingested. Look for any errors or warnings in the logs.

  3. Check MCE and MAE Consumers:
    Ensure that the Metadata Change Event (MCE) and Metadata Audit Event (MAE) consumers are running correctly. These consumers are responsible for processing the ingested metadata and indexing it in Elasticsearch. You can check their status by looking at the logs or using the following command:

    docker logs datahub-mae-consumer
    docker logs datahub-mce-consumer
    
  4. Reindex Elasticsearch:
    If the datasets are not showing up in the UI, you might need to reindex Elasticsearch. This can be done using the DataHub CLI. Here is an example command to restore indices:

    datahub elasticsearch reindex
    
  5. Check DataHub UI:
    After reindexing, check the DataHub UI to see if the datasets are now visible. You can also try searching for the datasets using their URNs.

Example Commands

Checking Elasticsearch Health

curl -X GET "localhost:9200/_cluster/health?pretty"

Reindexing Elasticsearch

datahub elasticsearch reindex

Additional Information

  • Ingestion Method: You mentioned using the CLI for ingestion.
  • DataHub Version: Ensure you are using a compatible version of DataHub.
  • Data Source: Custom source (e.g., Greenplum).

Sources:

<@U06FB43J33L> Can you see the aspects in mysql?
Also, can you check gms logs if you see any error there?

Thank you for your reply, I have solved the issue by downgrading the python module from 2.x.x to :

awesome, thanks for the heads up