I am injecting a custom source and having error: datahub.ingestion.run.pipeline.PipelineInitError: Failed to find a registered source for type my-source.custom_ingestion_source.GreenplumSource: cannot import name 'LegacyRow' from 'sqlalchemy.engine.row'
Could you please show potential solution to this issue.
Hey there! Make sure your message includes the following information if relevant, so we can help more effectively!
Are you using UI or CLI for ingestion?
Which DataHub version are you using? (e.g. 0.12.0)
What data source(s) are you integrating with DataHub? (e.g. BigQuery)
<@U06TM7M40RK> Can you help with the questions in this thread?
We’re experimenting with a new AI assistant for questions in the <#CUMUWQU66> channel to help you track down helpful resources more quickly. Keep in mind that responses may not always be spot-on; feel free to ask it follow-up questions and don’t forget to provide its responses a or so other folks in the Community know if you’re still looking for help on the topic!
To address the issue you’re encountering with the custom source ingestion, let’s break down the potential solutions step-by-step:
Summary
The error PipelineInitError: Failed to find a registered source for type my-source.custom_ingestion_source.GreenplumSource: cannot import name 'LegacyRow' from 'sqlalchemy.engine.row' suggests that there is an issue with the SQLAlchemy version compatibility. Specifically, the LegacyRow class is not found in the sqlalchemy.engine.row module.
Steps to Resolve
Check SQLAlchemy Version:
Ensure that the version of SQLAlchemy you are using is compatible with your custom source code. The LegacyRow class might not be available in the version you are using. You can check the installed version using:
pip show SQLAlchemy
Update or Downgrade SQLAlchemy:
If the LegacyRow class is not available in your current version, you might need to update or downgrade SQLAlchemy. For example, to install a specific version, you can use:
pip install SQLAlchemy==1.4.39
Modify Custom Source Code:
If updating or downgrading SQLAlchemy does not resolve the issue, you might need to modify your custom source code to be compatible with the installed version of SQLAlchemy. Ensure that the import statements and usage of SQLAlchemy classes are correct.
Rebuild and Redeploy:
After making the necessary changes, rebuild and redeploy your custom source. Ensure that the custom source is correctly registered and available for ingestion.
Example of Custom Source Code Adjustment
Here is an example of how you might adjust your custom source code to ensure compatibility:
from sqlalchemy.engine import Row # Adjust the import based on the SQLAlchemy version
class GreenplumSource:
def __init__(self, config, ctx):
self.config = config
self.ctx = ctx
def get_data(self):
# Your data extraction logic here
pass
@classmethod
def create(cls, config_dict, ctx):
config = GreenplumConfig.parse_obj(config_dict)
return cls(config, ctx)
Additional Information
Ingestion Method: Are you using the UI or CLI for ingestion?
DataHub Version: Which version of DataHub are you using? (e.g., 0.12.0)
Data Source: What data source(s) are you integrating with DataHub? (e.g., BigQuery)
I have successfully ingested the custom source but I do not see the datasets in the Datahub UI. I am running the datahub with help of command: datahub docker quickstart. While I using DataHub CLI for custom source ingestion.
I have successfully ingested the custom source but I do not see the datasets in the Datahub UI. I am running the datahub with help of command: datahub docker quickstart. While I using DataHub CLI for custom source ingestion. <@U06TM7M40RK>
Summary
If you have successfully ingested the custom source but do not see the datasets in the DataHub UI, the issue might be related to indexing in Elasticsearch or the ingestion process itself. Here are the steps to troubleshoot and resolve this issue.
Steps to Resolve
Check Elasticsearch Health:
Ensure that your Elasticsearch instance is running and healthy. You can check the health status by accessing the Elasticsearch health endpoint:
curl -X GET "localhost:9200/_cluster/health?pretty"
Verify Ingestion Logs:
Check the ingestion logs to ensure that the ingestion process completed successfully and that the metadata was correctly ingested. Look for any errors or warnings in the logs.
Check MCE and MAE Consumers:
Ensure that the Metadata Change Event (MCE) and Metadata Audit Event (MAE) consumers are running correctly. These consumers are responsible for processing the ingested metadata and indexing it in Elasticsearch. You can check their status by looking at the logs or using the following command:
Reindex Elasticsearch:
If the datasets are not showing up in the UI, you might need to reindex Elasticsearch. This can be done using the DataHub CLI. Here is an example command to restore indices:
datahub elasticsearch reindex
Check DataHub UI:
After reindexing, check the DataHub UI to see if the datasets are now visible. You can also try searching for the datasets using their URNs.
Example Commands
Checking Elasticsearch Health
curl -X GET "localhost:9200/_cluster/health?pretty"
Reindexing Elasticsearch
datahub elasticsearch reindex
Additional Information
Ingestion Method: You mentioned using the CLI for ingestion.
DataHub Version: Ensure you are using a compatible version of DataHub.