Issue with Oracle ingestion via Airflow pipeline and solution through shell script and DataHub recipe

Original Slack Thread

<@U06TM7M40RK> oracle ingestion via airflow pipeline doesnt work for us. It requires cx_Oracle package and the oracle instantclient but due to error in extracting the instantclient and setting as LD_LIBRARY_PATH env variable, the ingestion via airflow doesnt work. Oracle ingestion via shell script and datahub recipe works though.

It seems likely that the issue you’re encountering with Oracle ingestion via an Airflow pipeline is related to the configuration of the Oracle client libraries, specifically the cx_Oracle package and the Oracle Instant Client. The error you’re facing with setting the LD_LIBRARY_PATH environment variable might be causing the ingestion process to fail in Airflow, even though it works with a shell script and DataHub recipe.

Based on the <https://datahubproject.io/docs/0.14.0/generated/ingestion/sources/oracle|DataHub Oracle Ingestion Documentation>, the Oracle ingestion source requires the correct drivers, such as the Oracle Instant Client, to be installed. The cx_Oracle package is used to connect to Oracle databases, and it relies on the Oracle Instant Client being correctly installed and configured.

Here are some steps you might consider to resolve the issue:

  1. Verify Oracle Instant Client Installation: Ensure that the Oracle Instant Client is correctly installed on the machine where Airflow is running. You can follow the <https://cx-oracle.readthedocs.io/en/latest/user_guide/installation.html|cx_Oracle installation guide> for detailed instructions.

  2. Set LD_LIBRARY_PATH Correctly: Make sure that the LD_LIBRARY_PATH environment variable is set to include the directory where the Oracle Instant Client libraries are located. This can be done in the Airflow environment configuration or within the Airflow DAG script.

  3. Check Airflow Environment: Ensure that the Airflow environment has access to the necessary environment variables and paths. You might need to configure the Airflow worker or scheduler to include these settings.

  4. Test Outside Airflow: Since the ingestion works via a shell script, you can try running the same script within an Airflow BashOperator to see if it works. This can help isolate whether the issue is with the Airflow Python environment or the overall system configuration.

  5. Use a Virtual Environment: Consider using a virtual environment for Airflow that includes the cx_Oracle package and the necessary Oracle client libraries. This can help manage dependencies more effectively.

Would you like more detailed instructions or examples for any specific part of this process?

Sources:

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><