Datahub Ingestion Issues with Impala Hive Connector and SQLAlchemy Recipes

Original Slack Thread

Hi Team -
We are on datahub version 0.10.5. Though this problem below has been present since much before.
We are ingesting some Hive tables from Apache Impala using the Hive connector. The ingestion jobs have several errors regarding not able to ingest metadata. In datahub UI, we see the tables are present but without any schemas.
Below are some of the errors we see -

[2023-10-02T04:43:11.998+0000] {process_utils.py:187} INFO - config = SQLAlchemyGenericConfig.parse_obj(config_dict)
[2023-10-02T04:43:12.465+0000] {process_utils.py:187} INFO - _impala_builtins => Views error:
[2023-10-02T04:43:12.465+0000] {process_utils.py:187} INFO - Traceback (most recent call last):
[2023-10-02T04:43:12.466+0000] {process_utils.py:187} INFO - File "/tmp/venv457bp1g4/lib/python3.9/site-packages/datahub/ingestion/source/sql/sql_common.py", line 881, in loop_views
[2023-10-02T04:43:12.466+0000] {process_utils.py:187} INFO - for view in inspector.get_view_names(schema):
[2023-10-02T04:43:12.466+0000] {process_utils.py:187} INFO - File "/tmp/venv457bp1g4/lib/python3.9/site-packages/sqlalchemy/engine/reflection.py", line 412, in get_view_names
[2023-10-02T04:43:12.466+0000] {process_utils.py:187} INFO - return self.dialect.get_view_names(
[2023-10-02T04:43:12.467+0000] {process_utils.py:187} INFO - File "/tmp/venv457bp1g4/lib/python3.9/site-packages/sqlalchemy/engine/interfaces.py", line 332, in get_view_names
[2023-10-02T04:43:12.467+0000] {process_utils.py:187} INFO - raise NotImplementedError()
[2023-10-02T04:43:12.467+0000] {process_utils.py:187} INFO - NotImplementedError
[2023-10-02T04:43:12.467+0000] {process_utils.py:187} INFO -
[2023-10-02T04:43:12.563+0000] {process_utils.py:187} INFO - adh_adhoc_cdl4ran.atoll_cdds_l8 => unable to get column information due to an error -> (impala.error.HiveServer2Error) AnalysisException: Failed to load metadata for table: 'adh_adhoc_cdl4ran.atoll_cdds_l8'
[2023-10-02T04:43:12.563+0000] {process_utils.py:187} INFO - CAUSED BY: TableLoadingException: Could not load table adh_adhoc_cdl4ran.atoll_cdds_l8 from catalog
[2023-10-02T04:43:12.564+0000] {process_utils.py:187} INFO - CAUSED BY: TException: TGetPartialCatalogObjectResponse(status:TStatus(status_code:GENERAL, error_msgs:[TableLoadingException: Failed to load metadata for table: adh_adhoc_cdl4ran.atoll_cdds_l8
[2023-10-02T04:43:12.564+0000] {process_utils.py:187} INFO - CAUSED BY: InvalidStorageDescriptorException: Impala does not support tables of this type. REASON: SerDe library 'org.apache.hadoop.hive.serde2.OpenCSVSerde' is not supported.]), lookup_status:OK)
[2023-10-02T04:43:12.564+0000] {process_utils.py:187} INFO -
[2023-10-02T04:43:12.564+0000] {process_utils.py:187} INFO - [SQL: SELECT * FROM adh_adhoc_cdl4ran.atoll_cdds_l8 LIMIT 0]
[2023-10-02T04:43:12.564+0000] {process_utils.py:187} INFO - (Background on this error at: <https://sqlalche.me/e/14/dbapi>)

[2023-10-02T04:50:59.728+0000] {process_utils.py:187} INFO -
[2023-10-02T04:50:59.728+0000] {process_utils.py:187} INFO - [SQL: SELECT * FROM prod_tic_sandboxes.t_salesforce_accounts_111 LIMIT 0]
[2023-10-02T04:50:59.728+0000] {process_utils.py:187} INFO - (Background on this error at: <https://sqlalche.me/e/14/dbapi>)
[2023-10-02T04:50:59.764+0000] {process_utils.py:187} INFO - prod_tic_sandboxes.t_salesforce_accounts_222 => unable to get column information due to an error -> (impala.error.HiveServer2Error) AnalysisException: Failed to load metadata for table: 'prod_tic_sandboxes.t_salesforce_accounts_222'
[2023-10-02T04:50:59.765+0000] {process_utils.py:187} INFO - CAUSED BY: TableLoadingException: Could not load table prod_tic_sandboxes.t_salesforce_accounts_222 from catalog
[2023-10-02T04:50:59.765+0000] {process_utils.py:187} INFO - CAUSED BY: TException: TGetPartialCatalogObjectResponse(status:TStatus(status_code:GENERAL, error_msgs:[TableLoadingException: Failed to load metadata for table: prod_tic_sandboxes.t_salesforce_accounts_222
[2023-10-02T04:50:59.765+0000] {process_utils.py:187} INFO - CAUSED BY: AnalysisException: Invalid avro.schema.url: <hdfs://nameservice-cdlpv2/data/prod/tic/base/salesforce/schema/salesforce_schema.avsc>. Path does not exist.]), lookup_status:OK)

[2023-10-03T03:02:15.723+0000] {process_utils.py:187} INFO - default.medallia_euc6173 => unable to get column information due to an error -> 'ARRAY'
[2023-10-03T03:02:16.484+0000] {process_utils.py:187} INFO - default => Views error:
[2023-10-03T03:02:16.484+0000] {process_utils.py:187} INFO - Traceback (most recent call last):
[2023-10-03T03:02:16.484+0000] {process_utils.py:187} INFO - File "/tmp/venvnrgb6ir9/lib/python3.9/site-packages/datahub/ingestion/source/sql/sql_common.py", line 881, in loop_views
[2023-10-03T03:02:16.484+0000] {process_utils.py:187} INFO - for view in inspector.get_view_names(schema):
[2023-10-03T03:02:16.485+0000] {process_utils.py:187} INFO - File "/tmp/venvnrgb6ir9/lib/python3.9/site-packages/sqlalchemy/engine/reflection.py", line 412, in get_view_names
[2023-10-03T03:02:16.485+0000] {process_utils.py:187} INFO - return self.dialect.get_view_names(
[2023-10-03T03:02:16.485+0000] {process_utils.py:187} INFO - File "/tmp/venvnrgb6ir9/lib/python3.9/site-packages/sqlalchemy/engine/interfaces.py", line 332, in get_view_names
[2023-10-03T03:02:16.485+0000] {process_utils.py:187} INFO - raise NotImplementedError()
[2023-10-03T03:02:16.485+0000] {process_utils.py:187} INFO - NotImplementedError

[2023-10-03T03:02:32.193+0000] {process_utils.py:187} INFO - metadata.edhub_group => unable to get column information due to an error -> (impala.error.HiveServer2Error) Error while compiling statement: FAILED: RuntimeException java.lang.ClassNotFoundException: org.apache.kudu.mapreduce.KuduTableInputFormat
[2023-10-03T03:02:32.194+0000] {process_utils.py:187} INFO - [SQL: SELECT * FROM metadata.edhub_group LIMIT 0]
[2023-10-03T03:02:32.194+0000] {process_utils.py:187} INFO - (Background on this error at: <https://sqlalche.me/e/14/dbapi>)
Can someone please help us understand what is wrong here?
<@UV14447EU> <@U03BEML16LB> <@U01GCJKA8P9> <@U05QQUDHTKJ>

It seems pyhive doesn’t support fully Impala.
Maybe you can try out the generic sqlalchemy source with impyla

ok we’ll try that

source:
type: sqlalchemy
config:
platform: "hive"
include_views: true
env: PROD
platform_instance:
connect_uri: "impala://"

sink:
type: "datahub-rest"

This is the recipe we are already using. I believe this is the sqlalchemy source with impyla ? If so, then it is not working and throwing the above errors. <@UV14447EU>

Can someone help us here please? Our SQLAlchemy recipe to Hive source using impala is not able to get metadata for many source tables and also for the ones its able to ingest, the schema fields details are missing.

[2023-10-11T04:58:03.348+0000] {process_utils.py:187} INFO - prod_swe_base.t_rxdod_nni =&gt; unable to get column information due to an error -&gt; (impala.error.HiveServer2Error) LocalCatalogException: org.apache.impala.catalog.TableLoadingException: Failed to load metadata for table 'prod_swe_base.t_rxdod_nni' because of unsupported partition-column type 'org.apache.impala.catalog.StructType@3d7ce8a0' in partition column 'key'
[2023-10-11T04:58:03.348+0000] {process_utils.py:187} INFO - CAUSED BY: TableLoadingException: Failed to load metadata for table 'prod_swe_base.t_rxdod_nni' because of unsupported partition-column type 'org.apache.impala.catalog.StructType@3d7ce8a0' in partition column 'key'

[2023-10-11T04:55:53.905+0000] {process_utils.py:187} INFO - prod_swe_base.t_mnc_cvd2_netb_v_d1033_closed_order =&gt; unable to get column information due to an error -&gt; (impala.error.HiveServer2Error) AnalysisException: Failed to load metadata for table: 'prod_swe_base.t_mnc_cvd2_netb_v_d1033_closed_order'
[2023-10-11T04:55:53.905+0000] {process_utils.py:187} INFO - CAUSED BY: TableLoadingException: Could not load table prod_swe_base.t_mnc_cvd2_netb_v_d1033_closed_order from catalog
[2023-10-11T04:55:53.905+0000] {process_utils.py:187} INFO - CAUSED BY: TException: TGetPartialCatalogObjectResponse(status:TStatus(status_code:GENERAL, error_msgs:[TableLoadingException: Failed to load metadata for table: prod_swe_base.t_mnc_cvd2_netb_v_d1033_closed_order
[2023-10-11T04:55:53.905+0000] {process_utils.py:187} INFO - CAUSED BY: AnalysisException: Invalid avro.schema.url: /data/prod2/swe/base/mnc_cvd2/schema/netb_v_d1033_closed_order.avsc. Path does not exist.]), lookup_status:OK)
[2023-10-11T04:55:53.905+0000] {process_utils.py:187} INFO -
[2023-10-11T04:55:53.906+0000] {process_utils.py:187} INFO - [SQL: SELECT * FROM prod_swe_base.t_mnc_cvd2_netb_v_d1033_closed_order LIMIT 0]

[2023-10-11T04:04:39.352+0000] {process_utils.py:187} INFO - prod_swe_access.v_h_contact_level_abt =&gt; unable to get column information due to an error -&gt; (impala.error.HiveServer2Error) AnalysisException: Failed to load metadata for table: 'prod_swe_access.v_h_contact_level_abt'
[2023-10-11T04:04:39.352+0000] {process_utils.py:187} INFO - CAUSED BY: TableLoadingException: Could not load table prod_swe_access.v_h_contact_level_abt from catalog
[2023-10-11T04:04:39.352+0000] {process_utils.py:187} INFO - CAUSED BY: TException: TGetPartialCatalogObjectResponse(status:TStatus(status_code:GENERAL, error_msgs:[TableLoadingException: Failed to parse view-definition statement of view: prod_swe_access.v_h_contact_level_abt]), lookup_status:OK)
[2023-10-11T04:04:39.352+0000] {process_utils.py:187} INFO -
[2023-10-11T04:04:39.352+0000] {process_utils.py:187} INFO - [SQL: SELECT * FROM prod_swe_access.v_h_contact_level_abt LIMIT 0]

<@U0121TRV0FL> <@UV14447EU> <@U03MF8MU5P0> <@U03BEML16LB> we have been getting these failures for a long time and our end users aren’t happy about not able to see all metadata for Hive. <@U05C3CJDPD4>

Does impala have sqlalchemy driver? Maybe you can try with that and the generic sqlalchemy source.

<@UV14447EU> I have even tried with hive connection and it gives different error

here is the code

from sqlalchemy.engine import create_engine

engine = create_engine('<hive://XXX>',connect_args={'auth': 'KERBEROS', 'kerberos_service_name': 'hive'})
conn = engine.connect()
print(conn)

here is the error stack

    return self._connection_cls(self)
  File "/datahub/management-app/venv/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 145, in __init__
    self._dbapi_connection = engine.raw_connection()
  File "/datahub/management-app/venv/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 3292, in raw_connection
    return self.pool.connect()
  File "/datahub/management-app/venv/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 452, in connect
    return _ConnectionFairy._checkout(self)
  File "/datahub/management-app/venv/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 1269, in _checkout
    fairy = _ConnectionRecord.checkout(pool)
  File "/datahub/management-app/venv/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 716, in checkout
    rec = pool._do_get()
  File "/datahub/management-app/venv/lib/python3.9/site-packages/sqlalchemy/pool/impl.py", line 170, in _do_get
    self._dec_overflow()
  File "/datahub/management-app/venv/lib/python3.9/site-packages/sqlalchemy/util/langhelpers.py", line 146, in __exit__
    raise exc_value.with_traceback(exc_tb)
  File "/datahub/management-app/venv/lib/python3.9/site-packages/sqlalchemy/pool/impl.py", line 167, in _do_get
    return self._create_connection()
  File "/datahub/management-app/venv/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 393, in _create_connection
    return _ConnectionRecord(self)
  File "/datahub/management-app/venv/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 678, in __init__
    self.__connect()
  File "/datahub/management-app/venv/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 903, in __connect
    pool.logger.debug("Error on connect(): %s", e)
  File "/datahub/management-app/venv/lib/python3.9/site-packages/sqlalchemy/util/langhelpers.py", line 146, in __exit__
    raise exc_value.with_traceback(exc_tb)
  File "/datahub/management-app/venv/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 898, in __connect
    self.dbapi_connection = connection = pool._invoke_creator(self)
  File "/datahub/management-app/venv/lib/python3.9/site-packages/sqlalchemy/engine/create.py", line 637, in connect
    return dialect.connect(*cargs, **cparams)
  File "/datahub/management-app/venv/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 616, in connect
    return self.loaded_dbapi.connect(*cargs, **cparams)
  File "/datahub/management-app/venv/lib/python3.9/site-packages/pyhive/hive.py", line 143, in connect
    return Connection(*args, **kwargs)
  File "/datahub/management-app/venv/lib/python3.9/site-packages/pyhive/hive.py", line 269, in __init__
    self._transport.open()
  File "/datahub/management-app/venv/lib/python3.9/site-packages/thrift_sasl/__init__.py", line 93, in open
    status, payload = self._recv_sasl_message()
  File "/datahub/management-app/venv/lib/python3.9/site-packages/thrift_sasl/__init__.py", line 112, in _recv_sasl_message
    header = self._trans_read_all(5)
  File "/datahub/management-app/venv/lib/python3.9/site-packages/thrift_sasl/__init__.py", line 210, in _trans_read_all
    return read_all(sz)
  File "/datahub/management-app/venv/lib/python3.9/site-packages/thrift/transport/TTransport.py", line 62, in readAll
    chunk = self.read(sz - have)
  File "/datahub/management-app/venv/lib/python3.9/site-packages/thrift/transport/TSocket.py", line 166, in read
    raise TTransportException(type=TTransportException.END_OF_FILE,
thrift.transport.TTransport.TTransportException: TSocket read 0 bytes```

Hello <@UV14447EU> , did you get chance to look into the above issue

Hello <@UV14447EU>, can you help us here

I’m afraid if neither Hive source and nor vanilla SqlAlchemy source works with Impala then I’m afraid we need a custom source with the official Impala sqlalchemy driver (does this exists??).