Hi Team -
We are on datahub version 0.10.5. Though this problem below has been present since much before.
We are ingesting some Hive tables from Apache Impala using the Hive connector. The ingestion jobs have several errors regarding not able to ingest metadata. In datahub UI, we see the tables are present but without any schemas.
Below are some of the errors we see -
[2023-10-02T04:43:11.998+0000] {process_utils.py:187} INFO - config = SQLAlchemyGenericConfig.parse_obj(config_dict)
[2023-10-02T04:43:12.465+0000] {process_utils.py:187} INFO - _impala_builtins => Views error:
[2023-10-02T04:43:12.465+0000] {process_utils.py:187} INFO - Traceback (most recent call last):
[2023-10-02T04:43:12.466+0000] {process_utils.py:187} INFO - File "/tmp/venv457bp1g4/lib/python3.9/site-packages/datahub/ingestion/source/sql/sql_common.py", line 881, in loop_views
[2023-10-02T04:43:12.466+0000] {process_utils.py:187} INFO - for view in inspector.get_view_names(schema):
[2023-10-02T04:43:12.466+0000] {process_utils.py:187} INFO - File "/tmp/venv457bp1g4/lib/python3.9/site-packages/sqlalchemy/engine/reflection.py", line 412, in get_view_names
[2023-10-02T04:43:12.466+0000] {process_utils.py:187} INFO - return self.dialect.get_view_names(
[2023-10-02T04:43:12.467+0000] {process_utils.py:187} INFO - File "/tmp/venv457bp1g4/lib/python3.9/site-packages/sqlalchemy/engine/interfaces.py", line 332, in get_view_names
[2023-10-02T04:43:12.467+0000] {process_utils.py:187} INFO - raise NotImplementedError()
[2023-10-02T04:43:12.467+0000] {process_utils.py:187} INFO - NotImplementedError
[2023-10-02T04:43:12.467+0000] {process_utils.py:187} INFO -
[2023-10-02T04:43:12.563+0000] {process_utils.py:187} INFO - adh_adhoc_cdl4ran.atoll_cdds_l8 => unable to get column information due to an error -> (impala.error.HiveServer2Error) AnalysisException: Failed to load metadata for table: 'adh_adhoc_cdl4ran.atoll_cdds_l8'
[2023-10-02T04:43:12.563+0000] {process_utils.py:187} INFO - CAUSED BY: TableLoadingException: Could not load table adh_adhoc_cdl4ran.atoll_cdds_l8 from catalog
[2023-10-02T04:43:12.564+0000] {process_utils.py:187} INFO - CAUSED BY: TException: TGetPartialCatalogObjectResponse(status:TStatus(status_code:GENERAL, error_msgs:[TableLoadingException: Failed to load metadata for table: adh_adhoc_cdl4ran.atoll_cdds_l8
[2023-10-02T04:43:12.564+0000] {process_utils.py:187} INFO - CAUSED BY: InvalidStorageDescriptorException: Impala does not support tables of this type. REASON: SerDe library 'org.apache.hadoop.hive.serde2.OpenCSVSerde' is not supported.]), lookup_status:OK)
[2023-10-02T04:43:12.564+0000] {process_utils.py:187} INFO -
[2023-10-02T04:43:12.564+0000] {process_utils.py:187} INFO - [SQL: SELECT * FROM adh_adhoc_cdl4ran.atoll_cdds_l8 LIMIT 0]
[2023-10-02T04:43:12.564+0000] {process_utils.py:187} INFO - (Background on this error at: <https://sqlalche.me/e/14/dbapi>)
[2023-10-02T04:50:59.728+0000] {process_utils.py:187} INFO -
[2023-10-02T04:50:59.728+0000] {process_utils.py:187} INFO - [SQL: SELECT * FROM prod_tic_sandboxes.t_salesforce_accounts_111 LIMIT 0]
[2023-10-02T04:50:59.728+0000] {process_utils.py:187} INFO - (Background on this error at: <https://sqlalche.me/e/14/dbapi>)
[2023-10-02T04:50:59.764+0000] {process_utils.py:187} INFO - prod_tic_sandboxes.t_salesforce_accounts_222 => unable to get column information due to an error -> (impala.error.HiveServer2Error) AnalysisException: Failed to load metadata for table: 'prod_tic_sandboxes.t_salesforce_accounts_222'
[2023-10-02T04:50:59.765+0000] {process_utils.py:187} INFO - CAUSED BY: TableLoadingException: Could not load table prod_tic_sandboxes.t_salesforce_accounts_222 from catalog
[2023-10-02T04:50:59.765+0000] {process_utils.py:187} INFO - CAUSED BY: TException: TGetPartialCatalogObjectResponse(status:TStatus(status_code:GENERAL, error_msgs:[TableLoadingException: Failed to load metadata for table: prod_tic_sandboxes.t_salesforce_accounts_222
[2023-10-02T04:50:59.765+0000] {process_utils.py:187} INFO - CAUSED BY: AnalysisException: Invalid avro.schema.url: <hdfs://nameservice-cdlpv2/data/prod/tic/base/salesforce/schema/salesforce_schema.avsc>. Path does not exist.]), lookup_status:OK)
[2023-10-03T03:02:15.723+0000] {process_utils.py:187} INFO - default.medallia_euc6173 => unable to get column information due to an error -> 'ARRAY'
[2023-10-03T03:02:16.484+0000] {process_utils.py:187} INFO - default => Views error:
[2023-10-03T03:02:16.484+0000] {process_utils.py:187} INFO - Traceback (most recent call last):
[2023-10-03T03:02:16.484+0000] {process_utils.py:187} INFO - File "/tmp/venvnrgb6ir9/lib/python3.9/site-packages/datahub/ingestion/source/sql/sql_common.py", line 881, in loop_views
[2023-10-03T03:02:16.484+0000] {process_utils.py:187} INFO - for view in inspector.get_view_names(schema):
[2023-10-03T03:02:16.485+0000] {process_utils.py:187} INFO - File "/tmp/venvnrgb6ir9/lib/python3.9/site-packages/sqlalchemy/engine/reflection.py", line 412, in get_view_names
[2023-10-03T03:02:16.485+0000] {process_utils.py:187} INFO - return self.dialect.get_view_names(
[2023-10-03T03:02:16.485+0000] {process_utils.py:187} INFO - File "/tmp/venvnrgb6ir9/lib/python3.9/site-packages/sqlalchemy/engine/interfaces.py", line 332, in get_view_names
[2023-10-03T03:02:16.485+0000] {process_utils.py:187} INFO - raise NotImplementedError()
[2023-10-03T03:02:16.485+0000] {process_utils.py:187} INFO - NotImplementedError
[2023-10-03T03:02:32.193+0000] {process_utils.py:187} INFO - metadata.edhub_group => unable to get column information due to an error -> (impala.error.HiveServer2Error) Error while compiling statement: FAILED: RuntimeException java.lang.ClassNotFoundException: org.apache.kudu.mapreduce.KuduTableInputFormat
[2023-10-03T03:02:32.194+0000] {process_utils.py:187} INFO - [SQL: SELECT * FROM metadata.edhub_group LIMIT 0]
[2023-10-03T03:02:32.194+0000] {process_utils.py:187} INFO - (Background on this error at: <https://sqlalche.me/e/14/dbapi>)
Can someone please help us understand what is wrong here?
<@UV14447EU> <@U03BEML16LB> <@U01GCJKA8P9> <@U05QQUDHTKJ>