<@U06TM7M40RK> why is the trino connector automatically creating sibling tables of Hive platform for every trino table that gets ingested ?
It seems likely that the automatic creation of sibling tables for Trino and Hive platforms during ingestion is related to the configuration and features of DataHub’s ingestion framework. According to the <Trino | DataHub ingestion source documentation>, there is an option to set Trino as the primary entity of a set of siblings, which might be influencing this behavior. This feature is described as experimental and is controlled by the trino_as_primary
configuration parameter.
Additionally, the <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/sql/trino.py|Trino ingestion source code> indicates that sibling relationships can be generated during ingestion. The code includes logic to create sibling work units, which are used to establish relationships between datasets from different platforms that are considered equivalent or related.
If you are experiencing unexpected behavior with sibling creation, it might be helpful to review your ingestion configuration to ensure that the trino_as_primary
setting aligns with your expectations. You may also want to check if there are any custom transformers or scripts in your ingestion pipeline that could be affecting sibling relationships.
Would you like more detailed instructions or examples on how to configure or troubleshoot this behavior?
Sources:
- <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/sql/trino.py|Trino ingestion source code>
- <Trino | DataHub ingestion source documentation> 0 button 0 button Hint: Mention @runllm in the thread for followups.
Hi ! Did this interaction resolve your issue or do you still require assistance ?
Please tag @ RunLLM if you still need further assistance !
Or tag @Datahub Community Support if you require a person to take a look
Hope this helps ! ><
<@U06TM7M40RK> in trino ingestion, in the logs I see some messages like -
{<http://datahub.ingestion.source.ge|datahub.ingestion.source.ge>_data_profiler:1206} - Unable to delete temporary table: "data"."t_ra_brm_assetswe"
What is the meaning of this message and why is it unable to delete temporary table? Because of this error, we are getting unnecessary sibling relation between trino and hive datasets. The temporary table data.t_ra_brm_assetswe is treated as a hive dataset without any schema.
Some more similar messages -
WARNING {<http://datahub.ingestion.source.ge|datahub.ingestion.source.ge>_data_profiler:1206} - Unable to delete temporary table: "data"."t_sweb2x_emu_commissionbase"
WARNING {<http://datahub.ingestion.source.ge|datahub.ingestion.source.ge>_data_profiler:1206} - Unable to delete temporary table: "information_schema"."applicable_roles"
It seems likely that the messages you are seeing in the logs, such as “Unable to delete temporary table,” are related to the process of data profiling during the ingestion of Trino datasets. According to the <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/ge_data_profiler.py|DataHub ingestion source code>, temporary tables are often created as part of the profiling process to gather statistics or perform other operations. These tables are intended to be deleted after their use to avoid clutter and potential conflicts.
The inability to delete these temporary tables could be due to several reasons:
-
Permissions: The user account used for ingestion might not have the necessary permissions to delete tables in the specified schema. This is a common issue, and ensuring that the account has the appropriate
DROP
orDELETE
permissions on the temporary tables is crucial. -
Locks or Dependencies: There might be locks or dependencies on the temporary tables that prevent them from being deleted. This could happen if other processes are accessing the tables simultaneously.
-
Configuration Issues: There might be a misconfiguration in the ingestion setup that affects the handling of temporary tables. Reviewing the ingestion configuration for any settings related to temporary tables might help identify the issue.
The creation of unnecessary sibling relationships between Trino and Hive datasets could be a side effect of these temporary tables not being deleted. If the temporary tables are mistakenly treated as Hive datasets, it might be due to the ingestion framework’s logic for identifying and categorizing datasets.
To address this issue, you might want to:
- Verify that the user account has the necessary permissions to delete temporary tables.
- Check for any locks or dependencies that might be preventing the deletion of these tables.
- Review the ingestion configuration for any settings related to temporary tables and ensure they are correctly set.
Would you like more detailed steps or examples on how to troubleshoot or resolve this issue?
Sources:
- <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/ge_data_profiler.py|DataHub ingestion source code> 0 button 0 button Hint: Mention @runllm in the thread for followups.
<@U06TM7M40RK> I have the below configuration for Trino ingestion -
source:
type: trino
config:
# Coordinates
host_port:
database:
env: DEV
`# Credentials`
`username: ${DATAHUB_TRINO_USER}`
`password: ${DATAHUB_TRINO_PASSWORD}`
`trino_as_primary: False`
`profiling:`
`enabled: False`
but still the ingestion is creating unwanted siblings in Hive platform without schema. Below are the ingestion log messages -
'siblings': ['urn:li:dataset:(urn:li:dataPlatform:trino,minio.data.t_ra_brm_assetsweaccrued,DEV)',
'urn:li:dataset:(urn:li:dataPlatform:trino,minio.data.t_qup740_new_table,DEV)',
'urn:li:dataset:(urn:li:dataPlatform:trino,minio.data.t_sweb2x_netquery_address,DEV)',
'urn:li:dataset:(urn:li:dataPlatform:hive,data.t_na_mobileum_rts,PROD)',
'urn:li:dataset:(urn:li:dataPlatform:hive,data.t_ra_wh12p_billnor,PROD)',
'urn:li:dataset:(urn:li:dataPlatform:trino,minio.data.t_ra_ime_nobillirap,DEV)',
'urn:li:dataset:(urn:li:dataPlatform:hive,data.t_ra_wh12p_featurenor,PROD)',
'urn:li:dataset:(urn:li:dataPlatform:trino,minio.data.t_sweb2x_netquery_fiberlan,DEV)',
'urn:li:dataset:(urn:li:dataPlatform:trino,minio.data.t_na_rgw_bandwidth,DEV)',
'urn:li:dataset:(urn:li:dataPlatform:hive,data.t_jqp615_arcus_demo,PROD)',
'... sampled of 324 total elements']
What config parameter will completely stop the sibling creation ? I’ve tried setting trino_as_primary to False and disabled profiling, didnt work
I’m not able to find any data sources to answer your question. You might want to provide details about any custom scripts or transformers used in your ingestion pipeline. If this doesn’t help, please reach out to the DataHub support team for further assistance. 0 button 0 button Hint: Mention @runllm in the thread for followups.
<@U06TM7M40RK> I already provided the complete recipe for Trino ingestion above. I need to know which parameter in the recipe configuration is controlling the creation of Hive siblings ?
I’m not able to find any data sources to answer your question. You might want to specify if there are any additional plugins or custom logic applied during ingestion. If this doesn’t help, please reach out to the DataHub support team for further assistance. 0 button 0 button Hint: Mention @runllm in the thread for followups.
Hi ! Did this interaction resolve your issue or do you still require assistance ?
Just checking in a final time, Please tag @ RunLLM if you still need further assistance !
Or tag @Datahub Community Support if you require a person to take a look at this thread
Hope this helps ! ><