Hey there! :hello-dog:
I have a doubt regarding Datahub integration with Great Expectations. We are using datahub 0.12.0 and GE version 0.15.50.
We created a datasource inside great_expectations.yml
file like this:
name: my_bigquery_datasource
class_name: Datasource
module_name: great_expectations.datasource
execution_engine:
class_name: SqlAlchemyExecutionEngine
module_name: great_expectations.execution_engine
connection_string: <bigquery://myproject/tmp>
data_connectors:
default_runtime_data_connector_name:
name: default_runtime_data_connector_name
class_name: RuntimeDataConnector
module_name: great_expectations.datasource.data_connector
batch_identifiers:
- default_identifier_name
default_inferred_data_connector_name:
name: default_inferred_data_connector_name
class_name: InferredAssetSqlDataConnector
module_name: great_expectations.datasource.data_connector
include_schema_name: true```
When I execute the command: `print(context.get_available_data_asset_names())`, I have noticed that all tables, from all datasets inside my GCP project appear. So, my intention is to use `tmp` as the dataset where GE will save the temporary tables it generates. So far so good. However, I want to send the validations to Datahub, and then the problem begins. We configure the checkpoint as follows (the code below is the main part of the file):
```action_list:
- name: store_validation_result
action:
class_name: StoreValidationResultAction
- name: store_evaluation_params
action:
class_name: StoreEvaluationParametersAction
- name: update_data_docs
action:
class_name: UpdateDataDocsAction
- name: datahub_action
action:
module_name: datahub.integrations.great_expectations.action
class_name: DataHubValidationAction
server_url: <http://datahub-gms:8080>
validations:
- batch_request:
datasource_name: my_bigquery_datasource
data_connector_name: default_inferred_data_connector_name
data_asset_name: mydataset.mytable
data_connector_query:
index: -1
expectation_suite_name: myproject.mydataset.mytable```
This way gives error to send to Datahub, because the urn includes `tmp` dataset, instead of `mydataset`, although I am explicitly setting `mydataset` in the `data_asset_name`. So it seems that Datahub is building the URN based on the connection string defined in `great_expectations` file. Because when I changed the dataset in the datasource, it worked.
Is there a way to force the URN to use the `data_asset_name` instead of the connection string? Or maybe configure another parameter inside checkpoint so we can set the desired dataset?
Because we have several datasets inside the GCP project and the performance will be really bad if we have one datasource defined per dataset. Besides, it won't make sense to define many datasources for the same project with different datasets, since each one of those will retrieve all data assets anyway.
Thanks in advance!