Troubleshooting SSL Certificate Verification Issue in Apache Superset Integration

Original Slack Thread

Hi Team!
I have a problem with SSL certificates Apache Superset. Can u help me? Maybe i must change my ingestion recipe?

Error log

Execution finished with errors.
{'exec_id': '9c78f121-03f4-4693-b30b-1bc1a283eb1a',
 'infos': ['2024-06-05 07:58:49.669538 INFO: Starting execution for task with name=RUN_INGEST',
           "2024-06-05 07:58:53.715320 INFO: Failed to execute 'datahub ingest', exit code 1",
           '2024-06-05 07:58:53.715496 INFO: Caught exception EXECUTING task_id=9c78f121-03f4-4693-b30b-1bc1a283eb1a, name=RUN_INGEST, '
           'stacktrace=Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/default_executor.py", line 140, in execute_task\n'
           '    task_event_loop.run_until_complete(task_future)\n'
           '  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete\n'
           '    return future.result()\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 282, in execute\n'
           '    raise TaskError("Failed to execute \'datahub ingest\'")\n'
           "acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"],
 'errors': []}

~~~~ Ingestion Logs ~~~~
Obtaining venv creation lock...
Acquired venv creation lock
venv is already set up
venv setup time = 0 sec
This version of datahub supports report-to functionality
+ exec datahub ingest run -c /tmp/datahub/ingest/9c78f121-03f4-4693-b30b-1bc1a283eb1a/recipe.yml --report-to /tmp/datahub/ingest/9c78f121-03f4-4693-b30b-1bc1a283eb1a/ingestion_report.json
[2024-06-05 07:58:52,376] INFO     {datahub.cli.ingest_cli:147} - DataHub CLI version: 0.13.3rc1
[2024-06-05 07:58:52,419] INFO     {datahub.ingestion.run.pipeline:254} - Sink configured successfully. 
[2024-06-05 07:58:52,751] ERROR    {datahub.entrypoints:205} - Command failed: Failed to configure the source (superset): HTTPSConnectionPool(host='<http://superset.dev.datastorage.int.rolfcorp.ru|superset.dev.datastorage.int.rolfcorp.ru>', port=443): Max retries exceeded with url: /login/api/v1/security/login (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1007)')))
Traceback (most recent call last):
  File "/tmp/datahub/ingest/venv-superset-03575587e416950c/lib/python3.10/site-packages/urllib3/connectionpool.py", line 467, in _make_request
    self._validate_conn(conn)
  File "/tmp/datahub/ingest/venv-superset-03575587e416950c/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1099, in _validate_conn
    conn.connect()
  File "/tmp/datahub/ingest/venv-superset-03575587e416950c/lib/python3.10/site-packages/urllib3/connection.py", line 653, in connect
    sock_and_verified = _ssl_wrap_socket_and_match_hostname(
  File "/tmp/datahub/ingest/venv-superset-03575587e416950c/lib/python3.10/site-packages/urllib3/connection.py", line 806, in _ssl_wrap_socket_and_match_hostname
    ssl_sock = ssl_wrap_socket(
  File "/tmp/datahub/ingest/venv-superset-03575587e416950c/lib/python3.10/site-packages/urllib3/util/ssl_.py", line 465, in ssl_wrap_socket
    ssl_sock = _ssl_wrap_socket_impl(sock, context, tls_in_tls, server_hostname)
  File "/tmp/datahub/ingest/venv-superset-03575587e416950c/lib/python3.10/site-packages/urllib3/util/ssl_.py", line 509, in _ssl_wrap_socket_impl
    return ssl_context.wrap_socket(sock, server_hostname=server_hostname)
  File "/usr/local/lib/python3.10/ssl.py", line 513, in wrap_socket
    return self.sslsocket_class._create(
  File "/usr/local/lib/python3.10/ssl.py", line 1104, in _create
    self.do_handshake()
  File "/usr/local/lib/python3.10/ssl.py", line 1375, in do_handshake
    self._sslobj.do_handshake()
ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1007)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/tmp/datahub/ingest/venv-superset-03575587e416950c/lib/python3.10/site-packages/urllib3/connectionpool.py", line 793, in urlopen
    response = self._make_request(
  File "/tmp/datahub/ingest/venv-superset-03575587e416950c/lib/python3.10/site-packages/urllib3/connectionpool.py", line 491, in _make_request
    raise new_e
urllib3.exceptions.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1007)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/tmp/datahub/ingest/venv-superset-03575587e416950c/lib/python3.10/site-packages/requests/adapters.py", line 667, in send
    resp = conn.urlopen(
  File "/tmp/datahub/ingest/venv-superset-03575587e416950c/lib/python3.10/site-packages/urllib3/connectionpool.py", line 847, in urlopen
    retries = retries.increment(
  File "/tmp/datahub/ingest/venv-superset-03575587e416950c/lib/python3.10/site-packages/urllib3/util/retry.py", line 515, in increment
    raise MaxRetryError(_pool, url, reason) from reason  # type: ignore[arg-type]
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='<http://superset.dev.datastorage.int.rolfcorp.ru|superset.dev.datastorage.int.rolfcorp.ru>', port=443): Max retries exceeded with url: /login/api/v1/security/login (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1007)')))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/tmp/datahub/ingest/venv-superset-03575587e416950c/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 123, in _add_init_error_context
    yield
  File "/tmp/datahub/ingest/venv-superset-03575587e416950c/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 272, in __init__
    self.source = source_class.create(
  File "/tmp/datahub/ingest/venv-superset-03575587e416950c/lib/python3.10/site-packages/datahub/ingestion/source/superset.py", line 219, in create
    return cls(ctx, config)
  File "/tmp/datahub/ingest/venv-superset-03575587e416950c/lib/python3.10/site-packages/datahub/ingestion/source/superset.py", line 182, in __init__
    login_response = <http://requests.post|requests.post>(
  File "/tmp/datahub/ingest/venv-superset-03575587e416950c/lib/python3.10/site-packages/requests/api.py", line 115, in post
    return request("post", url, data=data, json=json, **kwargs)
  File "/tmp/datahub/ingest/venv-superset-03575587e416950c/lib/python3.10/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "/tmp/datahub/ingest/venv-superset-03575587e416950c/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/tmp/datahub/ingest/venv-superset-03575587e416950c/lib/python3.10/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/tmp/datahub/ingest/venv-superset-03575587e416950c/lib/python3.10/site-packages/requests/adapters.py", line 698, in send
    raise SSLError(e, request=request)
requests.exceptions.SSLError: HTTPSConnectionPool(host='<http://superset.dev.datastorage.int.rolfcorp.ru|superset.dev.datastorage.int.rolfcorp.ru>', port=443): Max retries exceeded with url: /login/api/v1/security/login (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1007)')))

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/tmp/datahub/ingest/venv-superset-03575587e416950c/lib/python3.10/site-packages/datahub/entrypoints.py", line 192, in main
    sys.exit(datahub(standalone_mode=False, **kwargs))
  File "/tmp/datahub/ingest/venv-superset-03575587e416950c/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/tmp/datahub/ingest/venv-superset-03575587e416950c/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/tmp/datahub/ingest/venv-superset-03575587e416950c/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/tmp/datahub/ingest/venv-superset-03575587e416950c/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/tmp/datahub/ingest/venv-superset-03575587e416950c/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/tmp/datahub/ingest/venv-superset-03575587e416950c/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/tmp/datahub/ingest/venv-superset-03575587e416950c/lib/python3.10/site-packages/datahub/telemetry/telemetry.py", line 454, in wrapper
    raise e
  File "/tmp/datahub/ingest/venv-superset-03575587e416950c/lib/python3.10/site-packages/datahub/telemetry/telemetry.py", line 403, in wrapper
    res = func(*args, **kwargs)
  File "/tmp/datahub/ingest/venv-superset-03575587e416950c/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 201, in run
    ret = loop.run_until_complete(run_ingestion_and_check_upgrade())
  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/tmp/datahub/ingest/venv-superset-03575587e416950c/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 170, in run_ingestion_and_check_upgrade
    pipeline = Pipeline.create(
  File "/tmp/datahub/ingest/venv-superset-03575587e416950c/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 388, in create
    return cls(
  File "/tmp/datahub/ingest/venv-superset-03575587e416950c/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 271, in __init__
    with _add_init_error_context(f"configure the source ({self.source_type})"):
  File "/usr/local/lib/python3.10/contextlib.py", line 153, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/tmp/datahub/ingest/venv-superset-03575587e416950c/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 125, in _add_init_error_context
    raise PipelineInitError(f"Failed to {step}: {e}") from e
datahub.ingestion.run.pipeline.PipelineInitError: Failed to configure the source (superset): HTTPSConnectionPool(host='<http://superset.dev.datastorage.int.rolfcorp.ru|superset.dev.datastorage.int.rolfcorp.ru>', port=443): Max retries exceeded with url: /login/api/v1/security/login (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1007)')))```
my recipe
```run_id: 'urn:li:dataHubExecutionRequest:9c78f121-03f4-4693-b30b-1bc1a283eb1a'
sink:
  type: datahub-kafka
  config:
    connection:
      bootstrap: 'broker:29092'
      schema_registry_url: '<http://schema-registry:8081>'
source:
  type: superset
  config:
    connect_uri: '<https://superset.dev.datastorage.int.rolfcorp.ru/login/>'
    password: '${DWH_SUPERSET_PASSWORD}'
    provider: db
    username: admin
pipeline_name: 'urn:li:dataHubIngestionSource:1d9947ae-f280-47d3-b553-3fb94ff22e2a'```

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

<@U06TM7M40RK> Can you help with the questions in this thread?


:information_source: We’re experimenting with a new AI assistant for questions in the <#CUMUWQU66> channel to help you track down helpful resources more quickly. Keep in mind that responses may not always be spot-on; feel free to ask it follow-up questions and don’t forget to provide its responses a :thumbsup: or :thumbsdown: so other folks in the Community know if you’re still looking for help on the topic!

To help you more effectively, could you please provide the following information:

  1. Are you using the UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g., 0.12.0)?
  3. What data source(s) are you integrating with DataHub? (e.g., BigQuery)?

Based on the error log you provided, it seems like the SSL certificate verification is failing because the local issuer certificate cannot be found. This is a common issue when the SSL certificate chain is not properly configured or the certificate is not trusted by the client.

Potential Solutions

  1. Disable SSL Verification: If you are using self-signed certificates or if you are in a development environment, you can disable SSL verification by setting the disable_ssl_verification parameter to true in your ingestion recipe. This is not recommended for production environments.

    source:
      type: superset
      config:
        connect_uri: 'https://superset.dev.datastorage.int.rolfcorp.ru/login/'
        password: '${DWH_SUPERSET_PASSWORD}'
        provider: db
        username: admin
        disable_ssl_verification: true
    
  2. Provide CA Certificate Path: If you have a CA certificate, you can specify the path to the CA certificate in your ingestion recipe using the ca_certificate_path parameter.

    source:
      type: superset
      config:
        connect_uri: 'https://superset.dev.datastorage.int.rolfcorp.ru/login/'
        password: '${DWH_SUPERSET_PASSWORD}'
        provider: db
        username: admin
        ca_certificate_path: '/path/to/ca_certificate.pem'
    
  3. Set Environment Variable: You can set the REQUESTS_CA_BUNDLE environment variable to point to your CA certificate file. This will ensure that the requests library used by DataHub can verify the SSL certificate.

    export REQUESTS_CA_BUNDLE=/path/to/ca_certificate.pem
    

Example Recipe with CA Certificate Path

Here is an example of how your ingestion recipe might look with the ca_certificate_path parameter:

run_id: 'urn:li:dataHubExecutionRequest:9c78f121-03f4-4693-b30b-1bc1a283eb1a'
sink:
  type: datahub-kafka
  config:
    connection:
      bootstrap: 'broker:29092'
      schema_registry_url: 'http://schema-registry:8081'
source:
  type: superset
  config:
    connect_uri: 'https://superset.dev.datastorage.int.rolfcorp.ru/login/'
    password: '${DWH_SUPERSET_PASSWORD}'
    provider: db
    username: admin
    ca_certificate_path: '/path/to/ca_certificate.pem'
pipeline_name: 'urn:li:dataHubIngestionSource:1d9947ae-f280-47d3-b553-3fb94ff22e2a'

Would you like more details on any of these solutions?

Sources:

<@U06TM7M40RK>

  1. not working
disable_ssl_verification
  extra fields not permitted (type=value_error.extra)```

It appears that the disable_ssl_verification parameter is not supported in the Superset source configuration. This is confirmed by the error message you received: extra fields not permitted.

Alternative Solutions

  1. Set Environment Variable: You can set the REQUESTS_CA_BUNDLE environment variable to disable SSL verification or point to your CA certificate file. This approach is often used when the configuration parameter is not available.

    export REQUESTS_CA_BUNDLE=""
    

    Or, if you have a CA certificate:

    export REQUESTS_CA_BUNDLE=/path/to/ca_certificate.pem
    
  2. Modify the Source Code: If you have control over the environment where DataHub is running, you can modify the source code to disable SSL verification. This involves changing the rest_emitter.py file as mentioned in the <Issues · datahub-project/datahub · GitHub issue>.

    # Modify line 65 in rest_emitter.py
    disable_ssl_verification: bool = True
    
  3. Use a Custom Recipe: If you are using a custom recipe, ensure that you are using the correct parameters supported by the DataHub version you are using. You might need to update your DataHub CLI to the latest version if the parameter is supported in newer versions.

Setting Environment Variable Example

Here is how you can set the environment variable in a Docker container:

  1. Dockerfile: Add the environment variable in your Dockerfile.

    ENV REQUESTS_CA_BUNDLE=""
    
  2. Docker Compose: Add the environment variable in your docker-compose.yml file.

    version: '3.7'
    services:
      datahub:
        image: datahubproject/datahub:latest
        environment:
          - REQUESTS_CA_BUNDLE=""
    

Example Recipe without Unsupported Parameters

Here is your ingestion recipe without the unsupported disable_ssl_verification parameter:

run_id: 'urn:li:dataHubExecutionRequest:9c78f121-03f4-4693-b30b-1bc1a283eb1a'
sink:
  type: datahub-kafka
  config:
    connection:
      bootstrap: 'broker:29092'
      schema_registry_url: 'http://schema-registry:8081'
source:
  type: superset
  config:
    connect_uri: 'https://superset.dev.datastorage.int.rolfcorp.ru/login/'
    password: '${DWH_SUPERSET_PASSWORD}'
    provider: db
    username: admin
pipeline_name: 'urn:li:dataHubIngestionSource:1d9947ae-f280-47d3-b553-3fb94ff22e2a'

Would you like more details on any of these solutions?

Sources:

We don’t currently support the disable_sslverification⁣ flag on superset ingestion

It probably wouldn’t be too hard to add - if you’re up for contributing it, I can provide some pointers