Troubleshooting BigQuery Metadata Ingestion Issue with Datahub Team

Original Slack Thread

Hello datahub team. I have a problem when using bigQuery metadata ingestion. https://github.com/datahub-project/datahub/issues/8550|Issue.
This is my ingest yaml

    type: bigquery
    config:
        env: TEST
        rate_limit: true
        requests_per_min: 60
        include_table_lineage: true
        include_usage_statistics: true
        include_tables: true
        include_views: true
        extract_column_lineage: true
        lineage_use_sql_parser: true
        incremental_lineage: false
        max_query_duration: 1800
        profiling:
            enabled: true
            profile_table_level_only: false
        stateful_ingestion:
            enabled: true
        credential:
            xxxx
        dataset_pattern:
            allow:
                - hods_utc
                - ods_utc```
Here is the log
```~~~~ Execution Summary - RUN_INGEST ~~~~
Execution finished with errors.
{'exec_id': 'f22174e2-0a08-4229-b766-038193888ed6',
 'infos': ['2023-08-03 07:17:24.999134 INFO: Starting execution for task with name=RUN_INGEST',
           "2023-08-03 07:17:47.442757 INFO: Failed to execute 'datahub ingest'",
           '2023-08-03 07:17:47.443519 INFO: Caught exception EXECUTING task_id=f22174e2-0a08-4229-b766-038193888ed6, name=RUN_INGEST, '
           'stacktrace=Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/default_executor.py", line 122, in execute_task\n'
           '    task_event_loop.run_until_complete(task_future)\n'
           '  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete\n'
           '    return future.result()\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 231, in execute\n'
           '    raise TaskError("Failed to execute \'datahub ingest\'")\n'
           "acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"],
 'errors': []}

~~~~ Ingestion Logs ~~~~
Obtaining venv creation lock...
Acquired venv creation lock
venv setup time = 0
This version of datahub supports report-to functionality
datahub  ingest run -c /tmp/datahub/ingest/f22174e2-0a08-4229-b766-038193888ed6/recipe.yml --report-to /tmp/datahub/ingest/f22174e2-0a08-4229-b766-038193888ed6/ingestion_report.json
[2023-08-03 07:17:29,555] INFO     {datahub.cli.ingest_cli:173} - DataHub CLI version: 0.10.0.7
No ~/.datahubenv file found, generating one for you...
[2023-08-03 07:17:29,973] INFO     {datahub.ingestion.run.pipeline:184} - Sink configured successfully. DataHubRestEmitter: configured to talk to <http://datahub-gms:8080>
[2023-08-03 07:17:43,997] WARNING  {py.warnings:109} - /usr/local/lib/python3.10/site-packages/ratelimiter.py:127: DeprecationWarning: "@coroutine" decorator is deprecated since Python 3.8, use "async def" instead
  __aexit__ = asyncio.coroutine(__exit__)

[2023-08-03 07:17:44,778] WARNING  {py.warnings:109} - /usr/local/lib/python3.10/site-packages/datahub/ingestion/source/bigquery_v2/bigquery_config.py:198: ConfigurationWarning: env is deprecated and will be removed in a future release. Please use platform_instance instead.
  super().__init__(**data)

[2023-08-03 07:17:44,782] WARNING  {datahub.ingestion.source.bigquery_v2.bigquery_config:256} - Please update `dataset_pattern` to match against fully qualified schema name `&lt;project_id&gt;.&lt;dataset_name&gt;` and set config `match_fully_qualified_names : True`.Current default `match_fully_qualified_names: False` is only to maintain backward compatibility. The config option `match_fully_qualified_names` will be deprecated in future and the default behavior will assume `match_fully_qualified_names: True`.
Failed to configure the source (bigquery): 1 validation error for BigQueryV2Config
extract_column_lineage
  extra fields not permitted (type=value_error.extra)```
I’m not quite sure what the cause of this problem is, but when I remove field extract_column_lineage it works fine.

My datahub version is 0.10.4 both server and cli.
Can someone please help me? Thank you very, very much.

Hi
the thing ist that extract_column_lineage is not available in 0.10.4. It was added in 0.10.5

I can’t believe it’s so obvious. It’s embarrassing. hhh…
Thanks <@U02AF5P6QDS> Let me try.

no problem. Run in the same thing as the hosted documentation shows the fields from master branch.

Hi <@U02AF5P6QDS>
After I upgrade to v0.10.5, there is still a

  extra fields not permitted (type=value_error.extra)```
problem in my service.

I use docker quick start image to upgrade datahub.

did you upgrade your client as well? did a pip install and install the 0.10.5.* version?

yes, I upgrade my local client,

acryl-datahub, version 0.10.5```

I’m very confused, I run ingest from a web page and the web log shows that the cli version is 0.10.4.2, but in fact I have 0.10.5 installed on my machine.
By the way, these things shown by the docker ps command already confirm that I have the corresponding version of datahub installed, right?

CONTAINER ID   IMAGE                                     COMMAND                  CREATED          STATUS                    PORTS                                             NAMES
94f4afc4fdc5   acryldata/datahub-actions:head            "/bin/sh -c 'dockeri…"   31 minutes ago   Up 28 minutes                                             datahub-actions
d06b829cd4fb   linkedin/datahub-frontend-react:v0.10.5   "/bin/sh -c ./start.…"   31 minutes ago   Up 28 minutes (healthy)   0.0.0.0:9002-&gt;9002/tcp, :::9002-&gt;9002/tcp                       datahub-frontend-react
7104bb242bb1   linkedin/datahub-gms:v0.10.5              "/bin/sh -c /datahub…"   31 minutes ago   Up 30 minutes (healthy)   0.0.0.0:8080-&gt;8080/tcp, :::8080-&gt;8080/tcp                       datahub-gms
f102d6eca5fa   mysql:5.7                                 "docker-entrypoint.s…"   7 hours ago      Up 7 hours (healthy)      0.0.0.0:3306-&gt;3306/tcp, :::3306-&gt;3306/tcp, 33060/tcp            mysql
e53f4c46d30f   confluentinc/cp-schema-registry:7.4.0     "/etc/confluent/dock…"   2 days ago       Up 2 days (healthy)       0.0.0.0:8081-&gt;8081/tcp, :::8081-&gt;8081/tcp                       schema-registry
9814392624bf   confluentinc/cp-kafka:7.4.0               "/etc/confluent/dock…"   2 days ago       Up 2 days (healthy)       0.0.0.0:9092-&gt;9092/tcp, :::9092-&gt;9092/tcp                       broker
25399a96b558   confluentinc/cp-zookeeper:7.4.0           "/etc/confluent/dock…"   2 days ago       Up 2 days (healthy)       2888/tcp, 0.0.0.0:2181-&gt;2181/tcp, :::2181-&gt;2181/tcp, 3888/tcp   zookeeper
2ee99519f02b   elasticsearch:7.10.1                      "/tini -- /usr/local…"   2 days ago       Up 2 days (healthy)       0.0.0.0:9200-&gt;9200/tcp, :::9200-&gt;9200/tcp, 9300/tcp             elasticsearch```

Oh wait, here is

linkedin/datahub-gms:v0.10.5```
that’s I find in dockerhub.
But there are
```acryldata/datahub-frontend-react:v0.10.5
acryldata/datahub-gms:v0.10.5```
which one should I choose ?:joy:![attachment](https://files.slack.com/files-pri/TUMKD5EGJ-F05KWH14QVD/image.png?t=xoxe-973659184562-6705490291811-6708051934148-dd1595bd5f63266bc09e6166373c7a3c)

the linkedin ones should be the right, if not even the same

can you try running the ingestion from local machine and not via UI?

[2023-08-04 09:26:39,769] DEBUG    {datahub.telemetry.telemetry:309} - Sending telemetry for function-call
[2023-08-04 09:26:40,005] INFO     {datahub.cli.ingest_cli:173} - DataHub CLI version: 0.10.5
[2023-08-04 09:26:40,123] DEBUG    {datahub.ingestion.sink.datahub_rest:118} - Setting env variables to override config
[2023-08-04 09:26:40,124] DEBUG    {datahub.ingestion.sink.datahub_rest:120} - Setting gms config
[2023-08-04 09:26:40,124] DEBUG    {datahub.ingestion.run.pipeline:212} - Sink type datahub-rest (&lt;class 'datahub.ingestion.sink.datahub_rest.DatahubRestSink'&gt;) configured
[2023-08-04 09:26:40,124] INFO     {datahub.ingestion.run.pipeline:213} - Sink configured successfully. DataHubRestEmitter: configured to talk to <http://datahub-gms.as-in.io> with token: eyJh**********U7Sc
[2023-08-04 09:26:40,183] DEBUG    {datahub.ingestion.sink.datahub_rest:118} - Setting env variables to override config
[2023-08-04 09:26:40,183] DEBUG    {datahub.ingestion.sink.datahub_rest:120} - Setting gms config
[2023-08-04 09:26:40,183] DEBUG    {datahub.ingestion.reporting.datahub_ingestion_run_summary_provider:125} - Ingestion source urn = urn:li:dataHubIngestionSource:cli-05a5746e49c4efc8a3d2044d1a297d76
[2023-08-04 09:26:40,185] DEBUG    {datahub.emitter.rest_emitter:260} - Attempting to emit to DataHub GMS; using curl equivalent to:
curl -X POST -H 'User-Agent: python-requests/2.28.1' -H 'Accept-Encoding: gzip, deflate' -H 'Accept: */*' -H 'Connection: keep-alive' -H 'X-RestLi-Protocol-Version: 2.0.0' -H 'Content-Type: application/json' -H 'Authorization: &lt;redacted&gt;' --data '{"proposal": {"entityType": "dataHubIngestionSource", "entityUrn":"urn:li:dataHubIngestionSource:cli-05a5746e49c4efc8a3d2044d1a297d76", "changeType": "UPSERT", "aspectName": "dataHubIngestionSourceInfo", "aspect": {"value": "{\"name\": \"[CLI] bigquery [urn:li:dataHubIngestionSource:72dc6fb3-7573-4991-91b9-068969132b08]\", \"type\": \"bigquery\", \"platform\": \"urn:li:dataPlatform:unknown\", \"config\": {\"recipe\": \"{\\\"source\\\": {\\\"type\\\": \\\"bigquery\\\", \\\"config\\\": {\\\"extract_column_lineage\\\": false, \\\"env\\\": \\\"TEST\\\", \\\"rate_limit\\\": true, \\\"requests_per_min\\\": 60, \\\"include_table_lineage\\\": true, \\\"include_usage_statistics\\\": true, \\\"include_tables\\\": true, \\\"include_views\\\": true, \\\"lineage_use_sql_parser\\\": true, \\\"incremental_lineage\\\": false, \\\"max_query_duration\\\": 1800, \\\"profiling\\\": {\\\"enabled\\\": true, \\\"profile_table_level_only\\\": false}, \\\"stateful_ingestion\\\": {\\\"enabled\\\": true}, \\\"credential\\\": {\\\"project_id\\\": \\\"aftership-test\\\", \\\"private_key\\\": \\\"********\\\", \\\"private_key_id\\\": \\\"********\\\", \\\"client_email\\\": \\\"<mailto:tssd-792@aftership-test.iam.gserviceaccount.com|tssd-792@aftership-test.iam.gserviceaccount.com>\\\", \\\"client_id\\\": \\\"101426640770366287958\\\"}, \\\"dataset_pattern\\\": {\\\"allow\\\": [\\\"hods_utc\\\", \\\"ods_utc\\\"]}}}, \\\"pipeline_name\\\": \\\"urn:li:dataHubIngestionSource:72dc6fb3-7573-4991-91b9-068969132b08\\\"}\", \"version\": \"0.10.5\", \"executorId\": \"__datahub_cli_\"}}", "contentType": "application/json"}}}' '<http://datahub-gms.as-in.io/aspects?action=ingestProposal>'
[2023-08-04 09:26:40,243] DEBUG    {datahub.ingestion.run.pipeline:287} - Reporter type:datahub,&lt;class 'datahub.ingestion.reporting.datahub_ingestion_run_summary_provider.DatahubIngestionRunSummaryProvider'&gt; configured.
[2023-08-04 09:26:43,577] WARNING  {py.warnings:109} - /usr/local/lib/python3.8/dist-packages/ratelimiter.py:127: DeprecationWarning: "@coroutine" decoratoris deprecated since Python 3.8, use "async def" instead
  __aexit__ = asyncio.coroutine(__exit__)

[2023-08-04 09:26:43,694] WARNING  {py.warnings:109} - /usr/local/lib/python3.8/dist-packages/datahub/ingestion/source/bigquery_v2/bigquery_config.py:210: ConfigurationWarning: env is deprecated and will be removed in a future release. Please use platform_instance instead.
  super().__init__(**data)

[2023-08-04 09:26:43,696] WARNING  {datahub.ingestion.source.bigquery_v2.bigquery_config:268} - Please update `dataset_pattern` to match against fully qualified schema name `&lt;project_id&gt;.&lt;dataset_name&gt;` and set config `match_fully_qualified_names : True`.Current default `match_fully_qualified_names: False` is only to maintain backward compatibility. The config option `match_fully_qualified_names` will be deprecated in future and the default behavior will assume `match_fully_qualified_names: True`.
[2023-08-04 09:26:43,697] DEBUG    {datahub.telemetry.telemetry:309} - Sending telemetry for function-call
[2023-08-04 09:26:43,938] DEBUG    {datahub.entrypoints:196} - Error: Failed to configure the source (bigquery): 1 validation error for BigQueryV2Config
extract_column_lineage
  extra fields not permitted (type=value_error.extra)
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/datahub/ingestion/run/pipeline.py", line 120, in _add_init_error_context
    yield
  File "/usr/local/lib/python3.8/dist-packages/datahub/ingestion/run/pipeline.py", line 226, in __init__
    self.source = source_class.create(
  File "/usr/local/lib/python3.8/dist-packages/datahub/ingestion/source/bigquery_v2/bigquery.py", line 263, in create
    config = BigQueryV2Config.parse_obj(config_dict)
  File "pydantic/main.py", line 526, in pydantic.main.BaseModel.parse_obj
  File "/usr/local/lib/python3.8/dist-packages/datahub/ingestion/source/bigquery_v2/bigquery_config.py", line 210, in __init__
    super().__init__(**data)
  File "pydantic/main.py", line 342, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for BigQueryV2Config
extract_column_lineage
  extra fields not permitted (type=value_error.extra)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/datahub/entrypoints.py", line 186, in main
    sys.exit(datahub(standalone_mode=False, **kwargs))
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/datahub/telemetry/telemetry.py", line 448, in wrapper
    raise e
  File "/usr/local/lib/python3.8/dist-packages/datahub/telemetry/telemetry.py", line 397, in wrapper
    res = func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/datahub/utilities/memory_leak_detector.py", line 95, in wrapper
    return func(ctx, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/datahub/cli/ingest_cli.py", line 187, in run
    pipeline = Pipeline.create(
  File "/usr/local/lib/python3.8/dist-packages/datahub/ingestion/run/pipeline.py", line 336, in create
    return cls(
  File "/usr/local/lib/python3.8/dist-packages/datahub/ingestion/run/pipeline.py", line 230, in __init__
    <http://logger.info|logger.info>("Source configured successfully.")
  File "/usr/lib/python3.8/contextlib.py", line 131, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.8/dist-packages/datahub/ingestion/run/pipeline.py", line 122, in _add_init_error_context
    raise PipelineInitError(f"Failed to {step}: {e}") from e
datahub.ingestion.run.pipeline.PipelineInitError: Failed to configure the source (bigquery): 1 validation error for BigQueryV2Config
extract_column_lineage
  extra fields not permitted (type=value_error.extra)
Failed to configure the source (bigquery): 1 validation error for BigQueryV2Config
extract_column_lineage
  extra fields not permitted (type=value_error.extra)
[2023-08-04 09:26:43,941] DEBUG    {datahub.entrypoints:201} - DataHub CLI version: 0.10.5 at /usr/local/lib/python3.8/dist-packages/datahub/__init__.py
[2023-08-04 09:26:43,941] DEBUG    {datahub.entrypoints:204} - Python version: 3.8.0 (default, Dec  9 2021, 17:53:27)
[GCC 8.4.0] at /usr/bin/python3.8 on Linux-5.4.0-1103-gcp-x86_64-with-glibc2.27
[2023-08-04 09:26:43,941] DEBUG    {datahub.entrypoints:209} - GMS config {'models': {}, 'patchCapable': True, 'versions': {'linkedin/datahub': {'version': 'v0.10.5', 'commit': '4f9fc671dcea03cbb22b7c0e02e29bdf88ba955f'}}, 'managedIngestion': {'defaultCliVersion': '@cliMajorVersion@', 'enabled': True}, 'statefulIngestionCapable': True, 'supportsImpactAnalysis': True, 'timeZone': 'GMT', 'telemetry': {'enabledCli': True, 'enabledIngestion': False}, 'datasetUrnNameCasing': False, 'retention': 'true', 'datahub': {'serverType': 'quickstart'}, 'noCode': 'true'}```
Here is log from local machine cc <@U02AF5P6QDS>
I run in debug model

can you show the recipe as well?

Sure, I masking the bigquery credential

    type: bigquery
    config:
        extract_column_lineage: false
        env: TEST
        rate_limit: true
        requests_per_min: 60
        include_table_lineage: true
        include_usage_statistics: true
        include_tables: true
        include_views: true
        lineage_use_sql_parser: true
        incremental_lineage: false
        max_query_duration: 1800
        profiling:
            enabled: true
            profile_table_level_only: false
        stateful_ingestion:
            enabled: true
        credential:
            project_id: ******
            private_key: ******
            private_key_id: ******
            client_email: ******
            client_id: '101426640770366287958'
        dataset_pattern:
            allow:
                - hods_utc
                - ods_utc
                  #- dw_utc_00
                  #- dwb_utc_00
                  #- dwd_utc_00
                  #- dws_utc_00
                  #- fdm_utc_00
                  #- hdb_utc_00
                  #- hds_utc_00
                  #- app_utc_00
                  #start_time: '2023-01-04T12:00:00Z'
pipeline_name: 'urn:li:dataHubIngestionSource:72dc6fb3-7573-4991-91b9-068969132b08'```

mhm for me it works. Can you try doing a pip install --upgrade "acryl-datahub[bigquery]==0.10.5.*"

think you are not on the latest client version

I gotta say, you’re a real piece of work. <@U02AF5P6QDS> Thank you!!!
I realized that the problem was still with my local version of the cli, so I upgraded the acryl-datahub acryl-datahub[bigquery] thing with pip upgrade and left the version unspecified.
Guess what, ingest is now running, though not finished.
Let me see how BigQuery’s field lineage works.