Troubleshooting 401 Unauthorized Error in DataHub v12 File-Based Lineage Ingest Jobs

Original Slack Thread

Hi, I’m using datahub v12 and am having trouble getting my <https://datahubproject.io/docs/generated/ingestion/sources/file-based-lineage/|file based lineage> ingest jobs working. They are failing with a 401 unauthorized error when trying to hit GMS. Is there a way to pass credentials for the job to use? I haven’t been able to find any documentation showing how to do this when using a kafka sink

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Which DataHub version are you using? (e.g. 0.12.0)
  2. Please post any relevant error logs on the thread!

Recipe:

datahub_api:
  server: "<http://datahub-datahub-gms:8080>"
source:
  type: datahub-lineage-file
  config:
    file: '<http://file-host.svc.cluster.local/lineage.yaml>'
    preserve_upstream: true
sink:
  type: datahub-kafka
  config:
    connection:
      bootstrap: ...
      schema_registry_url: ...```

Logs:

[2024-02-09 15:01:10,513] ERROR    {datahub.entrypoints:201} - Command failed: 401 Client Error: Unauthorized for url: <http://datahub-datahub-gms:8080/entitiesV2/urn%3Ali%3Adataset%3A%28urn%3Ali%3AdataPlatform%3Asnowflake%2Cdata-warehouse.fivetran.prod_realtime_scriptdash_aurora_public.deliveries%2CPROD%29?aspects=List(upstreamLineage)>
2024-02-09 08:01:10.516 
Traceback (most recent call last):
2024-02-09 08:01:10.516 
  File "/datahub-ingestion/.local/lib/python3.10/site-packages/datahub/entrypoints.py", line 188, in main
2024-02-09 08:01:10.516 
    sys.exit(datahub(standalone_mode=False, **kwargs))
2024-02-09 08:01:10.516 
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
2024-02-09 08:01:10.516 
    return self.main(*args, **kwargs)
2024-02-09 08:01:10.516 
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1078, in main
2024-02-09 08:01:10.516 
    rv = self.invoke(ctx)
2024-02-09 08:01:10.516 
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
2024-02-09 08:01:10.516 
    return _process_result(sub_ctx.command.invoke(sub_ctx))
2024-02-09 08:01:10.516 
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
2024-02-09 08:01:10.516 
    return _process_result(sub_ctx.command.invoke(sub_ctx))
2024-02-09 08:01:10.516 
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
2024-02-09 08:01:10.516 
    return ctx.invoke(self.callback, **ctx.params)
2024-02-09 08:01:10.516 
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 783, in invoke
2024-02-09 08:01:10.516 
    return __callback(*args, **kwargs)
2024-02-09 08:01:10.516 
  File "/datahub-ingestion/.local/lib/python3.10/site-packages/datahub/telemetry/telemetry.py", line 454, in wrapper
2024-02-09 08:01:10.516 
    raise e
2024-02-09 08:01:10.516 
  File "/datahub-ingestion/.local/lib/python3.10/site-packages/datahub/telemetry/telemetry.py", line 403, in wrapper
2024-02-09 08:01:10.516 
    res = func(*args, **kwargs)
2024-02-09 08:01:10.516 
  File "/datahub-ingestion/.local/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 201, in run
2024-02-09 08:01:10.516 
    ret = loop.run_until_complete(run_ingestion_and_check_upgrade())
2024-02-09 08:01:10.516 
  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
2024-02-09 08:01:10.516 
    return future.result()
2024-02-09 08:01:10.516 
  File "/datahub-ingestion/.local/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 185, in run_ingestion_and_check_upgrade
2024-02-09 08:01:10.516 
    ret = await ingestion_future
2024-02-09 08:01:10.516 
  File "/datahub-ingestion/.local/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 139, in run_pipeline_to_completion
2024-02-09 08:01:10.516 
    raise e
2024-02-09 08:01:10.516 
  File "/datahub-ingestion/.local/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 131, in run_pipeline_to_completion
2024-02-09 08:01:10.516 
    pipeline.run()
2024-02-09 08:01:10.516 
  File "/datahub-ingestion/.local/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 404, in run
2024-02-09 08:01:10.516 
    for wu in itertools.islice(
2024-02-09 08:01:10.516 
  File "/datahub-ingestion/.local/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 150, in auto_workunit_reporter
2024-02-09 08:01:10.516 
    for wu in stream:
2024-02-09 08:01:10.516 
  File "/datahub-ingestion/.local/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 70, in auto_status_aspect
2024-02-09 08:01:10.516 
    for wu in stream:
2024-02-09 08:01:10.516 
  File "/datahub-ingestion/.local/lib/python3.10/site-packages/datahub/ingestion/source/metadata/lineage.py", line 166, in get_workunits_internal
2024-02-09 08:01:10.516 
    mcp = _get_lineage_mcp(entity_node, self.config.preserve_upstream)
2024-02-09 08:01:10.516 
  File "/datahub-ingestion/.local/lib/python3.10/site-packages/datahub/ingestion/source/metadata/lineage.py", line 208, in _get_lineage_mcp
2024-02-09 08:01:10.516 
    old_upstream_lineage = get_aspects_for_entity(
2024-02-09 08:01:10.516 
  File "/datahub-ingestion/.local/lib/python3.10/site-packages/datahub/cli/cli_utils.py", line 528, in get_aspects_for_entity
2024-02-09 08:01:10.516 
    entity_response = get_entity(
2024-02-09 08:01:10.516 
  File "/datahub-ingestion/.local/lib/python3.10/site-packages/datahub/cli/cli_utils.py", line 449, in get_entity
2024-02-09 08:01:10.516 
    response.raise_for_status()
2024-02-09 08:01:10.516 
  File "/usr/local/lib/python3.10/site-packages/requests/models.py", line 1021, in raise_for_status
2024-02-09 08:01:10.516 
    raise HTTPError(http_error_msg, response=self)
2024-02-09 08:01:10.516 
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: <http://datahub-datahub-gms:8080/entitiesV2/urn%3Ali%3Adataset%3A%28urn%3Ali%3AdataPlatform%3Asnowflake%2Cdata-warehouse.fivetran.prod_realtime_scriptdash_aurora_public.deliveries%2CPROD%29?aspects=List(upstreamLineage)>```

I discovered that setting preserve_upstream: false in the recipe stops the requests to GMS, but I don’t want to hard replace upstream data for a given entity if I don’t have to