Investigating Duplicate Containers in Datahub Navigation Bar

Original Slack Thread

<@U05NDLFDWPK> I have a maybe similar issue with Presto.
For Presto we have only one upper level hive container, but for some reason Datahub displays two of them - and the second one (the duplicate) contains only two tables in one schema, while the first one contains all the rest. This separation only iexhist in the navigation bar as well, if I open it it becomes one proper contaner.
What I’ve found is that in the url and the filter the duplicate hive has a unit separator in its name for some reason. See the screenshot, it shows as ␟hive instead of hive - probably that’s why is displays separately, but I have no idea where that unit separator came form.
I tried hard deleteing those two tables in this duplicate hive and re-ingest them - but nothing changed, it keeps displaying a duplicate.

yeah when I click on the duplicate hive in dev tools in my browser I can see this filter in network query, the ␟hive is there.

  {
    "and": [
      {
        "field": "browsePathV2",
        "condition": "EQUAL",
        "values": [
          "␟hive"
        ],
        "negated": false
      },
      {
        "field": "platform",
        "condition": "EQUAL",
        "values": [
          "urn:li:dataPlatform:presto"
        ],
        "negated": false
      },
      {
        "field": "_entityType",
        "values": [
          "DATASET"
        ]
      }
    ]
  }
]```

while for proper hive container it just calls for container by its urn, no such filters

hey Nadia! I know we talked in office hours today - let us know if updating your ingestion source solves your problem here

<@U03BEML16LB> yes, setting CLI to 0.12.1.1 fixed the problem, thank you!
Also I remember you explaining about symbol yesterday in zoom chat, but I didn’t get to read you answer fully before leaving the call, and apparently chat history does not get saved if you leave call and then enter it again. If it’s not too much trouble, could you please explain the again?

Same issue with postgres

attachment

I’m running into the same issue with the Clickhouse source. As a short-term fix, I ran the restore-indices job (because the data is intact in the database). However, it will break again with the next ingestion run.

<@U03BEML16LB> since I’m not using the CLI but UI-based ingestion, is there a way to specify the version in the recipe?

<@U05ED3WJ21Y> I’m also using UI ingestion and you can specify in there, on the last step under the name ioen the Advanced part and it’s there:

Ah, nice! Thank you :thanks: I knew it was somewhere but I couldn’t find it :see_no_evil: I’ll try it out

yeah, it’s actually pretty useful and helped me with several bugs along the way of using Datahub:)

Worked as a charm :slightly_smiling_face: Thanks again! I now just need to remind myself to un-pin this at some point in the future :sweat_smile:
Speaking of this: do you happen to know how the executor selects the CLI version if none is specified? The most recent package on PyPI is already 0.12.1.1, but it was definitely not used in my recent runs before pinning :thinking_face:

Usually it choses CLI according to Datahub version I think, I’m still on Datahub 10.5 so by default it used CLI 10.5. Though sometimes even if you upgrade Datahub the CLI does not change accordingly, it’s somewhat of a bug:woman-shrugging: But you can always pin in manually. You can see which version is used in logs file in each ingest and change if needed

awesome i’m glad upgrading CLI worked out for you folks! :slightly_smiling_face: yeah there was a funky race condition in ingestion prior to the fix in that release you upgraded to where the browsePathsV2 aspect would not always get the full container properly but just use the name (instead of the urn reference which is desired)

and then Nadia - the is used in the browsePathsV2 aspect to separate each level of the browse path and we chose this symbol as a delimiter in order to ensure it won’t clash with anything in the path. we want to store the path as a string so we can easily search and filter against anything in the path for an entity (ie. when you click on a folder in your browse sidebar)

> Usually it choses CLI according to Datahub version I think, I’m still on Datahub 10.5 so by default it used CLI 10.5. Though sometimes even if you upgrade Datahub the CLI does not change accordingly, it’s somewhat of a bug:woman-shrugging:
<@U03LYB2ESJ0> I took some time to trace it down and it is determined by the GMS variable UI_INGESTION_DEFAULT_CLI_VERSION, which is set in the <datahub-helm/charts/datahub/values.yaml at master · acryldata/datahub-helm · GitHub chart> at:

  datahub:
    managed_ingestion:
      defaultCliVersion: "0.12.0"```

I tried re-ingesting data (with the latest CLI version, I double checked it), and for me duplication is still happening with BigQuery.


Cli report:
{'cli_version': '0.12.1.1',
...```

It actually creates 2 containers - one with some hex UID and other human-readable one:

    "data": {
        "browseV2": {
            "groups": [
                {
                    "name": "urn:li:container:07522b910626b933699819e95664e72b",
                    "count": 514,
                    "hasSubGroups": true,
                    "entity": {
                        "urn": "urn:li:container:07522b910626b933699819e95664e72b",
                        "type": "CONTAINER",
                        "properties": {
                            "name": "container-name",
                            "__typename": "ContainerProperties"
                        },
                        "__typename": "Container"
                    },
                    "__typename": "BrowseResultGroupV2"
                },
                {
                    "name": "container-name",
                    "count": 419,
                    "hasSubGroups": true,
                    "entity": null,
                    "__typename": "BrowseResultGroupV2"
                }
            ],
            "start": 0,
            "count": 20,
            "total": 2,
            "metadata": {
                "path": [],
                "totalNumEntities": 933,
                "__typename": "BrowseResultMetadata"
            },
            "__typename": "BrowseResultsV2"
        }
    },
    "extensions": {}
}```
```❯ datahub get --urn urn:li:container:07522b910626b933699819e95664e72b
{
  "browsePaths": {
    "paths": [
      ""
    ]
  },
  "browsePathsV2": {
    "path": [
      {
        "id": "Default"
      }
    ]
  },
  "containerKey": {
    "guid": "07522b910626b933699819e95664e72b"
  },
  "containerProperties": {
    "customProperties": {
      "env": "PROD",
      "platform": "bigquery",
      "project_id": "container-name"
    },
    "name": "container-name"
  },
  "dataPlatformInstance": {
    "platform": "urn:li:dataPlatform:bigquery"
  },
  "status": {
    "removed": false
  },
  "subTypes": {
    "typeNames": [
      "Project"
    ]
  }
}```
```❯ datahub get --urn urn:li:container:container-name
{
  "containerKey": {
    "guid": "container-name"
  }
}```

However, after purging everything and ingesting again - everything looks fine, no more duplicated containers in the navigation sidebar.

Why do I specify the CLI version as 0.12.1.1, but it still appears as 0.12.1.0 in the log

attachment