Troubleshooting browsePathsV2 aspects not being filled in during ingestion after DataHub upgrade to v0.12.1

Original Slack Thread

Hi everyone. My team recently upgraded from datahub v0.10.4 to v0.12.1 and have been liking the new features so far, however we are having a bit of trouble with the browsePathsV2 aspects. It seems the browsePathsV2 aspects were not backfilled for most of our data sources. Actually only datasets for one of our custom connectors had their browsePathsV2s filled in. Not entirely sure why that’s the case as all of the upgrade jobs finished successfully, but figured they’d get filled in when our scheduled ingestion runs ran again.

However, after some waiting and additional testing it seems these aspects are not getting filled in. The ingestion pipeline finishes successfully without any errors or warnings, but no browsePathsV2 aspects are emitted. I’ve made sure the CLI version matches as well (v0.12.1). Here is the summary output from an example mssql run:

{'cli_version': '0.12.1.0',
 'cli_entry_location': '/usr/local/lib/python3.9/site-packages/datahub/__init__.py',
 'py_version': '3.9.17 (main, Jun 13 2023, 16:05:09) \n[GCC 8.3.0]',
 'py_exec_path': '/usr/local/bin/python',
 'os_details': 'Linux-4.18.0-425.19.2.el8_7.x86_64-x86_64-with-glibc2.28',
 'peak_memory_usage': '102.59 MB',
 'mem_info': '102.59 MB',
 'peak_disk_usage': '21.12 GB',
 'disk_info': {'total': '321.97 GB', 'used': '21.12 GB', 'free': '300.85 GB'}}

Source (mssql) report:
{'events_produced': 276,
 'events_produced_per_sec': 69,
 'entities': {'container': ['<example container urns>',
                            '... sampled of 15 total elements'],
              'dataset': ['<example dataset urns>',
                          '... sampled of 64 total elements']},
 'aspects': {'container': {'containerProperties': 15, 'status': 15, 'dataPlatformInstance': 15, 'subTypes': 15, 'container': 14},
             'dataset': {'container': 64, 'status': 64, 'datasetProperties': 64, 'schemaMetadata': 64, 'subTypes': 64, 'viewProperties': 10}},
 'warnings': {},
 'failures': {},
 'soft_deleted_stale_entities': [],
 'tables_scanned': 54,
 'views_scanned': 10,
 'entities_profiled': 0,
 'filtered': [],
 'start_time': '2024-02-05 16:38:07.937584 (4 seconds ago)',
 'running_time': '4 seconds'}

Sink (datahub-kafka) report:
{'total_records_written': 276,
 'records_written_per_second': 63,
 'warnings': [],
 'failures': [],
 'start_time': '2024-02-05 16:38:07.560390 (4.38 seconds ago)',
 'current_time': '2024-02-05 16:38:11.937851 (now)',
 'total_duration_in_seconds': 4.38}```
I'm not entirely sure why this is the case and will continue to update this thread as I research and debug more, but wanted to make a post in case anyone else has encountered a similar issue. Is it the expected behavior that browsePathsV2 aspects should be created when they don't exist for existing entities during ingestion? Is there anything I may be forgetting which would impact the behavior here? Appreciate any and all help. Thanks!

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Which DataHub version are you using? (e.g. 0.12.0)
  2. Please post any relevant error logs on the thread!

Also, out of curiosity, which job would have performed the browsePathsV2 backfill? I could try rerunning that, but wanted to investigate this strange behavior of the aspects not being created at ingestion time first.

The browse paths v2 can be generated from a couple different places. GMS or the system-update job. Ultimately the system-update job should perform this operation when a default value is found and then never touch those aspects again. This job is run on every helm upgrade/install. Now I have had to force reprocessing of the browse paths on a few systems. I introduced an environment variable to https://github.com/datahub-project/datahub/blob/a3ef587f54067598141afe3e584aa5742f817fc7/datahub-upgrade/src/main/java/com/linkedin/datahub/upgrade/system/entity/steps/BackfillBrowsePathsV2Step.java#L94|here and it hasn’t been needed yet so it is not an option in helm. You can add this to the extraEnvs for the system-update job and run to force this reprocessing.

I should add that the migration from v1 to v2 is what I am referring to. After that migration, then the ingestion code is responsible for populating them with some logic on the server side to produce a default aspect if it is not specified already (assuming v2 is enabled)

Thanks for the reply David! A few questions on this:

  1. Is a default browsePathV2 just a browsePathsV2 aspect with the only entry being the string id of “Default”?
    a. I’m assuming the situation in which you would need this would be if you have some entities that incorrectly had their browsePathsV2 aspects just set to id: "Default. This is a slightly different situation than I am in. In my case the entities don’t have browsePathsV2 aspects at all, they only had their original browsePaths aspects.
    b. I did try adding REPROCESS_DEFAULT_BROWSE_PATHS_V2 to the extraEnvs and running the system-update job, but nothing happened because the stage was skipped. However, I read through the file you linked and saw it required BACKFILL_BROWSE_PATHS_V2. Setting that instead caused the job to rerun correctly and the browsePathsV2 to be filled in.
  2. I’m still seeing some inconsistencies in how browsePathsV2 aspects are created for newly ingested entities.
    a. When I run a fresh install of datahub locally via docker on my laptop everything works as expected. I ingest a schema, it creates the containers and datasets, and in the Source report section of the terminal output I can see that a number of browsePathsV2 aspects have been created (twice as many as all the other aspects, is this relevant?). Looking in the backend and UI confirms that the browsePathsV2 aspects have been created correctly and reference the container hierarchy properly.
    b. In my Helm deployed instance of datahub on k8s, I re-ingested an existing schema after adding a new table to it. The existing tables had already been backfilled and showed correctly in the UI. After re-ingestion completed the new table was created, but the source report did not list browsePathsV2 among the aspects processed. When I check the backend database, I can see that it does have a browsePathsV2 aspect, but it just used the dataset key to fill in the id field without listing the container urns. This causes the browse tree to show duplicate entries (see the attached picture).
    I’m trying to figure out what could be different in these scenarios and why browsePathsV2 aspects are not being emitted for my k8s helm chart deployment. Any help in better understanding exactly how these aspects are created and under what scenarios would help make this easier to diagnose. Also any ideas on external factors that could be having an impact. I am using datahub version 0.12.1 and CLI version 0.12.1 in all cases.

I saw that https://github.com/datahub-project/datahub/blob/20b9050732f6a78225c70dc20eaade82e07859a9/metadata-io/src/main/java/com/linkedin/metadata/aspect/utils/DefaultAspectsUtil.java#L229|DefaultAspectsUtil.buildDefaultBrowsePaths() has a required parameter useContainerPaths which determines whether it pulls the container urns, when applicable, to create the aspect or just splits the entity key. When called from <datahub/datahub-upgrade/src/main/java/com/linkedin/datahub/upgrade/system/entity/steps/BackfillBrowsePathsV2Step.java at 159a013b0515f8a94b88d62e4ad20aad228fac9d · datahub-project/datahub · GitHub backfill job>, this value is explicitly set to true. However, when called <datahub/metadata-io/src/main/java/com/linkedin/metadata/aspect/utils/DefaultAspectsUtil.java at 20b9050732f6a78225c70dc20eaade82e07859a9 · datahub-project/datahub · GitHub generateDefaultAspectsIfMissing()>, this value is explicitly set to false. I’m curious, for what cases you would want this set to false?

My theory is that for some reason the ingestion CLI is not emitting browsePathsV2 aspects in all scenarios. When they are not emitted, I assume this generateDefaultAspectsIfMissing() function is being called which is hard coded to just build the browse path from the entity key even though it has a container aspect.

Sorry for the long post, but wanted to give as much context as possible. Any help or extra insights are appreciated!attachment

hey Ryan! so we do support backfilling data when you upgrade and that requires you to set the BACKFILL_BROWSE_PATHS_V2 env variable on your GMS pod to true - it is default false since it does a lot of work to loop over data and backfill this. however with 1.6k datasets you should be more than fine to do this

Just wanted to follow up here to close this thread out. The issue ended up being a patch that had been installed in the container at some point where the ingestion process was running. I worked with some people on my end to update that and now everything is running smoothly. Thank you to everyone who chipped in their ideas and explaining the end-to-end process for how the browsePathsV2 aspects are created. I’m excited to start working with some of the new features!