Skip to content
This repository has been archived by the owner on Nov 23, 2023. It is now read-only.

fix: Prevent metadata duplication #1915

Merged
merged 8 commits into from
Sep 12, 2022
Merged

fix: Prevent metadata duplication #1915

merged 8 commits into from
Sep 12, 2022

Conversation

Jimlinz
Copy link
Contributor

@Jimlinz Jimlinz commented Aug 4, 2022

Pystac normalize_hrefs resolves all links and mutates the entire tree, causing <item>.json and collection.json for all datasets within the catalog to update. save_object only saves the particular entity within the catalog, preventing unnecessary metadata duplication.

Reference

stac-utils/pystac#284
stac-utils/pystac#294
stac-utils/pystac#90

@Jimlinz Jimlinz requested review from l0b0 and billgeo August 4, 2022 23:10
@Jimlinz Jimlinz linked an issue Aug 4, 2022 that may be closed by this pull request
11 tasks
@@ -95,8 +95,4 @@ def handle_message(metadata_key: str) -> None:
if root_catalog.get_child(dataset_metadata.id) is None:
root_catalog.add_child(child=dataset_metadata, strategy=GeostoreSTACLayoutStrategy())

root_catalog.normalize_hrefs(
Copy link
Contributor

@billgeo billgeo Aug 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do want to normalize hrefs across the Geostore (hopefully we have a test to enforce that). Have you checked that we have code elswehere for that and/or a test for this?

Copy link
Contributor

@billgeo billgeo Aug 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely don't need to normalize every href every time we add a dataset though, just need to do the catalog.json and all the STAC files in the child / dataset that is being added. Can ignore all the sibling dataset stac files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was addressed in stac-utils/pystac#294 (if I read the PR correctly). Nevertheless, tests are failing which means more work is needed to figure out what is needed to fix this issue.

@Jimlinz Jimlinz marked this pull request as draft August 5, 2022 01:11
@Jimlinz
Copy link
Contributor Author

Jimlinz commented Aug 5, 2022

Changing PR to draft since tests are failing. save_object appears to be a suggestion to work around #1818; however, this is causing tests to fail. More work is needed to figure out why the tests are failing and if there is a better solution for the problem above.

@Jimlinz Jimlinz force-pushed the fix/metadata-duplication branch 2 times, most recently from 0300e6e to 82b591b Compare August 5, 2022 06:22
@l0b0
Copy link
Contributor

l0b0 commented Aug 9, 2022

Looking through the test failures:

  • should_update_root_catalog_with_new_version_catalog [dataset version links] fails because an optional title is missing. Possible solutions:
    • Ignore the irrelevant field in the test. This is probably the simplest, but means the test might have to change again if PySTAC changes.
    • Change the test to only check the URLs rather than the whole link array. More robust, but also more work and less coverage.
  • should_update_root_catalog_with_new_version_collection [item links] fails for two separate reasons:
    • Missing title, as above.
    • Wrong root link. This is an actual issue, and needs fixing.
  • should_successfully_run_dataset_version_creation_process_and_again_with_partial_upload fails because of a missing/wrong root link (expected_link_object is never in (load(smart_open.open(f"{storage_bucket_prefix}{dataset_title}/{item_metadata_filename}", mode="rb")))[STAC_LINKS_KEY]). Also needs fixing.

In summary, to make this work ASAP we could:

  1. Modify tests to remove the title from the relevant expected links entries.
  2. Change the code which writes metadata files (geostore.import_metadata_file.task.importer) to modify the root URL as it's currently specified in the tests.

@Jimlinz Jimlinz force-pushed the fix/metadata-duplication branch 5 times, most recently from 09fb141 to 3b3a279 Compare August 10, 2022 08:11
assert catalog_json[STAC_LINKS_KEY] == expected_root_catalog_links
updated_catalog_json = []
for i in catalog_json[STAC_LINKS_KEY]:
del i["title"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than deleting the title from the actual state you could simply remove the non-existent title from the expected state. That should result in less code overall, and very slightly more coverage.

Copy link
Contributor Author

@Jimlinz Jimlinz Aug 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean something like this bb4fd62?
It was my initial attempt, but it didn't work. Maybe I have missed something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like bb4fd62, except some of the titles should still be there.

root_catalog.normalize_hrefs(
f"{S3_URL_PREFIX}{Resource.STORAGE_BUCKET_NAME.resource_name}",
strategy=GeostoreSTACLayoutStrategy(),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there some missing code to actually replace the root href?

>               assert item_json[STAC_LINKS_KEY] == expected_item_links
E               AssertionError: assert [{'href': './CCRrWAn9FPPOyNFd1Q6I.json',\n  'rel': 'root',\n  'type': 'application/json'},\n {'href': './CCRrWAn9FPPOyNFd1Q6I.json',\n  'rel': 'parent',\n  'type': 'application/json'}] == [{'href': '../catalog.json', 'rel': 'root', 'type': 'application/json'},\n {'href': './CCRrWAn9FPPOyNFd1Q6I.json',\n  'rel': 'parent',\n  'type': 'application/json'}]
E                 At index 0 diff: {'rel': 'root', 'href': './CCRrWAn9FPPOyNFd1Q6I.json', 'type': 'application/json'} != {'rel': 'root', 'href': '../catalog.json', 'type': 'application/json'}
E                 Full diff:
E                   [
E                 -  {'href': '../catalog.json',
E                 +  {'href': './CCRrWAn9FPPOyNFd1Q6I.json',
E                     'rel': 'root',
E                     'type': 'application/json'},
E                    {'href': './CCRrWAn9FPPOyNFd1Q6I.json',
E                     'rel': 'parent',
E                     'type': 'application/json'},
E                   ]

Copy link
Contributor Author

@Jimlinz Jimlinz Aug 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a while for me to trace this down. There are two options to tackle this bug:
root_catalog.save_object() or root_catalog.save(catalog_type=CatalogType.SELF_CONTAINED) see populate_catalog/task.py

The first option save_object is the workaround suggested by Pystac (see stac-utils/pystac#284). This would be the ideal implementation as none of the metadata gets duplicated during dataset version creation (at least in in my own tests); however, from what I can tell, this options eludes importer from being called in import_metadata_file/task.py (which is why you are seeing the error message above - something I tried in the early commit). Why this may be the case exactly, I am still slightly unclear. I have tried tracing this down but with no success thus far. Perhaps you might have a better understanding of how Geostore utilizes Pystac to shine some light on why this may be? I will continue to explore and will update if I am able to work this out.

The second option root_catalog.save(catalog_type=CatalogType.SELF_CONTAINED) is the option I went with in this PR. This appears to only be a partial fix. It prevents item.json from being updated with every new dataset version creation; however, it doesn't prevent collection.json from being updated every time a new dataset is added or updated.

@Jimlinz Jimlinz force-pushed the fix/metadata-duplication branch 7 times, most recently from 9b50390 to 26b008f Compare August 14, 2022 23:17
@Jimlinz Jimlinz marked this pull request as ready for review August 15, 2022 05:11
@Jimlinz Jimlinz marked this pull request as draft August 15, 2022 20:10
@@ -16,6 +16,7 @@
from ..types import JsonObject

S3_BODY_KEY = "Body"
CATALOG_FILENAME = "catalog.json"
Copy link
Contributor

@l0b0 l0b0 Aug 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This constant already exists in geostore/populate_catalog/task.py, so it should probably be moved to a file in the geostore directory (so that it's shared between subdirectories at deployment time) and imported.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have changed this to reference geostore/populate_catalog/task.py (for now). Good idea to move this to another file, but a quick look shows it is reference in quite a few places, so I might move this change to another PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spoke too soon. The value needs to be accessible via the lambda function. Changing this back to ../catalog.json for now and will create another PR to address the above.

geostore/populate_catalog/task.py Show resolved Hide resolved
tests/test_processing_stack.py Outdated Show resolved Hide resolved
@Jimlinz Jimlinz force-pushed the fix/metadata-duplication branch 2 times, most recently from 9340732 to a86b121 Compare August 18, 2022 00:47
Pystac normalize_hrefs resolves all links and mutates the entire tree, causing all .json metadata to update. Removing normalize_hrefs fixes this particular issue. GeostoreSTACStrategy get_item_href is not called post removal; however, this may be relevant again in the future as the codebase grows. Excluding this function from coverage, for now.
@Jimlinz Jimlinz force-pushed the fix/metadata-duplication branch 3 times, most recently from 1ceaf00 to 3a98ed8 Compare August 19, 2022 05:54
Missing / wrong root link is causing tests to fail post removal of normalize_href. This commit modifies the root URL as it's currently specified in the tests.
@Jimlinz Jimlinz marked this pull request as ready for review August 19, 2022 07:30
Test values need to match expected values, post removal of pystac normalize_href
Removing normalized_href requires root href key to be updated to "../catalog.json". update_root_link function was added to "geostore/import_metadata_file/task.py". To ensure full coverage, this test is added to ensure both root and non-root href_key are covered.
The get_item_href function isn't called post normalized_href removal; however, it is needed to override all the abstract methods in the parent class before it can be instantiated. raise NotImplementedError is added to ensure pragma: no cover is removed if this method is called again in the future as the codebase expand.
l0b0
l0b0 previously approved these changes Aug 23, 2022
@Jimlinz Jimlinz marked this pull request as draft August 29, 2022 06:16
@kodiakhq kodiakhq bot merged commit fc3c3e0 into master Sep 12, 2022
@kodiakhq kodiakhq bot deleted the fix/metadata-duplication branch September 12, 2022 00:09
@billgeo billgeo mentioned this pull request Sep 14, 2022
34 tasks
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Development

Successfully merging this pull request may close these issues.

Bug: New version of every metadata file when a new dataset is added
4 participants