Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with manifests after multiple slash update fix (#162) #178

Closed
jordanpadams opened this issue Sep 3, 2024 · 8 comments · Fixed by #184
Closed

Issues with manifests after multiple slash update fix (#162) #178

jordanpadams opened this issue Sep 3, 2024 · 8 comments · Fixed by #184
Assignees
Labels

Comments

@jordanpadams
Copy link
Member

jordanpadams commented Sep 3, 2024

Checked for duplicates

No - I haven't checked

🐛 Describe the bug

pds-deep-registry-archive -s PDS_RNG urn:nasa:pds:cassini_uvis_solarocc_beckerjarmak2023::1.0

in cassini_uvis_solarocc_beckerjarmak2023_v1.0_20240903_checksum_manifest_v1.0.tab, lines like this:

5c0fa44ac664f34d840b1d67a7777a0a        uvis_euv_2006_257_solar_time_series_ingress.tab

should really be:

5c0fa44ac664f34d840b1d67a7777a0a        data/uvis_euv_2006_257_solar_time_series_ingress.tab

another example:

pds-deep-registry-archive -s PDS_RNG urn:nasa:pds:cassini_uvis_solarocc_beckerjarmak2023::1.1

in cassini_uvis_solarocc_beckerjarmak2023_v1.1_20240903_checksum_manifest_v1.0.tab, lines like this:

886004291721b477a887eb01c7313e23        ata/uvis_euv_2006_257_solar_time_series_ingress.xml

should really be:

886004291721b477a887eb01c7313e23        data/uvis_euv_2006_257_solar_time_series_ingress.xml

🕵️ Expected behavior

The results to be correct

📜 To Reproduce

See above

🖥 Environment Info

Mac OSx

📚 Version of Software Used

Latest -dev version as of today

🩺 Test Data / Additional context

No response

🦄 Related requirements

🦄 #162

@jordanpadams
Copy link
Member Author

jordanpadams commented Sep 3, 2024

@nutjob4life while you are waiting on the devops blocker, here is a bug I found. this is still backseat to NASA-PDS/devops#69 but next up.

@nutjob4life
Copy link
Member

@jordanpadams superb, thanks! Will work on this

@nutjob4life
Copy link
Member

nutjob4life commented Sep 5, 2024

On hold while https://pds.nasa.gov/api/search/1.0/ is reporting HTTP 502

Update from Slack: the new URL is https://pds.nasa.gov/api/search/1/

However the server is returning 501 when querying for urn:nasa:pds:cassini_uvis_solarocc_beckerjarmak2023::1.1:

Update: merging this PR fixes the invocation

@nutjob4life
Copy link
Member

nutjob4life commented Sep 5, 2024

Okay this isn't actually a problem with the Deep Archive so much as the data is "bad".

For urn:nasa:pds:cassini_uvis_solarocc_beckerjarmak2023::1.0, the API returns this:

$ curl --location --silent 'https://pds.nasa.gov/api/search/1/products/urn:nasa:pds:cassini_uvis_solarocc_beckerjarmak2023::1.0/members' | json_pp | egrep //data
            "label_url" : "https://pds-rings.seti.org/pds4/bundles/cassini_uvis_solarocc_beckerjarmak2023//data/collection_data.xml",
               "https://pds-rings.seti.org/pds4/bundles/cassini_uvis_solarocc_beckerjarmak2023//data/collection_data.csv"
               "https://pds-rings.seti.org/pds4/bundles/cassini_uvis_solarocc_beckerjarmak2023//data/collection_data.xml"

These URLs have //data with an extra slash.

For urn:nasa:pds:cassini_uvis_solarocc_beckerjarmak2023::1.1, the API gives:

$ curl --location --silent 'https://pds.nasa.gov/api/search/1/products/urn:nasa:pds:cassini_uvis_solarocc_beckerjarmak2023::1.1/members' | json_pp | less
…
            "ops:Data_File_Info.ops:file_ref" : [
               "https://pds-rings.seti.org/pds4/bundles/cassini_uvis_solarocc_beckerjarmak2023_v1.0/xml_schema/collection_xml_schema.csv"
            ],
…

This file_ref URL is missing /data.

I can have Deep Archive work around the double-slash in //data, but I don't think it should insert /data/ into file refs where they're missing. Let me know what you think.

@jordanpadams
Copy link
Member Author

jordanpadams commented Sep 5, 2024

@nutjob4life the /data isn't the problem I was noting, it is just highlighting issues causing issues in 2 different scenarios:

Scenario 1: Contains >1 slash

  • urn:nasa:pds:cassini_uvis_solarocc_beckerjarmak2023::1.0
  • Looks like it is arbitrarily removing additional parts of the path (e.g. data/ is removed)

Scenario 2: Does Not Contain extra slashes

  • urn:nasa:pds:cassini_uvis_solarocc_beckerjarmak2023::1.1
  • Looks like it is arbitrarily removing additional characters from the path (e.g. ata/ when it should be data/)

Agreed deep-archive should not care about what the actual path is, just about encountering >1 slash and replacing that with 1 slash

@tloubrieu-jpl
Copy link
Member

Should be done by the end of week.

@nutjob4life
Copy link
Member

@jordanpadams thanks for spelling it out.

Extra slashes in the bundle URL are easy enough to fix and affects just one line of code.

However, upon digging deeper into the Deep Archive, I see a deeper problem that's more deeply concerning.

Slicing and Dicing URLs

In the past, Deep Archive would find the right-most / in the bundle URL and assume that's the prefix for all other URLs referred to by the bundle. For example, suppose the bundle URL was

http://astro.com/pds4/balmy/bundle.xml

Then the prefix http://astro.com/pds4/balmy could be safely stripped when the bundle contained URLs like the following, which would go into the checksum manifest; for example:

  • http://astro.com/pds4/balmy/README.txtREADME.txt
  • http://astro.com/pds4/balmy/data/collection.csvdata/collection.csv
  • http://astro.com/pds4/balmy/calibration/times.txtcalibration/times.txt
  • etc.

Works great! Until it doesn't.

Record-Scratch Moment

Enter urn:nasa:pds:cassini_uvis_solarocc_beckerjarmak2023::1.0; here the bundle URL and many file_refs don't have a common-prefix that breaks at a /. Take a look at the following actual data:

bundleurl: https://pds-rings/pds4/cassini_solarocc_beckerjarmak2023_v1.0/bundle.xml
           ↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑×××××↑
file_ref:  https://pds-rings/pds4/cassini_solarocc_beckerjarmak2023/data/collection_data.csv
           ↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑

As you can see, the bundle has this extra _v1.0 text that doesn't appear in the typical file_ref in that bundle. (I've elided some of the extra hosts/paths for clarity.) Worse, this is the case about half of the file_refs in that bundle! Some of them do contain _v1.0!

The checksum manifest should only contain suffixes after the common prefix, but there's not really a common prefix we can compute for the bundle URL if it doesn't have to have anything in common with file_refs.

I'm not sure what to do about this. One option is to make the checksum manifests much longer; instead of

MD5 bundle.xml
MD5 data/collection_data.csv

have

MD5 cassini_solarocc_beckerjarmak2023_v1.0/bundle.xml
MD5 cassini_solarocc_beckerjarmak2023/data/collection_data.csv

Is that okay?

Another option would be to write a little function to find a common prefix not based on / but on text alone. OK, but what if we get this?

bundleurl: https://geo.wustl.edu/pds4/earth/bundle.xml
file_ref:  https://geo.wustl.edu/pds4/earth/browse/collection.xml

The common prefix in this case is https://geo.wustl.edu/pds4/earth/b including the letter b, meaning rowse/collection.xml gets written to the manifest. (This isn't hypothetical; the cassini_uvis_solarocc_beckerjarmak2023 happens to include /bundle and /browse paths.)

Or what if we get this? 😱

bundleurl: https://atmos.nmsu.edu/pds4/airhead/bundle.xml
file_ref:  ftp://mirror.nmsu.edu/ftp/data/airhead-data-files/collection_data.csv

Do we add a command-line argument like --product-base-urls that lets user specify an entire list of prefixes to remove (which would help the cassini_uvis_solarocc_beckerjarmak2023::1.0 case)? Are users supposed to figure that out? Let me know what you think.

@jordanpadams
Copy link
Member Author

@nutjob4life let's just go with this:

MD5 cassini_solarocc_beckerjarmak2023_v1.0/bundle.xml
MD5 cassini_solarocc_beckerjarmak2023/data/collection_data.csv

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment