Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

url-encode parameters URL paths and the fallback_id header #55

Merged
merged 2 commits into from
Jul 13, 2023

Conversation

lindsay-stevens
Copy link
Contributor

@lindsay-stevens lindsay-stevens commented May 10, 2023

Closes #53
Closes #54

  • Add manual / smoke test against current Central behaviour
  • Add automated test for above as well

What has been done to verify that this works as intended?

New unit tests for the encoding functions on Session, and mocked integration test on client.forms.update.

Why is this the best possible solution? Were any other approaches considered?

Similar to Central frontend behaviour. In terms of implementation, 1) could have used a custom string class but that seemed over the top, or 2) could have put the functionality in the URL classes, but it seemed like it might be useful to put on Session so that users that are using client.get (or other verbs) can use it easily as well.

How does this change affect users? Describe intentional changes to behavior and behavior that could have accidentally been affected by code changes. In other words, what are the regression risks?

Supports use cases where form_id or instance_id contains non-ASCII characters.

Do we need any specific form for testing your changes? If so, please attach one.

Required forms are in the repository under tests/resources/forms.

Does this change require updates to documentation? If so, please file an issue here and include the link below.

Probably not needed, unless there is a gotcha when interacting with Central

Before submitting this PR, please make sure you have:

  • included test cases for core behavior and edge cases in tests
  • run nosetests and verified all tests pass
  • run python bin/pre_commit.py to format / lint code
  • verified that any code or assets from external sources are properly credited in comments

- strings passed in to request(data=some_str) get encoded as latin-1 by
  default. Data passed in as dictionaries to `json` or `params` are
  encoded to utf-8 by default. So encode the XML.
- forms.py: tidy up docstring wrap length to 90 chars
- submissions.py:
  - add encoding parameter, default utf-8, for encoding the XML, for
    submission.create and submission.edit
  - move example submission edit XML from private ._put method to
    public .edit method so users can see it.
- test_client.py: add integration tests for non-ascii form_ids
  and instance_ids, incl. check for retrieval after create/update.
@lindsay-stevens
Copy link
Contributor Author

@matthew-white submission XML also needed explicit encoding to utf-8 since it was defaulting to latin-1. I've fixed that in the PR but came across another issue. I'm wondering if this is a bug in pyODK (in the PR), or a bug in Central, or just undefined behaviour.

I created a submission and edited it a couple of times (the below XML is I think the 3rd edit). I can see the submission in the Central, but when I click the "edit" link for that submission I get an error 404.1 shown in the UI as a red toast box "Could not find the resource you were looking for.".

        client = Client()
        xml = """
        <data id="'=+/*-451%/%" version="1">
          <meta>
            <deprecatedID>~!@#$%^&*()_+=-✘✘</deprecatedID>
            <instanceID>~!@#$%^&*()_+=-✘✘✘</instanceID>
          </meta>
          <fruit>Papaya</fruit>
          <note_fruit/>
        </data>
        """
        client.submissions.edit(
            xml=xml, form_id="'=+/*-451%/%", instance_id="~!@#$%^&*()_+=-✅✅"
        )

Form URL

https://staging.getodk.cloud/#/projects/1/forms/'%3D%2B%2F*-451%25%2F%25

Submission URL Expected - the edit was submitted with this:

https://staging.getodk.cloud/#/projects/1/forms/%27%3D%2B%2F%2A-451%25%2F%25/submissions/~%21%40%23%24%25%5E%26%2A%28%29_%2B%3D-%E2%9C%85%E2%9C%85

Submission URL Actual - partially encoded?

https://staging.getodk.cloud/#/projects/1/forms/'%3D%2B%2F*-451%25%2F%25/submissions/~!%40%23%24%25%5E%26*()_%2B%3D-%E2%9C%85%E2%9C%85

@lindsay-stevens lindsay-stevens marked this pull request as ready for review May 11, 2023 04:52
@matthew-white
Copy link
Member

when I click the "edit" link for that submission I get an error 404.1 shown in the UI as a red toast box "Could not find the resource you were looking for.".

Is that when you click the "More" link or the edit button (the pencil icon)? I see error messages in both cases, but I only see "Could not find the resource…" after clicking "More".

Clicking the "More" link is definitely running into a bug with the Central API. In most cases, Central doesn't assume that submission instance IDs take any particular form. However, for requesting an individual submission over OData, Central assumes that the instance ID is a UUID (either with or without the uuid: prefix). It probably says something that we haven't heard about users running into this bug. I think almost all instance IDs end up being a UUID.

Clicking the edit button, I see the following error in Enketo: "Record not present. It may have expired." I see one request that resulted in an error, to the URL https://staging.getodk.cloud/-/submission/HYP8U3AqTpcUsPP8oE7eGEHeJG8k4N7?enketoId=HYP8U3AqTpcUsPP8oE7eGEHeJG8k4N7&instanceId=~!%40. %40 is @, so it looks like the first 3 characters of the instance ID are sent, then the ID is truncated at #. I'm guessing that # isn't encoded, so it's treated as an actual URL hash, and everything after it is ignored. @lognaturel, let me know if you'd like me to file an Enketo issue about this.

Also, @lognaturel, could you confirm that I'm right in thinking that in practice, almost all instances IDs are UUIDs? Is there any chance it'd make sense to make that part of the XForms spec? Or are we intentionally leaving open the possibility that we could do something different in the future?

Submission URL Actual - partially encoded?

https://staging.getodk.cloud/#/projects/1/forms/'%3D%2B%2F*-451%25%2F%25/submissions/~!%40%23%24%25%5E%26*()_%2B%3D-%E2%9C%85%E2%9C%85

I think this one is OK actually! Those characters are allowed to be encoded, but they don't have to be. From the MDN article about encodeURI():

Both encodeURI() and encodeURIComponent() do not encode the characters -.!~*'(), known as "unreserved marks", which do not have a reserved purpose but are allowed in a URI "as is".

@matthew-white
Copy link
Member

Submission URL Actual - partially encoded?

https://staging.getodk.cloud/#/projects/1/forms/'%3D%2B%2F*-451%25%2F%25/submissions/~!%40%23%24%25%5E%26*()_%2B%3D-%E2%9C%85%E2%9C%85

I think this one is OK actually! Those characters are allowed to be encoded, but they don't have to be.

I was looking at this a little more. It seems like the JavaScript function encodeURIComponent() follows RFC 2396, which doesn't encode as many characters as the newer RFC 3986. Maybe the "Submission URL Expected" above follows RFC 3986, which is why it encodes more characters?

The Central API and frontend both use encodeURIComponent(). Should they be following RFC 3986 instead? We haven't run into any issues with encodeURIComponent(), and I've seen the following paragraph of RFC 3986 quoted to justify not encoding all reserved characters:

URI producing applications should percent-encode data octets that
correspond to characters in the reserved set unless these characters
are specifically allowed by the URI scheme to represent data in that
component. If a reserved character is found in a URI component and
no delimiting role is known for that character, then it must be
interpreted as representing the data octet corresponding to that
character's encoding in US-ASCII.

This paragraph also seems relevant to me:

Under normal circumstances, the only time when octets within a URI
are percent-encoded is during the process of producing the URI from
its component parts. This is when an implementation determines which
of the reserved characters are to be used as subcomponent delimiters
and which can be safely used as data. Once produced, a URI is always
in its percent-encoded form.

I think I'll leave it here unless @lognaturel wants me to look into this further. Mostly I just wanted to clarify my previous comment. It looks like pyODK might be encoding according to RFC 3986, which should work well with Central.

Copy link
Member

@lognaturel lognaturel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests look good. I made sure the edits make sense. I did not spend time looking for potential missed opportunities to encode.

@lognaturel
Copy link
Member

it'd make sense to make that part of the XForms spec? Or are we intentionally leaving open the possibility that we could do something different in the future?

I don’t think it had previously been a hard requirement anywhere so it would be a restriction for Central only. I’m not sure it belongs in the spec? At the same time, I think it is a de facto standard.

@lognaturel lognaturel merged commit b097da1 into getodk:master Jul 13, 2023
@lindsay-stevens lindsay-stevens deleted the url-encoding branch July 13, 2023 07:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Encode Unicode in X-XlsForm-FormId-Fallback Encode URL syntax characters in form IDs
3 participants