On 17/02/2021 12:16, Greg Wilkins wrote:
> However, that is not compliant with RFC 3986 which says that
> normalization should happen before decoding.
Where does it say that? I just looked through all the references to
normalization and couldn't find it.
When a URI is dereferenced, the components and subcomponents
significant to the scheme-specific dereferencing process (if any)
must be parsed and separated before the percent-encoded octets within
those components can be safely decoded, as otherwise the data may be
mistaken for component delimiters. The only exception is for
percent-encoded octets corresponding to characters in the unreserved
set, which can be decoded at any time.
> That is all fine if you
> remember the segment boundaries so that segments like "%2e%2e", "%2f"
> and "..;" would be seen as decoded segments after normalization of "..",
> "/" and "..".
My reading of section 2.2 is that reserved characters should not be %nn
decoded prior to normalization.
Exactly, so if we decode %2f or %2e%2e after normalization we end up with a string that if normalized again could be wrong.
Now if our implementations are consistent with their algorithms, this will not be a problem. But if we give an application a path that contains
.. or a / that is a segment not a divider, then it is ambiguous and that application can rightly get confused... or just give that path back to
the container and the container will normalize again and get confused.
I think the Servlet spec is the place to be more explicit about this.
This has been on my radar for a while:
https://github.com/eclipse-ee4j/servlet-api/issues/18
Good one. I'll note a summary of this email there.