digitalmars.D - std.uri.decodeComponent decodes invalid UTF-8
- kdevel (45/45) Aug 04 ```D
- Richard (Rikki) Andrew Cattermole (8/9) Aug 05 Yes but also no.
- kdevel (14/19) Aug 05 The bug is as follows:
```D import std; void main () { writeln (decodeComponent ("%c0%80")); // prints U+0000 writeln (decodeComponent ("%c0%af")); // prints U+002F (SOLIDUS) string s = [0xc0, 0xaf]; validate (s); // throws Invalid UTF-8 sequence (at index 2) } ``` Quote from [1] ``` Implementations of the decoding algorithm above MUST protect against decoding invalid sequences. For instance, a naive implementation may decode the overlong UTF-8 sequence C0 80 into the character U+0000, or the surrogate pair ED A1 8C ED BE B4 into U+233B4. Decoding invalid sequences may have security consequences or cause other problems. See Security Considerations (Section 10) below. ``` Quote from [2] ``` Perhaps the most famous UTF-8 attack was against unpatched Microsoft Internet Information Server (IIS) 4 and IIS 5 servers. If an attacker made a request that looked like this http://servername/scripts/..%c0%af../winnt/system32/ cmd.exe the server didn't correctly handle %c0%af in the URL. What do you think %c0%af means? It's 11000000 10101111 in binary; and if it's broken up using the UTF-8 mapping rules, we get this: 11000000 10101111. Therefore, the character is 00000101111, or 0x2F, the slash (/) character! The %c0%af is an invalid UTF-8 representation of the / character. Such an invalid UTF-8 escape is often referred to as an overlong sequence. ``` Has the UTF-8 decoding been implemented in multiple places? [3] [1] *UTF-8, a transformation format of ISO 10646* (2003) https://www.rfc-editor.org/rfc/rfc3629#page-5 [2] *CAPEC-80: Using UTF-8 Encoding to Bypass Validation Logic* https://capec.mitre.org/data/definitions/80.html [3] *Re: Reducing the cost of autodecoding* (2016) https://forum.dlang.org/post/htxicoxzningxnpeyzui forum.dlang.org
Aug 04
On 05/08/2025 3:09 PM, kdevel wrote:Has the UTF-8 decoding been implemented in multiple places? [3]Yes but also no. A URI is ASCII. Any input to that function will be ASCII, it won't be UTF-8. The hex encoding is not UTF-8, its its own encoding, that gets reencoded out to UTF-8. https://github.com/dlang/phobos/blob/ae07a90aabb34e34e1e73419780549aeb95e8f9c/std/uri.d#L194 This does not validate to the extent that one may like.
Aug 05
The bug is as follows: On Tuesday, 5 August 2025 at 19:28:06 UTC, Richard (Rikki) Andrew Cattermole wrote:[...] A URI is ASCII.Sure. It is this: %c0%afAny input to that function will be ASCII, it won't be UTF-8. The hex encoding is not UTF-8, its its own encoding, that gets reencoded out to UTF-8.Which is decoded by decodeComponent into / which is valid ASCII and valid UTF-8. But the mapping of %c0%af -> / is an invalid one. It may be debatable if decodeComponent could legitimately have returned invalid UTF-8, i.e. "\xc0\xaf". The bug is that decodeComponent decodes invalid UTF-8 without noticing the user of that function. This behavior is a violation of RFC 3629.
Aug 05