digitalmars.D - std.uri.decodeComponent decodes invalid UTF-8

kdevel (45/45) Aug 04 2025 ```D

Richard (Rikki) Andrew Cattermole (8/9) Aug 05 2025 Yes but also no.

kdevel (14/19) Aug 05 2025 The bug is as follows:

kdevel <kdevel vogtner.de> writes:

```D
import std;

void main ()
{
    writeln (decodeComponent ("%c0%80")); // prints U+0000
    writeln (decodeComponent ("%c0%af")); // prints U+002F 
(SOLIDUS)
    string s = [0xc0, 0xaf];
    validate (s); // throws Invalid UTF-8 sequence (at index 2)
}
```

Quote from [1]
```
    Implementations of the decoding algorithm above MUST protect 
against
    decoding invalid sequences.  For instance, a naive 
implementation may
    decode the overlong UTF-8 sequence C0 80 into the character 
U+0000,
    or the surrogate pair ED A1 8C ED BE B4 into U+233B4.  Decoding
    invalid sequences may have security consequences or cause other
    problems.  See Security Considerations (Section 10) below.
```

Quote from [2]
```
Perhaps the most famous UTF-8 attack was against unpatched 
Microsoft Internet Information Server (IIS) 4 and IIS 5 servers. 
If an attacker made a request that looked like this
http://servername/scripts/..%c0%af../winnt/system32/ cmd.exe

the server didn't correctly handle %c0%af in the URL. What do you 
think %c0%af means? It's 11000000 10101111 in binary; and if it's 
broken up using the UTF-8 mapping rules, we get this: 11000000 
10101111. Therefore, the character is 00000101111, or 0x2F, the 
slash (/) character! The %c0%af is an invalid UTF-8 
representation of the / character. Such an invalid UTF-8 escape 
is often referred to as an overlong sequence.
```

Has the UTF-8 decoding been implemented in multiple places? [3]


[1] *UTF-8, a transformation format of ISO 10646* (2003)
     https://www.rfc-editor.org/rfc/rfc3629#page-5

[2] *CAPEC-80: Using UTF-8 Encoding to Bypass Validation Logic*
     https://capec.mitre.org/data/definitions/80.html

[3] *Re: Reducing the cost of autodecoding* (2016)
     
https://forum.dlang.org/post/htxicoxzningxnpeyzui forum.dlang.org

Aug 04 2025

"Richard (Rikki) Andrew Cattermole" <richard cattermole.co.nz> writes:

On 05/08/2025 3:09 PM, kdevel wrote:
 Has the UTF-8 decoding been implemented in multiple places? [3]

Yes but also no.

A URI is ASCII.

Any input to that function will be ASCII, it won't be UTF-8.

The hex encoding is not UTF-8, its its own encoding, that gets reencoded 
out to UTF-8.

https://github.com/dlang/phobos/blob/ae07a90aabb34e34e1e73419780549aeb95e8f9c/std/uri.d#L194

This does not validate to the extent that one may like.

Aug 05 2025

kdevel <kdevel vogtner.de> writes:

The bug is as follows:

On Tuesday, 5 August 2025 at 19:28:06 UTC, Richard (Rikki) Andrew 
Cattermole wrote:
 [...]
 A URI is ASCII.

Sure. It is this:

    %c0%af

 Any input to that function will be ASCII, it won't be UTF-8.

 The hex encoding is not UTF-8, its its own encoding, that gets 
 reencoded out to UTF-8.

Which is decoded by decodeComponent into

    /

which is valid ASCII and valid UTF-8. But the mapping of

    %c0%af -> /

is an invalid one. It may be debatable if decodeComponent could
legitimately have returned invalid UTF-8, i.e. "\xc0\xaf".

The bug is that decodeComponent decodes invalid UTF-8 without
noticing the user of that function. This behavior is a violation
of RFC 3629.

Aug 05 2025

D Programming

C/C++ Programming

Other

digitalmars.D - std.uri.decodeComponent decodes invalid UTF-8