www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Auto-UTF-detection - Feature Request

reply Arcane Jill <Arcane_member pathlink.com> writes:
In the source text analysis phase, the compiler does this (according to the
manual):

"The source text is assumed to be in UTF-8, unless one of the following BOMs
(Byte Order Marks) is present at the beginning of the source text".

However, it is heuristically possible to distinguish between the various UTFs
even /without/ a BOM. Okay, so it is /theoretically/ possible for an ambiguity
to exist, but those edge cases are going to be almost infinitesimally rare for
text files in general, and I'd say zero for D source files (which will consist
mostly of ASCII characters). I say, try to auto-detect the difference.

Here's how ya do it:

Since a D source file mostly consists of ASCII characters, excluding NULL, any
4-byte-aligned fragment of a D source file is likely to look like this, where xx
stands for any non-zero byte, and ?? stands for any byte at all:

#    xx xx xx xx   likely to be UTF-8
#    xx 00 xx 00   likely to be UTF-16LE
#    00 xx 00 xx   likely to be UTF-16BE
#    xx 00 00 00   likely to be UTF-32LE
#    00 00 00 xx   likely to be UTF-32BE
#
#    00 ?? ?? ??   definitely not UTF-8
#    ?? 00 ?? ??   definitely not UTF-8
#    ?? ?? 00 ??   definitely not UTF-8
#    ?? ?? ?? 00   definitely not UTF-8
#
#    00 00 ?? ??   definitely not UTF-16LE or UTF-16BE
#    ?? ?? 00 00   definitely not UTF-16LE or UTF-16BE
#
#    xx ?? ?? ??   definitely not UTF-32BE
#    ?? ?? ?? xx   definitely not UTF-32LE

Simply by analysing a few such four-byte chunks (say, the first 1024 byte of the
file) and counting how many fit each pattern, you can easily determine the most
likely encoding. This is a statistical test, obviously, since not /all/ bytes
will be ASCII, but it will catch all but a few extreme edge cases. If you
haven't made your mind up within the first 1024 bytes of the file, read the
/next/ 1024 bytes and try again, and so on.

Alternatively, if that sounds too hard, there's an even easier (but less
efficient) algorithm:

1) Assume UTF-32LE. Validate.
2) Assume UTF-32BE. Validate.
3) Assume UTF-16BE. Validate.
4) Assume UTF-16BE. Validate.
5) Assume UTF-8.    Validate.

If precisely one of these validations succeeds, you've sussed it. If more than
one succeeds, it's still ambiguous, but the chances of this happening are
microscopic. If none succeed, the source file is not UTF. 

Arcane Jill
Jul 25 2004
next sibling parent reply Arcane Jill <Arcane_member pathlink.com> writes:
Actually, it's just occurred to me that it's /really easy/ to tell the encoding
of a D source file, because of the fact that the very first character of a D
source file MUST be either a UTF BOM or a non-NULL ASCII. So all we have to do
is to test for each of these contingencies. Here's a short function that does
just that:

#    // Input: s: the first four bytes of the D source file
#    // Output: the D source file encoding
#
#    enum Encoding { UNKNOWN, UTF_8, UTF_16LE, UTF_16BE, UTF_32LE, UTF_32BE }
#
#    Encoding determineDSourceEncoding(ubyte[] s)
#    in
#    {
#        assert(s.length >= 4);
#    }
#    body
#    {
#        ubyte a = s[0];
#        ubyte b = s[1];
#        ubyte c = s[2];
#        ubyte d = s[3];
#
#        if (a==0xFF && b==0xFE && c==0x00 && d==0x00) return UTF_32LE; // BOM
#        if (a==0x00 && b==0x00 && c==0xFE && d==0xFF) return UTF_32BE; // BOM
#        if (a==0xFF && b==0xFE)                       return UTF_16LE; // BOM
#        if (a==0xFE && b==0xFF)                       return UTF_16BE; // BOM
#        if (a==0xEF && b==0xBB && c==0xBF)            return UTF_8;    // BOM
#        if (b==0x00 && c==0x00 && d==0x00)            return UTF_32LE; // ASCII
#        if (a==0x00 && b==0x00 && c==0x00)            return UTF_32BE; // ASCII
#        if (b==0x00)                                  return UTF_16LE; // ASCII
#        if (a==0x00)                                  return UTF_16BE; // ASCII
#        return UTF_8;                                                  // ASCII
#    }

This is an important issue, because I just did a quick test. Using my favorite
text editor (TextPad), I saved a text file in UTF-16LE. I then examined the
saved file with a hex editor. I can confirm that the file was saved in UTF-16LE,
but, critically, /without a BOM/. I don't know what other text editors do, but
clearly, if there is even a remote chance that source files will get saved
without a BOM, then we really ought to be able to compile those source files!

Arcane Jill
Jul 25 2004
next sibling parent reply J C Calvarese <jcc7 cox.net> writes:
Arcane Jill wrote:
...
 This is an important issue, because I just did a quick test. Using my favorite
 text editor (TextPad), I saved a text file in UTF-16LE. I then examined the
 saved file with a hex editor. I can confirm that the file was saved in
UTF-16LE,
 but, critically, /without a BOM/. I don't know what other text editors do, but
 clearly, if there is even a remote chance that source files will get saved
 without a BOM, then we really ought to be able to compile those source files!
 
 Arcane Jill

All is not lost. I downloaded and installed TextPad. I observed the problem of no BOMs, but I also found a solution: * Choose the Configure/Preferences from the menubar. * Find Document Classes/Default in the tree on the left (I guess you might have to choose something else here if you've set up a "D mode"). Document class options [x] Write Unicode and UTF-8 BOM Now when you save again, it'll add the BOM's. Unicode: FF FE (UTF-16 LE) Unicode (big endian): FE FF (UTF-16 BE) UTF-8: EF BB BF (UTF-8) So there's no problem using TextPad. (I don't know why the BOMs wouldn't be enabled by default, but that's a whole other issue.) The BOMs are standard, right? If a supposedly Unicode-capable won't add BOMs, it might not really be considered Unicode-capable. If a person want to use one of those editors, fine. But please stick to UTF-8. -- Justin (a/k/a jcc7) http://jcc_7.tripod.com/d/
Jul 25 2004
parent "Walter" <newshound digitalmars.com> writes:
"J C Calvarese" <jcc7 cox.net> wrote in message
news:ce1d18$2sa5$1 digitaldaemon.com...
 So there's no problem using TextPad. (I don't know why the BOMs wouldn't
 be enabled by default, but that's a whole other issue.)

Send them a bug report!
 The BOMs are standard, right?

Yes.
 If a supposedly Unicode-capable won't add
 BOMs, it might not really be considered Unicode-capable. If a person
 want to use one of those editors, fine. But please stick to UTF-8.

Jul 25 2004
prev sibling next sibling parent reply "Walter" <newshound digitalmars.com> writes:
"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:ce1848$2p2j$1 digitaldaemon.com...
 This is an important issue, because I just did a quick test. Using my

 text editor (TextPad), I saved a text file in UTF-16LE. I then examined

 saved file with a hex editor. I can confirm that the file was saved in

 but, critically, /without a BOM/. I don't know what other text editors do,

 clearly, if there is even a remote chance that source files will get saved
 without a BOM, then we really ought to be able to compile those source

Ack. What are the Textpad programmers thinking? They need to fix Textpad to put out the BOM.
Jul 25 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <ce21g6$93r$1 digitaldaemon.com>, Walter says...

Ack. What are the Textpad programmers thinking? They need to fix Textpad to
put out the BOM.

This is incorrect. UTF-16LE does not require a BOM. Almost all questions regarding UTFs and BOMs can be answered by heading over to the Unicode web site (www.unicode.org) and clicking on "FAQ". I cite from this FAQ here: "In particular, if a text data stream is marked as UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE, a BOM is neither necessary nor /permitted/" (italics present in original FAQ). This doesn't apply to D source code, of course, since D source files are not "marked" in any way, however, read that FAQ. Everything about BOMs in that FAQ tells you that BOMs are "useful". Nowhere does it say they are "required" - and, as noted, in some cases they are even prohibited. Fortunately, since D syntax requires that the first character of a D source file /must/ be an ASCII character, detecting the encoding is quick and easy. See my other posts on this thread for source code which works. You *WILL* encounter BOM-less source files in the wild. Insisting that the BOM be there is in defiance of Unicode rules, and is just going to cripple DMD. Arcane Jill
Jul 26 2004
parent "Walter" <newshound digitalmars.com> writes:
"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:ce2onb$pug$1 digitaldaemon.com...
 Fortunately, since D syntax requires that the first character of a D

 /must/ be an ASCII character, detecting the encoding is quick and easy.

Yes, and that's a key insight that I'd missed. Thanks!
Jul 26 2004
prev sibling parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <ce1848$2p2j$1 digitaldaemon.com>, Arcane Jill says...

Revised source - now handles files which are less than four bytes long:

#    // Input: s: the first four byte of the D source file,
#    //        (or all of them, if the file size is less than four bytes).
#    // Output: the D source file encoding
#
#    enum Encoding { UTF_8, UTF_16LE, UTF_16BE, UTF_32LE, UTF_32BE }
#
#    Encoding determineDSourceEncoding(ubyte[] s)
#    {
#        ubyte a = s.length >= 1 ? s[0] : 1;
#        ubyte b = s.length >= 2 ? s[1] : 1;
#        ubyte c = s.length >= 3 ? s[2] : 1;
#        ubyte d = s.length >= 4 ? s[3] : 1;
#
#        if (a==0xFF && b==0xFE && c==0x00 && d==0x00) return UTF_32LE;
#        if (a==0x00 && b==0x00 && c==0xFE && d==0xFF) return UTF_32BE;
#        if (a==0xFF && b==0xFE)                       return UTF_16LE;
#        if (a==0xFE && b==0xFF)                       return UTF_16BE;
#        if (a==0xEF && b==0xBB && c==0xBF)            return UTF_8;
#        if (b==0x00 && c==0x00 && d==0x00)            return UTF_32LE;
#        if (a==0x00 && b==0x00 && c==0x00)            return UTF_32BE;
#        if (b==0x00)                                  return UTF_16LE;
#        if (a==0x00)                                  return UTF_16BE;
#        return UTF_8;
#    }

Come on, that's a /tiny/ function, and I've written it all for you. The "overhead" is to call it /once/ during the source text stage (and that's /instead of/, not as well as, the current detection routine). As its input, it needs only the first four bytes of the source file (fewer if the file size is less than four bytes). You are correct in that many applications which are out there are buggy when it comes to Unicode, and equally correct that we should complain about that. But I've just given you a new /feature/ which is almost trivially small and which you can boast about freely - and which will save you from receiving a few misdirected bug reports in the future. Not even worth thinking about? Arcane Jill
Jul 26 2004
next sibling parent Arcane Jill <Arcane_member pathlink.com> writes:
In article <ce2due$jos$1 digitaldaemon.com>, Arcane Jill says...

Revised source - now handles files which are less than four bytes long:

I should have written that in C, shouldn't I? Never mind - I think just replacing "ubyte" with "unsigned char" should do it. Jill
Jul 26 2004
prev sibling parent reply "Walter" <newshound digitalmars.com> writes:
"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:ce2due$jos$1 digitaldaemon.com...
 Come on, that's a /tiny/ function, and I've written it all for you. The
 "overhead" is to call it /once/ during the source text stage (and that's
 /instead of/, not as well as, the current detection routine). As its

 needs only the first four bytes of the source file (fewer if the file size

 less than four bytes).

 You are correct in that many applications which are out there are buggy

 comes to Unicode, and equally correct that we should complain about that.

 I've just given you a new /feature/ which is almost trivially small and

 you can boast about freely - and which will save you from receiving a few
 misdirected bug reports in the future.

 Not even worth thinking about?

I'd already added it to my todo list, Jill <g>. But these things sometimes have hidden gotchas, so I wanted to let it simmer for a bit. I've put in stuff too quickly before, and had to back it out later :-(
Jul 26 2004
next sibling parent Arcane Jill <Arcane_member pathlink.com> writes:
In article <ce2gk5$l00$1 digitaldaemon.com>, Walter says...

I'd already added it to my todo list, Jill <g>. But these things sometimes
have hidden gotchas, so I wanted to let it simmer for a bit. I've put in
stuff too quickly before, and had to back it out later :-(

Fair enough. No hurry. I'm sure we're all in agreement that more important bug-fixes should come first. Keep up the good work. :) Jill
Jul 26 2004
prev sibling parent James McComb <alan jamesmccomb.id.au> writes:
Walter wrote:
 "Arcane Jill" <Arcane_member pathlink.com> wrote in message
 news:ce2due$jos$1 digitaldaemon.com...
 
Come on, that's a /tiny/ function, and I've written it all for you. The
"overhead" is to call it /once/ during the source text stage (and that's
/instead of/, not as well as, the current detection routine). As its

I'd already added it to my todo list, Jill <g>. But these things sometimes have hidden gotchas, so I wanted to let it simmer for a bit. I've put in stuff too quickly before, and had to back it out later :-(

Okay, I agree that the overhead is non-existent. This may be a potential "gotcha" (but I don't know how significant it is): Walter implements this detection routine, but D compilers from other vendors don't. Then there would be D files that only compile on DMD. To prevent this from happening, the new Unicode detection algortithm needs to be explicit in the D spec. James McComb
Jul 26 2004
prev sibling parent James McComb <alan jamesmccomb.id.au> writes:
Arcane Jill wrote:
 In the source text analysis phase, the compiler does this (according to the
 manual):
 
 "The source text is assumed to be in UTF-8, unless one of the following BOMs
 (Byte Order Marks) is present at the beginning of the source text".

I like this rule. It says that D is not going to try and *guess* the character encoding of the file. Okay, maybe Walter can write some code that can guess the encoding correctly 99% of the time, but I don't think that it is worth complicating the compiler and slightly increasing compile times just to handle missing BOMs. Here's why: Almost all the time, code will be written in UTF-8. If someone has gone to the trouble of writing their code in UTF-16 or UTF-32, they can go to the trouble of including a BOM in their file. After all, people are advised to use a BOM in those circumstances anyway. James McComb
Jul 25 2004