digitalmars.D - Auto-UTF-detection

digitalmars.D - Auto-UTF-detection - Feature Request

Arcane Jill (46/46) Jul 25 2004 In the source text analysis phase, the compiler does this (according to ...

Arcane Jill (40/40) Jul 25 2004 Actually, it's just occurred to me that it's /really easy/ to tell the e...

J C Calvarese (21/29) Jul 25 2004 All is not lost. I downloaded and installed TextPad. I observed the

Walter (4/10) Jul 25 2004 Send them a bug report!

Walter (9/15) Jul 25 2004 favorite

Arcane Jill (18/20) Jul 26 2004 This is incorrect. UTF-16LE does not require a BOM.

Walter (4/6) Jul 26 2004 source file

Arcane Jill (14/38) Jul 26 2004 Come on, that's a /tiny/ function, and I've written it all for you. The

Arcane Jill (4/5) Jul 26 2004 I should have written that in C, shouldn't I? Never mind - I think just
Walter (10/21) Jul 26 2004 input, it

Arcane Jill (5/8) Jul 26 2004 Fair enough. No hurry. I'm sure we're all in agreement that more importa...
James McComb (8/18) Jul 26 2004 Okay, I agree that the overhead is non-existent.

James McComb (11/16) Jul 25 2004 I like this rule. It says that D is not going to try and *guess* the

Arcane Jill <Arcane_member pathlink.com> writes:

In the source text analysis phase, the compiler does this (according to the
manual):

"The source text is assumed to be in UTF-8, unless one of the following BOMs
(Byte Order Marks) is present at the beginning of the source text".

However, it is heuristically possible to distinguish between the various UTFs
even /without/ a BOM. Okay, so it is /theoretically/ possible for an ambiguity
to exist, but those edge cases are going to be almost infinitesimally rare for
text files in general, and I'd say zero for D source files (which will consist
mostly of ASCII characters). I say, try to auto-detect the difference.

Here's how ya do it:

Since a D source file mostly consists of ASCII characters, excluding NULL, any
4-byte-aligned fragment of a D source file is likely to look like this, where xx
stands for any non-zero byte, and ?? stands for any byte at all:


















Simply by analysing a few such four-byte chunks (say, the first 1024 byte of the
file) and counting how many fit each pattern, you can easily determine the most
likely encoding. This is a statistical test, obviously, since not /all/ bytes
will be ASCII, but it will catch all but a few extreme edge cases. If you
haven't made your mind up within the first 1024 bytes of the file, read the
/next/ 1024 bytes and try again, and so on.

Alternatively, if that sounds too hard, there's an even easier (but less
efficient) algorithm:

1) Assume UTF-32LE. Validate.
2) Assume UTF-32BE. Validate.
3) Assume UTF-16BE. Validate.
4) Assume UTF-16BE. Validate.
5) Assume UTF-8.    Validate.

If precisely one of these validations succeeds, you've sussed it. If more than
one succeeds, it's still ambiguous, but the chances of this happening are
microscopic. If none succeed, the source file is not UTF. 

Arcane Jill

Jul 25 2004

Arcane Jill <Arcane_member pathlink.com> writes:

Actually, it's just occurred to me that it's /really easy/ to tell the encoding
of a D source file, because of the fact that the very first character of a D
source file MUST be either a UTF BOM or a non-NULL ASCII. So all we have to do
is to test for each of these contingencies. Here's a short function that does
just that:






























This is an important issue, because I just did a quick test. Using my favorite
text editor (TextPad), I saved a text file in UTF-16LE. I then examined the
saved file with a hex editor. I can confirm that the file was saved in UTF-16LE,
but, critically, /without a BOM/. I don't know what other text editors do, but
clearly, if there is even a remote chance that source files will get saved
without a BOM, then we really ought to be able to compile those source files!

Arcane Jill

Jul 25 2004

J C Calvarese <jcc7 cox.net> writes:

Arcane Jill wrote:
...
 This is an important issue, because I just did a quick test. Using my favorite
 text editor (TextPad), I saved a text file in UTF-16LE. I then examined the
 saved file with a hex editor. I can confirm that the file was saved in
UTF-16LE,
 but, critically, /without a BOM/. I don't know what other text editors do, but
 clearly, if there is even a remote chance that source files will get saved
 without a BOM, then we really ought to be able to compile those source files!
 
 Arcane Jill

All is not lost. I downloaded and installed TextPad. I observed the 
problem of no BOMs, but I also found a solution:

* Choose the Configure/Preferences from the menubar.
* Find Document Classes/Default in the tree on the left (I guess you 
might have to choose something else here if you've set up a "D mode").

Document class options
[x] Write Unicode and UTF-8 BOM

Now when you save again, it'll add the BOM's.

Unicode:               FF FE    (UTF-16 LE)
Unicode (big endian):  FE FF    (UTF-16 BE)
UTF-8:                 EF BB BF (UTF-8)

So there's no problem using TextPad. (I don't know why the BOMs wouldn't 
be enabled by default, but that's a whole other issue.)

The BOMs are standard, right? If a supposedly Unicode-capable won't add 
BOMs, it might not really be considered Unicode-capable. If a person 
want to use one of those editors, fine. But please stick to UTF-8.

-- 
Justin (a/k/a jcc7)
http://jcc_7.tripod.com/d/

Jul 25 2004

"Walter" <newshound digitalmars.com> writes:

"J C Calvarese" <jcc7 cox.net> wrote in message
news:ce1d18$2sa5$1 digitaldaemon.com...
 So there's no problem using TextPad. (I don't know why the BOMs wouldn't
 be enabled by default, but that's a whole other issue.)

Send them a bug report!

 The BOMs are standard, right?

Yes.

 If a supposedly Unicode-capable won't add
 BOMs, it might not really be considered Unicode-capable. If a person
 want to use one of those editors, fine. But please stick to UTF-8.

Jul 25 2004

"Walter" <newshound digitalmars.com> writes:

"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:ce1848$2p2j$1 digitaldaemon.com...
 This is an important issue, because I just did a quick test. Using my

favorite
 text editor (TextPad), I saved a text file in UTF-16LE. I then examined

the
 saved file with a hex editor. I can confirm that the file was saved in

UTF-16LE,
 but, critically, /without a BOM/. I don't know what other text editors do,

but
 clearly, if there is even a remote chance that source files will get saved
 without a BOM, then we really ought to be able to compile those source

files!

Ack. What are the Textpad programmers thinking? They need to fix Textpad to
put out the BOM.

Jul 25 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <ce21g6$93r$1 digitaldaemon.com>, Walter says...

Ack. What are the Textpad programmers thinking? They need to fix Textpad to
put out the BOM.

This is incorrect. UTF-16LE does not require a BOM.

Almost all questions regarding UTFs and BOMs can be answered by heading over to
the Unicode web site (www.unicode.org) and clicking on "FAQ". I cite from this
FAQ here:

"In particular, if a text data stream is marked as UTF-16BE, UTF-16LE, UTF-32BE
or UTF-32LE, a BOM is neither necessary nor /permitted/" (italics present in
original FAQ).

This doesn't apply to D source code, of course, since D source files are not
"marked" in any way, however, read that FAQ. Everything about BOMs in that FAQ
tells you that BOMs are "useful". Nowhere does it say they are "required" - and,
as noted, in some cases they are even prohibited.

Fortunately, since D syntax requires that the first character of a D source file
/must/ be an ASCII character, detecting the encoding is quick and easy. See my
other posts on this thread for source code which works.

You *WILL* encounter BOM-less source files in the wild. Insisting that the BOM
be there is in defiance of Unicode rules, and is just going to cripple DMD.

Arcane Jill

Jul 26 2004

"Walter" <newshound digitalmars.com> writes:

"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:ce2onb$pug$1 digitaldaemon.com...
 Fortunately, since D syntax requires that the first character of a D

source file
 /must/ be an ASCII character, detecting the encoding is quick and easy.

Yes, and that's a key insight that I'd missed. Thanks!

Jul 26 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <ce1848$2p2j$1 digitaldaemon.com>, Arcane Jill says...

Revised source - now handles files which are less than four bytes long:


























Come on, that's a /tiny/ function, and I've written it all for you. The
"overhead" is to call it /once/ during the source text stage (and that's
/instead of/, not as well as, the current detection routine). As its input, it
needs only the first four bytes of the source file (fewer if the file size is
less than four bytes).

You are correct in that many applications which are out there are buggy when it
comes to Unicode, and equally correct that we should complain about that. But
I've just given you a new /feature/ which is almost trivially small and which
you can boast about freely - and which will save you from receiving a few
misdirected bug reports in the future.

Not even worth thinking about?

Arcane Jill

Jul 26 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <ce2due$jos$1 digitaldaemon.com>, Arcane Jill says...

Revised source - now handles files which are less than four bytes long:

I should have written that in C, shouldn't I? Never mind - I think just
replacing "ubyte" with "unsigned char" should do it.

Jill

Jul 26 2004

"Walter" <newshound digitalmars.com> writes:

"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:ce2due$jos$1 digitaldaemon.com...
 Come on, that's a /tiny/ function, and I've written it all for you. The
 "overhead" is to call it /once/ during the source text stage (and that's
 /instead of/, not as well as, the current detection routine). As its

input, it
 needs only the first four bytes of the source file (fewer if the file size

is
 less than four bytes).

 You are correct in that many applications which are out there are buggy

when it
 comes to Unicode, and equally correct that we should complain about that.

But
 I've just given you a new /feature/ which is almost trivially small and

which
 you can boast about freely - and which will save you from receiving a few
 misdirected bug reports in the future.

 Not even worth thinking about?

I'd already added it to my todo list, Jill <g>. But these things sometimes
have hidden gotchas, so I wanted to let it simmer for a bit. I've put in
stuff too quickly before, and had to back it out later :-(

Jul 26 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <ce2gk5$l00$1 digitaldaemon.com>, Walter says...

I'd already added it to my todo list, Jill <g>. But these things sometimes
have hidden gotchas, so I wanted to let it simmer for a bit. I've put in
stuff too quickly before, and had to back it out later :-(

Fair enough. No hurry. I'm sure we're all in agreement that more important
bug-fixes should come first.

Keep up the good work. :)

Jill

Jul 26 2004

James McComb <alan jamesmccomb.id.au> writes:

Walter wrote:
 "Arcane Jill" <Arcane_member pathlink.com> wrote in message
 news:ce2due$jos$1 digitaldaemon.com...
 
Come on, that's a /tiny/ function, and I've written it all for you. The
"overhead" is to call it /once/ during the source text stage (and that's
/instead of/, not as well as, the current detection routine). As its

 
 I'd already added it to my todo list, Jill <g>. But these things sometimes
 have hidden gotchas, so I wanted to let it simmer for a bit. I've put in
 stuff too quickly before, and had to back it out later :-(

Okay, I agree that the overhead is non-existent.

This may be a potential "gotcha" (but I don't know how significant it 
is): Walter implements this detection routine, but D compilers from 
other vendors don't. Then there would be D files that only compile on 
DMD. To prevent this from happening, the new Unicode detection 
algortithm needs to be explicit in the D spec.

James McComb

Jul 26 2004

James McComb <alan jamesmccomb.id.au> writes:

Arcane Jill wrote:
 In the source text analysis phase, the compiler does this (according to the
 manual):
 
 "The source text is assumed to be in UTF-8, unless one of the following BOMs
 (Byte Order Marks) is present at the beginning of the source text".

I like this rule. It says that D is not going to try and *guess* the 
character encoding of the file. Okay, maybe Walter can write some code 
that can guess the encoding correctly 99% of the time, but I don't think 
that it is worth complicating the compiler and slightly increasing 
compile times just to handle missing BOMs.

Here's why: Almost all the time, code will be written in UTF-8. If 
someone has gone to the trouble of writing their code in UTF-16 or 
UTF-32, they can go to the trouble of including a BOM in their file. 
After all, people are advised to use a BOM in those circumstances anyway.

James McComb

Jul 25 2004

D Programming

C/C++ Programming

Other

digitalmars.D - Auto-UTF-detection - Feature Request