www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.bugs - utf.d update

reply Sean Kelly <sean f4.ca> writes:
I needed some new features for the readf work I've been doing.  I think they
will be useful in general as I suspect it will become pretty common to want to
encode or decode directly to a stream.  Here are the new prototypes:

// for all char types CharT
bit decode(out dchar val, bit delegate(out CharT) get)
dchar decode(bit delegate(out CharT) get)
void encode(bit delegate(CharT) put, dchar c)

The decode returning a bit will return false only if the first call to get fails
(ie. the stream is already at EOF), and will throw in all other cases.  The
remaining calls throw in all the same circumstances as the original calls.  All
decode and encode functions have been rewritten based on these new functions.
The old functions retain their weak gurantee while the new functions necessarily
only have the basic gurantee.

http://home.f4.ca/sean/d/utf.d
Jul 28 2004
next sibling parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <ce8sac$hhq$1 digitaldaemon.com>, Sean Kelly says...

http://home.f4.ca/sean/d/utf.d

In your code: bit isValidDchar(dchar c) { return c < 0xD800 || (c > 0xDFFF && c <= 0x10FFFF && c != 0xFFFE && c != 0xFFFF); } should read: bit isValidDchar(dchar c) { dchar d = c & 0xFFFF; if (d == 0xFFFE || d == 0xFFFF) return false; return c < 0xD800 || (c >= 0xE000 && c < 0xFDD0) || (c >= 0xFDF0 && c < 0x110000); } or something functionally equivalent thereto. Anything I may previously have said about isValidChar() is wrong. The Unicode FAQ (which appears to have changed its wording, since I remember it being ambiguous in the past) now says, unambiguously: "These invalid code points are the 66 noncharacters (including FFFE and FFFF), as well as unpaired surrogates." Ergo, we must exclude all 66 noncharacters, not merely FFFE and FFFF. Jill
Jul 28 2004
parent reply Sean Kelly <sean f4.ca> writes:
Done.  I'll also integrate your other changes this evening and repost.


Sean
Jul 28 2004
parent Sean Kelly <sean f4.ca> writes:
Well, the IsValidDChar change is up but I'm going to hold off on the rest unless
I can rework the code to be optimized for ASCII, as per Walter's comment.  The
existing code is already done this way so there's no loss in the meantime.


Sean
Jul 28 2004
prev sibling parent Sean Kelly <sean f4.ca> writes:
In article <ce8sac$hhq$1 digitaldaemon.com>, Sean Kelly says...
http://home.f4.ca/sean/d/utf.d

Just a note that I've incorporated Stewart Gordon's fixes in the file I have online (thanks Stewart!). Sean
Aug 06 2004