digitalmars.D - Unicode library now in Deimos

Arcane Jill (139/139) Jun 27 2004 With humungous thanks to Hauke for ideas, suggestions, algorithms, inspr...

Walter (1/1) Jun 27 2004 Cool! Is this a supplement or a replacement for Hauke's earlier work?

Arcane Jill (19/20) Jun 28 2004 Both really. etc.unicode is a very different beast from unichar. Hauke's...

Walter (3/7) Jul 01 2004 Hmm. This can be confusing. Can the functionality of each be made unique...

Arcane Jill (68/76) Jul 02 2004 Hauke's "utype" module (drop-in replacement for ctype) is unique. It is ...

Hauke Duden (20/123) Jul 02 2004 I agree that Phobos should only have one Unicode package - everything

Arcane Jill (25/45) Jul 02 2004 No problem. I'll do that soon, like within a week. You should definitely...

Hauke Duden (9/17) Jul 03 2004 Hmmm. I ran a few tests and it seems that you're right. As soon as you

Hauke Duden (5/153) Jun 27 2004 This is incredible!
Hauke Duden (8/11) Jun 27 2004 Hmmm. Why do you store each page separately with a manual switch for

Arcane Jill (18/27) Jun 28 2004 The old space/speed tradeoff. You can probably tweak this, once I make t...

Hauke Duden (9/42) Jun 28 2004 Not necessarily bigger. If you add RLE compression it can even get

Arcane Jill (14/21) Jun 28 2004 You mentioned that before, but I'm not sure I agree. RLE is pretty much ...

Hauke Duden (32/52) Jun 28 2004 Heh. Sorry if I seem like I want to push my ideas onto you. I usually

Arcane Jill (7/10) Jun 29 2004 As well as what?

Hauke Duden (16/31) Jun 29 2004 Yes, but they ARE in RAM. My point was that you don't save RAM if you

Arcane Jill (14/33) Jun 29 2004 But clearly you do. If the compressed size is X, and the uncompressed si...

Hauke Duden (22/66) Jun 29 2004 In this particular case, yes, as I stated in my post. I just wanted to

Martin M. Pedersen (7/11) Jun 29 2004 "Hauke Duden" skrev i en meddelelse news:cbrvnp$1uf...

Hauke Duden (5/18) Jun 30 2004 I'm pretty sure about it, but not 100% sure. I have frequently observed
Arcane Jill (71/82) Jun 30 2004 Once upon a time, I did just that. I had to write some Unicode stuff for...

Martin M. Pedersen (22/28) Jun 30 2004 my

Sam McCall (7/7) Jun 30 2004 Hi, this is really impressive! (Okay, so I'm only using isWhiteSpace and...

Arcane Jill (7/14) Jun 30 2004 It is, kindof. At least the source code is there, at

Sam McCall (14/40) Jun 30 2004 Sorry, yeah, that's what I meant.

Arcane Jill <Arcane_member pathlink.com> writes:

With humungous thanks to Hauke for ideas, suggestions, algorithms, inspriation,
etc., I've got the first version of etc.unicode uploaded to Deimos. It gives you
access to pretty much all Unicode properties. For an idea of the flavor of the
thing, these are the functions you get:

char[]              getAge(dchar c)
char[]              getArabicShapingName(dchar c)
BidiClass           getBidiClass(dchar c)
char[]              getBidiClassName(BidiClass e)
char[]              getBidiClassName(dchar c)
dchar               getBidiMirroringGlyph(dchar c)
char[]              getBlock(dchar c)
uint                getCanonicalCombiningClass(dchar c)
int                 getDecimalDigit(dchar c)
wchar[]             getDecompositionMappingUTF16(dchar c)
dchar[]             getDecompositionMappingUTF32(dchar c)
char[]              getDecompositionMappingUTF8(dchar c)
DecompositionType   getDecompositionType(dchar c)
char[]              getDecompositionTypeName(DecompositionType e)
char[]              getDecompositionTypeName(dchar c)
int                 getDigit(dchar c)
EastAsianWidth      getEastAsianWidth(dchar c)
char[]              getEastAsianWidthName(EastAsianWidth e)
char[]              getEastAsianWidthName(dchar c)
GeneralCategory     getGeneralCategory(dchar c)
char[]              getGeneralCategoryName(GeneralCategory e)
char[]              getGeneralCategoryName(dchar c)
HangulSyllableType  getHangulSyllableType(dchar c)
char[]              getHangulSyllableTypeName(HangulSyllableType e)
char[]              getHangulSyllableTypeName(dchar c)
int                 getHexValue(dchar c)
char[]              getISOComment(dchar c)
char[]              getJamo(dchar c)
char[]              getJoiningGroup(dchar c)
JoiningType         getJoiningType(dchar c)
char[]              getJoiningTypeName(JoiningType e)
char[]              getJoiningTypeName(dchar c)
LineBreak           getLineBreak(dchar c)
char[]              getLineBreakName(LineBreak e)
char[]              getLineBreakName(dchar c)
wchar[]             getLowercaseMappingLocalUTF16(dchar c, char[] locale)
dchar[]             getLowercaseMappingLocalUTF32(dchar c, char[] locale)
char[]              getLowercaseMappingLocalUTF8(dchar c, char[] locale)
wchar[]             getLowercaseMappingUTF16(dchar c)
dchar[]             getLowercaseMappingUTF32(dchar c)
char[]              getLowercaseMappingUTF8(dchar c)
char[]              getName(dchar c)
char[]              getNormalizationCorrectionVersion(dchar c)
dchar               getNormalizationCorrectionsCorrection(dchar c)
dchar               getNormalizationCorrectionsOriginal(dchar c)
char[]              getNumeric(dchar c)
uint                getNumericType(dchar c)
Script              getScript(dchar c)
char[]              getScriptName(Script e)
char[]              getScriptName(dchar c)
dchar               getSimpleCaseFolding(dchar c)
dchar               getSimpleLowercaseMapping(dchar c)
dchar               getSimpleTitlecaseMapping(dchar c)
dchar               getSimpleUppercaseMapping(dchar c)
char[]              getSpecialCaseCondition(dchar c)
char[]              getSpecialCaseConditionLocal(dchar c)
wchar[]             getTitlecaseMappingLocalUTF16(dchar c, char[] locale)
dchar[]             getTitlecaseMappingLocalUTF32(dchar c, char[] locale)
char[]              getTitlecaseMappingLocalUTF8(dchar c, char[] locale)
wchar[]             getTitlecaseMappingUTF16(dchar c)
dchar[]             getTitlecaseMappingUTF32(dchar c)
char[]              getTitlecaseMappingUTF8(dchar c)
char[]              getUnicode1Name(dchar c)
wchar[]             getUppercaseMappingLocalUTF16(dchar c, char[] locale)
dchar[]             getUppercaseMappingLocalUTF32(dchar c, char[] locale)
char[]              getUppercaseMappingLocalUTF8(dchar c, char[] locale)
wchar[]             getUppercaseMappingUTF16(dchar c)
dchar[]             getUppercaseMappingUTF32(dchar c)
char[]              getUppercaseMappingUTF8(dchar c)
bool                isASCIIHexDigit(dchar c)
bool                isAlphabetic(dchar c)
bool                isBidiControl(dchar c)
bool                isBidiMirrored(dchar c)
bool                isCompositionExclusion(dchar c)
bool                isDash(dchar c)
bool                isDefaultIgnorableCodePoint(dchar c)
bool                isDeprecated(dchar c)
bool                isDiacritic(dchar c)
bool                isExtender(dchar c)
bool                isGraphemeBase(dchar c)
bool                isGraphemeExtend(dchar c)
bool                isGraphemeLink(dchar c)
bool                isHexDigit(dchar c)
bool                isHyphen(dchar c)
bool                isIDContinue(dchar c)
bool                isIDSBinaryOperator(dchar c)
bool                isIDSTrinaryOperator(dchar c)
bool                isIDStart(dchar c)
bool                isIdeographic(dchar c)
bool                isJoinControl(dchar c)
bool                isLogicalOrderException(dchar c)
bool                isLowercase(dchar c)
bool                isMath(dchar c)
bool                isNoncharacterCodePoint(dchar c)
bool                isOtherAlphabetic(dchar c)
bool                isOtherDefaultIgnorableCodePoint(dchar c)
bool                isOtherGraphemeExtend(dchar c)
bool                isOtherIDStart(dchar c)
bool                isOtherLowercase(dchar c)
bool                isOtherMath(dchar c)
bool                isOtherUppercase(dchar c)
bool                isQuotationMark(dchar c)
bool                isRadical(dchar c)
bool                isSTerm(dchar c)
bool                isSoftDotted(dchar c)
bool                isTerminalPunctuation(dchar c)
bool                isUnifiedIdeograph(dchar c)
bool                isUppercase(dchar c)
bool                isVariationSelector(dchar c)
bool                isWhiteSpace(dchar c)
bool                isXIDContinue(dchar c)
bool                isXIDStart(dchar c)

Pretty much every function is in its own module. This means that when you link
against it you only get those functions which you actually call. In addition,
the tables that get linked in are tiny (well, most of them), and in some cases
even non-existent, thanks to some seriously aggressive space optimization. For
instance, if you call toSimpleUppercaseMapping(), which converts a character to
uppercase, you will add only 5K to the size of your executable.

Despite this space saving, the functions should still be pretty fast. The code
for that uppercasing function consists of two if tests, a shift, a switch
statement with seven cases, and a table lookup. And nothing else.

Most of the other functions go the same way. Some are optimized in different
ways, but I believe we now have a very good balance between speed and size.

This is only the first step toward full Unicode support for D. Character
properties are the heart of the Unicode algorithms. You need those first - so
here they are.

Currently, Deimos is not very well organized, so my next task will be trying to
get that together. There are lots of interesting things in Deimos now (and some
of them I don't even know what they are), but what we're lacking is overall
organization, a build script, a "ready-to-go" downloadable library, proper
doxygen documentation, and so on. It's a bit irritating, so I guess now is the
time to deal with that. In the meantime, you can download the etc.unicode source
files and documentation and build-it-yourself. (But be patient. There are A LOT
of files to compile).

Arcane Jill

Jun 27 2004

"Walter" <newshound digitalmars.com> writes:

Cool! Is this a supplement or a replacement for Hauke's earlier work?

Jun 27 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cbnbdl$182e$1 digitaldaemon.com>, Walter says...
Cool! Is this a supplement or a replacement for Hauke's earlier work?

Both really. etc.unicode is a very different beast from unichar. Hauke's unichar
is very easy to use - there's just one source file. You stick that source file
in your project and you're done. With etc.unicode it is (for now) more
complicated, as there are many, many source files, and, what with Deimos being
slightly disorganized at present, it may be a while before we get a build script
together. Deimos will work like Phobos - you download a lib and link against it.
But we're not that organized yet. Getting Deimos organized now has quite a high
priority for me (more even that writing code, and that's saying something).

It's a replacement for PART of Hauke's work. Hauke's utype module will always be
necessary if you want a drop-in replacement for ctype. We should keep that
forever. I don't know how isprint() and isgraph() are implemented in utype right
now, but they could in any case be implemented in terms of etc.unicode if
needed. (isgraph() == isGraphemeBase() || isGraphemeExtend()), etc.

etc.unicode does overlap the functionality of unichar though. That's because
etc.unicode is written by robot, and it was easier to get the robot to write the
lot rather than just some of it. I need to make the codebuilder robot public -
or at least available to Hauke - because he may want to tweak it in places.

Arcane Jill

Jun 28 2004

"Walter" <newshound digitalmars.com> writes:

"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cboirj$1ca$1 digitaldaemon.com...
 In article <cbnbdl$182e$1 digitaldaemon.com>, Walter says...
Cool! Is this a supplement or a replacement for Hauke's earlier work?

 It's a replacement for PART of Hauke's work.

Hmm. This can be confusing. Can the functionality of each be made unique?

Jul 01 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cc297e$ik$2 digitaldaemon.com>, Walter says...
"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cboirj$1ca$1 digitaldaemon.com...
 In article <cbnbdl$182e$1 digitaldaemon.com>, Walter says...
Cool! Is this a supplement or a replacement for Hauke's earlier work?

 It's a replacement for PART of Hauke's work.

Hmm. This can be confusing.

I meant, it's a replacement for "unichar", but not for "utype".


Can the functionality of each be made unique?

Hauke's "utype" module (drop-in replacement for ctype) is unique. It is not
duplicated in etc.unicode. (The FUNCTIONALITY of some functions is duplicated,
of course, but that's because what makes utype special is the fact that all
functions have the same _name_ as the corresponding ctype functions). For
examples, to convert a character to uppercase using simple-casing, you can
currently use any of:

(a) toupper(c)                // ASCII only, from cytpe
(b) toupper(c)                // all Unicode chars, from utype
(c) charToUpper(c)            // all Unicode chars, from unichar
(d) getUppercaseMapping(c)    // all Unicode chars, from etc.unicode

all of the above are locale-unaware, but very shortly, there will also be:

(e) getUppercaseMapping(c, locale);



It is not possible, however, to make "etc.unicode" and "unichar" unique. The
former is a superset of the latter. The codebuilder could, of course, be
instructed NOT to generate those functions for which similar functionality
exists in unichar, but then you'd have problems with keeping both versions in
step with each other, function names being inconsistent, linking strategy being
different, and so on.


My vote would go to retaining "utype" (because people are familiar with ctype),
but using "etc.unicode" in place of "unichar". What "etc.unicode" can do is a
superset of what "unichar" can do, in terms of provided functions, and in
addition can be rebuilt for any future (or even past) version of Unicode by any
end-user in a matter of minutes (once the codebuilder program goes public). It's
also likely to be smaller, because you link in only those parts you need,
whereas with "unichar" it's all or nothing.

It should be noted of course that "etc.unicode" is well optimized for space (but
still with guaranteed constant-time lookup), and Hauke's input is what made this
possible.

The function names are definitely confusing, I agree. But in the case of
"etc.unicode", the names originate from the Unicode Consortium. These folk
define property names, such as "Simple_Uppercase_Mapping", and so on. The names
are part of the Unicode standard - so I just slavishly put my metaphorical
blinkers on, removed the underscores and added a "get" or "is" prefix ("get" for
non-boolean properties, "is" for boolean properties) to conform to the D style
guide. Such names may be cumbersome, but I still think it's better than using
made-up names, and it's a consistent methodology to extend to the remaining
properties we haven't added yet.

I hope that makes things less confusing. Unfortunately, right now, Deimos is not
well organized, because it is in the hands of many people. To me it makes more
sense that people should be able to download the whole of Deimos in one go
(instead of individual packages), just like they can currently download the
whole of Phobos in one go. That sort of organization would be easy if Deimos
were a one-person-project, or even a project with one leader whose word was law,
but it's a collective effort, and I think those involved are going to HAVE to
put some effort into making it look like a unified effort. This will happen in
time, but I mention it because, right now (and I hope this is a temporary
phase), "unichar" is easier to use than "etc.unicode", even though both are
currently supplied in source code form, if only for the simple reason that
"unichar" is one file and "etc.unicode" is many files. In the (near?) future, I
would hope to have the following:

(1) Deimos being easy-to-download and easy-to-use, with pre-build linkable
libraries for all platforms, in both Debug and Release builds.

(2) Headers for etc.unicode (by which I mean, stripped versions of the source
code, with large tables removed), to speed up compilation time.

(3) The codebuilder program (which generates etc.unicode) being made public,
along with documentation, so that people can compile Unicode lookups for any
version of Unicode, past, present or future (or even customized).

Until this is done, unichar is likely to be easier to use. However, once these
steps are taken, I would then have no hesitation in suggesting that we use as
standard:


(i) etc.unicode
(possibly renamed to std.unicode - the codebuilder can locate it anywhere).

(ii) utype
(possibly renamed to std.utype).


Arcane Jill

Jul 02 2004

Hauke Duden <H.NS.Duden gmx.net> writes:

Arcane Jill wrote:
 In article <cc297e$ik$2 digitaldaemon.com>, Walter says...
 
"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cboirj$1ca$1 digitaldaemon.com...

In article <cbnbdl$182e$1 digitaldaemon.com>, Walter says...

Cool! Is this a supplement or a replacement for Hauke's earlier work?

It's a replacement for PART of Hauke's work.

Hmm. This can be confusing.

 
 
 I meant, it's a replacement for "unichar", but not for "utype".
 
 
 
Can the functionality of each be made unique?

 
 
 Hauke's "utype" module (drop-in replacement for ctype) is unique. It is not
 duplicated in etc.unicode. (The FUNCTIONALITY of some functions is duplicated,
 of course, but that's because what makes utype special is the fact that all
 functions have the same _name_ as the corresponding ctype functions). For
 examples, to convert a character to uppercase using simple-casing, you can
 currently use any of:
 
 (a) toupper(c)                // ASCII only, from cytpe
 (b) toupper(c)                // all Unicode chars, from utype
 (c) charToUpper(c)            // all Unicode chars, from unichar
 (d) getUppercaseMapping(c)    // all Unicode chars, from etc.unicode
 
 all of the above are locale-unaware, but very shortly, there will also be:
 
 (e) getUppercaseMapping(c, locale);

 
 
 It is not possible, however, to make "etc.unicode" and "unichar" unique. The
 former is a superset of the latter. The codebuilder could, of course, be
 instructed NOT to generate those functions for which similar functionality
 exists in unichar, but then you'd have problems with keeping both versions in
 step with each other, function names being inconsistent, linking strategy being
 different, and so on.
 
 
 My vote would go to retaining "utype" (because people are familiar with ctype),
 but using "etc.unicode" in place of "unichar". What "etc.unicode" can do is a
 superset of what "unichar" can do, in terms of provided functions, and in
 addition can be rebuilt for any future (or even past) version of Unicode by any
 end-user in a matter of minutes (once the codebuilder program goes public).
It's
 also likely to be smaller, because you link in only those parts you need,
 whereas with "unichar" it's all or nothing.
 
 It should be noted of course that "etc.unicode" is well optimized for space
(but
 still with guaranteed constant-time lookup), and Hauke's input is what made
this
 possible.
 
 The function names are definitely confusing, I agree. But in the case of
 "etc.unicode", the names originate from the Unicode Consortium. These folk
 define property names, such as "Simple_Uppercase_Mapping", and so on. The names
 are part of the Unicode standard - so I just slavishly put my metaphorical
 blinkers on, removed the underscores and added a "get" or "is" prefix ("get"
for
 non-boolean properties, "is" for boolean properties) to conform to the D style
 guide. Such names may be cumbersome, but I still think it's better than using
 made-up names, and it's a consistent methodology to extend to the remaining
 properties we haven't added yet.
 
 I hope that makes things less confusing. Unfortunately, right now, Deimos is
not
 well organized, because it is in the hands of many people. To me it makes more
 sense that people should be able to download the whole of Deimos in one go
 (instead of individual packages), just like they can currently download the
 whole of Phobos in one go. That sort of organization would be easy if Deimos
 were a one-person-project, or even a project with one leader whose word was
law,
 but it's a collective effort, and I think those involved are going to HAVE to
 put some effort into making it look like a unified effort. This will happen in
 time, but I mention it because, right now (and I hope this is a temporary
 phase), "unichar" is easier to use than "etc.unicode", even though both are
 currently supplied in source code form, if only for the simple reason that
 "unichar" is one file and "etc.unicode" is many files. In the (near?) future, I
 would hope to have the following:
 
 (1) Deimos being easy-to-download and easy-to-use, with pre-build linkable
 libraries for all platforms, in both Debug and Release builds.
 
 (2) Headers for etc.unicode (by which I mean, stripped versions of the source
 code, with large tables removed), to speed up compilation time.
 
 (3) The codebuilder program (which generates etc.unicode) being made public,
 along with documentation, so that people can compile Unicode lookups for any
 version of Unicode, past, present or future (or even customized).
 
 Until this is done, unichar is likely to be easier to use. However, once these
 steps are taken, I would then have no hesitation in suggesting that we use as
 standard:
 
 
 (i) etc.unicode
 (possibly renamed to std.unicode - the codebuilder can locate it anywhere).
 
 (ii) utype
 (possibly renamed to std.utype).


I agree that Phobos should only have one Unicode package - everything 
else would be a bad idea.

My hope is that I'll be able to integrate some of the current advantages 
of unichar (faster lookup, smaller footprint) with AJ's work so that 
they can apply to all Unicode functions. I'll know more the 
possibilities once AJ releases the code generator.

I also have a feeling that it is not necessary to have as many separate 
modules as etc.unicode currently has (but since I have only glanced at 
etc.unicode I could be wrong). Since the linker will always throw out 
uncalled functions and unaccessed data (correct?) it should be possible 
to make it easier to use. The main thing we'd have to keep an eye on is 
that static module constructors do not pull in all the data and functions.

The function names could also use some tuning - right now they feel a 
little clunky (as do the unichar function names, of course). The main 
problem here is that people knowing Unicode will recognize the property 
names and that the functions will still be sufficiently different from 
utype/ctype to prevent confusion (since utype/ctype define quite 
properties with the same name in different way).

Hauke

Jul 02 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cc3a30$1ni0$1 digitaldaemon.com>, Hauke Duden says...
I agree that Phobos should only have one Unicode package - everything 
else would be a bad idea.

My hope is that I'll be able to integrate some of the current advantages 
of unichar (faster lookup, smaller footprint) with AJ's work so that 
they can apply to all Unicode functions. I'll know more the 
possibilities once AJ releases the code generator.

No problem. I'll do that soon, like within a week. You should definitely be
given write access, and if you can make it better/faster/whatever that would
certainly be great.


I also have a feeling that it is not necessary to have as many separate 
modules as etc.unicode currently has (but since I have only glanced at 
etc.unicode I could be wrong).

As currently written, the codebuilder generates between two and four modules per
Unicode property - one is just a wrapper to present a usable interface to
humans; the others are purely robot-generated and (in general) will comprise one
module per lookup table (but you still need a module even if there are zero
lookup tables). You could certainly have a bash at reducing the number of
modules per Unicode property, but it would be bad (in my opinion) to reduce it
further. As a minimum, you need one per Unicode property.

I don't know if having many modules is actually a problem though. From they end
users' point of view, they will still need to import ONE module, and link with
ONE library, and the rest happens automatically. But yeah, if you can make that
happen better or faster, great!



Since the linker will always throw out 
uncalled functions and unaccessed data (correct?) it should be possible 
to make it easier to use.

No, the linker can only throw out whole unused modules. It cannot throw out
/parts/ of modules. Therefore, each "optional thing" needs to be in its own
module or modules.




The main thing we'd have to keep an eye on is 
that static module constructors do not pull in all the data and functions.

There aren't any static module constructors. None at all. Zero.
So keeping an eye on that should be fairly easy, I'd say.



The function names could also use some tuning - right now they feel a 
little clunky (as do the unichar function names, of course). The main 
problem here is that people knowing Unicode will recognize the property 
names and that the functions will still be sufficiently different from 
utype/ctype to prevent confusion (since utype/ctype define quite 
properties with the same name in different way).

Yes, I agree. The problem is that the names are defined by the Unicode
Consortium (though I did tweak them to match D style). They are the standard,
official names for properties. I think we'd have to be quite imaginitive to come
up with any reasonable alternative.

Arcane Jill

Jul 02 2004

Hauke Duden <H.NS.Duden gmx.net> writes:

Arcane Jill wrote:
Since the linker will always throw out 
uncalled functions and unaccessed data (correct?) it should be possible 
to make it easier to use.

 
 
 No, the linker can only throw out whole unused modules. It cannot throw out
 /parts/ of modules. Therefore, each "optional thing" needs to be in its own
 module or modules.

Hmmm. I ran a few tests and it seems that you're right. As soon as you 
specify a module to the linker it will be included fully in the 
executable, regardless of whether it is used or not. It is unfortunate 
that it isn't a little more sophisticated :(.

GDC should be able to do better (since GCC/G++ does for C/C++), but 
since the unicode lib should work well with all compilers it seems that 
your current approach is the only feasible one.

Hauke

Jul 03 2004

Hauke Duden <H.NS.Duden gmx.net> writes:

This is incredible!

Hopefully I'll have some free time next week to check this out :).

Great work.

Hauke



Arcane Jill wrote:
 With humungous thanks to Hauke for ideas, suggestions, algorithms, inspriation,
 etc., I've got the first version of etc.unicode uploaded to Deimos. It gives
you
 access to pretty much all Unicode properties. For an idea of the flavor of the
 thing, these are the functions you get:
 
 char[]              getAge(dchar c)
 char[]              getArabicShapingName(dchar c)
 BidiClass           getBidiClass(dchar c)
 char[]              getBidiClassName(BidiClass e)
 char[]              getBidiClassName(dchar c)
 dchar               getBidiMirroringGlyph(dchar c)
 char[]              getBlock(dchar c)
 uint                getCanonicalCombiningClass(dchar c)
 int                 getDecimalDigit(dchar c)
 wchar[]             getDecompositionMappingUTF16(dchar c)
 dchar[]             getDecompositionMappingUTF32(dchar c)
 char[]              getDecompositionMappingUTF8(dchar c)
 DecompositionType   getDecompositionType(dchar c)
 char[]              getDecompositionTypeName(DecompositionType e)
 char[]              getDecompositionTypeName(dchar c)
 int                 getDigit(dchar c)
 EastAsianWidth      getEastAsianWidth(dchar c)
 char[]              getEastAsianWidthName(EastAsianWidth e)
 char[]              getEastAsianWidthName(dchar c)
 GeneralCategory     getGeneralCategory(dchar c)
 char[]              getGeneralCategoryName(GeneralCategory e)
 char[]              getGeneralCategoryName(dchar c)
 HangulSyllableType  getHangulSyllableType(dchar c)
 char[]              getHangulSyllableTypeName(HangulSyllableType e)
 char[]              getHangulSyllableTypeName(dchar c)
 int                 getHexValue(dchar c)
 char[]              getISOComment(dchar c)
 char[]              getJamo(dchar c)
 char[]              getJoiningGroup(dchar c)
 JoiningType         getJoiningType(dchar c)
 char[]              getJoiningTypeName(JoiningType e)
 char[]              getJoiningTypeName(dchar c)
 LineBreak           getLineBreak(dchar c)
 char[]              getLineBreakName(LineBreak e)
 char[]              getLineBreakName(dchar c)
 wchar[]             getLowercaseMappingLocalUTF16(dchar c, char[] locale)
 dchar[]             getLowercaseMappingLocalUTF32(dchar c, char[] locale)
 char[]              getLowercaseMappingLocalUTF8(dchar c, char[] locale)
 wchar[]             getLowercaseMappingUTF16(dchar c)
 dchar[]             getLowercaseMappingUTF32(dchar c)
 char[]              getLowercaseMappingUTF8(dchar c)
 char[]              getName(dchar c)
 char[]              getNormalizationCorrectionVersion(dchar c)
 dchar               getNormalizationCorrectionsCorrection(dchar c)
 dchar               getNormalizationCorrectionsOriginal(dchar c)
 char[]              getNumeric(dchar c)
 uint                getNumericType(dchar c)
 Script              getScript(dchar c)
 char[]              getScriptName(Script e)
 char[]              getScriptName(dchar c)
 dchar               getSimpleCaseFolding(dchar c)
 dchar               getSimpleLowercaseMapping(dchar c)
 dchar               getSimpleTitlecaseMapping(dchar c)
 dchar               getSimpleUppercaseMapping(dchar c)
 char[]              getSpecialCaseCondition(dchar c)
 char[]              getSpecialCaseConditionLocal(dchar c)
 wchar[]             getTitlecaseMappingLocalUTF16(dchar c, char[] locale)
 dchar[]             getTitlecaseMappingLocalUTF32(dchar c, char[] locale)
 char[]              getTitlecaseMappingLocalUTF8(dchar c, char[] locale)
 wchar[]             getTitlecaseMappingUTF16(dchar c)
 dchar[]             getTitlecaseMappingUTF32(dchar c)
 char[]              getTitlecaseMappingUTF8(dchar c)
 char[]              getUnicode1Name(dchar c)
 wchar[]             getUppercaseMappingLocalUTF16(dchar c, char[] locale)
 dchar[]             getUppercaseMappingLocalUTF32(dchar c, char[] locale)
 char[]              getUppercaseMappingLocalUTF8(dchar c, char[] locale)
 wchar[]             getUppercaseMappingUTF16(dchar c)
 dchar[]             getUppercaseMappingUTF32(dchar c)
 char[]              getUppercaseMappingUTF8(dchar c)
 bool                isASCIIHexDigit(dchar c)
 bool                isAlphabetic(dchar c)
 bool                isBidiControl(dchar c)
 bool                isBidiMirrored(dchar c)
 bool                isCompositionExclusion(dchar c)
 bool                isDash(dchar c)
 bool                isDefaultIgnorableCodePoint(dchar c)
 bool                isDeprecated(dchar c)
 bool                isDiacritic(dchar c)
 bool                isExtender(dchar c)
 bool                isGraphemeBase(dchar c)
 bool                isGraphemeExtend(dchar c)
 bool                isGraphemeLink(dchar c)
 bool                isHexDigit(dchar c)
 bool                isHyphen(dchar c)
 bool                isIDContinue(dchar c)
 bool                isIDSBinaryOperator(dchar c)
 bool                isIDSTrinaryOperator(dchar c)
 bool                isIDStart(dchar c)
 bool                isIdeographic(dchar c)
 bool                isJoinControl(dchar c)
 bool                isLogicalOrderException(dchar c)
 bool                isLowercase(dchar c)
 bool                isMath(dchar c)
 bool                isNoncharacterCodePoint(dchar c)
 bool                isOtherAlphabetic(dchar c)
 bool                isOtherDefaultIgnorableCodePoint(dchar c)
 bool                isOtherGraphemeExtend(dchar c)
 bool                isOtherIDStart(dchar c)
 bool                isOtherLowercase(dchar c)
 bool                isOtherMath(dchar c)
 bool                isOtherUppercase(dchar c)
 bool                isQuotationMark(dchar c)
 bool                isRadical(dchar c)
 bool                isSTerm(dchar c)
 bool                isSoftDotted(dchar c)
 bool                isTerminalPunctuation(dchar c)
 bool                isUnifiedIdeograph(dchar c)
 bool                isUppercase(dchar c)
 bool                isVariationSelector(dchar c)
 bool                isWhiteSpace(dchar c)
 bool                isXIDContinue(dchar c)
 bool                isXIDStart(dchar c)
 
 Pretty much every function is in its own module. This means that when you link
 against it you only get those functions which you actually call. In addition,
 the tables that get linked in are tiny (well, most of them), and in some cases
 even non-existent, thanks to some seriously aggressive space optimization. For
 instance, if you call toSimpleUppercaseMapping(), which converts a character to
 uppercase, you will add only 5K to the size of your executable.
 
 Despite this space saving, the functions should still be pretty fast. The code
 for that uppercasing function consists of two if tests, a shift, a switch
 statement with seven cases, and a table lookup. And nothing else.
 
 Most of the other functions go the same way. Some are optimized in different
 ways, but I believe we now have a very good balance between speed and size.
 
 This is only the first step toward full Unicode support for D. Character
 properties are the heart of the Unicode algorithms. You need those first - so
 here they are.
 
 Currently, Deimos is not very well organized, so my next task will be trying to
 get that together. There are lots of interesting things in Deimos now (and some
 of them I don't even know what they are), but what we're lacking is overall
 organization, a build script, a "ready-to-go" downloadable library, proper
 doxygen documentation, and so on. It's a bit irritating, so I guess now is the
 time to deal with that. In the meantime, you can download the etc.unicode
source
 files and documentation and build-it-yourself. (But be patient. There are A LOT
 of files to compile).
 
 Arcane Jill

Jun 27 2004

Hauke Duden <H.NS.Duden gmx.net> writes:

Arcane Jill wrote:
 Despite this space saving, the functions should still be pretty fast. The code
 for that uppercasing function consists of two if tests, a shift, a switch
 statement with seven cases, and a table lookup. And nothing else.

Hmmm. Why do you store each page separately with a manual switch for 
choosing the right one? A second lookup table should be a lot faster.

You could also save some more cycles if you add a single page that 
contains only 0 values instead of returning a null-pointer. That way you 
do not need to check for null every time you read a value.

But these are just minor points. This is a great piece of work.

Hauke

Jun 27 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cbneog$1cu3$1 digitaldaemon.com>, Hauke Duden says...
Arcane Jill wrote:
 Despite this space saving, the functions should still be pretty fast. The code
 for that uppercasing function consists of two if tests, a shift, a switch
 statement with seven cases, and a table lookup. And nothing else.

Hmmm. Why do you store each page separately with a manual switch for 
choosing the right one? A second lookup table should be a lot faster.

You could also save some more cycles if you add a single page that 
contains only 0 values instead of returning a null-pointer. That way you 
do not need to check for null every time you read a value.


The old space/speed tradeoff. You can probably tweak this, once I make the
codebuilder program public. There are various parameters which control decisions
the robot makes, and right now those parameters are just constants. But we COULD
change that so that more popular lookup get biased in favor of speed, while less
popular lookups get biased in favor of minimal space. In the current
configuration, the robot decides that a big table full of zeroes is a waste of
space compared with a test for null, and that a switch statement is acceptable
so long as there are fewer than sixteen cases. But all these things are
ultimately tweakable if we later decide to tweak them. If we do that, the robot
will simply write different code (i.e. faster but bigger).

Personally, I think that the choices currently made are quite reasonable for
most properties. There may be a case for speeding up uppercasing and a few
others, but you still have to think in terms of how much RAM that would consume
at runtime. Right now it's 5K for uppercasing, and another 5K for lowercasing.
Unicode contains a *lot* of data, and I'm hesitant to give speed too high a
priority here, for fear of everything getting huge.

Arcane Jill

Jun 28 2004

Hauke Duden <H.NS.Duden gmx.net> writes:

Arcane Jill wrote:
 In article <cbneog$1cu3$1 digitaldaemon.com>, Hauke Duden says...
 
Arcane Jill wrote:

Despite this space saving, the functions should still be pretty fast. The code
for that uppercasing function consists of two if tests, a shift, a switch
statement with seven cases, and a table lookup. And nothing else.

Hmmm. Why do you store each page separately with a manual switch for 
choosing the right one? A second lookup table should be a lot faster.

You could also save some more cycles if you add a single page that 
contains only 0 values instead of returning a null-pointer. That way you 
do not need to check for null every time you read a value.

 
 
 
 The old space/speed tradeoff. You can probably tweak this, once I make the
 codebuilder program public. There are various parameters which control
decisions
 the robot makes, and right now those parameters are just constants. But we
COULD
 change that so that more popular lookup get biased in favor of speed, while
less
 popular lookups get biased in favor of minimal space. In the current
 configuration, the robot decides that a big table full of zeroes is a waste of
 space compared with a test for null, and that a switch statement is acceptable
 so long as there are fewer than sixteen cases. But all these things are
 ultimately tweakable if we later decide to tweak them. If we do that, the robot
 will simply write different code (i.e. faster but bigger).

Not necessarily bigger. If you add RLE compression it can even get 
smaller than what you have now.

I'd love to take a look at the "robot" code. I have some things in mind 
that might improve on both speed and size. I'd like to see how easy it 
would be to integrate them into your current system.


 Personally, I think that the choices currently made are quite reasonable for
 most properties. There may be a case for speeding up uppercasing and a few
 others, but you still have to think in terms of how much RAM that would consume
 at runtime. Right now it's 5K for uppercasing, and another 5K for lowercasing.
 Unicode contains a *lot* of data, and I'm hesitant to give speed too high a
 priority here, for fear of everything getting huge.

We're on the same page here. Both sides need to be optimized but you 
have to find a good balance.


Hauke

Jun 28 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cbosl4$f9d$1 digitaldaemon.com>, Hauke Duden says...

Not necessarily bigger. If you add RLE compression it can even get 
smaller than what you have now.

You mentioned that before, but I'm not sure I agree. RLE is pretty much the
/only/ one of your ideas that I didn't go with. You see, I take the view that
hard disk space is plentiful, but RAM is not. With that perspective, compressing
on disk, but decompressing into RAM, is /not/ a good thing to do. You might as
well load it into RAM in the uncompressed state in the first place.



I'd love to take a look at the "robot" code. I have some things in mind 
that might improve on both speed and size. I'd like to see how easy it 
would be to integrate them into your current system.

I thought you might. Rest assured, you will be the /first/ person to get write
access. I'll probably need to start a new project for it thought. The
codebuilder itself doesn't REALLY belong in Deimos, as it's not general purpose.



We're on the same page here. Both sides need to be optimized but you 
have to find a good balance.

The codebuilder is a good way to get that balance. Change a few constants, run
it again and new source gets written reflecting the new balance. But fine tuning
it is probably more your area of expertise. You seem to know more about this
sort of stuff than I, anyway.

Jill

Jun 28 2004

Hauke Duden <H.NS.Duden gmx.net> writes:

Arcane Jill wrote:
 In article <cbosl4$f9d$1 digitaldaemon.com>, Hauke Duden says...
 
 
Not necessarily bigger. If you add RLE compression it can even get 
smaller than what you have now.

 
 
 You mentioned that before, but I'm not sure I agree. RLE is pretty much the
 /only/ one of your ideas that I didn't go with. You see, I take the view that
 hard disk space is plentiful, but RAM is not. With that perspective,
compressing
 on disk, but decompressing into RAM, is /not/ a good thing to do. You might as
 well load it into RAM in the uncompressed state in the first place.

Heh. Sorry if I seem like I want to push my ideas onto you. I usually 
write these messages in a hurry and I'm often not sure if I have 
mentioned something before ;)

Regarding the size problem: for me the trade-off is not so much disk 
space against RAM usage but executable size against RAM usage. My 
concern is that if the executables get too big then people might not 
want to use the unicode functions for some applications (and fall back 
to ASCII instead). For example, good Setup software adds as little 
overhead as possible to the installed data. If all the Unicode stuff 
together amounts to 100K then that may already be too much.

You should also keep in mind that the executable is held in RAM as well, 
so increasing executable size to save RAM does not always give you an 
advantage.

Please also let me emphasize that I'm not advocating holding completely 
uncompressed tables in RAM. On the contrary: I think the layout you 
currently have is good for the uncompressed version.
What I mean is not storing this data directly in the executable, but 
storing an RLE compressed version and unpacking it into the current form 
at runtime. The RLE'ed version should be quite a bit smaller (it reduces 
the mapping data in unichar to about 1/4th). So the increase in RAM 
usage would be around 125% of what you have now (assuming that the other 
data packs similarly well - the 25% increase comes from the second RLE 
compressed version of the data in the executable). But executable size 
goes down to 25%. I think that's worth it.

Also keep in mind that we're talking about kilobyte sizes here. 500 KB 
of RAM is not much nowadays (my estimate for an application that uses 
just about everything), but downloading 500 KB more from the internet is 
very noticable for modem users.

I'd love to take a look at the "robot" code. I have some things in mind 
that might improve on both speed and size. I'd like to see how easy it 
would be to integrate them into your current system.

 
 
 I thought you might. Rest assured, you will be the /first/ person to get write
 access. I'll probably need to start a new project for it thought. The
 codebuilder itself doesn't REALLY belong in Deimos, as it's not general
purpose.

Take your time - I'm curious, but I don't have much free time to spend 
on this anyway. I can wait :).


Hauke

Jun 28 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cbpp7k$1pgi$1 digitaldaemon.com>, Hauke Duden says...

You should also keep in mind that the executable is held in RAM as well, 

As well as what?

The tables are directly contained in the RAM image of the executable. They are
not duplicated or otherwise reconstructed. They are accessed in-place.



so increasing executable size to save RAM does not always give you an 
advantage.

Curiously, you seem to be arguing in favor of my position. Had we used RLE
decompression, THEN we'd have to worry about the "as well".

Jill

Jun 29 2004

Hauke Duden <H.NS.Duden gmx.net> writes:

Arcane Jill wrote:
 In article <cbpp7k$1pgi$1 digitaldaemon.com>, Hauke Duden says...
 
 
You should also keep in mind that the executable is held in RAM as well, 

 
 
 As well as what?

As well as the data you store in explicitly allocated memory.

 The tables are directly contained in the RAM image of the executable. They are
 not duplicated or otherwise reconstructed. They are accessed in-place.

Yes, but they ARE in RAM. My point was that you don't save RAM if you 
put the data in the executable instead of an explicitly allocated memory 
block.

so increasing executable size to save RAM does not always give you an 
advantage.

 
 
 Curiously, you seem to be arguing in favor of my position. Had we used RLE
 decompression, THEN we'd have to worry about the "as well".

I don't think I understand what you mean. If I understood your last post 
correctly you didn't want to use RLE compression because disk space is 
cheap, but RAM is not. I am arguing that:

- executable size (=disk space) is more expensive than RAM if the file 
is downloaded from the internet

- RAM usage is increased only slightly (my rough estimate was 125% of 
the original space) but executable size is reduced significantly (down 
to 25%).

In an age where many programs are downloaded from the internet that is 
worth thinking about.

Hauke

Jun 29 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cbrvnp$1uf8$1 digitaldaemon.com>, Hauke Duden says...
You should also keep in mind that the executable is held in RAM as well, 

 
 As well as what?

As well as the data you store in explicitly allocated memory.

That would be zero.


 The tables are directly contained in the RAM image of the executable. They are
 not duplicated or otherwise reconstructed. They are accessed in-place.

Yes, but they ARE in RAM. My point was that you don't save RAM if you 
put the data in the executable instead of an explicitly allocated memory 
block.

But clearly you do. If the compressed size is X, and the uncompressed size is Y,
then storing the uncompressed table in the executable costs Y bytes of RAM.
Decompressing at runtime costs (X+Y) bytes of RAM, since you can't un-allocate
the X. Since X is not negative, it follows that Y will always be less than (X+Y)



I don't think I understand what you mean. If I understood your last post 
correctly you didn't want to use RLE compression because disk space is 
cheap, but RAM is not. I am arguing that:

- executable size (=disk space) is more expensive than RAM if the file 
is downloaded from the internet

That's what zip files are for.

Besides which, I don't think my obj files actually do contain large arrays full
of zeroes. Such zero blocks will all have been removed and replaced by null
pointer returns. Or were you arguing that zero-blocks should be re-inserted?



- RAM usage is increased only slightly (my rough estimate was 125% of 
the original space) but executable size is reduced significantly (down 
to 25%).

I'm not arguing with that. I'm arguing in favor of not increasing RAM usage *AT
ALL*. Like, not even slightly.


In an age where many programs are downloaded from the internet that is 
worth thinking about.

Zip files use much better compression than simple RLE. I say zip 'em.

Jill

Jun 29 2004

Hauke Duden <H.NS.Duden gmx.net> writes:

Arcane Jill wrote:
 In article <cbrvnp$1uf8$1 digitaldaemon.com>, Hauke Duden says...
 
You should also keep in mind that the executable is held in RAM as well, 

As well as what?

As well as the data you store in explicitly allocated memory.

 
 
 That would be zero.
 
 
 
The tables are directly contained in the RAM image of the executable. They are
not duplicated or otherwise reconstructed. They are accessed in-place.

Yes, but they ARE in RAM. My point was that you don't save RAM if you 
put the data in the executable instead of an explicitly allocated memory 
block.

 
 
 But clearly you do. If the compressed size is X, and the uncompressed size is
Y,
 then storing the uncompressed table in the executable costs Y bytes of RAM.
 Decompressing at runtime costs (X+Y) bytes of RAM, since you can't un-allocate
 the X. Since X is not negative, it follows that Y will always be less than
(X+Y)

In this particular case, yes, as I stated in my post. I just wanted to 
emphasize that moving data into statically compiled arrays (as opposed 
to dynamic ones) doesn't automatically reduce RAM usage.

I don't think I understand what you mean. If I understood your last post 
correctly you didn't want to use RLE compression because disk space is 
cheap, but RAM is not. I am arguing that:

- executable size (=disk space) is more expensive than RAM if the file 
is downloaded from the internet

 
 
 That's what zip files are for.

What about installers and self-extractors? You do not ZIP those because 
they ARE the ZIP file (in a manner of speaking). I'd like to be able to 
write such applications in D.

Besides half of my reasons is that even if D executables could be 
compressed better than C++ ones, many people would still compare their 
size in uncompressed form. A similar thing happened to C++: C++ 
executables usually compress better than C ones (templates and exception 
handling create lots of similar code), yet C++ is often said to be the 
"bloat king" among languages.

I just don't want people to shun the Unicode routines because of the 
size difference, even if it may not have such a big impact on the end 
result as they might think.


 Besides which, I don't think my obj files actually do contain large arrays full
 of zeroes. Such zero blocks will all have been removed and replaced by null
 pointer returns. Or were you arguing that zero-blocks should be re-inserted?

RLE doesn't just pack zero arrays. Unicode contains lots of ranges with 
the same values.


- RAM usage is increased only slightly (my rough estimate was 125% of 
the original space) but executable size is reduced significantly (down 
to 25%).

 
 
 I'm not arguing with that. I'm arguing in favor of not increasing RAM usage *AT
 ALL*. Like, not even slightly.

As I said, I think 100 KB of extra RAM usage is a lot better than 400 KB 
of increased executable size. Especially for a garbage collected 
language that will always use more RAM than strictly necessary.

Hauke

Jun 29 2004

"Martin M. Pedersen" <martin moeller-pedersen.dk> writes:

"Hauke Duden" <H.NS.Duden gmx.net> skrev i en meddelelse news:cbrvnp$1uf8> >
The tables are directly contained in the RAM image of the executable. They
are
 not duplicated or otherwise reconstructed. They are accessed in-place.

 Yes, but they ARE in RAM. My point was that you don't save RAM if you
 put the data in the executable instead of an explicitly allocated memory
 block.

Are you sure about that? I would expect individual pages to be loaded on
demand.

Regards,
Martin

Jun 29 2004

Hauke Duden <H.NS.Duden gmx.net> writes:

Martin M. Pedersen wrote:
 "Hauke Duden" <H.NS.Duden gmx.net> skrev i en meddelelse news:cbrvnp$1uf8> >
 The tables are directly contained in the RAM image of the executable. They
 are
 
not duplicated or otherwise reconstructed. They are accessed in-place.

Yes, but they ARE in RAM. My point was that you don't save RAM if you
put the data in the executable instead of an explicitly allocated memory
block.

 
 
 Are you sure about that? I would expect individual pages to be loaded on
 demand.

I'm pretty sure about it, but not 100% sure. I have frequently observed 
in the past that RAM usage increases by the size of a DLL as soon as it 
is loaded.

Hauke

Jun 30 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cbsj6s$2s9m$1 digitaldaemon.com>, Martin M. Pedersen says...
"Hauke Duden" <H.NS.Duden gmx.net> skrev i en meddelelse news:cbrvnp$1uf8> >
The tables are directly contained in the RAM image of the executable. They
are
 not duplicated or otherwise reconstructed. They are accessed in-place.

 Yes, but they ARE in RAM. My point was that you don't save RAM if you
 put the data in the executable instead of an explicitly allocated memory
 block.

Are you sure about that? I would expect individual pages to be loaded on
demand.

Regards,
Martin


Once upon a time, I did just that. I had to write some Unicode stuff for my
employer a few years back, and I adopted exactly that "load on demand" approach.
It is with this hindsight that I now beleive it to have been a bad idea
(although with some modification, it might be a good idea in a DLL).

The basic load-on-demand approach is this: The user calls something like
isUppercase(c); The library evaluates (c >> N) (for some N) to get what we might
loosely call a "page" number. Then it says to itself, "Is this page cached?". If
so, use the in-RAM table to look up the answer; if not, load the page from disk,
decompress it into RAM, cache it so we don't have to do all that again, and THEN
look up the value. Plus, you'd have to do this every time you re-ran your
application, which might matter for some very small applications.

At the time, it seemed like this was quite a promising approach, but it had a
huge number of drawbacks. For one thing, neither C++ nor D has any concept of
resource files (essentially a Java concept), so, in order to load anything off
disk, YOU FIRST HAVE TO FIND IT. This means either a DLL (per "page"? per
property?), or you have to read an environment variable to tell you where to
look. Requiring users to set an environment variable just to get isUppercase()
working is not desirable. For another thing, the extra code you have to go
through at runtime to answer the question "is it cached?" is itself a few extra
cycles.

I took a different approach this time round. In this new approach, there are two
important principles: (1) The souce code shall be written by robot. This
protects us against Unicode itself being updated (and it is /constantly/ being
updated), and it also allows for some SERIOUS optimization, because of course a
robot can try many, many different optimization strategies and, using sheer
brute force, pick the best. (2) Each property shall be in its own object module,
so that you do not link in those properties which you do not need. This is
load-on-demand in a sense, but it's COMPILE-TIME load-on-demand, which is better
(in my opinion) and the split is in a different direction (property-required
rather than codepoint-range).

The new approach makes complete sense when you realize just how small things
get. That isUppercase() function generates a linkable object module which is a
measly 3420 /bytes/ in size, and does not depend on (pull in) anything else. I
mean - come on guys - just 3K! Are we REALLY saying that's too much? Plus, as a
bonus, you get isUppercase() data for all characters, so no there is no bias
toward any particular subrange.

And in any case, I didn't get that 3K figure by measuring the size of the
generated tables, I got it by compiling an object module and typing dir at the
command line to see how big it ended up. For reference, the following do-nothing
functions:








by the same standard comes out at 232 bytes. However, this measure
overestimates, because, even if it isn't inlined, there are some linker symbols
in there that will be discarded when constructing an executable. So even these
small estimates are overestimates. I think you would be hard-pressed to do
better than my robot.

The good news, however, is that nobody has to agree with me. (I guess it was too
much to hope for that everyone would). Because, pretty soon, I'm going to make
the codebuilder robot open source, once I've added a few more tweaks and got a
few other things sorted out. That means that if anyone can come up with a better
strategy than I, they would be perfectly welcome to take a branch of the
codebuilder source tree and modify it to do something else. Then we could run
various efficiency tests to compare all versions. You could do this to see if
your load-on-demand-by-codepoint-range idea was feasable; Hauke could do it to
see if his RLE encoding idea works out better than what I've done, and without
doubt, the most efficient one is the one we'll keep (though I'm not sure how you
define efficient).

In any case, I'm putting my next efforts into (a) fixing some bugs in Int, and
(b) making Deimos easy to download and use. This sort of feature modification
you suggest is not on my agenda in the near future, because, although there ARE
a few more functions I need to add to etc.unicode., I've basically achieved what
I set out to achieve, and I'm happy with it, and pretty soon I'm going to be
keen to get back to my crypto stuff.

I hope that helps.

Arcane Jill

Jun 30 2004

"Martin M. Pedersen" <martin moeller-pedersen.dk> writes:

"Arcane Jill" <Arcane_member pathlink.com> skrev i en meddelelse
news:cbu6ru$2nm4$1 digitaldaemon.com...
Are you sure about that? I would expect individual pages to be loaded on
demand.

 Once upon a time, I did just that. I had to write some Unicode stuff for

my
 employer a few years back, and I adopted exactly that "load on demand"

approach.
 It is with this hindsight that I now beleive it to have been a bad idea
 (although with some modification, it might be a good idea in a DLL).

What I meant was load-on-demand implemented by the operating system. I think
you have done a great job, and also in this respect made the right decision
:-) When a modern operating system starts executing a program, it does not
actually load the program into RAM. Instead, it sets up page tables, and
uses the page-fault mechanism of the CPU to implement load-on-demand for the
code segment. For example, when the very first instruction of the program is
to be executed, it generates a page-fault, and it is at this time, the very
first page is loaded into RAM. At least, this is my understanding. Static,
constant tables like the ones in the Unicode library, would be - or can be -
embraced by the same mechanism, meaning that you only pay for what you use.
A decompression scheme at application level would mean that you would always
pay for everything. It is better to leave that kind of stuff to the
operating system, and mechanisms for doing that has existed for a long time,
although they might not be universally available. NTFS has built-in support
for LZW-compression, and it was a long time since "Stacker" was invented.
For distribution, we have de-facto standards such as zip.

Regards,
Martin

Jun 30 2004

Sam McCall <tunah.d tunah.net> writes:

Hi, this is really impressive! (Okay, so I'm only using isWhiteSpace and 
  simple folding/casing atm, I'll learn what the others do later ;-)
The unicode stuff isn't in the library in subversion AFAICS though.
(Actually, I can't get subversion to check out properly, I've been using 
the HTTP gateway. I tried http://svn.dsource.org/svn/projects/deimos as 
the location in TortoiseSVN, does that look right?)
Sam

Jun 30 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cbu5r3$2kpj$1 digitaldaemon.com>, Sam McCall says...
Hi, this is really impressive! (Okay, so I'm only using isWhiteSpace and 
  simple folding/casing atm, I'll learn what the others do later ;-)
The unicode stuff isn't in the library in subversion AFAICS though.

It is, kindof. At least the source code is there, at
http://svn.dsource.org/svn/projects/deimos/trunk/etc/unicode/.

But what we really NEED is a downloadable pre-built library. That's the part
that's currently missing.


(Actually, I can't get subversion to check out properly, I've been using 
the HTTP gateway. I tried http://svn.dsource.org/svn/projects/deimos as 
the location in TortoiseSVN, does that look right?)
Sam

Er - I hope someone else can answer that...?

Jill

Jun 30 2004

Sam McCall <tunah.d tunah.net> writes:

Arcane Jill wrote:
 In article <cbu5r3$2kpj$1 digitaldaemon.com>, Sam McCall says...
 
Hi, this is really impressive! (Okay, so I'm only using isWhiteSpace and 
 simple folding/casing atm, I'll learn what the others do later ;-)
The unicode stuff isn't in the library in subversion AFAICS though.

 
 
 It is, kindof. At least the source code is there, at
 http://svn.dsource.org/svn/projects/deimos/trunk/etc/unicode/.
 
 But what we really NEED is a downloadable pre-built library. That's the part
 that's currently missing.

Sorry, yeah, that's what I meant.
I built one here with (from trunk/)
for /R etc %f in (*.d) dmd -c -release %f
lib -c deimos.lib age.obj
for %f in (*.obj) lib deimos.lib %f

(I'm not at all familiar with this stuff, so this may well be the Wrong 
Way, in particular the age.obj thing is a hack because i can't seem to 
get the lib tool to add an obj, creating the library if it doesn't exist).
Then I hit the problem that I couldn't use most of the functions, due to 
unknown symbol. Sure enough, most of the functions weren't in the 
library. I changed the ones I wanted to use to "public" in the source, 
and that worked. I'm not sure if this is the right fix.
Sam

 
 
 
(Actually, I can't get subversion to check out properly, I've been using 
the HTTP gateway. I tried http://svn.dsource.org/svn/projects/deimos as 
the location in TortoiseSVN, does that look right?)
Sam

 
 
 Er - I hope someone else can answer that...?
 
 Jill

Jun 30 2004

D Programming

C/C++ Programming

Other

digitalmars.D - Unicode library now in Deimos