www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Unicode library now in Deimos

reply Arcane Jill <Arcane_member pathlink.com> writes:
With humungous thanks to Hauke for ideas, suggestions, algorithms, inspriation,
etc., I've got the first version of etc.unicode uploaded to Deimos. It gives you
access to pretty much all Unicode properties. For an idea of the flavor of the
thing, these are the functions you get:

char[]              getAge(dchar c)
char[]              getArabicShapingName(dchar c)
BidiClass           getBidiClass(dchar c)
char[]              getBidiClassName(BidiClass e)
char[]              getBidiClassName(dchar c)
dchar               getBidiMirroringGlyph(dchar c)
char[]              getBlock(dchar c)
uint                getCanonicalCombiningClass(dchar c)
int                 getDecimalDigit(dchar c)
wchar[]             getDecompositionMappingUTF16(dchar c)
dchar[]             getDecompositionMappingUTF32(dchar c)
char[]              getDecompositionMappingUTF8(dchar c)
DecompositionType   getDecompositionType(dchar c)
char[]              getDecompositionTypeName(DecompositionType e)
char[]              getDecompositionTypeName(dchar c)
int                 getDigit(dchar c)
EastAsianWidth      getEastAsianWidth(dchar c)
char[]              getEastAsianWidthName(EastAsianWidth e)
char[]              getEastAsianWidthName(dchar c)
GeneralCategory     getGeneralCategory(dchar c)
char[]              getGeneralCategoryName(GeneralCategory e)
char[]              getGeneralCategoryName(dchar c)
HangulSyllableType  getHangulSyllableType(dchar c)
char[]              getHangulSyllableTypeName(HangulSyllableType e)
char[]              getHangulSyllableTypeName(dchar c)
int                 getHexValue(dchar c)
char[]              getISOComment(dchar c)
char[]              getJamo(dchar c)
char[]              getJoiningGroup(dchar c)
JoiningType         getJoiningType(dchar c)
char[]              getJoiningTypeName(JoiningType e)
char[]              getJoiningTypeName(dchar c)
LineBreak           getLineBreak(dchar c)
char[]              getLineBreakName(LineBreak e)
char[]              getLineBreakName(dchar c)
wchar[]             getLowercaseMappingLocalUTF16(dchar c, char[] locale)
dchar[]             getLowercaseMappingLocalUTF32(dchar c, char[] locale)
char[]              getLowercaseMappingLocalUTF8(dchar c, char[] locale)
wchar[]             getLowercaseMappingUTF16(dchar c)
dchar[]             getLowercaseMappingUTF32(dchar c)
char[]              getLowercaseMappingUTF8(dchar c)
char[]              getName(dchar c)
char[]              getNormalizationCorrectionVersion(dchar c)
dchar               getNormalizationCorrectionsCorrection(dchar c)
dchar               getNormalizationCorrectionsOriginal(dchar c)
char[]              getNumeric(dchar c)
uint                getNumericType(dchar c)
Script              getScript(dchar c)
char[]              getScriptName(Script e)
char[]              getScriptName(dchar c)
dchar               getSimpleCaseFolding(dchar c)
dchar               getSimpleLowercaseMapping(dchar c)
dchar               getSimpleTitlecaseMapping(dchar c)
dchar               getSimpleUppercaseMapping(dchar c)
char[]              getSpecialCaseCondition(dchar c)
char[]              getSpecialCaseConditionLocal(dchar c)
wchar[]             getTitlecaseMappingLocalUTF16(dchar c, char[] locale)
dchar[]             getTitlecaseMappingLocalUTF32(dchar c, char[] locale)
char[]              getTitlecaseMappingLocalUTF8(dchar c, char[] locale)
wchar[]             getTitlecaseMappingUTF16(dchar c)
dchar[]             getTitlecaseMappingUTF32(dchar c)
char[]              getTitlecaseMappingUTF8(dchar c)
char[]              getUnicode1Name(dchar c)
wchar[]             getUppercaseMappingLocalUTF16(dchar c, char[] locale)
dchar[]             getUppercaseMappingLocalUTF32(dchar c, char[] locale)
char[]              getUppercaseMappingLocalUTF8(dchar c, char[] locale)
wchar[]             getUppercaseMappingUTF16(dchar c)
dchar[]             getUppercaseMappingUTF32(dchar c)
char[]              getUppercaseMappingUTF8(dchar c)
bool                isASCIIHexDigit(dchar c)
bool                isAlphabetic(dchar c)
bool                isBidiControl(dchar c)
bool                isBidiMirrored(dchar c)
bool                isCompositionExclusion(dchar c)
bool                isDash(dchar c)
bool                isDefaultIgnorableCodePoint(dchar c)
bool                isDeprecated(dchar c)
bool                isDiacritic(dchar c)
bool                isExtender(dchar c)
bool                isGraphemeBase(dchar c)
bool                isGraphemeExtend(dchar c)
bool                isGraphemeLink(dchar c)
bool                isHexDigit(dchar c)
bool                isHyphen(dchar c)
bool                isIDContinue(dchar c)
bool                isIDSBinaryOperator(dchar c)
bool                isIDSTrinaryOperator(dchar c)
bool                isIDStart(dchar c)
bool                isIdeographic(dchar c)
bool                isJoinControl(dchar c)
bool                isLogicalOrderException(dchar c)
bool                isLowercase(dchar c)
bool                isMath(dchar c)
bool                isNoncharacterCodePoint(dchar c)
bool                isOtherAlphabetic(dchar c)
bool                isOtherDefaultIgnorableCodePoint(dchar c)
bool                isOtherGraphemeExtend(dchar c)
bool                isOtherIDStart(dchar c)
bool                isOtherLowercase(dchar c)
bool                isOtherMath(dchar c)
bool                isOtherUppercase(dchar c)
bool                isQuotationMark(dchar c)
bool                isRadical(dchar c)
bool                isSTerm(dchar c)
bool                isSoftDotted(dchar c)
bool                isTerminalPunctuation(dchar c)
bool                isUnifiedIdeograph(dchar c)
bool                isUppercase(dchar c)
bool                isVariationSelector(dchar c)
bool                isWhiteSpace(dchar c)
bool                isXIDContinue(dchar c)
bool                isXIDStart(dchar c)

Pretty much every function is in its own module. This means that when you link
against it you only get those functions which you actually call. In addition,
the tables that get linked in are tiny (well, most of them), and in some cases
even non-existent, thanks to some seriously aggressive space optimization. For
instance, if you call toSimpleUppercaseMapping(), which converts a character to
uppercase, you will add only 5K to the size of your executable.

Despite this space saving, the functions should still be pretty fast. The code
for that uppercasing function consists of two if tests, a shift, a switch
statement with seven cases, and a table lookup. And nothing else.

Most of the other functions go the same way. Some are optimized in different
ways, but I believe we now have a very good balance between speed and size.

This is only the first step toward full Unicode support for D. Character
properties are the heart of the Unicode algorithms. You need those first - so
here they are.

Currently, Deimos is not very well organized, so my next task will be trying to
get that together. There are lots of interesting things in Deimos now (and some
of them I don't even know what they are), but what we're lacking is overall
organization, a build script, a "ready-to-go" downloadable library, proper
doxygen documentation, and so on. It's a bit irritating, so I guess now is the
time to deal with that. In the meantime, you can download the etc.unicode source
files and documentation and build-it-yourself. (But be patient. There are A LOT
of files to compile).

Arcane Jill
Jun 27 2004
next sibling parent reply "Walter" <newshound digitalmars.com> writes:
Cool! Is this a supplement or a replacement for Hauke's earlier work?
Jun 27 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cbnbdl$182e$1 digitaldaemon.com>, Walter says...
Cool! Is this a supplement or a replacement for Hauke's earlier work?
Both really. etc.unicode is a very different beast from unichar. Hauke's unichar is very easy to use - there's just one source file. You stick that source file in your project and you're done. With etc.unicode it is (for now) more complicated, as there are many, many source files, and, what with Deimos being slightly disorganized at present, it may be a while before we get a build script together. Deimos will work like Phobos - you download a lib and link against it. But we're not that organized yet. Getting Deimos organized now has quite a high priority for me (more even that writing code, and that's saying something). It's a replacement for PART of Hauke's work. Hauke's utype module will always be necessary if you want a drop-in replacement for ctype. We should keep that forever. I don't know how isprint() and isgraph() are implemented in utype right now, but they could in any case be implemented in terms of etc.unicode if needed. (isgraph() == isGraphemeBase() || isGraphemeExtend()), etc. etc.unicode does overlap the functionality of unichar though. That's because etc.unicode is written by robot, and it was easier to get the robot to write the lot rather than just some of it. I need to make the codebuilder robot public - or at least available to Hauke - because he may want to tweak it in places. Arcane Jill
Jun 28 2004
parent reply "Walter" <newshound digitalmars.com> writes:
"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cboirj$1ca$1 digitaldaemon.com...
 In article <cbnbdl$182e$1 digitaldaemon.com>, Walter says...
Cool! Is this a supplement or a replacement for Hauke's earlier work?
It's a replacement for PART of Hauke's work.
Hmm. This can be confusing. Can the functionality of each be made unique?
Jul 01 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cc297e$ik$2 digitaldaemon.com>, Walter says...
"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cboirj$1ca$1 digitaldaemon.com...
 In article <cbnbdl$182e$1 digitaldaemon.com>, Walter says...
Cool! Is this a supplement or a replacement for Hauke's earlier work?
It's a replacement for PART of Hauke's work.
Hmm. This can be confusing.
I meant, it's a replacement for "unichar", but not for "utype".
Can the functionality of each be made unique?
Hauke's "utype" module (drop-in replacement for ctype) is unique. It is not duplicated in etc.unicode. (The FUNCTIONALITY of some functions is duplicated, of course, but that's because what makes utype special is the fact that all functions have the same _name_ as the corresponding ctype functions). For examples, to convert a character to uppercase using simple-casing, you can currently use any of: (a) toupper(c) // ASCII only, from cytpe (b) toupper(c) // all Unicode chars, from utype (c) charToUpper(c) // all Unicode chars, from unichar (d) getUppercaseMapping(c) // all Unicode chars, from etc.unicode all of the above are locale-unaware, but very shortly, there will also be: (e) getUppercaseMapping(c, locale); It is not possible, however, to make "etc.unicode" and "unichar" unique. The former is a superset of the latter. The codebuilder could, of course, be instructed NOT to generate those functions for which similar functionality exists in unichar, but then you'd have problems with keeping both versions in step with each other, function names being inconsistent, linking strategy being different, and so on. My vote would go to retaining "utype" (because people are familiar with ctype), but using "etc.unicode" in place of "unichar". What "etc.unicode" can do is a superset of what "unichar" can do, in terms of provided functions, and in addition can be rebuilt for any future (or even past) version of Unicode by any end-user in a matter of minutes (once the codebuilder program goes public). It's also likely to be smaller, because you link in only those parts you need, whereas with "unichar" it's all or nothing. It should be noted of course that "etc.unicode" is well optimized for space (but still with guaranteed constant-time lookup), and Hauke's input is what made this possible. The function names are definitely confusing, I agree. But in the case of "etc.unicode", the names originate from the Unicode Consortium. These folk define property names, such as "Simple_Uppercase_Mapping", and so on. The names are part of the Unicode standard - so I just slavishly put my metaphorical blinkers on, removed the underscores and added a "get" or "is" prefix ("get" for non-boolean properties, "is" for boolean properties) to conform to the D style guide. Such names may be cumbersome, but I still think it's better than using made-up names, and it's a consistent methodology to extend to the remaining properties we haven't added yet. I hope that makes things less confusing. Unfortunately, right now, Deimos is not well organized, because it is in the hands of many people. To me it makes more sense that people should be able to download the whole of Deimos in one go (instead of individual packages), just like they can currently download the whole of Phobos in one go. That sort of organization would be easy if Deimos were a one-person-project, or even a project with one leader whose word was law, but it's a collective effort, and I think those involved are going to HAVE to put some effort into making it look like a unified effort. This will happen in time, but I mention it because, right now (and I hope this is a temporary phase), "unichar" is easier to use than "etc.unicode", even though both are currently supplied in source code form, if only for the simple reason that "unichar" is one file and "etc.unicode" is many files. In the (near?) future, I would hope to have the following: (1) Deimos being easy-to-download and easy-to-use, with pre-build linkable libraries for all platforms, in both Debug and Release builds. (2) Headers for etc.unicode (by which I mean, stripped versions of the source code, with large tables removed), to speed up compilation time. (3) The codebuilder program (which generates etc.unicode) being made public, along with documentation, so that people can compile Unicode lookups for any version of Unicode, past, present or future (or even customized). Until this is done, unichar is likely to be easier to use. However, once these steps are taken, I would then have no hesitation in suggesting that we use as standard: (i) etc.unicode (possibly renamed to std.unicode - the codebuilder can locate it anywhere). (ii) utype (possibly renamed to std.utype). Arcane Jill
Jul 02 2004
parent reply Hauke Duden <H.NS.Duden gmx.net> writes:
Arcane Jill wrote:
 In article <cc297e$ik$2 digitaldaemon.com>, Walter says...
 
"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cboirj$1ca$1 digitaldaemon.com...

In article <cbnbdl$182e$1 digitaldaemon.com>, Walter says...

Cool! Is this a supplement or a replacement for Hauke's earlier work?
It's a replacement for PART of Hauke's work.
Hmm. This can be confusing.
I meant, it's a replacement for "unichar", but not for "utype".
Can the functionality of each be made unique?
Hauke's "utype" module (drop-in replacement for ctype) is unique. It is not duplicated in etc.unicode. (The FUNCTIONALITY of some functions is duplicated, of course, but that's because what makes utype special is the fact that all functions have the same _name_ as the corresponding ctype functions). For examples, to convert a character to uppercase using simple-casing, you can currently use any of: (a) toupper(c) // ASCII only, from cytpe (b) toupper(c) // all Unicode chars, from utype (c) charToUpper(c) // all Unicode chars, from unichar (d) getUppercaseMapping(c) // all Unicode chars, from etc.unicode all of the above are locale-unaware, but very shortly, there will also be: (e) getUppercaseMapping(c, locale); It is not possible, however, to make "etc.unicode" and "unichar" unique. The former is a superset of the latter. The codebuilder could, of course, be instructed NOT to generate those functions for which similar functionality exists in unichar, but then you'd have problems with keeping both versions in step with each other, function names being inconsistent, linking strategy being different, and so on. My vote would go to retaining "utype" (because people are familiar with ctype), but using "etc.unicode" in place of "unichar". What "etc.unicode" can do is a superset of what "unichar" can do, in terms of provided functions, and in addition can be rebuilt for any future (or even past) version of Unicode by any end-user in a matter of minutes (once the codebuilder program goes public). It's also likely to be smaller, because you link in only those parts you need, whereas with "unichar" it's all or nothing. It should be noted of course that "etc.unicode" is well optimized for space (but still with guaranteed constant-time lookup), and Hauke's input is what made this possible. The function names are definitely confusing, I agree. But in the case of "etc.unicode", the names originate from the Unicode Consortium. These folk define property names, such as "Simple_Uppercase_Mapping", and so on. The names are part of the Unicode standard - so I just slavishly put my metaphorical blinkers on, removed the underscores and added a "get" or "is" prefix ("get" for non-boolean properties, "is" for boolean properties) to conform to the D style guide. Such names may be cumbersome, but I still think it's better than using made-up names, and it's a consistent methodology to extend to the remaining properties we haven't added yet. I hope that makes things less confusing. Unfortunately, right now, Deimos is not well organized, because it is in the hands of many people. To me it makes more sense that people should be able to download the whole of Deimos in one go (instead of individual packages), just like they can currently download the whole of Phobos in one go. That sort of organization would be easy if Deimos were a one-person-project, or even a project with one leader whose word was law, but it's a collective effort, and I think those involved are going to HAVE to put some effort into making it look like a unified effort. This will happen in time, but I mention it because, right now (and I hope this is a temporary phase), "unichar" is easier to use than "etc.unicode", even though both are currently supplied in source code form, if only for the simple reason that "unichar" is one file and "etc.unicode" is many files. In the (near?) future, I would hope to have the following: (1) Deimos being easy-to-download and easy-to-use, with pre-build linkable libraries for all platforms, in both Debug and Release builds. (2) Headers for etc.unicode (by which I mean, stripped versions of the source code, with large tables removed), to speed up compilation time. (3) The codebuilder program (which generates etc.unicode) being made public, along with documentation, so that people can compile Unicode lookups for any version of Unicode, past, present or future (or even customized). Until this is done, unichar is likely to be easier to use. However, once these steps are taken, I would then have no hesitation in suggesting that we use as standard: (i) etc.unicode (possibly renamed to std.unicode - the codebuilder can locate it anywhere). (ii) utype (possibly renamed to std.utype).
I agree that Phobos should only have one Unicode package - everything else would be a bad idea. My hope is that I'll be able to integrate some of the current advantages of unichar (faster lookup, smaller footprint) with AJ's work so that they can apply to all Unicode functions. I'll know more the possibilities once AJ releases the code generator. I also have a feeling that it is not necessary to have as many separate modules as etc.unicode currently has (but since I have only glanced at etc.unicode I could be wrong). Since the linker will always throw out uncalled functions and unaccessed data (correct?) it should be possible to make it easier to use. The main thing we'd have to keep an eye on is that static module constructors do not pull in all the data and functions. The function names could also use some tuning - right now they feel a little clunky (as do the unichar function names, of course). The main problem here is that people knowing Unicode will recognize the property names and that the functions will still be sufficiently different from utype/ctype to prevent confusion (since utype/ctype define quite properties with the same name in different way). Hauke
Jul 02 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cc3a30$1ni0$1 digitaldaemon.com>, Hauke Duden says...
I agree that Phobos should only have one Unicode package - everything 
else would be a bad idea.

My hope is that I'll be able to integrate some of the current advantages 
of unichar (faster lookup, smaller footprint) with AJ's work so that 
they can apply to all Unicode functions. I'll know more the 
possibilities once AJ releases the code generator.
No problem. I'll do that soon, like within a week. You should definitely be given write access, and if you can make it better/faster/whatever that would certainly be great.
I also have a feeling that it is not necessary to have as many separate 
modules as etc.unicode currently has (but since I have only glanced at 
etc.unicode I could be wrong).
As currently written, the codebuilder generates between two and four modules per Unicode property - one is just a wrapper to present a usable interface to humans; the others are purely robot-generated and (in general) will comprise one module per lookup table (but you still need a module even if there are zero lookup tables). You could certainly have a bash at reducing the number of modules per Unicode property, but it would be bad (in my opinion) to reduce it further. As a minimum, you need one per Unicode property. I don't know if having many modules is actually a problem though. From they end users' point of view, they will still need to import ONE module, and link with ONE library, and the rest happens automatically. But yeah, if you can make that happen better or faster, great!
Since the linker will always throw out 
uncalled functions and unaccessed data (correct?) it should be possible 
to make it easier to use.
No, the linker can only throw out whole unused modules. It cannot throw out /parts/ of modules. Therefore, each "optional thing" needs to be in its own module or modules.
The main thing we'd have to keep an eye on is 
that static module constructors do not pull in all the data and functions.
There aren't any static module constructors. None at all. Zero. So keeping an eye on that should be fairly easy, I'd say.
The function names could also use some tuning - right now they feel a 
little clunky (as do the unichar function names, of course). The main 
problem here is that people knowing Unicode will recognize the property 
names and that the functions will still be sufficiently different from 
utype/ctype to prevent confusion (since utype/ctype define quite 
properties with the same name in different way).
Yes, I agree. The problem is that the names are defined by the Unicode Consortium (though I did tweak them to match D style). They are the standard, official names for properties. I think we'd have to be quite imaginitive to come up with any reasonable alternative. Arcane Jill
Jul 02 2004
parent Hauke Duden <H.NS.Duden gmx.net> writes:
Arcane Jill wrote:
Since the linker will always throw out 
uncalled functions and unaccessed data (correct?) it should be possible 
to make it easier to use.
No, the linker can only throw out whole unused modules. It cannot throw out /parts/ of modules. Therefore, each "optional thing" needs to be in its own module or modules.
Hmmm. I ran a few tests and it seems that you're right. As soon as you specify a module to the linker it will be included fully in the executable, regardless of whether it is used or not. It is unfortunate that it isn't a little more sophisticated :(. GDC should be able to do better (since GCC/G++ does for C/C++), but since the unicode lib should work well with all compilers it seems that your current approach is the only feasible one. Hauke
Jul 03 2004
prev sibling next sibling parent Hauke Duden <H.NS.Duden gmx.net> writes:
This is incredible!

Hopefully I'll have some free time next week to check this out :).

Great work.

Hauke



Arcane Jill wrote:
 With humungous thanks to Hauke for ideas, suggestions, algorithms, inspriation,
 etc., I've got the first version of etc.unicode uploaded to Deimos. It gives
you
 access to pretty much all Unicode properties. For an idea of the flavor of the
 thing, these are the functions you get:
 
 char[]              getAge(dchar c)
 char[]              getArabicShapingName(dchar c)
 BidiClass           getBidiClass(dchar c)
 char[]              getBidiClassName(BidiClass e)
 char[]              getBidiClassName(dchar c)
 dchar               getBidiMirroringGlyph(dchar c)
 char[]              getBlock(dchar c)
 uint                getCanonicalCombiningClass(dchar c)
 int                 getDecimalDigit(dchar c)
 wchar[]             getDecompositionMappingUTF16(dchar c)
 dchar[]             getDecompositionMappingUTF32(dchar c)
 char[]              getDecompositionMappingUTF8(dchar c)
 DecompositionType   getDecompositionType(dchar c)
 char[]              getDecompositionTypeName(DecompositionType e)
 char[]              getDecompositionTypeName(dchar c)
 int                 getDigit(dchar c)
 EastAsianWidth      getEastAsianWidth(dchar c)
 char[]              getEastAsianWidthName(EastAsianWidth e)
 char[]              getEastAsianWidthName(dchar c)
 GeneralCategory     getGeneralCategory(dchar c)
 char[]              getGeneralCategoryName(GeneralCategory e)
 char[]              getGeneralCategoryName(dchar c)
 HangulSyllableType  getHangulSyllableType(dchar c)
 char[]              getHangulSyllableTypeName(HangulSyllableType e)
 char[]              getHangulSyllableTypeName(dchar c)
 int                 getHexValue(dchar c)
 char[]              getISOComment(dchar c)
 char[]              getJamo(dchar c)
 char[]              getJoiningGroup(dchar c)
 JoiningType         getJoiningType(dchar c)
 char[]              getJoiningTypeName(JoiningType e)
 char[]              getJoiningTypeName(dchar c)
 LineBreak           getLineBreak(dchar c)
 char[]              getLineBreakName(LineBreak e)
 char[]              getLineBreakName(dchar c)
 wchar[]             getLowercaseMappingLocalUTF16(dchar c, char[] locale)
 dchar[]             getLowercaseMappingLocalUTF32(dchar c, char[] locale)
 char[]              getLowercaseMappingLocalUTF8(dchar c, char[] locale)
 wchar[]             getLowercaseMappingUTF16(dchar c)
 dchar[]             getLowercaseMappingUTF32(dchar c)
 char[]              getLowercaseMappingUTF8(dchar c)
 char[]              getName(dchar c)
 char[]              getNormalizationCorrectionVersion(dchar c)
 dchar               getNormalizationCorrectionsCorrection(dchar c)
 dchar               getNormalizationCorrectionsOriginal(dchar c)
 char[]              getNumeric(dchar c)
 uint                getNumericType(dchar c)
 Script              getScript(dchar c)
 char[]              getScriptName(Script e)
 char[]              getScriptName(dchar c)
 dchar               getSimpleCaseFolding(dchar c)
 dchar               getSimpleLowercaseMapping(dchar c)
 dchar               getSimpleTitlecaseMapping(dchar c)
 dchar               getSimpleUppercaseMapping(dchar c)
 char[]              getSpecialCaseCondition(dchar c)
 char[]              getSpecialCaseConditionLocal(dchar c)
 wchar[]             getTitlecaseMappingLocalUTF16(dchar c, char[] locale)
 dchar[]             getTitlecaseMappingLocalUTF32(dchar c, char[] locale)
 char[]              getTitlecaseMappingLocalUTF8(dchar c, char[] locale)
 wchar[]             getTitlecaseMappingUTF16(dchar c)
 dchar[]             getTitlecaseMappingUTF32(dchar c)
 char[]              getTitlecaseMappingUTF8(dchar c)
 char[]              getUnicode1Name(dchar c)
 wchar[]             getUppercaseMappingLocalUTF16(dchar c, char[] locale)
 dchar[]             getUppercaseMappingLocalUTF32(dchar c, char[] locale)
 char[]              getUppercaseMappingLocalUTF8(dchar c, char[] locale)
 wchar[]             getUppercaseMappingUTF16(dchar c)
 dchar[]             getUppercaseMappingUTF32(dchar c)
 char[]              getUppercaseMappingUTF8(dchar c)
 bool                isASCIIHexDigit(dchar c)
 bool                isAlphabetic(dchar c)
 bool                isBidiControl(dchar c)
 bool                isBidiMirrored(dchar c)
 bool                isCompositionExclusion(dchar c)
 bool                isDash(dchar c)
 bool                isDefaultIgnorableCodePoint(dchar c)
 bool                isDeprecated(dchar c)
 bool                isDiacritic(dchar c)
 bool                isExtender(dchar c)
 bool                isGraphemeBase(dchar c)
 bool                isGraphemeExtend(dchar c)
 bool                isGraphemeLink(dchar c)
 bool                isHexDigit(dchar c)
 bool                isHyphen(dchar c)
 bool                isIDContinue(dchar c)
 bool                isIDSBinaryOperator(dchar c)
 bool                isIDSTrinaryOperator(dchar c)
 bool                isIDStart(dchar c)
 bool                isIdeographic(dchar c)
 bool                isJoinControl(dchar c)
 bool                isLogicalOrderException(dchar c)
 bool                isLowercase(dchar c)
 bool                isMath(dchar c)
 bool                isNoncharacterCodePoint(dchar c)
 bool                isOtherAlphabetic(dchar c)
 bool                isOtherDefaultIgnorableCodePoint(dchar c)
 bool                isOtherGraphemeExtend(dchar c)
 bool                isOtherIDStart(dchar c)
 bool                isOtherLowercase(dchar c)
 bool                isOtherMath(dchar c)
 bool                isOtherUppercase(dchar c)
 bool                isQuotationMark(dchar c)
 bool                isRadical(dchar c)
 bool                isSTerm(dchar c)
 bool                isSoftDotted(dchar c)
 bool                isTerminalPunctuation(dchar c)
 bool                isUnifiedIdeograph(dchar c)
 bool                isUppercase(dchar c)
 bool                isVariationSelector(dchar c)
 bool                isWhiteSpace(dchar c)
 bool                isXIDContinue(dchar c)
 bool                isXIDStart(dchar c)
 
 Pretty much every function is in its own module. This means that when you link
 against it you only get those functions which you actually call. In addition,
 the tables that get linked in are tiny (well, most of them), and in some cases
 even non-existent, thanks to some seriously aggressive space optimization. For
 instance, if you call toSimpleUppercaseMapping(), which converts a character to
 uppercase, you will add only 5K to the size of your executable.
 
 Despite this space saving, the functions should still be pretty fast. The code
 for that uppercasing function consists of two if tests, a shift, a switch
 statement with seven cases, and a table lookup. And nothing else.
 
 Most of the other functions go the same way. Some are optimized in different
 ways, but I believe we now have a very good balance between speed and size.
 
 This is only the first step toward full Unicode support for D. Character
 properties are the heart of the Unicode algorithms. You need those first - so
 here they are.
 
 Currently, Deimos is not very well organized, so my next task will be trying to
 get that together. There are lots of interesting things in Deimos now (and some
 of them I don't even know what they are), but what we're lacking is overall
 organization, a build script, a "ready-to-go" downloadable library, proper
 doxygen documentation, and so on. It's a bit irritating, so I guess now is the
 time to deal with that. In the meantime, you can download the etc.unicode
source
 files and documentation and build-it-yourself. (But be patient. There are A LOT
 of files to compile).
 
 Arcane Jill
 
 
Jun 27 2004
prev sibling next sibling parent reply Hauke Duden <H.NS.Duden gmx.net> writes:
Arcane Jill wrote:
 Despite this space saving, the functions should still be pretty fast. The code
 for that uppercasing function consists of two if tests, a shift, a switch
 statement with seven cases, and a table lookup. And nothing else.
Hmmm. Why do you store each page separately with a manual switch for choosing the right one? A second lookup table should be a lot faster. You could also save some more cycles if you add a single page that contains only 0 values instead of returning a null-pointer. That way you do not need to check for null every time you read a value. But these are just minor points. This is a great piece of work. Hauke
Jun 27 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cbneog$1cu3$1 digitaldaemon.com>, Hauke Duden says...
Arcane Jill wrote:
 Despite this space saving, the functions should still be pretty fast. The code
 for that uppercasing function consists of two if tests, a shift, a switch
 statement with seven cases, and a table lookup. And nothing else.
Hmmm. Why do you store each page separately with a manual switch for choosing the right one? A second lookup table should be a lot faster. You could also save some more cycles if you add a single page that contains only 0 values instead of returning a null-pointer. That way you do not need to check for null every time you read a value.
The old space/speed tradeoff. You can probably tweak this, once I make the codebuilder program public. There are various parameters which control decisions the robot makes, and right now those parameters are just constants. But we COULD change that so that more popular lookup get biased in favor of speed, while less popular lookups get biased in favor of minimal space. In the current configuration, the robot decides that a big table full of zeroes is a waste of space compared with a test for null, and that a switch statement is acceptable so long as there are fewer than sixteen cases. But all these things are ultimately tweakable if we later decide to tweak them. If we do that, the robot will simply write different code (i.e. faster but bigger). Personally, I think that the choices currently made are quite reasonable for most properties. There may be a case for speeding up uppercasing and a few others, but you still have to think in terms of how much RAM that would consume at runtime. Right now it's 5K for uppercasing, and another 5K for lowercasing. Unicode contains a *lot* of data, and I'm hesitant to give speed too high a priority here, for fear of everything getting huge. Arcane Jill
Jun 28 2004
parent reply Hauke Duden <H.NS.Duden gmx.net> writes:
Arcane Jill wrote:
 In article <cbneog$1cu3$1 digitaldaemon.com>, Hauke Duden says...
 
Arcane Jill wrote:

Despite this space saving, the functions should still be pretty fast. The code
for that uppercasing function consists of two if tests, a shift, a switch
statement with seven cases, and a table lookup. And nothing else.
Hmmm. Why do you store each page separately with a manual switch for choosing the right one? A second lookup table should be a lot faster. You could also save some more cycles if you add a single page that contains only 0 values instead of returning a null-pointer. That way you do not need to check for null every time you read a value.
The old space/speed tradeoff. You can probably tweak this, once I make the codebuilder program public. There are various parameters which control decisions the robot makes, and right now those parameters are just constants. But we COULD change that so that more popular lookup get biased in favor of speed, while less popular lookups get biased in favor of minimal space. In the current configuration, the robot decides that a big table full of zeroes is a waste of space compared with a test for null, and that a switch statement is acceptable so long as there are fewer than sixteen cases. But all these things are ultimately tweakable if we later decide to tweak them. If we do that, the robot will simply write different code (i.e. faster but bigger).
Not necessarily bigger. If you add RLE compression it can even get smaller than what you have now. I'd love to take a look at the "robot" code. I have some things in mind that might improve on both speed and size. I'd like to see how easy it would be to integrate them into your current system.
 Personally, I think that the choices currently made are quite reasonable for
 most properties. There may be a case for speeding up uppercasing and a few
 others, but you still have to think in terms of how much RAM that would consume
 at runtime. Right now it's 5K for uppercasing, and another 5K for lowercasing.
 Unicode contains a *lot* of data, and I'm hesitant to give speed too high a
 priority here, for fear of everything getting huge.
We're on the same page here. Both sides need to be optimized but you have to find a good balance. Hauke
Jun 28 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cbosl4$f9d$1 digitaldaemon.com>, Hauke Duden says...

Not necessarily bigger. If you add RLE compression it can even get 
smaller than what you have now.
You mentioned that before, but I'm not sure I agree. RLE is pretty much the /only/ one of your ideas that I didn't go with. You see, I take the view that hard disk space is plentiful, but RAM is not. With that perspective, compressing on disk, but decompressing into RAM, is /not/ a good thing to do. You might as well load it into RAM in the uncompressed state in the first place.
I'd love to take a look at the "robot" code. I have some things in mind 
that might improve on both speed and size. I'd like to see how easy it 
would be to integrate them into your current system.
I thought you might. Rest assured, you will be the /first/ person to get write access. I'll probably need to start a new project for it thought. The codebuilder itself doesn't REALLY belong in Deimos, as it's not general purpose.
We're on the same page here. Both sides need to be optimized but you 
have to find a good balance.
The codebuilder is a good way to get that balance. Change a few constants, run it again and new source gets written reflecting the new balance. But fine tuning it is probably more your area of expertise. You seem to know more about this sort of stuff than I, anyway. Jill
Jun 28 2004
parent reply Hauke Duden <H.NS.Duden gmx.net> writes:
Arcane Jill wrote:
 In article <cbosl4$f9d$1 digitaldaemon.com>, Hauke Duden says...
 
 
Not necessarily bigger. If you add RLE compression it can even get 
smaller than what you have now.
You mentioned that before, but I'm not sure I agree. RLE is pretty much the /only/ one of your ideas that I didn't go with. You see, I take the view that hard disk space is plentiful, but RAM is not. With that perspective, compressing on disk, but decompressing into RAM, is /not/ a good thing to do. You might as well load it into RAM in the uncompressed state in the first place.
Heh. Sorry if I seem like I want to push my ideas onto you. I usually write these messages in a hurry and I'm often not sure if I have mentioned something before ;) Regarding the size problem: for me the trade-off is not so much disk space against RAM usage but executable size against RAM usage. My concern is that if the executables get too big then people might not want to use the unicode functions for some applications (and fall back to ASCII instead). For example, good Setup software adds as little overhead as possible to the installed data. If all the Unicode stuff together amounts to 100K then that may already be too much. You should also keep in mind that the executable is held in RAM as well, so increasing executable size to save RAM does not always give you an advantage. Please also let me emphasize that I'm not advocating holding completely uncompressed tables in RAM. On the contrary: I think the layout you currently have is good for the uncompressed version. What I mean is not storing this data directly in the executable, but storing an RLE compressed version and unpacking it into the current form at runtime. The RLE'ed version should be quite a bit smaller (it reduces the mapping data in unichar to about 1/4th). So the increase in RAM usage would be around 125% of what you have now (assuming that the other data packs similarly well - the 25% increase comes from the second RLE compressed version of the data in the executable). But executable size goes down to 25%. I think that's worth it. Also keep in mind that we're talking about kilobyte sizes here. 500 KB of RAM is not much nowadays (my estimate for an application that uses just about everything), but downloading 500 KB more from the internet is very noticable for modem users.
I'd love to take a look at the "robot" code. I have some things in mind 
that might improve on both speed and size. I'd like to see how easy it 
would be to integrate them into your current system.
I thought you might. Rest assured, you will be the /first/ person to get write access. I'll probably need to start a new project for it thought. The codebuilder itself doesn't REALLY belong in Deimos, as it's not general purpose.
Take your time - I'm curious, but I don't have much free time to spend on this anyway. I can wait :). Hauke
Jun 28 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cbpp7k$1pgi$1 digitaldaemon.com>, Hauke Duden says...

You should also keep in mind that the executable is held in RAM as well, 
As well as what? The tables are directly contained in the RAM image of the executable. They are not duplicated or otherwise reconstructed. They are accessed in-place.
so increasing executable size to save RAM does not always give you an 
advantage.
Curiously, you seem to be arguing in favor of my position. Had we used RLE decompression, THEN we'd have to worry about the "as well". Jill
Jun 29 2004
parent reply Hauke Duden <H.NS.Duden gmx.net> writes:
Arcane Jill wrote:
 In article <cbpp7k$1pgi$1 digitaldaemon.com>, Hauke Duden says...
 
 
You should also keep in mind that the executable is held in RAM as well, 
As well as what?
As well as the data you store in explicitly allocated memory.
 The tables are directly contained in the RAM image of the executable. They are
 not duplicated or otherwise reconstructed. They are accessed in-place.
Yes, but they ARE in RAM. My point was that you don't save RAM if you put the data in the executable instead of an explicitly allocated memory block.
so increasing executable size to save RAM does not always give you an 
advantage.
Curiously, you seem to be arguing in favor of my position. Had we used RLE decompression, THEN we'd have to worry about the "as well".
I don't think I understand what you mean. If I understood your last post correctly you didn't want to use RLE compression because disk space is cheap, but RAM is not. I am arguing that: - executable size (=disk space) is more expensive than RAM if the file is downloaded from the internet - RAM usage is increased only slightly (my rough estimate was 125% of the original space) but executable size is reduced significantly (down to 25%). In an age where many programs are downloaded from the internet that is worth thinking about. Hauke
Jun 29 2004
next sibling parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cbrvnp$1uf8$1 digitaldaemon.com>, Hauke Duden says...
You should also keep in mind that the executable is held in RAM as well, 
As well as what?
As well as the data you store in explicitly allocated memory.
That would be zero.
 The tables are directly contained in the RAM image of the executable. They are
 not duplicated or otherwise reconstructed. They are accessed in-place.
Yes, but they ARE in RAM. My point was that you don't save RAM if you put the data in the executable instead of an explicitly allocated memory block.
But clearly you do. If the compressed size is X, and the uncompressed size is Y, then storing the uncompressed table in the executable costs Y bytes of RAM. Decompressing at runtime costs (X+Y) bytes of RAM, since you can't un-allocate the X. Since X is not negative, it follows that Y will always be less than (X+Y)
I don't think I understand what you mean. If I understood your last post 
correctly you didn't want to use RLE compression because disk space is 
cheap, but RAM is not. I am arguing that:

- executable size (=disk space) is more expensive than RAM if the file 
is downloaded from the internet
That's what zip files are for. Besides which, I don't think my obj files actually do contain large arrays full of zeroes. Such zero blocks will all have been removed and replaced by null pointer returns. Or were you arguing that zero-blocks should be re-inserted?
- RAM usage is increased only slightly (my rough estimate was 125% of 
the original space) but executable size is reduced significantly (down 
to 25%).
I'm not arguing with that. I'm arguing in favor of not increasing RAM usage *AT ALL*. Like, not even slightly.
In an age where many programs are downloaded from the internet that is 
worth thinking about.
Zip files use much better compression than simple RLE. I say zip 'em. Jill
Jun 29 2004
parent Hauke Duden <H.NS.Duden gmx.net> writes:
Arcane Jill wrote:
 In article <cbrvnp$1uf8$1 digitaldaemon.com>, Hauke Duden says...
 
You should also keep in mind that the executable is held in RAM as well, 
As well as what?
As well as the data you store in explicitly allocated memory.
That would be zero.
The tables are directly contained in the RAM image of the executable. They are
not duplicated or otherwise reconstructed. They are accessed in-place.
Yes, but they ARE in RAM. My point was that you don't save RAM if you put the data in the executable instead of an explicitly allocated memory block.
But clearly you do. If the compressed size is X, and the uncompressed size is Y, then storing the uncompressed table in the executable costs Y bytes of RAM. Decompressing at runtime costs (X+Y) bytes of RAM, since you can't un-allocate the X. Since X is not negative, it follows that Y will always be less than (X+Y)
In this particular case, yes, as I stated in my post. I just wanted to emphasize that moving data into statically compiled arrays (as opposed to dynamic ones) doesn't automatically reduce RAM usage.
I don't think I understand what you mean. If I understood your last post 
correctly you didn't want to use RLE compression because disk space is 
cheap, but RAM is not. I am arguing that:

- executable size (=disk space) is more expensive than RAM if the file 
is downloaded from the internet
That's what zip files are for.
What about installers and self-extractors? You do not ZIP those because they ARE the ZIP file (in a manner of speaking). I'd like to be able to write such applications in D. Besides half of my reasons is that even if D executables could be compressed better than C++ ones, many people would still compare their size in uncompressed form. A similar thing happened to C++: C++ executables usually compress better than C ones (templates and exception handling create lots of similar code), yet C++ is often said to be the "bloat king" among languages. I just don't want people to shun the Unicode routines because of the size difference, even if it may not have such a big impact on the end result as they might think.
 Besides which, I don't think my obj files actually do contain large arrays full
 of zeroes. Such zero blocks will all have been removed and replaced by null
 pointer returns. Or were you arguing that zero-blocks should be re-inserted?
RLE doesn't just pack zero arrays. Unicode contains lots of ranges with the same values.
- RAM usage is increased only slightly (my rough estimate was 125% of 
the original space) but executable size is reduced significantly (down 
to 25%).
I'm not arguing with that. I'm arguing in favor of not increasing RAM usage *AT ALL*. Like, not even slightly.
As I said, I think 100 KB of extra RAM usage is a lot better than 400 KB of increased executable size. Especially for a garbage collected language that will always use more RAM than strictly necessary. Hauke
Jun 29 2004
prev sibling parent reply "Martin M. Pedersen" <martin moeller-pedersen.dk> writes:
"Hauke Duden" <H.NS.Duden gmx.net> skrev i en meddelelse news:cbrvnp$1uf8> >
The tables are directly contained in the RAM image of the executable. They
are
 not duplicated or otherwise reconstructed. They are accessed in-place.
Yes, but they ARE in RAM. My point was that you don't save RAM if you put the data in the executable instead of an explicitly allocated memory block.
Are you sure about that? I would expect individual pages to be loaded on demand. Regards, Martin
Jun 29 2004
next sibling parent Hauke Duden <H.NS.Duden gmx.net> writes:
Martin M. Pedersen wrote:
 "Hauke Duden" <H.NS.Duden gmx.net> skrev i en meddelelse news:cbrvnp$1uf8> >
 The tables are directly contained in the RAM image of the executable. They
 are
 
not duplicated or otherwise reconstructed. They are accessed in-place.
Yes, but they ARE in RAM. My point was that you don't save RAM if you put the data in the executable instead of an explicitly allocated memory block.
Are you sure about that? I would expect individual pages to be loaded on demand.
I'm pretty sure about it, but not 100% sure. I have frequently observed in the past that RAM usage increases by the size of a DLL as soon as it is loaded. Hauke
Jun 30 2004
prev sibling parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cbsj6s$2s9m$1 digitaldaemon.com>, Martin M. Pedersen says...
"Hauke Duden" <H.NS.Duden gmx.net> skrev i en meddelelse news:cbrvnp$1uf8> >
The tables are directly contained in the RAM image of the executable. They
are
 not duplicated or otherwise reconstructed. They are accessed in-place.
Yes, but they ARE in RAM. My point was that you don't save RAM if you put the data in the executable instead of an explicitly allocated memory block.
Are you sure about that? I would expect individual pages to be loaded on demand. Regards, Martin
Once upon a time, I did just that. I had to write some Unicode stuff for my employer a few years back, and I adopted exactly that "load on demand" approach. It is with this hindsight that I now beleive it to have been a bad idea (although with some modification, it might be a good idea in a DLL). The basic load-on-demand approach is this: The user calls something like isUppercase(c); The library evaluates (c >> N) (for some N) to get what we might loosely call a "page" number. Then it says to itself, "Is this page cached?". If so, use the in-RAM table to look up the answer; if not, load the page from disk, decompress it into RAM, cache it so we don't have to do all that again, and THEN look up the value. Plus, you'd have to do this every time you re-ran your application, which might matter for some very small applications. At the time, it seemed like this was quite a promising approach, but it had a huge number of drawbacks. For one thing, neither C++ nor D has any concept of resource files (essentially a Java concept), so, in order to load anything off disk, YOU FIRST HAVE TO FIND IT. This means either a DLL (per "page"? per property?), or you have to read an environment variable to tell you where to look. Requiring users to set an environment variable just to get isUppercase() working is not desirable. For another thing, the extra code you have to go through at runtime to answer the question "is it cached?" is itself a few extra cycles. I took a different approach this time round. In this new approach, there are two important principles: (1) The souce code shall be written by robot. This protects us against Unicode itself being updated (and it is /constantly/ being updated), and it also allows for some SERIOUS optimization, because of course a robot can try many, many different optimization strategies and, using sheer brute force, pick the best. (2) Each property shall be in its own object module, so that you do not link in those properties which you do not need. This is load-on-demand in a sense, but it's COMPILE-TIME load-on-demand, which is better (in my opinion) and the split is in a different direction (property-required rather than codepoint-range). The new approach makes complete sense when you realize just how small things get. That isUppercase() function generates a linkable object module which is a measly 3420 /bytes/ in size, and does not depend on (pull in) anything else. I mean - come on guys - just 3K! Are we REALLY saying that's too much? Plus, as a bonus, you get isUppercase() data for all characters, so no there is no bias toward any particular subrange. And in any case, I didn't get that 3K figure by measuring the size of the generated tables, I got it by compiling an object module and typing dir at the command line to see how big it ended up. For reference, the following do-nothing functions: by the same standard comes out at 232 bytes. However, this measure overestimates, because, even if it isn't inlined, there are some linker symbols in there that will be discarded when constructing an executable. So even these small estimates are overestimates. I think you would be hard-pressed to do better than my robot. The good news, however, is that nobody has to agree with me. (I guess it was too much to hope for that everyone would). Because, pretty soon, I'm going to make the codebuilder robot open source, once I've added a few more tweaks and got a few other things sorted out. That means that if anyone can come up with a better strategy than I, they would be perfectly welcome to take a branch of the codebuilder source tree and modify it to do something else. Then we could run various efficiency tests to compare all versions. You could do this to see if your load-on-demand-by-codepoint-range idea was feasable; Hauke could do it to see if his RLE encoding idea works out better than what I've done, and without doubt, the most efficient one is the one we'll keep (though I'm not sure how you define efficient). In any case, I'm putting my next efforts into (a) fixing some bugs in Int, and (b) making Deimos easy to download and use. This sort of feature modification you suggest is not on my agenda in the near future, because, although there ARE a few more functions I need to add to etc.unicode., I've basically achieved what I set out to achieve, and I'm happy with it, and pretty soon I'm going to be keen to get back to my crypto stuff. I hope that helps. Arcane Jill

Jun 30 2004
parent "Martin M. Pedersen" <martin moeller-pedersen.dk> writes:
"Arcane Jill" <Arcane_member pathlink.com> skrev i en meddelelse
news:cbu6ru$2nm4$1 digitaldaemon.com...
Are you sure about that? I would expect individual pages to be loaded on
demand.
Once upon a time, I did just that. I had to write some Unicode stuff for
my
 employer a few years back, and I adopted exactly that "load on demand"
approach.
 It is with this hindsight that I now beleive it to have been a bad idea
 (although with some modification, it might be a good idea in a DLL).
What I meant was load-on-demand implemented by the operating system. I think you have done a great job, and also in this respect made the right decision :-) When a modern operating system starts executing a program, it does not actually load the program into RAM. Instead, it sets up page tables, and uses the page-fault mechanism of the CPU to implement load-on-demand for the code segment. For example, when the very first instruction of the program is to be executed, it generates a page-fault, and it is at this time, the very first page is loaded into RAM. At least, this is my understanding. Static, constant tables like the ones in the Unicode library, would be - or can be - embraced by the same mechanism, meaning that you only pay for what you use. A decompression scheme at application level would mean that you would always pay for everything. It is better to leave that kind of stuff to the operating system, and mechanisms for doing that has existed for a long time, although they might not be universally available. NTFS has built-in support for LZW-compression, and it was a long time since "Stacker" was invented. For distribution, we have de-facto standards such as zip. Regards, Martin
Jun 30 2004
prev sibling parent reply Sam McCall <tunah.d tunah.net> writes:
Hi, this is really impressive! (Okay, so I'm only using isWhiteSpace and 
  simple folding/casing atm, I'll learn what the others do later ;-)
The unicode stuff isn't in the library in subversion AFAICS though.
(Actually, I can't get subversion to check out properly, I've been using 
the HTTP gateway. I tried http://svn.dsource.org/svn/projects/deimos as 
the location in TortoiseSVN, does that look right?)
Sam
Jun 30 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cbu5r3$2kpj$1 digitaldaemon.com>, Sam McCall says...
Hi, this is really impressive! (Okay, so I'm only using isWhiteSpace and 
  simple folding/casing atm, I'll learn what the others do later ;-)
The unicode stuff isn't in the library in subversion AFAICS though.
It is, kindof. At least the source code is there, at http://svn.dsource.org/svn/projects/deimos/trunk/etc/unicode/. But what we really NEED is a downloadable pre-built library. That's the part that's currently missing.
(Actually, I can't get subversion to check out properly, I've been using 
the HTTP gateway. I tried http://svn.dsource.org/svn/projects/deimos as 
the location in TortoiseSVN, does that look right?)
Sam
Er - I hope someone else can answer that...? Jill
Jun 30 2004
parent Sam McCall <tunah.d tunah.net> writes:
Arcane Jill wrote:
 In article <cbu5r3$2kpj$1 digitaldaemon.com>, Sam McCall says...
 
Hi, this is really impressive! (Okay, so I'm only using isWhiteSpace and 
 simple folding/casing atm, I'll learn what the others do later ;-)
The unicode stuff isn't in the library in subversion AFAICS though.
It is, kindof. At least the source code is there, at http://svn.dsource.org/svn/projects/deimos/trunk/etc/unicode/. But what we really NEED is a downloadable pre-built library. That's the part that's currently missing.
Sorry, yeah, that's what I meant. I built one here with (from trunk/) for /R etc %f in (*.d) dmd -c -release %f lib -c deimos.lib age.obj for %f in (*.obj) lib deimos.lib %f (I'm not at all familiar with this stuff, so this may well be the Wrong Way, in particular the age.obj thing is a hack because i can't seem to get the lib tool to add an obj, creating the library if it doesn't exist). Then I hit the problem that I couldn't use most of the functions, due to unknown symbol. Sure enough, most of the functions weren't in the library. I changed the ones I wanted to use to "public" in the source, and that worked. I'm not sure if this is the right fix. Sam
 
 
 
(Actually, I can't get subversion to check out properly, I've been using 
the HTTP gateway. I tried http://svn.dsource.org/svn/projects/deimos as 
the location in TortoiseSVN, does that look right?)
Sam
Er - I hope someone else can answer that...? Jill
Jun 30 2004