www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - =?utf-8?B?SW50ZXJuYXRpb25hbGl6YXRpb24gbGlicmFyeSDilIAgYWR2aWNlL2hlbHA=?=

reply "Uwe Salomon" <post uwesalomon.de> writes:
During the writing of a string class for my Indigo library i "discovered"  
the need for a thorough internationalization library for D. I think a good  
implementation of i18n functionality would be very important for the  
development of applications in D, thus for the future of D. There is the  
ICU port of the Mango tree, but as ICU is a C/C++ library, this is not as  
natural and fast as it could be. I would like to write a native D i18n  
library which is independent of third party libraries.

As this is too big a project to develop for myself, and (i hope) of public  
interest for the D community, i would like to ask for:

- Advice: What is needed? How should it be implemented?
- Help: Who has the time and wants to help me? A total of 2 or 3  
developers should be sufficient?

My ideas are to write a compact core library that contains the most  
important features (character properties, UTF encodings, basic message  
translation), and then write some localization modules (number formatting,  
date formatting, comparing and searching). The goals should be simplicity  
and speed (but perhaps the community wants other things more?), avoiding  
complicated implementations and "template magic". And it should be well  
documented from the beginning, not a construction site on every corner.

But that are just some ideas that come to my mind right now. I hope that  
everybody makes some helpful statements what he/she thinks should be  
covered by the library on all accounts, and what would be very nice.

Thanks & ciao
uwe
May 15 2005
next sibling parent reply "Andrew Fedoniouk" <news terrainformatica.com> writes:
Good idea, I like it.

FYI: On Windows MultiByteToWideChar and WideCharToMultiByte
support many encodings other than mentioned directly in MSDN.
I am using this list:

lang_t langs[] = {
    {"asmo-708",708},
    {"dos-720",720},
    {"iso-8859-6",28596},
    {"x-mac-arabic",10004},
    {"windows-1256",1256},
    {"ibm775",775},
    {"iso-8859-4",28594},
    {"windows-1257",1257},
    {"ibm852",852},
    {"iso-8859-2",28592},
    {"x-mac-ce",10029},
    {"windows-1250",1250},
    {"euc-cn",51936},
    {"gb2312",936},
    {"hz-gb-2312",52936},
    {"x-mac-chinesesimp",10008},
    {"big5",950},
    {"x-chinese-cns",20000},
    {"x-chinese-eten",20002},
    {"x-mac-chinesetrad",10002},
    {"cp866",866},
    {"iso-8859-5",28595},
    {"koi8-r",20866},
    {"koi8-u",21866},
    {"x-mac-cyrillic",10007},
    {"windows-1251",1251},
    {"x-europa",29001},
    {"x-ia5-german",20106},
    {"ibm737",737},
    {"iso-8859-7",28597},
    {"x-mac-greek",10006},
    {"windows-1253",1253},
    {"ibm869",869},
    {"dos-862",862},
    {"iso-8859-8-i",38598},
    {"iso-8859-8",28598},
    {"x-mac-hebrew",10005},
    {"windows-1255",1255},
    {"x-ebcdic-arabic",20420},
    {"x-ebcdic-cyrillicrussian",20880},
    {"x-ebcdic-cyrillicserbianbulgarian",21025},
    {"x-ebcdic-denmarknorway",20277},
    {"x-ebcdic-denmarknorway-euro",1142},
    {"x-ebcdic-finlandsweden",20278},
    {"x-ebcdic-finlandsweden-euro",1143},
    {"x-ebcdic-finlandsweden-euro",1143},
    {"x-ebcdic-france-euro",1147},
    {"x-ebcdic-germany",20273},
    {"x-ebcdic-germany-euro",1141},
    {"x-ebcdic-greekmodern",875},
    {"x-ebcdic-greek",20423},
    {"x-ebcdic-hebrew",20424},
    {"x-ebcdic-icelandic",20871},
    {"x-ebcdic-icelandic-euro",1149},
    {"x-ebcdic-international-euro",1148},
    {"x-ebcdic-italy",20280},
    {"x-ebcdic-italy-euro",1144},
    {"x-ebcdic-japaneseandkana",50930},
    {"x-ebcdic-japaneseandjapaneselatin",50939},
    {"x-ebcdic-japaneseanduscanada",50931},
    {"x-ebcdic-japanesekatakana",20290},
    {"x-ebcdic-koreanandkoreanextended",50933},
    {"x-ebcdic-koreanextended",20833},
    {"cp870",870},
    {"x-ebcdic-simplifiedchinese",50935},
    {"x-ebcdic-spain",20284},
    {"x-ebcdic-spain-euro",1145},
    {"x-ebcdic-thai",20838},
    {"x-ebcdic-traditionalchinese",50937},
    {"cp1026",1026},
    {"x-ebcdic-turkish",20905},
    {"x-ebcdic-uk",20285},
    {"x-ebcdic-uk-euro",1146},
    {"ebcdic-cp-us",37},
    {"x-ebcdic-cp-us-euro",1140},
    {"ibm861",861},
    {"x-mac-icelandic",10079},
    {"x-iscii-as",57006},
    {"x-iscii-be",57003},
    {"x-iscii-de",57002},
    {"x-iscii-gu",57010},
    {"x-iscii-ka",57008},
    {"x-iscii-ma",57009},
    {"x-iscii-or",57007},
    {"x-iscii-pa",57011},
    {"x-iscii-ta",57004},
    {"x-iscii-te",57005},
    {"euc-jp",51932},
    {"iso-2022-jp",50220},
    {"iso-2022-jp",50222},
    {"csiso2022jp",50221},
    {"x-mac-japanese",10001},
    {"shift_jis",932},
    {"ks_c_5601-1987",949},
    {"euc-kr",51949},
    {"iso-2022-kr",50225},
    {"johab",1361},
    {"x-mac-korean",10003},
    {"iso-8859-3",28593},
    {"iso-8859-15",28605},
    {"x-ia5-norwegian",20108},
    {"ibm437",437},
    {"x-ia5-swedish",20107},
    {"windows-874",874},
    {"ibm857",857},
    {"iso-8859-9",28599},
    {"x-mac-turkish",10081},
    {"windows-1254",1254},
    //{(const char *)L"unicode",1200},
    //{"unicodefffe",1201},
    {"utf-7",65000},
    {"utf-8",65001},
    //{"us-ascii",20127},
    {"us-ascii",1252},
    {"windows-1258",1258},
    {"ibm850",850},
    {"x-ia5",20105},
    {"iso-8859-1",1252}, //was 28591
    {"macintosh",10000},
    {"windows-1252",1252},
    {"system",CP_ACP}
  };

Second member in these structs is
codepage id directly used as first
parameter of
MultiByteToWideChar and WideCharToMultiByte

Hope this will help. At least it might help to build
translation tables automaticly :)

Andrew.




"Uwe Salomon" <post uwesalomon.de> wrote in message 
news:op.sqtvopik6yjbe6 sandmann.maerchenwald.net...
 During the writing of a string class for my Indigo library i "discovered" 
 the need for a thorough internationalization library for D. I think a good 
 implementation of i18n functionality would be very important for the 
 development of applications in D, thus for the future of D. There is the 
 ICU port of the Mango tree, but as ICU is a C/C++ library, this is not as 
 natural and fast as it could be. I would like to write a native D i18n 
 library which is independent of third party libraries.

 As this is too big a project to develop for myself, and (i hope) of public 
 interest for the D community, i would like to ask for:

 - Advice: What is needed? How should it be implemented?
 - Help: Who has the time and wants to help me? A total of 2 or 3 
 developers should be sufficient?

 My ideas are to write a compact core library that contains the most 
 important features (character properties, UTF encodings, basic message 
 translation), and then write some localization modules (number formatting, 
 date formatting, comparing and searching). The goals should be simplicity 
 and speed (but perhaps the community wants other things more?), avoiding 
 complicated implementations and "template magic". And it should be well 
 documented from the beginning, not a construction site on every corner.

 But that are just some ideas that come to my mind right now. I hope that 
 everybody makes some helpful statements what he/she thinks should be 
 covered by the library on all accounts, and what would be very nice.

 Thanks & ciao
 uwe 
May 15 2005
parent "Uwe Salomon" <post uwesalomon.de> writes:
 FYI: On Windows MultiByteToWideChar and WideCharToMultiByte
 support many encodings other than mentioned directly in MSDN.
Hmm, thanks for that. As libiconv is no standard for windows :) this will come in handy. Is there anyone who knows about encoding/decoding (and programming specialties in general) on the Mac? Regrettably, i know not a thing about the Mac programming environment at all. :( uwe
May 15 2005
prev sibling next sibling parent reply Thomas Kuehne <thomas-dloop kuehne.thisisspam.cn> writes:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Uwe Salomon schrieb am Sun, 15 May 2005 19:47:03 +0200:
 During the writing of a string class for my Indigo library i "discovered"  
 the need for a thorough internationalization library for D. I think a good  
 implementation of i18n functionality would be very important for the  
 development of applications in D, thus for the future of D. There is the  
 ICU port of the Mango tree, but as ICU is a C/C++ library, this is not as  
 natural and fast as it could be. I would like to write a native D i18n  
 library which is independent of third party libraries.
[snip] some links: http://www.i18ngurus.com/ http://www.openi18n.org/ http://java.sun.com/j2se/corejava/intl/ http://doc.trolltech.com/3.3/i18n.html Thomas -----BEGIN PGP SIGNATURE----- iD8DBQFCifJL3w+/yD4P9tIRAovzAKDAgMP6Ti7ENQPYwMo1uuLdrIBKfQCfbg9q 8bWa1c8AAVR++B5ytpxaugo= =XnC4 -----END PGP SIGNATURE-----
May 17 2005
parent reply "Uwe Salomon" <post uwesalomon.de> writes:
 some links:
[snip] These are very good and informative, thanks a lot! uwe
May 17 2005
parent reply Lars Ivar Igesund <larsivar igesund.net> writes:
Uwe Salomon wrote:

 some links:
[snip] These are very good and informative, thanks a lot! uwe
Also, look at http://i18n.kde.org and http://developer.kde.org/documentation/library/kdeqt/kde3arch/kde-i18n-howto.html/ While KDE is based on Qt, it seems like they've expanded on the functionality, especially the part that has to do with translations of messages and gui. Lars Ivar Igesund
May 18 2005
parent "Uwe Salomon" <post uwesalomon.de> writes:
 While KDE is based on Qt, it seems like they've expanded on the
 functionality, especially the part that has to do with translations of
 messages and gui.
Hmm, they are using GNU gettext() instead of the Qt tr(). Perhaps it would be a good idea to go at least one of the ways, instead of inventing something totally new. I like the KDE markup i18n("String to translate"). If i used that, all the existing tools (KBabel, Emacs PO mode) as well as string extractors and friends were available already. But it will make the lib dependant on GNU gettext(), or i would have to write my own .mo reader. gettext() is nonstandard for Windows, right? Please, would anybody be so kind and explain to me how translation of user messages works under Windows (roughly)? I remember them using resource files. Does the application load the right resource file at runtime? And how does it work for the Mac? Thanks for the help! uwe
May 18 2005
prev sibling parent reply "Uwe Salomon" <post uwesalomon.de> writes:
This is a first implementation for conversion between UTF encodings. I  
used UTF-8 <=> UTF-16 as an example. In a sum, this is what i thought of:


char[] toUtf8(wchar[] str, inout size_t eaten, char[] buffer);
char[] toUtf8(wchar[] str, inout size_t eaten);
char[] toUtf8(wchar[] str, char[] buffer);
char[] toUtf8(wchar[] str);

* The first function converts str into UTF-8, beginning at str[eaten],  
adjusting eaten up to where it converted (stopping before an incomplete  
sequence at the end of str), and using buffer if large enough,  
reallocating the buffer if space is not sufficient. It throws an exception  
if faced with invalid input encoding.

* The second function allocates a sufficient buffer itself.

* The third function converts str as a whole, asserting on an incomplete  
sequence at the end of str. It uses buffer if possible.

* The fourth function does like the third, and allocates the buffer itself.

* For every function there is a variant called fast_toUtf8() with the same  
parameters which relies on valid input, producing invalid output  
otherwise. It can be used if the input is guaranteed to be valid, and is  
much faster then.


For more explanations and a coding example visit:
http://www.uwesalomon.de/code/unicode/files/conversion-d.html

The source is at
http://www.uwesalomon.de/code/unicode/conversion.d


This is a draft, and i will be very happy if everyone who is interested  
comments on it, especially the API "design" (i know, fast_toUtf8() is a  
clumsy name :). And another question (i hope this is not arrogant): should  
these functions (or especially the simple form, without eaten) be included  
into Phobos std.utf? They are *much* faster than the current  
implementation. If someone would say, "Nice stuff, kiddo. Debug that  
properly, adjust it to the std.utf module (use their exception etc.) and  
submit a patch. Perhaps we will look at it then." i would sure do that.  
:)  But i am afraid that these kind of guerilla actions are rather  
unwanted, and i should better keep my mouth shut and code some useful  
stuff...

Thanks
uwe
May 18 2005
parent reply "Ben Hinkle" <ben.hinkle gmail.com> writes:
"Uwe Salomon" <post uwesalomon.de> wrote in message 
news:op.sqyw3zok6yjbe6 sandmann.maerchenwald.net...
 This is a first implementation for conversion between UTF encodings. I 
 used UTF-8 <=> UTF-16 as an example. In a sum, this is what i thought of:


 char[] toUtf8(wchar[] str, inout size_t eaten, char[] buffer);
 char[] toUtf8(wchar[] str, inout size_t eaten);
 char[] toUtf8(wchar[] str, char[] buffer);
 char[] toUtf8(wchar[] str);

 * The first function converts str into UTF-8, beginning at str[eaten], 
 adjusting eaten up to where it converted (stopping before an incomplete 
 sequence at the end of str), and using buffer if large enough, 
 reallocating the buffer if space is not sufficient. It throws an exception 
 if faced with invalid input encoding.

 * The second function allocates a sufficient buffer itself.

 * The third function converts str as a whole, asserting on an incomplete 
 sequence at the end of str. It uses buffer if possible.

 * The fourth function does like the third, and allocates the buffer 
 itself.

 * For every function there is a variant called fast_toUtf8() with the same 
 parameters which relies on valid input, producing invalid output 
 otherwise. It can be used if the input is guaranteed to be valid, and is 
 much faster then.


 For more explanations and a coding example visit:
 http://www.uwesalomon.de/code/unicode/files/conversion-d.html

 The source is at
 http://www.uwesalomon.de/code/unicode/conversion.d


 This is a draft, and i will be very happy if everyone who is interested 
 comments on it, especially the API "design" (i know, fast_toUtf8() is a 
 clumsy name :). And another question (i hope this is not arrogant): should 
 these functions (or especially the simple form, without eaten) be included 
 into Phobos std.utf? They are *much* faster than the current 
 implementation. If someone would say, "Nice stuff, kiddo. Debug that 
 properly, adjust it to the std.utf module (use their exception etc.) and 
 submit a patch. Perhaps we will look at it then." i would sure do that. 
 :)  But i am afraid that these kind of guerilla actions are rather 
 unwanted, and i should better keep my mouth shut and code some useful 
 stuff...

 Thanks
 uwe
Speeding up std.utf would be good - how can one argue with that? :-) Three thoughts come to mind: 1) fast_toUtf8 should be something like toUtf8Unsafe or toUtf8Unchecked to indicate to the user that it's not just a faster version of another routine (since I'd call fast_foo over foo every time!) but one that makes significant assumptions about the input. I'm not actually sure how often it would be ok to call such a function anyway so maybe it isn't even needed. Getting the wrong answer quickly is not a good trade-off. 2) it looks like you reallocate the output buffer inside the loop - can it be moved to outside? 3) the formatting of the source code is somewhat unusual. I missed the loop at first: // Now do the conversion. if (pIn < endIn) do { // Check for enough space left in the buffer. if (pOut >= endOut) [snip 50 lines of code or so] } while (++pIn < endIn); That first line with the "do" my eye skipped right over the "do" and I had to backtrack once I saw a "while" down at the bottom.
May 18 2005
next sibling parent reply "Uwe Salomon" <post uwesalomon.de> writes:
 1) fast_toUtf8 should be something like toUtf8Unsafe or toUtf8Unchecked
Yes, one of them sounds much better. I did not think long about fast_xxx()... Perhaps also toUtf8Unverified(), regrettably that is very long.
 I'm not actually sure how often it
 would be ok to call such a function anyway so maybe it isn't even needed.
 Getting the wrong answer quickly is not a good trade-off.
You are right, that is an important fact, especially for a standard library. Easy test: i converted a german email (mostly ASCII, some special characters) with 5000 characters from UTF8 to UTF16. I provided the buffer, because both functions are equally well at allocating memory. Normal compilation: * safe function: 0.100 ms * unsafe function: 0.088 ms (12% faster) Compilation -release -O: * safe function: 0.050 ms * unsafe function: 0.046 ms (8 % faster) I am not sure how all this could benefit from an assembler implementation. Anyways, the speed gain is minimal (actually, i thought it would be a lot more!). Well, no need to search for a good "unsafe" name then. ;)
 2) it looks like you reallocate the output buffer inside the loop - can  
 it be moved to outside?
Why? To shorten the loop? I thought the buffer should only be reallocated if the conversion itself shows it is too short. Do you want to move it before (so that a reallocation *cannot* occure inside the loop), or just outside (with a goto SomeWhereOutsideTheLoop and after the reallocation goto BackIntoTheLoop)?
 3) the formatting of the source code is somewhat unusual. I missed the  
 loop at first.
Changed. Thanks for the reply, uwe
May 18 2005
next sibling parent "Uwe Salomon" <post uwesalomon.de> writes:
 Normal compilation:
    * safe function: 0.100 ms
    * unsafe function: 0.088 ms (12% faster)

 Compilation -release -O:
    * safe function: 0.050 ms
    * unsafe function: 0.046 ms (8 % faster)
Maybe i should add that if you convert a text which contains a lot of UTF8 2/3-byte-encodings (asian languages), the unsafe function saves more: about 20% in comparison to the safe function. uwe
May 18 2005
prev sibling parent reply "Ben Hinkle" <ben.hinkle gmail.com> writes:
"Uwe Salomon" <post uwesalomon.de> wrote in message 
news:op.sqy2lzec6yjbe6 sandmann.maerchenwald.net...
 1) fast_toUtf8 should be something like toUtf8Unsafe or toUtf8Unchecked
Yes, one of them sounds much better. I did not think long about fast_xxx()... Perhaps also toUtf8Unverified(), regrettably that is very long.
 I'm not actually sure how often it
 would be ok to call such a function anyway so maybe it isn't even needed.
 Getting the wrong answer quickly is not a good trade-off.
You are right, that is an important fact, especially for a standard library. Easy test: i converted a german email (mostly ASCII, some special characters) with 5000 characters from UTF8 to UTF16. I provided the buffer, because both functions are equally well at allocating memory. Normal compilation: * safe function: 0.100 ms * unsafe function: 0.088 ms (12% faster) Compilation -release -O: * safe function: 0.050 ms * unsafe function: 0.046 ms (8 % faster) I am not sure how all this could benefit from an assembler implementation. Anyways, the speed gain is minimal (actually, i thought it would be a lot more!). Well, no need to search for a good "unsafe" name then. ;)
I could see using the unsafe versions when you check the input once and then convert many slices that one then knows to be safe. So it isn't unreasonable to have it in there. I don't know the use cases well enough to offer up an opinion.
 2) it looks like you reallocate the output buffer inside the loop - can 
 it be moved to outside?
Why? To shorten the loop? I thought the buffer should only be reallocated if the conversion itself shows it is too short. Do you want to move it before (so that a reallocation *cannot* occure inside the loop), or just outside (with a goto SomeWhereOutsideTheLoop and after the reallocation goto BackIntoTheLoop)?
How about if it needs to grow the buffer it does so with a large chunk instead of many small chunks. That is, the buffer doesn't have to fit exactly. Basically I have in mind that you estimate the maximum buffer size based on the number of input characters left and allocate that.
 3) the formatting of the source code is somewhat unusual. I missed the 
 loop at first.
Changed. Thanks for the reply, uwe
May 18 2005
parent reply "Uwe Salomon" <post uwesalomon.de> writes:
 I could see using the unsafe versions when you check the input once and  
 then
 convert many slices that one then knows to be safe. So it isn't  
 unreasonable
 to have it in there. I don't know the use cases well enough to offer up  
 an opinion.
Imagine a program that reads a lot of files from disk, does some fuzzy work on them, and writes some others back, for example a doc tool. It reads the source files in UTF8 format and converts them to the internally used UTF16 (using the safe functions). It then processes here and there, extracts the comments and formats them round. After that it puts out HTML files in UTF8. The comments need to be converted back to UTF8, and that's where the program could use the unsafe functions. At least that were my thoughts. But if the speed gain is under 30%, i think the fast versions are unnecessary. Imagine the doc tool needs a minute for output. With the current functions this would drop to 50 seconds at most, providing that the output only consists of UTF conversion (which is very unlikely).
 2) it looks like you reallocate the output buffer inside the loop - can
 it be moved to outside?
Why? To shorten the loop? I thought the buffer should only be reallocated if the conversion itself shows it is too short. Do you want to move it before (so that a reallocation *cannot* occure inside the loop), or just outside (with a goto SomeWhereOutsideTheLoop and after the reallocation goto BackIntoTheLoop)?
How about if it needs to grow the buffer it does so with a large chunk instead of many small chunks. That is, the buffer doesn't have to fit exactly. Basically I have in mind that you estimate the maximum buffer size based on the number of input characters left and allocate that.
Hmm, the current source is: if (pOut >= endOut) { // ... buffer.length = buffer.length + (endIn - pIn) + 2; // Will be enough. // ... } This will grow the buffer only once? (endIn - pIn) is the number of UTF8 characters to be processed, and they cannot expand to more than the same amount of UTF16 characters (1-byte encoded UTF8 becomes 1-word encoded UTF16, 4-byte encoded UTF8 becomes 2-word encoded UTF16). The same goes for toUtf8(). But you are right, this could still be moved before the loop, especially this one in toUtf16(). That's because (endIn - pIn) is a very accurate guess for languages with a lot of ASCII in them. Ciao uwe
May 18 2005
parent reply "Ben Hinkle" <ben.hinkle gmail.com> writes:
"Uwe Salomon" <post uwesalomon.de> wrote in message 
news:op.sqy4o0ud6yjbe6 sandmann.maerchenwald.net...
 I could see using the unsafe versions when you check the input once and 
 then
 convert many slices that one then knows to be safe. So it isn't 
 unreasonable
 to have it in there. I don't know the use cases well enough to offer up 
 an opinion.
Imagine a program that reads a lot of files from disk, does some fuzzy work on them, and writes some others back, for example a doc tool. It reads the source files in UTF8 format and converts them to the internally used UTF16 (using the safe functions). It then processes here and there, extracts the comments and formats them round. After that it puts out HTML files in UTF8. The comments need to be converted back to UTF8, and that's where the program could use the unsafe functions. At least that were my thoughts. But if the speed gain is under 30%, i think the fast versions are unnecessary. Imagine the doc tool needs a minute for output. With the current functions this would drop to 50 seconds at most, providing that the output only consists of UTF conversion (which is very unlikely).
sounds reasonable
 2) it looks like you reallocate the output buffer inside the loop - can
 it be moved to outside?
Why? To shorten the loop? I thought the buffer should only be reallocated if the conversion itself shows it is too short. Do you want to move it before (so that a reallocation *cannot* occure inside the loop), or just outside (with a goto SomeWhereOutsideTheLoop and after the reallocation goto BackIntoTheLoop)?
How about if it needs to grow the buffer it does so with a large chunk instead of many small chunks. That is, the buffer doesn't have to fit exactly. Basically I have in mind that you estimate the maximum buffer size based on the number of input characters left and allocate that.
Hmm, the current source is: if (pOut >= endOut) { // ... buffer.length = buffer.length + (endIn - pIn) + 2; // Will be enough. // ... } This will grow the buffer only once? (endIn - pIn) is the number of UTF8 characters to be processed, and they cannot expand to more than the same amount of UTF16 characters (1-byte encoded UTF8 becomes 1-word encoded UTF16, 4-byte encoded UTF8 becomes 2-word encoded UTF16). The same goes for toUtf8().
ok - I didn't look at the details. I just saw the resizing happening in the loop and guessed it was resizing a little bit each time. What you have seems reasonable.
 But you are right, this could still be moved before the loop, especially 
 this one in toUtf16(). That's because (endIn - pIn) is a very accurate 
 guess for languages with a lot of ASCII in them.

 Ciao
 uwe 
May 18 2005
parent "Uwe Salomon" <post uwesalomon.de> writes:
 This will grow the buffer only once? (endIn - pIn) is the number of UTF8
 characters to be processed, and they cannot expand to more than the same
 amount of UTF16 characters (1-byte encoded UTF8 becomes 1-word encoded
 UTF16, 4-byte encoded UTF8 becomes 2-word encoded UTF16). The same goes
 for toUtf8().
ok - I didn't look at the details. I just saw the resizing happening in the loop and guessed it was resizing a little bit each time. What you have seems reasonable.
Still you are right. I moved it out of the loop in toUtf16(). I will think about it in the other functions, not sure what is best in each case (well, it always depends on the characters in the string). I am now writing the other 4 functions (that is much easier now, as the two were the most complex). After finishing and testing them, i'll beep again. :) By the way... how are the Phobos docs generated? Hand-crafted? I will also update the corresponding sections if you let me... Ciao uwe
May 18 2005
prev sibling parent reply "Uwe Salomon" <post uwesalomon.de> writes:
I have now moved the UTF conversion code into the std.utf module. I have  
made the following changes:

* The tabs are now spaces. Sorry... :)
* Slight change in UTF8stride array. Unicode 4.01 declares some encodings  
illegal, including 5- and 6-byte encodings and some at the beginning of  
the 2-bytes.
* Slight change in stride(wchar) and toUTFindex(wchar) and  
toUCSindex(wchar). I just changed the detection of UTF16 surrogate values  
to a faster variant that does not need a local variable as well.
* Replacement of all toUTF() functions, except the ones that only validate  
because the return type has the same encoding as the parameter. toUTF16z()  
is still there as well, but changed to use my own toUTF16 (it  
zero-terminates the strings anyways).

I have not changed the encode/decode functions, even if they really needed  
some change (especially the UTF8 decode() function). I will happily do  
that, but i want to know first if my previous work is ok.

Ciao
uwe
May 21 2005
parent "Uwe Salomon" <post uwesalomon.de> writes:
And here goes the attachment %)
May 21 2005