digitalmars.D - =?utf-8?B?SW50ZXJuYXRpb25hbGl6YXRpb24gbGlicmFyeSDilIAgYWR2aWNlL2hlbHA=?=

Uwe Salomon (24/24) May 15 2005 During the writing of a string class for my Indigo library i "discovered...

Andrew Fedoniouk (136/160) May 15 2005 Good idea, I like it.

Uwe Salomon (5/7) May 15 2005 Hmm, thanks for that. As libiconv is no standard for windows :) this wil...

Thomas Kuehne (15/22) May 17 2005 -----BEGIN PGP SIGNED MESSAGE-----

Uwe Salomon (3/4) May 17 2005 [snip]

Lars Ivar Igesund (9/15) May 18 2005 Also, look at

Uwe Salomon (13/16) May 18 2005 Hmm, they are using GNU gettext() instead of the Qt tr(). Perhaps it wou...

Uwe Salomon (36/36) May 18 2005 This is a first implementation for conversion between UTF encodings. I

Ben Hinkle (24/61) May 18 2005 Speeding up std.utf would be good - how can one argue with that? :-)

Uwe Salomon (24/32) May 18 2005 Yes, one of them sounds much better. I did not think long about

Uwe Salomon (4/10) May 18 2005 Maybe i should add that if you convert a text which contains a lot of UT...
Ben Hinkle (10/42) May 18 2005 I could see using the unsafe versions when you check the input once and ...

Uwe Salomon (30/49) May 18 2005 Imagine a program that reads a lot of files from disk, does some fuzzy

Ben Hinkle (6/56) May 18 2005 sounds reasonable

Uwe Salomon (10/20) May 18 2005 Still you are right. I moved it out of the loop in toUtf16(). I will thi...

Uwe Salomon (18/18) May 21 2005 I have now moved the UTF conversion code into the std.utf module. I have...

Uwe Salomon (1/1) May 21 2005 And here goes the attachment %)

"Uwe Salomon" <post uwesalomon.de> writes:

During the writing of a string class for my Indigo library i "discovered"  
the need for a thorough internationalization library for D. I think a good  
implementation of i18n functionality would be very important for the  
development of applications in D, thus for the future of D. There is the  
ICU port of the Mango tree, but as ICU is a C/C++ library, this is not as  
natural and fast as it could be. I would like to write a native D i18n  
library which is independent of third party libraries.

As this is too big a project to develop for myself, and (i hope) of public  
interest for the D community, i would like to ask for:

- Advice: What is needed? How should it be implemented?
- Help: Who has the time and wants to help me? A total of 2 or 3  
developers should be sufficient?

My ideas are to write a compact core library that contains the most  
important features (character properties, UTF encodings, basic message  
translation), and then write some localization modules (number formatting,  
date formatting, comparing and searching). The goals should be simplicity  
and speed (but perhaps the community wants other things more?), avoiding  
complicated implementations and "template magic". And it should be well  
documented from the beginning, not a construction site on every corner.

But that are just some ideas that come to my mind right now. I hope that  
everybody makes some helpful statements what he/she thinks should be  
covered by the library on all accounts, and what would be very nice.

Thanks & ciao
uwe

May 15 2005

"Andrew Fedoniouk" <news terrainformatica.com> writes:

Good idea, I like it.

FYI: On Windows MultiByteToWideChar and WideCharToMultiByte
support many encodings other than mentioned directly in MSDN.
I am using this list:

lang_t langs[] = {
    {"asmo-708",708},
    {"dos-720",720},
    {"iso-8859-6",28596},
    {"x-mac-arabic",10004},
    {"windows-1256",1256},
    {"ibm775",775},
    {"iso-8859-4",28594},
    {"windows-1257",1257},
    {"ibm852",852},
    {"iso-8859-2",28592},
    {"x-mac-ce",10029},
    {"windows-1250",1250},
    {"euc-cn",51936},
    {"gb2312",936},
    {"hz-gb-2312",52936},
    {"x-mac-chinesesimp",10008},
    {"big5",950},
    {"x-chinese-cns",20000},
    {"x-chinese-eten",20002},
    {"x-mac-chinesetrad",10002},
    {"cp866",866},
    {"iso-8859-5",28595},
    {"koi8-r",20866},
    {"koi8-u",21866},
    {"x-mac-cyrillic",10007},
    {"windows-1251",1251},
    {"x-europa",29001},
    {"x-ia5-german",20106},
    {"ibm737",737},
    {"iso-8859-7",28597},
    {"x-mac-greek",10006},
    {"windows-1253",1253},
    {"ibm869",869},
    {"dos-862",862},
    {"iso-8859-8-i",38598},
    {"iso-8859-8",28598},
    {"x-mac-hebrew",10005},
    {"windows-1255",1255},
    {"x-ebcdic-arabic",20420},
    {"x-ebcdic-cyrillicrussian",20880},
    {"x-ebcdic-cyrillicserbianbulgarian",21025},
    {"x-ebcdic-denmarknorway",20277},
    {"x-ebcdic-denmarknorway-euro",1142},
    {"x-ebcdic-finlandsweden",20278},
    {"x-ebcdic-finlandsweden-euro",1143},
    {"x-ebcdic-finlandsweden-euro",1143},
    {"x-ebcdic-france-euro",1147},
    {"x-ebcdic-germany",20273},
    {"x-ebcdic-germany-euro",1141},
    {"x-ebcdic-greekmodern",875},
    {"x-ebcdic-greek",20423},
    {"x-ebcdic-hebrew",20424},
    {"x-ebcdic-icelandic",20871},
    {"x-ebcdic-icelandic-euro",1149},
    {"x-ebcdic-international-euro",1148},
    {"x-ebcdic-italy",20280},
    {"x-ebcdic-italy-euro",1144},
    {"x-ebcdic-japaneseandkana",50930},
    {"x-ebcdic-japaneseandjapaneselatin",50939},
    {"x-ebcdic-japaneseanduscanada",50931},
    {"x-ebcdic-japanesekatakana",20290},
    {"x-ebcdic-koreanandkoreanextended",50933},
    {"x-ebcdic-koreanextended",20833},
    {"cp870",870},
    {"x-ebcdic-simplifiedchinese",50935},
    {"x-ebcdic-spain",20284},
    {"x-ebcdic-spain-euro",1145},
    {"x-ebcdic-thai",20838},
    {"x-ebcdic-traditionalchinese",50937},
    {"cp1026",1026},
    {"x-ebcdic-turkish",20905},
    {"x-ebcdic-uk",20285},
    {"x-ebcdic-uk-euro",1146},
    {"ebcdic-cp-us",37},
    {"x-ebcdic-cp-us-euro",1140},
    {"ibm861",861},
    {"x-mac-icelandic",10079},
    {"x-iscii-as",57006},
    {"x-iscii-be",57003},
    {"x-iscii-de",57002},
    {"x-iscii-gu",57010},
    {"x-iscii-ka",57008},
    {"x-iscii-ma",57009},
    {"x-iscii-or",57007},
    {"x-iscii-pa",57011},
    {"x-iscii-ta",57004},
    {"x-iscii-te",57005},
    {"euc-jp",51932},
    {"iso-2022-jp",50220},
    {"iso-2022-jp",50222},
    {"csiso2022jp",50221},
    {"x-mac-japanese",10001},
    {"shift_jis",932},
    {"ks_c_5601-1987",949},
    {"euc-kr",51949},
    {"iso-2022-kr",50225},
    {"johab",1361},
    {"x-mac-korean",10003},
    {"iso-8859-3",28593},
    {"iso-8859-15",28605},
    {"x-ia5-norwegian",20108},
    {"ibm437",437},
    {"x-ia5-swedish",20107},
    {"windows-874",874},
    {"ibm857",857},
    {"iso-8859-9",28599},
    {"x-mac-turkish",10081},
    {"windows-1254",1254},
    //{(const char *)L"unicode",1200},
    //{"unicodefffe",1201},
    {"utf-7",65000},
    {"utf-8",65001},
    //{"us-ascii",20127},
    {"us-ascii",1252},
    {"windows-1258",1258},
    {"ibm850",850},
    {"x-ia5",20105},
    {"iso-8859-1",1252}, //was 28591
    {"macintosh",10000},
    {"windows-1252",1252},
    {"system",CP_ACP}
  };

Second member in these structs is
codepage id directly used as first
parameter of
MultiByteToWideChar and WideCharToMultiByte

Hope this will help. At least it might help to build
translation tables automaticly :)

Andrew.




"Uwe Salomon" <post uwesalomon.de> wrote in message 
news:op.sqtvopik6yjbe6 sandmann.maerchenwald.net...
 During the writing of a string class for my Indigo library i "discovered" 
 the need for a thorough internationalization library for D. I think a good 
 implementation of i18n functionality would be very important for the 
 development of applications in D, thus for the future of D. There is the 
 ICU port of the Mango tree, but as ICU is a C/C++ library, this is not as 
 natural and fast as it could be. I would like to write a native D i18n 
 library which is independent of third party libraries.

 As this is too big a project to develop for myself, and (i hope) of public 
 interest for the D community, i would like to ask for:

 - Advice: What is needed? How should it be implemented?
 - Help: Who has the time and wants to help me? A total of 2 or 3 
 developers should be sufficient?

 My ideas are to write a compact core library that contains the most 
 important features (character properties, UTF encodings, basic message 
 translation), and then write some localization modules (number formatting, 
 date formatting, comparing and searching). The goals should be simplicity 
 and speed (but perhaps the community wants other things more?), avoiding 
 complicated implementations and "template magic". And it should be well 
 documented from the beginning, not a construction site on every corner.

 But that are just some ideas that come to my mind right now. I hope that 
 everybody makes some helpful statements what he/she thinks should be 
 covered by the library on all accounts, and what would be very nice.

 Thanks & ciao
 uwe

May 15 2005

"Uwe Salomon" <post uwesalomon.de> writes:

 FYI: On Windows MultiByteToWideChar and WideCharToMultiByte
 support many encodings other than mentioned directly in MSDN.

Hmm, thanks for that. As libiconv is no standard for windows :) this will  
come in handy. Is there anyone who knows about encoding/decoding (and  
programming specialties in general) on the Mac? Regrettably, i know not a  
thing about the Mac programming environment at all. :(

uwe

May 15 2005

Thomas Kuehne <thomas-dloop kuehne.thisisspam.cn> writes:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Uwe Salomon schrieb am Sun, 15 May 2005 19:47:03 +0200:
 During the writing of a string class for my Indigo library i "discovered"  
 the need for a thorough internationalization library for D. I think a good  
 implementation of i18n functionality would be very important for the  
 development of applications in D, thus for the future of D. There is the  
 ICU port of the Mango tree, but as ICU is a C/C++ library, this is not as  
 natural and fast as it could be. I would like to write a native D i18n  
 library which is independent of third party libraries.

[snip]

some links:
http://www.i18ngurus.com/
http://www.openi18n.org/
http://java.sun.com/j2se/corejava/intl/
http://doc.trolltech.com/3.3/i18n.html

Thomas


-----BEGIN PGP SIGNATURE-----

iD8DBQFCifJL3w+/yD4P9tIRAovzAKDAgMP6Ti7ENQPYwMo1uuLdrIBKfQCfbg9q
8bWa1c8AAVR++B5ytpxaugo=
=XnC4
-----END PGP SIGNATURE-----

May 17 2005

"Uwe Salomon" <post uwesalomon.de> writes:

 some links:

[snip]

These are very good and informative, thanks a lot!

uwe

May 17 2005

Lars Ivar Igesund <larsivar igesund.net> writes:

Uwe Salomon wrote:

 some links:

 [snip]
 
 These are very good and informative, thanks a lot!
 
 uwe

Also, look at

http://i18n.kde.org

and

http://developer.kde.org/documentation/library/kdeqt/kde3arch/kde-i18n-howto.html/

While KDE is based on Qt, it seems like they've expanded on the
functionality, especially the part that has to do with translations of
messages and gui.

Lars Ivar Igesund

May 18 2005

"Uwe Salomon" <post uwesalomon.de> writes:

 While KDE is based on Qt, it seems like they've expanded on the
 functionality, especially the part that has to do with translations of
 messages and gui.

Hmm, they are using GNU gettext() instead of the Qt tr(). Perhaps it would  
be a good idea to go at least one of the ways, instead of inventing  
something totally new. I like the KDE markup i18n("String to translate").  
If i used that, all the existing tools (KBabel, Emacs PO mode) as well as  
string extractors and friends were available already. But it will make the  
lib dependant on GNU gettext(), or i would have to write my own .mo reader.

gettext() is nonstandard for Windows, right? Please, would anybody be so  
kind and explain to me how translation of user messages works under  
Windows (roughly)? I remember them using resource files. Does the  
application load the right resource file at runtime? And how does it work  
for the Mac?

Thanks for the help!
uwe

May 18 2005

"Uwe Salomon" <post uwesalomon.de> writes:

This is a first implementation for conversion between UTF encodings. I  
used UTF-8 <=> UTF-16 as an example. In a sum, this is what i thought of:


char[] toUtf8(wchar[] str, inout size_t eaten, char[] buffer);
char[] toUtf8(wchar[] str, inout size_t eaten);
char[] toUtf8(wchar[] str, char[] buffer);
char[] toUtf8(wchar[] str);

* The first function converts str into UTF-8, beginning at str[eaten],  
adjusting eaten up to where it converted (stopping before an incomplete  
sequence at the end of str), and using buffer if large enough,  
reallocating the buffer if space is not sufficient. It throws an exception  
if faced with invalid input encoding.

* The second function allocates a sufficient buffer itself.

* The third function converts str as a whole, asserting on an incomplete  
sequence at the end of str. It uses buffer if possible.

* The fourth function does like the third, and allocates the buffer itself.

* For every function there is a variant called fast_toUtf8() with the same  
parameters which relies on valid input, producing invalid output  
otherwise. It can be used if the input is guaranteed to be valid, and is  
much faster then.


For more explanations and a coding example visit:
http://www.uwesalomon.de/code/unicode/files/conversion-d.html

The source is at
http://www.uwesalomon.de/code/unicode/conversion.d


This is a draft, and i will be very happy if everyone who is interested  
comments on it, especially the API "design" (i know, fast_toUtf8() is a  
clumsy name :). And another question (i hope this is not arrogant): should  
these functions (or especially the simple form, without eaten) be included  
into Phobos std.utf? They are *much* faster than the current  
implementation. If someone would say, "Nice stuff, kiddo. Debug that  
properly, adjust it to the std.utf module (use their exception etc.) and  
submit a patch. Perhaps we will look at it then." i would sure do that.  
:)  But i am afraid that these kind of guerilla actions are rather  
unwanted, and i should better keep my mouth shut and code some useful  
stuff...

Thanks
uwe

May 18 2005

"Ben Hinkle" <ben.hinkle gmail.com> writes:

"Uwe Salomon" <post uwesalomon.de> wrote in message 
news:op.sqyw3zok6yjbe6 sandmann.maerchenwald.net...
 This is a first implementation for conversion between UTF encodings. I 
 used UTF-8 <=> UTF-16 as an example. In a sum, this is what i thought of:


 char[] toUtf8(wchar[] str, inout size_t eaten, char[] buffer);
 char[] toUtf8(wchar[] str, inout size_t eaten);
 char[] toUtf8(wchar[] str, char[] buffer);
 char[] toUtf8(wchar[] str);

 * The first function converts str into UTF-8, beginning at str[eaten], 
 adjusting eaten up to where it converted (stopping before an incomplete 
 sequence at the end of str), and using buffer if large enough, 
 reallocating the buffer if space is not sufficient. It throws an exception 
 if faced with invalid input encoding.

 * The second function allocates a sufficient buffer itself.

 * The third function converts str as a whole, asserting on an incomplete 
 sequence at the end of str. It uses buffer if possible.

 * The fourth function does like the third, and allocates the buffer 
 itself.

 * For every function there is a variant called fast_toUtf8() with the same 
 parameters which relies on valid input, producing invalid output 
 otherwise. It can be used if the input is guaranteed to be valid, and is 
 much faster then.


 For more explanations and a coding example visit:
 http://www.uwesalomon.de/code/unicode/files/conversion-d.html

 The source is at
 http://www.uwesalomon.de/code/unicode/conversion.d


 This is a draft, and i will be very happy if everyone who is interested 
 comments on it, especially the API "design" (i know, fast_toUtf8() is a 
 clumsy name :). And another question (i hope this is not arrogant): should 
 these functions (or especially the simple form, without eaten) be included 
 into Phobos std.utf? They are *much* faster than the current 
 implementation. If someone would say, "Nice stuff, kiddo. Debug that 
 properly, adjust it to the std.utf module (use their exception etc.) and 
 submit a patch. Perhaps we will look at it then." i would sure do that. 
 :)  But i am afraid that these kind of guerilla actions are rather 
 unwanted, and i should better keep my mouth shut and code some useful 
 stuff...

 Thanks
 uwe

Speeding up std.utf would be good - how can one argue with that? :-)
Three thoughts come to mind:
1) fast_toUtf8 should be something like toUtf8Unsafe or toUtf8Unchecked to 
indicate to the user that it's not just a faster version of another routine 
(since I'd call fast_foo over foo every time!) but one that makes 
significant assumptions about the input. I'm not actually sure how often it 
would be ok to call such a function anyway so maybe it isn't even needed. 
Getting the wrong answer quickly is not a good trade-off.
2) it looks like you reallocate the output buffer inside the loop - can it 
be moved to outside?
3) the formatting of the source code is somewhat unusual. I missed the loop 
at first:
 // Now do the conversion.
  if (pIn < endIn) do
  {
    // Check for enough space left in the buffer.
    if (pOut >= endOut)
    [snip 50 lines of code or so]
  }
  while (++pIn < endIn);

That first line with the "do" my eye skipped right over the "do" and I had 
to backtrack once I saw a "while" down at the bottom.

May 18 2005

"Uwe Salomon" <post uwesalomon.de> writes:

 1) fast_toUtf8 should be something like toUtf8Unsafe or toUtf8Unchecked

Yes, one of them sounds much better. I did not think long about  
fast_xxx()... Perhaps also toUtf8Unverified(), regrettably that is very  
long.

 I'm not actually sure how often it
 would be ok to call such a function anyway so maybe it isn't even needed.
 Getting the wrong answer quickly is not a good trade-off.

You are right, that is an important fact, especially for a standard  
library. Easy test: i converted a german email (mostly ASCII, some special  
characters) with 5000 characters from UTF8 to UTF16. I provided the  
buffer, because both functions are equally well at allocating memory.

Normal compilation:
   * safe function: 0.100 ms
   * unsafe function: 0.088 ms (12% faster)

Compilation -release -O:
   * safe function: 0.050 ms
   * unsafe function: 0.046 ms (8 % faster)

I am not sure how all this could benefit from an assembler implementation.  
Anyways, the speed gain is minimal (actually, i thought it would be a lot  
more!). Well, no need to search for a good "unsafe" name then. ;)

 2) it looks like you reallocate the output buffer inside the loop - can  
 it be moved to outside?

Why? To shorten the loop? I thought the buffer should only be reallocated  
if the conversion itself shows it is too short. Do you want to move it  
before (so that a reallocation *cannot* occure inside the loop), or just  
outside (with a goto SomeWhereOutsideTheLoop and after the reallocation  
goto BackIntoTheLoop)?

 3) the formatting of the source code is somewhat unusual. I missed the  
 loop at first.

Changed.

Thanks for the reply,
uwe

May 18 2005

"Uwe Salomon" <post uwesalomon.de> writes:

 Normal compilation:
    * safe function: 0.100 ms
    * unsafe function: 0.088 ms (12% faster)

 Compilation -release -O:
    * safe function: 0.050 ms
    * unsafe function: 0.046 ms (8 % faster)

Maybe i should add that if you convert a text which contains a lot of UTF8  
2/3-byte-encodings (asian languages), the unsafe function saves more:  
about 20% in comparison to the safe function.

uwe

May 18 2005

"Ben Hinkle" <ben.hinkle gmail.com> writes:

"Uwe Salomon" <post uwesalomon.de> wrote in message 
news:op.sqy2lzec6yjbe6 sandmann.maerchenwald.net...
 1) fast_toUtf8 should be something like toUtf8Unsafe or toUtf8Unchecked

 Yes, one of them sounds much better. I did not think long about 
 fast_xxx()... Perhaps also toUtf8Unverified(), regrettably that is very 
 long.

 I'm not actually sure how often it
 would be ok to call such a function anyway so maybe it isn't even needed.
 Getting the wrong answer quickly is not a good trade-off.

 You are right, that is an important fact, especially for a standard 
 library. Easy test: i converted a german email (mostly ASCII, some special 
 characters) with 5000 characters from UTF8 to UTF16. I provided the 
 buffer, because both functions are equally well at allocating memory.

 Normal compilation:
   * safe function: 0.100 ms
   * unsafe function: 0.088 ms (12% faster)

 Compilation -release -O:
   * safe function: 0.050 ms
   * unsafe function: 0.046 ms (8 % faster)

 I am not sure how all this could benefit from an assembler implementation. 
 Anyways, the speed gain is minimal (actually, i thought it would be a lot 
 more!). Well, no need to search for a good "unsafe" name then. ;)

I could see using the unsafe versions when you check the input once and then 
convert many slices that one then knows to be safe. So it isn't unreasonable 
to have it in there. I don't know the use cases well enough to offer up an 
opinion.

 2) it looks like you reallocate the output buffer inside the loop - can 
 it be moved to outside?

 Why? To shorten the loop? I thought the buffer should only be reallocated 
 if the conversion itself shows it is too short. Do you want to move it 
 before (so that a reallocation *cannot* occure inside the loop), or just 
 outside (with a goto SomeWhereOutsideTheLoop and after the reallocation 
 goto BackIntoTheLoop)?

How about if it needs to grow the buffer it does so with a large chunk 
instead of many small chunks. That is, the buffer doesn't have to fit 
exactly. Basically I have in mind that you estimate the maximum buffer size 
based on the number of input characters left and allocate that.

 3) the formatting of the source code is somewhat unusual. I missed the 
 loop at first.

 Changed.

 Thanks for the reply,
 uwe

May 18 2005

"Uwe Salomon" <post uwesalomon.de> writes:

 I could see using the unsafe versions when you check the input once and  
 then
 convert many slices that one then knows to be safe. So it isn't  
 unreasonable
 to have it in there. I don't know the use cases well enough to offer up  
 an opinion.

Imagine a program that reads a lot of files from disk, does some fuzzy  
work on them, and writes some others back, for example a doc tool. It  
reads the source files in UTF8 format and converts them to the internally  
used UTF16 (using the safe functions). It then processes here and there,  
extracts the comments and formats them round. After that it puts out HTML  
files in UTF8. The comments need to be converted back to UTF8, and that's  
where the program could use the unsafe functions.

At least that were my thoughts. But if the speed gain is under 30%, i  
think the fast versions are unnecessary. Imagine the doc tool needs a  
minute for output. With the current functions this would drop to 50  
seconds at most, providing that the output only consists of UTF conversion  
(which is very unlikely).

 2) it looks like you reallocate the output buffer inside the loop - can
 it be moved to outside?

 Why? To shorten the loop? I thought the buffer should only be  
 reallocated
 if the conversion itself shows it is too short. Do you want to move it
 before (so that a reallocation *cannot* occure inside the loop), or just
 outside (with a goto SomeWhereOutsideTheLoop and after the reallocation
 goto BackIntoTheLoop)?

 How about if it needs to grow the buffer it does so with a large chunk
 instead of many small chunks. That is, the buffer doesn't have to fit
 exactly. Basically I have in mind that you estimate the maximum buffer  
 size based on the number of input characters left and allocate that.

Hmm, the current source is:

if (pOut >= endOut)
{
   // ...
   buffer.length = buffer.length + (endIn - pIn) + 2;      // Will be  
enough.
   // ...
}

This will grow the buffer only once? (endIn - pIn) is the number of UTF8  
characters to be processed, and they cannot expand to more than the same  
amount of UTF16 characters (1-byte encoded UTF8 becomes 1-word encoded  
UTF16, 4-byte encoded UTF8 becomes 2-word encoded UTF16). The same goes  
for toUtf8().

But you are right, this could still be moved before the loop, especially  
this one in toUtf16(). That's because (endIn - pIn) is a very accurate  
guess for languages with a lot of ASCII in them.

Ciao
uwe

May 18 2005

"Ben Hinkle" <ben.hinkle gmail.com> writes:

"Uwe Salomon" <post uwesalomon.de> wrote in message 
news:op.sqy4o0ud6yjbe6 sandmann.maerchenwald.net...
 I could see using the unsafe versions when you check the input once and 
 then
 convert many slices that one then knows to be safe. So it isn't 
 unreasonable
 to have it in there. I don't know the use cases well enough to offer up 
 an opinion.

 Imagine a program that reads a lot of files from disk, does some fuzzy 
 work on them, and writes some others back, for example a doc tool. It 
 reads the source files in UTF8 format and converts them to the internally 
 used UTF16 (using the safe functions). It then processes here and there, 
 extracts the comments and formats them round. After that it puts out HTML 
 files in UTF8. The comments need to be converted back to UTF8, and that's 
 where the program could use the unsafe functions.

 At least that were my thoughts. But if the speed gain is under 30%, i 
 think the fast versions are unnecessary. Imagine the doc tool needs a 
 minute for output. With the current functions this would drop to 50 
 seconds at most, providing that the output only consists of UTF conversion 
 (which is very unlikely).

sounds reasonable

 2) it looks like you reallocate the output buffer inside the loop - can
 it be moved to outside?

 Why? To shorten the loop? I thought the buffer should only be 
 reallocated
 if the conversion itself shows it is too short. Do you want to move it
 before (so that a reallocation *cannot* occure inside the loop), or just
 outside (with a goto SomeWhereOutsideTheLoop and after the reallocation
 goto BackIntoTheLoop)?

 How about if it needs to grow the buffer it does so with a large chunk
 instead of many small chunks. That is, the buffer doesn't have to fit
 exactly. Basically I have in mind that you estimate the maximum buffer 
 size based on the number of input characters left and allocate that.

 Hmm, the current source is:

 if (pOut >= endOut)
 {
   // ...
   buffer.length = buffer.length + (endIn - pIn) + 2;      // Will be 
 enough.
   // ...
 }

 This will grow the buffer only once? (endIn - pIn) is the number of UTF8 
 characters to be processed, and they cannot expand to more than the same 
 amount of UTF16 characters (1-byte encoded UTF8 becomes 1-word encoded 
 UTF16, 4-byte encoded UTF8 becomes 2-word encoded UTF16). The same goes 
 for toUtf8().

ok - I didn't look at the details. I just saw the resizing happening in the 
loop and guessed it was resizing a little bit each time. What you have seems 
reasonable.

 But you are right, this could still be moved before the loop, especially 
 this one in toUtf16(). That's because (endIn - pIn) is a very accurate 
 guess for languages with a lot of ASCII in them.

 Ciao
 uwe

May 18 2005

"Uwe Salomon" <post uwesalomon.de> writes:

 This will grow the buffer only once? (endIn - pIn) is the number of UTF8
 characters to be processed, and they cannot expand to more than the same
 amount of UTF16 characters (1-byte encoded UTF8 becomes 1-word encoded
 UTF16, 4-byte encoded UTF8 becomes 2-word encoded UTF16). The same goes
 for toUtf8().

 ok - I didn't look at the details. I just saw the resizing happening in  
 the
 loop and guessed it was resizing a little bit each time. What you have  
 seems
 reasonable.

Still you are right. I moved it out of the loop in toUtf16(). I will think  
about it in the other functions, not sure what is best in each case (well,  
it always depends on the characters in the string).

I am now writing the other 4 functions (that is much easier now, as the  
two were the most complex). After finishing and testing them, i'll beep  
again. :)

By the way... how are the Phobos docs generated? Hand-crafted? I will also  
update the corresponding sections if you let me...

Ciao
uwe

May 18 2005

"Uwe Salomon" <post uwesalomon.de> writes:

I have now moved the UTF conversion code into the std.utf module. I have  
made the following changes:

* The tabs are now spaces. Sorry... :)
* Slight change in UTF8stride array. Unicode 4.01 declares some encodings  
illegal, including 5- and 6-byte encodings and some at the beginning of  
the 2-bytes.
* Slight change in stride(wchar) and toUTFindex(wchar) and  
toUCSindex(wchar). I just changed the detection of UTF16 surrogate values  
to a faster variant that does not need a local variable as well.
* Replacement of all toUTF() functions, except the ones that only validate  
because the return type has the same encoding as the parameter. toUTF16z()  
is still there as well, but changed to use my own toUTF16 (it  
zero-terminates the strings anyways).

I have not changed the encode/decode functions, even if they really needed  
some change (especially the UTF8 decode() function). I will happily do  
that, but i want to know first if my previous work is ok.

Ciao
uwe

May 21 2005

"Uwe Salomon" <post uwesalomon.de> writes:

And here goes the attachment %)

May 21 2005

D Programming

C/C++ Programming

Other

digitalmars.D - =?utf-8?B?SW50ZXJuYXRpb25hbGl6YXRpb24gbGlicmFyeSDilIAgYWR2aWNlL2hlbHA=?=