digitalmars.D - std.string and unicode

Todor Totev (9/9) Dec 16 2006 Hello all,

Frits van Bommel (11/16) Dec 16 2006 In many cases UTF-8 or ASCII doesn't actually make a difference, so the

=?UTF-8?B?SmFyaS1NYXR0aSBNw6RrZWzDpA==?= (7/12) Dec 16 2006 No they are not. The string constants (lowercase, letters, etc.) only

Stewart Gordon (13/24) Dec 16 2006 Whether they're _supposed_ to be Unicode aware, and whether they

=?UTF-8?B?SmFyaS1NYXR0aSBNw6RrZWzDpA==?= (2/4) Dec 16 2006 Right, I meant 7 + 1 to make the characters aligned on the 8 bit boundar...

Frits van Bommel (34/47) Dec 16 2006 All I said was they're _supposed_ to be Unicode aware, which was the

=?UTF-8?B?SmFyaS1NYXR0aSBNw6RrZWzDpA==?= (30/59) Dec 16 2006 Ok, sorry - didn't mean to be rude. I just thought it might not be

Stewart Gordon (20/39) Dec 17 2006 At this rate, I think to call it stable is wishful thinking.

=?UTF-8?B?SmFyaS1NYXR0aSBNw6RrZWzDpA==?= (17/37) Dec 17 2006 I guess it's a bad practice to use any Unicode-unaware algorithms today,

Todor Totev (20/23) Dec 17 2006 Thank you for your responses. First of all, i'm a beginner D programemer...

Thomas Kuehne (18/23) Dec 17 2006 -----BEGIN PGP SIGNED MESSAGE-----

Serg Kovrov (7/14) Dec 27 2006 Ok, there was some discussion on this issue, but still I can't get it......

"Todor Totev" <umbra.tenebris list.ru> writes:

Hello all,
are std.string functions supposed to be UNICODE aware?
The documentation says nothing and looking at the source it appears that
the functions work only for ASCII characters, despite the fact that
their arguments are char[].
If these functions are designed to be ascii only, where can I find the
unicode ones?
Best Regards,
Todor

Dec 16 2006

Frits van Bommel <fvbommel REMwOVExCAPSs.nl> writes:

Todor Totev wrote:
 Hello all,
 are std.string functions supposed to be UNICODE aware?

Yes.

 The documentation says nothing and looking at the source it appears that
 the functions work only for ASCII characters, despite the fact that
 their arguments are char[].

In many cases UTF-8 or ASCII doesn't actually make a difference, so the 
code doesn't need to do anything special. IIRC that was in fact one of 
the design goals of UTF-8.
But if you find any functions that only work for ASCII, be sure to 
report them to the digitalmars.D.bugs newsgroup[1].
Make sure to test them first, though. In fact, include code that fails 
(but shouldn't).


[1]: The bugzilla seems to be down at the moment, otherwise that would 
have been preferred.

Dec 16 2006

=?UTF-8?B?SmFyaS1NYXR0aSBNw6RrZWzDpA==?= <jmjmak utu.fi.invalid> writes:

Frits van Bommel wrote:
 Todor Totev wrote:
 Hello all,
 are std.string functions supposed to be UNICODE aware?

 
 Yes.

No they are not. The string constants (lowercase, letters, etc.) only
feature ASCII characters. Some functions have "BUG: only works with
ASCII" attached to them. Many of the functions expect that the char[]
string consists of 8 bit characters.

Mango project (http://dsource.org/projects/mango) has functions with
better Unicode compatibility.

Dec 16 2006

Stewart Gordon <smjg_1998 yahoo.com> writes:

Jari-Matti Mäkelä wrote:
 Frits van Bommel wrote:
 Todor Totev wrote:
 Hello all,
 are std.string functions supposed to be UNICODE aware?

 Yes.

 
 No they are not.

Whether they're _supposed_ to be Unicode aware, and whether they 
actually _are_ Unicode aware, are two very different matters.

 The string constants (lowercase, letters, etc.) only
 feature ASCII characters. Some functions have "BUG: only works with
 ASCII" attached to them. Many of the functions expect that the char[]
 string consists of 8 bit characters.

<snip>

maketrans expects the char[] string to consist of 7-bit characters. 
Which functions expect it to consist of 8-bit characters?

But indeed, somebody needs to define a Unicode translation table format 
that isn't going to take up 4MB or so.  Why was the current translation 
table format put in in the first place, considering:
- it's obvious that it won't work in Unicode
- a dchar[dchar] is an intuitive way to do it
?

Stewart.

Dec 16 2006

=?UTF-8?B?SmFyaS1NYXR0aSBNw6RrZWzDpA==?= <jmjmak utu.fi.invalid> writes:

Stewart Gordon wrote:
 maketrans expects the char[] string to consist of 7-bit characters.
 Which functions expect it to consist of 8-bit characters?

Right, I meant 7 + 1 to make the characters aligned on the 8 bit boundaries.

Dec 16 2006

Frits van Bommel <fvbommel REMwOVExCAPSs.nl> writes:

Jari-Matti Mäkelä wrote:
 Frits van Bommel wrote:
 Todor Totev wrote:
 Hello all,
 are std.string functions supposed to be UNICODE aware?

 Yes.

 
 No they are not.

All I said was they're _supposed_ to be Unicode aware, which was the 
question ;). I also said to report any that aren't.

 The string constants (lowercase, letters, etc.) only
 feature ASCII characters. Some functions have "BUG: only works with
 ASCII" attached to them. Many of the functions expect that the char[]
 string consists of 8 bit characters.

First, let me note I was mostly just repeating what I remember from a 
previous iteration of this discussion. I didn't actually look through 
the entire source.
I guess those with "BUG" in them are known bugs, which will hopefully be 
fixed.
But functions like find() (substring form) and replace(), for instance, 
shouldn't need special code to deal with UTF-8 vs. ASCII. In fact, 
anything that doesn't deal with properties of individual characters 
should probably be fine.

 Mango project (http://dsource.org/projects/mango) has functions with
 better Unicode compatibility.

Always good to know if I ever need it.

...

[some time later]
Actually, I just did look through the entire source. The functions that 
seem to have trouble with UTF-8:

icmp: Only checks for range 'A'-'Z'
ifind, irfind (substring forms): use icmp and are thus equally guilty.
rjustify, ljustify, center, zfill: fail to account for multi-byte 
characters still being only one character (i.e. align according to byte 
length, not character length)
maketrans, translate: Not really practical for full Unicode, but buggy 
still.
soundex: AFAIK the Soundex algorithm is undefined for characters out of 
the ranges a-z and A-Z, so the fault lies more in the algorithm than 
this implementation of it.

The rest seems to handle it just fine, as long as you only pass valid 
UTF-8 and any indexes passed in are at character boundaries.

That's a score of 10 ASCII functions and about 70 UTF-8 functions. I 
think the phrase 'many of the functions' is a bit exaggerated here 
(especially since the last 7 are a bit obscure IMHO), but YMMV.

Note that I didn't follow my own advice and actually test these; this is 
based purely on browsing through the file.

Dec 16 2006

=?UTF-8?B?SmFyaS1NYXR0aSBNw6RrZWzDpA==?= <jmjmak utu.fi.invalid> writes:

Frits van Bommel wrote:
 Jari-Matti Mäkelä wrote:
 Frits van Bommel wrote:
 Todor Totev wrote:
 Hello all,
 are std.string functions supposed to be UNICODE aware?

 Yes.

 No they are not.

 
 All I said was they're _supposed_ to be Unicode aware, which was the
 question ;). I also said to report any that aren't.
 

Ok, sorry - didn't mean to be rude. I just thought it might not be
realistic to expect them all to be Unicode aware in the near future.

Many of them were implemented to be pretty limited to ASCII last time I
checked. And they haven't changed a lot in the last few years. I don't
know, the 1.0 stable will be out on Jan 1. - I cannot foretell, how much
they will change before or after that. For example Ruby has been out a
while and still some of it's string functions are not compatible with
Unicode.

 But functions like find() (substring form) and replace(), for instance,
 shouldn't need special code to deal with UTF-8 vs. ASCII. In fact,
 anything that doesn't deal with properties of individual characters
 should probably be fine.

Unless the functions must be aware of different forms of the same
characters. The Unicode gurus can shed more light into this. I recall
there has been previous discussion of this also.

 Actually, I just did look through the entire source. The functions that
 seem to have trouble with UTF-8:
 
 icmp: Only checks for range 'A'-'Z'
 ifind, irfind (substring forms): use icmp and are thus equally guilty.
 rjustify, ljustify, center, zfill: fail to account for multi-byte
 characters still being only one character (i.e. align according to byte
 length, not character length)
 maketrans, translate: Not really practical for full Unicode, but buggy
 still.
 soundex: AFAIK the Soundex algorithm is undefined for characters out of
 the ranges a-z and A-Z, so the fault lies more in the algorithm than
 this implementation of it.

Also tolower & toupper have problems with characters like Ә (04D8) and Һ
(04BA). Ok, I'm not 100% sure they are supposed to have upper and lower
case counterparts. Then there are some characters like Σ (03A3) that are
also used as symbols in e.g. mathematics and ordinary letters in some
languages.

Anyway, one cannot tell those methods about the locale to use in
conversions so I though they are not supposed to be Unicode compatible.
And the spec does not say whether they are ought to be compatible or
not. (OTOH I'm pretty much impressed about the capabilities of
tolower/toupper. Last time I tested those they were not even able to
change the case of ä:s and ö:s.)

I'm most probably the wrong person to answer this, but this is just the
first impression I got. Now that you mention it, it really seems that
Walter has improved those a lot. I'm not sure how much it matters to the
OP how things are supposed to be if they are not implemented yet. I'm
still using D for small scale projects that are in the end of their life
cycle before the spec / std library manages to get through such massive
changes.

Dec 16 2006

Stewart Gordon <smjg_1998 yahoo.com> writes:

Jari-Matti Mäkelä wrote:
<snip>
 Ok, sorry - didn't mean to be rude. I just thought it might not be
 realistic to expect them all to be Unicode aware in the near future.
 
 Many of them were implemented to be pretty limited to ASCII last time I
 checked. And they haven't changed a lot in the last few years. I don't
 know, the 1.0 stable will be out on Jan 1.

At this rate, I think to call it stable is wishful thinking.

<snip>
 But functions like find() (substring form) and replace(), for instance,
 shouldn't need special code to deal with UTF-8 vs. ASCII. In fact,
 anything that doesn't deal with properties of individual characters
 should probably be fine.

 
 Unless the functions must be aware of different forms of the same
 characters. The Unicode gurus can shed more light into this. I recall
 there has been previous discussion of this also.

Eventually, we should probably have two versions of the find functions: 
one that matches codepoint for codepoint and therefore byte for byte, 
and one that matches character for character.  See point 6 at

http://www.textpad.info/forum/viewtopic.php?t=4778

However, it does appear that that post mixes up terms a bit - I think by 
"character" it means "codepoint", and by "glyph" it means "character". 
But the question, which version should be called find, etc. as opposed 
to something else?

<snip>
 Also tolower & toupper have problems with characters like Ә (04D8) and Һ
 (04BA). Ok, I'm not 100% sure they are supposed to have upper and lower
 case counterparts. Then there are some characters like Σ (03A3) that are
 also used as symbols in e.g. mathematics and ordinary letters in some
 languages.

<snip>

U+03A3 is the Greek letter and U+2211 is the mathematical symbol. 
Similarly, U+03A0 and U+220F.

Here they are:
Σ ∑
Π ∏

Stewart.

Dec 17 2006

=?UTF-8?B?SmFyaS1NYXR0aSBNw6RrZWzDpA==?= <jmjmak utu.fi.invalid> writes:

Stewart Gordon wrote:
 Jari-Matti Mäkelä wrote:
 But functions like find() (substring form) and replace(), for instance,
 shouldn't need special code to deal with UTF-8 vs. ASCII. In fact,
 anything that doesn't deal with properties of individual characters
 should probably be fine.

 Unless the functions must be aware of different forms of the same
 characters. The Unicode gurus can shed more light into this. I recall
 there has been previous discussion of this also.

 
 Eventually, we should probably have two versions of the find functions:
 one that matches codepoint for codepoint and therefore byte for byte,
 and one that matches character for character.  See point 6 at
 
 http://www.textpad.info/forum/viewtopic.php?t=4778
 
 However, it does appear that that post mixes up terms a bit - I think by
 "character" it means "codepoint", and by "glyph" it means "character".
 But the question, which version should be called find, etc. as opposed
 to something else?

I guess it's a bad practice to use any Unicode-unaware algorithms today,
but sometimes it's just a lot easier to do it the old fashioned way. And
of course, a bit faster too. D can be abused to use 8-bit ISO8859-x code
pages. This is what some people still want. The Linux people are pretty
much comfortable with Unicode, but on Windows things are worse. I
haven't invested much time on the windows side of things, but I think
there are some libraries to workaround this now.

One friend of mine also had some problems with Unicode a while back. He
would have wanted to have an Unicode aware file stream on the standard
library. The standard file stream just writes whatever comes to the file
so it's basically pretty easy to mix up different UTF encodings on the
same stream.

Currently D does not try to be very Unicode friendly. I really don't
know how it should be. There are several different opinions and
implementation and this has been discussed before. To me it just seems
that the current state of D is a weird mixture of several approaches.

Dec 17 2006

"Todor Totev" <umbra.tenebris list.ru> writes:

Thank you for your responses. First of all, i'm a beginner D programemer.
Actually my problem is with functions within std.path module.
I'm trying to figure out how to write them to have more robust support
for Windows NTFS file system and I noticed that fncharmatch() is  
ascii-only.
So I started to look how to improve it and discovered std.string.icmp.
But it was ascii-only too and so I decided to posted my question.
The icmp function has different versions for Windows/linux but my
quick test with cyrillic letters showed that it is broken on Windows.

About the fncharmatch() function my best idea so far is to use either
std.uni.toUniUpper() and then compare or just upcase entire
strings with std.string.toupper().

It is compeltelly unrelated but I have a vague memory that the german
letter "ß" (U+DF) when upcased is replaced with "SS". I'm not sure
if this is true but if it is then std.uni.toUniUpper() has a bug because
I don't see its code to check for this case. Could someone speaking
german check this, please?

 Currently D does not try to be very Unicode friendly. I really don't
 know how it should be. There are several different opinions and
 implementation and this has been discussed before.

Could you pelase point me to these discussions? I'm very interested.
Regards,
Todor

Dec 17 2006

Thomas Kuehne <thomas-dloop kuehne.cn> writes:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Todor Totev schrieb am 2006-12-17:

<snip>

 It is compeltelly unrelated but I have a vague memory that the german
 letter "�" (U+DF) when upcased is replaced with "SS". I'm not sure
 if this is true but if it is then std.uni.toUniUpper() has a bug because
 I don't see its code to check for this case. Could someone speaking
 german check this, please?

The uppercase version of "�" is "SS". (At least accroding to Unicode and
DIN, many Germans however treat "�" as caseless ...). Unicode allows to
types of toUpper/toLower: complete and simplified. The simplified
version doesn't change the casing if the number of codepoints would
change. Phobos currently excludes all changes where the simplified
version would cause a change the length of the UTF-8 encoded string.

For an updated std.uni see
http://www.digitalmars.com/pnews/read.php?server=news.digitalmars.com&group=digitalmars.D&artnum=34218

Thomas

-----BEGIN PGP SIGNATURE-----

iD8DBQFFhkNNLK5blCcjpWoRAmMYAJ9Yjua974pcPmYzt+zP6NVsqixDVACaA6KM
/4dVwg+nCCQ9gOW6zyWr8A8=
=lKh2
-----END PGP SIGNATURE-----

Dec 17 2006

Serg Kovrov <kovrov no.spam> writes:

Todor Totev wrote:
 Hello all,
 are std.string functions supposed to be UNICODE aware?
 The documentation says nothing and looking at the source it appears that
 the functions work only for ASCII characters, despite the fact that
 their arguments are char[].
 If these functions are designed to be ascii only, where can I find the
 unicode ones?

Ok, there was some discussion on this issue, but still I can't get it... 
Will unicode support, and support for wchar/dchar be in std.sting on 1.0?

It is nice idea to have a roadmap for '1.0' (and at least a 'post 1.0') 
stuff.

-- 
serg.

Dec 27 2006

D Programming

C/C++ Programming

Other

digitalmars.D - std.string and unicode