www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - std.string and unicode

reply "Todor Totev" <umbra.tenebris list.ru> writes:
Hello all,
are std.string functions supposed to be UNICODE aware?
The documentation says nothing and looking at the source it appears that
the functions work only for ASCII characters, despite the fact that
their arguments are char[].
If these functions are designed to be ascii only, where can I find the
unicode ones?
Best Regards,
Todor
Dec 16 2006
next sibling parent reply Frits van Bommel <fvbommel REMwOVExCAPSs.nl> writes:
Todor Totev wrote:
 Hello all,
 are std.string functions supposed to be UNICODE aware?

Yes.
 The documentation says nothing and looking at the source it appears that
 the functions work only for ASCII characters, despite the fact that
 their arguments are char[].

In many cases UTF-8 or ASCII doesn't actually make a difference, so the code doesn't need to do anything special. IIRC that was in fact one of the design goals of UTF-8. But if you find any functions that only work for ASCII, be sure to report them to the digitalmars.D.bugs newsgroup[1]. Make sure to test them first, though. In fact, include code that fails (but shouldn't). [1]: The bugzilla seems to be down at the moment, otherwise that would have been preferred.
Dec 16 2006
parent reply =?UTF-8?B?SmFyaS1NYXR0aSBNw6RrZWzDpA==?= <jmjmak utu.fi.invalid> writes:
Frits van Bommel wrote:
 Todor Totev wrote:
 Hello all,
 are std.string functions supposed to be UNICODE aware?

Yes.

No they are not. The string constants (lowercase, letters, etc.) only feature ASCII characters. Some functions have "BUG: only works with ASCII" attached to them. Many of the functions expect that the char[] string consists of 8 bit characters. Mango project (http://dsource.org/projects/mango) has functions with better Unicode compatibility.
Dec 16 2006
next sibling parent reply Stewart Gordon <smjg_1998 yahoo.com> writes:
Jari-Matti Mäkelä wrote:
 Frits van Bommel wrote:
 Todor Totev wrote:
 Hello all,
 are std.string functions supposed to be UNICODE aware?


No they are not.

Whether they're _supposed_ to be Unicode aware, and whether they actually _are_ Unicode aware, are two very different matters.
 The string constants (lowercase, letters, etc.) only
 feature ASCII characters. Some functions have "BUG: only works with
 ASCII" attached to them. Many of the functions expect that the char[]
 string consists of 8 bit characters.

maketrans expects the char[] string to consist of 7-bit characters. Which functions expect it to consist of 8-bit characters? But indeed, somebody needs to define a Unicode translation table format that isn't going to take up 4MB or so. Why was the current translation table format put in in the first place, considering: - it's obvious that it won't work in Unicode - a dchar[dchar] is an intuitive way to do it ? Stewart.
Dec 16 2006
parent =?UTF-8?B?SmFyaS1NYXR0aSBNw6RrZWzDpA==?= <jmjmak utu.fi.invalid> writes:
Stewart Gordon wrote:
 maketrans expects the char[] string to consist of 7-bit characters.
 Which functions expect it to consist of 8-bit characters?

Right, I meant 7 + 1 to make the characters aligned on the 8 bit boundaries.
Dec 16 2006
prev sibling parent reply Frits van Bommel <fvbommel REMwOVExCAPSs.nl> writes:
Jari-Matti Mäkelä wrote:
 Frits van Bommel wrote:
 Todor Totev wrote:
 Hello all,
 are std.string functions supposed to be UNICODE aware?


No they are not.

All I said was they're _supposed_ to be Unicode aware, which was the question ;). I also said to report any that aren't.
 The string constants (lowercase, letters, etc.) only
 feature ASCII characters. Some functions have "BUG: only works with
 ASCII" attached to them. Many of the functions expect that the char[]
 string consists of 8 bit characters.

First, let me note I was mostly just repeating what I remember from a previous iteration of this discussion. I didn't actually look through the entire source. I guess those with "BUG" in them are known bugs, which will hopefully be fixed. But functions like find() (substring form) and replace(), for instance, shouldn't need special code to deal with UTF-8 vs. ASCII. In fact, anything that doesn't deal with properties of individual characters should probably be fine.
 Mango project (http://dsource.org/projects/mango) has functions with
 better Unicode compatibility.

Always good to know if I ever need it. ... [some time later] Actually, I just did look through the entire source. The functions that seem to have trouble with UTF-8: icmp: Only checks for range 'A'-'Z' ifind, irfind (substring forms): use icmp and are thus equally guilty. rjustify, ljustify, center, zfill: fail to account for multi-byte characters still being only one character (i.e. align according to byte length, not character length) maketrans, translate: Not really practical for full Unicode, but buggy still. soundex: AFAIK the Soundex algorithm is undefined for characters out of the ranges a-z and A-Z, so the fault lies more in the algorithm than this implementation of it. The rest seems to handle it just fine, as long as you only pass valid UTF-8 and any indexes passed in are at character boundaries. That's a score of 10 ASCII functions and about 70 UTF-8 functions. I think the phrase 'many of the functions' is a bit exaggerated here (especially since the last 7 are a bit obscure IMHO), but YMMV. Note that I didn't follow my own advice and actually test these; this is based purely on browsing through the file.
Dec 16 2006
parent reply =?UTF-8?B?SmFyaS1NYXR0aSBNw6RrZWzDpA==?= <jmjmak utu.fi.invalid> writes:
Frits van Bommel wrote:
 Jari-Matti Mäkelä wrote:
 Frits van Bommel wrote:
 Todor Totev wrote:
 Hello all,
 are std.string functions supposed to be UNICODE aware?


No they are not.

All I said was they're _supposed_ to be Unicode aware, which was the question ;). I also said to report any that aren't.

Ok, sorry - didn't mean to be rude. I just thought it might not be realistic to expect them all to be Unicode aware in the near future. Many of them were implemented to be pretty limited to ASCII last time I checked. And they haven't changed a lot in the last few years. I don't know, the 1.0 stable will be out on Jan 1. - I cannot foretell, how much they will change before or after that. For example Ruby has been out a while and still some of it's string functions are not compatible with Unicode.
 But functions like find() (substring form) and replace(), for instance,
 shouldn't need special code to deal with UTF-8 vs. ASCII. In fact,
 anything that doesn't deal with properties of individual characters
 should probably be fine.

Unless the functions must be aware of different forms of the same characters. The Unicode gurus can shed more light into this. I recall there has been previous discussion of this also.
 Actually, I just did look through the entire source. The functions that
 seem to have trouble with UTF-8:
 
 icmp: Only checks for range 'A'-'Z'
 ifind, irfind (substring forms): use icmp and are thus equally guilty.
 rjustify, ljustify, center, zfill: fail to account for multi-byte
 characters still being only one character (i.e. align according to byte
 length, not character length)
 maketrans, translate: Not really practical for full Unicode, but buggy
 still.
 soundex: AFAIK the Soundex algorithm is undefined for characters out of
 the ranges a-z and A-Z, so the fault lies more in the algorithm than
 this implementation of it.

Also tolower & toupper have problems with characters like Ә (04D8) and Һ (04BA). Ok, I'm not 100% sure they are supposed to have upper and lower case counterparts. Then there are some characters like Σ (03A3) that are also used as symbols in e.g. mathematics and ordinary letters in some languages. Anyway, one cannot tell those methods about the locale to use in conversions so I though they are not supposed to be Unicode compatible. And the spec does not say whether they are ought to be compatible or not. (OTOH I'm pretty much impressed about the capabilities of tolower/toupper. Last time I tested those they were not even able to change the case of ä:s and ö:s.) I'm most probably the wrong person to answer this, but this is just the first impression I got. Now that you mention it, it really seems that Walter has improved those a lot. I'm not sure how much it matters to the OP how things are supposed to be if they are not implemented yet. I'm still using D for small scale projects that are in the end of their life cycle before the spec / std library manages to get through such massive changes.
Dec 16 2006
parent reply Stewart Gordon <smjg_1998 yahoo.com> writes:
Jari-Matti Mäkelä wrote:
<snip>
 Ok, sorry - didn't mean to be rude. I just thought it might not be
 realistic to expect them all to be Unicode aware in the near future.
 
 Many of them were implemented to be pretty limited to ASCII last time I
 checked. And they haven't changed a lot in the last few years. I don't
 know, the 1.0 stable will be out on Jan 1.

At this rate, I think to call it stable is wishful thinking. <snip>
 But functions like find() (substring form) and replace(), for instance,
 shouldn't need special code to deal with UTF-8 vs. ASCII. In fact,
 anything that doesn't deal with properties of individual characters
 should probably be fine.

Unless the functions must be aware of different forms of the same characters. The Unicode gurus can shed more light into this. I recall there has been previous discussion of this also.

Eventually, we should probably have two versions of the find functions: one that matches codepoint for codepoint and therefore byte for byte, and one that matches character for character. See point 6 at http://www.textpad.info/forum/viewtopic.php?t=4778 However, it does appear that that post mixes up terms a bit - I think by "character" it means "codepoint", and by "glyph" it means "character". But the question, which version should be called find, etc. as opposed to something else? <snip>
 Also tolower & toupper have problems with characters like Ә (04D8) and Һ
 (04BA). Ok, I'm not 100% sure they are supposed to have upper and lower
 case counterparts. Then there are some characters like Σ (03A3) that are
 also used as symbols in e.g. mathematics and ordinary letters in some
 languages.

U+03A3 is the Greek letter and U+2211 is the mathematical symbol. Similarly, U+03A0 and U+220F. Here they are: Σ ∑ Π ∏ Stewart.
Dec 17 2006
parent reply =?UTF-8?B?SmFyaS1NYXR0aSBNw6RrZWzDpA==?= <jmjmak utu.fi.invalid> writes:
Stewart Gordon wrote:
 Jari-Matti Mäkelä wrote:
 But functions like find() (substring form) and replace(), for instance,
 shouldn't need special code to deal with UTF-8 vs. ASCII. In fact,
 anything that doesn't deal with properties of individual characters
 should probably be fine.

Unless the functions must be aware of different forms of the same characters. The Unicode gurus can shed more light into this. I recall there has been previous discussion of this also.

Eventually, we should probably have two versions of the find functions: one that matches codepoint for codepoint and therefore byte for byte, and one that matches character for character. See point 6 at http://www.textpad.info/forum/viewtopic.php?t=4778 However, it does appear that that post mixes up terms a bit - I think by "character" it means "codepoint", and by "glyph" it means "character". But the question, which version should be called find, etc. as opposed to something else?

I guess it's a bad practice to use any Unicode-unaware algorithms today, but sometimes it's just a lot easier to do it the old fashioned way. And of course, a bit faster too. D can be abused to use 8-bit ISO8859-x code pages. This is what some people still want. The Linux people are pretty much comfortable with Unicode, but on Windows things are worse. I haven't invested much time on the windows side of things, but I think there are some libraries to workaround this now. One friend of mine also had some problems with Unicode a while back. He would have wanted to have an Unicode aware file stream on the standard library. The standard file stream just writes whatever comes to the file so it's basically pretty easy to mix up different UTF encodings on the same stream. Currently D does not try to be very Unicode friendly. I really don't know how it should be. There are several different opinions and implementation and this has been discussed before. To me it just seems that the current state of D is a weird mixture of several approaches.
Dec 17 2006
parent reply "Todor Totev" <umbra.tenebris list.ru> writes:
Thank you for your responses. First of all, i'm a beginner D programemer.
Actually my problem is with functions within std.path module.
I'm trying to figure out how to write them to have more robust support
for Windows NTFS file system and I noticed that fncharmatch() is  
ascii-only.
So I started to look how to improve it and discovered std.string.icmp.
But it was ascii-only too and so I decided to posted my question.
The icmp function has different versions for Windows/linux but my
quick test with cyrillic letters showed that it is broken on Windows.

About the fncharmatch() function my best idea so far is to use either
std.uni.toUniUpper() and then compare or just upcase entire
strings with std.string.toupper().

It is compeltelly unrelated but I have a vague memory that the german
letter "ß" (U+DF) when upcased is replaced with "SS". I'm not sure
if this is true but if it is then std.uni.toUniUpper() has a bug because
I don't see its code to check for this case. Could someone speaking
german check this, please?

 Currently D does not try to be very Unicode friendly. I really don't
 know how it should be. There are several different opinions and
 implementation and this has been discussed before.

Could you pelase point me to these discussions? I'm very interested. Regards, Todor
Dec 17 2006
parent Thomas Kuehne <thomas-dloop kuehne.cn> writes:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Todor Totev schrieb am 2006-12-17:

<snip>

 It is compeltelly unrelated but I have a vague memory that the german
 letter "" (U+DF) when upcased is replaced with "SS". I'm not sure
 if this is true but if it is then std.uni.toUniUpper() has a bug because
 I don't see its code to check for this case. Could someone speaking
 german check this, please?

The uppercase version of "" is "SS". (At least accroding to Unicode and DIN, many Germans however treat "" as caseless ...). Unicode allows to types of toUpper/toLower: complete and simplified. The simplified version doesn't change the casing if the number of codepoints would change. Phobos currently excludes all changes where the simplified version would cause a change the length of the UTF-8 encoded string. For an updated std.uni see http://www.digitalmars.com/pnews/read.php?server=news.digitalmars.com&group=digitalmars.D&artnum=34218 Thomas -----BEGIN PGP SIGNATURE----- iD8DBQFFhkNNLK5blCcjpWoRAmMYAJ9Yjua974pcPmYzt+zP6NVsqixDVACaA6KM /4dVwg+nCCQ9gOW6zyWr8A8= =lKh2 -----END PGP SIGNATURE-----
Dec 17 2006
prev sibling parent Serg Kovrov <kovrov no.spam> writes:
Todor Totev wrote:
 Hello all,
 are std.string functions supposed to be UNICODE aware?
 The documentation says nothing and looking at the source it appears that
 the functions work only for ASCII characters, despite the fact that
 their arguments are char[].
 If these functions are designed to be ascii only, where can I find the
 unicode ones?

Ok, there was some discussion on this issue, but still I can't get it... Will unicode support, and support for wchar/dchar be in std.sting on 1.0? It is nice idea to have a roadmap for '1.0' (and at least a 'post 1.0') stuff. -- serg.
Dec 27 2006