www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - upper case

reply FLorian Rivoal <FLorian_member pathlink.com> writes:
Overall, D fully integrates unicode strings, in data structures as well as in
the various functions provided. But there seem to be some little things forgoten
on the way in std.string:

Everything concerning upper-case and lower-case characters only process non
accentuated roman letters. This is the behaviour I would expect for functions
processing ANSI strings, but since D string encode unicode characters, it might
be a good idea to extend their behaviour to other characters like accentuted
roman letters, cyrilic letters, and so on... those also have upper-case and
lower-case forms.

for the sake of efficiency, clarity or something, maybe those could be supplied
as separated functions. maybe not. But anyway, i think this would have its place
in std.string. Otherwise, include something like "assert(language is english);"
in the preconditions of the functions ;)

Of course, this is not difficult to be implemented by the programmer who needs
it. But neither would be the current version which processes only non
actentuated roman letters. So if it is considered worth including for this case,
why not for the other?
Jul 12 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <ccve36$r2o$1 digitaldaemon.com>, FLorian Rivoal says...
Overall, D fully integrates unicode strings, in data structures as well as in
the various functions provided. But there seem to be some little things forgoten
on the way in std.string:

Everything concerning upper-case and lower-case characters only process non
accentuated roman letters. This is the behaviour I would expect for functions
processing ANSI strings, but since D string encode unicode characters, it might
be a good idea to extend their behaviour to other characters like accentuted
roman letters, cyrilic letters, and so on... those also have upper-case and
lower-case forms.

for the sake of efficiency, clarity or something, maybe those could be supplied
as separated functions. maybe not. But anyway, i think this would have its place
in std.string. Otherwise, include something like "assert(language is english);"
in the preconditions of the functions ;)

Of course, this is not difficult to be implemented by the programmer who needs
it. But neither would be the current version which processes only non
actentuated roman letters. So if it is considered worth including for this case,
why not for the other?
Panicke ye not. The full Unicode caseing algorithms are on their way, complete with locale-sensitivity as required by Turkish, Azeri and Lithuanian, and context-sensitivity as required by Greek and a few others. Just wait a little bit longer. Right now, the functions getSimpleLowercaseMapping(), getSimpleUppercaseMapping() and getSimpleTitlecaseMapping() in etc.unicode.unicode perform case "Default Simple Case Mapping" as defined by the Unicode standard. "Default" means not locale sensitive, and "Simple" means "one character at a time, as defined in UnicodeData.txt". They perform case mappings on a character-by-character basis, and work for ALL languages (except Turkish, Azeri and Lithuanian, which will have to wait for the next version). The forthcoming version will do everything. Including casefolding and normalization. It's a few weeks away, unfortunately, so be patient. It would not have been possible for std.string to do all that you require, because a Unicode casing algorithm cannot possibly work unless it can first access all the Unicode properties. std.string does not have that advantage - hence etc.unicode.unicode. One day in the future, it is my hope that all of this will be integrated into Phobos. Arcane Jill. Oh - PS - must apologize. A pre-linked downloadable version of etc.unicode.unicode is STILL not available (so it's still just source code). The reason for this was that it was my birthday last weekend, and I was partying instead of coding. Since I actually have a day job, it will have to wait until next weekend now.
Jul 13 2004
parent reply "Blandger" <zeroman prominvest.com.ua> writes:
"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cd04a3$2280$1 digitaldaemon.com...
 In article <ccve36$r2o$1 digitaldaemon.com>, FLorian Rivoal says...
 The forthcoming version will do everything. Including casefolding and
 normalization. It's a few weeks away, unfortunately, so be patient.
Sounds great. Thank you Jill in advance. I think D is lack of good and consistent String class as java has. For example, recently I stuck with: Object { ... char[] toString() ... } but I need wchar[] at least for supporting non ASCII languages. DMD complains about another return type. It seems that many good libs are coming out to the first versions very soon. I looking forward for first DTL also.
 Oh - PS - must apologize. A pre-linked downloadable version of
 etc.unicode.unicode is STILL not available (so it's still just source
code). The
 reason for this was that it was my birthday last weekend,
Congratulations! It's a good reason for the rest. :))
Jul 13 2004
next sibling parent reply Hauke Duden <H.NS.Duden gmx.net> writes:
Blandger wrote:
 "Arcane Jill" <Arcane_member pathlink.com> wrote in message
 news:cd04a3$2280$1 digitaldaemon.com...
 
In article <ccve36$r2o$1 digitaldaemon.com>, FLorian Rivoal says...
The forthcoming version will do everything. Including casefolding and
normalization. It's a few weeks away, unfortunately, so be patient.
Sounds great. Thank you Jill in advance. I think D is lack of good and consistent String class as java has. For example, recently I stuck with: Object { ... char[] toString() ... } but I need wchar[] at least for supporting non ASCII languages. DMD complains about another return type. It seems that many good libs are coming out to the first versions very soon. I looking forward for first DTL also.
I'm currently working on this. A String interface that abstracts from the specific encoding + a bunch of implementations for the most common ones (UTF-8, 16, 32, system codepage, etc...). It provides some very useful (IMHO) functionality too (like "split", which is so rarely implemented in non-script languages). It is near completion and needs only a few more hours of work on documentation and testing. I hope to find the time within the next one or two weeks. Hauke
Jul 13 2004
next sibling parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cd0bgb$2g5g$1 digitaldaemon.com>, Hauke Duden says...

I'm currently working on this. A String interface that abstracts from 
the specific encoding + a bunch of implementations for the most common 
ones (UTF-8, 16, 32, system codepage, etc...). It provides some very 
useful (IMHO) functionality too (like "split", which is so rarely 
implemented in non-script languages).
Hauke, dude, did anyone ever tell you you're brilliant? Well, I'll say it anyway - you're brilliant. We need this. I've always been annoyed that, while std.string has got some amazing functions in it, like find() and so forth, they ONLY work chars! Huh???? I reckon that now that we have templates, find() should be made to work for ANY kind of array - no need to limit it even to strings. Same for all the other nice stringy functions.
It is near completion and needs only a few more hours of work on 
documentation and testing. I hope to find the time within the next one 
or two weeks.

Hauke
Yay. Looking forward to it. Jill
Jul 13 2004
parent Hauke Duden <H.NS.Duden gmx.net> writes:
Arcane Jill wrote:
 In article <cd0bgb$2g5g$1 digitaldaemon.com>, Hauke Duden says...
 
 
I'm currently working on this. A String interface that abstracts from 
the specific encoding + a bunch of implementations for the most common 
ones (UTF-8, 16, 32, system codepage, etc...). It provides some very 
useful (IMHO) functionality too (like "split", which is so rarely 
implemented in non-script languages).
Hauke, dude, did anyone ever tell you you're brilliant? Well, I'll say it anyway - you're brilliant. We need this.
Not recently, so thank you very much ;).
 I've always been annoyed that, while std.string has got some amazing functions
 in it, like find() and so forth, they ONLY work chars! Huh????
 
 I reckon that now that we have templates, find() should be made to work for ANY
 kind of array - no need to limit it even to strings. Same for all the other
nice
 stringy functions.
Yes. I've written a mixin that contains the string algorithms and that is used in the String classes. I've also gone to some length to ensure that the character decoding stuff can be inlined into the mixed-in algorithms. So performance will (hopefully - I haven't done any tests yet) be good. Hauke
Jul 13 2004
prev sibling parent "Blandger" <zeroman prominvest.com.ua> writes:
"Hauke Duden" <H.NS.Duden gmx.net> wrote in message
news:cd0bgb$2g5g$1 digitaldaemon.com...
 Blandger wrote:
 I'm currently working on this. A String interface that abstracts from
 the specific encoding + a bunch of implementations for the most common
 ones (UTF-8, 16, 32, system codepage, etc...). It provides some very
 useful (IMHO) functionality too (like "split", which is so rarely
 implemented in non-script languages).
Wow! Nice to hear it. :)
 It is near completion and needs only a few more hours of work on
 documentation and testing. I hope to find the time within the next one
 or two weeks.
Good. Don't hurry much, just make it good, consistent and handy for working with. Thanks!
Jul 13 2004
prev sibling next sibling parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cd085g$29tq$1 digitaldaemon.com>, Blandger says...

but I need wchar[] at least for supporting non ASCII languages.
Not true. char[] stores UTF-8, not ASCII. The whole of Unicode is available to char[] arrays. is perfectly legal. (And you can use etc.unicode's getSimpleUppercaseMapping() to uppercase it too). Arcane Jill
Jul 13 2004
next sibling parent Arcane Jill <Arcane_member pathlink.com> writes:
In article <cd0jdn$2sru$1 digitaldaemon.com>, Arcane Jill says...
In article <cd085g$29tq$1 digitaldaemon.com>, Blandger says...

but I need wchar[] at least for supporting non ASCII languages.
Not true. char[] stores UTF-8, not ASCII. The whole of Unicode is available to char[] arrays. is perfectly legal. (And you can use etc.unicode's getSimpleUppercaseMapping() to uppercase it too). Arcane Jill
Okay, so it doesn't come out right on this forum! But it will work in D source.
Jul 13 2004
prev sibling parent reply "Blandger" <zeroman prominvest.com.ua> writes:
"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cd0jdn$2sru$1 digitaldaemon.com...
 In article <cd085g$29tq$1 digitaldaemon.com>, Blandger says...

but I need wchar[] at least for supporting non ASCII languages.
Not true. char[] stores UTF-8, not ASCII. The whole of Unicode is
available to
 char[] arrays.




 is perfectly legal. (And you can use etc.unicode's
getSimpleUppercaseMapping() to uppercase it too). Thanks for addition. You are right it's legal but it looks (and I think works) ugly. It seems to me there is no 'normal way' to work with upper/lowecase, sort, search, collate, replace, code pages stuff with non ASCII letters within Phobos in this case . Or am I something missed ??
Jul 13 2004
next sibling parent reply "Walter" <newshound digitalmars.com> writes:
"Blandger" <zeroman prominvest.com.ua> wrote in message
news:cd0lhh$30mc$1 digitaldaemon.com...
 "Arcane Jill" <Arcane_member pathlink.com> wrote in message
 news:cd0jdn$2sru$1 digitaldaemon.com...
 In article <cd085g$29tq$1 digitaldaemon.com>, Blandger says...

but I need wchar[] at least for supporting non ASCII languages.
Not true. char[] stores UTF-8, not ASCII. The whole of Unicode is
available to
 char[] arrays.




 is perfectly legal. (And you can use etc.unicode's
getSimpleUppercaseMapping() to uppercase it too). Thanks for addition. You are right it's legal but it looks (and I think works) ugly. It seems
to
 me there is no 'normal way' to work with upper/lowecase, sort, search,
 collate, replace, code pages stuff  with non ASCII letters within Phobos
in
 this case . Or am I something missed ??
It looks ugly because it's written with unicode code numbers rather than the actual characters. If you write your source code using an editor that supports UTF-8, UTF-16, or UTF-32 you can write it using the actual characters. The D compiler can handle UTF-8, UTF-16, or UTF-32 source text.
Jul 13 2004
next sibling parent reply "Blandger" <zeroman aport.ru> writes:
"Walter" <newshound digitalmars.com> wrote in message
news:cd17f0$115j$2 digitaldaemon.com...

 It looks ugly because it's written with unicode code numbers rather than
the
 actual characters. If you write your source code using an editor that
 supports UTF-8, UTF-16, or UTF-32 you can write it using the actual
 characters. The D compiler can handle UTF-8, UTF-16, or UTF-32 source
text. I'm always catching myself with a thought I'm afraid write a code using UTF editors. Actually I don't know why! May be it's an old, outdated habits, may be it's something like 'internal fear' from UTF-x stuff. Really I don't know why it's so. So I decided to ask how many people in NG use UTF-x editors coding sources??
Jul 13 2004
next sibling parent Thomas Kuehne <eisvogel users.sourceforge.net> writes:
Blandger wrote:
 So I decided to ask how many people in NG use UTF-x editors coding
 sources??
Hasn't this been the standard for several years now - at least in the perl and Java world? Thomas
Jul 13 2004
prev sibling parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cd1fmv$1fqa$3 digitaldaemon.com>, Blandger says...
So I decided to ask how many people in NG use UTF-x editors coding sources??
I wasn't aware that there were still any _non_ UTF-XX editors in use! Even Microsoft Notepad - the bottom end of text editors if you're a programmer (no syntax highlighting, etc.) understands UTF-8. These days, what text editors don't? Me, I use TextPad. TextPad is not fully Unicode-aware (yet), but it CAN save files in UTF-8 format, which is all I need. Arcane Jill
Jul 14 2004
parent reply "Blandger" <zeroman prominvest.com.ua> writes:
"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cd2lsj$hsu$1 digitaldaemon.com...
 Me, I use TextPad. TextPad is not fully Unicode-aware (yet), but it CAN
save
 files in UTF-8 format, which is all I need.
Actually I'd like to ask: how many people at present time use 'unicode editors' for their project's sources on the 'regular base' but not occasionally. It seems to me it happens very rarely (if ever) and it's not the 'strict rule' in companies/projects. So I think myself why i's so if unicode is so wonderful?
Jul 14 2004
parent Arcane Jill <Arcane_member pathlink.com> writes:
In article <cd2vp0$164f$1 digitaldaemon.com>, Blandger says...
"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cd2lsj$hsu$1 digitaldaemon.com...
 Me, I use TextPad. TextPad is not fully Unicode-aware (yet), but it CAN
save
 files in UTF-8 format, which is all I need.
Actually I'd like to ask: how many people at present time use 'unicode editors' for their project's sources on the 'regular base' but not occasionally. It seems to me it happens very rarely (if ever) and it's not the 'strict rule' in companies/projects. So I think myself why i's so if unicode is so wonderful?
And I say again, almost ALL text editors these days can save in UTF. In fact, I'm not even sure I can name one that doesn't. On that basis, then, the probable answer is almost everyone (although they may not consciously be aware of it). Arcane Jill
Jul 14 2004
prev sibling parent reply Roberto Mariottini <Roberto_member pathlink.com> writes:
In article <cd17f0$115j$2 digitaldaemon.com>, Walter says...

[...]
If you write your source code using an editor that
supports UTF-8, UTF-16, or UTF-32 you can write it using the actual
characters. The D compiler can handle UTF-8, UTF-16, or UTF-32 source text.
This leds to some questions: How can it detect the right coding? Does endianess matter? And what about my current default codepage (windows-1252)? If I pass an HTML as source, does it honor the encoding specified in the header? Ciao
Jul 13 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cd2ksg$fng$1 digitaldaemon.com>, Roberto Mariottini says...

This leds to some questions:
How can it detect the right coding?
UTF-8, UTF-16BE, UTF-16LE, UTF-32BE and UTF32-LE are very easy to tell apart, either with or without a BOM (a BOM is a special prefix). It cannot, however, distinguish the above from any OTHER encoding.
Does endianess matter?
With the UTF family, no. As I said, they are easy to tell apart.
And what about my current default codepage (windows-1252)?
D is designed with a global philosophy, so it will ignore your default codepage, and signal an error if you rely upon it. This is a good thing, because in D (unlike C/C++), the same source file will compile identically on all machines. Consider the following fragment of C++: (assuming the existence of a C++ toUTF16() function). Even in Western Europe and America, if you run that on Linux (where the default encoding is ISO-8859-1) you'll end up with s containing U+0080, but if you run it on Windows (where the default encoding is WINDOWS-1252) you'll end up with s containing U+20AC. Outside of Western Europe and America, the situation would be decidedly worse. D, on the other hand, will produce a consistent binary for the same source, no matter where you live or what your encoding is. In other words, the short answer to your question:
And what about my current default codepage (windows-1252)?
is, if you're using D, forget it.
If I pass an HTML as source, does it honor the encoding specified in the header?
No. It can't, because DMD doesn't come armed with hundreds of different decoders. Arcane Jill
Jul 14 2004
parent reply Roberto Mariottini <Roberto_member pathlink.com> writes:
In article <cd2p4m$p0c$1 digitaldaemon.com>, Arcane Jill says...
In article <cd2ksg$fng$1 digitaldaemon.com>, Roberto Mariottini says...

This leds to some questions:
How can it detect the right coding?
UTF-8, UTF-16BE, UTF-16LE, UTF-32BE and UTF32-LE are very easy to tell apart, either with or without a BOM (a BOM is a special prefix). It cannot, however, distinguish the above from any OTHER encoding.
Does endianess matter?
With the UTF family, no. As I said, they are easy to tell apart.
And what about my current default codepage (windows-1252)?
[...]
is, if you're using D, forget it.
Thanks for the answer. I should have RTFM before asking, though. In http://www.digitalmars.com/d/lex.html is stated that D supports only ASCII and UTF-*, if there isn't a BOM at the beginning then UTF-8 is assumed(so ASCII is safe too).
If I pass an HTML as source, does it honor the encoding specified in the header?
No. It can't, because DMD doesn't come armed with hundreds of different decoders.
Well, do you know any translator from 1252 to UTF-8? Ciao
Jul 14 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cd372c$1i77$1 digitaldaemon.com>, Roberto Mariottini says...

Well, do you know any translator from 1252 to UTF-8?
How about I just make one up right now: Arcane Jill
Jul 14 2004
parent Roberto Mariottini <Roberto_member pathlink.com> writes:
I've played a little with this, but I don't seem to find a suitable solution.
Attached is the Jill code modified to get a filter program.

My test program is this:

import std.c.stdio;
import std.utf;

int main(char[][] args)
{
int perché;

printf("Perché\n");

return 0;
}

Obviously, if I compile it in its original encoding (Windows 1252) I get an
error:

test.d(6): invalid UTF-8 sequence
test.d(6): invalid UTF-8 sequence
test.d(6): unsupported char 0xe9

So I translate it in UTF-8, using:
w2u.exe test.d > test2.d

This new encoded file compiles without errors, but printf output is scrambled by
the conversion: two characters are printed instead of the special one. In fact
the special character is translated in a two-byte UTF-8 sequence by the filter,
and printf doesn't recognize UTF-8 encoded strings.
So I changed it to use wprintf:

prev sibling parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cd0lhh$30mc$1 digitaldaemon.com>, Blandger says...


You are right it's legal but it looks (and I think works) ugly.
Errm. That was an artifact of this forum's web interface. When I typed it in, it looked to me like a nice bunch of Russian and Chinese characters with a few Runes and Dingbats thrown it. It would look like that in my text editor too. And it would work. Alas, the HTML capacities of the D forum web site were not up the job, so you didn't see what I intended for you to see. Apparently you have to be a virgin to see unicode. :) Something like that anyway. Walter says Unicode is the future. I think he's right, but unfortunately it isn't the present.
It seems to
me there is no 'normal way' to work with upper/lowecase, sort, search,
collate, replace, code pages stuff  with non ASCII letters within Phobos in
this case . Or am I something missed ??
Right now, no. But you can use the getSimpleUppercaseMapping() etc. functions from Deimos to do casing. Lexicographical sort isn't a problem, obviously. Search - depends what you mean. If you're waiting for the Unicode regular expression engine, you'll have to wait a while - that will be one of the last things we get. If you want an exact match though, that's pretty easy right now - a string is just an array, after all. Collation will be available (but isn't yet) via the Unicode Collation Algorithm - for which we'll have to download the CLDR (Common Locale Data Repository) from Unicode to get all the locale-specific weightings, but that will come. "Code pages", note, have nothing to do with Unicode. That comes into play in our sphere during transoding (encoding/decoding), which is something that I imagine will ultimately be built into streams. Much of Phobos was written in the early days of D, when there was no access to Unicode property data. It takes time to organize a proper Unicode library. Unicode has layers of features, with each algorithm relying on the services of the next layer down. Phobos had access to none of this, when it was written. Even now, Deimos's Unicode support is still only at the character level, but we'll get to the string level eventually. But all this will come. And I strongly suspect that D's Unicode support will eventually make it the language of choice for Unicode projects. Arcane Jill
Jul 13 2004
parent "Blandger" <zeroman aport.ru> writes:
"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cd18gg$135d$1 digitaldaemon.com...
 In article <cd0lhh$30mc$1 digitaldaemon.com>, Blandger says...

 it would work. Alas, the HTML capacities of the D forum web site were not
up the
 job, so you didn't see what I intended for you to see.
I see. :)
 Apparently you have to be a virgin to see unicode.  :)

 Something like that anyway. Walter says Unicode is the future. I think
he's
 right, but unfortunately it isn't the present.
Agree with you both.
 "Code pages", note, have nothing to do with Unicode. That comes into play
in our
 sphere during transoding (encoding/decoding), which is something that I
imagine
 will ultimately be built into streams.
Exactly. I meant I don't want to think about code page then I use something like 'String class' in the D cdoe because it's should be 'internally unicoded' as it's in java. But I have to think about code page for I/O because there are a lots of 'old files' with 'old non unicode' content.
 Even now, Deimos's Unicode support is still only at the character level,
but
 we'll get to the string level eventually.
 But all this will come. And I strongly suspect that D's Unicode support
will
 eventually make it the language of choice for Unicode projects.
Hope so. :)
Jul 13 2004
prev sibling parent reply "Walter" <newshound digitalmars.com> writes:
"Blandger" <zeroman prominvest.com.ua> wrote in message
news:cd085g$29tq$1 digitaldaemon.com...
 For example, recently I stuck with:
 Object {
 ...
 char[] toString()
 ...
 }
 but I need wchar[] at least for supporting non ASCII languages. DMD
 complains about another return type.
char[] isn't ASCII, it's UTF-8. Any UTF-8 string can be converted to UTF-16 (which is wchar[]) by calling std.utf.toUTF16(). So, char[] toString() does fully support non-ASCII languages.
Jul 13 2004
parent reply "Blandger" <zeroman aport.ru> writes:
"Walter" <newshound digitalmars.com> wrote in message
news:cd17ev$115j$1 digitaldaemon.com...

 char[] isn't ASCII, it's UTF-8. Any UTF-8 string can be converted to
UTF-16
 (which is wchar[]) by calling std.utf.toUTF16(). So, char[] toString()
does
 fully support non-ASCII languages.
Sorry for mistaking all of you a little. DWT has a 'internal convention' to use 'alias wchar[] String;' for 'java String class' replacement. I don't know why. Seem it was Andy's decision. I hope it's right but... Recently I stuck with this: alias wchar[] String; public class ToStringTest { this() { } String toString() { return "ff"; } } DMD complains about another return type: //function toString overrides but is not covariant with toString How we can go throught this 'probable error'? This error has gone away by this time with unknow reason (it happed before) but I'm not sure if it doesn't come back again later... (sorry for probobly wrong english gramma here).
Jul 13 2004
parent "Walter" <newshound digitalmars.com> writes:
"Blandger" <zeroman aport.ru> wrote in message
news:cd1fmq$1fqa$2 digitaldaemon.com...
 "Walter" <newshound digitalmars.com> wrote in message
 news:cd17ev$115j$1 digitaldaemon.com...

 char[] isn't ASCII, it's UTF-8. Any UTF-8 string can be converted to
UTF-16
 (which is wchar[]) by calling std.utf.toUTF16(). So, char[] toString()
does
 fully support non-ASCII languages.
Sorry for mistaking all of you a little. DWT has a 'internal convention' to use 'alias wchar[] String;' for 'java String class' replacement. I don't know why. Seem it was Andy's decision.
I
 hope it's right but...

 Recently I stuck with this:

 alias wchar[] String;
   public class ToStringTest {
     this() {
     }
     String toString() {
       return "ff";
     }
   }
 DMD complains about another return type:
 //function toString overrides but is not covariant with toString

 How we can go throught this 'probable error'? This error has gone away by
 this time with unknow reason (it happed before) but I'm not sure if it
 doesn't come back again later... (sorry for probobly wrong english gramma
 here).
The "not covariant" error happens when the overriding function has a return type that is not the same as the return type of the overridden function, or is not derived from that type.
Jul 13 2004