digitalmars.D.bugs - switch (dchar[])

Thomas Kuehne (11/11) Nov 17 2004 Using dchar[] as case-keys within a switch results in:

Simon Buchan (7/12) Nov 17 2004 you mean dchar[]. Floating points cannot be compared very well,

Thomas Kuehne (9/15) Nov 17 2004 I'm aware of this problem.

Simon Buchan (6/6) Nov 17 2004 On Wed, 17 Nov 2004 13:47:46 +0100, Thomas Kuehne

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (7/10) Nov 17 2004 Isn't dchar[] a pretty useless type ?

Thomas Kuehne (6/10) Nov 17 2004 When you are dealing with extended CJK, ancient or private scripts

=?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= (12/16) Nov 17 2004 OK. My legacy Unicode code is all in java, which does wchar only...
Sean Kelly (5/16) Nov 17 2004 Semi-related question. Is it possible for there to be multiple UTF-8 (o...

Thomas Kuehne (9/21) Nov 17 2004 The used encodings could technically present one codepoint with

Sean Kelly (7/27) Nov 17 2004 The reason I asked was for string matching. I wanted to be sure there w...

Thomas Kuehne (12/41) Nov 17 2004 Yes - with one important exception.

Sean Kelly (5/8) Nov 17 2004 Thanks a lot. This was all in reference to my readf routines, which are

Simon Buchan (7/19) Nov 29 2004 Fixed in 1.06

Simon Buchan (8/9) Nov 29 2004 On Tue, 30 Nov 2004 19:06:37 +1300, Simon Buchan

Thomas Kuehne <thomas-dloop kuehne.thisisspam.cn> writes:

Using dchar[] as case-keys within a switch results in:
Internal error: s2ir.c 670

http://svn.kuehne.cn/dstress/nocompile/switch_14.d


Using multiple identical dchar[]s as case-keys within a switch results
in:
expression.c:1367: virtual int StringExp::compare(Object*): Assertion `0' failed

http://svn.kuehne.cn/dstress/nocompile/switch_13.d

I don't know why, but the current documentation states that only
"integral types or char[] or wchar[]" are allowed for switch statements.
It is certainly useful if wchar[] and floating types are allowed too.

Thomas

Nov 17 2004

"Simon Buchan" <currently no.where> writes:

On Wed, 17 Nov 2004 10:09:41 +0100, Thomas Kuehne  
<thomas-dloop kuehne.thisisspam.cn> wrote:

 <some bugs>
 I don't know why, but the current documentation states that only
 "integral types or char[] or wchar[]" are allowed for switch statements.
 It is certainly useful if wchar[] and floating types are allowed too.

 Thomas

you mean dchar[]. Floating points cannot be compared very well,  
unfortunately.
(1.0/3.0 != 0.2/0.6 is possible, for example) But you probably knew that.

-- 
Using Opera's revolutionary e-mail client: http://www.opera.com/m2/

Nov 17 2004

Thomas Kuehne <thomas-dloop kuehne.thisisspam.cn> writes:

Simon Buchan schrieb am Thu, 18 Nov 2004 01:01:50 +1300:
 I don't know why, but the current documentation states that only
 "integral types or char[] or wchar[]" are allowed for switch statements.
 It is certainly useful if wchar[] and floating types are allowed too.

 you mean dchar[]. Floating points cannot be compared very well,  
 unfortunately.
 (1.0/3.0 != 0.2/0.6 is possible, for example) But you probably knew that.

I'm aware of this problem.
Exactly for this purpose IEEE 754 defines a set of different rounding
modes. D's specification is pretty wage, thus no hard facts for this
discussion.

float.html:



Thomas

Nov 17 2004

"Simon Buchan" <currently no.where> writes:

On Wed, 17 Nov 2004 13:47:46 +0100, Thomas Kuehne  
<thomas-dloop kuehne.thisisspam.cn> wrote:

at least (punchline drums) we get operators like !<>=
<g>

-- 
Using Opera's revolutionary e-mail client: http://www.opera.com/m2/

Nov 17 2004

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

Thomas Kuehne wrote:

 I don't know why, but the current documentation states that only
 "integral types or char[] or wchar[]" are allowed for switch statements.
 It is certainly useful if wchar[] and floating types are allowed too.

Isn't dchar[] a pretty useless type ?
(dchar isn't, but an UTF-32 string...)

OTOH, good if it doesn't crash anything!

But I suspect that wchar[] is better for
storing a bunch of (dchar) code points ?

--anders

Nov 17 2004

Thomas Kuehne <thomas-dloop kuehne.thisisspam.cn> writes:

Anders F Bj�rklund schrieb am Wed, 17 Nov 2004 14:09:54 +0100:
 Isn't dchar[] a pretty useless type ?
 (dchar isn't, but an UTF-32 string...)

 But I suspect that wchar[] is better for
 storing a bunch of (dchar) code points ?

When you are dealing with extended CJK, ancient or private scripts
dchar is useful. For simple operations you might use wchar, but 
as soon as you start extensive text processing you add an huge amount
of overhead(lookup if this is a surrogate).

Thomas

Nov 17 2004

=?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= <afb algonet.se> writes:

Thomas Kuehne wrote:

 When you are dealing with extended CJK, ancient or private scripts
 dchar is useful. For simple operations you might use wchar, but 
 as soon as you start extensive text processing you add an huge amount
 of overhead(lookup if this is a surrogate).

OK. My legacy Unicode code is all in java, which does wchar only...
(Java only got support for surrogates in the brand new 1.5 version)

I was actually talking about the array representation: dchar[], not the
variable which might as well be declared dchar (32-bit registers anyway)

To be honest, I just used "foreach (dchar c; str)" and let D worry
about the implementation. Then again, str is just a standard char[].

My texts are just ISO-8859-1, with about 90-95% of it being US-ASCII.
(actually most of them are in "MacRoman"*, but that's about the same)

--anders

PS.
* = http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/ROMAN.TXT

Nov 17 2004

Sean Kelly <sean f4.ca> writes:

In article <h00s62-v9a.ln1 kuehne.cn>, Thomas Kuehne says...
Anders F Bj�rklund schrieb am Wed, 17 Nov 2004 14:09:54 +0100:
 Isn't dchar[] a pretty useless type ?
 (dchar isn't, but an UTF-32 string...)

 But I suspect that wchar[] is better for
 storing a bunch of (dchar) code points ?

When you are dealing with extended CJK, ancient or private scripts
dchar is useful. For simple operations you might use wchar, but 
as soon as you start extensive text processing you add an huge amount
of overhead(lookup if this is a surrogate).

Semi-related question.  Is it possible for there to be multiple UTF-8 (or
UTF-16) sequences which represent the same UTF-32 character?  I would assume
not, but don't want to make any assumptions.


Sean

Nov 17 2004

Thomas Kuehne <thomas-dloop kuehne.thisisspam.cn> writes:

Sean Kelly schrieb am Wed, 17 Nov 2004 19:45:54 +0000 (UTC):
 Isn't dchar[] a pretty useless type ?
 (dchar isn't, but an UTF-32 string...)

 But I suspect that wchar[] is better for
 storing a bunch of (dchar) code points ?



When you are dealing with extended CJK, ancient or private scripts
dchar is useful. For simple operations you might use wchar, but 
as soon as you start extensive text processing you add an huge amount
of overhead(lookup if this is a surrogate).


 Semi-related question.  Is it possible for there to be multiple UTF-8 (or
 UTF-16) sequences which represent the same UTF-32 character?  I would assume
 not, but don't want to make any assumptions.

The used encodings could technically present one codepoint with
different UTF-16/UTF-8 sequences. But the standards require you to
use the shortest possible sequence.

Please don't confuse characters and codepoints.
e.g "small Latin letter a with accent grave" can be represented in
with 2 different codepoint sequences and thus with different UTF8/16 
sequences.

Thomas

Nov 17 2004

Sean Kelly <sean f4.ca> writes:

In article <bens62-32r.ln1 kuehne.cn>, Thomas Kuehne says...
Sean Kelly schrieb am Wed, 17 Nov 2004 19:45:54 +0000 (UTC):
 Isn't dchar[] a pretty useless type ?
 (dchar isn't, but an UTF-32 string...)

 But I suspect that wchar[] is better for
 storing a bunch of (dchar) code points ?



When you are dealing with extended CJK, ancient or private scripts
dchar is useful. For simple operations you might use wchar, but 
as soon as you start extensive text processing you add an huge amount
of overhead(lookup if this is a surrogate).


 Semi-related question.  Is it possible for there to be multiple UTF-8 (or
 UTF-16) sequences which represent the same UTF-32 character?  I would assume
 not, but don't want to make any assumptions.

The used encodings could technically present one codepoint with
different UTF-16/UTF-8 sequences. But the standards require you to
use the shortest possible sequence.

Please don't confuse characters and codepoints.
e.g "small Latin letter a with accent grave" can be represented in
with 2 different codepoint sequences and thus with different UTF8/16 
sequences.

The reason I asked was for string matching.  I wanted to be sure there was no
advantage to doing comparisons in UTF-32 vs. UTF-8, for example.  So you're
saying that while it's theoretically possible to have two different UTF-8/16
sequences present the same codepoint, the requirements of the standard make this
effectively impossible.  Is that correct?


Sean

Nov 17 2004

Thomas Kuehne <thomas-dloop kuehne.thisisspam.cn> writes:

Sean Kelly schrieb am Wed, 17 Nov 2004 20:33:07 +0000 (UTC):
 Isn't dchar[] a pretty useless type ?
 (dchar isn't, but an UTF-32 string...)

 But I suspect that wchar[] is better for
 storing a bunch of (dchar) code points ?



When you are dealing with extended CJK, ancient or private scripts
dchar is useful. For simple operations you might use wchar, but 
as soon as you start extensive text processing you add an huge amount
of overhead(lookup if this is a surrogate).


 Semi-related question.  Is it possible for there to be multiple UTF-8 (or
 UTF-16) sequences which represent the same UTF-32 character?  I would assume
 not, but don't want to make any assumptions.

The used encodings could technically present one codepoint with
different UTF-16/UTF-8 sequences. But the standards require you to
use the shortest possible sequence.

Please don't confuse characters and codepoints.
e.g "small Latin letter a with accent grave" can be represented in
with 2 different codepoint sequences and thus with different UTF8/16 
sequences.

 The reason I asked was for string matching.  I wanted to be sure there was no
 advantage to doing comparisons in UTF-32 vs. UTF-8, for example.  So you're
 saying that while it's theoretically possible to have two different UTF-8/16
 sequences present the same codepoint, the requirements of the standard make
this
 effectively impossible.  Is that correct?

Yes - with one important exception.
When ever you interact with Java you have to be aware that it's "UTF-8" is
'adapted'. It doesn't encode the code point sequence (aka UTF-32) but
encodes UTF-16 chars. So if you have a UTF-16 surrogate(for encoding
0x00FFFF) Java will generate _2_ UTF-8 sequences - one for the lower

surrogate part and one for higher surrogate part instead of _1_ for the
code point.

Concerning the speed. It's a matter of encoded string/byte length.
For mostly Latin scripts use UTF-8.
For everything else - e.g. Greek, Japanese ... - use UTF-16.

http://unicode.org

Thomas

Nov 17 2004

Sean Kelly <sean f4.ca> writes:

In article <mgrs62-3ig.ln1 kuehne.cn>, Thomas Kuehne says...
Concerning the speed. It's a matter of encoded string/byte length.
For mostly Latin scripts use UTF-8.
For everything else - e.g. Greek, Japanese ... - use UTF-16.

Thanks a lot.  This was all in reference to my readf routines, which are
currently converting everything to UTF-32 before matching and such.  I'll change
them to use UTF-16 instead.


Sean

Nov 17 2004

"Simon Buchan" <currently no.where> writes:

On Wed, 17 Nov 2004 10:09:41 +0100, Thomas Kuehne  
<thomas-dloop kuehne.thisisspam.cn> wrote:

 Using dchar[] as case-keys within a switch results in:
 Internal error: s2ir.c 670

 http://svn.kuehne.cn/dstress/nocompile/switch_14.d


 Using multiple identical dchar[]s as case-keys within a switch results
 in:
 expression.c:1367: virtual int StringExp::compare(Object*): Assertion  
 `0' failed

 http://svn.kuehne.cn/dstress/nocompile/switch_13.d

 I don't know why, but the current documentation states that only
 "integral types or char[] or wchar[]" are allowed for switch statements.
 It is certainly useful if wchar[] and floating types are allowed too.

 Thomas

Fixed in 1.06

-- 
"Unhappy Microsoft customers have a funny way of becoming Linux,
Salesforce.com and Oracle customers." - www.microsoft-watch.com:
"The Year in Review: Microsoft Opens Up"

Nov 29 2004

"Simon Buchan" <currently no.where> writes:

On Tue, 30 Nov 2004 19:06:37 +1300, Simon Buchan <currently no.where>  
wrote:

<snip>
 Fixed in 1.06

err.... 1.07 <smack forehead>

-- 
"Unhappy Microsoft customers have a funny way of becoming Linux,
Salesforce.com and Oracle customers." - www.microsoft-watch.com:
"The Year in Review: Microsoft Opens Up"

Nov 29 2004

D Programming

C/C++ Programming

Other

digitalmars.D.bugs - switch (dchar[])