www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - A char is also not an int

reply Arcane Jill <Arcane_member pathlink.com> writes:
While we're on the subject of disunifying one type from another, may I point out
that a char is also not an int.

Back in the old days of C, there was no 8-bit wide type other than char, so if
you wanted an 8-bit wide numeric type, you used a char.

Similarly, in Java, there is no UNSIGNED 16-bit wide type other than char, so if
that's what you need, you use char.

D has no such problems, so maybe it's about time to make the distinction clear.
Logically, it makes no sense to try to do addition and subtraction with the
at-sign or the square-right-bracket symbol. We all KNOW that the zero glyph is
*NOT* the same thing as the number 48.

This was true even back in the days of ASCII, but it's even more true in
Unicode. A char in D stores, not a character, but a fragment of UTF-8, an
encoding of Unicode character - and even a Unicode character is /itself/ an
encoding. There is no longer a one-to-one correspondance between character and
glyph. (There IS such a one-to-one correspondence in the old ASCII range of
\u0020 to \u007E, of course, since Unicode is a superset of ASCII).

Perhaps it's time to change this one too?

       int a = 'X';            // wrong
       char a = 'X';           // right
       int a = cast(int) 'X'   // right

Arcane Jill
May 27 2004
next sibling parent "Matthew" <matthew.hat stlsoft.dot.org> writes:
 While we're on the subject of disunifying one type from another, may I point

 that a char is also not an int.

Can one implicitly convert char to int? Man, that sucks! Pardon my indignance by crediting my claim never to have tried it because I have a long-standing aversion to such things from C/C++. If it's true it needs to be made untrue ASAP. (Was that strong enough? I hope so ...)
May 27 2004
prev sibling next sibling parent reply Benji Smith <dlanguage xxagg.com> writes:
On Thu, 27 May 2004 07:16:19 +0000 (UTC), Arcane Jill
<Arcane_member pathlink.com> wrote:

Perhaps it's time to change this one too?

       int a = 'X';            // wrong
       char a = 'X';           // right
       int a = cast(int) 'X'   // right


I don't even like the notion of being able to explicitly cast from a char to an int. Especially in the case of unicode characters, the semantics of a cast (even an explicit cast) are not very well defined. Getting the int value of a character should, in my opinion, be the provice of a static method from a specific string class. --Benji
May 27 2004
parent Kevin Bealer <Kevin_member pathlink.com> writes:
In article <fd8cb0dfge0cm85o781a2rjpp9ait6fskq 4ax.com>, Benji Smith says...
On Thu, 27 May 2004 07:16:19 +0000 (UTC), Arcane Jill
<Arcane_member pathlink.com> wrote:

Perhaps it's time to change this one too?

       int a = 'X';            // wrong
       char a = 'X';           // right
       int a = cast(int) 'X'   // right


I don't even like the notion of being able to explicitly cast from a char to an int. Especially in the case of unicode characters, the semantics of a cast (even an explicit cast) are not very well defined. Getting the int value of a character should, in my opinion, be the provice of a static method from a specific string class. --Benji

I think the opposite is true; with Unicode, the semantics CAN be solid. In a normal C program, this is not the case. Consider: int chA = 'A'; int chZ = 'Z'; if ((chZ - chA) == 25) { // Is this true for EBCDIC? I dunno. } In C, the encoding is assumed to be the default system architecture encoding, which is not necessarily Unicode or ASCII. But, if the language DEFINES unicode as the operative representation, then the value 'A' should always be the same integer value. In any case, sometimes you need the integer value. Kevin
May 27 2004
prev sibling next sibling parent Stewart Gordon <smjg_1998 yahoo.com> writes:
Arcane Jill wrote:

<snip>
 D has no such problems, so maybe it's about time to make the
 distinction clear. Logically, it makes no sense to try to do addition
 and subtraction with the at-sign or the square-right-bracket symbol.

Not even in cryptography and the like?
 We all KNOW that the zero glyph is *NOT* the same thing as the number
 48.
 
 This was true even back in the days of ASCII, but it's even more true
 in Unicode. A char in D stores, not a character, but a fragment of
 UTF-8, an encoding of Unicode character - and even a Unicode
 character is /itself/ an encoding. There is no longer a one-to-one
 correspondance between character and glyph.

By 'character' do you mean 'character' or 'char value'? Stewart. -- My e-mail is valid but not my primary mailbox, aside from its being the unfortunate victim of intensive mail-bombing at the moment. Please keep replies on the 'group where everyone may benefit.
May 27 2004
prev sibling next sibling parent reply "Walter" <newshound digitalmars.com> writes:
"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:c944k3$1o53$1 digitaldaemon.com...
 While we're on the subject of disunifying one type from another, may I

 that a char is also not an int.

 Back in the old days of C, there was no 8-bit wide type other than char,

 you wanted an 8-bit wide numeric type, you used a char.

 Similarly, in Java, there is no UNSIGNED 16-bit wide type other than char,

 that's what you need, you use char.

 D has no such problems, so maybe it's about time to make the distinction

 Logically, it makes no sense to try to do addition and subtraction with

 at-sign or the square-right-bracket symbol. We all KNOW that the zero

 *NOT* the same thing as the number 48.

 This was true even back in the days of ASCII, but it's even more true in
 Unicode. A char in D stores, not a character, but a fragment of UTF-8, an
 encoding of Unicode character - and even a Unicode character is /itself/

 encoding. There is no longer a one-to-one correspondance between character

 glyph. (There IS such a one-to-one correspondence in the old ASCII range

 \u0020 to \u007E, of course, since Unicode is a superset of ASCII).

 Perhaps it's time to change this one too?

       int a = 'X';            // wrong
       char a = 'X';           // right
       int a = cast(int) 'X'   // right


I understand where you're coming from, and this is a compelling idea, but this idea has been tried out before in Pascal. And I can say from personal experience it is one reason I hate Pascal <g>. Chars do want to be integral data types, and requiring a cast for it leads to execrably ugly expressions filled with casts. In moving to C, one of the breaths of fresh air was to not need all those %^&*^^% casts any more. Let me enumerate a few ways that chars are used as integral types: 1) converting case 2) using char as index into a translation table 3) encoding/decoding UTF strings 4) encryption/decryption software 5) compression code 6) hashing 7) regex internal implementation 8) char value as input to a state machine like a lexer 9) encoding/decoding strings to/from integers in other words, routine system programming tasks. The improvement D has, however, is to have chars be a separate type from byte, which makes for better self-documenting code, and one can have different overloads for them.
May 27 2004
next sibling parent James McComb <alan jamesmccomb.id.au> writes:
Walter wrote:

 I understand where you're coming from, and this is a compelling idea, but
 this idea has been tried out before in Pascal. And I can say from personal
 experience it is one reason I hate Pascal <g>. Chars do want to be integral
 data types, and requiring a cast for it leads to execrably ugly expressions
 filled with casts.

I agree with you about chars Walter, but this is because I think chars are different from bools. The way I see it, bools can be either TRUE or FALSE, and these values not numeric. TRUE + 32 is not defined. (Of course, bools will be *implemented* as numeric values, but I'm talking about syntax.) But character standards, such as ASCII and Unicode, *define* characters as numeric quantities. ASCII *defines* A to be 65. So characters really are numeric. 'A' + 32 equals 'a'. This behaviour is well-defined. So I'd like to have a proper bool type, but I'd prefer D chars to remain as they are. James McComb
May 27 2004
prev sibling next sibling parent reply Roberto Mariottini <Roberto_member pathlink.com> writes:
In article <c95al5$19mr$1 digitaldaemon.com>, Walter says...
I understand where you're coming from, and this is a compelling idea, but
this idea has been tried out before in Pascal. And I can say from personal
experience it is one reason I hate Pascal <g>. 

That's strange, because this is one of the reasons the makes me *like* Pascal :-)
Chars do want to be integral
data types, and requiring a cast for it leads to execrably ugly expressions
filled with casts. In moving to C, one of the breaths of fresh air was to
not need all those %^&*^^% casts any more.

In my experience, only poor programming practice leads to manu int <-> char casts.
Let me enumerate a few ways that
chars are used as integral types:

1) converting case

This is true only for English. Real natural languages are more complex than this, needing collating tables. I don't know about non-latin alphabets.
2) using char as index into a translation table

type a: array['a'..'z'] of 'A'..'Z'; b: array[char] of char;
3) encoding/decoding UTF strings
4) encryption/decryption software
5) compression code
6) hashing
7) regex internal implementation

This is something you just won't do frequently, once they are in a library. Simply converting all input to integers and reconverting the final output to chars should work.
8) char value as input to a state machine like a lexer
9) encoding/decoding strings to/from integers

I don't see the point here.
in other words, routine system programming tasks. The improvement D has,
however, is to have chars be a separate type from byte, which makes for
better self-documenting code, and one can have different overloads for them.

This is better than nothing :-) Ciao
May 28 2004
parent reply "Phill" <phill pacific.net.au> writes:
Roberto:

Can you explain what you mean by "Real natural languages"?
May 28 2004
parent Roberto Mariottini <Roberto_member pathlink.com> writes:
In article <c99c0u$12gr$1 digitaldaemon.com>, Phill says...
Roberto:

Can you explain what you mean by "Real natural languages"?

"French", "Italian" ? ;-) Ciao
May 31 2004
prev sibling parent "Matthew" <matthew.hat stlsoft.dot.org> writes:
 I understand where you're coming from, and this is a compelling idea, but
 this idea has been tried out before in Pascal. And I can say from personal
 experience it is one reason I hate Pascal <g>. Chars do want to be integral
 data types, and requiring a cast for it leads to execrably ugly expressions
 filled with casts. In moving to C, one of the breaths of fresh air was to
 not need all those %^&*^^% casts any more. Let me enumerate a few ways that
 chars are used as integral types:

 1) converting case
 2) using char as index into a translation table
 3) encoding/decoding UTF strings
 4) encryption/decryption software
 5) compression code
 6) hashing
 7) regex internal implementation
 8) char value as input to a state machine like a lexer
 9) encoding/decoding strings to/from integers

 in other words, routine system programming tasks. The improvement D has,
 however, is to have chars be a separate type from byte, which makes for
 better self-documenting code, and one can have different overloads for them.

<Horse state="dead" action="flog">But yet we cannot overload on single-bit integrals and boolean values!</Horse>
Jun 04 2004
prev sibling parent Derek Parnell <derek psych.ward> writes:
On Thu, 27 May 2004 07:16:19 +0000 (UTC), Arcane Jill wrote:

 While we're on the subject of disunifying one type from another, may I point
out
 that a char is also not an int.
 
 Back in the old days of C, there was no 8-bit wide type other than char, so if
 you wanted an 8-bit wide numeric type, you used a char.
 
 Similarly, in Java, there is no UNSIGNED 16-bit wide type other than char, so
if
 that's what you need, you use char.
 
 D has no such problems, so maybe it's about time to make the distinction clear.
 Logically, it makes no sense to try to do addition and subtraction with the
 at-sign or the square-right-bracket symbol. We all KNOW that the zero glyph is
 *NOT* the same thing as the number 48.
 
 This was true even back in the days of ASCII, but it's even more true in
 Unicode. A char in D stores, not a character, but a fragment of UTF-8, an
 encoding of Unicode character - and even a Unicode character is /itself/ an
 encoding. There is no longer a one-to-one correspondance between character and
 glyph. (There IS such a one-to-one correspondence in the old ASCII range of
 \u0020 to \u007E, of course, since Unicode is a superset of ASCII).
 
 Perhaps it's time to change this one too?
 
       int a = 'X';            // wrong
       char a = 'X';           // right
       int a = cast(int) 'X'   // right


Maybe... Another way of looking at is that a character has (at least) two properties: a Glyph and an Identifer. Within an encoding set (eg. Unicode, ASCII, EBCDIC, ...), no two characters have the same identifier even though they may have the same glyph (eg. Space and Non-Breaking Space). One may then argue that an efficient datatype for the identier is an unsigned integer value. This makes it simple to be used as an index into a glyph table. In fact, an encoding set is like to have multiple glyph tables for various font representations, but that is another issue all together. So, an implicit cast from char to int would be just getting the character's identifier value, which is not such a bad thing. What is a bad thing is making assumptions about the relationships between character identifers. There is no necessary correlation between an character set's collation sequence and the characters' identifiers. I frequently work with encryption algorithms, and integer character identifiers are a *very* handy thing indeed. -- Derek 28/May/04 10:50:16 AM
May 27 2004