www.digitalmars.com         C & C++   DMDScript  

D - char vs ascii

reply "Walter" <walter digitalmars.com> writes:
What do people think about using the keyword:

    ascii or char?
    unicode or wchar?

-Walter
Aug 14 2001
next sibling parent reply Jan Knepper <jan smartsoft.cc> writes:
I guess ascii makes more sence than char and unicode makes more
sence than wchar or wchar_t...



Walter wrote:

 What do people think about using the keyword:

     ascii or char?
     unicode or wchar?

 -Walter
Aug 14 2001
parent reply "Erik Funkenbusch" <erikf seahorsesoftware.com> writes:
Just some suggestions that come to mind, in no particular order or
coherance:

No, ascii makes little sense.  ascii refers explicitly to one character set.
There are many 8 bit character sets or locales or code pages or whatever you
want to call them.

Also, unicode can be 8 bit or 16 bit, and there is talk of a 32 bit as well
in the future.  I think any language that expects to stick around for any
length of time needs to address the forward compatibility of new code sets.

I'd much rather see a way to define your character type and use it
throughout your program.  Also remember that you might be creating an
application that needs to display multiple character sets simultaneously
(for instance, both English and Japanese).

Now, while much of this will be OS specific, and doesn't belong in a
language, you at least need some way to deal with such things cleanly in
that language.  char_t and wchar_t do not provide specific sizes, but can be
implementation defined.

I'd say define the types.  char8 and char16, this allows char32 or char64
(or char12 for that matter, remember that some CPU's have non-standard word
sizes).

An alternative would be a syntax like char(8) or char(16), perhaps even a
simple "char" and a modifier like "unicode(16) char"

Finally, I might suggest doing away with char all together and making the
entire language unicode.  On platforms that don't support it, provide a
seamless mapping mechanism to downconvert 16 bit chars to 8 bit.

"Jan Knepper" <jan smartsoft.cc> wrote in message
news:3B79CF33.94F71602 smartsoft.cc...
 I guess ascii makes more sence than char and unicode makes more
 sence than wchar or wchar_t...

 Walter wrote:

 What do people think about using the keyword:

     ascii or char?
     unicode or wchar?
Aug 14 2001
parent reply "Walter" <walter digitalmars.com> writes:
"Erik Funkenbusch" <erikf seahorsesoftware.com> wrote in message
news:9lcsqr$2s9p$1 digitaldaemon.com...
 Just some suggestions that come to mind, in no particular order or
 coherance:
 No, ascii makes little sense.  ascii refers explicitly to one character
set.
 There are many 8 bit character sets or locales or code pages or whatever
you
 want to call them.
Yes, I think it should just be called "char" and it will be an unsigned 8 bit type.
 Also, unicode can be 8 bit or 16 bit, and there is talk of a 32 bit as
well
 in the future.  I think any language that expects to stick around for any
 length of time needs to address the forward compatibility of new code
sets. 32 bit wchar_t's are a reality on linux now. I think it will work out best to just make a wchar type and it will map to whatever the wchar_t is for the local native C compiler.
 I'd much rather see a way to define your character type and use it
 throughout your program.  Also remember that you might be creating an
 application that needs to display multiple character sets simultaneously
 (for instance, both English and Japanese).
I've found I've wanted to support both ascii and unicode simultaneously in programs, hence I thought two different types was appropriate. I was constantly irritated by having to go through and either subtract or add L's in front of the strings. The macros to do it automatically are ugly. Hence, the idea that the string literals should be implicitly convertible to either char[] or wchar[]. Next, there is the D typedef facility, which actually does introduce a new, overloadable type. So, you could: typedef char mychar; or typedef wchar mychar; and through the magic of overloading <g> the rest of the code should not need changing.
 Now, while much of this will be OS specific, and doesn't belong in a
 language, you at least need some way to deal with such things cleanly in
 that language.  char_t and wchar_t do not provide specific sizes, but can
be
 implementation defined.
 I'd say define the types.  char8 and char16, this allows char32 or char64
 (or char12 for that matter, remember that some CPU's have non-standard
word
 sizes).
 An alternative would be a syntax like char(8) or char(16), perhaps even a
 simple "char" and a modifier like "unicode(16) char"
 Finally, I might suggest doing away with char all together and making the
 entire language unicode.  On platforms that don't support it, provide a
 seamless mapping mechanism to downconvert 16 bit chars to 8 bit.
Java went the way of chucking ascii entirely. While that makes sense for a web language, I think for systems languages ascii is going to be around for a long time, so might as well make it easy to deal with! Ascii is really never going to be anything but an 8 bit type - it is unicode with the varying size. Hence I think having a wchar type of a varying size is the way to go.
Aug 15 2001
next sibling parent reply "Ivan Frohne" <frohne gci.net> writes:
There's something clean and neat about calling things
what they are.  Instead of larding up your code with

    typedef char ascii
    typedef wchar unicode

why not just use 'ascii' and 'unicode' in the first place?
Save the typedefs for

    typedef ascii ebcdic

Now, about that cast notation ....


--Ivan Frohne
Aug 15 2001
parent reply "Walter" <walter digitalmars.com> writes:
I suspect that ascii and unicode are trademarked names!

"Ivan Frohne" <frohne gci.net> wrote in message
news:9lf20l$11og$1 digitaldaemon.com...
 There's something clean and neat about calling things
 what they are.  Instead of larding up your code with

     typedef char ascii
     typedef wchar unicode

 why not just use 'ascii' and 'unicode' in the first place?
 Save the typedefs for

     typedef ascii ebcdic

 Now, about that cast notation ....


 --Ivan Frohne
Aug 15 2001
parent Jan Knepper <jan smartsoft.cc> writes:
<g>
I had not thought of that one!

Jan



Walter wrote:

 I suspect that ascii and unicode are trademarked names!
Aug 15 2001
prev sibling parent reply "Sheldon Simms" <sheldon semanticedge.com> writes:
Im Artikel <9levtq$10ji$1 digitaldaemon.com> schrieb "Walter"
<walter digitalmars.com>:

 I've found I've wanted to support both ascii and unicode simultaneously
 in programs, hence I thought two different types was appropriate. I was
 constantly irritated by having to go through and either subtract or add
 L's in front of the strings. The macros to do it automatically are ugly.
 Hence, the idea that the string literals should be implicitly
 convertible to either char[] or wchar[].
Well it seems to be that you already have standard sizes integral types: byte, short, int, long. Why not make char be a 2 or 4-byte unicode char and use the syntax byte[] str = "My ASCII string"; for ascii? -- Sheldon Simms / sheldon semanticedge.com
Aug 16 2001
parent reply "Walter" <walter digitalmars.com> writes:
Sheldon Simms wrote in message <9lgvsh$2jb7$1 digitaldaemon.com>...
Im Artikel <9levtq$10ji$1 digitaldaemon.com> schrieb "Walter"
<walter digitalmars.com>:

 I've found I've wanted to support both ascii and unicode simultaneously
 in programs, hence I thought two different types was appropriate. I was
 constantly irritated by having to go through and either subtract or add
 L's in front of the strings. The macros to do it automatically are ugly.
 Hence, the idea that the string literals should be implicitly
 convertible to either char[] or wchar[].
Well it seems to be that you already have standard sizes integral types: byte, short, int, long. Why not make char be a 2 or 4-byte unicode char and use the syntax byte[] str = "My ASCII string"; for ascii?
It seems useful to be able to overload char and byte separately.
Aug 17 2001
parent reply c. keith ray <c._member pathlink.com> writes:
In article <9lih2u$10ca$2 digitaldaemon.com>, Walter says...
Sheldon Simms wrote in message <9lgvsh$2jb7$1 digitaldaemon.com>...
Im Artikel <9levtq$10ji$1 digitaldaemon.com> schrieb "Walter"
<walter digitalmars.com>:

 I've found I've wanted to support both ascii and unicode simultaneously
 in programs, hence I thought two different types was appropriate. I was
 constantly irritated by having to go through and either subtract or add
 L's in front of the strings. The macros to do it automatically are ugly.
 Hence, the idea that the string literals should be implicitly
 convertible to either char[] or wchar[].
Perhaps some consideration of an existing long-lived internationalized class library would be appropriate... [Cocoa] Representing strings as objects allows you to use strings wherever you use other objects. It also provides the benefits of encapsulation, so that string objects can use whatever encoding and storage [single-byte, multi-byte, or unicode] is needed for efficiency while simply appearing as arrays of characters. The class-cluster's two public classes, NSString and NSMutableString, declare the programmatic interface for noneditable and editable strings, respectively. Even though a string presents itself as an array of Unicode characters (Unicode is a registered trademark of Unicode, Inc.) its internal representation could be otherwise... A class cluster is one public class, whose visible 'constructors' (aka 'factory methods') instantiate appropriate hidden subclasses. So UnicodeString subclass, JapaneseShiftJISString subclass, ChineseBigFiveString subclass, and AsciiString subclass are hidden, but their parent classes visible. [I made up those names. Since they are hidden, it doesn't matter how many subclasses of NSString and NSMutableString there are - they all conform to the same public interface.] I believe the objective-C compiler translates "some string" into an NSString (I'm not sure if the compiler supports unicode string constants yet.) --- C. Keith Ray <http://homepage.mac.com/keithray/resume2.html> <http://homepage.mac.com/keithray/xpminifaq.html>
Apr 29 2002
parent reply "Walter" <walter digitalmars.com> writes:
"c. keith ray" <c._member pathlink.com> wrote in message
news:aak0n0$14gi$1 digitaldaemon.com...
<http://developer.apple.com/techpubs/macosx/Cocoa/Reference/Foundation/ObjC_


It's a great idea, but it appears to be copyrighted by Apple.
Apr 29 2002
parent reply Keith Ray <k1e2i3t4h5r6a7y 1m2a3c4.5c6o7m> writes:
In article <aal1h7$23cc$1 digitaldaemon.com>,
 "Walter" <walter digitalmars.com> wrote:

 "c. keith ray" <c._member pathlink.com> wrote in message
 news:aak0n0$14gi$1 digitaldaemon.com...
 <http://developer.apple.com/techpubs/macosx/Cocoa/Reference/Foundation/ObjC_

 
 It's a great idea, but it appears to be copyrighted by Apple.
See also: <http://www.gnustep.org/> The objective-C version of Apple's Foundation library (which defines the String classes) is not open-source. The C version is open-source and has equivalent functionality. Apple's open-source license is at: <http://www.opensource.apple.com/apsl/> The c version of the Foundation library is at: <http://www.opensource.apple.com/projects/darwin/1.4/projects.html> look for "CoreFoundation 226-14.1 Core Foundation tool kit" -- C. Keith Ray <http://homepage.mac.com/keithray/xpminifaq.html>
Apr 30 2002
parent reply "Walter" <walter digitalmars.com> writes:
Ok, Apple's open source license looks like it can be used. Do you want to
take the lead in converting it to D?

"Keith Ray" <k1e2i3t4h5r6a7y 1m2a3c4.5c6o7m> wrote in message
news:k1e2i3t4h5r6a7y-9BED6F.07530430042002 digitalmars.com...
 In article <aal1h7$23cc$1 digitaldaemon.com>,
  "Walter" <walter digitalmars.com> wrote:

 "c. keith ray" <c._member pathlink.com> wrote in message
 news:aak0n0$14gi$1 digitaldaemon.com...
<http://developer.apple.com/techpubs/macosx/Cocoa/Reference/Foundation/ObjC_


 It's a great idea, but it appears to be copyrighted by Apple.
See also: <http://www.gnustep.org/> The objective-C version of Apple's Foundation library (which defines the String classes) is not open-source. The C version is open-source and has equivalent functionality. Apple's open-source license is at: <http://www.opensource.apple.com/apsl/> The c version of the Foundation library is at: <http://www.opensource.apple.com/projects/darwin/1.4/projects.html> look for "CoreFoundation 226-14.1 Core Foundation tool kit" -- C. Keith Ray <http://homepage.mac.com/keithray/xpminifaq.html>
Apr 30 2002
parent reply Keith Ray <k1e2i3t4h5r6a7y 1m2a3c4.5c6o7m> writes:
In article <aamfkp$2p6o$1 digitaldaemon.com>,
 "Walter" <walter digitalmars.com> wrote:

 Ok, Apple's open source license looks like it can be used. Do you want to
 take the lead in converting it to D?
... in my extensive free time? I wish I did have time for that... I have the desire to implement an OO language very similar to Smalltalk [objects all the way down] but with syntax more like JavaScript or Java without type declarations, using techniques from Threaded Interpreted Languages (kinds like Forth or Postscript). I do plan to look at D in more detail real soon now. PS: I'm a Macintosh user by choice (I spend most of day-job time programming on Windows), so I can't use your D compiler yet. -- C. Keith Ray <http://homepage.mac.com/keithray/xpminifaq.html>
Apr 30 2002
parent "Walter" <walter digitalmars.com> writes:
"Keith Ray" <k1e2i3t4h5r6a7y 1m2a3c4.5c6o7m> wrote in message
news:k1e2i3t4h5r6a7y-35A6F7.20384830042002 digitalmars.com...
 PS: I'm a Macintosh user by choice (I spend most of day-job time
 programming on Windows), so I can't use your D compiler yet.
If you want, you can also do a Mac port starting with the gnu compiler sources for the Mac.
May 03 2002
prev sibling next sibling parent weingart cs.ualberta.ca (Tobias Weingartner) writes:
In article <9lchvd$2miu$1 digitaldaemon.com>, Walter wrote:
 What do people think about using the keyword:
 
     ascii or char?
     unicode or wchar?
Ascii makes little sense. In most cases where it is used (other than for strings) is to get a "byte". Since you have a byte type, char is sort of redundant. IMHO it would be better to extend the string type (unicode, etc) to be able to specify a restricted subset. Unicode would be the superset (for strings, and the default if not contrained), and some other things (unicode.byte[10] string_of_10_byte_sized_positions) for restricting the type of "string" you have. -- Tobias Weingartner | Unix Guru, Admin, Systems-Dude Apt B 7707-110 St. | http://www.tepid.org/~weingart/ Edmonton, AB |------------------------------------------------- Canada, T6G 1G3 | %SYSTEM-F-ANARCHISM, The OS has been overthrown
Aug 16 2001
prev sibling next sibling parent reply Jeff Frohwein <"jeff " SPAMLESSdevrs.com> writes:
Walter wrote:
 
 What do people think about using the keyword:
 
     ascii or char?
     unicode or wchar?
I personally think C might have started a bad habit by using types that were generally vague in nature. All I ask is that simplicity be given impartial consideration. Since we are all used to seeing types such as short, long, and int in code, perhaps it would be better for all of us to spend some time thinking about the following types rather than form an immediate opinion. I can easily identify with the fact that any unfamiliar looking types can look highly offensive to the newly or barely acquainted, as they did to me at one time: u8,s8,u16,s16,u32,s32,... Some will be adamantly opposed because they don't use these, or know anyone that does. SGI, for one, has used these types for Nintendo 64 development and now Nintendo is using them for GameBoy Advance development. There are probably others... As 128 bit and 256 bit systems are released, adding new types would be as easy as u128,s128,u256,s256... rather than have to consider something like "long long long long", or a new name in general. Those that want to use vague types can always typedef their own types. Thanks for listening, :) Jeff
Aug 16 2001
next sibling parent Charles Hixson <charleshixsn earthlink.net> writes:
Jeff Frohwein wrote:
 Walter wrote:
 
What do people think about using the keyword:

    ascii or char?
    unicode or wchar?
... u8,s8,u16,s16,u32,s32,... ... As 128 bit and 256 bit systems are released, adding new types would be as easy as u128,s128,u256,s256... rather than have to consider something like "long long long long", or a new name in general. Those that want to use vague types can always typedef their own types. Thanks for listening, :) Jeff
That's a good idea. These could be the basic language defined types, and then a "standard library" could include typedefs for the types that people are more familiar with. This would allow code to be written that could either easily adapt to changing word sizes. Be fixed for particular sizes, or both. And still have it be fairly portable.
Aug 17 2001
prev sibling parent "Walter" <walter digitalmars.com> writes:
Jeff Frohwein <"jeff "  SPAMLESSdevrs.com>
 Thanks for listening, :)
Oh, I am reading all of this stuff. It's a lot of fun, and people have great ideas. I'm a little surprised at the sheer volume of replies and comments! -Walter
Aug 17 2001
prev sibling parent reply Russell Bornschlegel <kaleja estarcion.com> writes:
Walter wrote:
 
 What do people think about using the keyword:
 
     ascii or char?
     unicode or wchar?
 
My votes would be "char" and "unicode". Erik makes a good case against "ascii". "wchar" and "wchar_t" are ugly C-committeeisms, IMO. -Russell B
Aug 16 2001
parent reply "Walter" <walter digitalmars.com> writes:
Oh, I hate the "_t" suffix too. I'd love to name it unicode, but since there
is a Unicode, Inc., I don't think I can.


Russell Bornschlegel wrote in message <3B7C4455.ADFB4496 estarcion.com>...
Walter wrote:
 What do people think about using the keyword:

     ascii or char?
     unicode or wchar?
My votes would be "char" and "unicode". Erik makes a good case against "ascii". "wchar" and "wchar_t" are ugly C-committeeisms, IMO. -Russell B
Aug 17 2001
parent reply "Walter" <walter digitalmars.com> writes:
Walter wrote in message <9lk4ij$2d7a$2 digitaldaemon.com>...
Oh, I hate the "_t" suffix too. I'd love to name it unicode, but since
there
is a Unicode, Inc., I don't think I can.
I checked. Unicode is a registered trademark of Unicode, Inc. They specifically say that "unicode" can't be included in a product. Oh well. I guess that's why the ANSI committee picked "wchar_t". Looks like "wchar" is what D will use. -Walter
Aug 17 2001
next sibling parent reply "Kent Sandvik" <sandvik excitehome.net> writes:
"Walter" <walter digitalmars.com> wrote in message
news:9lk4vh$2dj3$1 digitaldaemon.com...

 I checked. Unicode is a registered trademark of Unicode, Inc. They
 specifically say that "unicode" can't be included in a product. Oh well.

 I guess that's why the ANSI committee picked "wchar_t".
XML uses UTF, so you could think about using 'utf' as one possible keyword. --Kent
Aug 17 2001
parent reply Russ Lewis <russ deming-os.org> writes:
Kent Sandvik wrote:

 "Walter" <walter digitalmars.com> wrote in message
 news:9lk4vh$2dj3$1 digitaldaemon.com...

 I checked. Unicode is a registered trademark of Unicode, Inc. They
 specifically say that "unicode" can't be included in a product. Oh well.

 I guess that's why the ANSI committee picked "wchar_t".
XML uses UTF, so you could think about using 'utf' as one possible keyword. --Kent
Any clarification what UTF might mean? It's not necessarily obvious. Neither is wchar...but it's closer.
Aug 17 2001
parent "Kent Sandvik" <sandvik excitehome.net> writes:
Goodle is our friend. UTF or actually UTF-8 is one encoding scheme, stands
for
UCS Transformation Format, and actually USC is more in line with the Unicode
definition, or Universal Character Set. Anyway, if those buzz words are too
unknown, then wchar_t maybe is the way to go. --Kent



"Russ Lewis" <russ deming-os.org> wrote in message
news:3B7D9CA1.6A25385C deming-os.org...
 Kent Sandvik wrote:

 "Walter" <walter digitalmars.com> wrote in message
 news:9lk4vh$2dj3$1 digitaldaemon.com...

 I checked. Unicode is a registered trademark of Unicode, Inc. They
 specifically say that "unicode" can't be included in a product. Oh
well.
 I guess that's why the ANSI committee picked "wchar_t".
XML uses UTF, so you could think about using 'utf' as one possible keyword. --Kent
Any clarification what UTF might mean? It's not necessarily obvious.
Neither
 is wchar...but it's closer.
Aug 17 2001
prev sibling parent reply weingart cs.ualberta.ca (Tobias Weingartner) writes:
In article <9lk4vh$2dj3$1 digitaldaemon.com>, Walter wrote:
 
 I checked. Unicode is a registered trademark of Unicode, Inc. They
 specifically say that "unicode" can't be included in a product. Oh well.
 I guess that's why the ANSI committee picked "wchar_t".
 
 Looks like "wchar" is what D will use.
Please don't. I say, make form follow function. wchar is a throwback to some weird ansi'ism, having "wide char's". That's stupid. If you want to have D handle strings natively, *and* you want it to be some sort of internationalized version of a string, make it be a string, or even a "char", or "character". Make it sufficiently different from C, such that people will know. For 1-byte things, use the type "byte". Say what you mean, mean what you say. wchar? If you use UTF, it could be vchar (variable length), etc... --Toby.
Aug 20 2001
parent reply "Walter" <walter digitalmars.com> writes:
Tobias Weingartner wrote in message ...
For 1-byte things, use the type "byte".
Say what you mean, mean what you say.  wchar?  If you use UTF, it could
be vchar (variable length), etc...
I frequently want to overload characters differently than bytes, so using "byte" for ascii doesn't work well for me.
Aug 20 2001
parent weingart cs.ualberta.ca (Tobias Weingartner) writes:
In article <9lsc7e$1di3$2 digitaldaemon.com>, Walter wrote:
 
 Tobias Weingartner wrote in message ...
For 1-byte things, use the type "byte".
Say what you mean, mean what you say.  wchar?  If you use UTF, it could
be vchar (variable length), etc...
I frequently want to overload characters differently than bytes, so using "byte" for ascii doesn't work well for me.
That's exactly what I'm saying. For charaters, use the character type. An array of these could be a string. Could be that the base library (or the language if necessary) could define a string class as well (index entries are of type character). What I'm saying is that wchar is a bad name. They are not "wide" chars, but what you really want is a "character". So name it as such. A char can be anything, even variable length (UTF-8 for example). If you need byte-sized quantities in your program, use "byte". If you need a character (possibly byte, word, qword, or variable length), use character. --Toby.
Aug 22 2001