www.digitalmars.com         C & C++   DMDScript  

D - Types and sizes

reply Cem Karan <cfkaran2 eos.ncsu.edu> writes:
I've been going over the discussions on what kind of character support
D should have: should it be Unicode, ASCII, etc.  and it just struck me
that there are a series of fundamental problems that are exposed by
this train of thought.  I'll address my thoughts for types first, and
then argue that you should ditch the char type completely.

First off, we have too many types that are not orthogonal in C: the
byte, short, int, and long (and char, although that is a hack in my
mind) along with float and double.  When specifying a variable, you
should specify the kind that it is, the amount of storage that it
needs, and if it is signed or not.  E.g.:
   
   unsigned 8 int foo;
   12 float bar;

where the leading numerals are the number of bytes of storage.  If you
want to be really specific, make it the number of bits; that way, you
never run into the problem that the concept of byte means different
things on different machines.

With this scheme, there are only 2 types: integers and floats.  bytes
are (unsigned 1 int) and if you really want to use 'byte' instead,
typedef it.

As for the char, there are a LOT of problems with it, and arguing over
the idea of Unicode or something else won't help much.

1) Not all character encodings are a uniform size. UTF-8 can have
anywhere from 1 to 6 bytes necessary for encoding characters.

2) Character encodings that do have a uniform size often are
incomplete.  ISO 10646 (also known as UCS-4) has a large number of
character planes that are deliberately left unencoded.  Unicode has
ranges that are also undefined.  This means that although you created a
Unicode file, it doesn't appear to be the same to two different sets of
users on two different machines.

3) Character sorting is a MAJOR headache.  Read the specs for UCS-4 at
http://anubis.dkuug.dk/JTC1/SC22/WG20/docs/projects#14651 for a better
idea of what I'm talking about.   Here is the problem in a nutshell: in
certain scripts, characters are combined when they are displayed.  At
the same time, there is another single character that when rendered
looks exactly the same as the two other characters that were combined. 
Because of their locations in the code tables, you can't just cast them
to ints and compare them for sorting.  To a human, they have the same
meaning and should therefore be sorted next to each other. In addition,
how do you compare two strings that are from different scripts
(Japanese and English for example)?

So this brings me to my point.  Ditch the idea of the char completely. 
Make it a compiler supplied class instead.  This will allow you to
support multiple types of encodings.  I know that this might sound like
a dumb idea, but try looking at the complete specifications for some of
these encodings, and you'll see why I'm saying this.  Also, you need to
think about the number of encodings that have been invented so far. 
IBM has a package called International Components for Unicode which
just does transcoding from one encoding to another.  Currently it
handles 150 different encoding types. 

The problem is that people are used to the idea of a char.  The trick
is to make it possible to use a shorthand notation for 0 argument and
single argument constructors.

   Unicode mychar;
         would be equivalent to:
   Unicode mychar = new Unicode();

   Unicode yourchar = 'c';
         would be equivalent to:
   Unicode mychar = new Unicode('c');

   Unicode theirchar = "\U0036";
         would be equivalent to:
   Unicode mychar = new Unicode("\U0036");
         which would allow you to use characters that your own system
can't handle.

References:

http://www.unicode.org

The link below goes to UCS-4, which is a superset of Unicode.  It is 32
bits in size.
http://anubis.dkuug.dk/JTC1/SC2/WG2/

The link below goes to the ISO working group on internationalization of
programming languages, environments, and subsystems.
http://anubis.dkuug.dk/jtc1/sc22/

READ THIS.  They have spent quite a bit of time on how Strings and such
should be handled, and this is what they've come up with.  Even if you
don't support it directly, it is necessary.
http://anubis.dkuug.dk/JTC1/SC22/WG20/docs/projects#15435

-- 
Cem Karan

"Why would I want to conquer the world?!  Then I'd be expected to solve its
problems!"
Aug 16 2001
next sibling parent "Kent Sandvik" <sandvik excitehome.net> writes:
"Cem Karan" <cfkaran2 eos.ncsu.edu> wrote in message
news:160820012037366028%cfkaran2 eos.ncsu.edu...

 So this brings me to my point.  Ditch the idea of the char completely.
 Make it a compiler supplied class instead.  This will allow you to
 support multiple types of encodings.  I know that this might sound like
 a dumb idea, but try looking at the complete specifications for some of
 these encodings, and you'll see why I'm saying this.  Also, you need to
 think about the number of encodings that have been invented so far.
 IBM has a package called International Components for Unicode which
 just does transcoding from one encoding to another.  Currently it
 handles 150 different encoding types.

Yes, this is a good point. In some markets, for example in Japan, you need to worry about multiple character encodings. In a project I was involved in, we converted everything from whatever input format into Java (UTF-8), stored the data as XML (UTF-8), and when rendered the text then back over the browser converted it into SHIFT-JIS using the Java libraries. This way it's much easier, you use one single canonical string representation in the actual runtime environment, do the actual operations, and then let various translation libraries do the output. Unicode UTF-8 would be one natural internal format. --Kent
Aug 16 2001
prev sibling next sibling parent "Walter" <walter digitalmars.com> writes:
I understand it is a complicated subject, and you've well explained why. But
I'm not willing to give up on ascii char's yet, they're too useful! -Walter

"Cem Karan" <cfkaran2 eos.ncsu.edu> wrote in message
news:160820012037366028%cfkaran2 eos.ncsu.edu...
 I've been going over the discussions on what kind of character support
 D should have: should it be Unicode, ASCII, etc.  and it just struck me
 that there are a series of fundamental problems that are exposed by
 this train of thought.  I'll address my thoughts for types first, and
 then argue that you should ditch the char type completely.

 First off, we have too many types that are not orthogonal in C: the
 byte, short, int, and long (and char, although that is a hack in my
 mind) along with float and double.  When specifying a variable, you
 should specify the kind that it is, the amount of storage that it
 needs, and if it is signed or not.  E.g.:

    unsigned 8 int foo;
    12 float bar;

 where the leading numerals are the number of bytes of storage.  If you
 want to be really specific, make it the number of bits; that way, you
 never run into the problem that the concept of byte means different
 things on different machines.

 With this scheme, there are only 2 types: integers and floats.  bytes
 are (unsigned 1 int) and if you really want to use 'byte' instead,
 typedef it.

 As for the char, there are a LOT of problems with it, and arguing over
 the idea of Unicode or something else won't help much.

 1) Not all character encodings are a uniform size. UTF-8 can have
 anywhere from 1 to 6 bytes necessary for encoding characters.

 2) Character encodings that do have a uniform size often are
 incomplete.  ISO 10646 (also known as UCS-4) has a large number of
 character planes that are deliberately left unencoded.  Unicode has
 ranges that are also undefined.  This means that although you created a
 Unicode file, it doesn't appear to be the same to two different sets of
 users on two different machines.

 3) Character sorting is a MAJOR headache.  Read the specs for UCS-4 at
 http://anubis.dkuug.dk/JTC1/SC22/WG20/docs/projects#14651 for a better
 idea of what I'm talking about.   Here is the problem in a nutshell: in
 certain scripts, characters are combined when they are displayed.  At
 the same time, there is another single character that when rendered
 looks exactly the same as the two other characters that were combined.
 Because of their locations in the code tables, you can't just cast them
 to ints and compare them for sorting.  To a human, they have the same
 meaning and should therefore be sorted next to each other. In addition,
 how do you compare two strings that are from different scripts
 (Japanese and English for example)?

 So this brings me to my point.  Ditch the idea of the char completely.
 Make it a compiler supplied class instead.  This will allow you to
 support multiple types of encodings.  I know that this might sound like
 a dumb idea, but try looking at the complete specifications for some of
 these encodings, and you'll see why I'm saying this.  Also, you need to
 think about the number of encodings that have been invented so far.
 IBM has a package called International Components for Unicode which
 just does transcoding from one encoding to another.  Currently it
 handles 150 different encoding types.

 The problem is that people are used to the idea of a char.  The trick
 is to make it possible to use a shorthand notation for 0 argument and
 single argument constructors.

    Unicode mychar;
          would be equivalent to:
    Unicode mychar = new Unicode();

    Unicode yourchar = 'c';
          would be equivalent to:
    Unicode mychar = new Unicode('c');

    Unicode theirchar = "\U0036";
          would be equivalent to:
    Unicode mychar = new Unicode("\U0036");
          which would allow you to use characters that your own system
 can't handle.

 References:

 http://www.unicode.org

 The link below goes to UCS-4, which is a superset of Unicode.  It is 32
 bits in size.
 http://anubis.dkuug.dk/JTC1/SC2/WG2/

 The link below goes to the ISO working group on internationalization of
 programming languages, environments, and subsystems.
 http://anubis.dkuug.dk/jtc1/sc22/

 READ THIS.  They have spent quite a bit of time on how Strings and such
 should be handled, and this is what they've come up with.  Even if you
 don't support it directly, it is necessary.
 http://anubis.dkuug.dk/JTC1/SC22/WG20/docs/projects#15435

 --
 Cem Karan

 "Why would I want to conquer the world?!  Then I'd be expected to solve

 problems!"

Aug 16 2001
prev sibling next sibling parent reply "smilechaser" <smilechaser SPAMGUARDyahoo.com> writes:
"Cem Karan" <cfkaran2 eos.ncsu.edu> wrote in message
news:160820012037366028%cfkaran2 eos.ncsu.edu...
 I've been going over the discussions on what kind of character support
 D should have: should it be Unicode, ASCII, etc.  and it just struck me
 that there are a series of fundamental problems that are exposed by
 this train of thought.  I'll address my thoughts for types first, and
 then argue that you should ditch the char type completely.

 First off, we have too many types that are not orthogonal in C: the
 byte, short, int, and long (and char, although that is a hack in my
 mind) along with float and double.  When specifying a variable, you
 should specify the kind that it is, the amount of storage that it
 needs, and if it is signed or not.  E.g.:

    unsigned 8 int foo;
    12 float bar;

-- snip -- I agree with your treatment of types (int and float). In my experience the fact that different platforms define "int" and "long" differently leads to confusion. You also end up with absurd size qualifiers grafted onto the base types when the platform "get's bigger" (i.e. 8088-80x86-Intel's 64-bit archicture) so that you end up with a "long long long int" or some other such nonsense. One solution is to do something like Java did and support only the lowest common denominator - the byte. If a larger type is required it is "built" from the basic building block (the byte). The advantage here is that you can write code that seamlessly uses variables bigger than what the host machine supports. Now, the major problem with this is that it doesn't make good use of the host hardware. If I have a machine that can process 64-bit words at a time the previous solution is still going to be banging around byte values to emulate a 64-bit number. My solution to this is: 1) Declare two base types (float and int) that can have a variable size. 2) This size would always be expressed in bytes, and would have a maximum value (say 16). 3) The compiler would treat each datatype as a "plugin". If the data type was supported natively (i.e. a 32-bit int on a 32-bit machine) the plugin used would be the "native int32" plugin. If it wasn't the "emulated int32" plugin would be used. This ensures that programs will compile/run on just about any platform. Also previously written code could take advantage of hardware advances just by recompiling them (assuming of course that the compiler author would create a native plugin for the new data types). Now the use of this technology for only floats and ints may seem like overkill, but think of extending it to other data types such as the matrix or vector...
Aug 16 2001
parent Cem Karan <cfkaran2 eos.ncsu.edu> writes:
In article <9li2tp$jfr$1 digitaldaemon.com>, smilechaser
<smilechaser SPAMGUARDyahoo.com> wrote:

SNIP<<

2) This size would always be expressed in bytes, and would have a maximum value (say 16).

No maximums please. I do a fair amount of high precision scientific computing, and there isn't anything quite as irritating as limitations on precision.
     3) The compiler would treat each datatype as a "plugin". If the data
 type was supported natively (i.e. a 32-bit int on a 32-bit machine) the
 plugin used would be the "native int32" plugin. If it wasn't the "emulated
 int32" plugin would be used.
 
 This ensures that programs will compile/run on just about any platform. Also
 previously written code could take advantage of hardware advances just by
 recompiling them (assuming of course that the compiler author would create a
 native plugin for the new data types).
 
 Now the use of this technology for only floats and ints may seem like
 overkill, but think of extending it to other data types such as the matrix
 or vector...

I LIKE this idea! It actually allows you to have arbitrary precision integers and floats on any hardware. If you decide that you really, really need 256 byte integers, you can do it, and the compiler transparently supports it. There is only one possible problem that this might cause, and that is because the definition of what a 'byte' is, is not always the same on all machines (most modern machines now consider a byte to be 8 bits, but there have been cases in the past where this wasn't so) I would like to suggest that you define a byte to be 8 bits, no matter what the underlying hardware says it is. That way the developer isn't bitten by a 'bug' when doing a simple recompile...
Aug 17 2001
prev sibling parent reply Cem Karan <cfkaran2 eos.ncsu.edu> writes:
Here's a few more thoughts to extend what I was talking about.  Include
the concept of range for your variables.  This means that when a
compiler is running, it can perform greater logic error checking, and
it can perform some fairly unique optimizations.  First, the error
checking.

Have any of you guys had the problem of adding '1' one too many times
to a variable?  Like in a loop where you had something like this:

   for (short i = 0 ; i < q; i++)
      // do stuff

What if q is greater than 32768?  This loop won't end, and because the
compiler doesn't have a concept of range for q, it won't flag the
possible error.

Or what about divide by 0 errors?  

   for (int i = 0; i < 300; i++
      for (int j = a; j< b; j++)
         k = i/j;

As long as a * b > 0, then this will work, but if that isn't true at
any point in time, then k == Inf.   This is probably not what we
wanted.  So how to solve this?  It requires two parts.  First, you need
to identify all of the use-def chains in your program, and trace back
to those variables that are not defined within the program (user input
for example)  and define ranges for those variables.  The compiler will
have to identify those variables that are unspecified, and then the
programmer will have to specify them.  Once this is done, the compiler
can check for logic errors that are possible.  (This won't guarantee
that an error will occur, it only states that it is possible to have an
error occur.  Also, this technique won't catch all logic errors, it
just makes the program more robust.)

This also allows an optimisation that isn't currently possible; you can
tell the compiler to reduce the size of variables to the minimum
necessary.  If the ranges are known, then you can do another trick; you
can pack multiple variables into vectors.  (Anti-flame alert.  I know
that this optimization is not always useful, and may even be virtually
impossible to prove in certain cases.  My example only shows part of
the checking that would be necessary for this case to work.  The point
is that this is an optimization that cannot be done currently, that
could be done if you knew the ranges)

For example, lets say that you have a long string of 7 bit ASCII
characters.  You know that they are all lowercase characters and you
want to uppercase them.  The code that you use is:

   for (char* temp = charArrayPointer; temp < charArraySize; temp++)
         *temp -= 32;

If the compiler knows nothing about the range of values that each of
the chars in the array, then it must treat each one as having the
possibility of becoming negetive in value, and that would require all 8
bits in a char to hold.  That means that it cannot treat the chars as a
vector of bytes and operate on all of them at the same time.  On a 32
bit machine, that means that each byte gets promoted to an integer,
worked on, then demoted to a byte.  (promoting to a short an operating
in pairs won't work; theres too much overhead in splitting the shorts
up and then recombining them)

On the other hand, if the compiler determines that all values are
bounded in the array to the values [97,123] decimal, then it knows that
the most significant bit is immaterial.  That means that it can
concatenate 4 bytes together into an integer, concatenate 4 bytes that
each contain the number 32 together, and then subtract the latter from
the former.  It will have done 4 operations in a single clock (or 8 if
we're talking about a 64 bit machine) It can then do the next 4
operations and so on until it runs out of data.  And since the integer
is 'packed' that means that you don't have to do some sort of demotion
to bytes; just replace those 4 bytes and keep on rolling.  (And you
don't have to point out to me that this only works on integer aligned
arrays that are a multiple of 4 in size; again, this would require
range checking to know for sure if it is possible)

I know that this idea won't hold in a number of cases, especially those
where the ranges truly aren't known.  But it might help in other cases.
Aug 17 2001
parent "Walter" <walter digitalmars.com> writes:
The idea of ranges as an extension is a great one, for exactly the reasons
you mention. -Walter

Cem Karan wrote in message <170820011658229482%cfkaran2 eos.ncsu.edu>...
Here's a few more thoughts to extend what I was talking about.  Include
the concept of range for your variables.  This means that when a
compiler is running, it can perform greater logic error checking, and
it can perform some fairly unique optimizations.  First, the error
checking.

Have any of you guys had the problem of adding '1' one too many times
to a variable?  Like in a loop where you had something like this:

   for (short i = 0 ; i < q; i++)
      // do stuff

What if q is greater than 32768?  This loop won't end, and because the
compiler doesn't have a concept of range for q, it won't flag the
possible error.

Or what about divide by 0 errors?

   for (int i = 0; i < 300; i++
      for (int j = a; j< b; j++)
         k = i/j;

As long as a * b > 0, then this will work, but if that isn't true at
any point in time, then k == Inf.   This is probably not what we
wanted.  So how to solve this?  It requires two parts.  First, you need
to identify all of the use-def chains in your program, and trace back
to those variables that are not defined within the program (user input
for example)  and define ranges for those variables.  The compiler will
have to identify those variables that are unspecified, and then the
programmer will have to specify them.  Once this is done, the compiler
can check for logic errors that are possible.  (This won't guarantee
that an error will occur, it only states that it is possible to have an
error occur.  Also, this technique won't catch all logic errors, it
just makes the program more robust.)

This also allows an optimisation that isn't currently possible; you can
tell the compiler to reduce the size of variables to the minimum
necessary.  If the ranges are known, then you can do another trick; you
can pack multiple variables into vectors.  (Anti-flame alert.  I know
that this optimization is not always useful, and may even be virtually
impossible to prove in certain cases.  My example only shows part of
the checking that would be necessary for this case to work.  The point
is that this is an optimization that cannot be done currently, that
could be done if you knew the ranges)

For example, lets say that you have a long string of 7 bit ASCII
characters.  You know that they are all lowercase characters and you
want to uppercase them.  The code that you use is:

   for (char* temp = charArrayPointer; temp < charArraySize; temp++)
         *temp -= 32;

If the compiler knows nothing about the range of values that each of
the chars in the array, then it must treat each one as having the
possibility of becoming negetive in value, and that would require all 8
bits in a char to hold.  That means that it cannot treat the chars as a
vector of bytes and operate on all of them at the same time.  On a 32
bit machine, that means that each byte gets promoted to an integer,
worked on, then demoted to a byte.  (promoting to a short an operating
in pairs won't work; theres too much overhead in splitting the shorts
up and then recombining them)

On the other hand, if the compiler determines that all values are
bounded in the array to the values [97,123] decimal, then it knows that
the most significant bit is immaterial.  That means that it can
concatenate 4 bytes together into an integer, concatenate 4 bytes that
each contain the number 32 together, and then subtract the latter from
the former.  It will have done 4 operations in a single clock (or 8 if
we're talking about a 64 bit machine) It can then do the next 4
operations and so on until it runs out of data.  And since the integer
is 'packed' that means that you don't have to do some sort of demotion
to bytes; just replace those 4 bytes and keep on rolling.  (And you
don't have to point out to me that this only works on integer aligned
arrays that are a multiple of 4 in size; again, this would require
range checking to know for sure if it is possible)

I know that this idea won't hold in a number of cases, especially those
where the ranges truly aren't known.  But it might help in other cases.

Aug 25 2001