digitalmars.D - Re: Wide characters support in D

Ruslan Nikolaev <nruslan_devel yahoo.com> Jun 08 2010

dennis luehring <dl.soluz gmx.net> Jun 08 2010

"Nick Sabalausky" <a a.a> Jun 08 2010

"Yao G." <nospamyao gmail.com> Jun 08 2010

Ruslan Nikolaev <nruslan_devel yahoo.com> writes:

 
 Generally Linux systems use UTF-8 so I guess the "system
 encoding" there will be UTF-8. But then if you start to use
 QT you have to use UTF-16, but you might have to intermix
 UTF-8 to work with other libraries in the backend (libraries
 which are not necessarily D libraries, nor system
 libraries). So you may have a UTF-8 backend (such as the
 MySQL library), UTF-8 "system encoding" glue code, and
 UTF-16 GUI code (QT). That might be a good or a bad choice,
 depending on various factors, such as whether the glue code
 send more strings to the backend or the GUI.
 
 Now try to port the thing to Windows where you define the
 "system encoding" as UTF-16. Now you still have the same
 UTF-8 backend, and the same UTF-16 GUI code, but for some
 reason you're changing the glue code in the middle to
 UTF-16? Sure, it can be made to work, but all the string
 conversions will start to happen elsewhere, which may change
 the performance characteristics and add some potential for
 bugs, and all this for no real reason.
 
 The problem is that what you call "system encoding" is only
 the encoding used by the system frameworks. It is relevant
 when working with the system frameworks, but when you're
 working with any other API, you'll probably want to use the
 same character type as that API does, not necessarily the
 "system encoding". Not all programs are based on extensive
 use of the system frameworks. In some situations you'll want
 to use UTF-16 on Linux, or UTF-8 on Windows, because you're
 dealing with libraries that expect that (QT, MySQL).
 


Agreed. True, system encoding is not always that clear. Yet, usually UTF-8 is
common for Linux (consider also Gtk, wxWidgets, system calls, etc.) At the same
time, UTF-16 is more common for Windows (consider win32api, DFL, system calls,
etc.). Some programs written in C even tend to have their own 'tchar' so that
they can be compiled differently depending on platform.

 A compiler switch is a poor choice there, because you can't
 mix libraries compiled with a different compiler switches
 when that switch changes the default character type.


Compiler switch is only necessary for system programmer. For instance, gcc also
has 'fshort-wchar' that changes width of wchar_t to 16 bit. It also DOES break
the code casue libraries normally compiled for wchar_t to 32 bit. Again, it's
generally not for application programmer.

 
 In most cases, it's much better in my opinion if the
 programmer just uses the same character type as one of the
 libraries it uses, stick to that, and is aware of what he's
 doing. If someone really want to deal with the complexity of


Programmer should not know generally what encoding he works with. For both
UTF-8 and UTF-16, it's easy to determine number of bytes (words) in multibyte
(word) sequence by just looking at first code point. This can also be builtin
function (e.g. numberOfChars(tchar firstChar)). Size of each element can easily
be determined by sizeof. Conversion to UTF-32 and back can be done very
transparently.

The only problem it might cause - bindings with other libraries (but in this
case you can just use fromUTFxx and toUTFxx; you do this conversion anyway).
Also, transferring data over the network - again you can just stick to a
particular encoding (for network and files, UTF-8 is better since it's byte
order free).

 supporting both character types depending on the environment
 it runs on, it's easy to create a "tchar" and "tstring"
 alias that depends on whether it's Windows or Linux, or on a
 custom version flag from a compiler switch, but that'll be
 his choice and his responsibility to make everything work.


If it's a choice of programmer, then almost all advantages of tchar are lost.
It's like garbage collector - if used by everybody, you can expect advantages
of using it. However, if it's optional - everybody will write libraries
assuming no GC is available, thus - almost all performance advantages are lost.

And after all, one of the goals of D (if I am not wrong) to be flexible, so
that performance gains will be available for particular configurations if they
can be achieved (it's fully compiled language). It does not stick to something
particular and say 'you must use UTF-8' or 'you must use UTF-16'.

 michel.fortin michelf.com
 http://michelf.com/

Jun 08 2010

dennis luehring <dl.soluz gmx.net> writes:

please stop top-posting - just click on the post you want to reply and 
click then reply - your flooding the newsgroup root with replies ...

Am 08.06.2010 17:11, schrieb Ruslan Nikolaev:
  Generally Linux systems use UTF-8 so I guess the "system
  encoding" there will be UTF-8. But then if you start to use
  QT you have to use UTF-16, but you might have to intermix
  UTF-8 to work with other libraries in the backend (libraries
  which are not necessarily D libraries, nor system
  libraries). So you may have a UTF-8 backend (such as the
  MySQL library), UTF-8 "system encoding" glue code, and
  UTF-16 GUI code (QT). That might be a good or a bad choice,
  depending on various factors, such as whether the glue code
  send more strings to the backend or the GUI.

  Now try to port the thing to Windows where you define the
  "system encoding" as UTF-16. Now you still have the same
  UTF-8 backend, and the same UTF-16 GUI code, but for some
  reason you're changing the glue code in the middle to
  UTF-16? Sure, it can be made to work, but all the string
  conversions will start to happen elsewhere, which may change
  the performance characteristics and add some potential for
  bugs, and all this for no real reason.

  The problem is that what you call "system encoding" is only
  the encoding used by the system frameworks. It is relevant
  when working with the system frameworks, but when you're
  working with any other API, you'll probably want to use the
  same character type as that API does, not necessarily the
  "system encoding". Not all programs are based on extensive
  use of the system frameworks. In some situations you'll want
  to use UTF-16 on Linux, or UTF-8 on Windows, because you're
  dealing with libraries that expect that (QT, MySQL).


 Agreed. True, system encoding is not always that clear. Yet, usually UTF-8 is
common for Linux (consider also Gtk, wxWidgets, system calls, etc.) At the same
time, UTF-16 is more common for Windows (consider win32api, DFL, system calls,
etc.). Some programs written in C even tend to have their own 'tchar' so that
they can be compiled differently depending on platform.

  A compiler switch is a poor choice there, because you can't
  mix libraries compiled with a different compiler switches
  when that switch changes the default character type.


 Compiler switch is only necessary for system programmer. For instance, gcc
also has 'fshort-wchar' that changes width of wchar_t to 16 bit. It also DOES
break the code casue libraries normally compiled for wchar_t to 32 bit. Again,
it's generally not for application programmer.

  In most cases, it's much better in my opinion if the
  programmer just uses the same character type as one of the
  libraries it uses, stick to that, and is aware of what he's
  doing. If someone really want to deal with the complexity of


 Programmer should not know generally what encoding he works with. For both
UTF-8 and UTF-16, it's easy to determine number of bytes (words) in multibyte
(word) sequence by just looking at first code point. This can also be builtin
function (e.g. numberOfChars(tchar firstChar)). Size of each element can easily
be determined by sizeof. Conversion to UTF-32 and back can be done very
transparently.

 The only problem it might cause - bindings with other libraries (but in this
case you can just use fromUTFxx and toUTFxx; you do this conversion anyway).
Also, transferring data over the network - again you can just stick to a
particular encoding (for network and files, UTF-8 is better since it's byte
order free).

  supporting both character types depending on the environment
  it runs on, it's easy to create a "tchar" and "tstring"
  alias that depends on whether it's Windows or Linux, or on a
  custom version flag from a compiler switch, but that'll be
  his choice and his responsibility to make everything work.


 If it's a choice of programmer, then almost all advantages of tchar are lost.
It's like garbage collector - if used by everybody, you can expect advantages
of using it. However, if it's optional - everybody will write libraries
assuming no GC is available, thus - almost all performance advantages are lost.

 And after all, one of the goals of D (if I am not wrong) to be flexible, so
that performance gains will be available for particular configurations if they
can be achieved (it's fully compiled language). It does not stick to something
particular and say 'you must use UTF-8' or 'you must use UTF-16'.

  michel.fortin michelf.com
  http://michelf.com/

Jun 08 2010

"Nick Sabalausky" <a a.a> writes:

"dennis luehring" <dl.soluz gmx.net> wrote in message 
news:hulqni$1ssj$1 digitalmars.com...
 please stop top-posting - just click on the post you want to reply and 
 click then reply - your flooding the newsgroup root with replies ...

 Am 08.06.2010 17:11, schrieb Ruslan Nikolaev:
  Generally Linux systems use UTF-8 so I guess the "system
  encoding" there will be UTF-8. But then if you start to use






Speaking of top-posting... ;)

Jun 08 2010

"Yao G." <nospamyao gmail.com> writes:

Every time you reply to somebody, a new message is created. Is kinda  
difficult to follow this discussion when you need to look more than 15  
separated messages about the same issue. Please check your news client or  
something.

Yao G.

On Tue, 08 Jun 2010 10:11:34 -0500, Ruslan Nikolaev  
<nruslan_devel yahoo.com> wrote:

 Generally Linux systems use UTF-8 so I guess the "system
 encoding" there will be UTF-8. But then if you start to use
 QT you have to use UTF-16, but you might have to intermix
 UTF-8 to work with other libraries in the backend (libraries
 which are not necessarily D libraries, nor system
 libraries). So you may have a UTF-8 backend (such as the
 MySQL library), UTF-8 "system encoding" glue code, and
 UTF-16 GUI code (QT). That might be a good or a bad choice,
 depending on various factors, such as whether the glue code
 send more strings to the backend or the GUI.

 Now try to port the thing to Windows where you define the
 "system encoding" as UTF-16. Now you still have the same
 UTF-8 backend, and the same UTF-16 GUI code, but for some
 reason you're changing the glue code in the middle to
 UTF-16? Sure, it can be made to work, but all the string
 conversions will start to happen elsewhere, which may change
 the performance characteristics and add some potential for
 bugs, and all this for no real reason.

 The problem is that what you call "system encoding" is only
 the encoding used by the system frameworks. It is relevant
 when working with the system frameworks, but when you're
 working with any other API, you'll probably want to use the
 same character type as that API does, not necessarily the
 "system encoding". Not all programs are based on extensive
 use of the system frameworks. In some situations you'll want
 to use UTF-16 on Linux, or UTF-8 on Windows, because you're
 dealing with libraries that expect that (QT, MySQL).


 Agreed. True, system encoding is not always that clear. Yet, usually  
 UTF-8 is common for Linux (consider also Gtk, wxWidgets, system calls,  
 etc.) At the same time, UTF-16 is more common for Windows (consider  
 win32api, DFL, system calls, etc.). Some programs written in C even tend  
 to have their own 'tchar' so that they can be compiled differently  
 depending on platform.

 A compiler switch is a poor choice there, because you can't
 mix libraries compiled with a different compiler switches
 when that switch changes the default character type.


 Compiler switch is only necessary for system programmer. For instance,  
 gcc also has 'fshort-wchar' that changes width of wchar_t to 16 bit. It  
 also DOES break the code casue libraries normally compiled for wchar_t  
 to 32 bit. Again, it's generally not for application programmer.

 In most cases, it's much better in my opinion if the
 programmer just uses the same character type as one of the
 libraries it uses, stick to that, and is aware of what he's
 doing. If someone really want to deal with the complexity of


 Programmer should not know generally what encoding he works with. For  
 both UTF-8 and UTF-16, it's easy to determine number of bytes (words) in  
 multibyte (word) sequence by just looking at first code point. This can  
 also be builtin function (e.g. numberOfChars(tchar firstChar)). Size of  
 each element can easily be determined by sizeof. Conversion to UTF-32  
 and back can be done very transparently.

 The only problem it might cause - bindings with other libraries (but in  
 this case you can just use fromUTFxx and toUTFxx; you do this conversion  
 anyway). Also, transferring data over the network - again you can just  
 stick to a particular encoding (for network and files, UTF-8 is better  
 since it's byte order free).

 supporting both character types depending on the environment
 it runs on, it's easy to create a "tchar" and "tstring"
 alias that depends on whether it's Windows or Linux, or on a
 custom version flag from a compiler switch, but that'll be
 his choice and his responsibility to make everything work.


 If it's a choice of programmer, then almost all advantages of tchar are  
 lost. It's like garbage collector - if used by everybody, you can expect  
 advantages of using it. However, if it's optional - everybody will write  
 libraries assuming no GC is available, thus - almost all performance  
 advantages are lost.

 And after all, one of the goals of D (if I am not wrong) to be flexible,  
 so that performance gains will be available for particular  
 configurations if they can be achieved (it's fully compiled language).  
 It does not stick to something particular and say 'you must use UTF-8'  
 or 'you must use UTF-16'.

 michel.fortin michelf.com
 http://michelf.com/





-- 
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/

Jun 08 2010

D Programming

C/C++ Programming

Other

digitalmars.D - Re: Wide characters support in D