D - UTF-8

ET_yoza (24/24) Dec 25 2003 Dear Mr. Walter,

Walter (7/31) Dec 26 2003 D currently can handle UTF-8 (unicode) strings, but it does not handle

ET_yoza (2/7) Dec 26 2003
ET_yoza (2/7) Dec 26 2003
Hauke Duden (15/20) Dec 26 2003 I don't think you have to. The C runtime library provides functions to

Ben Hinkle (21/41) Dec 27 2003 oh boy, more unicode! ;)

Hauke Duden (29/38) Dec 28 2003 According to the C docs there are no guarantees about the way characters...

Matthew (14/33) Dec 28 2003 to

Hauke Duden (26/35) Dec 28 2003 I agree. To get us near this goal, I have begun work on a set of string
Walter (10/14) Jan 03 2004 about

Matthew (3/17) Jan 03 2004 Good answer. :)

Y.Tomino (8/22) Dec 28 2003 Oh! I did not know. I don't have a good knowledge of C. Thank you.

Hauke Duden (11/23) Dec 28 2003 Actually, I was only referring to converting from/to the current code

Y.Tomino (6/6) Dec 29 2003 Thank you.

Y.Tomino (54/88) Dec 26 2003 The mechanism like "filter" is unnecessary.

ET_yoza (2/51) Dec 26 2003
Matthew (9/24) Dec 28 2003 I don't have any criticisms to make of your internationalisation postula...

Y.Tomino (17/22) Dec 28 2003 Ok, it's not the right way. I neglected OS-version check.

ET_yoza <ET_yoza_member pathlink.com> writes:

Dear Mr. Walter,

I am a user D language in Japan and testing it. I have Windows2000
Japanese Edition installed in my PC for development, and also the D
compiler.
I found a problem during use of D, about encoding of a multi-byte
character sequence.
I know that D is a Unicode-oriented language, so wrote the source code
in Unicode.
API of my OS requires Shift_JIS as encoding of a character sequence.
(MBCS which A-th edition API of Windows requires is encoding without
the compatibility for every different country in UTF-8)
I expected that it would be converted by D language implicitly, but, D
doesn't seem to perform encoding properly while calling API.
As a result, in order to display Japanese, the source code cannot be
written in Unicode.
If the source code written in Shift_JIS, the short-term purpose will
be achieved. However, it is contrary to specification. Moreover, that
vicious problem of C will also be made to recur.
Since encodings other than Unicode have characters containing an
escape character.

"undefined escape sequence \?"

I'd like to write code in Unicode based on the specification,
therefore, please solve this problem by adjusting encoding when Phobos
calls API.

Dec 25 2003

"Walter" <walter digitalmars.com> writes:

D currently can handle UTF-8 (unicode) strings, but it does not handle
shift-JIS strings. To make shift-JIS work in your programs, you'll need to
write a filter to convert shift-JIS to UTF-8 on input, and convert UTF-8 to
shift-JIS on output. I intend to do this for all code pages, but have not
written those filters yet.

"ET_yoza" <ET_yoza_member pathlink.com> wrote in message
news:bseb83$2qvq$1 digitaldaemon.com...
 Dear Mr. Walter,

 I am a user D language in Japan and testing it. I have Windows2000
 Japanese Edition installed in my PC for development, and also the D
 compiler.
 I found a problem during use of D, about encoding of a multi-byte
 character sequence.
 I know that D is a Unicode-oriented language, so wrote the source code
 in Unicode.
 API of my OS requires Shift_JIS as encoding of a character sequence.
 (MBCS which A-th edition API of Windows requires is encoding without
 the compatibility for every different country in UTF-8)
 I expected that it would be converted by D language implicitly, but, D
 doesn't seem to perform encoding properly while calling API.
 As a result, in order to display Japanese, the source code cannot be
 written in Unicode.
 If the source code written in Shift_JIS, the short-term purpose will
 be achieved. However, it is contrary to specification. Moreover, that
 vicious problem of C will also be made to recur.
 Since encodings other than Unicode have characters containing an
 escape character.

 "undefined escape sequence \?"

 I'd like to write code in Unicode based on the specification,
 therefore, please solve this problem by adjusting encoding when Phobos
 calls API.

Dec 26 2003

ET_yoza <ET_yoza_member pathlink.com> writes:

Thank you. I understood

In article <bsgrfu$rpp$1 digitaldaemon.com>, Walter says...
D currently can handle UTF-8 (unicode) strings, but it does not handle
shift-JIS strings. To make shift-JIS work in your programs, you'll need to
write a filter to convert shift-JIS to UTF-8 on input, and convert UTF-8 to
shift-JIS on output. I intend to do this for all code pages, but have not
written those filters yet.

Dec 26 2003

ET_yoza <ET_yoza_member pathlink.com> writes:

Thank you. I understood.

In article <bsgrfu$rpp$1 digitaldaemon.com>, Walter says...
D currently can handle UTF-8 (unicode) strings, but it does not handle
shift-JIS strings. To make shift-JIS work in your programs, you'll need to
write a filter to convert shift-JIS to UTF-8 on input, and convert UTF-8 to
shift-JIS on output. I intend to do this for all code pages, but have not
written those filters yet.

Dec 26 2003

Hauke Duden <H.NS.Duden gmx.net> writes:

Walter wrote:
 D currently can handle UTF-8 (unicode) strings, but it does not handle
 shift-JIS strings. To make shift-JIS work in your programs, you'll need to
 write a filter to convert shift-JIS to UTF-8 on input, and convert UTF-8 to
 shift-JIS on output. I intend to do this for all code pages, but have not
 written those filters yet.

I don't think you have to. The C runtime library provides functions to 
convert from wide char to the local code page and vice versa. We can use 
those for conversions of this kind.

I know I'm repeating the same stuff over and over, but maybe this real 
world example has shifted your position somewhat. I REALLY think UTF-8 
strings should not use the "char" type. The CRT expects chars to be 
encoded in the local code page, so this will lead to all kinds of 
confusion when you mix C functions with D functions. The latter expects 
UTF-8, the former the local code page, but both use the same type. 
Actually, if you get right down to the definition, they use different 
types with the same name and none of the type-safety features you expect 
from a typed programming language!

It would be a lot easier if the types had different names.

Hauke

Dec 26 2003

"Ben Hinkle" <bhinkle4 juno.com> writes:

oh boy, more unicode! ;)

"Hauke Duden" <H.NS.Duden gmx.net> wrote in message
news:bshas9$1kgh$1 digitaldaemon.com...
 Walter wrote:
 D currently can handle UTF-8 (unicode) strings, but it does not handle
 shift-JIS strings. To make shift-JIS work in your programs, you'll need


to
 write a filter to convert shift-JIS to UTF-8 on input, and convert UTF-8


to
 shift-JIS on output. I intend to do this for all code pages, but have


not
 written those filters yet.

 I don't think you have to. The C runtime library provides functions to
 convert from wide char to the local code page and vice versa. We can use
 those for conversions of this kind.

 I know I'm repeating the same stuff over and over, but maybe this real
 world example has shifted your position somewhat. I REALLY think UTF-8
 strings should not use the "char" type. The CRT expects chars to be
 encoded in the local code page, so this will lead to all kinds of
 confusion when you mix C functions with D functions. The latter expects
 UTF-8, the former the local code page, but both use the same type.
 Actually, if you get right down to the definition, they use different
 types with the same name and none of the type-safety features you expect
 from a typed programming language!

 It would be a lot easier if the types had different names.

 Hauke

agreed.

I have a question about local code pages: do they all contained ASCII? I
thought so but I don't know for sure. Googling around it seems like some old
encodings were not actually compatible with ASCII but I don't know if those
encoding are in use anymore (EBCDIC?). Remember ASCII is only the first 7
bits. That was why I proposed using the types "ascii","utf8","utf16" and
"utf32". Any ascii[] that has non-ASCII bytes is assumed to be encoded in
the local code page. The standard C functions all take local encodings, so
they would be declared as taking ascii[] strings. If naming a type "ascii"
is too offensive then maybe someone can come up with a word for "8-bit local
encoding suitable for printf and friends". Hauke had suggested "charz",
which doesn't sound too bad to me but then it also makes a statement about
the trailing 0. Maybe something like "lchar" for local encoding, "uchar8"
for unicode utf-8, "uchar16" for utf-16 and "uchar32" for utf-32.

-Ben

Dec 27 2003

Hauke Duden <H.NS.Duden gmx.net> writes:

Ben Hinkle wrote:
 I have a question about local code pages: do they all contained ASCII?

According to the C docs there are no guarantees about the way characters 
are encoded. However, I read some time ago that "most" code pages are 
downward compatible to ASCII. No clue whether that means "most" as in 
"all we care about" or "most" as in "except for that one important code 
page we need to support". Murphy tells me that we should assume the 
latter ;).

 The standard C functions all take local encodings, so
 they would be declared as taking ascii[] strings.

That seems like another hack to me. Even if the code pages are ASCII 
compatible, the strings you pass to the functions are not necessarily 
all ASCII, so I think the type should have a different name.


 If naming a type "ascii"
 is too offensive then maybe someone can come up with a word for "8-bit local
 encoding suitable for printf and friends". Hauke had suggested "charz",
 which doesn't sound too bad to me but then it also makes a statement about
 the trailing 0. Maybe something like "lchar" for local encoding, "uchar8"
 for unicode utf-8, "uchar16" for utf-16 and "uchar32" for utf-32.

I like the uchar8 and uchar16. There is a clash with the naming 
convention that uTYPE means the unsigned version of that type. But that 
seems ok, since UTF-8 and UTF16 chars would be unsigned anyway.

How about the following:

cchar: for C-type strings, made up of 8 bit elements ecoded in the local 
code page. Maybe also null-terminated by convention?

wchar: for C-type wide char strings, made up of 16 bit or 32 bit 
elements depending on the system (16 for windows, 32 for linux). The 
encoding is UTF-16 or UTF-32. We need a type like this to be able to 
write portable code that interoperates with C-functions that take the 
wchar_t type, which has the same properties. Also null-terminated by 
convention?

uchar8: UTF-8 encoded string

uchar16: UTF-16 encoded string

uchar32: UTF-32 endoded string

Additionally, I would advocate to add either a "char" or "dchar" alias 
for uchar32, to make the point that this should be the default character 
type. uchar32 just does not stand out among the other char types.

Hauke

Dec 28 2003

"Matthew" <matthew.hat stlsoft.dot.org> writes:

"Hauke Duden" <H.NS.Duden gmx.net> wrote in message
news:bshas9$1kgh$1 digitaldaemon.com...
 Walter wrote:
 D currently can handle UTF-8 (unicode) strings, but it does not handle
 shift-JIS strings. To make shift-JIS work in your programs, you'll need


to
 write a filter to convert shift-JIS to UTF-8 on input, and convert UTF-8


to
 shift-JIS on output. I intend to do this for all code pages, but have


not
 written those filters yet.

 I don't think you have to. The C runtime library provides functions to
 convert from wide char to the local code page and vice versa. We can use
 those for conversions of this kind.

 I know I'm repeating the same stuff over and over, but maybe this real
 world example has shifted your position somewhat. I REALLY think UTF-8
 strings should not use the "char" type. The CRT expects chars to be
 encoded in the local code page, so this will lead to all kinds of
 confusion when you mix C functions with D functions. The latter expects
 UTF-8, the former the local code page, but both use the same type.
 Actually, if you get right down to the definition, they use different
 types with the same name and none of the type-safety features you expect
 from a typed programming language!

 It would be a lot easier if the types had different names.

I think you have some mileage in this idea.

Walter, internationalisation is an issue so fraught with confusion and
misunderstanding that I would suggest it is a +ve step to have new, and
ugly, types with which to deal with the different coding schemes. No-one who
does not understand it should go near such things.

Naturally that leads us to the position where all these issues must be
handled by the language for us, so people (which I think includes just about
all of us) who do not understand the issues do not need to care and yet can
still write correct programs.

Dec 28 2003

Hauke Duden <H.NS.Duden gmx.net> writes:

Matthew wrote:
 Walter, internationalisation is an issue so fraught with confusion and
 misunderstanding that I would suggest it is a +ve step to have new, and
 ugly, types with which to deal with the different coding schemes. No-one who
 does not understand it should go near such things.
 
 Naturally that leads us to the position where all these issues must be
 handled by the language for us, so people (which I think includes just about
 all of us) who do not understand the issues do not need to care and yet can
 still write correct programs.

I agree. To get us near this goal, I have begun work on a set of string 
interfaces and classes over the holidays. They have the properties I 
expect of good string handling (I have mentioned most of these in other 
threads):

- people should only have to think "string", not 
"UTF-8/16/32/ASCII/Shift-JIS..."
- hide the encoding most of the time
- allow character-based indexing and iteration
- provide basic string operations: (case-insensitive) comparison, 
concatenation, ...
- prevent unnecessary copying of the data
- allow access to the raw encoded data, if necessary (for interacting 
with C functions)
- automatically make strings null-terminated if needed, but do not treat 
the terminator as part of the string (e.g. do not include it in the length)
- enable implementations using other encodings. There is a lot of 
non-ASCII legacy code out there, so UTF-8 alone just doesn't cut it.

My hope is that when I'm done with this, Walter will declare those 
interfaces (or a similar solution) the default way to deal with strings. 
The goal is to never see those raw data strings anywhere in a normal D 
program, except under special circumstances or when interacting with C code.

And if that doesn't happen then at the very least I hope that this will 
inspire some more work to be done in this area. Bad string support can 
take all the fun out of a language ;) .

Hauke

Dec 28 2003

"Walter" <walter digitalmars.com> writes:

"Matthew" <matthew.hat stlsoft.dot.org> wrote in message
news:bsnpkn$30qk$1 digitaldaemon.com...
 Naturally that leads us to the position where all these issues must be
 handled by the language for us, so people (which I think includes just

about
 all of us) who do not understand the issues do not need to care and yet

can
 still write correct programs.

There is no way you can "not care" and still write correct programs unless
you're also willing to abandon all hope of writing competitively fast
applications.

D will provide the capability to write correct programs, but it will still
be up to the programmer to use that capability. What I'll probably wind up
doing is writing a tutorial page on it.

Jan 03 2004

"Matthew" <matthew.hat stlsoft.dot.org> writes:

"Walter" <walter digitalmars.com> wrote in message
news:bt7pts$28dc$2 digitaldaemon.com...
 "Matthew" <matthew.hat stlsoft.dot.org> wrote in message
 news:bsnpkn$30qk$1 digitaldaemon.com...
 Naturally that leads us to the position where all these issues must be
 handled by the language for us, so people (which I think includes just

 about
 all of us) who do not understand the issues do not need to care and yet

 can
 still write correct programs.

 There is no way you can "not care" and still write correct programs unless
 you're also willing to abandon all hope of writing competitively fast
 applications.

 D will provide the capability to write correct programs, but it will still
 be up to the programmer to use that capability. What I'll probably wind up
 doing is writing a tutorial page on it.

Good answer. :)

Jan 03 2004

"Y.Tomino" <demoonlit inter7.jp> writes:

 The C runtime library provides functions to 
 convert from wide char to the local code page and vice versa. We can use 
 those for conversions of this kind.

Oh! I did not know. I don't have a good knowledge of C. Thank you.
We can use setlocale, mbstowcs, wcstombs, etc.

By the way, excuse me. Are these different?
Only in the second case, I can get the right result.

setlocale(LC_ALL, null); (returns "C")
setlocale(LC_ALL, ""); (returns "Japanese_Japan.932")

 I know I'm repeating the same stuff over and over, but maybe this real 
 world example has shifted your position somewhat. I REALLY think UTF-8 
 strings should not use the "char" type. The CRT expects chars to be 
 encoded in the local code page, so this will lead to all kinds of 
 confusion when you mix C functions with D functions. The latter expects 
 UTF-8, the former the local code page, but both use the same type. 
 Actually, if you get right down to the definition, they use different 
 types with the same name and none of the type-safety features you expect 
 from a typed programming language!
 
 It would be a lot easier if the types had different names.

I agree.

YT

Dec 28 2003

Hauke Duden <H.NS.Duden gmx.net> writes:

Y.Tomino wrote:
The C runtime library provides functions to 
convert from wide char to the local code page and vice versa. We can use 
those for conversions of this kind.

 
 
 Oh! I did not know. I don't have a good knowledge of C. Thank you.
 We can use setlocale, mbstowcs, wcstombs, etc.

Actually, I was only referring to converting from/to the current code 
page, not changing the code page. I'm not sure whether all possible code 
pages are supported on all systems.

 By the way, excuse me. Are these different?
 Only in the second case, I can get the right result.
 
 setlocale(LC_ALL, null); (returns "C")
 setlocale(LC_ALL, ""); (returns "Japanese_Japan.932")

The first one does not change anything. It simply returns the current 
locale settings. From the CRT docs:

"""
The null pointer is a special directive that tells setlocale to query 
rather than set the international environment.
"""

Hauke

Dec 28 2003

"Y.Tomino" <demoonlit inter7.jp> writes:

Thank you.

I traced setlocale(LC_ALL, "") with debugger, it called
GetUserDefaultLangID.
I expect CRT uses the correct code page of OS, if I called setlocale(LC_ALL,
"") on start of my code.

YT

Dec 29 2003

"Y.Tomino" <demoonlit inter7.jp> writes:

The mechanism like "filter" is unnecessary.
It's same problem as WriteString I posted.
A-Version of Windows API always require the string of current code page.
** Don't call A-Version API with UTF-8!! **
Simply, it's necessary that Phobos call A-Version API with
WideCharToMultiByte.
WideCharToMultiByte convert Unicode to SHIFT-JIS in Japan,
and convert to used encoding in another country.

for example...

//current
void mkdir(char[] pathname)
{
  if (!CreateDirectoryA(toStringz(pathname), null))
  {
    throw new FileException(pathname, GetLastError());
  }
}

//correct
void mkdir(char[] utf8_pathname)
{
  char[] codepage_pathname = toMBCS(toUTF16(utf8_pathname));
  if (!CreateDirectoryA(toStringz(codepage_pathname), null))
  {
    throw new FileException(pathname, GetLastError());
  }
}

char[] toMBCS(wchar[] s)
{
  char[] result;
  result.length = WideCharToMultiByte(0, 0, s, s.length, null, 0, null,
null);
  WideCharToMultiByte(0, 0, s, s.length, result, result.length, null, null);
  return result;
}

//ideal
//Unicode has many letters more than code-page encoding.
//Please try W-Version API first.
void mkdir(char[] utf8_pathname)
{
  wchar[] utf16_pathname = toUTF16(utf8_pathname);
  if(!CreateDirectoryW(cast(wchar*)(utf16_pathname ~ "\0"), null))
  {
    char[] codepage_pathname = toMBCS(utf16_pathname);
    if (!CreateDirectoryA(toStringz(codepage_pathname), null))
    {
      throw new FileException(pathname, GetLastError());
    }
  }
}

Thanks.
YT

"Walter" <walter digitalmars.com> wrote in message
news:bsgrfu$rpp$1 digitaldaemon.com...
 D currently can handle UTF-8 (unicode) strings, but it does not handle
 shift-JIS strings. To make shift-JIS work in your programs, you'll need to
 write a filter to convert shift-JIS to UTF-8 on input, and convert UTF-8

to
 shift-JIS on output. I intend to do this for all code pages, but have not
 written those filters yet.

 "ET_yoza" <ET_yoza_member pathlink.com> wrote in message
 news:bseb83$2qvq$1 digitaldaemon.com...
 Dear Mr. Walter,

 I am a user D language in Japan and testing it. I have Windows2000
 Japanese Edition installed in my PC for development, and also the D
 compiler.
 I found a problem during use of D, about encoding of a multi-byte
 character sequence.
 I know that D is a Unicode-oriented language, so wrote the source code
 in Unicode.
 API of my OS requires Shift_JIS as encoding of a character sequence.
 (MBCS which A-th edition API of Windows requires is encoding without
 the compatibility for every different country in UTF-8)
 I expected that it would be converted by D language implicitly, but, D
 doesn't seem to perform encoding properly while calling API.
 As a result, in order to display Japanese, the source code cannot be
 written in Unicode.
 If the source code written in Shift_JIS, the short-term purpose will
 be achieved. However, it is contrary to specification. Moreover, that
 vicious problem of C will also be made to recur.
 Since encodings other than Unicode have characters containing an
 escape character.

 "undefined escape sequence \?"

 I'd like to write code in Unicode based on the specification,
 therefore, please solve this problem by adjusting encoding when Phobos
 calls API.

Dec 26 2003

ET_yoza <ET_yoza_member pathlink.com> writes:

Thank you. Thank you. Thank you!!!

In article <bshb44$1kr2$1 digitaldaemon.com>, Y.Tomino says...
The mechanism like "filter" is unnecessary.
It's same problem as WriteString I posted.
A-Version of Windows API always require the string of current code page.
** Don't call A-Version API with UTF-8!! **
Simply, it's necessary that Phobos call A-Version API with
WideCharToMultiByte.
WideCharToMultiByte convert Unicode to SHIFT-JIS in Japan,
and convert to used encoding in another country.

for example...

//current
void mkdir(char[] pathname)
{
  if (!CreateDirectoryA(toStringz(pathname), null))
  {
    throw new FileException(pathname, GetLastError());
  }
}

//correct
void mkdir(char[] utf8_pathname)
{
  char[] codepage_pathname = toMBCS(toUTF16(utf8_pathname));
  if (!CreateDirectoryA(toStringz(codepage_pathname), null))
  {
    throw new FileException(pathname, GetLastError());
  }
}

char[] toMBCS(wchar[] s)
{
  char[] result;
  result.length = WideCharToMultiByte(0, 0, s, s.length, null, 0, null,
null);
  WideCharToMultiByte(0, 0, s, s.length, result, result.length, null, null);
  return result;
}

//ideal
//Unicode has many letters more than code-page encoding.
//Please try W-Version API first.
void mkdir(char[] utf8_pathname)
{
  wchar[] utf16_pathname = toUTF16(utf8_pathname);
  if(!CreateDirectoryW(cast(wchar*)(utf16_pathname ~ "\0"), null))
  {
    char[] codepage_pathname = toMBCS(utf16_pathname);
    if (!CreateDirectoryA(toStringz(codepage_pathname), null))
    {
      throw new FileException(pathname, GetLastError());
    }
  }
}

Dec 26 2003

"Matthew" <matthew.hat stlsoft.dot.org> writes:

 //ideal
 //Unicode has many letters more than code-page encoding.
 //Please try W-Version API first.
 void mkdir(char[] utf8_pathname)
 {
   wchar[] utf16_pathname = toUTF16(utf8_pathname);
   if(!CreateDirectoryW(cast(wchar*)(utf16_pathname ~ "\0"), null))
   {
     char[] codepage_pathname = toMBCS(utf16_pathname);
     if (!CreateDirectoryA(toStringz(codepage_pathname), null))
     {
       throw new FileException(pathname, GetLastError());
     }
   }
 }

I don't have any criticisms to make of your internationalisation postulates,
but the above code is *not* the right way to write A/W flexible functions in
Win32.

For example, what do I get when I retrieve the win32 error code from the
FileException?

I've not followed enough of these threads to know whether much, if any,
library code is being written at the moment, but if it is, and if it is done
like this, it is a bad thing.

Matthew

Dec 28 2003

"Y.Tomino" <demoonlit inter7.jp> writes:

 I don't have any criticisms to make of your internationalisation

postulates,
 but the above code is *not* the right way to write A/W flexible functions

in
 Win32.

Ok, it's not the right way. I neglected OS-version check.

OSVERSIONINFO osVersion;
GetVersionEx(&osVersion);
if(osVersion.dwPlatformId==VER_PLATFORM_WIN32_NT){
  //try W-version API
}else{
  //try A-version API
}

But my code works because Windows9x has entries of some W-version API
(return FALSE).
(In my real intention, I want to use only W-version API for Phobos.)

 For example, what do I get when I retrieve the win32 error code from the
 FileException?

FileException of present Phobos has "errno".
(I do not know well whether this usage is right...)

Thanks.
YT

Dec 28 2003

D Programming

C/C++ Programming

Other

D - UTF-8