digitalmars.D.bugs - Only support for UTF-8?

Nick (28/28) Aug 13 2004 The D text formatting system seems to only support Unicode, where most

J C Calvarese (30/61) Aug 13 2004 I think you'll either need to program this yourself or wait for someone
Sean Kelly (13/16) Aug 14 2004 I've already got a fix for this (as far as it's been tested anyway). re...

Nick (3/11) Aug 14 2004 Thank you, I'll make sure to play around with them later.

Arcane Jill (18/29) Aug 15 2004 When I read that D "only" supports Unicode, I had to smile, because of c...

Nick (21/33) Aug 15 2004 That is correct. I've read up a bit on unicode the last couple of days, ...

Nick <Nick_member pathlink.com> writes:

The D text formatting system seems to only support Unicode, where most
"international" characters must be coded in two or more bytes. However, many
systems are not set up with UTF-8 or similar by default. For example, I'm
currently on a linux system set up with (non-Unicode) ISO-8859-Latin1 charset,
and the following fails:



















The streams only work because they are made for raw binary data. Also
format(...) failes. Most text files anywhere will not be unicode, so this is a
BIG problem. Is there a plan to fix this? Or is it supposed to work already?

Also, the compiler itself won't accept my poor scandinavian characters in string
litterals...

Nick

Aug 13 2004

J C Calvarese <jcc7 cox.net> writes:

Nick wrote:
 The D text formatting system seems to only support Unicode, where most
 "international" characters must be coded in two or more bytes. However, many
 systems are not set up with UTF-8 or similar by default. For example, I'm
 currently on a linux system set up with (non-Unicode) ISO-8859-Latin1 charset,
 and the following fails:
 

















 
 The streams only work because they are made for raw binary data. Also
 format(...) failes. Most text files anywhere will not be unicode, so this is a
 BIG problem. Is there a plan to fix this? Or is it supposed to work already?

I think you'll either need to program this yourself or wait for someone
to write a library that does this. I can think of someone who might
write such a library, but I'm not going to name names. (If you can't
guess who I'm thinking of, you need to read some of the recent posts
about Unicode in the main newsgroup.)

I've heard the reason why D is Unicode-based (when it's not ASCII-based)
is because there are so many different charsets out there that it'd be
hard to cover every one of them.

 
 Also, the compiler itself won't accept my poor scandinavian characters in
string
 litterals...

Look at http://www.digitalmars.com/d/lex.html

D source text can be in one of the following formats:
   * ASCII
   * UTF-8
   * UTF-16BE
   * UTF-16LE
   * UTF-32BE
   * UTF-32LE

UTF-8 is a superset of traditional 7-bit ASCII. One of the following UTF
BOMs (Byte Order Marks) can be present at the beginning of the source
text ...

If you're saving files as ISO-8859-Latin1, I'd wager that you're out of 
luck.

Even if UTF-8/UTF-16BE/UTF-16LE/UTF-32BE/UTF-32LE isn't the default, it
is an option, right?

If you want to see why Unicdoe is so popular, you might want to review
some of the Unicode threads:
http://www.prowiki.org/wiki4d/wiki.cgi?UnicodeIssues



 Nick




-- 
Justin (a/k/a jcc7)
http://jcc_7.tripod.com/d/

Aug 13 2004

Sean Kelly <sean f4.ca> writes:

In article <cfjgul$1a79$1 digitaldaemon.com>, Nick says...
The streams only work because they are made for raw binary data. Also
format(...) failes. Most text files anywhere will not be unicode, so this is a
BIG problem. Is there a plan to fix this? Or is it supposed to work already?

I've already got a fix for this (as far as it's been tested anyway).  readf is
meant to be a compliment to writef and is available as a part of unformat:

http://home.f4.ca/sean/d/unformat.d
http://home.f4.ca/sean/d/utf.d

utf.d is a drop-in replacement for std.utf.  The readf calls pretty much follow
the spec for scanf.  You may also have to compile std.format into your
application as the compiler seems to get hung up on multiply defined stdarg
symbols otherwise.

If you've got any comments please feel free.  There are some differences between
readf and writef that I'm not sure if I should fix (it doesn't throw an
exception on an argument mismatch, for example).


Sean

Aug 14 2004

Nick <Nick_member pathlink.com> writes:

In article <cfld0n$2ga6$1 digitaldaemon.com>, Sean Kelly says...
I've already got a fix for this (as far as it's been tested anyway).  readf is
meant to be a compliment to writef and is available as a part of unformat:

http://home.f4.ca/sean/d/unformat.d
http://home.f4.ca/sean/d/utf.d

utf.d is a drop-in replacement for std.utf.  The readf calls pretty much follow
the spec for scanf.  You may also have to compile std.format into your
application as the compiler seems to get hung up on multiply defined stdarg
symbols otherwise.

Thank you, I'll make sure to play around with them later.

Nick

Aug 14 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cfjgul$1a79$1 digitaldaemon.com>, Nick says...
The D text formatting system seems to only support Unicode,

When I read that D "only" supports Unicode, I had to smile, because of course,
many other systems only support a subset of Unicode, wheras D supports all of
it. But I do know what you mean - your problem in fact is that D only supports
UTF character encodings.


where most
"international" characters must be coded in two or more bytes.

Right, so your problem is the encoding, not the character set.


However, many
systems are not set up with UTF-8 or similar by default. For example, I'm
currently on a linux system set up with (non-Unicode) ISO-8859-Latin1 charset,

I assume you mean ISO-8859-1, colloquially known as "Latin 1". Incidently, these
days, ISO-8859-1 is considered to be an encoding, not a character set (though
the word "charset" continues to exist in HTML and other suchlike, for historical
reasons).


and the following fails:
<snip>
Is there a plan to fix this? Or is it supposed to work already?

There is a plan to fix this (I think), but I'm not clear on the details. Sean is
doing some stream stuff, and Hauke, last I heard, was doing some string stuff
involving the handling of transcoding issues (and convertion to/from ISO-8859-1
is certainly a transcoding issue). We could probably do with an update on who's
doing what.



Also, the compiler itself won't accept my poor scandinavian characters in string
litterals...

Well, /this/ one, at least, we can solve for you, right now. Re-save your source
file in UTF-8, and then recompile it. Then your characters will be accepted.

Arcane Jill

Aug 15 2004

Nick <Nick_member pathlink.com> writes:

In article <cfnuq3$nq5$1 digitaldaemon.com>, Arcane Jill says...
When I read that D "only" supports Unicode, I had to smile, because of course,
many other systems only support a subset of Unicode, wheras D supports all of
it. But I do know what you mean - your problem in fact is that D only supports
UTF character encodings.

That is correct. I've read up a bit on unicode the last couple of days, and I'm
starting to appreciate the fact that D supports it fully and natively. Seeing
how many character encodings there are in existance and all the problems
related, it's easy to see that ubiquitous support for Unicode would be a
blessing, and I think D is now a part of setting that standard. But D also needs
some support for other encodings, since these are still used everywhere. More on
this below.

There is a plan to fix this (I think), but I'm not clear on the details.
Sean is doing some stream stuff, and Hauke, last I heard, was doing some
string stuff involving the handling of transcoding issues (and convertion
to/from ISO-8859-1 is certainly a transcoding issue). We could probably
do with an update on who's doing what.

I will add to that and say I'm currently writing some simple wrappers for the C
iconv functions, hopefully to be posted soon (hence the errno posts in the other
NG.) These support a large set of encodings but are native to unix, I think. I'm
not sure what to do for Win32.

Of course, it would also be nice to have some way to autodetect locale settings
and convert all stdin/stdout traffic automatically, but I really have other
things to do :-)

Well, /this/ one, at least, we can solve for you, right now. Re-save your
source file in UTF-8, and then recompile it. Then your characters will be
accepted.

Yes, I figured that out. Emacs cooperated after I threatened to kick it in the
pants, as is the usual pratice.

I also recently found out how ridiculously easy it was to switch to UTF-8 in
linux (which involved setting an entire environment variable!) Legacy apps are
still a problem, though.

Nick

Aug 15 2004

D Programming

C/C++ Programming

Other

digitalmars.D.bugs - Only support for UTF-8?