www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.bugs - Only support for UTF-8?

reply Nick <Nick_member pathlink.com> writes:
The D text formatting system seems to only support Unicode, where most
"international" characters must be coded in two or more bytes. However, many
systems are not set up with UTF-8 or similar by default. For example, I'm
currently on a linux system set up with (non-Unicode) ISO-8859-Latin1 charset,
and the following fails:

# import std.stdio;
# import std.stream;
#
# alias std.stream.stdin stdin;
# alias std.stream.stdout stdout;
#
# void main()
# {
#   // Type some characters with byte values > 128, eg. זרו
#   char[] test = stdin.readLine();
#
#   // This works like wanted
#   stdout.writeLine("stream: " ~ test);
#
#   // "Error: invalid UTF-8 sequence"
#   writefln("writefln: ", test);
# }

The streams only work because they are made for raw binary data. Also
format(...) failes. Most text files anywhere will not be unicode, so this is a
BIG problem. Is there a plan to fix this? Or is it supposed to work already?

Also, the compiler itself won't accept my poor scandinavian characters in string
litterals...

Nick
Aug 13 2004
next sibling parent J C Calvarese <jcc7 cox.net> writes:
Nick wrote:
 The D text formatting system seems to only support Unicode, where most
 "international" characters must be coded in two or more bytes. However, many
 systems are not set up with UTF-8 or similar by default. For example, I'm
 currently on a linux system set up with (non-Unicode) ISO-8859-Latin1 charset,
 and the following fails:
 
 # import std.stdio;
 # import std.stream;
 #
 # alias std.stream.stdin stdin;
 # alias std.stream.stdout stdout;
 #
 # void main()
 # {
 #   // Type some characters with byte values > 128, eg. זרו
 #   char[] test = stdin.readLine();
 #
 #   // This works like wanted
 #   stdout.writeLine("stream: " ~ test);
 #
 #   // "Error: invalid UTF-8 sequence"
 #   writefln("writefln: ", test);
 # }
 
 The streams only work because they are made for raw binary data. Also
 format(...) failes. Most text files anywhere will not be unicode, so this is a
 BIG problem. Is there a plan to fix this? Or is it supposed to work already?

I think you'll either need to program this yourself or wait for someone to write a library that does this. I can think of someone who might write such a library, but I'm not going to name names. (If you can't guess who I'm thinking of, you need to read some of the recent posts about Unicode in the main newsgroup.) I've heard the reason why D is Unicode-based (when it's not ASCII-based) is because there are so many different charsets out there that it'd be hard to cover every one of them.
 
 Also, the compiler itself won't accept my poor scandinavian characters in
string
 litterals...

Look at http://www.digitalmars.com/d/lex.html D source text can be in one of the following formats: * ASCII * UTF-8 * UTF-16BE * UTF-16LE * UTF-32BE * UTF-32LE UTF-8 is a superset of traditional 7-bit ASCII. One of the following UTF BOMs (Byte Order Marks) can be present at the beginning of the source text ... If you're saving files as ISO-8859-Latin1, I'd wager that you're out of luck. Even if UTF-8/UTF-16BE/UTF-16LE/UTF-32BE/UTF-32LE isn't the default, it is an option, right? If you want to see why Unicdoe is so popular, you might want to review some of the Unicode threads: http://www.prowiki.org/wiki4d/wiki.cgi?UnicodeIssues
 Nick

-- Justin (a/k/a jcc7) http://jcc_7.tripod.com/d/
Aug 13 2004
prev sibling next sibling parent reply Sean Kelly <sean f4.ca> writes:
In article <cfjgul$1a79$1 digitaldaemon.com>, Nick says...
The streams only work because they are made for raw binary data. Also
format(...) failes. Most text files anywhere will not be unicode, so this is a
BIG problem. Is there a plan to fix this? Or is it supposed to work already?

I've already got a fix for this (as far as it's been tested anyway). readf is meant to be a compliment to writef and is available as a part of unformat: http://home.f4.ca/sean/d/unformat.d http://home.f4.ca/sean/d/utf.d utf.d is a drop-in replacement for std.utf. The readf calls pretty much follow the spec for scanf. You may also have to compile std.format into your application as the compiler seems to get hung up on multiply defined stdarg symbols otherwise. If you've got any comments please feel free. There are some differences between readf and writef that I'm not sure if I should fix (it doesn't throw an exception on an argument mismatch, for example). Sean
Aug 14 2004
parent Nick <Nick_member pathlink.com> writes:
In article <cfld0n$2ga6$1 digitaldaemon.com>, Sean Kelly says...
I've already got a fix for this (as far as it's been tested anyway).  readf is
meant to be a compliment to writef and is available as a part of unformat:

http://home.f4.ca/sean/d/unformat.d
http://home.f4.ca/sean/d/utf.d

utf.d is a drop-in replacement for std.utf.  The readf calls pretty much follow
the spec for scanf.  You may also have to compile std.format into your
application as the compiler seems to get hung up on multiply defined stdarg
symbols otherwise.

Thank you, I'll make sure to play around with them later. Nick
Aug 14 2004
prev sibling parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cfjgul$1a79$1 digitaldaemon.com>, Nick says...
The D text formatting system seems to only support Unicode,

When I read that D "only" supports Unicode, I had to smile, because of course, many other systems only support a subset of Unicode, wheras D supports all of it. But I do know what you mean - your problem in fact is that D only supports UTF character encodings.
where most
"international" characters must be coded in two or more bytes.

Right, so your problem is the encoding, not the character set.
However, many
systems are not set up with UTF-8 or similar by default. For example, I'm
currently on a linux system set up with (non-Unicode) ISO-8859-Latin1 charset,

I assume you mean ISO-8859-1, colloquially known as "Latin 1". Incidently, these days, ISO-8859-1 is considered to be an encoding, not a character set (though the word "charset" continues to exist in HTML and other suchlike, for historical reasons).
and the following fails:
<snip>
Is there a plan to fix this? Or is it supposed to work already?

There is a plan to fix this (I think), but I'm not clear on the details. Sean is doing some stream stuff, and Hauke, last I heard, was doing some string stuff involving the handling of transcoding issues (and convertion to/from ISO-8859-1 is certainly a transcoding issue). We could probably do with an update on who's doing what.
Also, the compiler itself won't accept my poor scandinavian characters in string
litterals...

Well, /this/ one, at least, we can solve for you, right now. Re-save your source file in UTF-8, and then recompile it. Then your characters will be accepted. Arcane Jill
Aug 15 2004
parent Nick <Nick_member pathlink.com> writes:
In article <cfnuq3$nq5$1 digitaldaemon.com>, Arcane Jill says...
When I read that D "only" supports Unicode, I had to smile, because of course,
many other systems only support a subset of Unicode, wheras D supports all of
it. But I do know what you mean - your problem in fact is that D only supports
UTF character encodings.

That is correct. I've read up a bit on unicode the last couple of days, and I'm starting to appreciate the fact that D supports it fully and natively. Seeing how many character encodings there are in existance and all the problems related, it's easy to see that ubiquitous support for Unicode would be a blessing, and I think D is now a part of setting that standard. But D also needs some support for other encodings, since these are still used everywhere. More on this below.
There is a plan to fix this (I think), but I'm not clear on the details.
Sean is doing some stream stuff, and Hauke, last I heard, was doing some
string stuff involving the handling of transcoding issues (and convertion
to/from ISO-8859-1 is certainly a transcoding issue). We could probably
do with an update on who's doing what.

I will add to that and say I'm currently writing some simple wrappers for the C iconv functions, hopefully to be posted soon (hence the errno posts in the other NG.) These support a large set of encodings but are native to unix, I think. I'm not sure what to do for Win32. Of course, it would also be nice to have some way to autodetect locale settings and convert all stdin/stdout traffic automatically, but I really have other things to do :-)
Well, /this/ one, at least, we can solve for you, right now. Re-save your
source file in UTF-8, and then recompile it. Then your characters will be
accepted.

Yes, I figured that out. Emacs cooperated after I threatened to kick it in the pants, as is the usual pratice. I also recently found out how ridiculously easy it was to switch to UTF-8 in linux (which involved setting an entire environment variable!) Legacy apps are still a problem, though. Nick
Aug 15 2004