www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.bugs - Bug in std.string.format?

reply Juanjo =?ISO-8859-15?Q?=C1lvarez?= <juanjuxNO SPAMyahoo.es> writes:
If I do:

//Also with any other ascii 8 bit chars:
char[] str = std.string.format("STRING WITH NON ASCII7BIT CHARS ———");

The program says (in runtime):

Error: invalid UTF-8 sequence

AFAIK '—' is UTF-8.
Jul 09 2004
next sibling parent Stewart Gordon <smjg_1998 yahoo.com> writes:
Juanjo Ńlvarez wrote:

 If I do:
 
 //Also with any other ascii 8 bit chars:
 char[] str = std.string.format("STRING WITH NON ASCII7BIT CHARS ———");

std.string.format isn't documented as I look. Is this the string counterpart of writef, which I'd just pointed out we should have over on d.D? Stewart. -- My e-mail is valid but not my primary mailbox, aside from its being the unfortunate victim of intensive mail-bombing at the moment. Please keep replies on the 'group where everyone may benefit.
Jul 09 2004
prev sibling parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cclofh$1qrr$1 digitaldaemon.com>, Juanjo =?ISO-8859-15?Q?=C1lvarez?=
says...
If I do:

//Also with any other ascii 8 bit chars:
char[] str = std.string.format("STRING WITH NON ASCII7BIT CHARS ———");

The program says (in runtime):

Error: invalid UTF-8 sequence

This is not a bug. You have an invalid UTF-8 sequence. The library is correctly reporting it.
AFAIK '—' is UTF-8.

It is not. The Unicode character U+00D1, LATIN CAPITAL N WITH TILDE is represented in UTF-8 by the two byte sequence { 0xC3, 0x91 }. UTF-8 is backwardly compatible with ASCII. It is /not/, however, backwardly compatible with ISO-8859-1. Any character with codepoint greater than 0x7F must be correctly UTF-8 encoded. You can get the correct UTF-8 sequence by starting with a string of dchars and passing it to std.utf.toUTF8(). Arcane Jill
Jul 09 2004
next sibling parent Arcane Jill <Arcane_member pathlink.com> writes:
In article <ccm0is$2768$1 digitaldaemon.com>, Arcane Jill says...
char[] str = std.string.format("STRING WITH NON ASCII7BIT CHARS ———");

Error: invalid UTF-8 sequence

This is not a bug. You have an invalid UTF-8 sequence. The library is correctly reporting it.

Oh - and here's the fix. Save your source-code text file in UTF-8 format before attempting to compile it. I suspect it is currently saved in some ANSI format or other - probably ISO-8859-1 or WINDOWS-259 depending on your operating system. You need a text editor which can save in UTF-8. D source files should always be saved in UTF-8 format if you want string literals to be correctly interpretted. Jill
Jul 09 2004
prev sibling next sibling parent reply Arcane Jill <Arcane_member pathlink.com> writes:
Actually, come to think of it, it would be very, very helpful to users of D if
the D compiler actually checked the integrity of all string literals at compile
time. If any string literal were found (at compile time) to contain an invalid
UTF-8 sequence, it would help the user ENORMOUSLY if an error message along the
lines of:

#   ERROR - D source file not saved as UTF-8. Cannot compile.

were to be printed. (Strictly speaking, the D compiler should always pass the
entire source file to toUTF32(), and generate the above error if toUTF32()
fails. However, the source file encoding won't make any difference EXCEPT to
string literals).

So ... although it /is/ a user-error, it is nonetheless a user-error which DMD
could have detected at compile-time, instead of leaving the error reporting to
run time. The error message itself (as it stands) doesn't really help people to
understand what's wrong.

Arcane Jill
Jul 09 2004
parent reply Stewart Gordon <smjg_1998 yahoo.com> writes:
Arcane Jill wrote:

<snip>
 #   ERROR - D source file not saved as UTF-8. Cannot compile.

Hang on ... according to the docs, the compiler is supposed to accept UTF-16 and UTF-32 too. <snip>
 So ... although it /is/ a user-error, it is nonetheless a user-error which DMD
 could have detected at compile-time, instead of leaving the error reporting to
 run time. The error message itself (as it stands) doesn't really help people to
 understand what's wrong.

Some debate is possible. Obviously the compiler isn't being UTF compliant. But what if someone wants to include, in a string literal, characters in the native OS or other character set that don't match UTF-8? (FTM, how are escaped characters supposed to be handled ITR, considering that a string literal can be either a char[], wchar[] or dchar[]?) Speaking of lexical.html... "There are no digraphs or trigraphs in D." What is meant by this, exactly? Stewart. -- My e-mail is valid but not my primary mailbox, aside from its being the unfortunate victim of intensive mail-bombing at the moment. Please keep replies on the 'group where everyone may benefit.
Jul 09 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <ccmo0h$8u0$1 digitaldaemon.com>, Stewart Gordon says...
Arcane Jill wrote:

<snip>
 #   ERROR - D source file not saved as UTF-8. Cannot compile.

Hang on ... according to the docs, the compiler is supposed to accept UTF-16 and UTF-32 too.

I stand corrected. However, the UTFs are all very easy to tell apart. UTF-16 looks very different from UTF-8, and it only takes a simple algorithm to distinguish them. Ditto UTF-32. What I *SHOULD* have said is that DMD assumes that the source file is encoded in UTF-8, UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE. What it can't do is tell 8-bit encodings apart from each other, so it assumes that, if it's an 8-bit encoding, it will be UTF-8.
<snip>
 So ... although it /is/ a user-error, it is nonetheless a user-error which DMD
 could have detected at compile-time, instead of leaving the error reporting to
 run time. The error message itself (as it stands) doesn't really help people to
 understand what's wrong.

Some debate is possible. Obviously the compiler isn't being UTF compliant.

Yes, it is. The compiler is being 100% UTF compliant. Problems only arise if the source code isn't.
But what if someone wants to include, in a string literal, 
characters in the native OS or other character set that don't match 
UTF-8?

There ain't no such character. UTF-8 can encode the entire of Unicode. I'm not sure there's an OS on the planet which uses characters which are not in Unicode. Oh wait - I believe the ZX Spectrum had some weird clunky graphics characters which are not in Unicode. But we don't need to worry about that because D has not been ported to that platform.
(FTM, how are escaped characters supposed to be handled ITR, 
considering that a string literal can be either a char[], wchar[] or 
dchar[]?)

They are supposed to be represented as is, not escaped in any way (beyond being encoded in UTF-whatever). Unless of course you mean stuff like "\n" - which obviously is stored in source as backslash followed by 'n'. The compiler can figure THAT out because it's part of D.
Speaking of lexical.html...
"There are no digraphs or trigraphs in D."

What is meant by this, exactly?

Old, old stuff from the early days of C. You have to go back a long time, but once, there were keyboards without square brackets or curly braces and things, and which were not remappable in software. Digraphs are two-character sequences which a C compiler will replace with those single missing characters. Trigraphs are similar three character sequences.
Jul 09 2004
next sibling parent Juanjo =?ISO-8859-15?Q?=C1lvarez?= <juanjuxNO SPAMyahoo.es> writes:
Arcane Jill wrote:


What is meant by this, exactly?

Old, old stuff from the early days of C. You have to go back a long time, but once, there were keyboards without square brackets or curly braces and things, and which were not remappable in software. Digraphs are two-character sequences which a C compiler will replace with those single missing characters. Trigraphs are similar three character sequences.

And they are (or at least were) extensively used in the obfuscated C contests :)
Jul 10 2004
prev sibling parent reply Stewart Gordon <smjg_1998 yahoo.com> writes:
Arcane Jill wrote:

<snip>
 I stand corrected. However, the UTFs are all very easy to tell apart.  
 UTF-16 looks very different from UTF-8, and it only takes a simple 
 algorithm to distinguish them. Ditto UTF-32.

Are we talking of the byte-order mark, or the fallback for if that's missing?
 What I *SHOULD* have said is that DMD assumes that the source file is 
 encoded in UTF-8, UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE. What it 
 can't do is tell 8-bit encodings apart from each other, so it assumes 
 that, if it's an 8-bit encoding, it will be UTF-8.

Actually, there is a BOM for UTF-8 according to the docs. But no doubt many UTF-8 files are typed without it. <snip>
 Yes, it is. The compiler is being 100% UTF compliant. Problems only
 arise if the source code isn't.

Actually, I read that UTF compliance of a text reader necessarily means rejecting input that isn't UTF compliant.
 But what if someone wants to include, in a string literal, 
 characters in the native OS or other character set that don't match 
 UTF-8?

There ain't no such character. UTF-8 can encode the entire of Unicode. I'm not sure there's an OS on the planet which uses characters which are not in Unicode.

By "match" I actually meant be represented by the same byte sequence. An important issue when it comes to generating console output, interfacing the OS API and stuff like that. <snip>
 (FTM, how are escaped characters supposed to be handled ITR, 
 considering that a string literal can be either a char[], wchar[]
 or dchar[]?)

They are supposed to be represented as is, not escaped in any way (beyond being encoded in UTF-whatever). Unless of course you mean stuff like "\n" - which obviously is stored in source as backslash followed by 'n'. The compiler can figure THAT out because it's part of D.

I meant stuff like "\xA3" actually, and in terms of what it becomes in the actual string data being represented. <snip>
 Old, old stuff from the early days of C. You have to go back a long
 time, but once, there were keyboards without square brackets or curly
 braces and things, and which were not remappable in software.
 Digraphs are two-character sequences which a C compiler will replace
 with those single missing characters. Trigraphs are similar three
 character sequences.

My dad had an old C manual (which I first learned from, but only the very basics) with handwritten notes in it about teletypes from well before my time. From what I remember, you typed something like: MAIN() \( PRINTF("\HELLO, WORLD!\\N"); \) But I don't remember there being any trigraphs in those notes. And back in those days, you wrote x =- 4 instead of x -= 4. I don't know at what point someone decided to break existing code by redefining the former to be the same as x = -4. Stewart. -- My e-mail is valid but not my primary mailbox, aside from its being the unfortunate victim of intensive mail-bombing at the moment. Please keep replies on the 'group where everyone may benefit.
Jul 12 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cctnge$1801$1 digitaldaemon.com>, Stewart Gordon says...
Arcane Jill wrote:

<snip>
 I stand corrected. However, the UTFs are all very easy to tell apart.  
 UTF-16 looks very different from UTF-8, and it only takes a simple 
 algorithm to distinguish them. Ditto UTF-32.

Are we talking of the byte-order mark, or the fallback for if that's missing?

I meant heuristically. Although obviously, if there's a BOM, you can tell just by reading the first (at most) four bytes.
Actually, there is a BOM for UTF-8 according to the docs.  But no doubt 
many UTF-8 files are typed without it.

Yep. Plenty of text editors save UTF-8 without a BOM. Some even offer you the choice of BOM or no-BOM. So the absense of a BOM does not imply that a text file is not UTF-8.
Actually, I read that UTF compliance of a text reader necessarily means 
rejecting input that isn't UTF compliant.

Gotcha. In that case, you are correct. So I guess this means that DMD really /must/ validate the source file, or be itself in error. Well spotted.
 But what if someone wants to include, in a string literal, 
 characters in the native OS or other character set that don't match 
 UTF-8?

There ain't no such character. UTF-8 can encode the entire of Unicode. I'm not sure there's an OS on the planet which uses characters which are not in Unicode.

By "match" I actually meant be represented by the same byte sequence. An important issue when it comes to generating console output, interfacing the OS API and stuff like that.

Aha. Well, that's an implementation-dependent thing, is it not? Not really a D matter, I would have thought. Would I be correct in assuming that most console escape sequences can be composed entirely out of ASCII characters? If that is so, there isn't a problem anyway.
 Unless of course you mean stuff like "\n" - which obviously is stored
 in source as backslash followed by 'n'. The compiler can figure THAT
 out because it's part of D.

I meant stuff like "\xA3" actually, and in terms of what it becomes in the actual string data being represented.

Understood. Well, there's a little-known difference between '\xA3' and '\u00A3'. '\xA3' means "the byte 0xA3", or, if it's a character, "the character represented by codepoint 0xA3 in whatever encoding I happen to be using at the time", whereas, '\u00A3' means specifically "the character represented by codepoint 0xA3 in /Unicode/". That is, U+00A3, POUND SIGN. In the particular case of D, a char[] contains UTF-8. So, I imagine it would be perfectly OK to contruct valud UTF-8 sequences by hand. That is, I would _HOPE_ that all three of the following lines would produce identical results: # char[] x = "\x82\xA3"; // UTF-8 for U+00A3 # char[] x = "\u00A3"; # char[] x = "£"; but I haven't tested this, so I don't know for sure. If not, it's a bug. For console escape sequences which are absolutely NOT UTF-8, I would encourage you to store such strings in ubyte[] arrays instead of char[] arrays, where such validity restrictions don't apply. There's nothing to stop you from passing a ubyte[] to std.stream.Stream.write(), after all.
And back in those days, you wrote x =- 4 instead of x -= 4.  I don't 
know at what point someone decided to break existing code by redefining 
the former to be the same as x = -4.

I don't know when that happened either. I gather that that change happened though because compilers had a hard time distinguishing between: (a) x =- 4; (b) x = -4; Arcane Jill
Jul 12 2004
parent Stewart Gordon <smjg_1998 yahoo.com> writes:
Arcane Jill wrote:
<snip>
 For console escape sequences which are absolutely NOT UTF-8, I would encourage
 you to store such strings in ubyte[] arrays instead of char[] arrays, where
such
 validity restrictions don't apply. There's nothing to stop you from passing a
 ubyte[] to std.stream.Stream.write(), after all.

As long as you don't confuse its semantics with those of the other methods called write.
And back in those days, you wrote x =- 4 instead of x -= 4.  I don't 
know at what point someone decided to break existing code by redefining 
the former to be the same as x = -4.

I don't know when that happened either. I gather that that change happened though because compilers had a hard time distinguishing between: (a) x =- 4; (b) x = -4;

It's no harder than distinguishing between (a) + +x; (b) ++ x; But no doubt programmers confused them, particularly when they tried writing x=-4 without any spaces. Stewart. -- My e-mail is valid but not my primary mailbox, aside from its being the unfortunate victim of intensive mail-bombing at the moment. Please keep replies on the 'group where everyone may benefit.
Jul 13 2004
prev sibling parent reply Juanjo =?ISO-8859-15?Q?=C1lvarez?= <juanjuxNO SPAMyahoo.es> writes:
Arcane Jill wrote:

AFAIK '—' is UTF-8.

It is not. The Unicode character U+00D1, LATIN CAPITAL N WITH TILDE is represented in UTF-8 by the two byte sequence { 0xC3, 0x91 }. UTF-8 is backwardly compatible with ASCII. It is /not/, however, backwardly compatible with ISO-8859-1. Any character with codepoint greater than 0x7F must be correctly UTF-8 encoded.

Then I was confused by the fact that inserting the line: # -*- coding: UTF-8 -*- at the start of a Python script make the interpreters works with latin1 chars directly.
 You can get the correct UTF-8 sequence by starting with a string of dchars
 and passing it to std.utf.toUTF8().

Could you please provide and example of how would be that done? Because if I try: dchar[] dstr = "ESPA—A"; the compiler says: otroformat.d(7): invalid UTF-8 sequence and if I instead try: dchar[] dstr = std.utf.toUTF8("ESPA—A"); it says: otroformat.d(7): function toUTF8 overloads char[](char[]s) and char[ (dchar[]s) both match argument list for toUTF8 So I'm a little lost here.
Jul 09 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <ccn00c$khq$1 digitaldaemon.com>, Juanjo =?ISO-8859-15?Q?=C1lvarez?=
says...

Then I was confused by the fact that inserting the line:

# -*- coding: UTF-8 -*-

at the start of a Python script make the interpreters works with latin1
chars directly.

That may be a red herring, but I don't know what Python does and I'm not qualified to comment. If I had to guess, I'd say that declaration tells Python the encoding with which the source files was saved. I can tell you though that D also interprets all Latin-1 characters (and indeed, all Unicode characters) directly ... *IF* the source file is saved in a UTF format. (See below). DMD may be "deficient" in the sense that it does not understand ISO-8859-1, ISO-8859-2, WINDOWS-1252, etc, etc. - but I would regard that as a strength, not a weakness. Simple. Neat. Clean. However, this does need to be better documented.
Could you please provide and example of how would be that done? Because if I
try:

dchar[] dstr = "ESPA—A";

the compiler says:

otroformat.d(7): invalid UTF-8 sequence

Honestly - this has got nothing whatsoever to do with the compiler. There's a stage BEFORE compiling - it's called saving the text file. Let's say you're using Microsoft Notepad. Type something into it, such as: # dchar[] dstr = "ESPA—A"; Now - instead of clicking on "Save", click instead on "Save As". You'll see three drop-down menus at the bottom of the dialog. One of them is labelled "Encoding", and it will have "ANSI" selected by default. *** CHANGE IT TO UTF-8 ***. Now save. Now the D compiler will be happy with it. Pretty much all text editors these days offer such a choice - however it is usually not the default, so you have to remember to explicitly do the Save As / UTF-8 thing. And you can use ALL characters too, not just Latin-1. You can use Latin-2, Greek, Russian, Chinese, whatever. Just remember that trick - SAVE AS UTF-8 before you attempt to compile.
and if I instead try:

dchar[] dstr = std.utf.toUTF8("ESPA—A");

it says:

otroformat.d(7): function toUTF8 overloads char[](char[]s) and char[
(dchar[]s) both match argument list for toUTF8

So I'm a little lost here.

I can understand that because, as I said, the DMD error message is not helpful. However, bear in mind that the fault lies with your use of the text editor, not with your use of D. If Walter would care to help everyone out with this one by improving the error message (if only to lay blame somewhere other than DMD), what he should do is this. The compiler should pass the entire source file contents to std.utf.validate (or some equivalent function written in C/C++). If it passes, go ahead and compile. If it fails, issue an error message that the source file is not correctly encoded, and needs to be re-saved as UTF-8 before it will compile. Of course, if the source file contains only ASCII characters then it is automatically valid UTF-8, even if it was saved as "ANSI". Arcane Jill
Jul 09 2004
parent reply Juanjo =?ISO-8859-15?Q?=C1lvarez?= <juanjuxNO SPAMyahoo.es> writes:
Arcane Jill wrote:

First things first; thanks for your comments and you patience with my.

# -*- coding: UTF-8 -*-

at the start of a Python script make the interpreters works with latin1
chars directly.

That may be a red herring, but I don't know what Python does and I'm not qualified to comment. If I had to guess, I'd say that declaration tells Python the encoding with which the source files was saved.

True, but the funny thing was that the files are saved (I've just tested it) as latin1 and it works and don't issue the warning it issues if you don't put that line.
 I can tell you though that D also interprets all Latin-1 characters (and
 indeed, all Unicode characters) directly ... *IF* the source file is saved
 in a UTF format. (See below).

I didn't notice that my editor was saving the files as ISO-8859-15 and as you said the compiler error message didn't helped on that.
 DMD may be "deficient" in the sense that it does not understand
 ISO-8859-1, ISO-8859-2, WINDOWS-1252, etc, etc. - but I would regard that
 as a strength, not a weakness. Simple. Neat. Clean. However, this does
 need to be better documented.

I really think that making it also understand ISO-8859-1 (like virtually every other compiler and interpreter out there) would not harm.
 Let's say you're using Microsoft Notepad. Type something into it, such as:

I'm using vim/KDE Kate/KDevelop, after you comment I've configured them to save in utf-8 by default and everything seems to work OK now (well, almost, I still have to configure my terminal emulator to use unicode so the D program textual non-ascii output is correctly shown.)
 Pretty much all text editors these days offer such a choice - however it
 is usually not the default, so you have to remember to explicitly do the
 Save As / UTF-8 thing.

Also true; it wasn't the default _because_ my LC_ALL environment variable (Linux) was set to "es_ES.ISO-8859-15".
 Just remember that trick - SAVE AS UTF-8 before you attempt to compile.

I'll, sure.
 If Walter would care to help everyone out with this one by improving the
 error message (if only to lay blame somewhere other than DMD), what he
 should do is this. The compiler should pass the entire source file
 contents to std.utf.validate (or some equivalent function written in
 C/C++). If it passes, go ahead and compile. If it fails, issue an error
 message that the source file is not correctly encoded, and needs to be
 re-saved as UTF-8 before it will compile.

That would be perfectly logical. Now, abusing you knowledge about the issue, how can I transform (in D) a default utf-8 encoded font into ISO-latin1? In the program I'm writing most users will use it from a unix console (graphical or not) and I don't want to force them to configure their consoles to utf-8. Thanks again for your answers
Jul 10 2004
parent Arcane Jill <Arcane_member pathlink.com> writes:
In article <ccp45f$r1r$1 digitaldaemon.com>, Juanjo =?ISO-8859-15?Q?=C1lvarez?=
says...

 DMD may be "deficient" in the sense that it does not understand
 ISO-8859-1, ISO-8859-2, WINDOWS-1252, etc, etc. - but I would regard that
 as a strength, not a weakness. Simple. Neat. Clean. However, this does
 need to be better documented.

I really think that making it also understand ISO-8859-1 (like virtually every other compiler and interpreter out there) would not harm.

In point of fact, your assertion that virtually every other compiler and interpretter out there "understands" ISO-8859-1, is not correct. D is superior, in this regard. In a traditional C compiler, the encoding of the source file is essentially *IGNORED*. There is absolutely no "understanding" going on. A string literal is just a sequence of uninterpretted bytes. The illusion of "understanding" is simply caused by the fact that the text editor at one end, and the console or whatever at the other, happen to use the same encoding as each other. With that bourne in mind, you should appreciate that what a C compiler APPEARS to understand is not, in fact, ISO-8859-1, at all. It is simply the default OS encoding, whatever that happens to be. It sounds to me like Python may have real understanding of encodings - but if that's true, Python would be the exception rather than the rule. However, D *CANNOT* ignore the encoding of the source file. In D, a char[] array must contain, *BY DEFINITION*, UTF-8. Ignoring the encoding of the source file would break that definition, and result in invalid UTF-8 sequences within char[] arrays, and consequent run-time errors. This means that D has two choices: (1) it could mandate that the source file encoding MUST be one of the UTF- family, or (2) it could be made to understand and decode other encodings. In effect, this would mean transcoding the source file at compile-time from its original encoding into UTF-8 before feeding it to the existing compilation process. D has chosen option (1), and I think it was the right choice. Option (2) would have added a trememdous amount of bloat to the compiler - and all so that users don't have to get the hang of "Save As". If D were to "understand" ISO-8859-1 specifically, there would be complaints from those whose native encoding were ISO-8859-2. Why is THEIR encoding supported, but not MINE? The UTF- family are the only truly global encodings we have, right now. They can be understood anywhere in the world, and can encode each and every Unicode character. By insisting that D source files must be UTF-XX, D is helping to educate people to think globally, to be less parochial. ISO-8859-1 is not understood everywhere. UTF-XX is.
 If Walter would care to help everyone out with this one by improving the
 error message (if only to lay blame somewhere other than DMD), what he
 should do is this. The compiler should pass the entire source file
 contents to std.utf.validate (or some equivalent function written in
 C/C++). If it passes, go ahead and compile. If it fails, issue an error
 message that the source file is not correctly encoded, and needs to be
 re-saved as UTF-8 before it will compile.

That would be perfectly logical.

It's something I would encourage. That said, I'm not sure if the current (unhelpful) error message could actually be deemed a bug. It does, after all, give AN error.
Now, abusing you knowledge about the issue, how can I transform (in D) a
default utf-8 encoded font into ISO-latin1?

I'll assume that where you wrote "font", you meant "string". In general, to convert a UTF encoded string into another encoding, you need to do "transcoding". This was discussed in the open streams discussion on the main forum not so long ago. In general, you need classes (called Readers) to translate from ENCODING-X to UTF, and other classes (called Writers) to translate from UTF to ENCODING-X. Some people prefer the generic term Filter to the terms Reader and Writer. So, you'd need an ISO-8859-1 Writer class. Unfortunately, such readers and writers don't exist yet. They are part of the ongoing discussion about the future of streams. Fortunately for you, as is happens, the algorithm for converting ISO-8859-1 to Unicode is dead simple, so you can roll your own. In function form, it is this: # char[] latin1ToUTF8(ubyte[] latin1) # { # wchar[] s = new wchar[latin1.length]; # for (uint i=0; i<s.length; ++i) # { # s[i] = cast(wchar) latin1[i]; # } # return toUTF8(s); # } Observe that the input is declared as ubyte[], not char[] - this is because, in D, you can't use a char[] array for anything other than UTF-8. Obvioulsly, this algorithm won't work for ISO-8859-2, WINDOWS-1252, or indeed ANY encoding other than Latin1.
In the program I'm writing most
users will use it from a unix console (graphical or not) and I don't want
to force them to configure their consoles to utf-8.

But, if I have understood you correctly, you *ARE* going to force them to configure their consoles to ISO-8859-1. That seems most unfair to people who happen not to live in Western Europe or America.
Thanks again for your answers

No probs. But we seem not to be talking about D bugs any more, so maybe we should re-title this thread and move the discussion over the the main forum? Arcane Jill
Jul 12 2004