www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - [Contribution] std.windows.charset

reply Stewart Gordon <smjg_1998 yahoo.com> writes:
I've got together some functions for converting between Windows 
character sets and UTF-8.  I propose that it be added to Phobos, since 
it's a significant step forward in compatibility between D and Windows.

It's basically the stuff taken from std.file with a few additions:
- ability to specify the ANSI or OEM codepage
- the corresponding fromMBSz
- throws an exception on error such as an invalid codepage
- toUTF8(wchar*), essential for converting back null-terminated UTF-16 
strings received from WinAPI (though this ought to be moved to std.utf).

Once this is done, they can be removed/deprecated from std.file, and the 
file functions adjusted to use the ones in the new module.  And listdir 
can certainly shrink.

I'd thought about making it auto-detect MSLU, so that the same app can 
run without MSLU or make use of it if it's there.  But having read up a 
bit more, it would appear that an app has to be linked to depend on MSLU 
anyway, in which case this won't work.

Stewart.

-- 
-----BEGIN GEEK CODE BLOCK-----
Version: 3.1
GCS/M d- s:- a->--- UB  P+ L E  W++  N+++ o K- w++  O? M V? PS- PE- Y? 
PGP- t- 5? X? R b DI? D G e++>++++ h-- r-- !y
------END GEEK CODE BLOCK------

My e-mail is valid but not my primary mailbox.  Please keep replies on 
the 'group where everyone may benefit.
Aug 31 2005
next sibling parent reply "Regan Heath" <regan netwin.co.nz> writes:
------------cNAVjUYKlkFd1BS6VuHFNq
Content-Type: text/plain; format=flowed; delsp=yes; charset=iso-8859-15
Content-Transfer-Encoding: 8bit

On Wed, 31 Aug 2005 11:20:15 +0100, Stewart Gordon <smjg_1998 yahoo.com>  
wrote:
 I've got together some functions for converting between Windows  
 character sets and UTF-8.  I propose that it be added to Phobos, since  
 it's a significant step forward in compatibility between D and Windows.

 It's basically the stuff taken from std.file with a few additions:
 - ability to specify the ANSI or OEM codepage
 - the corresponding fromMBSz
 - throws an exception on error such as an invalid codepage
 - toUTF8(wchar*), essential for converting back null-terminated UTF-16  
 strings received from WinAPI (though this ought to be moved to std.utf).

 Once this is done, they can be removed/deprecated from std.file, and the  
 file functions adjusted to use the ones in the new module.  And listdir  
 can certainly shrink.

 I'd thought about making it auto-detect MSLU, so that the same app can  
 run without MSLU or make use of it if it's there.  But having read up a  
 bit more, it would appear that an app has to be linked to depend on MSLU  
 anyway, in which case this won't work.

Good idea. I don't know if this will help or if you have something already but here are the functions I used for converting code page 1252 to/from UTF-8/16/32. I used them for logging cp1252 data to the screen, which doesn't actually display correctly anyway (unless you tell your windows console to go into utf-8 mode), but it did stop writef from throwing an exception. Regan ------------cNAVjUYKlkFd1BS6VuHFNq Content-Disposition: attachment; filename=cp1252.d Content-Type: application/octet-stream; name=cp1252.d Content-Transfer-Encoding: 8bit module cp1252; import std.utf; char[] cp1252toUTF8(void[] raw) { return toUTF8(cp1252toUTF16(raw)); } wchar[] cp1252toUTF16(void[] raw) { wchar[] result; result.length = raw.length; foreach(int i, ubyte b; cast(ubyte[])raw) { if (b <= 0x80) result[i] = b; else result[i] = table[b]; } return result; } dchar[] cp1252toUTF32(void[] raw) { return toUTF32(cp1252toUTF16(raw)); } ushort[] table = [ 0x0000, 0x0001, 0x0002, 0x0003, 0x0004, 0x0005, 0x0006, 0x0007, 0x0008, 0x0009, 0x000A, 0x000B, 0x000C, 0x000D, 0x000E, 0x000F, 0x0010, 0x0011, 0x0012, 0x0013, 0x0014, 0x0015, 0x0016, 0x0017, 0x0018, 0x0019, 0x001A, 0x001B, 0x001C, 0x001D, 0x001E, 0x001F, 0x0020, 0x0021, 0x0022, 0x0023, 0x0024, 0x0025, 0x0026, 0x0027, 0x0028, 0x0029, 0x002A, 0x002B, 0x002C, 0x002D, 0x002E, 0x002F, 0x0030, 0x0031, 0x0032, 0x0033, 0x0034, 0x0035, 0x0036, 0x0037, 0x0038, 0x0039, 0x003A, 0x003B, 0x003C, 0x003D, 0x003E, 0x003F, 0x0040, 0x0041, 0x0042, 0x0043, 0x0044, 0x0045, 0x0046, 0x0047, 0x0048, 0x0049, 0x004A, 0x004B, 0x004C, 0x004D, 0x004E, 0x004F, 0x0050, 0x0051, 0x0052, 0x0053, 0x0054, 0x0055, 0x0056, 0x0057, 0x0058, 0x0059, 0x005A, 0x005B, 0x005C, 0x005D, 0x005E, 0x005F, 0x0060, 0x0061, 0x0062, 0x0063, 0x0064, 0x0065, 0x0066, 0x0067, 0x0068, 0x0069, 0x006A, 0x006B, 0x006C, 0x006D, 0x006E, 0x006F, 0x0070, 0x0071, 0x0072, 0x0073, 0x0074, 0x0075, 0x0076, 0x0077, 0x0078, 0x0079, 0x007A, 0x007B, 0x007C, 0x007D, 0x007E, 0x007F, 0x20AC, 0x0000, 0x201A, 0x0192, 0x201E, 0x2026, 0x2020, 0x2021, 0x02C6, 0x2030, 0x0160, 0x2039, 0x0152, 0x0000, 0x017D, 0x0000, 0x0000, 0x2018, 0x2019, 0x201C, 0x201D, 0x2022, 0x2013, 0x2014, 0x02DC, 0x2122, 0x0161, 0x203A, 0x0153, 0x0000, 0x017E, 0x0178, 0x00A0, 0x00A1, 0x00A2, 0x00A3, 0x00A4, 0x00A5, 0x00A6, 0x00A7, 0x00A8, 0x00A9, 0x00AA, 0x00AB, 0x00AC, 0x00AD, 0x00AE, 0x00AF, 0x00B0, 0x00B1, 0x00B2, 0x00B3, 0x00B4, 0x00B5, 0x00B6, 0x00B7, 0x00B8, 0x00B9, 0x00BA, 0x00BB, 0x00BC, 0x00BD, 0x00BE, 0x00BF, 0x00C0, 0x00C1, 0x00C2, 0x00C3, 0x00C4, 0x00C5, 0x00C6, 0x00C7, 0x00C8, 0x00C9, 0x00CA, 0x00CB, 0x00CC, 0x00CD, 0x00CE, 0x00CF, 0x00D0, 0x00D1, 0x00D2, 0x00D3, 0x00D4, 0x00D5, 0x00D6, 0x00D7, 0x00D8, 0x00D9, 0x00DA, 0x00DB, 0x00DC, 0x00DD, 0x00DE, 0x00DF, 0x00E0, 0x00E1, 0x00E2, 0x00E3, 0x00E4, 0x00E5, 0x00E6, 0x00E7, 0x00E8, 0x00E9, 0x00EA, 0x00EB, 0x00EC, 0x00ED, 0x00EE, 0x00EF, 0x00F0, 0x00F1, 0x00F2, 0x00F3, 0x00F4, 0x00F5, 0x00F6, 0x00F7, 0x00F8, 0x00F9, 0x00FA, 0x00FB, 0x00FC, 0x00FD, 0x00FE, 0x00FF ]; /* UNTESTED! */ ubyte[] UTF16toCP1252(char[] raw) { return UTF16toCP1252(toUTF16(raw)); } ubyte[] UTF16toCP1252(wchar[] raw) { ubyte[] result; result.length = raw.length; foreach(int i, wchar c; raw) { foreach(int j, ushort s; table) { if (c == s) { result[i] = table[j]; goto found; } } throw new Exception("Data cannot be encoded"); found: ; } return result; } ubyte[] UTF16toCP1252(dchar[] raw) { return UTF16toCP1252(toUTF16(raw)); } ------------cNAVjUYKlkFd1BS6VuHFNq--
Aug 31 2005
parent reply Stewart Gordon <smjg_1998 yahoo.com> writes:
Regan Heath wrote:
<snip>
 Good idea. I don't know if this will help or if you have something 
 already but here are the functions I used for converting code page 1252 
 to/from UTF-8/16/32.

Both my contribution and yours are written with specific aims in mind. Mine to convert UTF-8 strings to/from null-terminated strings in Windows character sets and hence facilitate communication with the Windows API. Yours is made to convert D arrays between 1252 and UTFs. Neither is designed to be a comprehensive converter between character encodings. There's a project underway (part of Indigo) to implement such a thing, and I'm involved in it. But what I've contributed here is designed to be something simple to do the job it's made for, as is yours. The only things yours adds are: - cross-platform support, though only for one codepage - direct translation between 1252 and the other two UTFs - treating 1252 text as a D array none of which are needed for a std.windows module, though admittedly one or two WinAPI functions (such as TextOut) could find the (address, length) form of D strings useful.
 I used them for logging cp1252 data to the screen, 
 which doesn't actually display correctly anyway (unless you tell your 
 windows console to go into utf-8 mode), but it did stop writef from 
 throwing an exception.

For that matter, which versions of Windows support UTF-8 console mode? Windows 9x doesn't, and so writef is of little use if you want to output anything but plain ASCII to the console. And even if it did, it would still be good to be able to detect the codepage and output appropriately. I'm working towards getting TextStream implemented in Indigo, which will facilitate this. Stewart. -- -----BEGIN GEEK CODE BLOCK----- Version: 3.1 GCS/M d- s:- a->--- UB P+ L E W++ N+++ o K- w++ O? M V? PS- PE- Y? PGP- t- 5? X? R b DI? D G e++>++++ h-- r-- !y ------END GEEK CODE BLOCK------ My e-mail is valid but not my primary mailbox. Please keep replies on the 'group where everyone may benefit.
Aug 31 2005
parent reply "Regan Heath" <regan netwin.co.nz> writes:
On Wed, 31 Aug 2005 15:23:53 +0100, Stewart Gordon <smjg_1998 yahoo.com>  
wrote:
<snip>

Ahh. No problem.

 I used them for logging cp1252 data to the screen, which doesn't  
 actually display correctly anyway (unless you tell your windows console  
 to go into utf-8 mode), but it did stop writef from throwing an  
 exception.

For that matter, which versions of Windows support UTF-8 console mode? Windows 9x doesn't, and so writef is of little use if you want to output anything but plain ASCII to the console. And even if it did, it would still be good to be able to detect the codepage and output appropriately. I'm working towards getting TextStream implemented in Indigo, which will facilitate this.

I briefly tried to add a conversion stream to my code. The immediate problem was that when you try to implement say, readExact, to read exactly x bytes, you might find after conversion x bytes of cp1252 becomes x+y bytes of UTF (due to some multibyte codepoints). In fact it's possible, I believe, for it to be impossible to get exactly x bytes, if say you had x-1 bytes and the next character became a 2 byte codepoint. That is, if you're using char or wchar, it would probably work fine if you used dchar. Regan
Aug 31 2005
parent reply Stewart Gordon <smjg_1998 yahoo.com> writes:
Regan Heath wrote:
<snip>
 I briefly tried to add a conversion stream to my code. The immediate 
 problem was that when you try to implement say, readExact, to read 
 exactly x bytes, you might find after conversion x bytes of cp1252 
 becomes x+y bytes of UTF (due to some multibyte codepoints).

What is the practical use of being able to read x UTF-8 bytes from a stream that isn't in UTF-8? Stewart. -- -----BEGIN GEEK CODE BLOCK----- Version: 3.1 GCS/M d- s:- a->--- UB P+ L E W++ N+++ o K- w++ O? M V? PS- PE- Y? PGP- t- 5? X? R b DI? D G e++>++++ h-- r-- !y ------END GEEK CODE BLOCK------ My e-mail is valid but not my primary mailbox. Please keep replies on the 'group where everyone may benefit.
Sep 01 2005
parent reply "Regan Heath" <regan netwin.co.nz> writes:
On Thu, 01 Sep 2005 12:09:24 +0100, Stewart Gordon <smjg_1998 yahoo.com>  
wrote:
 Regan Heath wrote:
 <snip>
 I briefly tried to add a conversion stream to my code. The immediate  
 problem was that when you try to implement say, readExact, to read  
 exactly x bytes, you might find after conversion x bytes of cp1252  
 becomes x+y bytes of UTF (due to some multibyte codepoints).

What is the practical use of being able to read x UTF-8 bytes from a stream that isn't in UTF-8?

Being able to read exactly x bytes wasn't the purpose, it was simply the problem I encountered writing a conversion stream. A stream that would allow: ConversionStream s = new ConversionStream(new BufferedFile("a.txt")); char[] line; while(!s.eof()) { line = s.readLine(line); ..etc.. } s.close(); This would read my cp1252 encoded file as char[] lines, instead of reading them into ubyte[] or char[] (technically illegally) and then converting each line. The problem arose in implementing the readExact method for the ConversionStream. Firstly it would have required a buffer to deal with the expansion of the data, secondly as I mentioned last post it is possible to fail to get exactly x bytes, but have x-1 or x+1 instead. So, while I didn't want to read exactly x bytes the Stream interface requires that it is possible, right? Regan
Sep 01 2005
parent "Regan Heath" <regan netwin.co.nz> writes:
On Fri, 02 Sep 2005 00:11:05 +1200, Regan Heath <regan netwin.co.nz> wrote:
 On Thu, 01 Sep 2005 12:09:24 +0100, Stewart Gordon <smjg_1998 yahoo.com>  
 wrote:
 Regan Heath wrote:
 <snip>
 I briefly tried to add a conversion stream to my code. The immediate  
 problem was that when you try to implement say, readExact, to read  
 exactly x bytes, you might find after conversion x bytes of cp1252  
 becomes x+y bytes of UTF (due to some multibyte codepoints).

What is the practical use of being able to read x UTF-8 bytes from a stream that isn't in UTF-8?

Being able to read exactly x bytes wasn't the purpose, it was simply the problem I encountered writing a conversion stream. A stream that would allow: ConversionStream s = new ConversionStream(new BufferedFile("a.txt")); char[] line; while(!s.eof()) { line = s.readLine(line); ..etc.. } s.close(); This would read my cp1252 encoded file as char[] lines, instead of reading them into ubyte[] or char[] (technically illegally) and then converting each line. The problem arose in implementing the readExact method for the ConversionStream. Firstly it would have required a buffer to deal with the expansion of the data, secondly as I mentioned last post it is possible to fail to get exactly x bytes, but have x-1 or x+1 instead. So, while I didn't want to read exactly x bytes the Stream interface requires that it is possible, right?

I didn't pursue this far because I didn't need it, but the more I think on it... I guess breaking the stream in the middle of a char 'character' (one which is several bytes/codepoints long) isn't a problem as the next read will grab the rest and append it to the first part, unless, the user is reading a char at a time and expecting them to be complete and/or valid (which is just wrong). When reading lines it will keep reading till it gets a '\n' or '\r\n' so it shouldn't ever return any "part of a character"s. So, perhaps there is no problem after all :) Regan
Sep 01 2005
prev sibling parent Stewart Gordon <smjg_1998 yahoo.com> writes:
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Stewart Gordon wrote:
 I've got together some functions for converting between Windows 
 character sets and UTF-8.  I propose that it be added to Phobos, since 
 it's a significant step forward in compatibility between D and Windows.

I could've sworn I'd attached the thing! Why did nobody point this out? Stewart. -- -----BEGIN GEEK CODE BLOCK----- Version: 3.1 GCS/M d- s:- C++ a->--- UB P+ L E W++ N+++ o K- w++ O? M V? PS- PE- Y? PGP- t- 5? X? R b DI? D G e++>++++ h-- r-- !y ------END GEEK CODE BLOCK------ My e-mail is valid but not my primary mailbox. Please keep replies on the 'group where everyone may benefit.
Sep 07 2005