digitalmars.D - [Contribution] std.windows.charset

Stewart Gordon (25/25) Aug 31 2005 I've got together some functions for converting between Windows

Regan Heath (9/25) Aug 31 2005 Good idea. I don't know if this will help or if you have something alrea...

Stewart Gordon (32/39) Aug 31 2005 Both my contribution and yours are written with specific aims in mind.

Regan Heath (13/23) Aug 31 2005 On Wed, 31 Aug 2005 15:23:53 +0100, Stewart Gordon ...

Stewart Gordon (13/17) Sep 01 2005 What is the practical use of being able to read x UTF-8 bytes from a

Regan Heath (22/30) Sep 01 2005 Being able to read exactly x bytes wasn't the purpose, it was simply the...

Regan Heath (12/42) Sep 01 2005 I didn't pursue this far because I didn't need it, but the more I think ...

Stewart Gordon (12/15) Sep 07 2005

Stewart Gordon <smjg_1998 yahoo.com> writes:

I've got together some functions for converting between Windows 
character sets and UTF-8.  I propose that it be added to Phobos, since 
it's a significant step forward in compatibility between D and Windows.

It's basically the stuff taken from std.file with a few additions:
- ability to specify the ANSI or OEM codepage
- the corresponding fromMBSz
- throws an exception on error such as an invalid codepage
- toUTF8(wchar*), essential for converting back null-terminated UTF-16 
strings received from WinAPI (though this ought to be moved to std.utf).

Once this is done, they can be removed/deprecated from std.file, and the 
file functions adjusted to use the ones in the new module.  And listdir 
can certainly shrink.

I'd thought about making it auto-detect MSLU, so that the same app can 
run without MSLU or make use of it if it's there.  But having read up a 
bit more, it would appear that an app has to be linked to depend on MSLU 
anyway, in which case this won't work.

Stewart.

-- 
-----BEGIN GEEK CODE BLOCK-----
Version: 3.1
GCS/M d- s:- a->--- UB  P+ L E  W++  N+++ o K- w++  O? M V? PS- PE- Y? 
PGP- t- 5? X? R b DI? D G e++>++++ h-- r-- !y
------END GEEK CODE BLOCK------

My e-mail is valid but not my primary mailbox.  Please keep replies on 
the 'group where everyone may benefit.

Aug 31 2005

"Regan Heath" <regan netwin.co.nz> writes:

On Wed, 31 Aug 2005 11:20:15 +0100, Stewart Gordon <smjg_1998 yahoo.com>  
wrote:
 I've got together some functions for converting between Windows  
 character sets and UTF-8.  I propose that it be added to Phobos, since  
 it's a significant step forward in compatibility between D and Windows.

 It's basically the stuff taken from std.file with a few additions:
 - ability to specify the ANSI or OEM codepage
 - the corresponding fromMBSz
 - throws an exception on error such as an invalid codepage
 - toUTF8(wchar*), essential for converting back null-terminated UTF-16  
 strings received from WinAPI (though this ought to be moved to std.utf).

 Once this is done, they can be removed/deprecated from std.file, and the  
 file functions adjusted to use the ones in the new module.  And listdir  
 can certainly shrink.

 I'd thought about making it auto-detect MSLU, so that the same app can  
 run without MSLU or make use of it if it's there.  But having read up a  
 bit more, it would appear that an app has to be linked to depend on MSLU  
 anyway, in which case this won't work.

Good idea. I don't know if this will help or if you have something already  
but here are the functions I used for converting code page 1252 to/from  
UTF-8/16/32. I used them for logging cp1252 data to the screen, which  
doesn't actually display correctly anyway (unless you tell your windows  
console to go into utf-8 mode), but it did stop writef from throwing an  
exception.

Regan

Aug 31 2005

Stewart Gordon <smjg_1998 yahoo.com> writes:

Regan Heath wrote:
<snip>
 Good idea. I don't know if this will help or if you have something 
 already but here are the functions I used for converting code page 1252 
 to/from UTF-8/16/32.

Both my contribution and yours are written with specific aims in mind. 
Mine to convert UTF-8 strings to/from null-terminated strings in Windows 
character sets and hence facilitate communication with the Windows API. 
  Yours is made to convert D arrays between 1252 and UTFs.

Neither is designed to be a comprehensive converter between character 
encodings.  There's a project underway (part of Indigo) to implement 
such a thing, and I'm involved in it.  But what I've contributed here is 
designed to be something simple to do the job it's made for, as is yours.

The only things yours adds are:
- cross-platform support, though only for one codepage
- direct translation between 1252 and the other two UTFs
- treating 1252 text as a D array

none of which are needed for a std.windows module, though admittedly one 
or two WinAPI functions (such as TextOut) could find the (address, 
length) form of D strings useful.

 I used them for logging cp1252 data to the screen, 
 which doesn't actually display correctly anyway (unless you tell your 
 windows console to go into utf-8 mode), but it did stop writef from 
 throwing an exception.

For that matter, which versions of Windows support UTF-8 console mode? 
Windows 9x doesn't, and so writef is of little use if you want to output 
anything but plain ASCII to the console.  And even if it did, it would 
still be good to be able to detect the codepage and output 
appropriately.  I'm working towards getting TextStream implemented in 
Indigo, which will facilitate this.

Stewart.

-- 
-----BEGIN GEEK CODE BLOCK-----
Version: 3.1
GCS/M d- s:- a->--- UB  P+ L E  W++  N+++ o K- w++  O? M V? PS- PE- Y?
PGP- t- 5? X? R b DI? D G e++>++++ h-- r-- !y
------END GEEK CODE BLOCK------

My e-mail is valid but not my primary mailbox.  Please keep replies on
the 'group where everyone may benefit.

Aug 31 2005

"Regan Heath" <regan netwin.co.nz> writes:

On Wed, 31 Aug 2005 15:23:53 +0100, Stewart Gordon <smjg_1998 yahoo.com>  
wrote:
<snip>

Ahh. No problem.

 I used them for logging cp1252 data to the screen, which doesn't  
 actually display correctly anyway (unless you tell your windows console  
 to go into utf-8 mode), but it did stop writef from throwing an  
 exception.

 For that matter, which versions of Windows support UTF-8 console mode?  
 Windows 9x doesn't, and so writef is of little use if you want to output  
 anything but plain ASCII to the console.  And even if it did, it would  
 still be good to be able to detect the codepage and output  
 appropriately.  I'm working towards getting TextStream implemented in  
 Indigo, which will facilitate this.

I briefly tried to add a conversion stream to my code. The immediate  
problem was that when you try to implement say, readExact, to read exactly  
x bytes, you might find after conversion x bytes of cp1252 becomes x+y  
bytes of UTF (due to some multibyte codepoints).

In fact it's possible, I believe, for it to be impossible to get exactly x  
bytes, if say you had x-1 bytes and the next character became a 2 byte  
codepoint. That is, if you're using char or wchar, it would probably work  
fine if you used dchar.

Regan

Aug 31 2005

Stewart Gordon <smjg_1998 yahoo.com> writes:

Regan Heath wrote:
<snip>
 I briefly tried to add a conversion stream to my code. The immediate 
 problem was that when you try to implement say, readExact, to read 
 exactly x bytes, you might find after conversion x bytes of cp1252 
 becomes x+y bytes of UTF (due to some multibyte codepoints).

What is the practical use of being able to read x UTF-8 bytes from a 
stream that isn't in UTF-8?

Stewart.

-- 
-----BEGIN GEEK CODE BLOCK-----
Version: 3.1
GCS/M d- s:- a->--- UB  P+ L E  W++  N+++ o K-  w++  O? M V? PS- PE- Y? 
PGP- t- 5? X? R b DI? D G e++>++++ h-- r-- !y
------END GEEK CODE BLOCK------

My e-mail is valid but not my primary mailbox.  Please keep replies on 
the 'group where everyone may benefit.

Sep 01 2005

"Regan Heath" <regan netwin.co.nz> writes:

On Thu, 01 Sep 2005 12:09:24 +0100, Stewart Gordon <smjg_1998 yahoo.com>  
wrote:
 Regan Heath wrote:
 <snip>
 I briefly tried to add a conversion stream to my code. The immediate  
 problem was that when you try to implement say, readExact, to read  
 exactly x bytes, you might find after conversion x bytes of cp1252  
 becomes x+y bytes of UTF (due to some multibyte codepoints).

 What is the practical use of being able to read x UTF-8 bytes from a  
 stream that isn't in UTF-8?

Being able to read exactly x bytes wasn't the purpose, it was simply the  
problem I encountered writing a conversion stream. A stream that would  
allow:

ConversionStream s = new ConversionStream(new BufferedFile("a.txt"));
char[] line;

while(!s.eof()) {
	line = s.readLine(line);
	..etc..
}

s.close();

This would read my cp1252 encoded file as char[] lines, instead of reading  
them into ubyte[] or char[] (technically illegally) and then converting  
each line.

The problem arose in implementing the readExact method for the  
ConversionStream. Firstly it would have required a buffer to deal with the  
expansion of the data, secondly as I mentioned last post it is possible to  
fail to get exactly x bytes, but have x-1 or x+1 instead.

So, while I didn't want to read exactly x bytes the Stream interface  
requires that it is possible, right?

Regan

Sep 01 2005

"Regan Heath" <regan netwin.co.nz> writes:

On Fri, 02 Sep 2005 00:11:05 +1200, Regan Heath <regan netwin.co.nz> wrote:
 On Thu, 01 Sep 2005 12:09:24 +0100, Stewart Gordon <smjg_1998 yahoo.com>  
 wrote:
 Regan Heath wrote:
 <snip>
 I briefly tried to add a conversion stream to my code. The immediate  
 problem was that when you try to implement say, readExact, to read  
 exactly x bytes, you might find after conversion x bytes of cp1252  
 becomes x+y bytes of UTF (due to some multibyte codepoints).

 What is the practical use of being able to read x UTF-8 bytes from a  
 stream that isn't in UTF-8?

 Being able to read exactly x bytes wasn't the purpose, it was simply the  
 problem I encountered writing a conversion stream. A stream that would  
 allow:

 ConversionStream s = new ConversionStream(new BufferedFile("a.txt"));
 char[] line;

 while(!s.eof()) {
 	line = s.readLine(line);
 	..etc..
 }

 s.close();

 This would read my cp1252 encoded file as char[] lines, instead of  
 reading them into ubyte[] or char[] (technically illegally) and then  
 converting each line.

 The problem arose in implementing the readExact method for the  
 ConversionStream. Firstly it would have required a buffer to deal with  
 the expansion of the data, secondly as I mentioned last post it is  
 possible to fail to get exactly x bytes, but have x-1 or x+1 instead.

 So, while I didn't want to read exactly x bytes the Stream interface  
 requires that it is possible, right?

I didn't pursue this far because I didn't need it, but the more I think on  
it...

I guess breaking the stream in the middle of a char 'character' (one which  
is several bytes/codepoints long) isn't a problem as the next read will  
grab the rest and append it to the first part, unless, the user is reading  
a char at a time and expecting them to be complete and/or valid (which is  
just wrong).

When reading lines it will keep reading till it gets a '\n' or '\r\n' so  
it shouldn't ever return any "part of a character"s.

So, perhaps there is no problem after all :)

Regan

Sep 01 2005

Stewart Gordon <smjg_1998 yahoo.com> writes:

Stewart Gordon wrote:
 I've got together some functions for converting between Windows 
 character sets and UTF-8.  I propose that it be added to Phobos, since 
 it's a significant step forward in compatibility between D and Windows.

<snip>

I could've sworn I'd attached the thing!  Why did nobody point this out?

Stewart.

-- 
-----BEGIN GEEK CODE BLOCK-----
Version: 3.1
GCS/M d- s:- C++  a->--- UB  P+ L E  W++  N+++ o K-  w++  O? M V? PS- 
PE- Y? PGP- t- 5? X? R b DI? D G e++>++++ h-- r-- !y
------END GEEK CODE BLOCK------

My e-mail is valid but not my primary mailbox.  Please keep replies on 
the 'group where everyone may benefit.

Sep 07 2005

D Programming

C/C++ Programming

Other

digitalmars.D - [Contribution] std.windows.charset