www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - 8-bit character encodings

reply =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
I've written some test code for encodings...


They take a mapping (wchar[256]) from ubyte,
which defines the 8-bit charset / encoding.

Then it can convert to and from Unicode.
(such as the default char[] strings in D)


The unoptimized D code looks like this:

 /// converts a 8-bit charset encoding string into unicode
 char[] decode_string(ubyte[] string, wchar[256] mapping)
 {
 	wchar[] result;
 	foreach (ubyte c; string)
 	{
 		if (mapping[c] != 0xFFFF)
 		  result ~= mapping[c];
 	}
 	return std.utf.toUTF8(result);
 }
 /// converts a unicode string into 8-bit charset encoding
 ubyte[] encode_string(char[] string, wchar[256] mapping)
 {
 	ubyte[] result;
 	foreach (wchar c; string)
 	{
 		foreach (int i, wchar m; mapping)
 		{
 		    if (c == m)
 		        result ~= cast(ubyte) i;
 		} 
 	}
 	return result;
 }
I added four mappings, just to have something to test with: iso88591, cp1252, cp437, macroman (each lookup table is 512 bytes, so that's 2K) The ubyte[] can then be used as C (char *), by nul-terminating as usual, for e.g. printf("%s") It works just fine, for both I/O as e.g. Latin-1 It should probably throw an exception or something like that, when it encounters unmapped characters ? (for instance: Win CP-1252 has 5 non-Unicode chars) Surely someone must have written this before ? Just that I couldn't find it in the libraries... --anders PS. The real code builds reverse lookup tables too. (one with chars < 0x0100, and one with the rest) PPS. wchar[] versions left as exercise for the reader. They would avoid all the UTF-8 conversions above.
Nov 23 2004
next sibling parent reply "Walter" <newshound digitalmars.com> writes:
There's a Microsoft API function to do this, I think it's
WideCharToMultiByte() and MultiByteToWideChar().
Nov 23 2004
parent reply =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Walter wrote:
 There's a Microsoft API function to do this, I think it's
 WideCharToMultiByte() and MultiByteToWideChar().
But that's only on Windows, right ? (got the lookups from unicode.org) --anders
Nov 23 2004
parent reply "Walter" <newshound digitalmars.com> writes:
"Anders F Björklund" <afb algonet.se> wrote in message
news:co0f4b$28ip$1 digitaldaemon.com...
 Walter wrote:
 There's a Microsoft API function to do this, I think it's
 WideCharToMultiByte() and MultiByteToWideChar().
But that's only on Windows, right ? (got the lookups from unicode.org)
Right. I don't know what the corresponding linux API is.
Nov 23 2004
parent reply "Kris" <fu bar.com> writes:
| "Anders F Björklund" <afb algonet.se> wrote in message
| > Walter wrote:
| > > There's a Microsoft API function to do this, I think it's
| > > WideCharToMultiByte() and MultiByteToWideChar().
| >
| > But that's only on Windows, right ?
| >
| > (got the lookups from unicode.org)
|
| Right. I don't know what the corresponding linux API is.


Mango.io has optional bindings to any/all of the extensive ICU converters.
Stdio is covered there also, so it should probably handle the above case
without issue. I'd like to encourage folks to consider Mango.io and
Mango.icu as part of any Unicode oriented project. Naturally, I'm somewhat
biased :-)

For those not familiar with Mango, it comprises a set of related packages
(the Mango Tree) including:

- Cohesive, type-safe, and highly extensible IO package. Now with ICU hooks.
Supports all the D types along with all their array variants, and makes it
trivial to bind your own classes directly to the IO layer. Provides both the
put/get & <</>> syntactical flavors.

- Configurable runtime logging, a la Log4J, with a bonus HTML-based manager
to dynamically adjust the settings of a running executable. Also hooks into
Chainsaw for remote monitoring.

- Servlet engine. Supports the best parts of what the Java servlet spec
provides, and has better IO.

- A customizable and extensible HTTP server (used by the servlet engine).
Perhaps the fastest HTTP server available, since it can happily process
requests without making a single memory allocation. Just goes to show what
thread-locals and D array-slicing can do for performance! Also has a
separate HttpClient.

- High performance clustering. Based loosely around a Linda design, with
aspects of pub/sub and queuing mixed in. Uses D class-serialization to send
objects around a cluster, and is easy to use.

- Wrappers around the extensive ICU (unicode) project. This currently covers
around 85% of the ICU functionality, and includes a very usable
unicode-enabled UString class.


These packages are available as separate libraries. That is, Mango.icu and
Mango.log can be used in complete isolation. Mango.io can also be used
standalone. Mango.cluster, Mango.http, Mango.servlet, and Mango.cache
leverage the IO package to one degree or another.

Beta 9.6 will be released before the week is out, and v1.0 of some packages
will occur shortly thereafter. You can find out more about Mango over here:
http://www.dsource.org/forums/
Nov 23 2004
next sibling parent reply =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Kris wrote:

 Mango.io has optional bindings to any/all of the extensive ICU converters.
 Stdio is covered there also, so it should probably handle the above case
 without issue. I'd like to encourage folks to consider Mango.io and
 Mango.icu as part of any Unicode oriented project. Naturally, I'm somewhat
 biased :-)
OK, will check it out. Only difference being: 12 MB versus 32 KB :-) Put there's probably other neat stuff in there, and had ICU already.
 These packages are available as separate libraries. That is, Mango.icu and
 Mango.log can be used in complete isolation. Mango.io can also be used
 standalone. Mango.cluster, Mango.http, Mango.servlet, and Mango.cache
 leverage the IO package to one degree or another.
Looks extensive! Wonder if it compiles on Darwin ? Hmm, no makefile... --anders
Nov 23 2004
parent reply "Kris" <fu bar.com> writes:
"Anders F Björklund" <afb algonet.se> wrote in message news:co1du1
|
| Looks extensive! Wonder if it compiles on Darwin ? Hmm, no makefile...
|
| --anders

I'm not sure that anyone has tried it on Darwin as yet. Perhaps the linux
makefile will work? This one is compatible with the Beta 9.5 download
(accessible via the dsource download section), and I'll update it tomorrow
with the Beta 9.6 equivalent (to match the current checkins)

http://svn.dsource.org/svn/projects/mango/trunk/

Given that the ICU stuff is so recent, it has not been linked to the *nix
libs. The effort to get there is a known (and limited) quantity, but hasn't
happened yet. Everything else compiles and links just fine on linux, and the
vast majority of it runs without issue (there is one known problem regarding
Mango.cluster on that platform).

If you'd perhaps be willing to lend a hand regarding Darwin (or with the ICU
bindings, or whatever else), that would be great! :-)
Nov 24 2004
next sibling parent reply =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Kris wrote:

 I'm not sure that anyone has tried it on Darwin as yet. Perhaps the linux
 makefile will work? This one is compatible with the Beta 9.5 download
 (accessible via the dsource download section), and I'll update it tomorrow
 with the Beta 9.6 equivalent (to match the current checkins)
I copied the linux makefile to darwin.make, and tried it. Throwed some errors and then gdc hung on FileConduit.d... I think it was, will post the actual errors on Mango forum
 Given that the ICU stuff is so recent, it has not been linked to the *nix
 libs. The effort to get there is a known (and limited) quantity, but hasn't
 happened yet. Everything else compiles and links just fine on linux, and the
 vast majority of it runs without issue (there is one known problem regarding
 Mango.cluster on that platform).
Looks like most of it is POSIX-ish, should be compilable ? --anders
Nov 24 2004
parent "Kris" <fu bar.com> writes:
"Anders F Björklund" <afb algonet.se> wrote in message
news:co23ph$1k5f$1 digitaldaemon.com...
| Kris wrote:
|
| > I'm not sure that anyone has tried it on Darwin as yet. Perhaps the
linux
| > makefile will work? This one is compatible with the Beta 9.5 download
| > (accessible via the dsource download section), and I'll update it
tomorrow
| > with the Beta 9.6 equivalent (to match the current checkins)
|
| I copied the linux makefile to darwin.make, and tried it.
| Throwed some errors and then gdc hung on FileConduit.d...
|
| I think it was, will post the actual errors on Mango forum

Thanks; I'll check it out ...

|
| > Given that the ICU stuff is so recent, it has not been linked to the
*nix
| > libs. The effort to get there is a known (and limited) quantity, but
hasn't
| > happened yet. Everything else compiles and links just fine on linux, and
the
| > vast majority of it runs without issue (there is one known problem
regarding
| > Mango.cluster on that platform).
|
| Looks like most of it is POSIX-ish, should be compilable ?

Yep. We have to provide a little bit of linker glue, in place of the Win32
DLL binding-mechanism. The file ULocale has an example of how this should
work. It's not much effort, but it just hasn't been done.
Nov 24 2004
prev sibling parent reply =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Kris wrote:

 I'm not sure that anyone has tried it on Darwin as yet. Perhaps the linux
 makefile will work? This one is compatible with the Beta 9.5 download
 (accessible via the dsource download section), and I'll update it tomorrow
 with the Beta 9.6 equivalent (to match the current checkins)
Also, the Makefile seems a little broken since it recompiles everything? It should reference the object files, and not the source code directly. Something like:
 %.o : %.d
 	$(DMD) -c $(DFLAGS) -o $  $<
 
 libmango.a : $(OBJECTS)
 	$(AR) -r $  $(OBJECTS)
Perhaps adapted to use the $(OBJ) dir? "all", "clean" and "install" targets seems to be missing, by the way. They are phony targets that just references the others or runs shell. One could also add a "check" target, that would run the unit-tests... --anders
Nov 24 2004
parent "Kris" <fu bar.com> writes:
"Anders F Björklund" <afb algonet.se> wrote in message | Also, the Makefile
seems a little broken since it recompiles everything?
| It should reference the object files, and not the source code directly.
|
| Something like:
|
| > %.o : %.d
| > $(DMD) -c $(DFLAGS) -o $  $<
| >
| > libmango.a : $(OBJECTS)
| > $(AR) -r $  $(OBJECTS)
|
| Perhaps adapted to use the $(OBJ) dir?

That's because it's often faster to recompile everything than doing it
piecemeal :-)

One of the benefits of D is the speed at which it ploughs through source,
leaving tools like make in its wake (so to speak). The latest Win32 make
file does things somewhat differently, and is more along the lines of which
you speak (builds things a package at a time, rather than the whole
enchilada), and the linux makefile is expected to migrate to a similar
strategy.

There again, I have limited experience with make; and would be more than
happy if someone were to do it properly.
Nov 24 2004
prev sibling parent "Walter" <newshound digitalmars.com> writes:
That's good work!

"Kris" <fu bar.com> wrote in message news:co0pv7$2oh2$1 digitaldaemon.com...
 Mango.io has optional bindings to any/all of the extensive ICU converters.
 Stdio is covered there also, so it should probably handle the above case
 without issue. I'd like to encourage folks to consider Mango.io and
 Mango.icu as part of any Unicode oriented project. Naturally, I'm somewhat
 biased :-)

 For those not familiar with Mango, it comprises a set of related packages
 (the Mango Tree) including:

 - Cohesive, type-safe, and highly extensible IO package. Now with ICU
hooks.
 Supports all the D types along with all their array variants, and makes it
 trivial to bind your own classes directly to the IO layer. Provides both
the
 put/get & <</>> syntactical flavors.

 - Configurable runtime logging, a la Log4J, with a bonus HTML-based
manager
 to dynamically adjust the settings of a running executable. Also hooks
into
 Chainsaw for remote monitoring.

 - Servlet engine. Supports the best parts of what the Java servlet spec
 provides, and has better IO.

 - A customizable and extensible HTTP server (used by the servlet engine).
 Perhaps the fastest HTTP server available, since it can happily process
 requests without making a single memory allocation. Just goes to show what
 thread-locals and D array-slicing can do for performance! Also has a
 separate HttpClient.

 - High performance clustering. Based loosely around a Linda design, with
 aspects of pub/sub and queuing mixed in. Uses D class-serialization to
send
 objects around a cluster, and is easy to use.

 - Wrappers around the extensive ICU (unicode) project. This currently
covers
 around 85% of the ICU functionality, and includes a very usable
 unicode-enabled UString class.


 These packages are available as separate libraries. That is, Mango.icu and
 Mango.log can be used in complete isolation. Mango.io can also be used
 standalone. Mango.cluster, Mango.http, Mango.servlet, and Mango.cache
 leverage the IO package to one degree or another.

 Beta 9.6 will be released before the week is out, and v1.0 of some
packages
 will occur shortly thereafter. You can find out more about Mango over
here:
 http://www.dsource.org/forums/
Nov 23 2004
prev sibling parent reply Stewart Gordon <smjg_1998 yahoo.com> writes:
Anders F Björklund wrote:
 I've written some test code for encodings...
 
 They take a mapping (wchar[256]) from ubyte,
 which defines the 8-bit charset / encoding.
 
 Then it can convert to and from Unicode.
 (such as the default char[] strings in D)
 
 
 The unoptimized D code looks like this:
 
 /// converts a 8-bit charset encoding string into unicode
 char[] decode_string(ubyte[] string, wchar[256] mapping)
<snip> Why restrict yourself to 8-bit character sets that don't include U+10000 or above? Stewart.
Nov 24 2004
parent =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Stewart Gordon wrote:

 I've written some test code for encodings...
 
 They take a mapping (wchar[256]) from ubyte,
 which defines the 8-bit charset / encoding.
Why restrict yourself to 8-bit character sets that don't include U+10000 or above?
Because it was a quick and dirty hack, with the sole purpose of being able to provide input and output with consoles that don't talk Unicode... ICU has a better "full" implementation of this ? (as used by the Mango library posted here earlier) --anders
Nov 24 2004