digitalmars.D - 8-bit character encodings
- =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> Nov 23 2004
- "Walter" <newshound digitalmars.com> Nov 23 2004
- =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> Nov 23 2004
- "Walter" <newshound digitalmars.com> Nov 23 2004
- "Kris" <fu bar.com> Nov 23 2004
- =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> Nov 23 2004
- "Kris" <fu bar.com> Nov 24 2004
- =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> Nov 24 2004
- "Kris" <fu bar.com> Nov 24 2004
- =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> Nov 24 2004
- "Kris" <fu bar.com> Nov 24 2004
- "Walter" <newshound digitalmars.com> Nov 23 2004
- Stewart Gordon <smjg_1998 yahoo.com> Nov 24 2004
I've written some test code for encodings... They take a mapping (wchar[256]) from ubyte, which defines the 8-bit charset / encoding. Then it can convert to and from Unicode. (such as the default char[] strings in D) The unoptimized D code looks like this:/// converts a 8-bit charset encoding string into unicode char[] decode_string(ubyte[] string, wchar[256] mapping) { wchar[] result; foreach (ubyte c; string) { if (mapping[c] != 0xFFFF) result ~= mapping[c]; } return std.utf.toUTF8(result); }
/// converts a unicode string into 8-bit charset encoding ubyte[] encode_string(char[] string, wchar[256] mapping) { ubyte[] result; foreach (wchar c; string) { foreach (int i, wchar m; mapping) { if (c == m) result ~= cast(ubyte) i; } } return result; }
I added four mappings, just to have something to test with: iso88591, cp1252, cp437, macroman (each lookup table is 512 bytes, so that's 2K) The ubyte[] can then be used as C (char *), by nul-terminating as usual, for e.g. printf("%s") It works just fine, for both I/O as e.g. Latin-1 It should probably throw an exception or something like that, when it encounters unmapped characters ? (for instance: Win CP-1252 has 5 non-Unicode chars) Surely someone must have written this before ? Just that I couldn't find it in the libraries... --anders PS. The real code builds reverse lookup tables too. (one with chars < 0x0100, and one with the rest) PPS. wchar[] versions left as exercise for the reader. They would avoid all the UTF-8 conversions above.
Nov 23 2004
There's a Microsoft API function to do this, I think it's WideCharToMultiByte() and MultiByteToWideChar().
Nov 23 2004
Walter wrote:There's a Microsoft API function to do this, I think it's WideCharToMultiByte() and MultiByteToWideChar().
But that's only on Windows, right ? (got the lookups from unicode.org) --anders
Nov 23 2004
"Anders F Björklund" <afb algonet.se> wrote in message news:co0f4b$28ip$1 digitaldaemon.com...Walter wrote:There's a Microsoft API function to do this, I think it's WideCharToMultiByte() and MultiByteToWideChar().
But that's only on Windows, right ? (got the lookups from unicode.org)
Right. I don't know what the corresponding linux API is.
Nov 23 2004
| "Anders F Björklund" <afb algonet.se> wrote in message | > Walter wrote: | > > There's a Microsoft API function to do this, I think it's | > > WideCharToMultiByte() and MultiByteToWideChar(). | > | > But that's only on Windows, right ? | > | > (got the lookups from unicode.org) | | Right. I don't know what the corresponding linux API is. Mango.io has optional bindings to any/all of the extensive ICU converters. Stdio is covered there also, so it should probably handle the above case without issue. I'd like to encourage folks to consider Mango.io and Mango.icu as part of any Unicode oriented project. Naturally, I'm somewhat biased :-) For those not familiar with Mango, it comprises a set of related packages (the Mango Tree) including: - Cohesive, type-safe, and highly extensible IO package. Now with ICU hooks. Supports all the D types along with all their array variants, and makes it trivial to bind your own classes directly to the IO layer. Provides both the put/get & <</>> syntactical flavors. - Configurable runtime logging, a la Log4J, with a bonus HTML-based manager to dynamically adjust the settings of a running executable. Also hooks into Chainsaw for remote monitoring. - Servlet engine. Supports the best parts of what the Java servlet spec provides, and has better IO. - A customizable and extensible HTTP server (used by the servlet engine). Perhaps the fastest HTTP server available, since it can happily process requests without making a single memory allocation. Just goes to show what thread-locals and D array-slicing can do for performance! Also has a separate HttpClient. - High performance clustering. Based loosely around a Linda design, with aspects of pub/sub and queuing mixed in. Uses D class-serialization to send objects around a cluster, and is easy to use. - Wrappers around the extensive ICU (unicode) project. This currently covers around 85% of the ICU functionality, and includes a very usable unicode-enabled UString class. These packages are available as separate libraries. That is, Mango.icu and Mango.log can be used in complete isolation. Mango.io can also be used standalone. Mango.cluster, Mango.http, Mango.servlet, and Mango.cache leverage the IO package to one degree or another. Beta 9.6 will be released before the week is out, and v1.0 of some packages will occur shortly thereafter. You can find out more about Mango over here: http://www.dsource.org/forums/
Nov 23 2004
Kris wrote:Mango.io has optional bindings to any/all of the extensive ICU converters. Stdio is covered there also, so it should probably handle the above case without issue. I'd like to encourage folks to consider Mango.io and Mango.icu as part of any Unicode oriented project. Naturally, I'm somewhat biased :-)
OK, will check it out. Only difference being: 12 MB versus 32 KB :-) Put there's probably other neat stuff in there, and had ICU already.These packages are available as separate libraries. That is, Mango.icu and Mango.log can be used in complete isolation. Mango.io can also be used standalone. Mango.cluster, Mango.http, Mango.servlet, and Mango.cache leverage the IO package to one degree or another.
Looks extensive! Wonder if it compiles on Darwin ? Hmm, no makefile... --anders
Nov 23 2004
"Anders F Björklund" <afb algonet.se> wrote in message news:co1du1 | | Looks extensive! Wonder if it compiles on Darwin ? Hmm, no makefile... | | --anders I'm not sure that anyone has tried it on Darwin as yet. Perhaps the linux makefile will work? This one is compatible with the Beta 9.5 download (accessible via the dsource download section), and I'll update it tomorrow with the Beta 9.6 equivalent (to match the current checkins) http://svn.dsource.org/svn/projects/mango/trunk/ Given that the ICU stuff is so recent, it has not been linked to the *nix libs. The effort to get there is a known (and limited) quantity, but hasn't happened yet. Everything else compiles and links just fine on linux, and the vast majority of it runs without issue (there is one known problem regarding Mango.cluster on that platform). If you'd perhaps be willing to lend a hand regarding Darwin (or with the ICU bindings, or whatever else), that would be great! :-)
Nov 24 2004
Kris wrote:I'm not sure that anyone has tried it on Darwin as yet. Perhaps the linux makefile will work? This one is compatible with the Beta 9.5 download (accessible via the dsource download section), and I'll update it tomorrow with the Beta 9.6 equivalent (to match the current checkins)
I copied the linux makefile to darwin.make, and tried it. Throwed some errors and then gdc hung on FileConduit.d... I think it was, will post the actual errors on Mango forumGiven that the ICU stuff is so recent, it has not been linked to the *nix libs. The effort to get there is a known (and limited) quantity, but hasn't happened yet. Everything else compiles and links just fine on linux, and the vast majority of it runs without issue (there is one known problem regarding Mango.cluster on that platform).
Looks like most of it is POSIX-ish, should be compilable ? --anders
Nov 24 2004
"Anders F Björklund" <afb algonet.se> wrote in message news:co23ph$1k5f$1 digitaldaemon.com... | Kris wrote: | | > I'm not sure that anyone has tried it on Darwin as yet. Perhaps the linux | > makefile will work? This one is compatible with the Beta 9.5 download | > (accessible via the dsource download section), and I'll update it tomorrow | > with the Beta 9.6 equivalent (to match the current checkins) | | I copied the linux makefile to darwin.make, and tried it. | Throwed some errors and then gdc hung on FileConduit.d... | | I think it was, will post the actual errors on Mango forum Thanks; I'll check it out ... | | > Given that the ICU stuff is so recent, it has not been linked to the *nix | > libs. The effort to get there is a known (and limited) quantity, but hasn't | > happened yet. Everything else compiles and links just fine on linux, and the | > vast majority of it runs without issue (there is one known problem regarding | > Mango.cluster on that platform). | | Looks like most of it is POSIX-ish, should be compilable ? Yep. We have to provide a little bit of linker glue, in place of the Win32 DLL binding-mechanism. The file ULocale has an example of how this should work. It's not much effort, but it just hasn't been done.
Nov 24 2004
Kris wrote:I'm not sure that anyone has tried it on Darwin as yet. Perhaps the linux makefile will work? This one is compatible with the Beta 9.5 download (accessible via the dsource download section), and I'll update it tomorrow with the Beta 9.6 equivalent (to match the current checkins)
Also, the Makefile seems a little broken since it recompiles everything? It should reference the object files, and not the source code directly. Something like:%.o : %.d $(DMD) -c $(DFLAGS) -o $ $< libmango.a : $(OBJECTS) $(AR) -r $ $(OBJECTS)
Perhaps adapted to use the $(OBJ) dir? "all", "clean" and "install" targets seems to be missing, by the way. They are phony targets that just references the others or runs shell. One could also add a "check" target, that would run the unit-tests... --anders
Nov 24 2004
"Anders F Björklund" <afb algonet.se> wrote in message | Also, the Makefile seems a little broken since it recompiles everything? | It should reference the object files, and not the source code directly. | | Something like: | | > %.o : %.d | > $(DMD) -c $(DFLAGS) -o $ $< | > | > libmango.a : $(OBJECTS) | > $(AR) -r $ $(OBJECTS) | | Perhaps adapted to use the $(OBJ) dir? That's because it's often faster to recompile everything than doing it piecemeal :-) One of the benefits of D is the speed at which it ploughs through source, leaving tools like make in its wake (so to speak). The latest Win32 make file does things somewhat differently, and is more along the lines of which you speak (builds things a package at a time, rather than the whole enchilada), and the linux makefile is expected to migrate to a similar strategy. There again, I have limited experience with make; and would be more than happy if someone were to do it properly.
Nov 24 2004
That's good work! "Kris" <fu bar.com> wrote in message news:co0pv7$2oh2$1 digitaldaemon.com...Mango.io has optional bindings to any/all of the extensive ICU converters. Stdio is covered there also, so it should probably handle the above case without issue. I'd like to encourage folks to consider Mango.io and Mango.icu as part of any Unicode oriented project. Naturally, I'm somewhat biased :-) For those not familiar with Mango, it comprises a set of related packages (the Mango Tree) including: - Cohesive, type-safe, and highly extensible IO package. Now with ICU
Supports all the D types along with all their array variants, and makes it trivial to bind your own classes directly to the IO layer. Provides both
put/get & <</>> syntactical flavors. - Configurable runtime logging, a la Log4J, with a bonus HTML-based
to dynamically adjust the settings of a running executable. Also hooks
Chainsaw for remote monitoring. - Servlet engine. Supports the best parts of what the Java servlet spec provides, and has better IO. - A customizable and extensible HTTP server (used by the servlet engine). Perhaps the fastest HTTP server available, since it can happily process requests without making a single memory allocation. Just goes to show what thread-locals and D array-slicing can do for performance! Also has a separate HttpClient. - High performance clustering. Based loosely around a Linda design, with aspects of pub/sub and queuing mixed in. Uses D class-serialization to
objects around a cluster, and is easy to use. - Wrappers around the extensive ICU (unicode) project. This currently
around 85% of the ICU functionality, and includes a very usable unicode-enabled UString class. These packages are available as separate libraries. That is, Mango.icu and Mango.log can be used in complete isolation. Mango.io can also be used standalone. Mango.cluster, Mango.http, Mango.servlet, and Mango.cache leverage the IO package to one degree or another. Beta 9.6 will be released before the week is out, and v1.0 of some
will occur shortly thereafter. You can find out more about Mango over
http://www.dsource.org/forums/
Nov 23 2004
Anders F Björklund wrote:I've written some test code for encodings... They take a mapping (wchar[256]) from ubyte, which defines the 8-bit charset / encoding. Then it can convert to and from Unicode. (such as the default char[] strings in D) The unoptimized D code looks like this:/// converts a 8-bit charset encoding string into unicode char[] decode_string(ubyte[] string, wchar[256] mapping)
Why restrict yourself to 8-bit character sets that don't include U+10000 or above? Stewart.
Nov 24 2004
Stewart Gordon wrote:I've written some test code for encodings... They take a mapping (wchar[256]) from ubyte, which defines the 8-bit charset / encoding.
that don't include U+10000 or above?
Because it was a quick and dirty hack, with the sole purpose of being able to provide input and output with consoles that don't talk Unicode... ICU has a better "full" implementation of this ? (as used by the Mango library posted here earlier) --anders
Nov 24 2004









"Kris" <fu bar.com> 