www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Transcoding - who's doing what?

reply Arcane Jill <Arcane_member pathlink.com> writes:
There have been loads and loads of discussions in recent weeks about Unicode,
streams, and transcodings. There seems to be a general belief that "things are
happening", but I'm not quite clear on the specifics - hence this post, which is
basically a question.

To clarify my own plans on the Unicode front, the purpose of the etc.unicode
library is to implement all of the algorithms defined by the Unicode standard on
the Unicode website. ("All" is quite ambitious, actually, and it will take a
long time to achieve that, but obviously the core ones will come first, and most
of the property-getting functions are already there). But I'm /not/ planning on
writing any transcoding functions, simply because they're not part of the
Unicode standard. Transcoding, in fact, is all about converting /to/  Unicode
from something else (and vice versa).

Transcoding functions are easy to write - for most encodings a simple 256-entry
lookup table will suffice, at least in one direction. But transcoding in strings
is not necessarily the best architecture, and it would probably be better to do
it at a lower level, using streams (aka filters/readers/writers) - basically
just classes which implement a read() function and/or a write() function.

I don't know who, if anyone, is currently working on this. In post
http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/5925, Hauke said: "I'm
currently working on ... a string interface that abstracts from the specific
encoding + a bunch of implementations for the most common ones (UTF-8, 16, 32,
system codepage, etc...).", but it's possible I may have read too much into
that.

I also know that Sean is doing some stream stuff, and that in post
http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/8236, he said 'Rather
than having the encoding handled
directly by the Stream layer perhaps it should be dropped into another class.  I
can't imagine coding a base lib to support "Joe's custom encoding scheme."  For
the moment though, I think I'll leave stream.d as-is.  This seems like a design
issue that will take a bit of talk to get right.' and 'I'll
probably have the first cut of stream.d done in a few more days and after that
we can talk about what's wrong with it, etc.'

I also need a bit of educating on the future of D's streams. Are we going to get
separate InputStream and OutputStream interfaces, or what?

Sean, is your stuff part of Phobos-to-be? Or is it external to Phobos? I don't
mind either way, but if Phobos is going to go off in some completely tangential
direction, I want to know that too.

So, the simple, easy peasy task of converting between Latin-1 and Unicode hasn't
been done yet, basically because we haven't agreed on an architecture, and I for
one am not really sure who's doing it anyway.

Therefore, (1), I would like to ask, is anyone /actually/ writing transcoders
yet, or is it still up in the air?

And, (2), if the answer to (1) is no, I'd like to suggest that a couple of
simple classes be written which, I believe, will slot nicely into whatever
architecture we eventually come up with. This is what I suspect will do the job.
Two classes:



































Now these will probably need some adapting to fit into our final architecture.
(Should they derive from Stream? Or from some yet-to-be-defined transcoding
Reader/Writer base classes? Should they implement some interface? Should they be
merged into a single class? etc.) BUT - they won't need /much/ adaptation, and
once we've got Latin-1 working, we'll have an example on which to model all the
others. So feel free to take the above code and adapt it as necessary.

But I do think we should nail down the architecture soon, as we're getting a lot
of questions and discussion on this. But one thing at a time. Someone tell me
where streams are going (with regard to above questions) and then I'll have more
suggestions.

Arcane Jill
Aug 15 2004
next sibling parent reply Ben Hinkle <bhinkle4 juno.com> writes:
 I also need a bit of educating on the future of D's streams. Are we going
 to get separate InputStream and OutputStream interfaces, or what?
std.stream.InputStream and OutputStream interfaces already exist (since 0.89). All the "new" stuff in std.stream isn't in the phobos.html doc. Are you thinking of a different InputStream and OutputStream?
Aug 15 2004
parent Arcane Jill <Arcane_member pathlink.com> writes:
In article <cfogis$12on$1 digitaldaemon.com>, Ben Hinkle says...
 I also need a bit of educating on the future of D's streams. Are we going
 to get separate InputStream and OutputStream interfaces, or what?
std.stream.InputStream and OutputStream interfaces already exist (since 0.89).
I didn't know that. Thanks.
All the "new" stuff in std.stream isn't in the phobos.html doc.
Ah. That would be why I didn't know it. I've only read the HTML, not the D source. I know a lot have folk have suggested that I should read the source, but I guess it's an ideological thing - using the specifics of the source smacks of relying on undocumented features to me, something not guaranteed to work in future incarnations. How hard would it be to update the documentation?
Are
you thinking of a different InputStream and OutputStream?
I wasn't thinking of anything. I just didn't know there was such a beast. Thanks for educating me. Jill
Aug 15 2004
prev sibling next sibling parent reply "antiAlias" <gblazzer corneleus.com> writes:
I'm not doing anything specific for transcoding (yet) Jill; but will as soon
as the appropriate knowledge is made available in the shape of some
low-level libraries. If etc.unicode already has those, well, I'll get on the
job pronto.

As for architecture, this is how mango.io approaches it:

One might consider mango.io to have three separate, but related and
bindable, entities. These are Conduit, Buffer, and Reader/Writer. Conduits
represent things like files, sockets, and other 'physical', block oriented
devices. You can talk to a Conduit directly (via read/write methods) with an
instance of a Buffer. The next stage up in pecking order is the Buffer,
which acts as a bi-directional queue for Conduit data (or used independently
like Outbuffer, for that matter). You can read and write to a buffer using
void[], or map it directly to a local array if desired.

Buffers are intended as an abstraction over the more physical Conduit. You
can use a common Buffer for both read and write purposes, or you can have a
separate instance for both read and write purposes. On top of the Buffer,
one can map either a set of Tokenizers (for scanf like processing), or a set
of Readers/Writers. The latter convert between representations: usually
programmer-idioms to Conduit-idioms and back again. For example, a Reader
might convert Buffer content into ints, longs, char[] arrays and so on.
Writer does the opposite.

You can make a Reader/Writer pair do whatever you wish in terms of
conversion: a classic example is endian conversion, but others might include
various transcoding other tasks, including unicode. In addition, you can map
multiple Readers/Writers onto a common Buffer, and they will all behave
sequentially as one might imagine. The latter is handy for when you need to
see what the content is before reading it in some other manner (think HTTP
headers, followed by content that's been zip-compressed). You might think of
the Reader/Writer layer as "piecemeal" IO: they usually work with small
amounts of data at a time.

Finally, the Conduit actually has an optional filter "intercept" layer: you
can build a filter to modify either the input or output in void[] style.
That is, an output filter is given a void[], and does what ever it wants
with it (usually calls the next filter in the chain, which will ultimately
cause the modulated content to be written somewhere).

This sounds somewhat complex, but the APIs make it really easy (certainly as
easy as phobos.io) to get things hooked up. For example, when reading a file
you typically do the following:

FileConduit fc = new FileConduit ("file.name");
Reader r = new Reader (fc);

r.get(x).get(y).get(z);

(or r >> x >> y >> z;)

etc.

So, whenever the appropriate unicode converters are available, I (or someone
else) can hook them up either at the Buffer layer, or at the Conduit-filter
layer. If you'd be interested in doing that, I'd be very, very, grateful!






"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cfog97$12n2$1 digitaldaemon.com...
 There have been loads and loads of discussions in recent weeks about
Unicode,
 streams, and transcodings. There seems to be a general belief that "things
are
 happening", but I'm not quite clear on the specifics - hence this post,
which is
 basically a question.

 To clarify my own plans on the Unicode front, the purpose of the
etc.unicode
 library is to implement all of the algorithms defined by the Unicode
standard on
 the Unicode website. ("All" is quite ambitious, actually, and it will take
a
 long time to achieve that, but obviously the core ones will come first,
and most
 of the property-getting functions are already there). But I'm /not/
planning on
 writing any transcoding functions, simply because they're not part of the
 Unicode standard. Transcoding, in fact, is all about converting /to/
Unicode
 from something else (and vice versa).

 Transcoding functions are easy to write - for most encodings a simple
256-entry
 lookup table will suffice, at least in one direction. But transcoding in
strings
 is not necessarily the best architecture, and it would probably be better
to do
 it at a lower level, using streams (aka filters/readers/writers) -
basically
 just classes which implement a read() function and/or a write() function.

 I don't know who, if anyone, is currently working on this. In post
 http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/5925, Hauke said:
"I'm
 currently working on ... a string interface that abstracts from the
specific
 encoding + a bunch of implementations for the most common ones (UTF-8, 16,
32,
 system codepage, etc...).", but it's possible I may have read too much
into
 that.

 I also know that Sean is doing some stream stuff, and that in post
 http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/8236, he said
'Rather
 than having the encoding handled
 directly by the Stream layer perhaps it should be dropped into another
class. I
 can't imagine coding a base lib to support "Joe's custom encoding scheme."
For
 the moment though, I think I'll leave stream.d as-is.  This seems like a
design
 issue that will take a bit of talk to get right.' and 'I'll
 probably have the first cut of stream.d done in a few more days and after
that
 we can talk about what's wrong with it, etc.'

 I also need a bit of educating on the future of D's streams. Are we going
to get
 separate InputStream and OutputStream interfaces, or what?

 Sean, is your stuff part of Phobos-to-be? Or is it external to Phobos? I
don't
 mind either way, but if Phobos is going to go off in some completely
tangential
 direction, I want to know that too.

 So, the simple, easy peasy task of converting between Latin-1 and Unicode
hasn't
 been done yet, basically because we haven't agreed on an architecture, and
I for
 one am not really sure who's doing it anyway.

 Therefore, (1), I would like to ask, is anyone /actually/ writing
transcoders
 yet, or is it still up in the air?

 And, (2), if the answer to (1) is no, I'd like to suggest that a couple of
 simple classes be written which, I believe, will slot nicely into whatever
 architecture we eventually come up with. This is what I suspect will do
the job.
 Two classes:



































 Now these will probably need some adapting to fit into our final
architecture.
 (Should they derive from Stream? Or from some yet-to-be-defined
transcoding
 Reader/Writer base classes? Should they implement some interface? Should
they be
 merged into a single class? etc.) BUT - they won't need /much/ adaptation,
and
 once we've got Latin-1 working, we'll have an example on which to model
all the
 others. So feel free to take the above code and adapt it as necessary.

 But I do think we should nail down the architecture soon, as we're getting
a lot
 of questions and discussion on this. But one thing at a time. Someone tell
me
 where streams are going (with regard to above questions) and then I'll
have more
 suggestions.

 Arcane Jill
Aug 15 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cfoiln$13re$1 digitaldaemon.com>, antiAlias says...
I'm not doing anything specific for transcoding (yet) Jill; but will as soon
as the appropriate knowledge is made available in the shape of some
low-level libraries. If etc.unicode already has those, well, I'll get on the
job pronto.
I'm afraid it doesn't have anything relevant to encoding or decoding, sorry - just character properties, like isWhitespace(dchar) and so on. Transcoding is a different issue, basically just a mapping to/from a sequence of bytes from/to a Unicode character, and the actual mapping will be different for each encoding. Latin-1 is easy, because the codepoints are identical to those of Unicode.
As for architecture, this is how mango.io approaches it:
<snip> Cool.
So, whenever the appropriate unicode converters are available, I (or someone
else) can hook them up either at the Buffer layer, or at the Conduit-filter
layer. If you'd be interested in doing that, I'd be very, very, grateful!
I think I follow that. But presumably, if people don't want it to be std-specific, then it shouldn't be mango-specific either. I can write a converter for Latin-1, once we're all happy with the architecture. (Actually, I think any of us could). But I certainly wouldn't be able to do (for example) SHIFT-JIS. I imagine once we have the architecture nailed down, lots of transcoder classes will get written (one for each encoding). Jill
Aug 15 2004
parent "antiAlias" <gblazzer corneleus.com> writes:
"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cforp2$18vc$1 digitaldaemon.com...
So, whenever the appropriate unicode converters are available, I (or
someone
else) can hook them up either at the Buffer layer, or at the
Conduit-filter
layer. If you'd be interested in doing that, I'd be very, very, grateful!
Oops. Should have written "either at the Reader/Writer layer, or at the Conduit-filter layer" instead.
 I think I follow that. But presumably, if people don't want it to be
 std-specific, then it shouldn't be mango-specific either.
Yep; I think it's feasible to avoid all dependencies by limiting the API to arrays.
Aug 15 2004
prev sibling next sibling parent reply teqDruid <me teqdruid.com> writes:
On Sun, 15 Aug 2004 20:15:35 +0000, Arcane Jill wrote:
 

































 
I, for one, would prefer that the core functionality NOT be phobos-streams specific. IE, make a set of functions to do the transcoding, then use those to create the readers and writers. This way, it'll be easier to put the transcoding stuff into mango, which I prefer over std.streams. John
Aug 15 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <pan.2004.08.15.21.19.34.123236 teqdruid.com>, teqDruid says...

I, for one, would prefer that the core functionality NOT be phobos-streams
specific.
Fair enough.
IE, make a set of functions to do the transcoding, then use
those to create the readers and writers.  This way, it'll be easier to put
the transcoding stuff into mango, which I prefer over std.streams.
Right, but this "set of functions" (or classes, which I'd prefer) would still have to have a common format, or you wouldn't be able to call them polymorphically at runtime. Would you have a problem if they just implemented (or relied upon) the InputStream and OutputStream interfaces which I only just learned about a few posts ago? Jill
Aug 15 2004
parent reply "antiAlias" <gblazzer corneleus.com> writes:
Might I suggest something along the following lines:

int utf8ToDChar (char[] input, dchar[] output);
int dCharToUtf8 (dchar[] input, char[] output);

where both return the number of bytes converted (or something like that). I
think it's perhaps best to make these kind of things completely independent
of any other layer, if at all possible. These also happen to be the kind of
functions that might be worth optimizing with a smattering of assembly ...


"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cforuu$191p$1 digitaldaemon.com...
 In article <pan.2004.08.15.21.19.34.123236 teqdruid.com>, teqDruid says...

I, for one, would prefer that the core functionality NOT be
phobos-streams
specific.
Fair enough.
IE, make a set of functions to do the transcoding, then use
those to create the readers and writers.  This way, it'll be easier to
put
the transcoding stuff into mango, which I prefer over std.streams.
Right, but this "set of functions" (or classes, which I'd prefer) would
still
 have to have a common format, or you wouldn't be able to call them
 polymorphically at runtime.

 Would you have a problem if they just implemented (or relied upon) the
 InputStream and OutputStream interfaces which I only just learned about a
few
 posts ago?

 Jill
Aug 15 2004
next sibling parent reply Nick <Nick_member pathlink.com> writes:
In article <cfou0o$1a91$1 digitaldaemon.com>, antiAlias says...
Might I suggest something along the following lines:

int utf8ToDChar (char[] input, dchar[] output);
int dCharToUtf8 (dchar[] input, char[] output);

where both return the number of bytes converted (or something like that). I
think it's perhaps best to make these kind of things completely independent
of any other layer, if at all possible. These also happen to be the kind of
functions that might be worth optimizing with a smattering of assembly ...
Ok, here's my shot at it: http://folk.uio.no/mortennk/encoding/ (released under LGPL) I'm not a professional programmer, so please excuse bad programming style, naming conventions or other crimes against humanity. Like mentioned earlier, I use iconv() from libiconv, which can convert between a large set of encodings with little hassle. Only tested on Linux. I'll leave the Windows porting/testing to someone else. A Win32 port of libiconv can be found here: http://gnuwin32.sourceforge.net/packages/libiconv.htm Nick
Aug 15 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cfp7v5$1h84$1 digitaldaemon.com>, Nick says...

Ok, here's my shot at it:
I think we should establish what we need, who needs what and why, etc., before committing any code to a public library. Although the transcoding issue is "urgent" in the sense that lots of people want it, I'd say it was more important to get it right, than to write it fast. There's nothing wrong with your code. I just think that it addresses a different problem than the ones faced by stream developers. Jill
Aug 16 2004
parent Nick <Nick_member pathlink.com> writes:
That is ok. You raise some interesting points in your other post, and I might
rewrite my code later based on what you said, if I have the time. My code is
more a proof of concept, and the point was that encoding can be done easily
through libiconv and you don't have to reinvent the wheel. The library already
supports all the features you want, and rewriting my code for use with streams
shouldn't be very hard.

Nick

In article <cfpvc5$2297$1 digitaldaemon.com>, Arcane Jill says...
I think we should establish what we need, who needs what and why, etc., before
committing any code to a public library. Although the transcoding issue is
"urgent" in the sense that lots of people want it, I'd say it was more important
to get it right, than to write it fast.

There's nothing wrong with your code. I just think that it addresses a different
problem than the ones faced by stream developers.

Jill
Aug 16 2004
prev sibling next sibling parent reply teqDruid <me teqdruid.com> writes:
On Sun, 15 Aug 2004 17:12:25 -0700, antiAlias wrote:

 Might I suggest something along the following lines:
 
 int utf8ToDChar (char[] input, dchar[] output);
 int dCharToUtf8 (dchar[] input, char[] output);
That's what I was getting at... I don't know much about Unicode transcoding, but I don't see a reason for the core functionality to be any more complicated than that. John
Aug 15 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <pan.2004.08.16.06.29.47.206851 teqdruid.com>, teqDruid says...
On Sun, 15 Aug 2004 17:12:25 -0700, antiAlias wrote:

 Might I suggest something along the following lines:
 
 int utf8ToDChar (char[] input, dchar[] output);
 int dCharToUtf8 (dchar[] input, char[] output);
That's what I was getting at... I don't know much about Unicode transcoding, but I don't see a reason for the core functionality to be any more complicated than that. John
Suppose you want to decode a dchar from a stream, and then immediately read a ubyte from the same stream. The above functions won't let you do that. To decode a dchar from a stream you must first read /some/ bytes from that stream, in order to pass those bytes to the above function. But how many? One? Two? Four? In UTF-7, some Unicode characters require no less than /eight/ bytes. (One can invent or imagine encodings that require even more). If you've read too few bytes from the stream, your conversion function will throw an exception. If you've read too many, the stream's seek position will be incorrect for the next read. You could argue that streams themselves could be rewritten to call functions like the above internally, but now you're adding complexity to something that doesn't need it. You said: "I don't see a reason for the core functionality to be any more complicated than that". But those functions are not "core" - they are constructable from yet lower level functionality. The lowest level of abstraction about which it makes sense to talk is "get one Unicode character from somewhere" and "write one Unicode character somewhere". The minute you start talking about /strings/ instead of merely /characters/, you've made an implementation assumption. Anyway, it's not the function/class/interface/whatever that needs to be simple, it's the code which calls it. We make classes do complicated things so that callers don't have to. Arcane Jill
Aug 16 2004
parent "Martin M. Pedersen" <martin moeller-pedersen.dk> writes:
"Arcane Jill" <Arcane_member pathlink.com> skrev i en meddelelse
news:cfqcrh$2cs2$1 digitaldaemon.com...
 In article <pan.2004.08.16.06.29.47.206851 teqdruid.com>, teqDruid says...
 To decode a dchar from a stream you must first read /some/ bytes from that
 stream, in order to pass those bytes to the above function. But how many?
One?
 Two? Four? In UTF-7, some Unicode characters require no less than /eight/
bytes.
 (One can invent or imagine encodings that require even more).
Another verbose, yet useful representation is the character entities used in HTML: http://www.w3.org/TR/REC-html40/sgml/entities.html Regards, Martin M. Pedersen
Aug 16 2004
prev sibling parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cfou0o$1a91$1 digitaldaemon.com>, antiAlias says...

Might I suggest something along the following lines:

int utf8ToDChar (char[] input, dchar[] output);
int dCharToUtf8 (dchar[] input, char[] output);
where both return the number of bytes converted (or something like that).
That would be bad. I think it's possible you haven't understood the issues, so I'll try to explain in this post what some of them are, and why you would want to do certain things in certain ways.
I think it's perhaps best to make these kind of things completely independent
of any other layer, if at all possible.
I don't have any problem with that.
These also happen to be the kind of
functions that might be worth optimizing with a smattering of assembly ...
I disagree. Transcoding almost never happens in performance-critical code. It happens during input and output. A typical scenario is to get input from a console and then decode it, to to encode a string and then write it to a file. The CPU time utilized in the I/O will outweigh the time spent transcoding by a very large factor. Of course it still makes sense to do this efficiently, but assembler - given that it's not portable, decreases maintainability, etc. - is probably going a bit too far. Okay, back to these function signatures:
int utf8ToDChar (char[] input, dchar[] output);
int dCharToUtf8 (dchar[] input, char[] output);
(1) The encoding is not necessarily known at compile time. This problem would also exist had you used classes/interfaces, of course, but at least with classes or interfaces instead of plain functions, you can rely on polymorphism and factory methods to do the dispatching, giving you a single point of decision. Functions like the above would lead to switch statements all over the place, and also to inconsistent encoding names (e.g. "ISO-8859-1" vs "iso-8859-1" vs "LATIN-1" vs "Latin1"). Only by a single point of decision can enforce the IANA encoding names, case conventions, etc.. I see that in "charset.d" you made the encoding name a runtime parameter - but that too is bad, partly because you don't have a single point of decision, but partly also because you're now having to make that runtime check with /every/ fragment of text - not merely at construction time. (2) (Trival) you forgot "out" on the output variables. You cannot expect the caller to be aware in advance of the resulting required buffer size. (3) /This is most important/. In the typical scenario, the caller will be reading bytes from some source - which /could/ be a stream - and will want to get a single dchar. We're talking about a "get the next Unicode character" function, which is about as low level as it gets (in terms of functionality). But you can't build such a function out of your string routines, because you have no way of knowing in advance how many bytes will need to be consumed from the stream in order to build one character. So what do you do? Read too many and then put some back? Not all byte sources will allow you to "put back" or "unconsume" bytes. In fact, the minimal functionality that a decoder requires, is this: (next() could be called get(), or read(), or whatever). The minimal functionality upon which a decoder would rely, is this: For comparison, look at the way Walter's format() function uses an underlying put() function to write a single character. He /could/ have used strings throughout, but he recognised (correctly) that the one-byte-at-a-time approach was conceptually at a lower level. Strings can then be handled /in terms of/ those lower-level functions. With these two interfaces, you can put together the concept of a decoder. Thus: And a /specific/ decoder could derive from this, thus: This could be implemented more efficiently, but I wrote it that way to illustrate the point that the decoder - not the caller - is the only entity capable of knowing the length of the byte sequence corresponding to the next (dchar) character. So, NOW, if you want to plug this into a std.Stream, you could make one of these: And then simply make the magic decoder like so: And similarly for mango streams, InputStreams, strings, and so on. Strings are just not sufficiently low-level. We can rely on the compiler to inline these very simple functions. Encoding - the reverse process - would follow a similar pattern. You wouldn't need hasMore(), but something like done() or close() might be appropriate to indicate that you've finished. Arcane Jill
Aug 16 2004
next sibling parent reply teqDruid <me teqdruid.com> writes:
On Mon, 16 Aug 2004 09:34:36 +0000, Arcane Jill wrote:

 In article <cfou0o$1a91$1 digitaldaemon.com>, antiAlias says...
 
Might I suggest something along the following lines:

int utf8ToDChar (char[] input, dchar[] output);
int dCharToUtf8 (dchar[] input, char[] output);
where both return the number of bytes converted (or something like that).
...
 (3) /This is most important/. In the typical scenario, the caller will be
 reading bytes from some source - which /could/ be a stream - and will want to
 get a single dchar. We're talking about a "get the next Unicode character"
 function, which is about as low level as it gets (in terms of functionality).
 But you can't build such a function out of your string routines, because you
 have no way of knowing in advance how many bytes will need to be consumed from
 the stream in order to build one character. So what do you do? Read too many
and
 then put some back? Not all byte sources will allow you to "put back" or
 "unconsume" bytes.
 
 In fact, the minimal functionality that a decoder requires, is this:
 





 
 (next() could be called get(), or read(), or whatever). The minimal
 functionality upon which a decoder would rely, is this:
 





 
 For comparison, look at the way Walter's format() function uses an underlying
 put() function to write a single character. He /could/ have used strings
 throughout, but he recognised (correctly) that the one-byte-at-a-time approach
 was conceptually at a lower level. Strings can then be handled /in terms of/
 those lower-level functions.
 
 With these two interfaces, you can put together the concept of a decoder. Thus:
 








 
 And a /specific/ decoder could derive from this, thus:
 


















 
 This could be implemented more efficiently, but I wrote it that way to
 illustrate the point that the decoder - not the caller - is the only entity
 capable of knowing the length of the byte sequence corresponding to the next
 (dchar) character.
 
 So, NOW, if you want to plug this into a std.Stream, you could make one of
 these:
 








 
 And then simply make the magic decoder like so:
 

 
 And similarly for mango streams, InputStreams, strings, and so on. Strings are
 just not sufficiently low-level. We can rely on the compiler to inline these
 very simple functions.
 
 Encoding - the reverse process - would follow a similar pattern. You wouldn't
 need hasMore(), but something like done() or close() might be appropriate to
 indicate that you've finished.
 
 Arcane Jill
Understood. This code looks reasonably agnostic, and even simple enough the use. The only difference is in thinking- streams vs strings. I might note, however that you use: dchar[] toUTF32(char[] s); Which could also be written as: int toUTF32(char[] s, out dchar[]); Which looks very similar to: int utf8ToDChar (char[] input, dchar[] output); This is the function that I would define as implementing the "core" functionality. You then (to quote myself) "use those to create the readers and writers." The stream implementation is a bit more complex than I imagined, but I can blame that up to a total lack of experience with variable-width character encodings. (And hey, I'm a first-year undergrad... what'dya expect?) John
Aug 16 2004
parent Arcane Jill <Arcane_member pathlink.com> writes:
In article <pan.2004.08.16.18.58.22.898270 teqdruid.com>, teqDruid says...

Understood.  This code looks reasonably agnostic, and even simple enough
the use.  The only difference is in thinking- streams vs strings.
Yes. That's because a string can always be viewed as a stream, but a stream cannot always be viewed as a string.
I might
note, however that you use:
dchar[] toUTF32(char[] s);
Which could also be written as:
int toUTF32(char[] s, out dchar[]);
Actually I was just calling the function in std.utf. For any other encoding, I probably would have inlined the code right there, rather than written a function, but I figured, why re-invent stuff? std.utf.toUTF32() throws exceptions if the input is wrong, so it's just what you'd need in this circumstance. (The tests I made to determine the length didn't weed out illegal sequences - I was relying on std.utf to do that for me).
Which looks very similar to:
int utf8ToDChar (char[] input, dchar[] output);

This is the function that I would define as implementing the "core"
functionality.
Fair enough. Guess it just depends what you call "core". The main thing is the dispatch mechanism.
The stream implementation is a bit more complex than I imagined, but I can
blame that up to a total lack of experience with variable-width character
encodings.
There's more. Some encodings are not merely variable-width, but are also /stateful/. Consider UTF-7. A UTF-7 stream is always in one of two states: "ASCII" or "Radix 64". A '+' character in the stream changes the state to "Radix 64", and a '-' character changes the state back to "ASCII". A UTF-7 decoder needs to be aware at all times of the state of the stream. Incoming bytes are interpretted differently (as though they were two entirely different encodings) depending on the stream state. A function such as:
int utf7ToDchar (char[] input, dchar[] output);
just wouldn't do the job, because it doesn't preserve/know the state of the stream. You'd need a class, with a member variable to contain the current state of the stream (unless you wanted to use a global variable to store the state - yuk!) So, in general, basing your architecture on a set of functions with similar signature just wouldn't be adequate to do the job. Arcane Jill
Aug 17 2004
prev sibling next sibling parent reply "antiAlias" <gblazzer corneleus.com> writes:
Confusion abounds! I follow you Jill, but please don't underestimate the
usefulness of D arrays. I'll try to explain as we go along ...


"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cfpv3c$2253$1 digitaldaemon.com...
 In article <cfou0o$1a91$1 digitaldaemon.com>, antiAlias says...

Might I suggest something along the following lines:

int utf8ToDChar (char[] input, dchar[] output);
int dCharToUtf8 (dchar[] input, char[] output);
where both return the number of bytes converted (or something like that).
That would be bad. I think it's possible you haven't understood the
issues, so
 I'll try to explain in this post what some of them are, and why you would
want
 to do certain things in certain ways.


I think it's perhaps best to make these kind of things completely
independent
of any other layer, if at all possible.
I don't have any problem with that.
These also happen to be the kind of
functions that might be worth optimizing with a smattering of assembly
...
 I disagree. Transcoding almost never happens in performance-critical code.
It
 happens during input and output. A typical scenario is to get input from a
 console and then decode it, to to encode a string and then write it to a
file.
 The CPU time utilized in the I/O will outweigh the time spent transcoding
by a
 very large factor. Of course it still makes sense to do this efficiently,
but
 assembler - given that it's not portable, decreases maintainability,
etc. - is
 probably going a bit too far.
What about HTTP servers? What about SOAP servers? Pretty much anything XML oriented has to at least think about doing this kind of thing often and efficiently. The latter still matters, and perhaps always will. Still, it was just a suggestion.
 Okay, back to these function signatures:

int utf8ToDChar (char[] input, dchar[] output);
int dCharToUtf8 (dchar[] input, char[] output);
(1) The encoding is not necessarily known at compile time. This problem
would
 also exist had you used classes/interfaces, of course, but at least with
classes
 or interfaces instead of plain functions, you can rely on polymorphism and
 factory methods to do the dispatching, giving you a single point of
decision.
 Functions like the above would lead to switch statements all over the
place, and
 also to inconsistent encoding names (e.g. "ISO-8859-1" vs "iso-8859-1" vs
 "LATIN-1" vs "Latin1"). Only by a single point of decision can enforce the
IANA
 encoding names, case conventions, etc..
Agreed. I wouldn't presume to fashion a "complete" solution on /this/ NG <g>. Thus, encoding was deliberately ommited to clarify the means of getting data into and out of these converters. As far as encoding-names go, I would have expected such converters to be implemented as methods in a class; the constructor would be given the encoding identifier.
 I see that in "charset.d" you made the encoding name a runtime parameter -
but
 that too is bad, partly because you don't have a single point of decision,
but
 partly also because you're now having to make that runtime check with
/every/
 fragment of text - not merely at construction time.
Not sure what you mean. I've never written anything called "charset.d" ... besides, you can safely assume that efficiency is important to me.
 (2) (Trival) you forgot "out" on the output variables. You cannot expect
the
 caller to be aware in advance of the resulting required buffer size.
Au contraire! Both input and output are /provided/ by the caller. This is why the return value specifies the number of items converted. D arrays have some wonderful properties worth taking advantage of -- the length is always provided, you can slice and dice to your hearts' content, and void[] arrays can easily be mapped onto pretty much anything (including a single char or dchar instance). The caller has already said "here's a set of input data, and here's a place to put the output. Convert what you can within the constraints of input & output limits, and tell me the resultant outcome". If (for example) there's only space in the output for one dchar, the algorithm will halt after converting just one. If there's not enough input provided to construct a dchar, the algorithm indicates nothing was converted. Of course, this points out a flaw in the original prototypes: two return values are needed instead of one (the number of items used from the input, as well as the number of items placed into the output). Alternatively, the implementing class could provide it's own output buffer during initial construction.
 (3) /This is most important/. In the typical scenario, the caller will be
 reading bytes from some source - which /could/ be a stream - and will want
to
 get a single dchar. We're talking about a "get the next Unicode character"
 function, which is about as low level as it gets (in terms of
functionality).
 But you can't build such a function out of your string routines, because
you
 have no way of knowing in advance how many bytes will need to be consumed
from
 the stream in order to build one character. So what do you do? Read too
many and
 then put some back? Not all byte sources will allow you to "put back" or
 "unconsume" bytes.
Wholly agreed: pushback is a big "no no". But it's not an issue when using a pair of arrays in the suggested manner.
 In fact, the minimal functionality that a decoder requires, is this:







 (next() could be called get(), or read(), or whatever). The minimal
 functionality upon which a decoder would rely, is this:







 For comparison, look at the way Walter's format() function uses an
underlying
 put() function to write a single character. He /could/ have used strings
 throughout, but he recognised (correctly) that the one-byte-at-a-time
approach
 was conceptually at a lower level. Strings can then be handled /in terms
of/
 those lower-level functions.
There are several valid ways to skin that particular cat <g> <snip> Here's a fuller implementation of the array approach (in pseudo-code) class Transcoder { this (char[] encoding) {...} dchar[] toUnicode (char[] input, dchar[] output, out int consumed) { while (room_for_more_output) while (enough_input_for_another_dchar) do_actual_conversion_into_output_buffer; emit_quantity_of_input_consumed; return_slice_of_output_representing_converted_dchars; } char[] toUtf8 (dchar[] input, char[] output, out int consumed) { while (room_for_more_output) while (enough_input_for_another_char) do_actual_conversion_into_output_buffer; emit_quantity_of_input_consumed; return_slice_of_output_representing_converted_chars; } } This would be wrapped at some higher level such as within a Phobos Stream, or a Mango Reader/Writer, to handle the mapping of arrays to variables. The benefit of this approach is it's throughput, and the ability for the 'controller' to direct the input and output arrays to anywhere it likes (including scalar variables), leading to further efficiencies. Functions such as these do not need to be exposed to the typical programmer. In fact, I vaguely recall Java has something along these lines that's hidden in some sun.x.x library, which the Java Streams utilize at some level. A variation on the theme might initially provide a buffer to house the conversion output instead. There's pros and cons to both approaches. In this case, you'd probably want to split the transcoding into separate encoding and decoding: class Decoder { private dchar[] unicode; this (char[] encoding, dchar[] output) { do_something_with_encoding; unicode = output; } this (char[] encoding, int outputSize) { this (encoding, new dchar[outputSize]); } dchar[] convert (char[] input, out int consumed) { while (room_for_more_output_in_output_buffer) while (enough_input_for_another_dchar) do_actual_conversion_into_output_buffer; emit_quantity_of_input_consumed; return_slice_of_output_representing_converted_dchars; } } class Encoder { // similar approach to Decoder } These are just suggestions, to take or leave at one's discretion.
Aug 16 2004
next sibling parent "antiAlias" <gblazzer corneleus.com> writes:
 class Transcoder
 {
       this (char[] encoding) {...}

       dchar[] toUnicode (char[] input, dchar[] output, out int consumed)
       {
           while (room_for_more_output)
                     while (enough_input_for_another_dchar)
                               do_actual_conversion_into_output_buffer;

           emit_quantity_of_input_consumed;
           return_slice_of_output_representing_converted_dchars;
       }
 }
Whoops! Those twin while loops should, of course, be a single while() with an && between the two conditions.
Aug 16 2004
prev sibling parent Arcane Jill <Arcane_member pathlink.com> writes:
In article <cfr0tf$2pgm$1 digitaldaemon.com>, antiAlias says...

These also happen to be the kind of
functions that might be worth optimizing with a smattering of assembly
The CPU time utilized in the I/O will outweigh the time spent transcoding
by a
 very large factor.
What about HTTP servers? What about SOAP servers? Pretty much anything XML oriented has to at least think about doing this kind of thing often and efficiently.
I would say that here the time spent getting the web page from the server to client across the internet will outweigh the time spent encoding by many orders of magnitude. But I'm not /against/ efficiency. If people want to recode this stuff in assembler then obviously I'm not going to object.
Not sure what you mean. I've never written anything called "charset.d" ...
besides, you can safely assume that efficiency is important to me.
I think I was confusing you with Nick. My bad.
 (2) (Trival) you forgot "out" on the output variables. You cannot expect
the
 caller to be aware in advance of the resulting required buffer size.
Au contraire! Both input and output are /provided/ by the caller. This is why the return value specifies the number of items converted. D arrays have some wonderful properties worth taking advantage of -- the length is always provided, you can slice and dice to your hearts' content, and void[] arrays can easily be mapped onto pretty much anything (including a single char or dchar instance). The caller has already said "here's a set of input data, and here's a place to put the output. Convert what you can within the constraints of input & output limits, and tell me the resultant outcome". If (for example) there's only space in the output for one dchar, the algorithm will halt after converting just one. If there's not enough input provided to construct a dchar, the algorithm indicates nothing was converted. Of course, this points out a flaw in the original prototypes: two return values are needed instead of one (the number of items used from the input, as well as the number of items placed into the output). Alternatively, the implementing class could provide it's own output buffer during initial construction.
Gotcha. Sorry - I misinterpretted the intent of the function signatures.
Wholly agreed: pushback is a big "no no". But it's not an issue when using a
pair of arrays in the suggested manner.
Wholly agreed.
There are several valid ways to skin that particular cat <g>
Here's a fuller implementation of the array approach (in pseudo-code)

 <snip>

This would be wrapped at some higher level such as within a Phobos Stream,
or a Mango Reader/Writer, to handle the mapping of arrays to variables. The
benefit of this approach is it's throughput, and the ability for the
'controller' to direct the input and output arrays to anywhere it likes
(including scalar variables), leading to further efficiencies. Functions
such as these do not need to be exposed to the typical programmer. In fact,
I vaguely recall Java has something along these lines that's hidden in some
sun.x.x library, which the Java Streams utilize at some level.

A variation on the theme might initially provide a buffer to house the
conversion output instead. There's pros and cons to both approaches. In this
case, you'd probably want to split the transcoding into separate encoding
and decoding:

 <snip>

These are just suggestions, to take or leave at one's discretion.
They are good suggestions. They have the benefit of efficiency without losing generality. They have the disadvantage of having a slightly confusing signature, but good documentation should solve that. Nice one. Arcane Jill
Aug 17 2004
prev sibling parent reply Ben Hinkle <bhinkle4 juno.com> writes:
 (3) /This is most important/. In the typical scenario, the caller will be
 reading bytes from some source - which /could/ be a stream - and will want
 to get a single dchar. We're talking about a "get the next Unicode
 character" function, which is about as low level as it gets (in terms of
 functionality). But you can't build such a function out of your string
 routines, because you have no way of knowing in advance how many bytes
 will need to be consumed from the stream in order to build one character.
 So what do you do? Read too many and then put some back? Not all byte
 sources will allow you to "put back" or "unconsume" bytes.
std.stream supports ungetc, which pushes a character back by maintaining an array of pushed-back characters. Right now only the text functions check this array for content, though. I think the idea was that if one is storing text and binary data mixed together that the text are stored with writeString which puts a length byte followed by the text.
Aug 17 2004
parent Sean Kelly <sean f4.ca> writes:
In article <cfsu27$122d$1 digitaldaemon.com>, Ben Hinkle says...
 (3) /This is most important/. In the typical scenario, the caller will be
 reading bytes from some source - which /could/ be a stream - and will want
 to get a single dchar. We're talking about a "get the next Unicode
 character" function, which is about as low level as it gets (in terms of
 functionality). But you can't build such a function out of your string
 routines, because you have no way of knowing in advance how many bytes
 will need to be consumed from the stream in order to build one character.
 So what do you do? Read too many and then put some back? Not all byte
 sources will allow you to "put back" or "unconsume" bytes.
For the record, this is exactly what my mods to std.utf are for. In fact, unFormat and my stream mods already use them.
std.stream supports ungetc, which pushes a character back by maintaining an
array of pushed-back characters. Right now only the text functions check
this array for content, though. I think the idea was that if one is storing
text and binary data mixed together that the text are stored with
writeString which puts a length byte followed by the text. 
Most stream routines allow for at least one byte to put back. Obviously this isn't possible in all cases, but it *is* always possible to carry an unget buffer around with the stream, as std.stream already does. Only the formatted routines check this area for content and I consider that correct behavior, as some translation may have been done between the stream and the buffer. Sean
Aug 17 2004
prev sibling next sibling parent "Walter" <newshound digitalmars.com> writes:
"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cfog97$12n2$1 digitaldaemon.com...
 There have been loads and loads of discussions in recent weeks about
Unicode,
 streams, and transcodings. There seems to be a general belief that "things
are
 happening", but I'm not quite clear on the specifics - hence this post,
which is
 basically a question.
What I am excited about is D is becoming the premier language to do unicode in, by a wide margin. And that's thanks to you guys!
Aug 15 2004
prev sibling parent reply Sean Kelly <sean f4.ca> writes:
In article <cfog97$12n2$1 digitaldaemon.com>, Arcane Jill says...
I also need a bit of educating on the future of D's streams. Are we going to get
separate InputStream and OutputStream interfaces, or what?
I'd like to have them. In fact my partial rewrite of stream.d already has this.
Sean, is your stuff part of Phobos-to-be? Or is it external to Phobos? I don't
mind either way, but if Phobos is going to go off in some completely tangential
direction, I want to know that too.
Hard to answer as I don't really know what will happen with Phobos in the long term, however... I would like to see unFormat/readf get into Phobos, though that may have to wait for TypeInfo to be working for pointers since the current calling convention is still a bit inconsistent with writef (ie. it requires a format string as the first argument, just like scanf). I could work around this with a big if/else block to get the underlying type of pointer arguments but I'd prefer to just work off classinfo.name like Walter does for doFormat. Perhaps I'll just drop that into a separate function and replace it later when TypeInfo gets fixed. As for my std.stream rewrite... I like it better than what's in std.stream now but I have no idea what will sort out in the long term. Is adopting Mango.io a better idea? Perhaps streams should be dropped from Phobos completely? I consider my version of stream.d to be more of a prototype than a full-featured replacement.
So, the simple, easy peasy task of converting between Latin-1 and Unicode hasn't
been done yet, basically because we haven't agreed on an architecture, and I for
one am not really sure who's doing it anyway.
Well, std.doFormat/writef will take char/wchar/dchar strings and output UTF-8 or UTF-16. My unFormat will read UTF-8 and UTF-16 following the same convention as std.doFormat and will convert everything to char/wchar/dchar strings as appropriate. Both of these functions use the functions in std.utf for conversion. Is this enough to start with?
Therefore, (1), I would like to ask, is anyone /actually/ writing transcoders
yet, or is it still up in the air?
I guess that depends on what still needs to be done.
But I do think we should nail down the architecture soon, as we're getting a lot
of questions and discussion on this. But one thing at a time. Someone tell me
where streams are going (with regard to above questions) and then I'll have more
suggestions.
Frankly, I can live without streams so long as there is *some* way to do formatted i/o that can handle Unicode. I think doFormat/unFormat might be the answer to this, but I don't know the remaining issues well enough to say for sure. Sean
Aug 16 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cfrc06$31g8$1 digitaldaemon.com>, Sean Kelly says...

I also need a bit of educating on the future of D's streams. Are we going to get
separate InputStream and OutputStream interfaces, or what?
I'd like to have them. In fact my partial rewrite of stream.d already has this.
Apparently, so does Phobos, although I didn't know that at the time I posted the question. Now isn't that cute - an interface with an undocumented interface!
Sean, is your stuff part of Phobos-to-be? Or is it external to Phobos? I don't
mind either way, but if Phobos is going to go off in some completely tangential
direction, I want to know that too.
Hard to answer as I don't really know what will happen with Phobos in the long term, however...
Okay - I just wasn't sure if you were working for Walter in some capacity. Forgive the dumb question.
As for my std.stream rewrite... I like it better than what's in std.stream now
It's hard to know what's in std.stream now without reading the source. I /really/ wish someone would document it.
but I have no idea what will sort out in the long term.  Is adopting Mango.io a
better idea?
Many people think so. Others argue that we should wait for the new-improved std.stream. But I don't know what that future is. The mango folk have said that they don't want mango.io moved into std, so it will always be an external library. That isn't a problem for applications, of course, since mango is free and open-source, but it might be considered a problem for libraries (- when one third-party library dependends on a different third-party library, things start to get messy).
Perhaps streams should be dropped from Phobos completely?
Perhaps, but I find it unlikely that that will happen. Only Walter is empowered to do that.
I
consider my version of stream.d to be more of a prototype than a full-featured
replacement.
Well, that's good and bad. A prototype is good - it implies that better, future versions will exist. But "not a replacement"? If it's not a replacement, are you envisaging that people will use both? Do they interact somehow?
Well, std.doFormat/writef will take char/wchar/dchar strings and output UTF-8 or
UTF-16.  My unFormat will read UTF-8 and UTF-16 following the same convention as
std.doFormat and will convert everything to char/wchar/dchar strings as
appropriate.  Both of these functions use the functions in std.utf for
conversion.  Is this enough to start with?
It's certainly enough for now, but it's not transcoding in the more general sense. UTF8/16/32 are fundamental to D - they simply have to be there.
Frankly, I can live without streams so long as there is *some* way to do
formatted i/o that can handle Unicode.
Yes, "formatted" - that is an interesting and important one. printf()/writef() are currently not very Unicode-aware. A format string like "%5s" will output at least five /bytes/, not at least five /characters/. What is needed in this department is a printf() replacement written exclusively for dchars.
I think doFormat/unFormat might be the
answer to this, but I don't know the remaining issues well enough to say for
sure.

Sean
Well, thanks. I think I've got a picture now of what's going on. I'll post a summary shortly, then we can start calling for volunteers for the missing bits. Jill
Aug 17 2004
next sibling parent "antiAlias" <fu bar.com> writes:
"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cfscq6$s8u$1 digitaldaemon.com...
 std.stream. But I don't know what that future is. The mango folk have said
that
 they don't want mango.io moved into std, so it will always be an external
 library. That isn't a problem for applications, of course, since mango is
free Please allow me to clarify: from recollection, the position has always been that both mango.io and std.streams should be excluded from Phobos. I think that was Matthew's position also. If mango.io turns out to be a better solution, then it can certainly move from its current home if that's what people want; but I'm not holding my breath waiting for a consensus on that one <g> BTW; mango.io is completely independent from the rest of the Mango Tree (has no dependencies), so it can be easily cut away. In fact, it's almost totally independent of Phobos too ...
 and open-source, but it might be considered a problem for libraries (-
when one
 third-party library dependends on a different third-party library, things
start
 to get messy).
Right -- that thread regarding placing the .lib dependencies inside the D source-code might help out with this (to the extent that it can).
Aug 17 2004
prev sibling next sibling parent stonecobra <scott stonecobra.com> writes:
Arcane Jill wrote:

but I have no idea what will sort out in the long term.  Is adopting Mango.io a
better idea?
Many people think so. Others argue that we should wait for the new-improved std.stream. But I don't know what that future is. The mango folk have said that they don't want mango.io moved into std, so it will always be an external library. That isn't a problem for applications, of course, since mango is free and open-source, but it might be considered a problem for libraries (- when one third-party library dependends on a different third-party library, things start to get messy).
Or, since it is open source, you can just compile it in ala std.* and not have a library dependency Scott
Aug 17 2004
prev sibling parent Sean Kelly <sean f4.ca> writes:
In article <cfscq6$s8u$1 digitaldaemon.com>, Arcane Jill says...
In article <cfrc06$31g8$1 digitaldaemon.com>, Sean Kelly says...

I
consider my version of stream.d to be more of a prototype than a full-featured
replacement.
Well, that's good and bad. A prototype is good - it implies that better, future versions will exist. But "not a replacement"? If it's not a replacement, are you envisaging that people will use both? Do they interact somehow?
It's only a prototype in the sense that I haven't really finished it yet. There are some notable functions missing (like ignore), etc. If people are interested then I'll flesh it out a bit. I don't have a ton of free time so I figured I'd see what the response was before I worked any more on it.
Well, std.doFormat/writef will take char/wchar/dchar strings and output UTF-8 or
UTF-16.  My unFormat will read UTF-8 and UTF-16 following the same convention as
std.doFormat and will convert everything to char/wchar/dchar strings as
appropriate.  Both of these functions use the functions in std.utf for
conversion.  Is this enough to start with?
It's certainly enough for now, but it's not transcoding in the more general sense. UTF8/16/32 are fundamental to D - they simply have to be there.
Frankly, I can live without streams so long as there is *some* way to do
formatted i/o that can handle Unicode.
Yes, "formatted" - that is an interesting and important one. printf()/writef() are currently not very Unicode-aware. A format string like "%5s" will output at least five /bytes/, not at least five /characters/. What is needed in this department is a printf() replacement written exclusively for dchars.
unFormat operates entirely in terms of dchars. So the width modifiers are in terms of UTF-32 characters, etc. But I agree. If doFormat doesn't work this way then it probably should. The results are unpredictable otherwise. Sean
Aug 17 2004