www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Transcoding - Summary

reply Arcane Jill <Arcane_member pathlink.com> writes:
We have two separate problems:
(1) formatted I/O
(2) unformatted I/O

For unformatted I/O, we need the ability to read a sequence of dchars from some
source, and the ability to write a sequence of dchars to some sink. The class
which acts as a dchar source must perform decoding from some underlying ubyte
source. The class which acts as a dchar sink must perform encoding to some
underlying ubyte sink.

The source and sink could be anything - a string; a console; a file; a socket; -
even a simple counter which counts bytes and throws away data. So, to keep
things generic, I shall use the terms "ubyte source", "ubyte sink", "dchar
source" and "dchar sink". The traditional terms are:

ubyte source = input stream
ubyte sink = output stream
dchar source = reader
dchar sink = writer

(I'm using new terms merely in order to avoid confusion with objects in
std.stream, mango.io, and Java).

For formatted I/O, we need:
(1a) a replacement for printf() which emits a formatted sequence of dchars to an
arbitrary dchar sink
(1b) a replacement for scanf() which parses a sequence of dchars obtained from
an arbitrary dchar source

Further, for reasons of internationalization, our printf replacement must be
able to random-access its variadic arguments.

Observe that if the output of (1a) is plumbed into an encoder, and the input to
(1b) is plumbed into a decoder, then formatted transcoding is achieved. This
makes our printf/scanf replacements relatively easy to write. They are likely to
require very little modification from the existing format()/unformat() routines,
with essentially the only difference being that they must be dchar-based, not
char-based. (Random-access of the arguments would be a new feature, however,
though not necessarily an urgent one).

Another oft-voiced requirement is that transcoding be independent of any
particular string/stream implementation. (I suspect that if Phobos streams were
fully-featured, fully-documented, bug-free and intuitive, then nobody would be
asking for this requirement. But as things are, the requirement is there).

So ... listed below are the jobs which need to be done. Volunteers are requested
for any unclaimed jobs:

(1) The source and sink interfaces need to be nailed down.
(2) Given (1), dchar-based format()/unformat() replacements can be written.
(3) Given (1), encoder and decoder classes/interfaces can be written.
(4) Given (3), classes can be written to attach our encoders/decoders to std and
mango streams, to strings, etc.
(5) Given (3), encoders and decoders for SPECIFIC encodings can now be written.
(6) Will somebody /please/ document std.Stream?

I volunteer for (1) and (3). I'm hoping Sean will volunteer for (2). AntiAlias's
excellent ideas for throughput enhancement using buffers are part of (1) and
(3), so I suggest AntiAlias and I send each other code back and forth until we
are both happy with it.

Volunteers still needed for (4), (5) and (6) (though (4) and (5) are dependent
upon (3)). Anyone who's a dab hand at Wiki might like to volunteer for (6).

Arcane Jill
Aug 17 2004
next sibling parent Arcane Jill <Arcane_member pathlink.com> writes:
In article <cfsm6d$va0$1 digitaldaemon.com>, Arcane Jill says...

(1) The source and sink interfaces need to be nailed down.
(2) Given (1), dchar-based format()/unformat() replacements can be written.
(3) Given (1), encoder and decoder classes/interfaces can be written.
(4) Given (3), classes can be written to attach our encoders/decoders to std and
mango streams, to strings, etc.
(5) Given (3), encoders and decoders for SPECIFIC encodings can now be written.
(6) Will somebody /please/ document std.Stream?

Nick, I think your work falls into category (5). If you want that job, I guess it's yours, but if so, please wait for (3) before you start. Jill
Aug 17 2004
prev sibling next sibling parent Derek <derek psyc.ward> writes:
On Tue, 17 Aug 2004 10:21:01 +0000 (UTC), Arcane Jill wrote:

 We have two separate problems:
 (1) formatted I/O
 (2) unformatted I/O
 
 For unformatted I/O, we need the ability to read a sequence of dchars from some
 source, and the ability to write a sequence of dchars to some sink. The class
 which acts as a dchar source must perform decoding from some underlying ubyte
 source. The class which acts as a dchar sink must perform encoding to some
 underlying ubyte sink.
 
 The source and sink could be anything - a string; a console; a file; a socket;
-
 even a simple counter which counts bytes and throws away data. So, to keep
 things generic, I shall use the terms "ubyte source", "ubyte sink", "dchar
 source" and "dchar sink". The traditional terms are:
 
 ubyte source = input stream
 ubyte sink = output stream
 dchar source = reader
 dchar sink = writer
 
 (I'm using new terms merely in order to avoid confusion with objects in
 std.stream, mango.io, and Java).
 
 For formatted I/O, we need:
 (1a) a replacement for printf() which emits a formatted sequence of dchars to
an
 arbitrary dchar sink
 (1b) a replacement for scanf() which parses a sequence of dchars obtained from
 an arbitrary dchar source
 
 Further, for reasons of internationalization, our printf replacement must be
 able to random-access its variadic arguments.
 
 Observe that if the output of (1a) is plumbed into an encoder, and the input to
 (1b) is plumbed into a decoder, then formatted transcoding is achieved. This
 makes our printf/scanf replacements relatively easy to write. They are likely
to
 require very little modification from the existing format()/unformat()
routines,
 with essentially the only difference being that they must be dchar-based, not
 char-based. (Random-access of the arguments would be a new feature, however,
 though not necessarily an urgent one).
 
 Another oft-voiced requirement is that transcoding be independent of any
 particular string/stream implementation. (I suspect that if Phobos streams were
 fully-featured, fully-documented, bug-free and intuitive, then nobody would be
 asking for this requirement. But as things are, the requirement is there).
 
 So ... listed below are the jobs which need to be done. Volunteers are
requested
 for any unclaimed jobs:
 
 (1) The source and sink interfaces need to be nailed down.
 (2) Given (1), dchar-based format()/unformat() replacements can be written.
 (3) Given (1), encoder and decoder classes/interfaces can be written.
 (4) Given (3), classes can be written to attach our encoders/decoders to std
and
 mango streams, to strings, etc.
 (5) Given (3), encoders and decoders for SPECIFIC encodings can now be written.
 (6) Will somebody /please/ document std.Stream?
 
 I volunteer for (1) and (3). I'm hoping Sean will volunteer for (2).
AntiAlias's
 excellent ideas for throughput enhancement using buffers are part of (1) and
 (3), so I suggest AntiAlias and I send each other code back and forth until we
 are both happy with it.
 
 Volunteers still needed for (4), (5) and (6) (though (4) and (5) are dependent
 upon (3)). Anyone who's a dab hand at Wiki might like to volunteer for (6).
 
 Arcane Jill

I hope I'm not stating the bleeding obvious, but you are talking about TEXT I/O aren't you? There is also a lot of other I/O that is not text based - sound and image files, databases, etc... -- Derek Melbourne, Australia
Aug 17 2004
prev sibling next sibling parent reply "Walter" <newshound digitalmars.com> writes:
"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cfsm6d$va0$1 digitaldaemon.com...
 Further, for reasons of internationalization, our printf replacement must

 able to random-access its variadic arguments.

I disagree with this requirement. It breaks the nice way that std.format works. The only place where reordering the arguments is useful is in date/time formatting, and a specialized formatter would be suitable for that (and there are many other nice things one can do with a specialized date/time formatter).
Aug 17 2004
next sibling parent reply Regan Heath <regan netwin.co.nz> writes:
On Tue, 17 Aug 2004 15:00:28 -0700, Walter <newshound digitalmars.com> 
wrote:
 "Arcane Jill" <Arcane_member pathlink.com> wrote in message
 news:cfsm6d$va0$1 digitaldaemon.com...
 Further, for reasons of internationalization, our printf replacement 
 must

 able to random-access its variadic arguments.

I disagree with this requirement. It breaks the nice way that std.format works. The only place where reordering the arguments is useful is in date/time formatting, and a specialized formatter would be suitable for that (and there are many other nice things one can do with a specialized date/time formatter).

Did you miss the thread that mentioned that sentence structure in various languages differ? Example: english :- "The DOG is BIG" other :- ".. BIG .. DOG" (I don't actually know any other languages) So, it would be kind of useful to be able to define the format strings as: english :- "The $1 is $2" other :- ".. $2 .. $1" and be able to go: printf(format[lang_id],"DOG","BIG"); Regan -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Aug 17 2004
parent reply Russ Lewis <spamhole-2001-07-16 deming-os.org> writes:
Regan Heath wrote:
 Did you miss the thread that mentioned that sentence structure in 
 various languages differ?
 Example:
 
   english :- "The DOG is BIG"
   other   :- ".. BIG .. DOG"
 
 (I don't actually know any other languages)
 
 So, it would be kind of useful to be able to define the format strings as:
 
   english :- "The $1 is $2"
   other   :- ".. $2 .. $1"
 
 and be able to go:
 
   printf(format[lang_id],"DOG","BIG");

This isn't strictly a requirement of the formatting tools. Perhaps a library function which, given a number of varargs, reordered them and passed them to another function? Your code could look (very roughly) like this: char[] formatString = LookupNLSFormat (msgID, language); char[] reorderString = LookupNLSReorder(msgID, language); vwritef(formatString, doArgumentReorder(reorderString, <args>)); The advantage here is that you can do reordering for NLS support but writef stays simple.
Aug 17 2004
parent reply Regan Heath <regan netwin.co.nz> writes:
On Tue, 17 Aug 2004 19:45:47 -0700, Russ Lewis 
<spamhole-2001-07-16 deming-os.org> wrote:
 Regan Heath wrote:
 Did you miss the thread that mentioned that sentence structure in 
 various languages differ?
 Example:

   english :- "The DOG is BIG"
   other   :- ".. BIG .. DOG"

 (I don't actually know any other languages)

 So, it would be kind of useful to be able to define the format strings 
 as:

   english :- "The $1 is $2"
   other   :- ".. $2 .. $1"

 and be able to go:

   printf(format[lang_id],"DOG","BIG");

This isn't strictly a requirement of the formatting tools. Perhaps a library function which, given a number of varargs, reordered them and passed them to another function? Your code could look (very roughly) like this: char[] formatString = LookupNLSFormat (msgID, language); char[] reorderString = LookupNLSReorder(msgID, language); vwritef(formatString, doArgumentReorder(reorderString, <args>)); The advantage here is that you can do reordering for NLS support but writef stays simple.

The disadvantage being that the above idea is harder to maintain, there are 2 things that define how the message is displayed, 2 things in which a mistake could be made, 2 things in which you have to make changes, .. How hard or complex is it to implement a writef that can do: writef("The %1 is %2","dog","big"); (%1 and %2 can be changed to any symbol that fits with the current symbol set used in writef) I can't see it being a particularly big leap from what it currently does. Also consider: writef("A really long %1 that contains the same %1 several times. %1's like this could be quite common, yes?","string"); Regan -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Aug 17 2004
parent Arcane Jill <Arcane_member pathlink.com> writes:
In article <opscwr50rl5a2sq9 digitalmars.com>, Regan Heath says...
On Tue, 17 Aug 2004 19:45:47 -0700, Russ Lewis 

 This isn't strictly a requirement of the formatting tools.  Perhaps a 
 library function which, given a number of varargs, reordered them and 
 passed them to another function?

 Your code could look (very roughly) like this:

      char[] formatString  = LookupNLSFormat (msgID, language);
      char[] reorderString = LookupNLSReorder(msgID, language);
      vwritef(formatString, doArgumentReorder(reorderString, <args>));

are 2 things that define how the message is displayed, 2 things in which a mistake could be made, 2 things in which you have to make changes, .. How hard or complex is it to implement a writef that can do: writef("The %1 is %2","dog","big"); (%1 and %2 can be changed to any symbol that fits with the current symbol set used in writef) I can't see it being a particularly big leap from what it currently does. Also consider: writef("A really long %1 that contains the same %1 several times. %1's like this could be quite common, yes?","string");

Well, I didn't mean to cause trouble here. :) Anyway. I'm agreeing with Regan, and slightly disagreeing with Walter. There /is/ a need to be able do: # // English # article = "the"; # adjective = "red"; # noun = "house"; # formatString = "%s %s %s"; // default order # # // French # article = "la"; # adjective = "rouge"; # noun = "maison"; # formatString = "%(1)s %(3)s %(2)s"; # # writef(formatString, article, adjective, noun); Sorry, but that's a requirement. It's not an /urgent/ requirement, but you can bet vast sums of money that internationalization will start to become more and more of an issue once other transcoding issues have been dealt with. Russ's idea is good, but obviously not /as/ good as simply coming up with an improved printf() replacement. Right now, POSIX-printf() can do this random-access, but D's writef() can't. It's not urgent, and we'll solve it in time. But it /is/ an internationalization issue, and it won't go away. Arcane Jill
Aug 18 2004
prev sibling parent Derek Parnell <derek psych.ward> writes:
On Tue, 17 Aug 2004 15:00:28 -0700, Walter wrote:

 "Arcane Jill" <Arcane_member pathlink.com> wrote in message
 news:cfsm6d$va0$1 digitaldaemon.com...
 Further, for reasons of internationalization, our printf replacement must

 able to random-access its variadic arguments.

I disagree with this requirement. It breaks the nice way that std.format works. The only place where reordering the arguments is useful is in date/time formatting, and a specialized formatter would be suitable for that (and there are many other nice things one can do with a specialized date/time formatter).

I think that AJ was suggesting that there exists a business need for a type of formatter that can express in its template, the order that arguments will appear in the resultant string, regardless of the order that they are presented to the formatter. For example (contrived for simplicity): char[] Msg; if (gUserLang == LANG_english) temp = "%{1}s %{2}s %{3}s %{4}s %{5}s\n"; else temp = "%{2}s %{1}s %{5}s %{4}s %{3}s\n"; Msg = expand(temp, pSubjectDesc, pSubject, pVerb, pObjectDesc, pObject); writef(Msg); -- Derek Melbourne, Australia 18/Aug/04 10:31:55 AM
Aug 17 2004
prev sibling next sibling parent reply "antiAlias" <fu bar.com> writes:
Jill ~ I have a utf-8 transcoder that I'm using as a plaything within Mango;
if you're interested, I'll send it on.


"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:cfsm6d$va0$1 digitaldaemon.com...
 We have two separate problems:
 (1) formatted I/O
 (2) unformatted I/O

 For unformatted I/O, we need the ability to read a sequence of dchars from

 source, and the ability to write a sequence of dchars to some sink. The

 which acts as a dchar source must perform decoding from some underlying

 source. The class which acts as a dchar sink must perform encoding to some
 underlying ubyte sink.

 The source and sink could be anything - a string; a console; a file; a

 even a simple counter which counts bytes and throws away data. So, to keep
 things generic, I shall use the terms "ubyte source", "ubyte sink", "dchar
 source" and "dchar sink". The traditional terms are:

 ubyte source = input stream
 ubyte sink = output stream
 dchar source = reader
 dchar sink = writer

 (I'm using new terms merely in order to avoid confusion with objects in
 std.stream, mango.io, and Java).

 For formatted I/O, we need:
 (1a) a replacement for printf() which emits a formatted sequence of dchars

 arbitrary dchar sink
 (1b) a replacement for scanf() which parses a sequence of dchars obtained

 an arbitrary dchar source

 Further, for reasons of internationalization, our printf replacement must

 able to random-access its variadic arguments.

 Observe that if the output of (1a) is plumbed into an encoder, and the

 (1b) is plumbed into a decoder, then formatted transcoding is achieved.

 makes our printf/scanf replacements relatively easy to write. They are

 require very little modification from the existing format()/unformat()

 with essentially the only difference being that they must be dchar-based,

 char-based. (Random-access of the arguments would be a new feature,

 though not necessarily an urgent one).

 Another oft-voiced requirement is that transcoding be independent of any
 particular string/stream implementation. (I suspect that if Phobos streams

 fully-featured, fully-documented, bug-free and intuitive, then nobody

 asking for this requirement. But as things are, the requirement is there).

 So ... listed below are the jobs which need to be done. Volunteers are

 for any unclaimed jobs:

 (1) The source and sink interfaces need to be nailed down.
 (2) Given (1), dchar-based format()/unformat() replacements can be

 (3) Given (1), encoder and decoder classes/interfaces can be written.
 (4) Given (3), classes can be written to attach our encoders/decoders to

 mango streams, to strings, etc.
 (5) Given (3), encoders and decoders for SPECIFIC encodings can now be

 (6) Will somebody /please/ document std.Stream?

 I volunteer for (1) and (3). I'm hoping Sean will volunteer for (2).

 excellent ideas for throughput enhancement using buffers are part of (1)

 (3), so I suggest AntiAlias and I send each other code back and forth

 are both happy with it.

 Volunteers still needed for (4), (5) and (6) (though (4) and (5) are

 upon (3)). Anyone who's a dab hand at Wiki might like to volunteer for

 Arcane Jill

Aug 18 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cfv1d0$26s7$1 digitaldaemon.com>, antiAlias says...
Jill ~ I have a utf-8 transcoder that I'm using as a plaything within Mango;
if you're interested, I'll send it on.

Not really interested because (a) there's one in std.utf, and (b) I could write my own in just a few lines of code anyway. But we're really talking about general concepts here, not specific encodings. We need to get the architecture "right" first - which I guess means, in a form that everyone is happy with - and /then/ we start plugging in specific encodings. UTF-8 is one of the easiest, so I'm really not troubled by it. (ASCII is /the/ easiest, obviously). Antialias, it was you who came up with some ideas for throughput enhancement using buffers. I think we can do use those ideas without sacrificing genericity, which is why I suggested we collaborate on the generic interface. Would you be interested in that? Jill
Aug 18 2004
parent reply "antiAlias" <fu bar.com> writes:
"Arcane Jill" <Arcane_member pathlink.com> wrote in message
 Antialias, it was you who came up with some ideas for throughput

 using buffers. I think we can do use those ideas without sacrificing

 which is why I suggested we collaborate on the generic interface. Would

 interested in that?

Sure, Jill. That's what I was attempting <g> Was offering a transcoder built in the manner suggested; to experiment with said interface. Sometimes it's easier to deal with a more concrete entitiy as opposed to something completely virtual -- if nothing else, it should serve to more fully describe the suggested approach.. I'll need an email address, if this module would be of any value to you?
Aug 18 2004
parent Arcane Jill <Arcane_member pathlink.com> writes:
In article <cg01qr$13u$1 digitaldaemon.com>, antiAlias says...

I'll need an email
address, if this module would be of any value to you?

If you have an account on dsource, you can contact me privately there. My username is "Arcane Jill" Jill
Aug 19 2004
prev sibling parent J C Calvarese <jcc7 cox.net> writes:
Arcane Jill wrote:
...
 (6) Will somebody /please/ document std.Stream?
 
 I volunteer for (1) and (3). I'm hoping Sean will volunteer for (2).
AntiAlias's
 excellent ideas for throughput enhancement using buffers are part of (1) and
 (3), so I suggest AntiAlias and I send each other code back and forth until we
 are both happy with it.
 
 Volunteers still needed for (4), (5) and (6) (though (4) and (5) are dependent
 upon (3)). Anyone who's a dab hand at Wiki might like to volunteer for (6).

I'm not volunteering to single-handedly re-document std.stream, but I did start a wiki page for that purpose: http://www.prowiki.org/wiki4d/wiki.cgi?DocComments/Phobos/StdStream (Anyone can edit it by clicking on the "Edit" link in the upper right corner of the page.)
 
 Arcane Jill

-- Justin (a/k/a jcc7) http://jcc_7.tripod.com/d/
Aug 18 2004