www.digitalmars.com         C & C++   DMDScript  

D - new streams

reply "Pavel Minayev" <evilone omen.ru> writes:
You can find the new stream module at my site, http://int19h.tamb.ru.

Far the most interesting addition is scanf(). What is even better,
it can read D strings!

    char[] s;
    stdin.scanf("%.*s", &s);

Yes, this really works! Of course, you can still read C strings
(%s), but who needs it anymore? Note, however, that scanf wasn't
tested much, so it might contain bugs. Be careful!

Going further, readLine() has learnt to read lines terminated with
a single CR (aka It Came From Mac). And writeLine() now follows
Windows conventions, and writes CR/LF terminated lines. On Linux,
it should write a single LF, and a CR on Mac, whenever D gets
there - a bit of underlying platform transparency.

Unicode strings now work - no, really! =) readStringW(),
writeStringW(), readLineW(), and writeLineW() do their job
not any worse then their ANSI counterparts.

Generic read() and write() can now handle strings as well. Unlike
readString() and writeString(), these also store the length in
the stream:

    char[] s;
    ...
    file.write(s);   // writes s.length, followed by s
    ...
    file.read(s);    // reads length, then string of that length

Two new functions: getc() and ungetc(). I guess you know what
are these for. =) They also have Unicode versions, getcw() and
ungetcw().

Enumerations changed names again:

    enum SeekPos
    {
        Set,
        Current,
        End
    }

    enum FileMode
    {
        In,
        Out
    }

Now these have proper case, and should be more consistent to other
Phobos modules.

And finally, the module is NO LONGER DEPENDANT on my windows.d import,
and can be used with the one that comes with D. Thus, it can easily
replace the old and outdated stream module in Phobos.

By the way, Walter, could you pleeease replace the old version, that
you'd put into Phobos, with this new one? It's much better, has
less bugs, and since it is now self-sufficient (no need for my crappy
win32 import module), it should be easy to do...
May 10 2002
next sibling parent reply "Walter" <walter digitalmars.com> writes:
"Pavel Minayev" <evilone omen.ru> wrote in message
news:abglum$2kij$1 digitaldaemon.com...
 You can find the new stream module at my site, http://int19h.tamb.ru.

Cool!
 Going further, readLine() has learnt to read lines terminated with
 a single CR (aka It Came From Mac). And writeLine() now follows
 Windows conventions, and writes CR/LF terminated lines. On Linux,
 it should write a single LF, and a CR on Mac, whenever D gets
 there - a bit of underlying platform transparency.

What I do is treat as "newline" any of the following: 1) CR 2) CR LF 3) LF It requires a bit of lookahead to distinguish case 1 from case 2, but it works with files generated by Windows, linux, and Mac.
 By the way, Walter, could you pleeease replace the old version, that
 you'd put into Phobos, with this new one?

Sure!
May 10 2002
next sibling parent "Pavel Minayev" <evilone omen.ru> writes:
"Walter" <walter digitalmars.com> wrote in message
news:abgshp$2qbp$1 digitaldaemon.com...

 What I do is treat as "newline" any of the following:

 1) CR
 2) CR LF
 3) LF

 It requires a bit of lookahead to distinguish case 1 from case 2, but it
 works with files generated by Windows, linux, and Mac.

That's exactly what I did in the new version. It requires ungetc() (for the case when CR is not followed by LF) though, so I had to add it as well.
May 10 2002
prev sibling parent reply Russ Lewis <spamhole-2001-07-16 deming-os.org> writes:
Walter wrote:

 What I do is treat as "newline" any of the following:

 1) CR
 2) CR LF
 3) LF

 It requires a bit of lookahead to distinguish case 1 from case 2, but it
 works with files generated by Windows, linux, and Mac.

This has caused me some HUGE headaches doing streaming on UNIX boxes. At least some of the tools do "lookahead", so they don't echo a line out until you have printed 1 character AFTER the newline...in some cases, it has caused my programs to hang for minutes or hours (while, say, a long find command runs) until either another (unnecessary) line is printed, or the stream runs into EOF. IMHO, you should immediately interpret CR as a newline, but put a marker on the stream such that if another character is read and that character is a LF, then it will be consumed LATER. DON'T lookahead for it :( -- The Villagers are Online! villagersonline.com .[ (the fox.(quick,brown)) jumped.over(the dog.lazy) ] .[ (a version.of(English).(precise.more)) is(possible) ] ?[ you want.to(help(develop(it))) ]
May 10 2002
next sibling parent reply "Pavel Minayev" <evilone omen.ru> writes:
"Russ Lewis" <spamhole-2001-07-16 deming-os.org> wrote in message
news:3CDBF812.D68ED4F6 deming-os.org...

 IMHO, you should immediately interpret CR as a newline, but put a marker

 the stream such that if another character is read and that character is a
 LF, then it will be consumed LATER.  DON'T lookahead for it :(

I do a lookahead, but I have ungetc() implemented and working...
May 10 2002
parent reply Russ Lewis <spamhole-2001-07-16 deming-os.org> writes:
Pavel Minayev wrote:

 "Russ Lewis" <spamhole-2001-07-16 deming-os.org> wrote in message
 news:3CDBF812.D68ED4F6 deming-os.org...

 IMHO, you should immediately interpret CR as a newline, but put a marker

 the stream such that if another character is read and that character is a
 LF, then it will be consumed LATER.  DON'T lookahead for it :(

I do a lookahead, but I have ungetc() implemented and working...

Ungetc doesn't help the problem I was talking about. If you do lookahead but there is not a character available, then your library will block until one more character is available to read (or you detect EOF)...which could be a LONG time from now. -- The Villagers are Online! villagersonline.com .[ (the fox.(quick,brown)) jumped.over(the dog.lazy) ] .[ (a version.of(English).(precise.more)) is(possible) ] ?[ you want.to(help(develop(it))) ]
May 10 2002
next sibling parent reply Andrew Feldstein <Andrew_member pathlink.com> writes:
I agree that Russ's way is better, but it is still not ideal.  The user should
be able to set some sort of library flag to determine how to handle end of lines
*correctly* given the needs of the program.  This flag could control both
writing as well as reading, knowing how to handle \n, for example.  For example,
under *nix, it is incorrect to treat CR as part of a newline, and under MAC, I
believe, the LF the same.  Of course, any implementation should should default
to the text model used by the underlying operating system and should handle the
oddball cases cleanly.  Of course reading and writing don't *have* to be the
same....

Pavel, how would your new function read, say, a file containing nothing but
three <CR>'s followed by two <LF>'s?  Under various text models this could be
interpreted as any of 1, 2, 3, 4, or 5 blank lines.

In article <3CDBFA09.5F8DD76D deming-os.org>, Russ Lewis says...
Pavel Minayev wrote:

 "Russ Lewis" <spamhole-2001-07-16 deming-os.org> wrote in message
 news:3CDBF812.D68ED4F6 deming-os.org...

 IMHO, you should immediately interpret CR as a newline, but put a marker

 the stream such that if another character is read and that character is a
 LF, then it will be consumed LATER.  DON'T lookahead for it :(

I do a lookahead, but I have ungetc() implemented and working...

Ungetc doesn't help the problem I was talking about. If you do lookahead but there is not a character available, then your library will block until one more character is available to read (or you detect EOF)...which could be a LONG time from now. -- The Villagers are Online! villagersonline.com .[ (the fox.(quick,brown)) jumped.over(the dog.lazy) ] .[ (a version.of(English).(precise.more)) is(possible) ] ?[ you want.to(help(develop(it))) ]

May 10 2002
next sibling parent "Pavel Minayev" <evilone omen.ru> writes:
"Andrew Feldstein" <Andrew_member pathlink.com> wrote in message
news:abh3e0$3031$1 digitaldaemon.com...

 I agree that Russ's way is better, but it is still not ideal.  The user

 be able to set some sort of library flag to determine how to handle end of

 *correctly* given the needs of the program.  This flag could control both
 writing as well as reading, knowing how to handle \n, for example.  For

 under *nix, it is incorrect to treat CR as part of a newline, and under

 believe, the LF the same.  Of course, any implementation should should

 to the text model used by the underlying operating system and should

 oddball cases cleanly.  Of course reading and writing don't *have* to be

 same....

Under *nix, CR is a control character, and thus it is NOT supposed to be seen in ASCII-files - which readLine() is designed for. However, if it occasionally comes over a file made in Windows or Mac text editor, it will still be able to read it properly. The same is true for mac - text files SHOULDN'T contain LF. Stream's ability to handle it is an advantage, not a bug.
 Pavel, how would your new function read, say, a file containing nothing

 three <CR>'s followed by two <LF>'s?  Under various text models this could

 interpreted as any of 1, 2, 3, 4, or 5 blank lines.

It will treat is as CR, CR, CR+LF, LF - 4 lines.
May 10 2002
prev sibling parent "Walter" <walter digitalmars.com> writes:
"Andrew Feldstein" <Andrew_member pathlink.com> wrote in message
news:abh3e0$3031$1 digitaldaemon.com...
 I agree that Russ's way is better, but it is still not ideal.  The user

 be able to set some sort of library flag to determine how to handle end of

 *correctly* given the needs of the program.  This flag could control both
 writing as well as reading, knowing how to handle \n, for example.  For

 under *nix, it is incorrect to treat CR as part of a newline, and under

 believe, the LF the same.  Of course, any implementation should should

 to the text model used by the underlying operating system and should

 oddball cases cleanly.  Of course reading and writing don't *have* to be

 same....

The problem is that files are transferred from machine, and a program cannot reliably know the source of it.
 Pavel, how would your new function read, say, a file containing nothing

 three <CR>'s followed by two <LF>'s?  Under various text models this could

 interpreted as any of 1, 2, 3, 4, or 5 blank lines.

That would be CR,CR,CR,LF,LF, or 4 lines.
May 10 2002
prev sibling parent "Robert W. Cunningham" <rcunning acm.org> writes:
Russ Lewis wrote:

 Pavel Minayev wrote:

 "Russ Lewis" <spamhole-2001-07-16 deming-os.org> wrote in message
 news:3CDBF812.D68ED4F6 deming-os.org...

 IMHO, you should immediately interpret CR as a newline, but put a marker

 the stream such that if another character is read and that character is a
 LF, then it will be consumed LATER.  DON'T lookahead for it :(

I do a lookahead, but I have ungetc() implemented and working...

Ungetc doesn't help the problem I was talking about. If you do lookahead but there is not a character available, then your library will block until one more character is available to read (or you detect EOF)...which could be a LONG time from now.

On serial device drivers I've written, and on at least one of the many RTOS systems I've used, we had peekc() and/or lookc() calls that would, without side-effects, look at the next character in the device driver's buffer, and if that buffer was empty, the call would wait a single character time and sneak a nondestructive look at the uart buffer (a tricky thing to do on some uarts). I have no idea if Windows has similar capabilities. -BobC
May 10 2002
prev sibling parent "Walter" <walter digitalmars.com> writes:
"Russ Lewis" <spamhole-2001-07-16 deming-os.org> wrote in message
news:3CDBF812.D68ED4F6 deming-os.org...
 This has caused me some HUGE headaches doing streaming on UNIX boxes.  At
 least some of the tools do "lookahead", so they don't echo a line out

 you have printed 1 character AFTER the newline...in some cases, it has
 caused my programs to hang for minutes or hours (while, say, a long find
 command runs) until either another (unnecessary) line is printed, or the
 stream runs into EOF.

One solution is to use isatty() and if it is a stream, not a file, timeout instead of blocking for the lookahead. I've used similar tricks when reading escape sequences from terminals.
May 10 2002
prev sibling parent reply Burton Radons <loth users.sourceforge.net> writes:
On Fri, 10 May 2002 18:42:01 +0400, "Pavel Minayev" <evilone omen.ru>
wrote:

You can find the new stream module at my site, http://int19h.tamb.ru.

Far the most interesting addition is scanf(). What is even better,
it can read D strings!

    char[] s;
    stdin.scanf("%.*s", &s);

Yes, this really works! Of course, you can still read C strings
(%s), but who needs it anymore? Note, however, that scanf wasn't
tested much, so it might contain bugs. Be careful!

I think we should get the scanf and fmt format codes aligned. My method is "%s" for char[], "%S" for wchar[], "%+s" for char*, and "%+S" for wchar*. Different semantics for what looks like the same thing is bad city. [snip]
Generic read() and write() can now handle strings as well. Unlike
readString() and writeString(), these also store the length in
the stream:

    char[] s;
    ...
    file.write(s);   // writes s.length, followed by s
    ...
    file.read(s);    // reads length, then string of that length

Since this format is our own (that is to say, there's no standard for counted strings -- some are 32-bit, some are 16-bit, some are 8-bit, with varying rules on NUL termination and alignment), we may as well use dynamic-sized integers for this. For each byte we take the first seven bits and read another byte if the eighth bit is set, like: /* Write an unsigned long using the minimum number of bytes */ void dwrite(ulong value) { do { write ((value & 127) | (value > 127 ? 128 : 0)); value = value >> 7; } while (value); } /* Read an unsigned long using the minimum number of bytes */ void dread(out ulong value) { ulong shift = 0; ubyte buffer; value = 0; do { if (shift >= 64) throw new ReadError("integer overflow on reading value"); read (buffer); value |= (ulong) (buffer & 127) << shift; shift += 7; } while (buffer & 128); return value; } When writing uint you'll usually get three or two bytes savings, which really adds up when writing meshes, and you have your future covered, and it's endian neutral. Signed values can be written by preprocessing them for writing: if (value < 0) ovalue = (-value << 1) | 1; else ovalue = value << 1; and postprocessing them after reading: ovalue = (value >> 1); if (value & 1) ovalue = -ovalue; Uh, except that you can't write the minimum value of long then. Think of the byte case - you start with a range of -128 to 127 and end with a range of -127 to 127 if you kept just to byte. If they existed I'd cast to a bignum and save that, although real bignums should be saved like counted strings. For my code I won't be able to use the class if it doesn't handle endian properly - I'm just too ethically opposed to blindly writing values. It's just one step down from writing structs, IMO. Standard read/write could use little endian, with bread/bwrite for big endian. [snip]
Enumerations changed names again:

    enum SeekPos
    {
        Set,
        Current,
        End
    }

Why not Cur? "Set" is already nonsensical; Start or Beginning would be more appropriate, so we may as well use the convenient nonsense we're used to. Hm. I don't like writing the name of the enumeration when there's only one type that can fit in the argument. How about we have this: file.seek (x, .Current); file.seek (x, .Set); file.seek (x, .End); Minimise namespace pollution and too much writingitis at the same time. Of course, it means that you can't find the enumeration value until after the function has been decided upon, but it shouldn't be ambiguous; it's clearly an enumeration of some sort. [snip]
May 10 2002
next sibling parent reply "Pavel Minayev" <evilone omen.ru> writes:
"Burton Radons" <loth users.sourceforge.net> wrote in message
news:sjunduovt39c2tcntmkv6rp23cn8thmk9g 4ax.com...

 I think we should get the scanf and fmt format codes aligned.  My
 method is "%s" for char[], "%S" for wchar[], "%+s" for char*, and
 "%+S" for wchar*.  Different semantics for what looks like the same
 thing is bad city.

Agreed, but I think we should first have Walter to agree with this, so it'd become "official". Once it is, I will be happy to standartize streams appropriately.
 Since this format is our own (that is to say, there's no standard for
 counted strings -- some are 32-bit, some are 16-bit, some are 8-bit,
 with varying rules on NUL termination and alignment), we may as well
 use dynamic-sized integers for this.  For each byte we take the first
 seven bits and read another byte if the eighth bit is set, like:

 When writing uint you'll usually get three or two bytes savings, which
 really adds up when writing meshes, and you have your future covered,
 and it's endian neutral.

But at a cost of speed... and I wonder if it is really needed? Is file size so important?
 For my code I won't be able to use the class if it doesn't handle
 endian properly - I'm just too ethically opposed to blindly writing
 values.  It's just one step down from writing structs, IMO.  Standard
 read/write could use little endian, with bread/bwrite for big endian.

I would prefer read() and write() to operate in "current endianness" (because often you just don't care - all you want is that your program should be able to read data it previously written, on that computer, savegames etc). If you really care about endianness, you'll have to use functions like bread() and lread().
 Why not Cur?  "Set" is already nonsensical; Start or Beginning would
 be more appropriate, so we may as well use the convenient nonsense
 we're used to.

Because "Set" is a word, and so is "Current", but not "Cur". But if you really think that "Start" looks better, I'll probably change it...
May 10 2002
next sibling parent reply Burton Radons <loth users.sourceforge.net> writes:
On Fri, 10 May 2002 22:20:55 +0400, "Pavel Minayev" <evilone omen.ru>
wrote:

"Burton Radons" <loth users.sourceforge.net> wrote in message
news:sjunduovt39c2tcntmkv6rp23cn8thmk9g 4ax.com...

 I think we should get the scanf and fmt format codes aligned.  My
 method is "%s" for char[], "%S" for wchar[], "%+s" for char*, and
 "%+S" for wchar*.  Different semantics for what looks like the same
 thing is bad city.

Agreed, but I think we should first have Walter to agree with this, so it'd become "official". Once it is, I will be happy to standartize streams appropriately.

Sure, but I'm leaving the option open to kick his ass if he decides to go with "%format-a-string;". ;-)
 Since this format is our own (that is to say, there's no standard for
 counted strings -- some are 32-bit, some are 16-bit, some are 8-bit,
 with varying rules on NUL termination and alignment), we may as well
 use dynamic-sized integers for this.  For each byte we take the first
 seven bits and read another byte if the eighth bit is set, like:

 When writing uint you'll usually get three or two bytes savings, which
 really adds up when writing meshes, and you have your future covered,
 and it's endian neutral.

But at a cost of speed... and I wonder if it is really needed? Is file size so important?

It should be a little faster on a competent compiler. We have to buffer the data anyway; flushing the buffer takes a long time; loops can be unrolled; dynamic-sized integers lower the incidence of flushing; dynamic-sized integers are faster. But this is splitting hairs in any case. Endian independence and a much smaller normal case are far more important.
 For my code I won't be able to use the class if it doesn't handle
 endian properly - I'm just too ethically opposed to blindly writing
 values.  It's just one step down from writing structs, IMO.  Standard
 read/write could use little endian, with bread/bwrite for big endian.

I would prefer read() and write() to operate in "current endianness" (because often you just don't care - all you want is that your program should be able to read data it previously written, on that computer, savegames etc). If you really care about endianness, you'll have to use functions like bread() and lread().

Uh, if you don't care, then it can default to little endian. :-)
 Why not Cur?  "Set" is already nonsensical; Start or Beginning would
 be more appropriate, so we may as well use the convenient nonsense
 we're used to.

Because "Set" is a word, and so is "Current", but not "Cur". But if you really think that "Start" looks better, I'll probably change it...

It's a word, but so is "Catholicity", and it's as appropriate as "Set". My dictionary gives 125 meanings for set. The only thing that could be related is in the context of "setting sun", which is quite the opposite. Besides which, cur is a word. Uh, perhaps not in your part of the world. It means a worthless dog, or contemptible scoundrel.
May 10 2002
parent reply "Pavel Minayev" <evilone omen.ru> writes:
"Burton Radons" <loth users.sourceforge.net> wrote in message
news:3f4oduspb6fuiseeg7a4a025c92pnnfl1e 4ax.com...

 It's a word, but so is "Catholicity", and it's as appropriate as
 "Set".  My dictionary gives 125 meanings for set.  The only thing that
 could be related is in the context of "setting sun", which is quite
 the opposite.

Hmm.. I always thought that "set" is a short form of "just set that #%$ file pointer to whatever I say", but I could be wrong...
May 10 2002
parent "OddesE" <OddesE_XYZ hotmail.com> writes:
"Pavel Minayev" <evilone omen.ru> wrote in message
news:abha3v$4f7$1 digitaldaemon.com...
 "Burton Radons" <loth users.sourceforge.net> wrote in message
 news:3f4oduspb6fuiseeg7a4a025c92pnnfl1e 4ax.com...

 It's a word, but so is "Catholicity", and it's as appropriate as
 "Set".  My dictionary gives 125 meanings for set.  The only thing that
 could be related is in the context of "setting sun", which is quite
 the opposite.

Hmm.. I always thought that "set" is a short form of "just set that #%$ file pointer to whatever I say", but I could be wrong...

LOL :) -- Stijn OddesE_XYZ hotmail.com http://OddesE.cjb.net _________________________________________________ Remove _XYZ from my address when replying by mail
May 11 2002
prev sibling parent reply "Walter" <walter digitalmars.com> writes:
"Pavel Minayev" <evilone omen.ru> wrote in message
news:abh2p1$2vfr$1 digitaldaemon.com...
 "Burton Radons" <loth users.sourceforge.net> wrote in message
 news:sjunduovt39c2tcntmkv6rp23cn8thmk9g 4ax.com...
 I think we should get the scanf and fmt format codes aligned.  My
 method is "%s" for char[], "%S" for wchar[], "%+s" for char*, and
 "%+S" for wchar*.  Different semantics for what looks like the same
 thing is bad city.

so it'd become "official". Once it is, I will be happy to standartize streams appropriately.

But that means I have to think about it <g>. In any case, I think it is just a matter of reviewing the C printf and scanf format strings, and coming up with something as equivalent as practical but still support the full D types. Note that D enables some cool things like a format specifier for Objects, too, which will cast the argument to an Object and call toString() on it.
May 10 2002
parent reply "Pavel Minayev" <evilone omen.ru> writes:
"Walter" <walter digitalmars.com> wrote in message
news:abhjha$c6h$1 digitaldaemon.com...

 But that means I have to think about it <g>. In any case, I think it is

 a matter of reviewing the C printf and scanf format strings, and coming up
 with something as equivalent as practical but still support the full D

Yes, exactly. But you don't want anarchy here, do you?
May 10 2002
parent "Walter" <walter digitalmars.com> writes:
"Pavel Minayev" <evilone omen.ru> wrote in message
news:abi475$qe3$1 digitaldaemon.com...
 "Walter" <walter digitalmars.com> wrote in message
 news:abhjha$c6h$1 digitaldaemon.com...
 But that means I have to think about it <g>. In any case, I think it is

 a matter of reviewing the C printf and scanf format strings, and coming


 with something as equivalent as practical but still support the full D


No, but it's a matter of getting spread too thin making it hard to give each issue the attention it needs. I'm currently trying to finish another project (and get paid for it) so I can spend more time on D.
May 11 2002
prev sibling next sibling parent "Martin M. Pedersen" <mmp www.moeller-pedersen.dk> writes:
Hi,

"Burton Radons" <loth users.sourceforge.net> wrote in message
news:sjunduovt39c2tcntmkv6rp23cn8thmk9g 4ax.com...
 Since this format is our own (that is to say, there's no standard for
 counted strings -- some are 32-bit, some are 16-bit, some are 8-bit,
 with varying rules on NUL termination and alignment), we may as well
 use dynamic-sized integers for this.  For each byte we take the first
 seven bits and read another byte if the eighth bit is set, like:

This is very much like the ASN.1/DER encoding of lengths, but not exactly. We might consider that encoding, see: ftp://ftp.rsasecurity.com/pub/pkcs/ascii/layman.asc
 When writing uint you'll usually get three or two bytes savings, which
 really adds up when writing meshes, and you have your future covered,
 and it's endian neutral.

These are good properties. ASN.1/DER also specifies how to encode type information used to destinguish between ASCII and UNICODE strings - that might be usable too. Regards, Martin M. Pedersen
May 11 2002
prev sibling parent "Walter" <walter digitalmars.com> writes:
"Burton Radons" <loth users.sourceforge.net> wrote in message
news:sjunduovt39c2tcntmkv6rp23cn8thmk9g 4ax.com...
 I think we should get the scanf and fmt format codes aligned.  My
 method is "%s" for char[], "%S" for wchar[], "%+s" for char*, and
 "%+S" for wchar*.  Different semantics for what looks like the same
 thing is bad city.

I've been thinking about the problem of writing wide characters vs ascii characters. Embedding it in the format string isn't going to work too well, as what happens with: printf("foo %S bar"); Are foo and bar written as unicode or ascii? I don't know many applications that would want to mix the two. The practical solution I see is to have two printf's, one for ascii and one for unicode. I.e. printf and wprintf.
May 11 2002