www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - readf/unformat 1.4 released

reply Sean Kelly <sean f4.ca> writes:
For those of you who don't know, readf began as an attempt at a full
C99-compliant scanf implementation in D.  It's since been renamed to match the
Phobos writef/format functions a bit more closely, and this version attempts to
bring usage a bit closer to readf.  What's new in this version:

- Support for negative zero and negative infinity (previous version ignored sign
in these cases).  This version also does not allow the optional sign to appear
before "NAN" as IMO it's meaningless.  So "+NAN" and "-NAN" will both cause an
error.  If you don't like this, please let me know.  It is contrary to the C99
spec.

- Format strings are no longer necessary.  Default formats are:
%s: char arrays
%c: char pointers
%i: integer/bit
%f: floating point

unFormat still will not throw an exception on parameter mismatch, but will
return immediately instead.  This is the only interface issue I know of where
this package diverges from doFormat/writef.  The format string parsing is still
fully scanf-compliant, so there are some redundant format specifiers.  Check the
C99 spec or the included text file to get an idea of what specifiers do what.

By the way, the code currently assumes input data to be in UTF-8 or UTF-16
(native) format as it uses the Phobos toUTFXX functions for conversion.  The
package includes a custom utf.d that allows delegates and consists of two
implementation files: unformat.d and stdio.d.  The format string and all
incoming data are converted to UTF-32 before evaluation to facilitate
comparison.  As usual, please let me know what you think.  The file is here:

http://home.f4.ca/sean/d/stdio.zip


Sean
Sep 24 2004
next sibling parent reply "Ben Hinkle" <bhinkle mathworks.com> writes:
"Sean Kelly" <sean f4.ca> wrote in message
news:cj1of5$cjp$1 digitaldaemon.com...
 For those of you who don't know, readf began as an attempt at a full
 C99-compliant scanf implementation in D.  It's since been renamed to match

 Phobos writef/format functions a bit more closely, and this version

 bring usage a bit closer to readf.  What's new in this version:

 - Support for negative zero and negative infinity (previous version

 in these cases).  This version also does not allow the optional sign to

 before "NAN" as IMO it's meaningless.  So "+NAN" and "-NAN" will both

 error.  If you don't like this, please let me know.  It is contrary to the

 spec.

 - Format strings are no longer necessary.  Default formats are:
 %s: char arrays
 %c: char pointers
 %i: integer/bit
 %f: floating point

 unFormat still will not throw an exception on parameter mismatch, but will
 return immediately instead.  This is the only interface issue I know of

 this package diverges from doFormat/writef.  The format string parsing is

 fully scanf-compliant, so there are some redundant format specifiers.

 C99 spec or the included text file to get an idea of what specifiers do

 By the way, the code currently assumes input data to be in UTF-8 or UTF-16
 (native) format as it uses the Phobos toUTFXX functions for conversion.

 package includes a custom utf.d that allows delegates and consists of two
 implementation files: unformat.d and stdio.d.  The format string and all
 incoming data are converted to UTF-32 before evaluation to facilitate
 comparison.  As usual, please let me know what you think.  The file is

 http://home.f4.ca/sean/d/stdio.zip


 Sean

How does unFormat take advantage of D's _arguments feature (if it does)? I'm not quite sure why the parsing code needs to think about %s or %i or whatever since it can look at the type of the target variable. If it sees int* then it parses an int and if it sees char[]* it parses a string. The only role of the format would be to specify where to parse and where to match literal characters. -Ben
Sep 24 2004
parent reply Sean Kelly <sean f4.ca> writes:
In article <cj1r40$e0k$1 digitaldaemon.com>, Ben Hinkle says...
How does unFormat take advantage of D's _arguments feature (if it does)? I'm
not quite sure why the parsing code needs to think about %s or %i or
whatever since it can look at the type of the target variable. If it sees
int* then it parses an int and if it sees char[]* it parses a string. The
only role of the format would be to specify where to parse and where to
match literal characters.

Format strings can also specify how to parse the incoming data. Integers, for example, have a bunch of different format specifiers for different types of input. I chose "%i" as the default, since it's the most flexible, but "%d" specifies decimal numbers only, "%o" is octal, you can include width specifiers, etc. I also may have forgotten to allow a bit to be parsed as a string (%s) to convert "true" and "false" to 1 and 0, respectively. For a contrived example: # int i, r; # char[] s; # # r = sreadf( "0x1 hello", &i, &s ); # assert( r == 2 && i == 1 && s == "hello" ); # # i = i.init; s = s.init; # r = sreadf( "0x1 hello", "%d%*s", &i, &s ); # assert( r == 1 && i == 0 && s == "hello" ); In the second case, 0x1 is expected to be a decimal number so the "x" is interpreted as non-numeric. The "%*s" indicates that a string should be read but assignment should be suppressed (which throws out the "x1"), and the final string is read as normal because the format string has been exhausted. So in many cases there's no need to use format specifiers. The code still uses them internally even when one isn't supplied because it simplifies things, but this is all invisible to the programmer. unFormat takes advantage of the _arguments collection by using it to determine what type is being written to (it will return if you try to read a string into an integer, for example--writef would throw a FormatError in this situation), and to determine what to expect if no format string is supplied. You can also do stuff like this: # int i; # char c; # char[] s, t; # # sreadf( "0x1 hello", &i, "%2s", &s, "%2c", &t, &c ); # assert( i == 1 && s == "he" && t == "ll" && c == 'o' ); So there's no restriction on the number or the location of format strings. All arguments are evaluated left to right. Sean
Sep 24 2004
parent reply "Ben Hinkle" <bhinkle mathworks.com> writes:
"Sean Kelly" <sean f4.ca> wrote in message
news:cj1thq$fd3$1 digitaldaemon.com...
 In article <cj1r40$e0k$1 digitaldaemon.com>, Ben Hinkle says...
How does unFormat take advantage of D's _arguments feature (if it does)?


not quite sure why the parsing code needs to think about %s or %i or
whatever since it can look at the type of the target variable. If it sees
int* then it parses an int and if it sees char[]* it parses a string. The
only role of the format would be to specify where to parse and where to
match literal characters.

Format strings can also specify how to parse the incoming data. Integers,

 example, have a bunch of different format specifiers for different types

 input.  I chose "%i" as the default, since it's the most flexible, but

 specifies decimal numbers only, "%o" is octal, you can include width

 etc.  I also may have forgotten to allow a bit to be parsed as a string

 convert "true" and "false" to 1 and 0, respectively.

 For a contrived example:

 # int i, r;
 # char[] s;
 #
 # r = sreadf( "0x1 hello", &i, &s );
 # assert( r == 2 && i == 1 && s == "hello" );
 #
 # i = i.init; s = s.init;
 # r = sreadf( "0x1 hello", "%d%*s", &i, &s );
 # assert( r == 1 && i == 0 && s == "hello" );

 In the second case, 0x1 is expected to be a decimal number so the "x" is
 interpreted as non-numeric.  The "%*s" indicates that a string should be

 but assignment should be suppressed (which throws out the "x1"), and the

 string is read as normal because the format string has been exhausted.

 So in many cases there's no need to use format specifiers.  The code still

 them internally even when one isn't supplied because it simplifies things,

 this is all invisible to the programmer.

 unFormat takes advantage of the _arguments collection by using it to

 what type is being written to (it will return if you try to read a string

 an integer, for example--writef would throw a FormatError in this

 and to determine what to expect if no format string is supplied.

 You can also do stuff like this:

 # int i;
 # char c;
 # char[] s, t;
 #
 # sreadf( "0x1 hello", &i, "%2s", &s, "%2c", &t, &c );
 # assert( i == 1 && s == "he" && t == "ll" && c == 'o' );

 So there's no restriction on the number or the location of format strings.

 arguments are evaluated left to right.


 Sean

cool! so a C scanf call scanf("%d %d",&i,&j) can be any of readf("%d %d",&i,&i) readf("%d",&i,"%d",&j) readf(&i,&j); assuming i and j are ints. very nifty.
Sep 24 2004
parent Sean Kelly <sean f4.ca> writes:
In article <cj219m$h14$1 digitaldaemon.com>, Ben Hinkle says...
cool! so a C scanf call

 scanf("%d %d",&i,&j)

can be any of

 readf("%d %d",&i,&i)
 readf("%d",&i,"%d",&j)
 readf(&i,&j);

assuming i and j are ints. very nifty.

Yup. I didn't think about it until just now, but this may come in handy for internationalization, since the whole "%1" concept that's been talked about can be faked just by reordering parameters. Sean
Sep 24 2004
prev sibling parent reply =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Sean Kelly wrote: (back in 2004-09-24, that was)

 For those of you who don't know, readf began as an attempt at a full
 C99-compliant scanf implementation in D.  It's since been renamed to match the
 Phobos writef/format functions a bit more closely, and this version attempts to
 bring usage a bit closer to readf. 

 unFormat still will not throw an exception on parameter mismatch, but will
 return immediately instead.  This is the only interface issue I know of where
 this package diverges from doFormat/writef.  

I changed this package to break it into std.stdio.readf and std.string.unformat... I also made it throw Exceptions on % FormatError and parameter mismatch e.g. not passing pointers The missing TypeInfo for pointers, and the fact that you cannot pass _arguments and _argptr around with GDC without losing info makes it a horrible kludge at the moment - but it does work! ;-) (currently unformat only works from within std.string, though...) Sean Kelly's old declaration of unFormat was:
 int unFormat( bit delegate( out dchar ) getc,
               bit delegate( dchar ) ungetc,
               TypeInfo[] arguments,
               void* argptr )

This was changed to use EOF and lose the "bit":
 void unFormat( dchar delegate() getc, dchar delegate(dchar) ungetc,
               TypeInfo[] arguments, va_list argptr,
               Mangle[] mangle = null, Mangle[] mangle2 = null)

(last two parameters being part of the GDC kludge, you should already know "va_list" from std.stdarg:)
 version (GNU) {
     // va_list might be a pointer, but assuming so is not portable.
     private import gcc.builtins;
     alias __builtin_va_list va_list;
 } else {
     alias void* va_list;
 }

You only provide two delegates: getc and ungetc, which are very similar to their C counterparts... (making the wrappers for fgetc and ungetc simple) dchar getc(); dchar ungetc(dchar c); Should an EOF occur, the "new" versions now returns a cast(dchar) std.c.stdio.EOF, or: 0xFFFFFFFF as UTF-32. (which is not a valid code point, and thus "safe" here) Otherwise, the internals work more or less as before (except that it doesn't internalize the exceptions...) Here is the "ideal" version of std.string.unformat, ignoring the current GDC Mangling preprocessing hacks:
 void unformat(char[] s, ...)
 {
     size_t idx = 0, old_idx;
 	
     dchar getc()
     {
     	old_idx = idx;
     	if (idx >= s.length)
     		return cast(dchar) EOF;
     	return std.utf.decode(s, idx);
     }
 
     dchar ungetc(dchar c)
     {
     	idx = old_idx;
     	return c;
     }

     std.unformat.unFormat(&getc, &ungetc, _arguments, _argptr);
 }

You can use this as: (very similar to "format") int i, j; unformat("1 2", "%d %d",&i,&i) unformat("1 2", "%d",&i,"%d",&j) unformat("1 2", &i,&j); assert(i == 1 && j == 2); Since i and j are int's, it'll default to "%d". Then there is the readf function, which also works as expected:
 import std.stdio;
 
 void main()
 {
   char[] s;
   write("What's is your name: ");
   readf("%s", &s);
   writefln("Hello, %s!", s);
 }

Which inputs/outputs something like: What's is your name: Anders Hello, Anders! (yes, this is the actual D program) Note that if you pass "s" instead of "&s", an Error will be thrown... (this should stop the usual scanf bugs, with forgetting to &-prefix ?) The program also uses the formatless version of writef called "write", which doesn't treat '%' characters special but just prints them out... Once the new version of GDC is out, I will try to see if the TypeInfo passing can't be fixed for that compiler too and then post some code.
  * Copyright (C) 2004 by Sean Kelly
  * Copyright (C) 2005 by Anders F Bjoerklund
  *
  * Permission to use, copy, modify, distribute and sell this software
  * and its documentation for any purpose is hereby granted without fee,
  * provided that the above copyright notice appear in all copies and
  * that both that copyright notice and this permission notice appear
  * in supporting documentation.  Author makes no representations about
  * the suitability of this software for any purpose. It is provided
  * "as is" without express or implied warranty.

Original file came from: http://home.f4.ca/sean/d/stdio.zip (had the Open Source license agreement being duplicated above) Thanks to Sean for doing the grunt-work with format parsing, so we (still) don't have to rename it to "std.stdo" anymore :-) --anders PS. Yes, it uses pointers. Kris has already written C++-style bitshift-operator overloads for people who want that... ? http://svn.dsource.org/svn/projects/mango/trunk/doc/html/classAbstractReader.html http://svn.dsource.org/svn/projects/mango/trunk/doc/html/classAbstractWriter.html When (and if) D supports "out" arguments for variadic lists, the code can be changed to support those "out" vars instead. Although it would then also need some kind of R/O attribute to able to differentiate between format strings and string params? Meanwhile, the pointers work just fine (and it checks the types!)
Mar 17 2005
parent reply Sean Kelly <sean f4.ca> writes:
In article <d1bhnv$ek6$1 digitaldaemon.com>,
=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= says...
Sean Kelly wrote: (back in 2004-09-24, that was)

 For those of you who don't know, readf began as an attempt at a full
 C99-compliant scanf implementation in D.  It's since been renamed to match the
 Phobos writef/format functions a bit more closely, and this version attempts to
 bring usage a bit closer to readf. 

 unFormat still will not throw an exception on parameter mismatch, but will
 return immediately instead.  This is the only interface issue I know of where
 this package diverges from doFormat/writef.  

I changed this package to break it into std.stdio.readf and std.string.unformat...

Nice! Is this version available online? Sean
Mar 17 2005
parent reply =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Sean Kelly wrote:

I changed this package to break it into
std.stdio.readf and std.string.unformat...

Nice! Is this version available online?

Not yet, have to clean it up and backport it back into the DMD release again... (currently it's done to a tweaked GDC, you see) And it *really* wants TypeInfo? Main reason was to get opinions on: 1) changed getc delegate definitions 2) throwing on exceptions on errors Also, I still need to write the multibyte wrappers of getc/ungetc for file based streams (regular, as well as the wide orientation kind) --anders
Mar 17 2005
parent reply Sean Kelly <sean f4.ca> writes:
In article <d1co48$1nlr$1 digitaldaemon.com>,
=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= says...
Sean Kelly wrote:

I changed this package to break it into
std.stdio.readf and std.string.unformat...

Nice! Is this version available online?

Not yet, have to clean it up and backport it back into the DMD release again... (currently it's done to a tweaked GDC, you see) And it *really* wants TypeInfo?

I'll have to re-evaluate TypeInfo in DMD. I don't suppose it's working yet for pointer types? And why in the world doesn't GDC properly generate copies of the TypeInfo array?
Main reason was to get opinions on:
1) changed getc delegate definitions

I mostly created the getc/ungetc specs as they were because they were easier to embed in boolean expressions. ie. if( !getc( ch ) ) return; is easier to write than: if( ( ch = getc() ) == WEOF ) return; I figured the C function would need to be wrapped either way, so this seemed a decent gain. Especially since I think the reason the C routines are written the way they are is because C lacks an output qualifier. But it's mostly a cosmetic issue, so it doesn't matter much to me either way.
2) throwing on exceptions on errors

I mostly didn't do this with my version of the functions because I thought it made sense that unFormat should throw the same exceptions as format, but doing so created a dependency I wasn't happy with for an add-on library. If this stuff made it into Phobos I fully support the idea of consistency between the functions. This fix should be pretty easy anyway, as it just amounts to putting a "throw" in the necessary catch blocks at the bottom of the unFormat implementation (unFormat uses exceptions internally for flow control).
Also, I still need to write the multibyte wrappers of getc/ungetc
for file based streams (regular, as well as the wide orientation kind)

My release used the same wrappers for file i/o as it used for file i/o. Are these functions not available in Linux? Sean
Mar 17 2005
parent =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:
Sean Kelly wrote:

 I'll have to re-evaluate TypeInfo in DMD.  I don't suppose it's working yet for
 pointer types? 

Nope, they are all of the "TypeInfo" base class... :-(
 And why in the world doesn't GDC properly generate copies of the
 TypeInfo array

Maybe I explained myself badly. You can of course pass _arguments and _argptr off to subroutines. It is just that "arguments[i] is typeid(int*)" will no longer work... The identity is lost, when doing the workaround like that. It still works, if they are done against the original _arguments and in the same module (I'm a little shady on the details why that is so, just *that* it is so...) If the TypeInfo/typeid was working, all would be cool.
 I figured the C function would need to be wrapped either way, so this seemed a
 decent gain.  Especially since I think the reason the C routines are written
the
 way they are is because C lacks an output qualifier.  But it's mostly a
cosmetic
 issue, so it doesn't matter much to me either way.

The read/write functions in std.stream work like you describe, with out parameters (they throw Exceptions on EOF, instead of return a bit, but that's just a matter of preference...) Just thought that "dchar getc()" was a better match for "putc(dchar)", and that there seemed to be a lot of checking for eof spread out in the code ? That's all.
 I mostly didn't do this with my version of the functions because I thought it
 made sense that unFormat should throw the same exceptions as format, but doing
 so created a dependency I wasn't happy with for an add-on library. 

Yeah, it did meant hacking a few things in std.format... And there are some *nasty* circular dependencies going on, and double if not trouble defines of things like "stdin" and "va_list" Had to resort to e.g. "alias std.stdarg.va_list va_list;"
 If this
 stuff made it into Phobos I fully support the idea of consistency between the
 functions.  This fix should be pretty easy anyway, as it just amounts to
putting
 a "throw" in the necessary catch blocks at the bottom of the unFormat
 implementation (unFormat uses exceptions internally for flow control).

I did leave the overflow checks in, but most should be passed further ?
Also, I still need to write the multibyte wrappers of getc/ungetc
for file based streams (regular, as well as the wide orientation kind)

My release used the same wrappers for file i/o as it used for file i/o. Are these functions not available in Linux?

Yes, I just didn't loop over the bytes to reassemble UTF-32 (yet) Either way, readf and unformat *definitely* have a place in a future release of Phobos - next to writef and format... Just need to get the TypeInfo stuff completed first ? (the major part being adding ti for pointer types...) --anders
Mar 17 2005