digitalmars.D - readf/unformat 1.4 released

Sean Kelly (27/27) Sep 24 2004 For those of you who don't know, readf began as an attempt at a full

Ben Hinkle (21/48) Sep 24 2004 the

Sean Kelly (38/44) Sep 24 2004 Format strings can also specify how to parse the incoming data. Integer...

Ben Hinkle (23/68) Sep 24 2004 I'm

Sean Kelly (5/12) Sep 24 2004 Yup. I didn't think about it until just now, but this may come in handy...

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (58/117) Mar 17 2005 I changed this package to break it into

Sean Kelly (4/15) Mar 17 2005 Nice! Is this version available online?

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (13/17) Mar 17 2005 Not yet, have to clean it up and backport

Sean Kelly (24/38) Mar 17 2005 I'll have to re-evaluate TypeInfo in DMD. I don't suppose it's working ...

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (27/48) Mar 17 2005 Maybe I explained myself badly. You can of course pass

Sean Kelly <sean f4.ca> writes:

For those of you who don't know, readf began as an attempt at a full
C99-compliant scanf implementation in D.  It's since been renamed to match the
Phobos writef/format functions a bit more closely, and this version attempts to
bring usage a bit closer to readf.  What's new in this version:

- Support for negative zero and negative infinity (previous version ignored sign
in these cases).  This version also does not allow the optional sign to appear
before "NAN" as IMO it's meaningless.  So "+NAN" and "-NAN" will both cause an
error.  If you don't like this, please let me know.  It is contrary to the C99
spec.

- Format strings are no longer necessary.  Default formats are:
%s: char arrays
%c: char pointers
%i: integer/bit
%f: floating point

unFormat still will not throw an exception on parameter mismatch, but will
return immediately instead.  This is the only interface issue I know of where
this package diverges from doFormat/writef.  The format string parsing is still
fully scanf-compliant, so there are some redundant format specifiers.  Check the
C99 spec or the included text file to get an idea of what specifiers do what.

By the way, the code currently assumes input data to be in UTF-8 or UTF-16
(native) format as it uses the Phobos toUTFXX functions for conversion.  The
package includes a custom utf.d that allows delegates and consists of two
implementation files: unformat.d and stdio.d.  The format string and all
incoming data are converted to UTF-32 before evaluation to facilitate
comparison.  As usual, please let me know what you think.  The file is here:

http://home.f4.ca/sean/d/stdio.zip


Sean

Sep 24 2004

"Ben Hinkle" <bhinkle mathworks.com> writes:

"Sean Kelly" <sean f4.ca> wrote in message
news:cj1of5$cjp$1 digitaldaemon.com...
 For those of you who don't know, readf began as an attempt at a full
 C99-compliant scanf implementation in D.  It's since been renamed to match

the
 Phobos writef/format functions a bit more closely, and this version

attempts to
 bring usage a bit closer to readf.  What's new in this version:

 - Support for negative zero and negative infinity (previous version

ignored sign
 in these cases).  This version also does not allow the optional sign to

appear
 before "NAN" as IMO it's meaningless.  So "+NAN" and "-NAN" will both

cause an
 error.  If you don't like this, please let me know.  It is contrary to the

C99
 spec.

 - Format strings are no longer necessary.  Default formats are:
 %s: char arrays
 %c: char pointers
 %i: integer/bit
 %f: floating point

 unFormat still will not throw an exception on parameter mismatch, but will
 return immediately instead.  This is the only interface issue I know of

where
 this package diverges from doFormat/writef.  The format string parsing is

still
 fully scanf-compliant, so there are some redundant format specifiers.

Check the
 C99 spec or the included text file to get an idea of what specifiers do

what.
 By the way, the code currently assumes input data to be in UTF-8 or UTF-16
 (native) format as it uses the Phobos toUTFXX functions for conversion.

The
 package includes a custom utf.d that allows delegates and consists of two
 implementation files: unformat.d and stdio.d.  The format string and all
 incoming data are converted to UTF-32 before evaluation to facilitate
 comparison.  As usual, please let me know what you think.  The file is

here:
 http://home.f4.ca/sean/d/stdio.zip


 Sean

How does unFormat take advantage of D's _arguments feature (if it does)? I'm
not quite sure why the parsing code needs to think about %s or %i or
whatever since it can look at the type of the target variable. If it sees
int* then it parses an int and if it sees char[]* it parses a string. The
only role of the format would be to specify where to parse and where to
match literal characters.

-Ben

Sep 24 2004

Sean Kelly <sean f4.ca> writes:

In article <cj1r40$e0k$1 digitaldaemon.com>, Ben Hinkle says...
How does unFormat take advantage of D's _arguments feature (if it does)? I'm
not quite sure why the parsing code needs to think about %s or %i or
whatever since it can look at the type of the target variable. If it sees
int* then it parses an int and if it sees char[]* it parses a string. The
only role of the format would be to specify where to parse and where to
match literal characters.

Format strings can also specify how to parse the incoming data.  Integers, for
example, have a bunch of different format specifiers for different types of
input.  I chose "%i" as the default, since it's the most flexible, but "%d"
specifies decimal numbers only, "%o" is octal, you can include width specifiers,
etc.  I also may have forgotten to allow a bit to be parsed as a string (%s) to
convert "true" and "false" to 1 and 0, respectively.

For a contrived example:











In the second case, 0x1 is expected to be a decimal number so the "x" is
interpreted as non-numeric.  The "%*s" indicates that a string should be read
but assignment should be suppressed (which throws out the "x1"), and the final
string is read as normal because the format string has been exhausted.

So in many cases there's no need to use format specifiers.  The code still uses
them internally even when one isn't supplied because it simplifies things, but
this is all invisible to the programmer.

unFormat takes advantage of the _arguments collection by using it to determine
what type is being written to (it will return if you try to read a string into
an integer, for example--writef would throw a FormatError in this situation),
and to determine what to expect if no format string is supplied.

You can also do stuff like this:








So there's no restriction on the number or the location of format strings.  All
arguments are evaluated left to right.


Sean

Sep 24 2004

"Ben Hinkle" <bhinkle mathworks.com> writes:

"Sean Kelly" <sean f4.ca> wrote in message
news:cj1thq$fd3$1 digitaldaemon.com...
 In article <cj1r40$e0k$1 digitaldaemon.com>, Ben Hinkle says...
How does unFormat take advantage of D's _arguments feature (if it does)?


I'm
not quite sure why the parsing code needs to think about %s or %i or
whatever since it can look at the type of the target variable. If it sees
int* then it parses an int and if it sees char[]* it parses a string. The
only role of the format would be to specify where to parse and where to
match literal characters.

 Format strings can also specify how to parse the incoming data.  Integers,

for
 example, have a bunch of different format specifiers for different types

of
 input.  I chose "%i" as the default, since it's the most flexible, but

"%d"
 specifies decimal numbers only, "%o" is octal, you can include width

specifiers,
 etc.  I also may have forgotten to allow a bit to be parsed as a string

(%s) to
 convert "true" and "false" to 1 and 0, respectively.

 For a contrived example:











 In the second case, 0x1 is expected to be a decimal number so the "x" is
 interpreted as non-numeric.  The "%*s" indicates that a string should be

read
 but assignment should be suppressed (which throws out the "x1"), and the

final
 string is read as normal because the format string has been exhausted.

 So in many cases there's no need to use format specifiers.  The code still

uses
 them internally even when one isn't supplied because it simplifies things,

but
 this is all invisible to the programmer.

 unFormat takes advantage of the _arguments collection by using it to

determine
 what type is being written to (it will return if you try to read a string

into
 an integer, for example--writef would throw a FormatError in this

situation),
 and to determine what to expect if no format string is supplied.

 You can also do stuff like this:








 So there's no restriction on the number or the location of format strings.

All
 arguments are evaluated left to right.


 Sean

cool! so a C scanf call

 scanf("%d %d",&i,&j)

can be any of

 readf("%d %d",&i,&i)
 readf("%d",&i,"%d",&j)
 readf(&i,&j);

assuming i and j are ints. very nifty.

Sep 24 2004

Sean Kelly <sean f4.ca> writes:

In article <cj219m$h14$1 digitaldaemon.com>, Ben Hinkle says...
cool! so a C scanf call

 scanf("%d %d",&i,&j)

can be any of

 readf("%d %d",&i,&i)
 readf("%d",&i,"%d",&j)
 readf(&i,&j);

assuming i and j are ints. very nifty.

Yup.  I didn't think about it until just now, but this may come in handy for
internationalization, since the whole "%1" concept that's been talked about can
be faked just by reordering parameters.


Sean

Sep 24 2004

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

Sean Kelly wrote: (back in 2004-09-24, that was)

 For those of you who don't know, readf began as an attempt at a full
 C99-compliant scanf implementation in D.  It's since been renamed to match the
 Phobos writef/format functions a bit more closely, and this version attempts to
 bring usage a bit closer to readf. 

[...]
 unFormat still will not throw an exception on parameter mismatch, but will
 return immediately instead.  This is the only interface issue I know of where
 this package diverges from doFormat/writef.  

I changed this package to break it into
std.stdio.readf and std.string.unformat...

I also made it throw Exceptions on % FormatError
and parameter mismatch e.g. not passing pointers


The missing TypeInfo for pointers, and the fact that you cannot
pass _arguments and _argptr around with GDC without losing info
makes it a horrible kludge at the moment - but it does work! ;-)
(currently unformat only works from within std.string, though...)


Sean Kelly's old declaration of unFormat was:

 int unFormat( bit delegate( out dchar ) getc,
               bit delegate( dchar ) ungetc,
               TypeInfo[] arguments,
               void* argptr )

This was changed to use EOF and lose the "bit":

 void unFormat( dchar delegate() getc, dchar delegate(dchar) ungetc,
               TypeInfo[] arguments, va_list argptr,
               Mangle[] mangle = null, Mangle[] mangle2 = null)

(last two parameters being part of the GDC kludge,
you should already know "va_list" from std.stdarg:)

 version (GNU) {
     // va_list might be a pointer, but assuming so is not portable.
     private import gcc.builtins;
     alias __builtin_va_list va_list;
 } else {
     alias void* va_list;
 }


You only provide two delegates: getc and ungetc,
which are very similar to their C counterparts...
(making the wrappers for fgetc and ungetc simple)

     dchar getc();
     dchar ungetc(dchar c);

Should an EOF occur, the "new" versions now returns a
cast(dchar) std.c.stdio.EOF, or: 0xFFFFFFFF as UTF-32.
(which is not a valid code point, and thus "safe" here)

Otherwise, the internals work more or less as before
(except that it doesn't internalize the exceptions...)


Here is the "ideal" version of std.string.unformat,
ignoring the current GDC Mangling preprocessing hacks:

 void unformat(char[] s, ...)
 {
     size_t idx = 0, old_idx;
 	
     dchar getc()
     {
     	old_idx = idx;
     	if (idx >= s.length)
     		return cast(dchar) EOF;
     	return std.utf.decode(s, idx);
     }
 
     dchar ungetc(dchar c)
     {
     	idx = old_idx;
     	return c;
     }

     std.unformat.unFormat(&getc, &ungetc, _arguments, _argptr);
 }

You can use this as: (very similar to "format")

int i, j;
unformat("1 2", "%d %d",&i,&i)
unformat("1 2", "%d",&i,"%d",&j)
unformat("1 2", &i,&j);
assert(i == 1 && j == 2);

Since i and j are int's, it'll default to "%d".


Then there is the readf function, which also works as expected:

 import std.stdio;
 
 void main()
 {
   char[] s;
   write("What's is your name: ");
   readf("%s", &s);
   writefln("Hello, %s!", s);
 }

Which inputs/outputs something like:
     What's is your name: Anders
     Hello, Anders!
(yes, this is the actual D program)

Note that if you pass "s" instead of "&s", an Error will be thrown...
(this should stop the usual scanf bugs, with forgetting to &-prefix ?)

The program also uses the formatless version of writef called "write",
which doesn't treat '%' characters special but just prints them out...


Once the new version of GDC is out, I will try to see if the TypeInfo
passing can't be fixed for that compiler too and then post some code.

  * Copyright (C) 2004 by Sean Kelly
  * Copyright (C) 2005 by Anders F Bjoerklund
  *
  * Permission to use, copy, modify, distribute and sell this software
  * and its documentation for any purpose is hereby granted without fee,
  * provided that the above copyright notice appear in all copies and
  * that both that copyright notice and this permission notice appear
  * in supporting documentation.  Author makes no representations about
  * the suitability of this software for any purpose. It is provided
  * "as is" without express or implied warranty.

Original file came from: http://home.f4.ca/sean/d/stdio.zip
(had the Open Source license agreement being duplicated above)


Thanks to Sean for doing the grunt-work with format parsing,
so we (still) don't have to rename it to "std.stdo" anymore :-)
--anders


PS. Yes, it uses pointers. Kris has already written C++-style
     bitshift-operator overloads for people who want that... ?

http://svn.dsource.org/svn/projects/mango/trunk/doc/html/classAbstractReader.html
http://svn.dsource.org/svn/projects/mango/trunk/doc/html/classAbstractWriter.html

     When (and if) D supports "out" arguments for variadic lists,
     the code can be changed to support those "out" vars instead.
     Although it would then also need some kind of R/O attribute to
     able to differentiate between format strings and string params?
     Meanwhile, the pointers work just fine (and it checks the types!)

Mar 17 2005

Sean Kelly <sean f4.ca> writes:

In article <d1bhnv$ek6$1 digitaldaemon.com>,
=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= says...
Sean Kelly wrote: (back in 2004-09-24, that was)

 For those of you who don't know, readf began as an attempt at a full
 C99-compliant scanf implementation in D.  It's since been renamed to match the
 Phobos writef/format functions a bit more closely, and this version attempts to
 bring usage a bit closer to readf. 

[...]
 unFormat still will not throw an exception on parameter mismatch, but will
 return immediately instead.  This is the only interface issue I know of where
 this package diverges from doFormat/writef.  

I changed this package to break it into
std.stdio.readf and std.string.unformat...

Nice!  Is this version available online?


Sean

Mar 17 2005

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

Sean Kelly wrote:

I changed this package to break it into
std.stdio.readf and std.string.unformat...

 
 Nice!  Is this version available online?

Not yet, have to clean it up and backport
it back into the DMD release again...

(currently it's done to a tweaked GDC,
you see) And it *really* wants TypeInfo?


Main reason was to get opinions on:
1) changed getc delegate definitions
2) throwing on exceptions on errors

Also, I still need to write the
multibyte wrappers of getc/ungetc
for file based streams (regular, as
well as the wide orientation kind)

--anders

Mar 17 2005

Sean Kelly <sean f4.ca> writes:

In article <d1co48$1nlr$1 digitaldaemon.com>,
=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= says...
Sean Kelly wrote:

I changed this package to break it into
std.stdio.readf and std.string.unformat...

 
 Nice!  Is this version available online?

Not yet, have to clean it up and backport
it back into the DMD release again...

(currently it's done to a tweaked GDC,
you see) And it *really* wants TypeInfo?

I'll have to re-evaluate TypeInfo in DMD.  I don't suppose it's working yet for
pointer types?  And why in the world doesn't GDC properly generate copies of the
TypeInfo array?

Main reason was to get opinions on:
1) changed getc delegate definitions

I mostly created the getc/ungetc specs as they were because they were easier to
embed in boolean expressions.  ie.

if( !getc( ch ) ) return;

is easier to write than:

if( ( ch = getc() ) == WEOF ) return;

I figured the C function would need to be wrapped either way, so this seemed a
decent gain.  Especially since I think the reason the C routines are written the
way they are is because C lacks an output qualifier.  But it's mostly a cosmetic
issue, so it doesn't matter much to me either way.

2) throwing on exceptions on errors

I mostly didn't do this with my version of the functions because I thought it
made sense that unFormat should throw the same exceptions as format, but doing
so created a dependency I wasn't happy with for an add-on library.  If this
stuff made it into Phobos I fully support the idea of consistency between the
functions.  This fix should be pretty easy anyway, as it just amounts to putting
a "throw" in the necessary catch blocks at the bottom of the unFormat
implementation (unFormat uses exceptions internally for flow control).

Also, I still need to write the multibyte wrappers of getc/ungetc
for file based streams (regular, as well as the wide orientation kind)

My release used the same wrappers for file i/o as it used for file i/o.  Are
these functions not available in Linux?


Sean

Mar 17 2005

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

Sean Kelly wrote:

 I'll have to re-evaluate TypeInfo in DMD.  I don't suppose it's working yet for
 pointer types? 

Nope, they are all of the "TypeInfo" base class... :-(

 And why in the world doesn't GDC properly generate copies of the
 TypeInfo array

Maybe I explained myself badly. You can of course pass
_arguments and _argptr off to subroutines. It is just
that "arguments[i] is typeid(int*)" will no longer work...

The identity is lost, when doing the workaround like that.

It still works, if they are done against the original
_arguments and in the same module (I'm a little shady
on the details why that is so, just *that* it is so...)

If the TypeInfo/typeid was working, all would be cool.

 I figured the C function would need to be wrapped either way, so this seemed a
 decent gain.  Especially since I think the reason the C routines are written
the
 way they are is because C lacks an output qualifier.  But it's mostly a
cosmetic
 issue, so it doesn't matter much to me either way.

The read/write functions in std.stream work like you describe,
with out parameters (they throw Exceptions on EOF, instead
of return a bit, but that's just a matter of preference...)

Just thought that "dchar getc()" was a better match for
"putc(dchar)", and that there seemed to be a lot of
checking for eof spread out in the code ? That's all.

 I mostly didn't do this with my version of the functions because I thought it
 made sense that unFormat should throw the same exceptions as format, but doing
 so created a dependency I wasn't happy with for an add-on library. 

Yeah, it did meant hacking a few things in std.format...

And there are some *nasty* circular dependencies going on, and
double if not trouble defines of things like "stdin" and "va_list"

Had to resort to e.g. "alias std.stdarg.va_list va_list;"

 If this
 stuff made it into Phobos I fully support the idea of consistency between the
 functions.  This fix should be pretty easy anyway, as it just amounts to
putting
 a "throw" in the necessary catch blocks at the bottom of the unFormat
 implementation (unFormat uses exceptions internally for flow control).

I did leave the overflow checks in, but most should be passed further ?

Also, I still need to write the multibyte wrappers of getc/ungetc
for file based streams (regular, as well as the wide orientation kind)

 
 My release used the same wrappers for file i/o as it used for file i/o.  Are
 these functions not available in Linux?

Yes, I just didn't loop over the bytes to reassemble UTF-32 (yet)


Either way, readf and unformat *definitely* have a place in
a future release of Phobos - next to writef and format...

Just need to get the TypeInfo stuff completed first ?
(the major part being adding ti for pointer types...)

--anders

Mar 17 2005

D Programming

C/C++ Programming

Other

digitalmars.D - readf/unformat 1.4 released