digitalmars.D - Streams and encoding

Sean Kelly (38/38) Aug 03 2004 I finally got back on my stream mods today and had a question: how shou...

Arcane Jill (12/14) Aug 03 2004 Simple answer - it shouldn't have to.

Sean Kelly (7/19) Aug 03 2004 Works for me. So how does a formatted read/write routine know which for...

Arcane Jill (20/25) Aug 03 2004 Got it in one. Plus, you can have a factory function like createReader(c...

Sean Kelly (4/17) Aug 03 2004 No worries. If we've got a class per format then it knows implicitly wh...

parabolis (48/51) Aug 03 2004 I have been wondering who was working on a Stream library. I

Regan Heath (25/47) Aug 03 2004 On Tue, 03 Aug 2004 16:21:05 -0400, parabolis

parabolis (18/69) Aug 03 2004 I will concede the order was wrong. However I believe the

Regan Heath (20/88) Aug 03 2004 So.. ?

parabolis (36/88) Aug 03 2004 Here I must argue that any knowledge of where C went really

Regan Heath (26/70) Aug 03 2004 I didn't/don't use slicing. I think you may be confusing two different

parabolis (14/80) Aug 03 2004 Show me a safe function that takes void* as a parameter. That

Regan Heath (61/84) Aug 03 2004 On Tue, 03 Aug 2004 23:30:03 -0400, parabolis

parabolis (15/102) Aug 03 2004 Now that is pretty neat.

Regan Heath (19/118) Aug 04 2004 Just that bit.. or the whole thing? That bit above was a little hack, it...

parabolis (53/156) Aug 04 2004 Not really. My DataXXXStream would handle reading all cases

Regan Heath (16/98) Aug 04 2004 I don't think so.

Andy Friesen (20/63) Aug 03 2004 Slicing does not create garbage. Arrays really are value types that get...

Regan Heath (25/54) Aug 03 2004 Really? doesn't slicing create another array structure (the one you have...

Andy Friesen (11/46) Aug 03 2004 Sure, but the second two can probably be optimized into one and the same...

parabolis (2/4) Aug 03 2004 Sure there is. Not allocating is infinitely faster. :)

Andy Friesen (13/20) Aug 03 2004 Passing an array slice as an argument is exactly the same as passing a

Bent Rasmussen (2/6) Aug 04 2004 That is post-mature optimization. You should never have created

Sean Kelly (16/18) Aug 04 2004 Not sure I agree in this case.

parabolis (4/24) Aug 04 2004 I am pretty sure the second read in your example parses it be

Regan Heath (7/34) Aug 04 2004 It's not guaranteed to be valid. replace x.sizeof with 1000 and it's an

Andy Friesen (5/25) Aug 04 2004 I changed my mind. You're right. :)

parabolis (10/48) Aug 03 2004 That is what I meant by a wrapper. It is actually defined in
parabolis (14/19) Aug 05 2004 This is a good suggestion because void /is/ a much better

Regan Heath (18/38) Aug 05 2004 I still don't agree with the last bit, void[] gives no _assurance_ at al...

parabolis (47/98) Aug 06 2004 My argument is that there exists a program in which a bug will

Regan Heath (60/133) Aug 08 2004 On Fri, 06 Aug 2004 14:29:19 -0400, parabolis

Arcane Jill (4/9) Aug 06 2004 For all D types, the number of bytes occupied by a T[] of length N is (N...

parabolis (8/20) Aug 06 2004 Sorry I meant from the docs

Sean Kelly (6/13) Aug 06 2004 Probably an esoteric question, but I assume that the byte size gurantee ...

parabolis (6/24) Aug 06 2004 Actually that is not a terribly esoteric question. I do not

Regan Heath (14/65) Aug 03 2004 For another perspective/idea have a look at my thread entitled "My strea...

Sean Kelly (19/29) Aug 03 2004 My design really set out extend the original stream approach, and it see...

parabolis (6/12) Aug 03 2004 I probably should have made the argument explicit but I do
parabolis (3/6) Aug 03 2004 I also meant to suggest that I really like much less verbose

antiAlias (28/66) Aug 03 2004 What you good folks seem to be describing is pretty much how mango.io

Sean Kelly (7/15) Aug 03 2004 Yup. I've played around with Mango and kind of like it. One of the rea...

antiAlias (8/24) Aug 03 2004 You are absolutely right. But not many people seem to know about Mango, ...

parabolis (12/25) Aug 03 2004 I cant help but ask how it manages to do both input and output

antiAlias (12/37) Aug 03 2004 The primes.d thing is now a distant and foggy memory :-)

parabolis (14/20) Aug 03 2004 lol - The Mango Tree... just got it from the docs :) I am now

antiAlias (14/22) Aug 03 2004 Doing it the introspection way (ala Java) has a bunch of issues all of i...

parabolis (11/39) Aug 03 2004 I think I was premature to suppose D could do that. I just gave

antiAlias (34/73) Aug 03 2004 it's

Walter (14/20) Aug 03 2004 with a

Arcane Jill (19/27) Aug 03 2004 With all due respect, Walter, that's not really feasible. It is very har...

Sean Kelly (11/16) Aug 04 2004 That reminds me. Which format does the code in utf.d use? I'm thinking...

Arcane Jill (14/24) Aug 04 2004 Whatever works, works. But I'd make the enum private. Encodings should b...

Sean Kelly (16/42) Aug 04 2004 std.utf has methods like toUTF16. But does this target the big or littl...

Ben Hinkle (16/38) Aug 04 2004 handled
Arcane Jill (28/37) Aug 04 2004 Neither, really. toUTF16 returns an array of wchars, not an array of cha...

Sean Kelly (11/21) Aug 04 2004 Bah. Of course. So the two UTF schemes just depend on the byte order w...

Arcane Jill (52/65) Aug 04 2004 Well, from one point of view, the problem we've got here is serializatio...

Carlos Santander B. (19/19) Aug 04 2004 "Arcane Jill" escribi� en el mensaje

Arcane Jill (17/21) Aug 04 2004 Good question. I guess probably not. If the encoding is known, then it's...

Regan Heath (26/35) Aug 04 2004 On Wed, 4 Aug 2004 16:58:48 +0000 (UTC), Arcane Jill

Walter (17/34) Aug 04 2004 able

Sean Kelly <sean f4.ca> writes:

I finally got back on my stream mods today and had a question:  how should the
wrapper class know the encoding scheme of the low-level data?

For example, say all of the formatted IO code is in a mixin or base class
(assume base class for the same of discussion) that calls a  read(void*, size_t)
or write(void*, size_t) method in the derived class.  Now say I want to read a
char, wchar, or dchar from the stream.  How many bytes should I read and how do
I know what the encoding format is?  C++ streams handle this fairly simply by
making the char type a template parameter:






This has the obvious limitation that the programmer must instantiate the proper
type of stream for the data format he is trying to read (as there is only one
get/put method for any char type: CharT).  But it makes things pretty explicit:
Stream!(char) means "this is a stream formatted in UTF8."

The other option I can think off offhand would be to have a class member that
the derived class could set which specifies the encoding format:















This has tbe benefit of allowing the user to read and write any char type with a
single instantiation, but requires greater complexity in the Stream class and in
the Derived class.  And I wonder if such flexibility is truly necessary.

Any other design possibilities?  Preferences?  I'm really trying to establish a
good formatted IO design than work out the perfect stream API.  Any other weird
issues would be welcome also.


Sean

Aug 03 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <ceopfj$1hcl$1 digitaldaemon.com>, Sean Kelly says...
I finally got back on my stream mods today and had a question:  how should the
wrapper class know the encoding scheme of the low-level data?

Simple answer - it shouldn't have to.

I suggest using a specialized transcoding filter for such things. That's what
Java does (Java calls them Readers and Writers), and Java's streams have been
hailed as a shining example of how to do things correctly. Then your streams
just connect together naturally, as others have shown in other recent threads.
e.g.:


Windows1252Reader(stdin))));

(or something similar). You can have factory methods to create transcoders where
the encoding is not known until runtime.

Jill

Aug 03 2004

Sean Kelly <sean f4.ca> writes:

In article <ceor8d$1ihu$1 digitaldaemon.com>, Arcane Jill says...
In article <ceopfj$1hcl$1 digitaldaemon.com>, Sean Kelly says...
I finally got back on my stream mods today and had a question:  how should the
wrapper class know the encoding scheme of the low-level data?

Simple answer - it shouldn't have to.

Works for me.  So how does a formatted read/write routine know which format it's
targeting?

I suggest using a specialized transcoding filter for such things. That's what
Java does (Java calls them Readers and Writers), and Java's streams have been
hailed as a shining example of how to do things correctly. Then your streams
just connect together naturally, as others have shown in other recent threads.
e.g.:


Windows1252Reader(stdin))));

Okay, so all the formatted IO routines go in a Reader class and the type of the
reader class determines the format?  ie. there would be an UTF8Writer,
UTF8Reader, UTF16Writer, UTF16Reader, etc?


Sean

Aug 03 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <ceorv6$1iti$1 digitaldaemon.com>, Sean Kelly says...

Okay, so all the formatted IO routines go in a Reader class and the type of the
reader class determines the format?  ie. there would be an UTF8Writer,
UTF8Reader, UTF16Writer, UTF16Reader, etc?

Got it in one. Plus, you can have a factory function like createReader(char[]),
so you can do Reader r = createReader("UTF-16LE"); etc. (for when the type is
known at run time, not compile time, which is usually). The implementation of
createReader() is just a big swtich statement, with each case return a new
instance of the relevant class.

(I swapped your questions around. Here's the first one).

Works for me.  So how does a formatted read/write routine know which format it's
targeting?

You got me there. I think the question's too vague, and the answer
application-specific. Generally speaking, at some level, the encoding is known,
somehow. Maybe it's specified in the text file itself (XML and HTTP pull this
trick - for it to work the very start of the file must comprise only ASCII
characters (although they can be encoded in a UTF)); maybe it's specified in a
configuration file; maybe it's deduced using some heuristic test; maybe the OS
default is assumed. At the level where the encoding is known, decode it (into
UTF-8), and then you can use byte streams from then on. As parabolis said, a
stream, in the abstract, deals in ubytes, not chars (because that's what you
write to files, sockets, etc.). Classes which implement read() or write() in
units other than ubyte shouldn't really be called "streams", which of course is
why Java calls them Readers and Writers. (Maybe "filters" for the general case).

Arcane Jill

Aug 03 2004

Sean Kelly <sean f4.ca> writes:

In article <ceou9t$1kbq$1 digitaldaemon.com>, Arcane Jill says...
In article <ceorv6$1iti$1 digitaldaemon.com>, Sean Kelly says...

Okay, so all the formatted IO routines go in a Reader class and the type of the
reader class determines the format?  ie. there would be an UTF8Writer,
UTF8Reader, UTF16Writer, UTF16Reader, etc?

Got it in one. Plus, you can have a factory function like createReader(char[]),
so you can do Reader r = createReader("UTF-16LE"); etc. (for when the type is
known at run time, not compile time, which is usually). The implementation of
createReader() is just a big swtich statement, with each case return a new
instance of the relevant class.

(I swapped your questions around. Here's the first one).

Works for me.  So how does a formatted read/write routine know which format it's
targeting?

You got me there.

No worries.  If we've got a class per format then it knows implicitly what
format to convert to/from.


Sean

Aug 03 2004

parabolis <parabolis softhome.net> writes:

Sean Kelly wrote:

 I finally got back on my stream mods today and had a question:  how should the
 wrapper class know the encoding scheme of the low-level data?
 

I have been wondering who was working on a Stream library. I 
have many thoughts, many of which are covered in OT - scanf in 
Java. Here are a some notes:

In C (and C++ by extension I would imagine) the char type is the 
smallest addressable cell in memory. In D the char is a UTF-8 
8-bit code unit which is quite a differnent thing. I would 
suggest you seriously consider defining basic IO using either 
the ubyte (which represents a general 8-bit value) or possibly 
the data type that is the native cell size used in memory 
(something like size_t I believe).

Also I have noticed the tendency for people to not make the 
distinction between Input and Output streams. This leads to some 
problems. Say I want to write a class to handle CRC32 on stream 
data. It is far simpler and less error prone to compute such a 
digest on a stream in which data flows in only one direction 
especially in a multi-threaded environment.

Also the Input and Output distinction allows for streams pumps 
that automatically pull data from one and push data into 
another. This is especially useful with bifurcating streams that 
also do logging.

As for the templatization of streams I believe a pair of generic 
data input/output stream classes can be written using templates 
which will do impedance matching from the 8-bit streams to the 
n-bit data type you want to read. So you have to write 8, 16, 32 
and possibly 64 and 128 bit functions.


Here is the foundation of the stream library I imagine:
================================================================
interface DataSink {
     uint write( ubyte[] data, uint off = 0, uint len = 0);
}

interface DataSource {
     uint read( inout ubyte[] data, uint off = 0, uint len = 0);
     ulong seek( ulong size );
}
================================================================


The data being read/written by native interface classes:
================================================================
FileInputSream : DataSource
FileOutputSream : DataSink
SocketInputSream : DataSource
SocketOutputSream : DataSink
MMapInputStream : DataSource
MMapOutputStream : DataSink
================================================================

The data is then manipulated providing buffering, digesting, 
en/de-crpytion and [de]compressoin, etc. Finally it is possible 
to write interpreters for the data such as TGA, JPEG, etc...

Aug 03 2004

Regan Heath <regan netwin.co.nz> writes:

On Tue, 03 Aug 2004 16:21:05 -0400, parabolis <parabolis softhome.net> 
wrote:

<snip>

 Here is the foundation of the stream library I imagine:
 ================================================================
 interface DataSink {
      uint write( ubyte[] data, uint off = 0, uint len = 0);
 }

 interface DataSource {
      uint read( inout ubyte[] data, uint off = 0, uint len = 0);
      ulong seek( ulong size );
 }
 ================================================================

I think you need functions in the form:

   ulong write(void* data, ulong len = 0, ulong off = 0);

notice I have changed ubyte[] to void*, changed the order of the last two 
parameters and changed uint into ulong.

If you use ubyte[] you don't need len or off as you can call with:
   ubyte[] big = "regan was here";
   write(big[6..9]);
to achieve both.

The void* allows easy specialised write functions, eg.
   bool write(int x) { write(&x,x.sizeof); }

I'm not sure whether uint or ulong should be used, anyone got 
opinions/reasons for one or the other?

 The data being read/written by native interface classes:
 ================================================================
 FileInputSream : DataSource
 FileOutputSream : DataSink
 SocketInputSream : DataSource
 SocketOutputSream : DataSink
 MMapInputStream : DataSource
 MMapOutputStream : DataSink
 ================================================================

 The data is then manipulated providing buffering, digesting, 
 en/de-crpytion and [de]compressoin, etc. Finally it is possible to write 
 interpreters for the data such as TGA, JPEG, etc...

I think using template bolt-ins for this step is a great idea, for example 
you simply write a File, Socket, MMap class which implements the methods 
in the two interfaces above, then bolt them into your stream class which 
defines all the other stream operations.

See my earlier post (with source) on how this works. Note there was a 
problem with it which I have since fixed, changing 'super.' to 'this.' in 
the stream template class.

Regan.

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

Aug 03 2004

parabolis <parabolis softhome.net> writes:

Regan Heath wrote:
 On Tue, 03 Aug 2004 16:21:05 -0400, parabolis <parabolis softhome.net> 
 wrote:
 
 <snip>
 
 Here is the foundation of the stream library I imagine:
 ================================================================
 interface DataSink {
      uint write( ubyte[] data, uint off = 0, uint len = 0);
 }

 interface DataSource {
      uint read( inout ubyte[] data, uint off = 0, uint len = 0);
      ulong seek( ulong size );
 }
 ================================================================

 
 
 I think you need functions in the form:
 
   ulong write(void* data, ulong len = 0, ulong off = 0);
 
 notice I have changed ubyte[] to void*, changed the order of the last 
 two parameters and changed uint into ulong.
 
 If you use ubyte[] you don't need len or off as you can call with:
   ubyte[] big = "regan was here";
   write(big[6..9]);
 to achieve both.

I will concede the order was wrong. However I believe the 
slicing will need to create another array wrapper in memory 
which is then going to have to be GCed. The len and off 
parameters allow a caller to take either approach.

 
 The void* allows easy specialised write functions, eg.
   bool write(int x) { write(&x,x.sizeof); }

The void* is a pointer with no associated type. The arrays in D 
are infinitely better than void* pointers because arrays have 
extra information. As I said earlier in my post the behavior of 
providing data in a particular non-byte format should be done 
elsewhere in a single DataXXStream.


 The data being read/written by native interface classes:
 ================================================================
 FileInputSream : DataSource
 FileOutputSream : DataSink
 SocketInputSream : DataSource
 SocketOutputSream : DataSink
 MMapInputStream : DataSource
 MMapOutputStream : DataSink
 ================================================================

 The data is then manipulated providing buffering, digesting, 
 en/de-crpytion and [de]compressoin, etc. Finally it is possible to 
 write interpreters for the data such as TGA, JPEG, etc...

 
 
 I think using template bolt-ins for this step is a great idea, for 
 example you simply write a File, Socket, MMap class which implements the 
 methods in the two interfaces above, then bolt them into your stream 
 class which defines all the other stream operations.

I made an argument that I believe input and output should be 
clearly seperated which is my answer to why anything should not 
implement both. Until someone convinces me otherwise I do not 
see how a single class can implement both and be thread friendly 
without internally keeping all input related variables seperate 
from output related variables. If it is not possible to share 
input and output variables then the class can be factored into 
two smaller classes that are less prone to bugs.

Aug 03 2004

Regan Heath <regan netwin.co.nz> writes:

On Tue, 03 Aug 2004 18:02:55 -0400, parabolis <parabolis softhome.net> 
wrote:
 Regan Heath wrote:
 On Tue, 03 Aug 2004 16:21:05 -0400, parabolis <parabolis softhome.net> 
 wrote:

 <snip>

 Here is the foundation of the stream library I imagine:
 ================================================================
 interface DataSink {
      uint write( ubyte[] data, uint off = 0, uint len = 0);
 }

 interface DataSource {
      uint read( inout ubyte[] data, uint off = 0, uint len = 0);
      ulong seek( ulong size );
 }
 ================================================================


 I think you need functions in the form:

   ulong write(void* data, ulong len = 0, ulong off = 0);

 notice I have changed ubyte[] to void*, changed the order of the last 
 two parameters and changed uint into ulong.

 If you use ubyte[] you don't need len or off as you can call with:
   ubyte[] big = "regan was here";
   write(big[6..9]);
 to achieve both.

 I will concede the order was wrong. However I believe the slicing will 
 need to create another array wrapper in memory which is then going to 
 have to be GCed.

So.. ?

 The len and off parameters allow a caller to take either approach.

Yeah.. we have default parameters, we can provide both options at no cost, 
so why not.

 The void* allows easy specialised write functions, eg.
   bool write(int x) { write(&x,x.sizeof); }

 The void* is a pointer with no associated type.

Correct.

 The arrays in D are infinitely better than void* pointers because arrays 
 have extra information.

Incorrect. D arrays are better for some things, those that need/want the 
extra information.

Lets ignore our opinions on the use of void* for now, can you write the 
write(int x) function above as easily if you do not use void* but use 
ubyte[] instead?

 As I said earlier in my post the behavior of providing data in a 
 particular non-byte format should be done elsewhere in a single 
 DataXXStream.

Sure, and when/where you provide it, what will it look like if the 
underlying write operation takes a ubyte[] and not a void*? is it 
possible? is it worse than simply using a void*?

 The data being read/written by native interface classes:
 ================================================================
 FileInputSream : DataSource
 FileOutputSream : DataSink
 SocketInputSream : DataSource
 SocketOutputSream : DataSink
 MMapInputStream : DataSource
 MMapOutputStream : DataSink
 ================================================================

 The data is then manipulated providing buffering, digesting, 
 en/de-crpytion and [de]compressoin, etc. Finally it is possible to 
 write interpreters for the data such as TGA, JPEG, etc...


 I think using template bolt-ins for this step is a great idea, for 
 example you simply write a File, Socket, MMap class which implements 
 the methods in the two interfaces above, then bolt them into your 
 stream class which defines all the other stream operations.

 I made an argument that I believe input and output should be clearly 
 seperated which is my answer to why anything should not implement both. 
 Until someone convinces me otherwise I do not see how a single class can 
 implement both and be thread friendly without internally keeping all 
 input related variables seperate from output related variables. If it is 
 not possible to share input and output variables then the class can be 
 factored into two smaller classes that are less prone to bugs.

Sure, wanting to do this does not stop you using bolt-ins.

I just have to split my Stream bolt-in into InputStream and OutputStream, 
in fact, I think I will, as I agree with your reasoning.

Regan.

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

Aug 03 2004

parabolis <parabolis softhome.net> writes:

Regan Heath wrote:

 On Tue, 03 Aug 2004 18:02:55 -0400, parabolis <parabolis softhome.net> 
 
 The arrays in D are infinitely better than void* pointers because 
 arrays have extra information.

 
 
 Incorrect. D arrays are better for some things, those that need/want the 
 extra information.

Here I must argue that any knowledge of where C went really 
wrong was with char* which allows buffer overruns because you do 
not know how long the buffer is...

I also do not see how you could have used slicing and a void*. 
How would you know when to stop reading before you had off and len?

 Lets ignore our opinions on the use of void* for now, can you write the 
 write(int x) function above as easily if you do not use void* but use 
 ubyte[] instead?
 

I will do both at the same time... (read on)

 Sure, and when/where you provide it, what will it look like if the 
 underlying write operation takes a ubyte[] and not a void*? is it 
 possible? is it worse than simply using a void*?

I am more concerned with the fact that a ubyte[] should help 
guard against the char* buffer overruns that creaed a huge 
security industry. In fact I suspect that you might be somebody 
from NAV or McAfee and are here only to ensure security holes 
remain rampant... :P

One of the biggest breakthroughs Java made was in the area of 
security. Part of this breakthrough was a result of their 
eliminating that nasty char* and using arrays with length info 
builtin. Having said that... Of course it is possible to read a
int/long/real/whatever from a byte buffer. Moreover you can test
to see if something went wrong in the buffer because you know 
how long it is...
================================================================
int readInt( ubyte buf, uint off = 0 ) {
     if( buf.length <= off+4 )
         throw Error( "Buffer overrun" );
     uint result = buf[off+0];
     result |= (cast(int)(buf[off+1])) << 8;
     result |= (cast(int)(buf[off+2])) << 16;
     result |= (cast(int)(buf[off+3])) << 24;
     return result;
}
================================================================

 
 The data being read/written by native interface classes:
 ================================================================
 FileInputSream : DataSource
 FileOutputSream : DataSink
 SocketInputSream : DataSource
 SocketOutputSream : DataSink
 MMapInputStream : DataSource
 MMapOutputStream : DataSink
 ================================================================

 The data is then manipulated providing buffering, digesting, 
 en/de-crpytion and [de]compressoin, etc. Finally it is possible to 
 write interpreters for the data such as TGA, JPEG, etc...



 I think using template bolt-ins for this step is a great idea, for 
 example you simply write a File, Socket, MMap class which implements 
 the methods in the two interfaces above, then bolt them into your 
 stream class which defines all the other stream operations.


 I made an argument that I believe input and output should be clearly 
 seperated which is my answer to why anything should not implement 
 both. Until someone convinces me otherwise I do not see how a single 
 class can implement both and be thread friendly without internally 
 keeping all input related variables seperate from output related 
 variables. If it is not possible to share input and output variables 
 then the class can be factored into two smaller classes that are less 
 prone to bugs.

 
 
 Sure, wanting to do this does not stop you using bolt-ins.
 
 I just have to split my Stream bolt-in into InputStream and 
 OutputStream, in fact, I think I will, as I agree with your reasoning.

I am glad to hear you decided to split them. I think you will 
find it makes life simpler.

I am not much of a generic programmer. So I am waiting to see 
how you deal with the combinatorial problem before I am sold on 
the idea. If you can pull it off then you might be onto 
something. :)

Aug 03 2004

Regan Heath <regan netwin.co.nz> writes:

On Tue, 03 Aug 2004 21:41:51 -0400, parabolis <parabolis softhome.net> 
wrote:
 Regan Heath wrote:

 Incorrect. D arrays are better for some things, those that need/want 
 the extra information.

 Here I must argue that any knowledge of where C went really wrong was 
 with char* which allows buffer overruns because you do not know how long 
 the buffer is...

 I also do not see how you could have used slicing and a void*.

I didn't/don't use slicing. I think you may be confusing two different 
points I made.

My first point was that off and len were not required because you can 
slice into a ubyte[]. So _if_ you use ubyte[] you don't _need_ off and len.

My second point was that instead of ubyte[] you should use void* for 
convenience. If you use void* you definately need len.

 How would you know when to stop reading before you had off and len?

I have always had len, my fn prototype is:
   ulong write(void* address, ulong length);

which simply writes length bytes starting at address.

 Lets ignore our opinions on the use of void* for now, can you write the 
 write(int x) function above as easily if you do not use void* but use 
 ubyte[] instead?

 I will do both at the same time... (read on)

both? .. on I read ..

 Sure, and when/where you provide it, what will it look like if the 
 underlying write operation takes a ubyte[] and not a void*? is it 
 possible? is it worse than simply using a void*?

 I am more concerned with the fact that a ubyte[] should help guard 
 against the char* buffer overruns that creaed a huge security industry. 
 In fact I suspect that you might be somebody from NAV or McAfee and are 
 here only to ensure security holes remain rampant... :P

 One of the biggest breakthroughs Java made was in the area of security. 
 Part of this breakthrough was a result of their eliminating that nasty 
 char* and using arrays with length info builtin. Having said that... Of 
 course it is possible to read a
 int/long/real/whatever from a byte buffer. Moreover you can test
 to see if something went wrong in the buffer because you know how long 
 it is...
 ================================================================
 int readInt( ubyte buf, uint off = 0 ) {

Typo, you missed the [], I have added them below.

 int readInt( ubyte[] buf, uint off = 0 ) {
      if( buf.length <= off+4 )
          throw Error( "Buffer overrun" );
      uint result = buf[off+0];
      result |= (cast(int)(buf[off+1])) << 8;
      result |= (cast(int)(buf[off+2])) << 16;
      result |= (cast(int)(buf[off+3])) << 24;
      return result;
 }
 ================================================================

And this is supposed to be nicer/easier/more efficient than..

bool readInt(out int x) {
   if (read(&x,x.sizeof) != x.sizeof)
     throw new Exception("Out of data");
   return true;
}

As you can see using void* allows very convenient and totally buffer 
overrun safe code.

<snip>

 I am glad to hear you decided to split them. I think you will find it 
 makes life simpler.

 I am not much of a generic programmer. So I am waiting to see how you 
 deal with the combinatorial problem before I am sold on the idea. If you 
 can pull it off then you might be onto something. :)

You mean the problem you see with threads and shared buffers?

Regan.

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

Aug 03 2004

parabolis <parabolis softhome.net> writes:

Regan Heath wrote:

 I didn't/don't use slicing. I think you may be confusing two different 
 points I made.
 
 My first point was that off and len were not required because you can 
 slice into a ubyte[]. So _if_ you use ubyte[] you don't _need_ off and len.
 
 My second point was that instead of ubyte[] you should use void* for 
 convenience. If you use void* you definately need len.

I see now. I was confused. Sorry.

 Sure, and when/where you provide it, what will it look like if the 
 underlying write operation takes a ubyte[] and not a void*? is it 
 possible? is it worse than simply using a void*?


 I am more concerned with the fact that a ubyte[] should help guard 
 against the char* buffer overruns that creaed a huge security 
 industry. In fact I suspect that you might be somebody from NAV or 
 McAfee and are here only to ensure security holes remain rampant... :P

 One of the biggest breakthroughs Java made was in the area of 
 security. Part of this breakthrough was a result of their eliminating 
 that nasty char* and using arrays with length info builtin. Having 
 said that... Of course it is possible to read a
 int/long/real/whatever from a byte buffer. Moreover you can test
 to see if something went wrong in the buffer because you know how long 
 it is...
 ================================================================
 int readInt( ubyte buf, uint off = 0 ) {

 
 
 Typo, you missed the [], I have added them below.
 
 int readInt( ubyte[] buf, uint off = 0 ) {
      if( buf.length <= off+4 )
          throw Error( "Buffer overrun" );
      uint result = buf[off+0];
      result |= (cast(int)(buf[off+1])) << 8;
      result |= (cast(int)(buf[off+2])) << 16;
      result |= (cast(int)(buf[off+3])) << 24;
      return result;
 }
 ================================================================

 
 
 And this is supposed to be nicer/easier/more efficient than..
 
 bool readInt(out int x) {
   if (read(&x,x.sizeof) != x.sizeof)
     throw new Exception("Out of data");
   return true;
 }
 
 As you can see using void* allows very convenient and totally buffer 
 overrun safe code.

Show me a safe function that takes void* as a parameter. That 
was really more the point I was making. There is no way to 
guanratee in read(void*,uint len) that len is not actually 
longer than the array someone passes in. When that happens your 
read function will overwrite the end of the array and eventually 
write over executable code. Somebody will find that bug and send 
a specially formatted overly long string that has machine code 
in it and hijack the program.

 
 <snip>
 
 I am glad to hear you decided to split them. I think you will find it 
 makes life simpler.

 I am not much of a generic programmer. So I am waiting to see how you 
 deal with the combinatorial problem before I am sold on the idea. If 
 you can pull it off then you might be onto something. :)

 
 
 You mean the problem you see with threads and shared buffers?
 

Sorry I meant the problem with threads and shared buffers should 
be easier now.

The bit about the combinatorial problem goes back to the other 
thread in which I wanted to see how you combine multiple streams...

Aug 03 2004

Regan Heath <regan netwin.co.nz> writes:

On Tue, 03 Aug 2004 23:30:03 -0400, parabolis <parabolis softhome.net> 
wrote:

<snip>

 Show me a safe function that takes void* as a parameter. That was really 
 more the point I was making. There is no way to guanratee in 
 read(void*,uint len) that len is not actually longer than the array 
 someone passes in. When that happens your read function will overwrite 
 the end of the array and eventually write over executable code. Somebody 
 will find that bug and send a specially formatted overly long string 
 that has machine code in it and hijack the program.

I agree this is a problem, I have been dealing with it for years at work 
(we work with C only).

The solution in this case is that nobody outside the Stream template class 
actually calls the read/write functions that take void* instead they call 
the ones provided for int, float, ubyte[], and so on.

However, someone might want the void* ones in order to read/write a 
struct..

..

I have just discovered you can use ubyte[] and get the same sort of 
function as my void* one, check out...

class Stream
{
	ulong read(ubyte[] buffer, ulong length = 0, ulong offset = 0)
	{
		if (length == 0) length = buffer.length;
		buffer[offset..length] = 65;
		return length-offset;
	}
	
	bool read(out char x)
	{
		if (read(cast(ubyte[])(&x)[0..x.sizeof]) != x.sizeof)
			return false;
		return true;
	}
}

void main()
{
	Stream st = new Stream();
	char c;
	
	st.read(c);
	printf("%c\n",c);
}

as you can see using a cast, a slice and the address of the char we can do 
the same thing as with a void *.

So now the read function takes a ubyte[] and is itself buffer safe.. 
however this does not mean buffer overruns are not possible, consider...

void badBuggyRead(out char x)
{
	read(cast(ubyte[])(&x)[0..1000]);
}

so even tho read uses a ubyte[] it can still overrun.

 <snip>

 I am glad to hear you decided to split them. I think you will find it 
 makes life simpler.

 I am not much of a generic programmer. So I am waiting to see how you 
 deal with the combinatorial problem before I am sold on the idea. If 
 you can pull it off then you might be onto something. :)


 You mean the problem you see with threads and shared buffers?

 Sorry I meant the problem with threads and shared buffers should be 
 easier now.

:)

 The bit about the combinatorial problem goes back to the other thread in 
 which I wanted to see how you combine multiple streams...

Ahh yes.. I am waiting for an idea to come to me.. my first idea is that I 
combine them in the same way as I combine the ones I currently have eg.

alias OutputStream!(InputStream!(RawFile)) File;

or something, I have not tried splitting them yet, then..

alias CRCReader!(File) CRCFileReader;
alias CRCWriter!(File) CRCFileWriter;

alias ZIPReader!(File) ZIPFileReader;
alias ZIPWriter!(File) ZIPFileWriter;

now, this is fine for types we know about at compile time, however we may 
need to choose at runtime, so some sort of factory approach will have to 
be used...

Regan

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

Aug 03 2004

parabolis <parabolis softhome.net> writes:

Regan Heath wrote:

 On Tue, 03 Aug 2004 23:30:03 -0400, parabolis <parabolis softhome.net> 
 wrote:
 
 Show me a safe function that takes void* as a parameter. That was 
 really more the point I was making. There is no way to guanratee in 
 read(void*,uint len) that len is not actually longer than the array 
 someone passes in. When that happens your read function will overwrite 
 the end of the array and eventually write over executable code. 
 Somebody will find that bug and send a specially formatted overly long 
 string that has machine code in it and hijack the program.

 
 
 I agree this is a problem, I have been dealing with it for years at work 
 (we work with C only).
 
 The solution in this case is that nobody outside the Stream template 
 class actually calls the read/write functions that take void* instead 
 they call the ones provided for int, float, ubyte[], and so on.
 
 However, someone might want the void* ones in order to read/write a 
 struct..

That is a good point.

 
 ...
 
 I have just discovered you can use ubyte[] and get the same sort of 
 function as my void* one, check out...
 
 class Stream
 {
     ulong read(ubyte[] buffer, ulong length = 0, ulong offset = 0)
     {
         if (length == 0) length = buffer.length;
         buffer[offset..length] = 65;

Now that is pretty neat.

 
 So now the read function takes a ubyte[] and is itself buffer safe.. 
 however this does not mean buffer overruns are not possible, consider...
 
 void badBuggyRead(out char x)
 {
     read(cast(ubyte[])(&x)[0..1000]);
 }
 
 so even tho read uses a ubyte[] it can still overrun.

You can always circumvent a security measure. The point is that 
with the measure there you *have* to go out of your way to get 
around it.

 
 <snip>

 I am glad to hear you decided to split them. I think you will find 
 it makes life simpler.

 I am not much of a generic programmer. So I am waiting to see how 
 you deal with the combinatorial problem before I am sold on the 
 idea. If you can pull it off then you might be onto something. :)



 You mean the problem you see with threads and shared buffers?

 Sorry I meant the problem with threads and shared buffers should be 
 easier now.

 
 
 :)
 
 The bit about the combinatorial problem goes back to the other thread 
 in which I wanted to see how you combine multiple streams...

 
 
 Ahh yes.. I am waiting for an idea to come to me.. my first idea is that 
 I combine them in the same way as I combine the ones I currently have eg.
 
 alias OutputStream!(InputStream!(RawFile)) File;
 
 or something, I have not tried splitting them yet, then..
 
 alias CRCReader!(File) CRCFileReader;
 alias CRCWriter!(File) CRCFileWriter;
 
 alias ZIPReader!(File) ZIPFileReader;
 alias ZIPWriter!(File) ZIPFileWriter;
 
 now, this is fine for types we know about at compile time, however we 
 may need to choose at runtime, so some sort of factory approach will 
 have to be used...
 
 Regan
 

Consider the number of combinations of just Readers that are 
possible:

    File,Net,Mem - choose 1 of 3

    Compression
    CRC           } - choose any number and in any order
    Buffering

    Image,Audio,Video - choose 1 of 3

If I am not to sleepy to be thinking straight then there are 
rougly 100 combinations of readers with just these 9 classes.

Aug 03 2004

Regan Heath <regan netwin.co.nz> writes:

On Wed, 04 Aug 2004 01:24:46 -0400, parabolis <parabolis softhome.net> 
wrote:

 Regan Heath wrote:

 On Tue, 03 Aug 2004 23:30:03 -0400, parabolis <parabolis softhome.net> 
 wrote:

 Show me a safe function that takes void* as a parameter. That was 
 really more the point I was making. There is no way to guanratee in 
 read(void*,uint len) that len is not actually longer than the array 
 someone passes in. When that happens your read function will overwrite 
 the end of the array and eventually write over executable code. 
 Somebody will find that bug and send a specially formatted overly long 
 string that has machine code in it and hijack the program.


 I agree this is a problem, I have been dealing with it for years at 
 work (we work with C only).

 The solution in this case is that nobody outside the Stream template 
 class actually calls the read/write functions that take void* instead 
 they call the ones provided for int, float, ubyte[], and so on.

 However, someone might want the void* ones in order to read/write a 
 struct..

 That is a good point.

 ...

 I have just discovered you can use ubyte[] and get the same sort of 
 function as my void* one, check out...

 class Stream
 {
     ulong read(ubyte[] buffer, ulong length = 0, ulong offset = 0)
     {
         if (length == 0) length = buffer.length;
         buffer[offset..length] = 65;

 Now that is pretty neat.

Just that bit.. or the whole thing? That bit above was a little hack, it 
sets the whole buffer to 65 or ascii 'A'.

 So now the read function takes a ubyte[] and is itself buffer safe.. 
 however this does not mean buffer overruns are not possible, consider...

 void badBuggyRead(out char x)
 {
     read(cast(ubyte[])(&x)[0..1000]);
 }

 so even tho read uses a ubyte[] it can still overrun.

 You can always circumvent a security measure. The point is that with the 
 measure there you *have* to go out of your way to get around it.

But people will. Assume you're trying to read/write a struct, int, float, 
whatever, you _have_ to write code like that above and you might get it 
wrong, it's exactly the same as if you were using:

   read(void* address, ulong length);

you might call that wrong to. I cannot see a difference and void* is 
easier to use and smaller than void[].

 <snip>

 I am glad to hear you decided to split them. I think you will find 
 it makes life simpler.

 I am not much of a generic programmer. So I am waiting to see how 
 you deal with the combinatorial problem before I am sold on the 
 idea. If you can pull it off then you might be onto something. :)



 You mean the problem you see with threads and shared buffers?

 Sorry I meant the problem with threads and shared buffers should be 
 easier now.


 :)

 The bit about the combinatorial problem goes back to the other thread 
 in which I wanted to see how you combine multiple streams...


 Ahh yes.. I am waiting for an idea to come to me.. my first idea is 
 that I combine them in the same way as I combine the ones I currently 
 have eg.

 alias OutputStream!(InputStream!(RawFile)) File;

 or something, I have not tried splitting them yet, then..

 alias CRCReader!(File) CRCFileReader;
 alias CRCWriter!(File) CRCFileWriter;

 alias ZIPReader!(File) ZIPFileReader;
 alias ZIPWriter!(File) ZIPFileWriter;

 now, this is fine for types we know about at compile time, however we 
 may need to choose at runtime, so some sort of factory approach will 
 have to be used...

 Regan

 Consider the number of combinations of just Readers that are possible:

     File,Net,Mem - choose 1 of 3

     Compression
     CRC           } - choose any number and in any order
     Buffering

     Image,Audio,Video - choose 1 of 3

 If I am not to sleepy to be thinking straight then there are rougly 100 
 combinations of readers with just these 9 classes.

Yeah.. so? when I need one I make an alias and use it.. when I need 
another I make an alias and use it, it's no different to simply typing
  new A(new B(new C)))

when you use it, _except_, if you re-use it in several places then my 
alias is neater.

I am not going to alias all x possible combinations right now :)

Regan.

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

Aug 04 2004

parabolis <parabolis softhome.net> writes:

Regan Heath wrote:

 On Wed, 04 Aug 2004 01:24:46 -0400, parabolis <parabolis softhome.net> 
 wrote:
 
 Regan Heath wrote:

 On Tue, 03 Aug 2004 23:30:03 -0400, parabolis 
 <parabolis softhome.net> wrote:

 So now the read function takes a ubyte[] and is itself buffer safe.. 
 however this does not mean buffer overruns are not possible, consider...

 void badBuggyRead(out char x)
 {
     read(cast(ubyte[])(&x)[0..1000]);
 }

 so even tho read uses a ubyte[] it can still overrun.


 You can always circumvent a security measure. The point is that with 
 the measure there you *have* to go out of your way to get around it.

 
 
 But people will. Assume you're trying to read/write a struct, int, 
 float, whatever, you _have_ to write code like that above and you might 
 get it wrong, it's exactly the same as if you were using:

Not really. My DataXXXStream would handle reading all cases 
where you want to read a primitive. The struct thing is a 
special case that I will say should be handled by library 
read/write functions. So it is expected that people who want a 
primitive/struct will use a library function. Should somebody 
have the need for something strange and defeat the security 
measure then it is expected they will not do it in a way that 
causes a buffer overrun.

Most buffer overruns are a result of the fact that deal with 
char* on a regular basis leads to small bugs. I eliminate those 
with ubyte[] (or possibly void[]). You fail to do that with void*.

 
   read(void* address, ulong length);
 
 you might call that wrong to. I cannot see a difference and void* is 
 easier to use and smaller than void[].
 
 <snip>

 I am glad to hear you decided to split them. I think you will find 
 it makes life simpler.

 I am not much of a generic programmer. So I am waiting to see how 
 you deal with the combinatorial problem before I am sold on the 
 idea. If you can pull it off then you might be onto something. :)




 You mean the problem you see with threads and shared buffers?

 Sorry I meant the problem with threads and shared buffers should be 
 easier now.



 :)

 The bit about the combinatorial problem goes back to the other 
 thread in which I wanted to see how you combine multiple streams...



 Ahh yes.. I am waiting for an idea to come to me.. my first idea is 
 that I combine them in the same way as I combine the ones I currently 
 have eg.

 alias OutputStream!(InputStream!(RawFile)) File;

 or something, I have not tried splitting them yet, then..

 alias CRCReader!(File) CRCFileReader;
 alias CRCWriter!(File) CRCFileWriter;

 alias ZIPReader!(File) ZIPFileReader;
 alias ZIPWriter!(File) ZIPFileWriter;

 now, this is fine for types we know about at compile time, however we 
 may need to choose at runtime, so some sort of factory approach will 
 have to be used...

 Regan

 Consider the number of combinations of just Readers that are possible:

     File,Net,Mem - choose 1 of 3

     Compression
     CRC           } - choose any number and in any order
     Buffering

     Image,Audio,Video - choose 1 of 3

 If I am not to sleepy to be thinking straight then there are rougly 
 100 combinations of readers with just these 9 classes.

 
 
 Yeah.. so? when I need one I make an alias and use it.. when I need 
 another I make an alias and use it, it's no different to simply typing
  new A(new B(new C)))
 
 when you use it, _except_, if you re-use it in several places then my 
 alias is neater.
 
 I am not going to alias all x possible combinations right now :)
 

So for something that reads from a file then does buffering then 
decompression then computes a CRC check of the input stream and 
reads image data you would use something like this:
================================================================
alias BufferedInputStream!(FileInputStream)
     BufferedFileInputStream;
alias DecompressionInputStream!(BufferedFileInputStream)
     DecompressionBufferedFileInputStream;
alias CRCInputStream!(DecompressionBufferedFileInputStream)
     CRCDecompressionBufferedFileInputStream;
alias ImageInputStream!(CRCDecompressionBufferedFileInputStream)
     ImageCRCDecompressionBufferedFileInputStream;

CRCInputSream crc_in = new
     CRCDecompressionBufferedFileInputStream(filename);
ImageInputSream iin= new
     ImageCRCDecompressionBufferedFileInputStream(crc_in);
================================================================
File - 10 times
Buffered - 10 times
Decompression - 8 times
CRC - 7 times
Image - 4 times
================================

I cannot imagine why you would like having all that alias 
clutter up your file instead of just using the minimal:
================================================================
CRCInputStream crc_in = new CRCInputStream
(   new DecompressionInputStream
     (   new BufferedInputStream
         (  new FileInputStream( filename )
         )
     )
);
ImageInputSream iin = new ImageInputStream( crc_in );
================================================================
File - 1 time
Buffered - 1 time
Decompression - 1 time
CRC - 2 times
Image - 2 times
================

Aug 04 2004

Regan Heath <regan netwin.co.nz> writes:

On Wed, 04 Aug 2004 11:37:05 -0400, parabolis <parabolis softhome.net> 
wrote:
 Regan Heath wrote:

 On Wed, 04 Aug 2004 01:24:46 -0400, parabolis <parabolis softhome.net> 
 wrote:

 Regan Heath wrote:

 On Tue, 03 Aug 2004 23:30:03 -0400, parabolis 
 <parabolis softhome.net> wrote:

 So now the read function takes a ubyte[] and is itself buffer safe.. 
 however this does not mean buffer overruns are not possible, 
 consider...

 void badBuggyRead(out char x)
 {
     read(cast(ubyte[])(&x)[0..1000]);
 }

 so even tho read uses a ubyte[] it can still overrun.


 You can always circumvent a security measure. The point is that with 
 the measure there you *have* to go out of your way to get around it.


 But people will. Assume you're trying to read/write a struct, int, 
 float, whatever, you _have_ to write code like that above and you might 
 get it wrong, it's exactly the same as if you were using:

 Not really. My DataXXXStream would handle reading all cases where you 
 want to read a primitive. The struct thing is a special case that I will 
 say should be handled by library read/write functions. So it is expected 
 that people who want a primitive/struct will use a library function. 
 Should somebody have the need for something strange and defeat the 
 security measure then it is expected they will not do it in a way that 
 causes a buffer overrun.

 Most buffer overruns are a result of the fact that deal with char* on a 
 regular basis leads to small bugs. I eliminate those with ubyte[] (or 
 possibly void[]).

I don't think so.

 You fail to do that with void*.

I don't try. Because it's impossible.

<snip>

 I am not going to alias all x possible combinations right now :)

 So for something that reads from a file then does buffering then 
 decompression then computes a CRC check of the input stream and reads 
 image data you would use something like this:

Nope.

alias ImageStream!(CRCStream!(DecompressStream!(File) CompressedImageCRC;
// my 'File' is buffered.

CompressedImageCRC f = new CompressedImageCRC();

or more likely 'CompressedImageCRC' will be replaced by a name that has 
context where I use it, if for example it was an image resource for a game 
it might be simply 'Image'

 ================================================================
 alias BufferedInputStream!(FileInputStream)
      BufferedFileInputStream;
 alias DecompressionInputStream!(BufferedFileInputStream)
      DecompressionBufferedFileInputStream;
 alias CRCInputStream!(DecompressionBufferedFileInputStream)
      CRCDecompressionBufferedFileInputStream;
 alias ImageInputStream!(CRCDecompressionBufferedFileInputStream)
      ImageCRCDecompressionBufferedFileInputStream;

 CRCInputSream crc_in = new
      CRCDecompressionBufferedFileInputStream(filename);
 ImageInputSream iin= new
      ImageCRCDecompressionBufferedFileInputStream(crc_in);
 ================================================================
 File - 10 times
 Buffered - 10 times
 Decompression - 8 times
 CRC - 7 times
 Image - 4 times
 ================================

 I cannot imagine why you would like having all that alias clutter up 
 your file instead of just using the minimal:
 ================================================================
 CRCInputStream crc_in = new CRCInputStream
 (   new DecompressionInputStream
      (   new BufferedInputStream
          (  new FileInputStream( filename )
          )
      )
 );
 ImageInputSream iin = new ImageInputStream( crc_in );
 ================================================================
 File - 1 time
 Buffered - 1 time
 Decompression - 1 time
 CRC - 2 times
 Image - 2 times
 ================

Now instantiate it 10 times and give me a tally.

Regan

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

Aug 04 2004

Andy Friesen <andy ikagames.com> writes:

parabolis wrote:

 Regan Heath wrote:
 
 On Tue, 03 Aug 2004 16:21:05 -0400, parabolis <parabolis softhome.net> 
 wrote:

 <snip>

 Here is the foundation of the stream library I imagine:
 ================================================================
 interface DataSink {
      uint write( ubyte[] data, uint off = 0, uint len = 0);
 }

 interface DataSource {
      uint read( inout ubyte[] data, uint off = 0, uint len = 0);
      ulong seek( ulong size );
 }
 ================================================================



 I think you need functions in the form:

   ulong write(void* data, ulong len = 0, ulong off = 0);

 notice I have changed ubyte[] to void*, changed the order of the last 
 two parameters and changed uint into ulong.

 If you use ubyte[] you don't need len or off as you can call with:
   ubyte[] big = "regan was here";
   write(big[6..9]);
 to achieve both.

 
 
 I will concede the order was wrong. However I believe the slicing will 
 need to create another array wrapper in memory which is then going to 
 have to be GCed. The len and off parameters allow a caller to take 
 either approach.

Slicing does not create garbage.  Arrays really are value types that get 
copied when you pass them to a function.  You can generally treat them 
as reference types because the data they refer to is not copied along 
with them.

An array is quite literally little more than this:

     struct Array(T) {
         T* data;
         int length;
     }

Might I suggest that DataSources and DataSinks use void[]?

void[] knows how many bytes it points to and is slicable.  Whether or 
not void[] was created for this exact scenerio is uncertain, but they 
are exceptionally well suited to the task regardless.

(incidently, slicing void* is legal as well)

 The void* is a pointer with no associated type. The arrays in D are 
 infinitely better than void* pointers because arrays have extra 
 information. As I said earlier in my post the behavior of providing data 
 in a particular non-byte format should be done elsewhere in a single 
 DataXXStream.

The whole idea behind DataSources and DataSinks is that they just pull 
bytes in and out of some other place without ever having any concern for 
their meaning.

This is a textbook case of the right place to use void*. :)  (or void[])

  -- andy

Aug 03 2004

Regan Heath <regan netwin.co.nz> writes:

On Tue, 03 Aug 2004 21:30:29 -0700, Andy Friesen <andy ikagames.com> wrote:
 On Tue, 03 Aug 2004 16:21:05 -0400, parabolis <parabolis softhome.net> 
 wrote:
 I will concede the order was wrong. However I believe the slicing will 
 need to create another array wrapper in memory which is then going to 
 have to be GCed. The len and off parameters allow a caller to take 
 either approach.

 Slicing does not create garbage.

Really? doesn't slicing create another array structure (the one you have 
described below) exactly the same as if/when you pass one to a function, 
so..

void foo(char[] a)
{
}
void main()
{
   char[] a = "12345";
   foo(a[1..3]);
}

the above code creates 3 arrays:
  1- 'a' at the start of main
  2- one for the slice
  3- one for the function call.

leaving out the slice creates one less copy of the array (not the data)

I think that is what parabolis meant.

 Arrays really are value types that get copied when you pass them to a 
 function.  You can generally treat them as reference types because the 
 data they refer to is not copied along with them.

 An array is quite literally little more than this:

      struct Array(T) {
          T* data;
          int length;
      }

 Might I suggest that DataSources and DataSinks use void[]?

 void[] knows how many bytes it points to and is slicable.  Whether or 
 not void[] was created for this exact scenerio is uncertain, but they 
 are exceptionally well suited to the task regardless.

 (incidently, slicing void* is legal as well)

 The void* is a pointer with no associated type. The arrays in D are 
 infinitely better than void* pointers because arrays have extra 
 information. As I said earlier in my post the behavior of providing 
 data in a particular non-byte format should be done elsewhere in a 
 single DataXXStream.

 The whole idea behind DataSources and DataSinks is that they just pull 
 bytes in and out of some other place without ever having any concern for 
 their meaning.

 This is a textbook case of the right place to use void*. :)  (or void[])

I agree void* or void[] should be used.

Parabolis's other concern was a buffer overrun, but as I see it neither 
void[], void * or ubyte[] are any more buffer safe (see my other post for 
a detailed explaination)

Regan

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

Aug 03 2004

Andy Friesen <andy ikagames.com> writes:

Regan Heath wrote:

 Slicing does not create garbage.

 
 Really? doesn't slicing create another array structure (the one you have 
 described below) exactly the same as if/when you pass one to a function, 
 so..
 
 void foo(char[] a)
 {
 }
 void main()
 {
   char[] a = "12345";
   foo(a[1..3]);
 }
 
 the above code creates 3 arrays:
  1- 'a' at the start of main
  2- one for the slice
  3- one for the function call.
 
 leaving out the slice creates one less copy of the array (not the data)
 
 I think that is what parabolis meant.

Sure, but the second two can probably be optimized into one and the same.

Besides, it's stack space.  Nothing is faster than stack allocation. 
(sub esp, ...)

 The whole idea behind DataSources and DataSinks is that they just pull 
 bytes in and out of some other place without ever having any concern 
 for their meaning.

 This is a textbook case of the right place to use void*. :)  (or void[])

 
 
 I agree void* or void[] should be used.
 
 Parabolis's other concern was a buffer overrun, but as I see it neither 
 void[], void * or ubyte[] are any more buffer safe (see my other post 
 for a detailed explaination)

References are to be preferred over pointers in C++ because constructing 
a null reference isn't easily possible to do by accident.  It's easy to 
do on purpose, but if you do, Santa will put you on his Naughty list and 
give you coal.  Also, your programs might crash or something.

D arrays are the same way.  Accidentally constructing an invalid array 
is much less likely to occur than using an explicit pointer/length pair. :)

  -- andy

Aug 03 2004

parabolis <parabolis softhome.net> writes:

Andy Friesen wrote:

 Besides, it's stack space.  Nothing is faster than stack allocation. 
 (sub esp, ...)

Sure there is. Not allocating is infinitely faster. :)

Aug 03 2004

Andy Friesen <andy ikagames.com> writes:

parabolis wrote:
 Andy Friesen wrote:
 
 Besides, it's stack space.  Nothing is faster than stack allocation. 
 (sub esp, ...)

 
 
 Sure there is. Not allocating is infinitely faster. :)

Passing an array slice as an argument is exactly the same as passing a 
pointer to its contents and a size. (the exact same code should be emitted)

This is why the %.*s trick works with printf.  The length gets pushed 
first, then the pointer, which just so happens to be the same format as 
expected by %.*s.

printf("%.*s\n", str); <===> printf("%.*s\n", str.length, &str[0]);

While we're on the topic of speed hacking, though, might I suggest the 
following for improving application performance:

     main() { return 0; }

(it reduces memory consumption too!)

;)

  -- andy

Aug 03 2004

"Bent Rasmussen" <exo bent-rasmussen.info> writes:

 While we're on the topic of speed hacking, though, might I suggest the
 following for improving application performance:

      main() { return 0; }

 (it reduces memory consumption too!)

That is post-mature optimization. You should never have created
application.d in the first place! :-)

Aug 04 2004

Sean Kelly <sean f4.ca> writes:

In article <cepsao$1vbo$1 digitaldaemon.com>, Andy Friesen says...
D arrays are the same way.  Accidentally constructing an invalid array 
is much less likely to occur than using an explicit pointer/length pair. :)

Not sure I agree in this case.








Both instances of the above code require the programmer to be a bit evil about
how they specify access to a range of memory.  To me, the void* call just looks
cleaner and less confusing while being no more prone to user error (in fact
possibly less, as the calling syntax is simpler).

I had actually added wrapper functions to unformatted read/write all primitive
types but recently removed them because they seemed redundant.  I suppose if
there's enough of a demand I'll add them back.


Sean

Aug 04 2004

parabolis <parabolis softhome.net> writes:

Sean Kelly wrote:

 In article <cepsao$1vbo$1 digitaldaemon.com>, Andy Friesen says...
 
D arrays are the same way.  Accidentally constructing an invalid array 
is much less likely to occur than using an explicit pointer/length pair. :)

 
 
 Not sure I agree in this case.
 






 
 Both instances of the above code require the programmer to be a bit evil about
 how they specify access to a range of memory.  To me, the void* call just looks
 cleaner and less confusing while being no more prone to user error (in fact
 possibly less, as the calling syntax is simpler).

I am pretty sure the second read in your example parses it be 
treating the address of x as a ubyte array and then slicing into 
which creates a valid ubyte[] array to pass to a function.

Aug 04 2004

Regan Heath <regan netwin.co.nz> writes:

On Wed, 04 Aug 2004 11:55:43 -0400, parabolis <parabolis softhome.net> 
wrote:

 Sean Kelly wrote:

 In article <cepsao$1vbo$1 digitaldaemon.com>, Andy Friesen says...

 D arrays are the same way.  Accidentally constructing an invalid array 
 is much less likely to occur than using an explicit pointer/length 
 pair. :)


 Not sure I agree in this case.







 Both instances of the above code require the programmer to be a bit 
 evil about
 how they specify access to a range of memory.  To me, the void* call 
 just looks
 cleaner and less confusing while being no more prone to user error (in 
 fact
 possibly less, as the calling syntax is simpler).

 I am pretty sure the second read in your example parses it be treating 
 the address of x as a ubyte array and then slicing into which creates a 
 valid ubyte[] array to pass to a function.

It's not guaranteed to be valid. replace x.sizeof with 1000 and it's an 
invalid ubyte[] array.

Regan

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

Aug 04 2004

Andy Friesen <andy ikagames.com> writes:

Sean Kelly wrote:

 In article <cepsao$1vbo$1 digitaldaemon.com>, Andy Friesen says...
 
D arrays are the same way.  Accidentally constructing an invalid array 
is much less likely to occur than using an explicit pointer/length pair. :)

 
 
 Not sure I agree in this case.
 






 
 Both instances of the above code require the programmer to be a bit evil about
 how they specify access to a range of memory.  To me, the void* call just looks
 cleaner and less confusing while being no more prone to user error (in fact
 possibly less, as the calling syntax is simpler).

I changed my mind.  You're right. :)

Getting an invalid array is hard, except when you start slicing 
pointers, at which point it becomes a bit too easy.

  -- andy

Aug 04 2004

parabolis <parabolis softhome.net> writes:

Andy Friesen wrote:
 I will concede the order was wrong. However I believe the slicing will 
 need to create another array wrapper in memory which is then going to 
 have to be GCed. The len and off parameters allow a caller to take 
 either approach.

 
 
 Slicing does not create garbage.  Arrays really are value types that get 
 copied when you pass them to a function.  You can generally treat them 
 as reference types because the data they refer to is not copied along 
 with them.
 
 An array is quite literally little more than this:
 
     struct Array(T) {
         T* data;
         int length;
     }
 

That is what I meant by a wrapper. It is actually defined in
     phobos\internal\adi.d
Given that it is a struct it will be created on the stack and 
thus not GCed. However I still like to have the option to decide 
between the two. :)


 Might I suggest that DataSources and DataSinks use void[]?
 
 void[] knows how many bytes it points to and is slicable.  Whether or 
 not void[] was created for this exact scenerio is uncertain, but they 
 are exceptionally well suited to the task regardless.
 
 (incidently, slicing void* is legal as well)
 
 The void* is a pointer with no associated type. The arrays in D are 
 infinitely better than void* pointers because arrays have extra 
 information. As I said earlier in my post the behavior of providing 
 data in a particular non-byte format should be done elsewhere in a 
 single DataXXStream.

 
 
 The whole idea behind DataSources and DataSinks is that they just pull 
 bytes in and out of some other place without ever having any concern for 
 their meaning.
 
 This is a textbook case of the right place to use void*. :)  (or void[])

I had no idea there is a void[] in D and will have to consider 
it. As I explained in another post this is a textbook example of 
when *not* to use void*. If void[] exists then its use might be 
justified but honestly it warps my mind even trying to consider it.

Aug 03 2004

parabolis <parabolis softhome.net> writes:

Andy Friesen wrote:

 Might I suggest that DataSources and DataSinks use void[]?
 
 void[] knows how many bytes it points to and is slicable.  Whether or 
 not void[] was created for this exact scenerio is uncertain, but they 
 are exceptionally well suited to the task regardless.

This is a good suggestion because void /is/ a much better 
conceptual match for general data coming from or going to 
someplace than byte or int. It is also a good suggestion because 
using void[] gives you some assurance against buffer overruns.

However I think the conceptual problems void[] introduces 
outweigh the benefits. void[] does a rather unspected thing when 
it gives you a byte count in .length. The default assumption 
would be (or at least my default assumption was) that the 
.length would be the same for an int[] being treated as a 
void[]. This suggests that at least some people using/writing 
functions with void[] parameters will do strange things. I 
believe the ensuing confusion warrants using a ubyte[] which 
which has behaviour that people will already understand.

Aug 05 2004

Regan Heath <regan netwin.co.nz> writes:

On Thu, 05 Aug 2004 21:11:58 -0400, parabolis <parabolis softhome.net> 
wrote:
 Andy Friesen wrote:

 Might I suggest that DataSources and DataSinks use void[]?

 void[] knows how many bytes it points to and is slicable.  Whether or 
 not void[] was created for this exact scenerio is uncertain, but they 
 are exceptionally well suited to the task regardless.

 This is a good suggestion because void /is/ a much better conceptual 
 match for general data coming from or going to someplace than byte or 
 int. It is also a good suggestion because using void[] gives you some 
 assurance against buffer overruns.

I still don't agree with the last bit, void[] gives no _assurance_ at all, 
neither does ubyte[] or any other [].

 However I think the conceptual problems void[] introduces outweigh the 
 benefits. void[] does a rather unspected thing when it gives you a byte 
 count in .length.

That is what I assumed it would do. A void* is a pointer to 'something', 
the smallest addressable unit is a byte. As you do not know what 
'something' is, you have to provide the ability to address the smallest 
addressable unit, i.e. a byte.

 The default assumption would be (or at least my default assumption was) 
 that the .length would be the same for an int[] being treated as a 
 void[].

But then you cannot address each of the 4 bytes of each int.

 This suggests that at least some people using/writing functions with 
 void[] parameters will do strange things.

Have you used 'void' as a type before, I suspect only people who have not 
used the concept before will get this wrong, and a simple line of 
documentation describing void[] will put them right.

 I believe the ensuing confusion warrants using a ubyte[] which which has 
 behaviour that people will already understand.

I agree ubyte[] is the 'right' type, the data itself is a bunch of 
unsigned bytes, but, void[] or void* give you ease of use that ubyte[] 
lacks.

Regan

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

Aug 05 2004

parabolis <parabolis softhome.net> writes:

Regan Heath wrote:

 On Thu, 05 Aug 2004 21:11:58 -0400, parabolis <parabolis softhome.net> 
 wrote:
 
 Andy Friesen wrote:

 Might I suggest that DataSources and DataSinks use void[]?

 void[] knows how many bytes it points to and is slicable.  Whether or 
 not void[] was created for this exact scenerio is uncertain, but they 
 are exceptionally well suited to the task regardless.


 This is a good suggestion because void /is/ a much better conceptual 
 match for general data coming from or going to someplace than byte or 
 int. It is also a good suggestion because using void[] gives you some 
 assurance against buffer overruns.

 
 
 I still don't agree with the last bit, void[] gives no _assurance_ at 
 all, neither does ubyte[] or any other [].

My argument is that there exists a program in which a bug will 
be caught. You argument is that there does not exist a program 
such that a bug will be caught (or that for all programs there 
is no program such that a bug is caught).

Assuming we have the function:
      read_bad(void*,uint len)
     read_good(ubyte[],uint len)

A exerpt from program P in which a bug is caught is as follows:
============================== P ==============================
     ubyte ex[256];
      read_bad(ex,0xFFFF_FFFF); // memory overwritten
     read_good(ex,0xFFFF_FFFF); // exception thrown
================================================================

P contains a bug that is caught using an array parameter. The 
existance of P simultaneously proves my argument and disproves 
yours.

Yet we have had this discussion before and you seem to insist 
that since you can find examples where a bug is not caught my 
argument must be wrong somehow. I am not familiar with any logic 
in which such claims are expected. Either you will have to 
explain the logic system you are using to me so I can explain my 
claim properly or you will have to use the one I am using. Here 
are some links to mine:

http://en.wikipedia.org/wiki/Logic
http://en.wikipedia.org/wiki/Predicate_logic
http://en.wikipedia.org/wiki/Universal_quantifier
http://en.wikipedia.org/wiki/Existential_quantifier

 
 However I think the conceptual problems void[] introduces outweigh the 
 benefits. void[] does a rather unspected thing when it gives you a 
 byte count in .length.

 
 
 That is what I assumed it would do. A void* is a pointer to 'something', 
 the smallest addressable unit is a byte. As you do not know what 
 'something' is, you have to provide the ability to address the smallest 
 addressable unit, i.e. a byte.

Wonderful guess. It is entirely more complicated than a ubyte[] 
being a partition of memory on 8-bit boundries and knowing how 
the length and sizeof will work.


 The default assumption would be (or at least my default assumption 
 was) that the .length would be the same for an int[] being treated as 
 a void[].

 
 
 But then you cannot address each of the 4 bytes of each int.

Yes that was exactly my point.

 
 This suggests that at least some people using/writing functions with 
 void[] parameters will do strange things.

 
 
 Have you used 'void' as a type before, I suspect only people who have 

No I have never used void as a type before. I have always been 
under the impression that "void varX;" is not a legal 
declaration/definition in C or C++. I have used void* frequently 
in C/C++ but the size of any void* variables is of course the 
size of any pointer.

 not used the concept before will get this wrong, and a simple line of 
 documentation describing void[] will put them right.

Or using ubyte[] will write the documentation for me and provide 
some assurance that in cases in which people did not read the 
docs will have a chance of getting it right from the start.

 I believe the ensuing confusion warrants using a ubyte[] which which 
 has behaviour that people will already understand.

 
 
 I agree ubyte[] is the 'right' type, the data itself is a bunch of 
 unsigned bytes, but, void[] or void* give you ease of use that ubyte[] 
 lacks.

No actually I have been saying void is 'right' because streaming 
data is only partitioned according to the semantics of the 
interpretation of the data. Partitioning data into a byte forces 
an arbitrary partition of general data that would not happen 
conceptually with void.

I just feel that using void[] lacks the ease of use you get with 
ubyte[].

Aug 06 2004

Regan Heath <regan netwin.co.nz> writes:

On Fri, 06 Aug 2004 14:29:19 -0400, parabolis <parabolis softhome.net> 
wrote:

<snip>

 I still don't agree with the last bit, void[] gives no _assurance_ at 
 all, neither does ubyte[] or any other [].

 My argument is that there exists a program in which a bug will be 
 caught.

Ok.

 You argument is that there does not exist a program such that a bug will 
 be caught (or that for all programs there is no program such that a bug 
 is caught).

I make no such argument. In fact I am having a hard time following the 
above sentence.

 Assuming we have the function:
       read_bad(void*,uint len)
      read_good(ubyte[],uint len)

 A exerpt from program P in which a bug is caught is as follows:
 ============================== P ==============================
      ubyte ex[256];
       read_bad(ex,0xFFFF_FFFF); // memory overwritten
      read_good(ex,0xFFFF_FFFF); // exception thrown
 ================================================================

 P contains a bug that is caught using an array parameter. The existance 
 of P simultaneously proves my argument and disproves yours.

I don't think you understand my argument.

 Yet we have had this discussion before and you seem to insist that since 
 you can find examples where a bug is not caught my argument must be 
 wrong somehow. I am not familiar with any logic in which such claims are 
 expected. Either you will have to explain the logic system you are using 
 to me so I can explain my claim properly or you will have to use the one 
 I am using. Here are some links to mine:

 http://en.wikipedia.org/wiki/Logic
 http://en.wikipedia.org/wiki/Predicate_logic
 http://en.wikipedia.org/wiki/Universal_quantifier
 http://en.wikipedia.org/wiki/Existential_quantifier

You are missing my point.

There is one pivotal fact in this debate and that is that an array is 
_not_ guaranteed to be correct about it's own length. Consider:

void read(void[] a, int length) {
   if (length > a.length)
     throw new Exception(..);
}
void main()
{
   char* p = "0123456789";
   read(&p[0..1000],1000);
}

no exception is throw and memory is overwritten.

My point, which you seem to have missed, is simply: "An array is _not_ 
guaranteed to be correct about it's own length"

The reason this point is all important in this debate is that when trying 
to write basic types and structs you _will_ need to create the array from 
the basic type, when doing so you _will_ need to define the arrays length 
manually. So there is _always_ going to be the same risk of error 
regardless of whether you use void[] or void*.

Since the risk is the same in either case I vote for the 
clearest/cleanest/simplest code, as this reduces the risk of error 
slightly, the code I propsed:

bool read(out int x) {
   return read(&x,x.sizeof) == x.sizeof;
}

is cleaner/clearer and simpler than

bool read(out int x) {
   return read(&x[0..x.sizeof],x.sizeof) == x.sizeof);
}

 However I think the conceptual problems void[] introduces outweigh the 
 benefits. void[] does a rather unspected thing when it gives you a 
 byte count in .length.


 That is what I assumed it would do. A void* is a pointer to 
 'something', the smallest addressable unit is a byte. As you do not 
 know what 'something' is, you have to provide the ability to address 
 the smallest addressable unit, i.e. a byte.

 Wonderful guess. It is entirely more complicated than a ubyte[] being a 
 partition of memory on 8-bit boundries and knowing how the length and 
 sizeof will work.

The semantics of void* are quite well known, anyone who has used it knows 
what I have described above, anyone who doesn't will read the docs on 
void* before they start. Anyone who _guesses_ what will happen is asking 
for trouble.

 The default assumption would be (or at least my default assumption 
 was) that the .length would be the same for an int[] being treated as 
 a void[].


 But then you cannot address each of the 4 bytes of each int.

 Yes that was exactly my point.

So we agree, void[] works in a logical fashion.

 This suggests that at least some people using/writing functions with 
 void[] parameters will do strange things.


 Have you used 'void' as a type before, I suspect only people who have

 No I have never used void as a type before. I have always been under the 
 impression that "void varX;" is not a legal declaration/definition in C 
 or C++. I have used void* frequently in C/C++ but the size of any void* 
 variables is of course the size of any pointer.

In that case why didn't you read the documentation on the void[] type?

 not used the concept before will get this wrong, and a simple line of 
 documentation describing void[] will put them right.

 Or using ubyte[] will write the documentation for me and provide some 
 assurance that in cases in which people did not read the docs will have 
 a chance of getting it right from the start.

Rubbish, you're basically asserting that ubyte[] is known of by everyone, 
and that is simply not true.

 I believe the ensuing confusion warrants using a ubyte[] which which 
 has behaviour that people will already understand.


 I agree ubyte[] is the 'right' type, the data itself is a bunch of 
 unsigned bytes, but, void[] or void* give you ease of use that ubyte[] 
 lacks.

 No actually I have been saying void is 'right' because streaming data is 
 only partitioned according to the semantics of the interpretation of the 
 data. Partitioning data into a byte forces an arbitrary partition of 
 general data that would not happen conceptually with void.

 I just feel that using void[] lacks the ease of use you get with ubyte[].

Ease of use?! How is this:

bool read(out int x) {
   return read(&x[0..x.sizeof],x.sizeof) == x.sizeof);
}

easier than:

bool read(out int x) {
   return read(&x,x.sizeof) == x.sizeof;
}

?


There seems to be 2 points in this argument, "what is easier" and "what is 
safer", my opinion which I have tried to demonstrate is that neither is 
safer and void* is easier.

Regan

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

Aug 08 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <ceulss$2fj6$1 digitaldaemon.com>, parabolis says...

void[] does a rather unspected thing when 
it gives you a byte count in .length. The default assumption 
would be (or at least my default assumption was) that the 
.length would be the same for an int[] being treated as a 
void[].

For all D types, the number of bytes occupied by a T[] of length N is (N *
T.sizeof). This should have been your default assumption. void.sizeof is 1.

Jill

Aug 06 2004

parabolis <parabolis softhome.net> writes:

Arcane Jill wrote:

 In article <ceulss$2fj6$1 digitaldaemon.com>, parabolis says...
 
 
void[] does a rather unspected thing when 
it gives you a byte count in .length. The default assumption 
would be (or at least my default assumption was) that the 
.length would be the same for an int[] being treated as a 
void[].

 
 
 For all D types, the number of bytes occupied by a T[] of length N is (N *
 T.sizeof). This should have been your default assumption. void.sizeof is 1.

Sorry I meant from the docs

http://www.digitalmars.com/d/type.html:

     void  no type
      bit  single bit
     byte  signed 8 bits
    ubyte  unsigned 8 bits
    ....

Aug 06 2004

Sean Kelly <sean f4.ca> writes:

In article <cf0e4q$mqi$1 digitaldaemon.com>, parabolis says...
Sorry I meant from the docs

http://www.digitalmars.com/d/type.html:

     void  no type
      bit  single bit
     byte  signed 8 bits
    ubyte  unsigned 8 bits
    ....

Probably an esoteric question, but I assume that the byte size gurantee is only
for machines with the proper architecture?  Not that I expect to see a D
compiler for the very few machines that support strange byte sizes, just
wondering...


Sean

Aug 06 2004

parabolis <parabolis softhome.net> writes:

Sean Kelly wrote:

 In article <cf0e4q$mqi$1 digitaldaemon.com>, parabolis says...
 
Sorry I meant from the docs

http://www.digitalmars.com/d/type.html:

    void  no type
     bit  single bit
    byte  signed 8 bits
   ubyte  unsigned 8 bits
   ....

 
 
 Probably an esoteric question, but I assume that the byte size gurantee is only
 for machines with the proper architecture?  Not that I expect to see a D
 compiler for the very few machines that support strange byte sizes, just
 wondering...

Actually that is not a terribly esoteric question. I do not 
believe the D byte is the same as the C/C++ char. (Which is what 
I assume you are referring to in this case.) I would be curious 
to know the answer as well. I would also be curious how a 
compiler would deal with Harvard architecture.

Aug 06 2004

Regan Heath <regan netwin.co.nz> writes:

For another perspective/idea have a look at my thread entitled "My stream 
concept".

I use template bolt-ins.

There was a little problem with it, which was actually trivial to fix, I 
simply replaced the 'super.' calls with 'this.' calls.

It should also be noted that my idea was strictly for creating the base 
level stream classes from the various devices i.e. File, Socket, Memory 
etc. The next step is to add filters (as described by Arcane Jill) I am 
hoping an idea will come to me as to how I can do that, without needing:
   new MemoryMap(new UTF16Filter(new Stream()));

Regan

On Tue, 3 Aug 2004 19:36:19 +0000 (UTC), Sean Kelly <sean f4.ca> wrote:

 I finally got back on my stream mods today and had a question:  how 
 should the
 wrapper class know the encoding scheme of the low-level data?

 For example, say all of the formatted IO code is in a mixin or base class
 (assume base class for the same of discussion) that calls a  read(void*, 
 size_t)
 or write(void*, size_t) method in the derived class.  Now say I want to 
 read a
 char, wchar, or dchar from the stream.  How many bytes should I read and 
 how do
 I know what the encoding format is?  C++ streams handle this fairly 
 simply by
 making the char type a template parameter:






 This has the obvious limitation that the programmer must instantiate the 
 proper
 type of stream for the data format he is trying to read (as there is 
 only one
 get/put method for any char type: CharT).  But it makes things pretty 
 explicit:
 Stream!(char) means "this is a stream formatted in UTF8."

 The other option I can think off offhand would be to have a class member 
 that
 the derived class could set which specifies the encoding format:















 This has tbe benefit of allowing the user to read and write any char 
 type with a
 single instantiation, but requires greater complexity in the Stream 
 class and in
 the Derived class.  And I wonder if such flexibility is truly necessary.

 Any other design possibilities?  Preferences?  I'm really trying to 
 establish a
 good formatted IO design than work out the perfect stream API.  Any 
 other weird
 issues would be welcome also.


 Sean



-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

Aug 03 2004

Sean Kelly <sean f4.ca> writes:

In article <opsb6d5aw85a2sq9 digitalmars.com>, Regan Heath says...
For another perspective/idea have a look at my thread entitled "My stream 
concept".

I use template bolt-ins.

There was a little problem with it, which was actually trivial to fix, I 
simply replaced the 'super.' calls with 'this.' calls.

It should also be noted that my idea was strictly for creating the base 
level stream classes from the various devices i.e. File, Socket, Memory 
etc. The next step is to add filters (as described by Arcane Jill) I am 
hoping an idea will come to me as to how I can do that, without needing:
   new MemoryMap(new UTF16Filter(new Stream()));

My design really set out extend the original stream approach, and it seemed the
logical extension was pretty C++ like.  I ended up creating a basic set of
interfaces--Stream, InputStream, and OutputStream--and putting all the
implementation in templates meant to be mixins.  This was somewhat necessary to
support the multiple inheritance type model.  So the input file stream looks
something like this:






}

Works quite well but it's very different from the Java approach.  I'm still not
sure which I like better, though I'll grant that the Java version is more
flexible (at the expense of verbosity).  The other potential issue is the
top-heaviness of the design.  I am warming up to the the idea of separate
reader/writer adaptor classes.


Sean

Aug 03 2004

parabolis <parabolis softhome.net> writes:

Sean Kelly wrote:

 Works quite well but it's very different from the Java approach.  I'm still not
 sure which I like better, though I'll grant that the Java version is more
 flexible (at the expense of verbosity).  The other potential issue is the
 top-heaviness of the design.  I am warming up to the the idea of separate
 reader/writer adaptor classes.
 

I probably should have made the argument explicit but I do 
believe dealing with incoming and outgoing data at the same time 
is suspect of multi-threading issues. If your code is MT safe 
then you probably did much more work than you had to with little 
apparent benefit.

Aug 03 2004

parabolis <parabolis softhome.net> writes:

Sean Kelly wrote:

 Works quite well but it's very different from the Java approach.  I'm still not
 sure which I like better, though I'll grant that the Java version is more
 flexible (at the expense of verbosity).  The other potential issue is the

I also meant to suggest that I really like much less verbose 
class names like: FileIS and FileOS...

Aug 03 2004

"antiAlias" <gblazzer corneleus.com> writes:

What you good folks seem to be describing is pretty much how mango.io
operates. All the questions raised so far are quietly handled by that
library (even the separate input & output buffers, if you want that), so it
might be worthwhile checking it out. It's also house-trained, documented,
and has a raft of additional features that you selectively apply where
appropriate (it's not all tragically intertwined).

As a bonus, there's a ton of functionality already built on top of mango.io,
including http-server, servlet-engine, clustering, logging, local & remote
object caching; even tossing remote D executable objects around a local
network. The DSP project is also targeting Mango as a delivery mechanism.
Check them out over at dsource.org.

I think it's great to have "competing" libraries under way, but at some
point is it worth considering funneling efforts instead? Perhaps not?


"Sean Kelly" <sean f4.ca> wrote in message
news:ceopfj$1hcl$1 digitaldaemon.com...
 I finally got back on my stream mods today and had a question:  how should

the
 wrapper class know the encoding scheme of the low-level data?

 For example, say all of the formatted IO code is in a mixin or base class
 (assume base class for the same of discussion) that calls a  read(void*,

size_t)
 or write(void*, size_t) method in the derived class.  Now say I want to

read a
 char, wchar, or dchar from the stream.  How many bytes should I read and

how do
 I know what the encoding format is?  C++ streams handle this fairly simply

by
 making the char type a template parameter:






 This has the obvious limitation that the programmer must instantiate the

proper
 type of stream for the data format he is trying to read (as there is only

one
 get/put method for any char type: CharT).  But it makes things pretty

explicit:
 Stream!(char) means "this is a stream formatted in UTF8."

 The other option I can think off offhand would be to have a class member

that
 the derived class could set which specifies the encoding format:















 This has tbe benefit of allowing the user to read and write any char type

with a
 single instantiation, but requires greater complexity in the Stream class

and in
 the Derived class.  And I wonder if such flexibility is truly necessary.

 Any other design possibilities?  Preferences?  I'm really trying to

establish a
 good formatted IO design than work out the perfect stream API.  Any other

weird
 issues would be welcome also.


 Sean

Aug 03 2004

Sean Kelly <sean f4.ca> writes:

In article <cep4dd$1nde$1 digitaldaemon.com>, antiAlias says...
What you good folks seem to be describing is pretty much how mango.io
operates. All the questions raised so far are quietly handled by that
library (even the separate input & output buffers, if you want that), so it
might be worthwhile checking it out. It's also house-trained, documented,
and has a raft of additional features that you selectively apply where
appropriate (it's not all tragically intertwined).

Yup.  I've played around with Mango and kind of like it.  One of the reasons I
started these stream mods was to have an alternate design to compare to Mango
for the sake of discussion.  ie. I don't want folks to settle on Mango simply
because the other choices are missing features.

I think it's great to have "competing" libraries under way, but at some
point is it worth considering funneling efforts instead? Perhaps not?

Definately.


Sean

Aug 03 2004

"antiAlias" <gblazzer corneleus.com> writes:

You are absolutely right. But not many people seem to know about Mango, so
the opportunity for "spreading the news" was too great to pass up :-)

"Sean Kelly" <sean f4.ca> wrote in message
news:cep64d$1o0t$1 digitaldaemon.com...
 In article <cep4dd$1nde$1 digitaldaemon.com>, antiAlias says...
What you good folks seem to be describing is pretty much how mango.io
operates. All the questions raised so far are quietly handled by that
library (even the separate input & output buffers, if you want that), so


it
might be worthwhile checking it out. It's also house-trained, documented,
and has a raft of additional features that you selectively apply where
appropriate (it's not all tragically intertwined).

 Yup.  I've played around with Mango and kind of like it.  One of the

reasons I
 started these stream mods was to have an alternate design to compare to

Mango
 for the sake of discussion.  ie. I don't want folks to settle on Mango

simply
 because the other choices are missing features.

I think it's great to have "competing" libraries under way, but at some
point is it worth considering funneling efforts instead? Perhaps not?

 Definately.


 Sean

Aug 03 2004

parabolis <parabolis softhome.net> writes:

antiAlias wrote:

 What you good folks seem to be describing is pretty much how mango.io
 operates. All the questions raised so far are quietly handled by that
 library (even the separate input & output buffers, if you want that), so it
 might be worthwhile checking it out. It's also house-trained, documented,
 and has a raft of additional features that you selectively apply where
 appropriate (it's not all tragically intertwined).

I cant help but ask how it manages to do both input and output 
and still avoid multi-threading issues?

 As a bonus, there's a ton of functionality already built on top of mango.io,
 including http-server, servlet-engine, clustering, logging, local & remote
 object caching; even tossing remote D executable objects around a local
 network. The DSP project is also targeting Mango as a delivery mechanism.
 Check them out over at dsource.org.

I have only started looking over the library. It is rather 
extensive. The source is well documented and organized. Both are 
rare to see. I am not fond of the pdf format. Anyway I am 
impressed at the surface. I will take a look deeper within.

 I think it's great to have "competing" libraries under way, but at some
 point is it worth considering funneling efforts instead? Perhaps not?

On the note of competing libraries I could not help but notice 
your primes.d implementation. You might want to look at the 
primes.d on Deimos and consider using that instead. It is rather 
cleverly designed and could be tuned to do no worse than your 
bsearch for all ushort values.

Aug 03 2004

"antiAlias" <gblazzer corneleus.com> writes:

The primes.d thing is now a distant and foggy memory :-)

Can I hook you up with a copy of the latest (much better, with annotated
source) documentation? You'll see Primes.d is gone, along with some other
warts:
http://svn.dsource.org/svn/projects/mango/downloads/mango_beta_9-2_doc.zip


"parabolis" <parabolis softhome.net> wrote in message
news:cep9ee$1ov1$1 digitaldaemon.com...
 antiAlias wrote:

 What you good folks seem to be describing is pretty much how mango.io
 operates. All the questions raised so far are quietly handled by that
 library (even the separate input & output buffers, if you want that), so


it
 might be worthwhile checking it out. It's also house-trained,


documented,
 and has a raft of additional features that you selectively apply where
 appropriate (it's not all tragically intertwined).

 I cant help but ask how it manages to do both input and output
 and still avoid multi-threading issues?

 As a bonus, there's a ton of functionality already built on top of


mango.io,
 including http-server, servlet-engine, clustering, logging, local &


remote
 object caching; even tossing remote D executable objects around a local
 network. The DSP project is also targeting Mango as a delivery


mechanism.
 Check them out over at dsource.org.

 I have only started looking over the library. It is rather
 extensive. The source is well documented and organized. Both are
 rare to see. I am not fond of the pdf format. Anyway I am
 impressed at the surface. I will take a look deeper within.

 I think it's great to have "competing" libraries under way, but at some
 point is it worth considering funneling efforts instead? Perhaps not?

 On the note of competing libraries I could not help but notice
 your primes.d implementation. You might want to look at the
 primes.d on Deimos and consider using that instead. It is rather
 cleverly designed and could be tuned to do no worse than your
 bsearch for all ushort values.

Aug 03 2004

parabolis <parabolis softhome.net> writes:

antiAlias wrote:

 The primes.d thing is now a distant and foggy memory :-)
 
 Can I hook you up with a copy of the latest (much better, with annotated
 source) documentation? You'll see Primes.d is gone, along with some other
 warts:
 http://svn.dsource.org/svn/projects/mango/downloads/mango_beta_9-2_doc.zip

lol - The Mango Tree... just got it from the docs :) I am now 
without question under the belief that the mango docs are great. 
I was going to suggest in my last post that I would like to see 
some docs that cover more of the concept area than just doxygen 
stuff. I decided that it would probably be to much to excpect :)


Quote:
================================================================
Note that these Tokenizers do not maintain any state of their 
own. Thus they are all thread-safe.
================================================================
This is always good to know from documentation. :)


However I am curious about IPickle's design. Would it not be 
possible to serialize objects based on the data in ClassInfo?

Aug 03 2004

"antiAlias" <gblazzer corneleus.com> writes:

"parabolis" wrote..

 Quote:
 ================================================================
 Note that these Tokenizers do not maintain any state of their
 own. Thus they are all thread-safe.
 ================================================================
 This is always good to know from documentation. :)


 However I am curious about IPickle's design. Would it not be
 possible to serialize objects based on the data in ClassInfo?

Doing it the introspection way (ala Java) has a bunch of issues all of it's
own, and D doesn't have the power to expose all the requisite data as yet (I
could be wrong on the latter though).

IPickle was a nice and simple way to approach it; there's no monkey business
anywhere (like Java has), it's explicit, and it's very fast. While not an
overriding design factor, throughput is one of the main things all the Mango
branches/packages keep an watchful eye upon. Frankly, I'd like to see a
decent introspection approach emerge along the way; perhaps as a complement
rather than a replacement: within Mango there's no obvious reason why the
two approaches could not produce an equivalent serialized stream, and
therefore be interchangeable at the endpoints.

This is one area where I think getting other people involved in the project
would help tremendously.

Aug 03 2004

parabolis <parabolis softhome.net> writes:

antiAlias wrote:

 "parabolis" wrote..
 
 
Quote:
================================================================
Note that these Tokenizers do not maintain any state of their
own. Thus they are all thread-safe.
================================================================
This is always good to know from documentation. :)


However I am curious about IPickle's design. Would it not be
possible to serialize objects based on the data in ClassInfo?

 
 
 Doing it the introspection way (ala Java) has a bunch of issues all of it's
 own, and D doesn't have the power to expose all the requisite data as yet (I
 could be wrong on the latter though).

I think I was premature to suppose D could do that. I just gave 
the issue some thought and there is just enough introspection to 
make a shallow copy which is obviously not sufficient.

 IPickle was a nice and simple way to approach it; there's no monkey business
 anywhere (like Java has), it's explicit, and it's very fast. While not an
 overriding design factor, throughput is one of the main things all the Mango
 branches/packages keep an watchful eye upon. Frankly, I'd like to see a
 decent introspection approach emerge along the way; perhaps as a complement
 rather than a replacement: within Mango there's no obvious reason why the
 two approaches could not produce an equivalent serialized stream, and
 therefore be interchangeable at the endpoints.

Any automated serializing algorithm would have to either allow 
IPickles to [de-]serialize themselves or ignore read/write. 
However given one of those holds then the serialization ought to 
be compatible.

 This is one area where I think getting other people involved in the project
 would help tremendously.

I think I am probably sold on being willing to help. It is more 
an issue of whether I can provide anything that will further 
mango. :)

Aug 03 2004

"antiAlias" <gblazzer corneleus.com> writes:

"parabolis" <parabolis softhome.net> wrote in message
news:cepppv$1ugt$1 digitaldaemon.com...
 antiAlias wrote:

 "parabolis" wrote..


Quote:
================================================================
Note that these Tokenizers do not maintain any state of their
own. Thus they are all thread-safe.
================================================================
This is always good to know from documentation. :)


However I am curious about IPickle's design. Would it not be
possible to serialize objects based on the data in ClassInfo?


 Doing it the introspection way (ala Java) has a bunch of issues all of


it's
 own, and D doesn't have the power to expose all the requisite data as


yet (I
 could be wrong on the latter though).

 I think I was premature to suppose D could do that. I just gave
 the issue some thought and there is just enough introspection to
 make a shallow copy which is obviously not sufficient.

 IPickle was a nice and simple way to approach it; there's no monkey


business
 anywhere (like Java has), it's explicit, and it's very fast. While not


an
 overriding design factor, throughput is one of the main things all the


Mango
 branches/packages keep an watchful eye upon. Frankly, I'd like to see a
 decent introspection approach emerge along the way; perhaps as a


complement
 rather than a replacement: within Mango there's no obvious reason why


the
 two approaches could not produce an equivalent serialized stream, and
 therefore be interchangeable at the endpoints.

 Any automated serializing algorithm would have to either allow
 IPickles to [de-]serialize themselves or ignore read/write.
 However given one of those holds then the serialization ought to
 be compatible.

 This is one area where I think getting other people involved in the


project
 would help tremendously.

 I think I am probably sold on being willing to help. It is more
 an issue of whether I can provide anything that will further
 mango. :)

There's lots to do <g>

Here's some things that have been noted:
http://www.dsource.org/forums/viewtopic.php?t=174&sid=f5f234d101f0405ebaf9cb
df728af44a
And here's some more:
http://www.dsource.org/forums/viewtopic.php?t=157&sid=f5f234d101f0405ebaf9cb
df728af44a

That's just the tip of the iceberg though. For example, there's no Unicode
support as yet since we decided to wait until Hauke & AJ released all the
requisite pieces (better to do it properly); IO filters/decorators such as
companders have not actually been implemented yet, although there's a solid
placeholder for them; there's some annoying things that are currently
unimplemented on Unix (noted in the documentation todo list); etc. etc.

Plenty of room for improvement all over the place, and that's before you hit
the upper decks :-)

The project is very open to other packages hooking in at any level: as a
peer, as part of the Mango Tree itself, or as a package user. For example,
there's currently a bit-sliced XML/SAX engine in the works (okay;
"byte-sliced" then), plus the DSP project mentioned earlier (which looks to
be really uber cool ... everyone should check that one out).

Having real-world user-code drive the design and functionality is of truly
immense value: the bad stuff is typically identified and removed/replaced
rather quickly. Anyone who would like to get involved, please jump on the
dsource.org forums!

Aug 03 2004

"Walter" <newshound digitalmars.com> writes:

"Sean Kelly" <sean f4.ca> wrote in message
news:ceopfj$1hcl$1 digitaldaemon.com...
 This has tbe benefit of allowing the user to read and write any char type

with a
 single instantiation, but requires greater complexity in the Stream class

and in
 the Derived class.  And I wonder if such flexibility is truly necessary.

 Any other design possibilities?  Preferences?  I'm really trying to

establish a
 good formatted IO design than work out the perfect stream API.  Any other

weird
 issues would be welcome also.


I'm one of those folks who is very much in favor of a file reader being able
to automatically detect the encoding in it. Hence, D can auto-detect the UTF
formatting. So, I'd recommend that the format be an enum that can be
specifically set or can be auto-detected. Different resulting behaviors can
be handled with virtual functions.

Also, formats like UTF-16 have two variants, big end and little end.

It should also be able to read data in other formats, such as code pages,
and convert them to utf. These cannot be auto-detected.

Aug 03 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cep6nb$1o72$1 digitaldaemon.com>, Walter says...

I'm one of those folks who is very much in favor of a file reader being able
to automatically detect the encoding in it. Hence, D can auto-detect the UTF
formatting. So, I'd recommend that the format be an enum that can be
specifically set or can be auto-detected. Different resulting behaviors can
be handled with virtual functions.

With all due respect, Walter, that's not really feasible. It is very hard, for
example, to distinguish between ISO-8859-1 and ISO-8859-2 (not to mention
ISO-8859-3, etc.). Yes, distinguishing between UTFs is straightforward, but not
all encodings make life that easy for us. You can't use an enum, because there
are an unlimited number of possible encodings.

Besides, if you're parsing an HTTP header, and if, within that header, you read
"Content-Type: text/plain; encoding=MAC-ROMAN", then you can be pretty sure you
know what the encoding of the following document is going to be. Other formats
have different indicators (HTML meta tags; Python source file comments; -the
list is endless). Only at the application level can you /really/ sort this out,
because the application presumably knows what it's looking at.


Also, formats like UTF-16 have two variants, big end and little end.

Best to treat those as two separate encodings, although if the encoding is
specified as "UTF-16" you may still need to auto-detect which variant is being
used. Once you know for sure, stick with it.



It should also be able to read data in other formats, such as code pages,
and convert them to utf. These cannot be auto-detected.

I think that's the whole point. Windows code pages /are/ encodings. WINDOWS-1252
is an encoding, same as UTF-8. I think people here are talking about encodings
generally, not just UTFs.

Jill

Aug 03 2004

Sean Kelly <sean f4.ca> writes:

In article <ceq0mg$20d8$1 digitaldaemon.com>, Arcane Jill says...
In article <cep6nb$1o72$1 digitaldaemon.com>, Walter says...

Also, formats like UTF-16 have two variants, big end and little end.

Best to treat those as two separate encodings, although if the encoding is
specified as "UTF-16" you may still need to auto-detect which variant is being
used. Once you know for sure, stick with it.

That reminds me.  Which format does the code in utf.d use?  I'm thinking I may
do something like this for encoding for now:

enum Format {
UTF8 = 0,
UTF16 = 1,
UTF16LE = 1,
UTF16BE = 2
}

So "UTF-16" would actually default to one of the two methods.


Sean

Aug 04 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <ceqv9a$15b$1 digitaldaemon.com>, Sean Kelly says...
That reminds me.  Which format does the code in utf.d use?

To be honest, I don't understand the question.


I'm thinking I may
do something like this for encoding for now:

enum Format {
UTF8 = 0,
UTF16 = 1,
UTF16LE = 1,
UTF16BE = 2
}

So "UTF-16" would actually default to one of the two methods.

Whatever works, works. But I'd make the enum private. Encodings should be
universally known by their IANA registered name, otherwise how can you map name
to number. (For example, you encounter an XML file which declares its own
encoding to be "X-ARCANE-JILLS-CUSTOM-ENCODING" - how do you turn that into an
enum?)


Got an unrelated question for you. In the stream function void read(out int),
there is an assumption that the bytes will be embedded in the stream in
little-endian order. Should applications assume (a) it's always little endian,
regardless of host architecture, or (b) it's always host-byte order. Is there a
big endian version? Is there a network byte order version?

Should there be?

Jill

Aug 04 2004

Sean Kelly <sean f4.ca> writes:

In article <cer4k8$7jj$1 digitaldaemon.com>, Arcane Jill says...
In article <ceqv9a$15b$1 digitaldaemon.com>, Sean Kelly says...
That reminds me.  Which format does the code in utf.d use?

To be honest, I don't understand the question.

std.utf has methods like toUTF16.  But does this target the big or little endian
encoding scheme?  I suppose I could assume it corresponds to the byte order of
the target machine, but this would imply different behavior on different
platforms.

I'm thinking I may
do something like this for encoding for now:

enum Format {
UTF8 = 0,
UTF16 = 1,
UTF16LE = 1,
UTF16BE = 2
}

So "UTF-16" would actually default to one of the two methods.

Whatever works, works. But I'd make the enum private. Encodings should be
universally known by their IANA registered name, otherwise how can you map name
to number. (For example, you encounter an XML file which declares its own
encoding to be "X-ARCANE-JILLS-CUSTOM-ENCODING" - how do you turn that into an
enum?)

This raises an interesting question.  Rather than having the encoding handled
directly by the Stream layer perhaps it should be dropped into another class.  I
can't imagine coding a base lib to support "Joe's custom encoding scheme."  For
the moment though, I think I'll leave stream.d as-is.  This seems like a design
issue that will take a bit of talk to get right.

Got an unrelated question for you. In the stream function void read(out int),
there is an assumption that the bytes will be embedded in the stream in
little-endian order. Should applications assume (a) it's always little endian,
regardless of host architecture, or (b) it's always host-byte order. Is there a
big endian version? Is there a network byte order version?

Not currently.  This corresponds to the C++ design: unformatted IO is assumed to
be in the byte order of the host platform.

Should there be?

Probably.  Or at least one that converts to/from network byte order.  I'll
probably have the first cut of stream.d done in a few more days and after that
we can talk about what's wrong with it, etc.


Sean

Aug 04 2004

"Ben Hinkle" <bhinkle mathworks.com> writes:

Whatever works, works. But I'd make the enum private. Encodings should be
universally known by their IANA registered name, otherwise how can you


map name
to number. (For example, you encounter an XML file which declares its own
encoding to be "X-ARCANE-JILLS-CUSTOM-ENCODING" - how do you turn that


into an
enum?)

 This raises an interesting question.  Rather than having the encoding

handled
 directly by the Stream layer perhaps it should be dropped into another

class.  I
 can't imagine coding a base lib to support "Joe's custom encoding scheme."

For
 the moment though, I think I'll leave stream.d as-is.  This seems like a

design
 issue that will take a bit of talk to get right.

I wonder if delegates could help out here. Instead of subclasses or wrapping
a stream in another stream the primary Stream class could have a delegate to
sort out big/little endian or encoding issues. I'm not exactly sure how it
would work but it's worth investigating. There might be issues with sharing
data between the stream and the encoder/decoder delegate.

Got an unrelated question for you. In the stream function void read(out


int),
there is an assumption that the bytes will be embedded in the stream in
little-endian order. Should applications assume (a) it's always little


endian,
regardless of host architecture, or (b) it's always host-byte order. Is


there a
big endian version? Is there a network byte order version?

 Not currently.  This corresponds to the C++ design: unformatted IO is

assumed to
 be in the byte order of the host platform.

Should there be?

 Probably.  Or at least one that converts to/from network byte order.  I'll
 probably have the first cut of stream.d done in a few more days and after

that
 we can talk about what's wrong with it, etc.


 Sean

Aug 04 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cer7fh$9t5$1 digitaldaemon.com>, Sean Kelly says...
std.utf has methods like toUTF16.  But does this target the big or little endian
encoding scheme?  I suppose I could assume it corresponds to the byte order of
the target machine, but this would imply different behavior on different
platforms.

Neither, really. toUTF16 returns an array of wchars, not an array of chars, so
(conceptually) there is no byte-order issue involved. A wchar is (conceptually)
a sixteen bit wide value, with bit 0 being the low order bit, and bit 15 being
the high order bit. Byte ordering doesn't come into it.

Problems occur, however, when a wchar or a dchar leaves the nice safe
environment of D and heads out into a stream. Only then does byte ordering
become an issue (as it does also with arrays of ints, etc.).

If you cast a wchar[] (or an int[], etc.) to a void[], then the bytes of data
don't change, only the reference has a different type. In practice, this means
you have (inadvertantly) applied a host-byte-order encoding to the array. There
doesn't seem to be much that a stream can do about this, so, I reckon the
problem here lies not with the stream, but with the cast. In short, a cast is
not the most architecture-independent way to convert an arbitrary array into a
void[]. Maybe some new functions could be written to implement this?



This raises an interesting question.  Rather than having the encoding handled
directly by the Stream layer perhaps it should be dropped into another class.  I
can't imagine coding a base lib to support "Joe's custom encoding scheme."  For
the moment though, I think I'll leave stream.d as-is.  This seems like a design
issue that will take a bit of talk to get right.

Right. Someone writing an application ought to be able to make their own
transcoder (extending a library-defined base class; implementing a
library-defined interface; whatever). Let's say that (in an application, not a
library) I define classes JoesCustomReader and JoesCustomWriter. Now, I should
still be able to do:



and read the file. If a reader needs to be identified by a globally unique enum,
then I can't do that without the possibility of an enum value clash. But if, on
the other hand, they are identified by a string, then the possibility of a clash
becomes vanishingly small.

I do agree with you that registration of readers/writers and the dispatching
mechanism is something best left until later, however.


Jill

Aug 04 2004

Sean Kelly <sean f4.ca> writes:

In article <ceraen$c48$1 digitaldaemon.com>, Arcane Jill says...
Problems occur, however, when a wchar or a dchar leaves the nice safe
environment of D and heads out into a stream. Only then does byte ordering
become an issue (as it does also with arrays of ints, etc.).

Bah.  Of course.  So the two UTF schemes just depend on the byte order when
serialized.  Makes sense.

If you cast a wchar[] (or an int[], etc.) to a void[], then the bytes of data
don't change, only the reference has a different type. In practice, this means
you have (inadvertantly) applied a host-byte-order encoding to the array. There
doesn't seem to be much that a stream can do about this, so, I reckon the
problem here lies not with the stream, but with the cast. In short, a cast is
not the most architecture-independent way to convert an arbitrary array into a
void[]. Maybe some new functions could be written to implement this?

I think byte order should be specified, perhaps as a quality of the stream.  It
could default to native and perhaps be switchable?  The only other catch I see
is that a console stream should probably ignore this setting and always leave
everything in native format.  In any case, this byte order would affect encoding
schemes using > 1 byte characters and perhaps a new set of unformatted IO
methods as well.  Again something I'm going to ignore for now as it's more
complexity than we need quite yet.


Sean

Aug 04 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cerbsa$d02$1 digitaldaemon.com>, Sean Kelly says...

In short, a cast is
not the most architecture-independent way to convert an arbitrary array into a
void[]. Maybe some new functions could be written to implement this?


I think byte order should be specified, perhaps as a quality of the stream.  It
could default to native and perhaps be switchable?

Well, from one point of view, the problem we've got here is serialization. How
do you serialize an array of primitive types having sizeof > 1? This boils down
to a simpler question: how do you serialize a single primitive with sizeof > 1.
Let's cut to a clear example - how do you serialize an int?

std.stream.Stream.write(int) serializes in little-endian order. But the specs
say "Outside of byte, ubyte, and char, the format is implementation-specific and
should only be used in conjunction with read." I think this is scary. Perhaps it
would be better for a stream to /mandate/ the order. As you suggest, it could be
a property of the stream, but there are disadvantages to that - if you chain a
whole bunch of streams together, each with different endianness, you could end
up with a lot of byteswapping going on. Another possibility might be to ditch
the function write(int), and replace it with two functions, writeBE(int) and
writeLE(int), (and similarly with all other primitive types). That would be
absolutely guaranteed to be platform independent.

Of course that applies to wchar and dchar too, but the whole point of encodings
(well, /one/ of the points of encodings anyway) is that you never have to spit
out anything other than a stream of /bytes/. The encoding itself determines the
byte order. There really is no such encoding as "UTF-16" (although calling
wchar[]s UTF-16 does make sense). As far as actual encodings are concerned, the
name "UTF-16" is just a shorthand way of saying "either UTF-16LE or UTF-16BE".
When reading, you have to auto-detect between them, but once you've
/established/ the encoding, then you rewind the stream and start reading it
again with the now known encoding. When writing, you get to choose, arbitrarily
(so you would probably choose native byte order), but you can make it easier for
subsequent readers to auto-detect by writing a BOM at the start of the stream.

How does this affect users' code? Well, you simply don't allow anyone to write



(i.e. you define no such class). Instead, give them a factory method. Make them
write:



or even



(but we said we wouldn't talk about dispatching yet, so let's stick with
createUTF16Reader() to keep things simple)

The function createUTF16Reader() reads the underlying stream, auto-detects
between UTF-16LE and UTF-16BE, and then constructs either a UTF16LEReader or a
UTF16BEReader, and returns it. Somehow it needs a method of pushing back the
characters it's already read into the stream. Then, when the caller calls
s.read(), the exact encoding is known, and the stream is (re)read from the
start.



The only other catch I see
is that a console stream should probably ignore this setting and always leave
everything in native format.

Maybe writeLE() and writeBE() could be supplemented by writeNative(), with the
warning that it's no longer cross-platform? (Of course, the function write()
does that right now, but calling it writeNative() would give you a clue that you
were doing something a bit parochial).


In any case, this byte order would affect encoding
schemes using > 1 byte characters and perhaps a new set of unformatted IO
methods as well.

I don't think it would affect encodings at all, only the serialization of
primitive types other than byte, ubyte and char. Transcoders, as I said, read or
write /bytes/ to or from an underlying stream (but have dchar read() and/or void
write(dchar) methods for callers to use).


Again something I'm going to ignore for now as it's more
complexity than we need quite yet.

Righty ho. I vaguely remember Hauke saying he was working on a class to do
something about transcoding issues, but I don't know the specifics.

Arcane Jill

Aug 04 2004

"Carlos Santander B." <carlos8294 msn.com> writes:

"Arcane Jill" <Arcane_member pathlink.com> escribi� en el mensaje
news:cerk3u$i4f$1 digitaldaemon.com
| (so you would probably choose native byte order), but you can make it easier
for
| subsequent readers to auto-detect by writing a BOM at the start of the stream.
|
| ...
|
| between UTF-16LE and UTF-16BE, and then constructs either a UTF16LEReader or a
| UTF16BEReader, and returns it. Somehow it needs a method of pushing back the
| characters it's already read into the stream. Then, when the caller calls
| s.read(), the exact encoding is known, and the stream is (re)read from the
| start.
|


In the former case (the stream includes a BOM), would re-reading from the start
include the BOM? If so, what good would it be for a user who just wants to read
the file, independent of the encoding? (did I make myself clear?)

-----------------------
Carlos Santander Bernal

Aug 04 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <ces5mu$r8p$1 digitaldaemon.com>, Carlos Santander B. says...

In the former case (the stream includes a BOM), would re-reading from the start
include the BOM?

Good question. I guess probably not. If the encoding is known, then it's known -
Since a BOM serves only to identify the encoding, you don't need to re-read it
in this instance.

That said, it's still best that readers be prepared to ignore it. That is, if a
reader reads U+FEFF as the first character, it would be harmless to throw that
character away and return instead the second one.

Pretty much all BOM related questions are answered here:
http://www.unicode.org/faq/utf_bom.html#BOM.



If so, what good would it be for a user who just wants to read
the file, independent of the encoding? (did I make myself clear?)

If you fail to discard a BOM, and accidently treat it as a character, it will
appear to your application as the character U+FEFF (ZERO WIDTH NON-BREAKING
SPACE). It will display as a zero-width space. It has a general category of Cf
(which actually makes it a formatting control, not a space!). Basically, it
tries as hard as it can to do nothing at all.

So it's useless to the "user who just wants to read the file" - useless, but
harmless, most especially if you can recognise it and throw it away.

Arcane Jill

Aug 04 2004

Regan Heath <regan netwin.co.nz> writes:

On Wed, 4 Aug 2004 16:58:48 +0000 (UTC), Arcane Jill 
<Arcane_member pathlink.com> wrote:

<snip>

 Got an unrelated question for you. In the stream function void read(out 
 int),
 there is an assumption that the bytes will be embedded in the stream in
 little-endian order. Should applications assume (a) it's always little 
 endian,
 regardless of host architecture, or (b) it's always host-byte order. Is 
 there a
 big endian version? Is there a network byte order version?

 Should there be?

I think we go with (b).

I think it is best handled with a filter. eg.

Stream s = new BigEndian(new FileStream("test.dat",FileMode.READ));

so BigEndian looks like:

#class BigEndian {










You'll need a LittleEndian one too.
Using the filter you can guarantee the endian-ness of the data.


Of course if you're sending binary data from a LE to BE system via sockets 
you need to know what you're doing, and you need to decide what 
endian-ness will be used for the transmission, in this case on the one end 
of the socket you'll need a toBigEndian/toLittleEndian filter.

Regan

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

Aug 04 2004

"Walter" <newshound digitalmars.com> writes:

"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:ceq0mg$20d8$1 digitaldaemon.com...
 In article <cep6nb$1o72$1 digitaldaemon.com>, Walter says...

I'm one of those folks who is very much in favor of a file reader being


able
to automatically detect the encoding in it. Hence, D can auto-detect the


UTF
formatting. So, I'd recommend that the format be an enum that can be
specifically set or can be auto-detected. Different resulting behaviors


can
be handled with virtual functions.

 With all due respect, Walter, that's not really feasible. It is very hard,

for
 example, to distinguish between ISO-8859-1 and ISO-8859-2 (not to mention
 ISO-8859-3, etc.). Yes, distinguishing between UTFs is straightforward,

but not
 all encodings make life that easy for us. You can't use an enum, because

there
 are an unlimited number of possible encodings.

I understand there are limits to this. I think it should be done where
possible, and that it should not be precluded by design.

 Besides, if you're parsing an HTTP header, and if, within that header, you

read
 "Content-Type: text/plain; encoding=MAC-ROMAN", then you can be pretty

sure you
 know what the encoding of the following document is going to be. Other

formats
 have different indicators (HTML meta tags; Python source file

comments; -the
 list is endless). Only at the application level can you /really/ sort this

out,
 because the application presumably knows what it's looking at.

Yes. And this argues for a capability to switch horses midstream, so to
speak.

Aug 04 2004

D Programming

C/C++ Programming

Other

digitalmars.D - Streams and encoding