www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Streams and encoding

reply Sean Kelly <sean f4.ca> writes:
I finally got back on my stream mods today and had a question:  how should the
wrapper class know the encoding scheme of the low-level data?

For example, say all of the formatted IO code is in a mixin or base class
(assume base class for the same of discussion) that calls a  read(void*, size_t)
or write(void*, size_t) method in the derived class.  Now say I want to read a
char, wchar, or dchar from the stream.  How many bytes should I read and how do
I know what the encoding format is?  C++ streams handle this fairly simply by
making the char type a template parameter:

# class Stream(CharT) {
#     Stream get(CharT) {}
#     Stream put(CharT) {}
# }

This has the obvious limitation that the programmer must instantiate the proper
type of stream for the data format he is trying to read (as there is only one
get/put method for any char type: CharT).  But it makes things pretty explicit:
Stream!(char) means "this is a stream formatted in UTF8."

The other option I can think off offhand would be to have a class member that
the derived class could set which specifies the encoding format:

# class Stream {
#     enum Encoding{ UTF8, UTF16, UTF32 }
#     Encoding encoding;
#     this() { encoding = Encoding.UTF8; }
#     Stream get(char) {}
#     Stream get(wchar) {}
#     Stream get(dchar) {}
#     ...
# }
# 
# class File: Stream {
#     void open(wchar[] filename) { encoding = UTF16; }
# }

This has tbe benefit of allowing the user to read and write any char type with a
single instantiation, but requires greater complexity in the Stream class and in
the Derived class.  And I wonder if such flexibility is truly necessary.

Any other design possibilities?  Preferences?  I'm really trying to establish a
good formatted IO design than work out the perfect stream API.  Any other weird
issues would be welcome also.


Sean
Aug 03 2004
next sibling parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <ceopfj$1hcl$1 digitaldaemon.com>, Sean Kelly says...
I finally got back on my stream mods today and had a question:  how should the
wrapper class know the encoding scheme of the low-level data?

Simple answer - it shouldn't have to. I suggest using a specialized transcoding filter for such things. That's what Java does (Java calls them Readers and Writers), and Java's streams have been hailed as a shining example of how to do things correctly. Then your streams just connect together naturally, as others have shown in other recent threads. e.g.: # Stream s = new ZipStream(new BufferedStream(new FilterStream(new Windows1252Reader(stdin)))); (or something similar). You can have factory methods to create transcoders where the encoding is not known until runtime. Jill
Aug 03 2004
parent reply Sean Kelly <sean f4.ca> writes:
In article <ceor8d$1ihu$1 digitaldaemon.com>, Arcane Jill says...
In article <ceopfj$1hcl$1 digitaldaemon.com>, Sean Kelly says...
I finally got back on my stream mods today and had a question:  how should the
wrapper class know the encoding scheme of the low-level data?

Simple answer - it shouldn't have to.

Works for me. So how does a formatted read/write routine know which format it's targeting?
I suggest using a specialized transcoding filter for such things. That's what
Java does (Java calls them Readers and Writers), and Java's streams have been
hailed as a shining example of how to do things correctly. Then your streams
just connect together naturally, as others have shown in other recent threads.
e.g.:

# Stream s = new ZipStream(new BufferedStream(new FilterStream(new
Windows1252Reader(stdin))));

Okay, so all the formatted IO routines go in a Reader class and the type of the reader class determines the format? ie. there would be an UTF8Writer, UTF8Reader, UTF16Writer, UTF16Reader, etc? Sean
Aug 03 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <ceorv6$1iti$1 digitaldaemon.com>, Sean Kelly says...

Okay, so all the formatted IO routines go in a Reader class and the type of the
reader class determines the format?  ie. there would be an UTF8Writer,
UTF8Reader, UTF16Writer, UTF16Reader, etc?

Got it in one. Plus, you can have a factory function like createReader(char[]), so you can do Reader r = createReader("UTF-16LE"); etc. (for when the type is known at run time, not compile time, which is usually). The implementation of createReader() is just a big swtich statement, with each case return a new instance of the relevant class. (I swapped your questions around. Here's the first one).
Works for me.  So how does a formatted read/write routine know which format it's
targeting?

You got me there. I think the question's too vague, and the answer application-specific. Generally speaking, at some level, the encoding is known, somehow. Maybe it's specified in the text file itself (XML and HTTP pull this trick - for it to work the very start of the file must comprise only ASCII characters (although they can be encoded in a UTF)); maybe it's specified in a configuration file; maybe it's deduced using some heuristic test; maybe the OS default is assumed. At the level where the encoding is known, decode it (into UTF-8), and then you can use byte streams from then on. As parabolis said, a stream, in the abstract, deals in ubytes, not chars (because that's what you write to files, sockets, etc.). Classes which implement read() or write() in units other than ubyte shouldn't really be called "streams", which of course is why Java calls them Readers and Writers. (Maybe "filters" for the general case). Arcane Jill
Aug 03 2004
parent Sean Kelly <sean f4.ca> writes:
In article <ceou9t$1kbq$1 digitaldaemon.com>, Arcane Jill says...
In article <ceorv6$1iti$1 digitaldaemon.com>, Sean Kelly says...

Okay, so all the formatted IO routines go in a Reader class and the type of the
reader class determines the format?  ie. there would be an UTF8Writer,
UTF8Reader, UTF16Writer, UTF16Reader, etc?

Got it in one. Plus, you can have a factory function like createReader(char[]), so you can do Reader r = createReader("UTF-16LE"); etc. (for when the type is known at run time, not compile time, which is usually). The implementation of createReader() is just a big swtich statement, with each case return a new instance of the relevant class. (I swapped your questions around. Here's the first one).
Works for me.  So how does a formatted read/write routine know which format it's
targeting?

You got me there.

No worries. If we've got a class per format then it knows implicitly what format to convert to/from. Sean
Aug 03 2004
prev sibling next sibling parent reply parabolis <parabolis softhome.net> writes:
Sean Kelly wrote:

 I finally got back on my stream mods today and had a question:  how should the
 wrapper class know the encoding scheme of the low-level data?
 

I have been wondering who was working on a Stream library. I have many thoughts, many of which are covered in OT - scanf in Java. Here are a some notes: In C (and C++ by extension I would imagine) the char type is the smallest addressable cell in memory. In D the char is a UTF-8 8-bit code unit which is quite a differnent thing. I would suggest you seriously consider defining basic IO using either the ubyte (which represents a general 8-bit value) or possibly the data type that is the native cell size used in memory (something like size_t I believe). Also I have noticed the tendency for people to not make the distinction between Input and Output streams. This leads to some problems. Say I want to write a class to handle CRC32 on stream data. It is far simpler and less error prone to compute such a digest on a stream in which data flows in only one direction especially in a multi-threaded environment. Also the Input and Output distinction allows for streams pumps that automatically pull data from one and push data into another. This is especially useful with bifurcating streams that also do logging. As for the templatization of streams I believe a pair of generic data input/output stream classes can be written using templates which will do impedance matching from the 8-bit streams to the n-bit data type you want to read. So you have to write 8, 16, 32 and possibly 64 and 128 bit functions. Here is the foundation of the stream library I imagine: ================================================================ interface DataSink { uint write( ubyte[] data, uint off = 0, uint len = 0); } interface DataSource { uint read( inout ubyte[] data, uint off = 0, uint len = 0); ulong seek( ulong size ); } ================================================================ The data being read/written by native interface classes: ================================================================ FileInputSream : DataSource FileOutputSream : DataSink SocketInputSream : DataSource SocketOutputSream : DataSink MMapInputStream : DataSource MMapOutputStream : DataSink ================================================================ The data is then manipulated providing buffering, digesting, en/de-crpytion and [de]compressoin, etc. Finally it is possible to write interpreters for the data such as TGA, JPEG, etc...
Aug 03 2004
parent reply Regan Heath <regan netwin.co.nz> writes:
On Tue, 03 Aug 2004 16:21:05 -0400, parabolis <parabolis softhome.net> 
wrote:

<snip>

 Here is the foundation of the stream library I imagine:
 ================================================================
 interface DataSink {
      uint write( ubyte[] data, uint off = 0, uint len = 0);
 }

 interface DataSource {
      uint read( inout ubyte[] data, uint off = 0, uint len = 0);
      ulong seek( ulong size );
 }
 ================================================================

I think you need functions in the form: ulong write(void* data, ulong len = 0, ulong off = 0); notice I have changed ubyte[] to void*, changed the order of the last two parameters and changed uint into ulong. If you use ubyte[] you don't need len or off as you can call with: ubyte[] big = "regan was here"; write(big[6..9]); to achieve both. The void* allows easy specialised write functions, eg. bool write(int x) { write(&x,x.sizeof); } I'm not sure whether uint or ulong should be used, anyone got opinions/reasons for one or the other?
 The data being read/written by native interface classes:
 ================================================================
 FileInputSream : DataSource
 FileOutputSream : DataSink
 SocketInputSream : DataSource
 SocketOutputSream : DataSink
 MMapInputStream : DataSource
 MMapOutputStream : DataSink
 ================================================================

 The data is then manipulated providing buffering, digesting, 
 en/de-crpytion and [de]compressoin, etc. Finally it is possible to write 
 interpreters for the data such as TGA, JPEG, etc...

I think using template bolt-ins for this step is a great idea, for example you simply write a File, Socket, MMap class which implements the methods in the two interfaces above, then bolt them into your stream class which defines all the other stream operations. See my earlier post (with source) on how this works. Note there was a problem with it which I have since fixed, changing 'super.' to 'this.' in the stream template class. Regan. -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Aug 03 2004
parent reply parabolis <parabolis softhome.net> writes:
Regan Heath wrote:
 On Tue, 03 Aug 2004 16:21:05 -0400, parabolis <parabolis softhome.net> 
 wrote:
 
 <snip>
 
 Here is the foundation of the stream library I imagine:
 ================================================================
 interface DataSink {
      uint write( ubyte[] data, uint off = 0, uint len = 0);
 }

 interface DataSource {
      uint read( inout ubyte[] data, uint off = 0, uint len = 0);
      ulong seek( ulong size );
 }
 ================================================================

I think you need functions in the form: ulong write(void* data, ulong len = 0, ulong off = 0); notice I have changed ubyte[] to void*, changed the order of the last two parameters and changed uint into ulong. If you use ubyte[] you don't need len or off as you can call with: ubyte[] big = "regan was here"; write(big[6..9]); to achieve both.

I will concede the order was wrong. However I believe the slicing will need to create another array wrapper in memory which is then going to have to be GCed. The len and off parameters allow a caller to take either approach.
 
 The void* allows easy specialised write functions, eg.
   bool write(int x) { write(&x,x.sizeof); }

The void* is a pointer with no associated type. The arrays in D are infinitely better than void* pointers because arrays have extra information. As I said earlier in my post the behavior of providing data in a particular non-byte format should be done elsewhere in a single DataXXStream.
 The data being read/written by native interface classes:
 ================================================================
 FileInputSream : DataSource
 FileOutputSream : DataSink
 SocketInputSream : DataSource
 SocketOutputSream : DataSink
 MMapInputStream : DataSource
 MMapOutputStream : DataSink
 ================================================================

 The data is then manipulated providing buffering, digesting, 
 en/de-crpytion and [de]compressoin, etc. Finally it is possible to 
 write interpreters for the data such as TGA, JPEG, etc...

I think using template bolt-ins for this step is a great idea, for example you simply write a File, Socket, MMap class which implements the methods in the two interfaces above, then bolt them into your stream class which defines all the other stream operations.

I made an argument that I believe input and output should be clearly seperated which is my answer to why anything should not implement both. Until someone convinces me otherwise I do not see how a single class can implement both and be thread friendly without internally keeping all input related variables seperate from output related variables. If it is not possible to share input and output variables then the class can be factored into two smaller classes that are less prone to bugs.
Aug 03 2004
next sibling parent reply Regan Heath <regan netwin.co.nz> writes:
On Tue, 03 Aug 2004 18:02:55 -0400, parabolis <parabolis softhome.net> 
wrote:
 Regan Heath wrote:
 On Tue, 03 Aug 2004 16:21:05 -0400, parabolis <parabolis softhome.net> 
 wrote:

 <snip>

 Here is the foundation of the stream library I imagine:
 ================================================================
 interface DataSink {
      uint write( ubyte[] data, uint off = 0, uint len = 0);
 }

 interface DataSource {
      uint read( inout ubyte[] data, uint off = 0, uint len = 0);
      ulong seek( ulong size );
 }
 ================================================================

I think you need functions in the form: ulong write(void* data, ulong len = 0, ulong off = 0); notice I have changed ubyte[] to void*, changed the order of the last two parameters and changed uint into ulong. If you use ubyte[] you don't need len or off as you can call with: ubyte[] big = "regan was here"; write(big[6..9]); to achieve both.

I will concede the order was wrong. However I believe the slicing will need to create another array wrapper in memory which is then going to have to be GCed.

So.. ?
 The len and off parameters allow a caller to take either approach.

Yeah.. we have default parameters, we can provide both options at no cost, so why not.
 The void* allows easy specialised write functions, eg.
   bool write(int x) { write(&x,x.sizeof); }

The void* is a pointer with no associated type.

Correct.
 The arrays in D are infinitely better than void* pointers because arrays 
 have extra information.

Incorrect. D arrays are better for some things, those that need/want the extra information. Lets ignore our opinions on the use of void* for now, can you write the write(int x) function above as easily if you do not use void* but use ubyte[] instead?
 As I said earlier in my post the behavior of providing data in a 
 particular non-byte format should be done elsewhere in a single 
 DataXXStream.

Sure, and when/where you provide it, what will it look like if the underlying write operation takes a ubyte[] and not a void*? is it possible? is it worse than simply using a void*?
 The data being read/written by native interface classes:
 ================================================================
 FileInputSream : DataSource
 FileOutputSream : DataSink
 SocketInputSream : DataSource
 SocketOutputSream : DataSink
 MMapInputStream : DataSource
 MMapOutputStream : DataSink
 ================================================================

 The data is then manipulated providing buffering, digesting, 
 en/de-crpytion and [de]compressoin, etc. Finally it is possible to 
 write interpreters for the data such as TGA, JPEG, etc...

I think using template bolt-ins for this step is a great idea, for example you simply write a File, Socket, MMap class which implements the methods in the two interfaces above, then bolt them into your stream class which defines all the other stream operations.

I made an argument that I believe input and output should be clearly seperated which is my answer to why anything should not implement both. Until someone convinces me otherwise I do not see how a single class can implement both and be thread friendly without internally keeping all input related variables seperate from output related variables. If it is not possible to share input and output variables then the class can be factored into two smaller classes that are less prone to bugs.

Sure, wanting to do this does not stop you using bolt-ins. I just have to split my Stream bolt-in into InputStream and OutputStream, in fact, I think I will, as I agree with your reasoning. Regan. -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Aug 03 2004
parent reply parabolis <parabolis softhome.net> writes:
Regan Heath wrote:

 On Tue, 03 Aug 2004 18:02:55 -0400, parabolis <parabolis softhome.net> 
 
 The arrays in D are infinitely better than void* pointers because 
 arrays have extra information.

Incorrect. D arrays are better for some things, those that need/want the extra information.

Here I must argue that any knowledge of where C went really wrong was with char* which allows buffer overruns because you do not know how long the buffer is... I also do not see how you could have used slicing and a void*. How would you know when to stop reading before you had off and len?
 Lets ignore our opinions on the use of void* for now, can you write the 
 write(int x) function above as easily if you do not use void* but use 
 ubyte[] instead?
 

I will do both at the same time... (read on)
 Sure, and when/where you provide it, what will it look like if the 
 underlying write operation takes a ubyte[] and not a void*? is it 
 possible? is it worse than simply using a void*?

I am more concerned with the fact that a ubyte[] should help guard against the char* buffer overruns that creaed a huge security industry. In fact I suspect that you might be somebody from NAV or McAfee and are here only to ensure security holes remain rampant... :P One of the biggest breakthroughs Java made was in the area of security. Part of this breakthrough was a result of their eliminating that nasty char* and using arrays with length info builtin. Having said that... Of course it is possible to read a int/long/real/whatever from a byte buffer. Moreover you can test to see if something went wrong in the buffer because you know how long it is... ================================================================ int readInt( ubyte buf, uint off = 0 ) { if( buf.length <= off+4 ) throw Error( "Buffer overrun" ); uint result = buf[off+0]; result |= (cast(int)(buf[off+1])) << 8; result |= (cast(int)(buf[off+2])) << 16; result |= (cast(int)(buf[off+3])) << 24; return result; } ================================================================
 
 The data being read/written by native interface classes:
 ================================================================
 FileInputSream : DataSource
 FileOutputSream : DataSink
 SocketInputSream : DataSource
 SocketOutputSream : DataSink
 MMapInputStream : DataSource
 MMapOutputStream : DataSink
 ================================================================

 The data is then manipulated providing buffering, digesting, 
 en/de-crpytion and [de]compressoin, etc. Finally it is possible to 
 write interpreters for the data such as TGA, JPEG, etc...

I think using template bolt-ins for this step is a great idea, for example you simply write a File, Socket, MMap class which implements the methods in the two interfaces above, then bolt them into your stream class which defines all the other stream operations.

I made an argument that I believe input and output should be clearly seperated which is my answer to why anything should not implement both. Until someone convinces me otherwise I do not see how a single class can implement both and be thread friendly without internally keeping all input related variables seperate from output related variables. If it is not possible to share input and output variables then the class can be factored into two smaller classes that are less prone to bugs.

Sure, wanting to do this does not stop you using bolt-ins. I just have to split my Stream bolt-in into InputStream and OutputStream, in fact, I think I will, as I agree with your reasoning.

I am glad to hear you decided to split them. I think you will find it makes life simpler. I am not much of a generic programmer. So I am waiting to see how you deal with the combinatorial problem before I am sold on the idea. If you can pull it off then you might be onto something. :)
Aug 03 2004
parent reply Regan Heath <regan netwin.co.nz> writes:
On Tue, 03 Aug 2004 21:41:51 -0400, parabolis <parabolis softhome.net> 
wrote:
 Regan Heath wrote:

 Incorrect. D arrays are better for some things, those that need/want 
 the extra information.

Here I must argue that any knowledge of where C went really wrong was with char* which allows buffer overruns because you do not know how long the buffer is...

 I also do not see how you could have used slicing and a void*.

I didn't/don't use slicing. I think you may be confusing two different points I made. My first point was that off and len were not required because you can slice into a ubyte[]. So _if_ you use ubyte[] you don't _need_ off and len. My second point was that instead of ubyte[] you should use void* for convenience. If you use void* you definately need len.
 How would you know when to stop reading before you had off and len?

I have always had len, my fn prototype is: ulong write(void* address, ulong length); which simply writes length bytes starting at address.
 Lets ignore our opinions on the use of void* for now, can you write the 
 write(int x) function above as easily if you do not use void* but use 
 ubyte[] instead?

I will do both at the same time... (read on)

both? .. on I read ..
 Sure, and when/where you provide it, what will it look like if the 
 underlying write operation takes a ubyte[] and not a void*? is it 
 possible? is it worse than simply using a void*?

I am more concerned with the fact that a ubyte[] should help guard against the char* buffer overruns that creaed a huge security industry. In fact I suspect that you might be somebody from NAV or McAfee and are here only to ensure security holes remain rampant... :P One of the biggest breakthroughs Java made was in the area of security. Part of this breakthrough was a result of their eliminating that nasty char* and using arrays with length info builtin. Having said that... Of course it is possible to read a int/long/real/whatever from a byte buffer. Moreover you can test to see if something went wrong in the buffer because you know how long it is... ================================================================ int readInt( ubyte buf, uint off = 0 ) {

Typo, you missed the [], I have added them below.
 int readInt( ubyte[] buf, uint off = 0 ) {
      if( buf.length <= off+4 )
          throw Error( "Buffer overrun" );
      uint result = buf[off+0];
      result |= (cast(int)(buf[off+1])) << 8;
      result |= (cast(int)(buf[off+2])) << 16;
      result |= (cast(int)(buf[off+3])) << 24;
      return result;
 }
 ================================================================

And this is supposed to be nicer/easier/more efficient than.. bool readInt(out int x) { if (read(&x,x.sizeof) != x.sizeof) throw new Exception("Out of data"); return true; } As you can see using void* allows very convenient and totally buffer overrun safe code. <snip>
 I am glad to hear you decided to split them. I think you will find it 
 makes life simpler.

 I am not much of a generic programmer. So I am waiting to see how you 
 deal with the combinatorial problem before I am sold on the idea. If you 
 can pull it off then you might be onto something. :)

You mean the problem you see with threads and shared buffers? Regan. -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Aug 03 2004
parent reply parabolis <parabolis softhome.net> writes:
Regan Heath wrote:

 I didn't/don't use slicing. I think you may be confusing two different 
 points I made.
 
 My first point was that off and len were not required because you can 
 slice into a ubyte[]. So _if_ you use ubyte[] you don't _need_ off and len.
 
 My second point was that instead of ubyte[] you should use void* for 
 convenience. If you use void* you definately need len.

I see now. I was confused. Sorry.
 Sure, and when/where you provide it, what will it look like if the 
 underlying write operation takes a ubyte[] and not a void*? is it 
 possible? is it worse than simply using a void*?

I am more concerned with the fact that a ubyte[] should help guard against the char* buffer overruns that creaed a huge security industry. In fact I suspect that you might be somebody from NAV or McAfee and are here only to ensure security holes remain rampant... :P One of the biggest breakthroughs Java made was in the area of security. Part of this breakthrough was a result of their eliminating that nasty char* and using arrays with length info builtin. Having said that... Of course it is possible to read a int/long/real/whatever from a byte buffer. Moreover you can test to see if something went wrong in the buffer because you know how long it is... ================================================================ int readInt( ubyte buf, uint off = 0 ) {

Typo, you missed the [], I have added them below.
 int readInt( ubyte[] buf, uint off = 0 ) {
      if( buf.length <= off+4 )
          throw Error( "Buffer overrun" );
      uint result = buf[off+0];
      result |= (cast(int)(buf[off+1])) << 8;
      result |= (cast(int)(buf[off+2])) << 16;
      result |= (cast(int)(buf[off+3])) << 24;
      return result;
 }
 ================================================================

And this is supposed to be nicer/easier/more efficient than.. bool readInt(out int x) { if (read(&x,x.sizeof) != x.sizeof) throw new Exception("Out of data"); return true; } As you can see using void* allows very convenient and totally buffer overrun safe code.

Show me a safe function that takes void* as a parameter. That was really more the point I was making. There is no way to guanratee in read(void*,uint len) that len is not actually longer than the array someone passes in. When that happens your read function will overwrite the end of the array and eventually write over executable code. Somebody will find that bug and send a specially formatted overly long string that has machine code in it and hijack the program.
 
 <snip>
 
 I am glad to hear you decided to split them. I think you will find it 
 makes life simpler.

 I am not much of a generic programmer. So I am waiting to see how you 
 deal with the combinatorial problem before I am sold on the idea. If 
 you can pull it off then you might be onto something. :)

You mean the problem you see with threads and shared buffers?

Sorry I meant the problem with threads and shared buffers should be easier now. The bit about the combinatorial problem goes back to the other thread in which I wanted to see how you combine multiple streams...
Aug 03 2004
parent reply Regan Heath <regan netwin.co.nz> writes:
On Tue, 03 Aug 2004 23:30:03 -0400, parabolis <parabolis softhome.net> 
wrote:

<snip>

 Show me a safe function that takes void* as a parameter. That was really 
 more the point I was making. There is no way to guanratee in 
 read(void*,uint len) that len is not actually longer than the array 
 someone passes in. When that happens your read function will overwrite 
 the end of the array and eventually write over executable code. Somebody 
 will find that bug and send a specially formatted overly long string 
 that has machine code in it and hijack the program.

I agree this is a problem, I have been dealing with it for years at work (we work with C only). The solution in this case is that nobody outside the Stream template class actually calls the read/write functions that take void* instead they call the ones provided for int, float, ubyte[], and so on. However, someone might want the void* ones in order to read/write a struct.. .. I have just discovered you can use ubyte[] and get the same sort of function as my void* one, check out... class Stream { ulong read(ubyte[] buffer, ulong length = 0, ulong offset = 0) { if (length == 0) length = buffer.length; buffer[offset..length] = 65; return length-offset; } bool read(out char x) { if (read(cast(ubyte[])(&x)[0..x.sizeof]) != x.sizeof) return false; return true; } } void main() { Stream st = new Stream(); char c; st.read(c); printf("%c\n",c); } as you can see using a cast, a slice and the address of the char we can do the same thing as with a void *. So now the read function takes a ubyte[] and is itself buffer safe.. however this does not mean buffer overruns are not possible, consider... void badBuggyRead(out char x) { read(cast(ubyte[])(&x)[0..1000]); } so even tho read uses a ubyte[] it can still overrun.
 <snip>

 I am glad to hear you decided to split them. I think you will find it 
 makes life simpler.

 I am not much of a generic programmer. So I am waiting to see how you 
 deal with the combinatorial problem before I am sold on the idea. If 
 you can pull it off then you might be onto something. :)

You mean the problem you see with threads and shared buffers?

Sorry I meant the problem with threads and shared buffers should be easier now.

:)
 The bit about the combinatorial problem goes back to the other thread in 
 which I wanted to see how you combine multiple streams...

Ahh yes.. I am waiting for an idea to come to me.. my first idea is that I combine them in the same way as I combine the ones I currently have eg. alias OutputStream!(InputStream!(RawFile)) File; or something, I have not tried splitting them yet, then.. alias CRCReader!(File) CRCFileReader; alias CRCWriter!(File) CRCFileWriter; alias ZIPReader!(File) ZIPFileReader; alias ZIPWriter!(File) ZIPFileWriter; now, this is fine for types we know about at compile time, however we may need to choose at runtime, so some sort of factory approach will have to be used... Regan -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Aug 03 2004
parent reply parabolis <parabolis softhome.net> writes:
Regan Heath wrote:

 On Tue, 03 Aug 2004 23:30:03 -0400, parabolis <parabolis softhome.net> 
 wrote:
 
 Show me a safe function that takes void* as a parameter. That was 
 really more the point I was making. There is no way to guanratee in 
 read(void*,uint len) that len is not actually longer than the array 
 someone passes in. When that happens your read function will overwrite 
 the end of the array and eventually write over executable code. 
 Somebody will find that bug and send a specially formatted overly long 
 string that has machine code in it and hijack the program.

I agree this is a problem, I have been dealing with it for years at work (we work with C only). The solution in this case is that nobody outside the Stream template class actually calls the read/write functions that take void* instead they call the ones provided for int, float, ubyte[], and so on. However, someone might want the void* ones in order to read/write a struct..

That is a good point.
 
 ...
 
 I have just discovered you can use ubyte[] and get the same sort of 
 function as my void* one, check out...
 
 class Stream
 {
     ulong read(ubyte[] buffer, ulong length = 0, ulong offset = 0)
     {
         if (length == 0) length = buffer.length;
         buffer[offset..length] = 65;

Now that is pretty neat.
 
 So now the read function takes a ubyte[] and is itself buffer safe.. 
 however this does not mean buffer overruns are not possible, consider...
 
 void badBuggyRead(out char x)
 {
     read(cast(ubyte[])(&x)[0..1000]);
 }
 
 so even tho read uses a ubyte[] it can still overrun.

You can always circumvent a security measure. The point is that with the measure there you *have* to go out of your way to get around it.
 
 <snip>

 I am glad to hear you decided to split them. I think you will find 
 it makes life simpler.

 I am not much of a generic programmer. So I am waiting to see how 
 you deal with the combinatorial problem before I am sold on the 
 idea. If you can pull it off then you might be onto something. :)

You mean the problem you see with threads and shared buffers?

Sorry I meant the problem with threads and shared buffers should be easier now.

:)
 The bit about the combinatorial problem goes back to the other thread 
 in which I wanted to see how you combine multiple streams...

Ahh yes.. I am waiting for an idea to come to me.. my first idea is that I combine them in the same way as I combine the ones I currently have eg. alias OutputStream!(InputStream!(RawFile)) File; or something, I have not tried splitting them yet, then.. alias CRCReader!(File) CRCFileReader; alias CRCWriter!(File) CRCFileWriter; alias ZIPReader!(File) ZIPFileReader; alias ZIPWriter!(File) ZIPFileWriter; now, this is fine for types we know about at compile time, however we may need to choose at runtime, so some sort of factory approach will have to be used... Regan

Consider the number of combinations of just Readers that are possible: File,Net,Mem - choose 1 of 3 Compression CRC } - choose any number and in any order Buffering Image,Audio,Video - choose 1 of 3 If I am not to sleepy to be thinking straight then there are rougly 100 combinations of readers with just these 9 classes.
Aug 03 2004
parent reply Regan Heath <regan netwin.co.nz> writes:
On Wed, 04 Aug 2004 01:24:46 -0400, parabolis <parabolis softhome.net> 
wrote:

 Regan Heath wrote:

 On Tue, 03 Aug 2004 23:30:03 -0400, parabolis <parabolis softhome.net> 
 wrote:

 Show me a safe function that takes void* as a parameter. That was 
 really more the point I was making. There is no way to guanratee in 
 read(void*,uint len) that len is not actually longer than the array 
 someone passes in. When that happens your read function will overwrite 
 the end of the array and eventually write over executable code. 
 Somebody will find that bug and send a specially formatted overly long 
 string that has machine code in it and hijack the program.

I agree this is a problem, I have been dealing with it for years at work (we work with C only). The solution in this case is that nobody outside the Stream template class actually calls the read/write functions that take void* instead they call the ones provided for int, float, ubyte[], and so on. However, someone might want the void* ones in order to read/write a struct..

That is a good point.
 ...

 I have just discovered you can use ubyte[] and get the same sort of 
 function as my void* one, check out...

 class Stream
 {
     ulong read(ubyte[] buffer, ulong length = 0, ulong offset = 0)
     {
         if (length == 0) length = buffer.length;
         buffer[offset..length] = 65;

Now that is pretty neat.

Just that bit.. or the whole thing? That bit above was a little hack, it sets the whole buffer to 65 or ascii 'A'.
 So now the read function takes a ubyte[] and is itself buffer safe.. 
 however this does not mean buffer overruns are not possible, consider...

 void badBuggyRead(out char x)
 {
     read(cast(ubyte[])(&x)[0..1000]);
 }

 so even tho read uses a ubyte[] it can still overrun.

You can always circumvent a security measure. The point is that with the measure there you *have* to go out of your way to get around it.

But people will. Assume you're trying to read/write a struct, int, float, whatever, you _have_ to write code like that above and you might get it wrong, it's exactly the same as if you were using: read(void* address, ulong length); you might call that wrong to. I cannot see a difference and void* is easier to use and smaller than void[].
 <snip>

 I am glad to hear you decided to split them. I think you will find 
 it makes life simpler.

 I am not much of a generic programmer. So I am waiting to see how 
 you deal with the combinatorial problem before I am sold on the 
 idea. If you can pull it off then you might be onto something. :)

You mean the problem you see with threads and shared buffers?

Sorry I meant the problem with threads and shared buffers should be easier now.

:)
 The bit about the combinatorial problem goes back to the other thread 
 in which I wanted to see how you combine multiple streams...

Ahh yes.. I am waiting for an idea to come to me.. my first idea is that I combine them in the same way as I combine the ones I currently have eg. alias OutputStream!(InputStream!(RawFile)) File; or something, I have not tried splitting them yet, then.. alias CRCReader!(File) CRCFileReader; alias CRCWriter!(File) CRCFileWriter; alias ZIPReader!(File) ZIPFileReader; alias ZIPWriter!(File) ZIPFileWriter; now, this is fine for types we know about at compile time, however we may need to choose at runtime, so some sort of factory approach will have to be used... Regan

Consider the number of combinations of just Readers that are possible: File,Net,Mem - choose 1 of 3 Compression CRC } - choose any number and in any order Buffering Image,Audio,Video - choose 1 of 3 If I am not to sleepy to be thinking straight then there are rougly 100 combinations of readers with just these 9 classes.

Yeah.. so? when I need one I make an alias and use it.. when I need another I make an alias and use it, it's no different to simply typing new A(new B(new C))) when you use it, _except_, if you re-use it in several places then my alias is neater. I am not going to alias all x possible combinations right now :) Regan. -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Aug 04 2004
parent reply parabolis <parabolis softhome.net> writes:
Regan Heath wrote:

 On Wed, 04 Aug 2004 01:24:46 -0400, parabolis <parabolis softhome.net> 
 wrote:
 
 Regan Heath wrote:

 On Tue, 03 Aug 2004 23:30:03 -0400, parabolis 
 <parabolis softhome.net> wrote:

 So now the read function takes a ubyte[] and is itself buffer safe.. 
 however this does not mean buffer overruns are not possible, consider...

 void badBuggyRead(out char x)
 {
     read(cast(ubyte[])(&x)[0..1000]);
 }

 so even tho read uses a ubyte[] it can still overrun.

You can always circumvent a security measure. The point is that with the measure there you *have* to go out of your way to get around it.

But people will. Assume you're trying to read/write a struct, int, float, whatever, you _have_ to write code like that above and you might get it wrong, it's exactly the same as if you were using:

Not really. My DataXXXStream would handle reading all cases where you want to read a primitive. The struct thing is a special case that I will say should be handled by library read/write functions. So it is expected that people who want a primitive/struct will use a library function. Should somebody have the need for something strange and defeat the security measure then it is expected they will not do it in a way that causes a buffer overrun. Most buffer overruns are a result of the fact that deal with char* on a regular basis leads to small bugs. I eliminate those with ubyte[] (or possibly void[]). You fail to do that with void*.
 
   read(void* address, ulong length);
 
 you might call that wrong to. I cannot see a difference and void* is 
 easier to use and smaller than void[].
 
 <snip>

 I am glad to hear you decided to split them. I think you will find 
 it makes life simpler.

 I am not much of a generic programmer. So I am waiting to see how 
 you deal with the combinatorial problem before I am sold on the 
 idea. If you can pull it off then you might be onto something. :)

You mean the problem you see with threads and shared buffers?

Sorry I meant the problem with threads and shared buffers should be easier now.

:)
 The bit about the combinatorial problem goes back to the other 
 thread in which I wanted to see how you combine multiple streams...

Ahh yes.. I am waiting for an idea to come to me.. my first idea is that I combine them in the same way as I combine the ones I currently have eg. alias OutputStream!(InputStream!(RawFile)) File; or something, I have not tried splitting them yet, then.. alias CRCReader!(File) CRCFileReader; alias CRCWriter!(File) CRCFileWriter; alias ZIPReader!(File) ZIPFileReader; alias ZIPWriter!(File) ZIPFileWriter; now, this is fine for types we know about at compile time, however we may need to choose at runtime, so some sort of factory approach will have to be used... Regan

Consider the number of combinations of just Readers that are possible: File,Net,Mem - choose 1 of 3 Compression CRC } - choose any number and in any order Buffering Image,Audio,Video - choose 1 of 3 If I am not to sleepy to be thinking straight then there are rougly 100 combinations of readers with just these 9 classes.

Yeah.. so? when I need one I make an alias and use it.. when I need another I make an alias and use it, it's no different to simply typing new A(new B(new C))) when you use it, _except_, if you re-use it in several places then my alias is neater. I am not going to alias all x possible combinations right now :)

So for something that reads from a file then does buffering then decompression then computes a CRC check of the input stream and reads image data you would use something like this: ================================================================ alias BufferedInputStream!(FileInputStream) BufferedFileInputStream; alias DecompressionInputStream!(BufferedFileInputStream) DecompressionBufferedFileInputStream; alias CRCInputStream!(DecompressionBufferedFileInputStream) CRCDecompressionBufferedFileInputStream; alias ImageInputStream!(CRCDecompressionBufferedFileInputStream) ImageCRCDecompressionBufferedFileInputStream; CRCInputSream crc_in = new CRCDecompressionBufferedFileInputStream(filename); ImageInputSream iin= new ImageCRCDecompressionBufferedFileInputStream(crc_in); ================================================================ File - 10 times Buffered - 10 times Decompression - 8 times CRC - 7 times Image - 4 times ================================ I cannot imagine why you would like having all that alias clutter up your file instead of just using the minimal: ================================================================ CRCInputStream crc_in = new CRCInputStream ( new DecompressionInputStream ( new BufferedInputStream ( new FileInputStream( filename ) ) ) ); ImageInputSream iin = new ImageInputStream( crc_in ); ================================================================ File - 1 time Buffered - 1 time Decompression - 1 time CRC - 2 times Image - 2 times ================
Aug 04 2004
parent Regan Heath <regan netwin.co.nz> writes:
On Wed, 04 Aug 2004 11:37:05 -0400, parabolis <parabolis softhome.net> 
wrote:
 Regan Heath wrote:

 On Wed, 04 Aug 2004 01:24:46 -0400, parabolis <parabolis softhome.net> 
 wrote:

 Regan Heath wrote:

 On Tue, 03 Aug 2004 23:30:03 -0400, parabolis 
 <parabolis softhome.net> wrote:

 So now the read function takes a ubyte[] and is itself buffer safe.. 
 however this does not mean buffer overruns are not possible, 
 consider...

 void badBuggyRead(out char x)
 {
     read(cast(ubyte[])(&x)[0..1000]);
 }

 so even tho read uses a ubyte[] it can still overrun.

You can always circumvent a security measure. The point is that with the measure there you *have* to go out of your way to get around it.

But people will. Assume you're trying to read/write a struct, int, float, whatever, you _have_ to write code like that above and you might get it wrong, it's exactly the same as if you were using:

Not really. My DataXXXStream would handle reading all cases where you want to read a primitive. The struct thing is a special case that I will say should be handled by library read/write functions. So it is expected that people who want a primitive/struct will use a library function. Should somebody have the need for something strange and defeat the security measure then it is expected they will not do it in a way that causes a buffer overrun. Most buffer overruns are a result of the fact that deal with char* on a regular basis leads to small bugs. I eliminate those with ubyte[] (or possibly void[]).

I don't think so.
 You fail to do that with void*.

I don't try. Because it's impossible. <snip>
 I am not going to alias all x possible combinations right now :)

So for something that reads from a file then does buffering then decompression then computes a CRC check of the input stream and reads image data you would use something like this:

Nope. alias ImageStream!(CRCStream!(DecompressStream!(File) CompressedImageCRC; // my 'File' is buffered. CompressedImageCRC f = new CompressedImageCRC(); or more likely 'CompressedImageCRC' will be replaced by a name that has context where I use it, if for example it was an image resource for a game it might be simply 'Image'
 ================================================================
 alias BufferedInputStream!(FileInputStream)
      BufferedFileInputStream;
 alias DecompressionInputStream!(BufferedFileInputStream)
      DecompressionBufferedFileInputStream;
 alias CRCInputStream!(DecompressionBufferedFileInputStream)
      CRCDecompressionBufferedFileInputStream;
 alias ImageInputStream!(CRCDecompressionBufferedFileInputStream)
      ImageCRCDecompressionBufferedFileInputStream;

 CRCInputSream crc_in = new
      CRCDecompressionBufferedFileInputStream(filename);
 ImageInputSream iin= new
      ImageCRCDecompressionBufferedFileInputStream(crc_in);
 ================================================================
 File - 10 times
 Buffered - 10 times
 Decompression - 8 times
 CRC - 7 times
 Image - 4 times
 ================================

 I cannot imagine why you would like having all that alias clutter up 
 your file instead of just using the minimal:
 ================================================================
 CRCInputStream crc_in = new CRCInputStream
 (   new DecompressionInputStream
      (   new BufferedInputStream
          (  new FileInputStream( filename )
          )
      )
 );
 ImageInputSream iin = new ImageInputStream( crc_in );
 ================================================================
 File - 1 time
 Buffered - 1 time
 Decompression - 1 time
 CRC - 2 times
 Image - 2 times
 ================

Now instantiate it 10 times and give me a tally. Regan -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Aug 04 2004
prev sibling parent reply Andy Friesen <andy ikagames.com> writes:
parabolis wrote:

 Regan Heath wrote:
 
 On Tue, 03 Aug 2004 16:21:05 -0400, parabolis <parabolis softhome.net> 
 wrote:

 <snip>

 Here is the foundation of the stream library I imagine:
 ================================================================
 interface DataSink {
      uint write( ubyte[] data, uint off = 0, uint len = 0);
 }

 interface DataSource {
      uint read( inout ubyte[] data, uint off = 0, uint len = 0);
      ulong seek( ulong size );
 }
 ================================================================

I think you need functions in the form: ulong write(void* data, ulong len = 0, ulong off = 0); notice I have changed ubyte[] to void*, changed the order of the last two parameters and changed uint into ulong. If you use ubyte[] you don't need len or off as you can call with: ubyte[] big = "regan was here"; write(big[6..9]); to achieve both.

I will concede the order was wrong. However I believe the slicing will need to create another array wrapper in memory which is then going to have to be GCed. The len and off parameters allow a caller to take either approach.

Slicing does not create garbage. Arrays really are value types that get copied when you pass them to a function. You can generally treat them as reference types because the data they refer to is not copied along with them. An array is quite literally little more than this: struct Array(T) { T* data; int length; } Might I suggest that DataSources and DataSinks use void[]? void[] knows how many bytes it points to and is slicable. Whether or not void[] was created for this exact scenerio is uncertain, but they are exceptionally well suited to the task regardless. (incidently, slicing void* is legal as well)
 The void* is a pointer with no associated type. The arrays in D are 
 infinitely better than void* pointers because arrays have extra 
 information. As I said earlier in my post the behavior of providing data 
 in a particular non-byte format should be done elsewhere in a single 
 DataXXStream.

The whole idea behind DataSources and DataSinks is that they just pull bytes in and out of some other place without ever having any concern for their meaning. This is a textbook case of the right place to use void*. :) (or void[]) -- andy
Aug 03 2004
next sibling parent reply Regan Heath <regan netwin.co.nz> writes:
On Tue, 03 Aug 2004 21:30:29 -0700, Andy Friesen <andy ikagames.com> wrote:
 On Tue, 03 Aug 2004 16:21:05 -0400, parabolis <parabolis softhome.net> 
 wrote:
 I will concede the order was wrong. However I believe the slicing will 
 need to create another array wrapper in memory which is then going to 
 have to be GCed. The len and off parameters allow a caller to take 
 either approach.

Slicing does not create garbage.

Really? doesn't slicing create another array structure (the one you have described below) exactly the same as if/when you pass one to a function, so.. void foo(char[] a) { } void main() { char[] a = "12345"; foo(a[1..3]); } the above code creates 3 arrays: 1- 'a' at the start of main 2- one for the slice 3- one for the function call. leaving out the slice creates one less copy of the array (not the data) I think that is what parabolis meant.
 Arrays really are value types that get copied when you pass them to a 
 function.  You can generally treat them as reference types because the 
 data they refer to is not copied along with them.

 An array is quite literally little more than this:

      struct Array(T) {
          T* data;
          int length;
      }

 Might I suggest that DataSources and DataSinks use void[]?

 void[] knows how many bytes it points to and is slicable.  Whether or 
 not void[] was created for this exact scenerio is uncertain, but they 
 are exceptionally well suited to the task regardless.

 (incidently, slicing void* is legal as well)

 The void* is a pointer with no associated type. The arrays in D are 
 infinitely better than void* pointers because arrays have extra 
 information. As I said earlier in my post the behavior of providing 
 data in a particular non-byte format should be done elsewhere in a 
 single DataXXStream.

The whole idea behind DataSources and DataSinks is that they just pull bytes in and out of some other place without ever having any concern for their meaning. This is a textbook case of the right place to use void*. :) (or void[])

I agree void* or void[] should be used. Parabolis's other concern was a buffer overrun, but as I see it neither void[], void * or ubyte[] are any more buffer safe (see my other post for a detailed explaination) Regan -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Aug 03 2004
parent reply Andy Friesen <andy ikagames.com> writes:
Regan Heath wrote:

 Slicing does not create garbage.

Really? doesn't slicing create another array structure (the one you have described below) exactly the same as if/when you pass one to a function, so.. void foo(char[] a) { } void main() { char[] a = "12345"; foo(a[1..3]); } the above code creates 3 arrays: 1- 'a' at the start of main 2- one for the slice 3- one for the function call. leaving out the slice creates one less copy of the array (not the data) I think that is what parabolis meant.

Sure, but the second two can probably be optimized into one and the same. Besides, it's stack space. Nothing is faster than stack allocation. (sub esp, ...)
 The whole idea behind DataSources and DataSinks is that they just pull 
 bytes in and out of some other place without ever having any concern 
 for their meaning.

 This is a textbook case of the right place to use void*. :)  (or void[])

I agree void* or void[] should be used. Parabolis's other concern was a buffer overrun, but as I see it neither void[], void * or ubyte[] are any more buffer safe (see my other post for a detailed explaination)

References are to be preferred over pointers in C++ because constructing a null reference isn't easily possible to do by accident. It's easy to do on purpose, but if you do, Santa will put you on his Naughty list and give you coal. Also, your programs might crash or something. D arrays are the same way. Accidentally constructing an invalid array is much less likely to occur than using an explicit pointer/length pair. :) -- andy
Aug 03 2004
next sibling parent reply parabolis <parabolis softhome.net> writes:
Andy Friesen wrote:

 Besides, it's stack space.  Nothing is faster than stack allocation. 
 (sub esp, ...)

Sure there is. Not allocating is infinitely faster. :)
Aug 03 2004
parent reply Andy Friesen <andy ikagames.com> writes:
parabolis wrote:
 Andy Friesen wrote:
 
 Besides, it's stack space.  Nothing is faster than stack allocation. 
 (sub esp, ...)

Sure there is. Not allocating is infinitely faster. :)

Passing an array slice as an argument is exactly the same as passing a pointer to its contents and a size. (the exact same code should be emitted) This is why the %.*s trick works with printf. The length gets pushed first, then the pointer, which just so happens to be the same format as expected by %.*s. printf("%.*s\n", str); <===> printf("%.*s\n", str.length, &str[0]); While we're on the topic of speed hacking, though, might I suggest the following for improving application performance: main() { return 0; } (it reduces memory consumption too!) ;) -- andy
Aug 03 2004
parent "Bent Rasmussen" <exo bent-rasmussen.info> writes:
 While we're on the topic of speed hacking, though, might I suggest the
 following for improving application performance:

      main() { return 0; }

 (it reduces memory consumption too!)

That is post-mature optimization. You should never have created application.d in the first place! :-)
Aug 04 2004
prev sibling parent reply Sean Kelly <sean f4.ca> writes:
In article <cepsao$1vbo$1 digitaldaemon.com>, Andy Friesen says...
D arrays are the same way.  Accidentally constructing an invalid array 
is much less likely to occur than using an explicit pointer/length pair. :)

Not sure I agree in this case. # void read( void* addr, size_t size ); # void read( ubyte[] val ); # # int x; # read( &x, x.sizeof ); # read( cast(ubyte[]) &x[0..x.sizeof] ); Both instances of the above code require the programmer to be a bit evil about how they specify access to a range of memory. To me, the void* call just looks cleaner and less confusing while being no more prone to user error (in fact possibly less, as the calling syntax is simpler). I had actually added wrapper functions to unformatted read/write all primitive types but recently removed them because they seemed redundant. I suppose if there's enough of a demand I'll add them back. Sean
Aug 04 2004
next sibling parent reply parabolis <parabolis softhome.net> writes:
Sean Kelly wrote:

 In article <cepsao$1vbo$1 digitaldaemon.com>, Andy Friesen says...
 
D arrays are the same way.  Accidentally constructing an invalid array 
is much less likely to occur than using an explicit pointer/length pair. :)

Not sure I agree in this case. # void read( void* addr, size_t size ); # void read( ubyte[] val ); # # int x; # read( &x, x.sizeof ); # read( cast(ubyte[]) &x[0..x.sizeof] ); Both instances of the above code require the programmer to be a bit evil about how they specify access to a range of memory. To me, the void* call just looks cleaner and less confusing while being no more prone to user error (in fact possibly less, as the calling syntax is simpler).

I am pretty sure the second read in your example parses it be treating the address of x as a ubyte array and then slicing into which creates a valid ubyte[] array to pass to a function.
Aug 04 2004
parent Regan Heath <regan netwin.co.nz> writes:
On Wed, 04 Aug 2004 11:55:43 -0400, parabolis <parabolis softhome.net> 
wrote:

 Sean Kelly wrote:

 In article <cepsao$1vbo$1 digitaldaemon.com>, Andy Friesen says...

 D arrays are the same way.  Accidentally constructing an invalid array 
 is much less likely to occur than using an explicit pointer/length 
 pair. :)

Not sure I agree in this case. # void read( void* addr, size_t size ); # void read( ubyte[] val ); # # int x; # read( &x, x.sizeof ); # read( cast(ubyte[]) &x[0..x.sizeof] ); Both instances of the above code require the programmer to be a bit evil about how they specify access to a range of memory. To me, the void* call just looks cleaner and less confusing while being no more prone to user error (in fact possibly less, as the calling syntax is simpler).

I am pretty sure the second read in your example parses it be treating the address of x as a ubyte array and then slicing into which creates a valid ubyte[] array to pass to a function.

It's not guaranteed to be valid. replace x.sizeof with 1000 and it's an invalid ubyte[] array. Regan -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Aug 04 2004
prev sibling parent Andy Friesen <andy ikagames.com> writes:
Sean Kelly wrote:

 In article <cepsao$1vbo$1 digitaldaemon.com>, Andy Friesen says...
 
D arrays are the same way.  Accidentally constructing an invalid array 
is much less likely to occur than using an explicit pointer/length pair. :)

Not sure I agree in this case. # void read( void* addr, size_t size ); # void read( ubyte[] val ); # # int x; # read( &x, x.sizeof ); # read( cast(ubyte[]) &x[0..x.sizeof] ); Both instances of the above code require the programmer to be a bit evil about how they specify access to a range of memory. To me, the void* call just looks cleaner and less confusing while being no more prone to user error (in fact possibly less, as the calling syntax is simpler).

I changed my mind. You're right. :) Getting an invalid array is hard, except when you start slicing pointers, at which point it becomes a bit too easy. -- andy
Aug 04 2004
prev sibling next sibling parent parabolis <parabolis softhome.net> writes:
Andy Friesen wrote:
 I will concede the order was wrong. However I believe the slicing will 
 need to create another array wrapper in memory which is then going to 
 have to be GCed. The len and off parameters allow a caller to take 
 either approach.

Slicing does not create garbage. Arrays really are value types that get copied when you pass them to a function. You can generally treat them as reference types because the data they refer to is not copied along with them. An array is quite literally little more than this: struct Array(T) { T* data; int length; }

That is what I meant by a wrapper. It is actually defined in phobos\internal\adi.d Given that it is a struct it will be created on the stack and thus not GCed. However I still like to have the option to decide between the two. :)
 Might I suggest that DataSources and DataSinks use void[]?
 
 void[] knows how many bytes it points to and is slicable.  Whether or 
 not void[] was created for this exact scenerio is uncertain, but they 
 are exceptionally well suited to the task regardless.
 
 (incidently, slicing void* is legal as well)
 
 The void* is a pointer with no associated type. The arrays in D are 
 infinitely better than void* pointers because arrays have extra 
 information. As I said earlier in my post the behavior of providing 
 data in a particular non-byte format should be done elsewhere in a 
 single DataXXStream.

The whole idea behind DataSources and DataSinks is that they just pull bytes in and out of some other place without ever having any concern for their meaning. This is a textbook case of the right place to use void*. :) (or void[])

I had no idea there is a void[] in D and will have to consider it. As I explained in another post this is a textbook example of when *not* to use void*. If void[] exists then its use might be justified but honestly it warps my mind even trying to consider it.
Aug 03 2004
prev sibling parent reply parabolis <parabolis softhome.net> writes:
Andy Friesen wrote:

 Might I suggest that DataSources and DataSinks use void[]?
 
 void[] knows how many bytes it points to and is slicable.  Whether or 
 not void[] was created for this exact scenerio is uncertain, but they 
 are exceptionally well suited to the task regardless.

This is a good suggestion because void /is/ a much better conceptual match for general data coming from or going to someplace than byte or int. It is also a good suggestion because using void[] gives you some assurance against buffer overruns. However I think the conceptual problems void[] introduces outweigh the benefits. void[] does a rather unspected thing when it gives you a byte count in .length. The default assumption would be (or at least my default assumption was) that the .length would be the same for an int[] being treated as a void[]. This suggests that at least some people using/writing functions with void[] parameters will do strange things. I believe the ensuing confusion warrants using a ubyte[] which which has behaviour that people will already understand.
Aug 05 2004
next sibling parent reply Regan Heath <regan netwin.co.nz> writes:
On Thu, 05 Aug 2004 21:11:58 -0400, parabolis <parabolis softhome.net> 
wrote:
 Andy Friesen wrote:

 Might I suggest that DataSources and DataSinks use void[]?

 void[] knows how many bytes it points to and is slicable.  Whether or 
 not void[] was created for this exact scenerio is uncertain, but they 
 are exceptionally well suited to the task regardless.

This is a good suggestion because void /is/ a much better conceptual match for general data coming from or going to someplace than byte or int. It is also a good suggestion because using void[] gives you some assurance against buffer overruns.

I still don't agree with the last bit, void[] gives no _assurance_ at all, neither does ubyte[] or any other [].
 However I think the conceptual problems void[] introduces outweigh the 
 benefits. void[] does a rather unspected thing when it gives you a byte 
 count in .length.

That is what I assumed it would do. A void* is a pointer to 'something', the smallest addressable unit is a byte. As you do not know what 'something' is, you have to provide the ability to address the smallest addressable unit, i.e. a byte.
 The default assumption would be (or at least my default assumption was) 
 that the .length would be the same for an int[] being treated as a 
 void[].

But then you cannot address each of the 4 bytes of each int.
 This suggests that at least some people using/writing functions with 
 void[] parameters will do strange things.

Have you used 'void' as a type before, I suspect only people who have not used the concept before will get this wrong, and a simple line of documentation describing void[] will put them right.
 I believe the ensuing confusion warrants using a ubyte[] which which has 
 behaviour that people will already understand.

I agree ubyte[] is the 'right' type, the data itself is a bunch of unsigned bytes, but, void[] or void* give you ease of use that ubyte[] lacks. Regan -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Aug 05 2004
parent reply parabolis <parabolis softhome.net> writes:
Regan Heath wrote:

 On Thu, 05 Aug 2004 21:11:58 -0400, parabolis <parabolis softhome.net> 
 wrote:
 
 Andy Friesen wrote:

 Might I suggest that DataSources and DataSinks use void[]?

 void[] knows how many bytes it points to and is slicable.  Whether or 
 not void[] was created for this exact scenerio is uncertain, but they 
 are exceptionally well suited to the task regardless.

This is a good suggestion because void /is/ a much better conceptual match for general data coming from or going to someplace than byte or int. It is also a good suggestion because using void[] gives you some assurance against buffer overruns.

I still don't agree with the last bit, void[] gives no _assurance_ at all, neither does ubyte[] or any other [].

My argument is that there exists a program in which a bug will be caught. You argument is that there does not exist a program such that a bug will be caught (or that for all programs there is no program such that a bug is caught). Assuming we have the function: read_bad(void*,uint len) read_good(ubyte[],uint len) A exerpt from program P in which a bug is caught is as follows: ============================== P ============================== ubyte ex[256]; read_bad(ex,0xFFFF_FFFF); // memory overwritten read_good(ex,0xFFFF_FFFF); // exception thrown ================================================================ P contains a bug that is caught using an array parameter. The existance of P simultaneously proves my argument and disproves yours. Yet we have had this discussion before and you seem to insist that since you can find examples where a bug is not caught my argument must be wrong somehow. I am not familiar with any logic in which such claims are expected. Either you will have to explain the logic system you are using to me so I can explain my claim properly or you will have to use the one I am using. Here are some links to mine: http://en.wikipedia.org/wiki/Logic http://en.wikipedia.org/wiki/Predicate_logic http://en.wikipedia.org/wiki/Universal_quantifier http://en.wikipedia.org/wiki/Existential_quantifier
 
 However I think the conceptual problems void[] introduces outweigh the 
 benefits. void[] does a rather unspected thing when it gives you a 
 byte count in .length.

That is what I assumed it would do. A void* is a pointer to 'something', the smallest addressable unit is a byte. As you do not know what 'something' is, you have to provide the ability to address the smallest addressable unit, i.e. a byte.

Wonderful guess. It is entirely more complicated than a ubyte[] being a partition of memory on 8-bit boundries and knowing how the length and sizeof will work.
 The default assumption would be (or at least my default assumption 
 was) that the .length would be the same for an int[] being treated as 
 a void[].

But then you cannot address each of the 4 bytes of each int.

Yes that was exactly my point.
 
 This suggests that at least some people using/writing functions with 
 void[] parameters will do strange things.

Have you used 'void' as a type before, I suspect only people who have

No I have never used void as a type before. I have always been under the impression that "void varX;" is not a legal declaration/definition in C or C++. I have used void* frequently in C/C++ but the size of any void* variables is of course the size of any pointer.
 not used the concept before will get this wrong, and a simple line of 
 documentation describing void[] will put them right.

Or using ubyte[] will write the documentation for me and provide some assurance that in cases in which people did not read the docs will have a chance of getting it right from the start.
 I believe the ensuing confusion warrants using a ubyte[] which which 
 has behaviour that people will already understand.

I agree ubyte[] is the 'right' type, the data itself is a bunch of unsigned bytes, but, void[] or void* give you ease of use that ubyte[] lacks.

No actually I have been saying void is 'right' because streaming data is only partitioned according to the semantics of the interpretation of the data. Partitioning data into a byte forces an arbitrary partition of general data that would not happen conceptually with void. I just feel that using void[] lacks the ease of use you get with ubyte[].
Aug 06 2004
parent Regan Heath <regan netwin.co.nz> writes:
On Fri, 06 Aug 2004 14:29:19 -0400, parabolis <parabolis softhome.net> 
wrote:

<snip>

 I still don't agree with the last bit, void[] gives no _assurance_ at 
 all, neither does ubyte[] or any other [].

My argument is that there exists a program in which a bug will be caught.

Ok.
 You argument is that there does not exist a program such that a bug will 
 be caught (or that for all programs there is no program such that a bug 
 is caught).

I make no such argument. In fact I am having a hard time following the above sentence.
 Assuming we have the function:
       read_bad(void*,uint len)
      read_good(ubyte[],uint len)

 A exerpt from program P in which a bug is caught is as follows:
 ============================== P ==============================
      ubyte ex[256];
       read_bad(ex,0xFFFF_FFFF); // memory overwritten
      read_good(ex,0xFFFF_FFFF); // exception thrown
 ================================================================

 P contains a bug that is caught using an array parameter. The existance 
 of P simultaneously proves my argument and disproves yours.

I don't think you understand my argument.
 Yet we have had this discussion before and you seem to insist that since 
 you can find examples where a bug is not caught my argument must be 
 wrong somehow. I am not familiar with any logic in which such claims are 
 expected. Either you will have to explain the logic system you are using 
 to me so I can explain my claim properly or you will have to use the one 
 I am using. Here are some links to mine:

 http://en.wikipedia.org/wiki/Logic
 http://en.wikipedia.org/wiki/Predicate_logic
 http://en.wikipedia.org/wiki/Universal_quantifier
 http://en.wikipedia.org/wiki/Existential_quantifier

You are missing my point. There is one pivotal fact in this debate and that is that an array is _not_ guaranteed to be correct about it's own length. Consider: void read(void[] a, int length) { if (length > a.length) throw new Exception(..); } void main() { char* p = "0123456789"; read(&p[0..1000],1000); } no exception is throw and memory is overwritten. My point, which you seem to have missed, is simply: "An array is _not_ guaranteed to be correct about it's own length" The reason this point is all important in this debate is that when trying to write basic types and structs you _will_ need to create the array from the basic type, when doing so you _will_ need to define the arrays length manually. So there is _always_ going to be the same risk of error regardless of whether you use void[] or void*. Since the risk is the same in either case I vote for the clearest/cleanest/simplest code, as this reduces the risk of error slightly, the code I propsed: bool read(out int x) { return read(&x,x.sizeof) == x.sizeof; } is cleaner/clearer and simpler than bool read(out int x) { return read(&x[0..x.sizeof],x.sizeof) == x.sizeof); }
 However I think the conceptual problems void[] introduces outweigh the 
 benefits. void[] does a rather unspected thing when it gives you a 
 byte count in .length.

That is what I assumed it would do. A void* is a pointer to 'something', the smallest addressable unit is a byte. As you do not know what 'something' is, you have to provide the ability to address the smallest addressable unit, i.e. a byte.

Wonderful guess. It is entirely more complicated than a ubyte[] being a partition of memory on 8-bit boundries and knowing how the length and sizeof will work.

The semantics of void* are quite well known, anyone who has used it knows what I have described above, anyone who doesn't will read the docs on void* before they start. Anyone who _guesses_ what will happen is asking for trouble.
 The default assumption would be (or at least my default assumption 
 was) that the .length would be the same for an int[] being treated as 
 a void[].

But then you cannot address each of the 4 bytes of each int.

Yes that was exactly my point.

So we agree, void[] works in a logical fashion.
 This suggests that at least some people using/writing functions with 
 void[] parameters will do strange things.

Have you used 'void' as a type before, I suspect only people who have

No I have never used void as a type before. I have always been under the impression that "void varX;" is not a legal declaration/definition in C or C++. I have used void* frequently in C/C++ but the size of any void* variables is of course the size of any pointer.

In that case why didn't you read the documentation on the void[] type?
 not used the concept before will get this wrong, and a simple line of 
 documentation describing void[] will put them right.

Or using ubyte[] will write the documentation for me and provide some assurance that in cases in which people did not read the docs will have a chance of getting it right from the start.

Rubbish, you're basically asserting that ubyte[] is known of by everyone, and that is simply not true.
 I believe the ensuing confusion warrants using a ubyte[] which which 
 has behaviour that people will already understand.

I agree ubyte[] is the 'right' type, the data itself is a bunch of unsigned bytes, but, void[] or void* give you ease of use that ubyte[] lacks.

No actually I have been saying void is 'right' because streaming data is only partitioned according to the semantics of the interpretation of the data. Partitioning data into a byte forces an arbitrary partition of general data that would not happen conceptually with void. I just feel that using void[] lacks the ease of use you get with ubyte[].

Ease of use?! How is this: bool read(out int x) { return read(&x[0..x.sizeof],x.sizeof) == x.sizeof); } easier than: bool read(out int x) { return read(&x,x.sizeof) == x.sizeof; } ? There seems to be 2 points in this argument, "what is easier" and "what is safer", my opinion which I have tried to demonstrate is that neither is safer and void* is easier. Regan -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Aug 08 2004
prev sibling parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <ceulss$2fj6$1 digitaldaemon.com>, parabolis says...

void[] does a rather unspected thing when 
it gives you a byte count in .length. The default assumption 
would be (or at least my default assumption was) that the 
.length would be the same for an int[] being treated as a 
void[].

For all D types, the number of bytes occupied by a T[] of length N is (N * T.sizeof). This should have been your default assumption. void.sizeof is 1. Jill
Aug 06 2004
parent reply parabolis <parabolis softhome.net> writes:
Arcane Jill wrote:

 In article <ceulss$2fj6$1 digitaldaemon.com>, parabolis says...
 
 
void[] does a rather unspected thing when 
it gives you a byte count in .length. The default assumption 
would be (or at least my default assumption was) that the 
.length would be the same for an int[] being treated as a 
void[].

For all D types, the number of bytes occupied by a T[] of length N is (N * T.sizeof). This should have been your default assumption. void.sizeof is 1.

Sorry I meant from the docs http://www.digitalmars.com/d/type.html: void no type bit single bit byte signed 8 bits ubyte unsigned 8 bits ....
Aug 06 2004
parent reply Sean Kelly <sean f4.ca> writes:
In article <cf0e4q$mqi$1 digitaldaemon.com>, parabolis says...
Sorry I meant from the docs

http://www.digitalmars.com/d/type.html:

     void  no type
      bit  single bit
     byte  signed 8 bits
    ubyte  unsigned 8 bits
    ....

Probably an esoteric question, but I assume that the byte size gurantee is only for machines with the proper architecture? Not that I expect to see a D compiler for the very few machines that support strange byte sizes, just wondering... Sean
Aug 06 2004
parent parabolis <parabolis softhome.net> writes:
Sean Kelly wrote:

 In article <cf0e4q$mqi$1 digitaldaemon.com>, parabolis says...
 
Sorry I meant from the docs

http://www.digitalmars.com/d/type.html:

    void  no type
     bit  single bit
    byte  signed 8 bits
   ubyte  unsigned 8 bits
   ....

Probably an esoteric question, but I assume that the byte size gurantee is only for machines with the proper architecture? Not that I expect to see a D compiler for the very few machines that support strange byte sizes, just wondering...

Actually that is not a terribly esoteric question. I do not believe the D byte is the same as the C/C++ char. (Which is what I assume you are referring to in this case.) I would be curious to know the answer as well. I would also be curious how a compiler would deal with Harvard architecture.
Aug 06 2004
prev sibling next sibling parent reply Regan Heath <regan netwin.co.nz> writes:
For another perspective/idea have a look at my thread entitled "My stream 
concept".

I use template bolt-ins.

There was a little problem with it, which was actually trivial to fix, I 
simply replaced the 'super.' calls with 'this.' calls.

It should also be noted that my idea was strictly for creating the base 
level stream classes from the various devices i.e. File, Socket, Memory 
etc. The next step is to add filters (as described by Arcane Jill) I am 
hoping an idea will come to me as to how I can do that, without needing:
   new MemoryMap(new UTF16Filter(new Stream()));

Regan

On Tue, 3 Aug 2004 19:36:19 +0000 (UTC), Sean Kelly <sean f4.ca> wrote:

 I finally got back on my stream mods today and had a question:  how 
 should the
 wrapper class know the encoding scheme of the low-level data?

 For example, say all of the formatted IO code is in a mixin or base class
 (assume base class for the same of discussion) that calls a  read(void*, 
 size_t)
 or write(void*, size_t) method in the derived class.  Now say I want to 
 read a
 char, wchar, or dchar from the stream.  How many bytes should I read and 
 how do
 I know what the encoding format is?  C++ streams handle this fairly 
 simply by
 making the char type a template parameter:

 # class Stream(CharT) {
 #     Stream get(CharT) {}
 #     Stream put(CharT) {}
 # }

 This has the obvious limitation that the programmer must instantiate the 
 proper
 type of stream for the data format he is trying to read (as there is 
 only one
 get/put method for any char type: CharT).  But it makes things pretty 
 explicit:
 Stream!(char) means "this is a stream formatted in UTF8."

 The other option I can think off offhand would be to have a class member 
 that
 the derived class could set which specifies the encoding format:

 # class Stream {
 #     enum Encoding{ UTF8, UTF16, UTF32 }
 #     Encoding encoding;
 #     this() { encoding = Encoding.UTF8; }
 #     Stream get(char) {}
 #     Stream get(wchar) {}
 #     Stream get(dchar) {}
 #     ...
 # }
 #
 # class File: Stream {
 #     void open(wchar[] filename) { encoding = UTF16; }
 # }

 This has tbe benefit of allowing the user to read and write any char 
 type with a
 single instantiation, but requires greater complexity in the Stream 
 class and in
 the Derived class.  And I wonder if such flexibility is truly necessary.

 Any other design possibilities?  Preferences?  I'm really trying to 
 establish a
 good formatted IO design than work out the perfect stream API.  Any 
 other weird
 issues would be welcome also.


 Sean

-- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Aug 03 2004
parent reply Sean Kelly <sean f4.ca> writes:
In article <opsb6d5aw85a2sq9 digitalmars.com>, Regan Heath says...
For another perspective/idea have a look at my thread entitled "My stream 
concept".

I use template bolt-ins.

There was a little problem with it, which was actually trivial to fix, I 
simply replaced the 'super.' calls with 'this.' calls.

It should also be noted that my idea was strictly for creating the base 
level stream classes from the various devices i.e. File, Socket, Memory 
etc. The next step is to add filters (as described by Arcane Jill) I am 
hoping an idea will come to me as to how I can do that, without needing:
   new MemoryMap(new UTF16Filter(new Stream()));

My design really set out extend the original stream approach, and it seemed the logical extension was pretty C++ like. I ended up creating a basic set of interfaces--Stream, InputStream, and OutputStream--and putting all the implementation in templates meant to be mixins. This was somewhat necessary to support the multiple inheritance type model. So the input file stream looks something like this: # class InFile : InputStream { # mixin StreamDefs SD; # mixin InputStreamDefs!(readFile) ISD; # private: # uint readFile(void* buf, size_t size) {} } Works quite well but it's very different from the Java approach. I'm still not sure which I like better, though I'll grant that the Java version is more flexible (at the expense of verbosity). The other potential issue is the top-heaviness of the design. I am warming up to the the idea of separate reader/writer adaptor classes. Sean
Aug 03 2004
next sibling parent parabolis <parabolis softhome.net> writes:
Sean Kelly wrote:

 Works quite well but it's very different from the Java approach.  I'm still not
 sure which I like better, though I'll grant that the Java version is more
 flexible (at the expense of verbosity).  The other potential issue is the
 top-heaviness of the design.  I am warming up to the the idea of separate
 reader/writer adaptor classes.
 

I probably should have made the argument explicit but I do believe dealing with incoming and outgoing data at the same time is suspect of multi-threading issues. If your code is MT safe then you probably did much more work than you had to with little apparent benefit.
Aug 03 2004
prev sibling parent parabolis <parabolis softhome.net> writes:
Sean Kelly wrote:

 Works quite well but it's very different from the Java approach.  I'm still not
 sure which I like better, though I'll grant that the Java version is more
 flexible (at the expense of verbosity).  The other potential issue is the

I also meant to suggest that I really like much less verbose class names like: FileIS and FileOS...
Aug 03 2004
prev sibling next sibling parent reply "antiAlias" <gblazzer corneleus.com> writes:
What you good folks seem to be describing is pretty much how mango.io
operates. All the questions raised so far are quietly handled by that
library (even the separate input & output buffers, if you want that), so it
might be worthwhile checking it out. It's also house-trained, documented,
and has a raft of additional features that you selectively apply where
appropriate (it's not all tragically intertwined).

As a bonus, there's a ton of functionality already built on top of mango.io,
including http-server, servlet-engine, clustering, logging, local & remote
object caching; even tossing remote D executable objects around a local
network. The DSP project is also targeting Mango as a delivery mechanism.
Check them out over at dsource.org.

I think it's great to have "competing" libraries under way, but at some
point is it worth considering funneling efforts instead? Perhaps not?


"Sean Kelly" <sean f4.ca> wrote in message
news:ceopfj$1hcl$1 digitaldaemon.com...
 I finally got back on my stream mods today and had a question:  how should

 wrapper class know the encoding scheme of the low-level data?

 For example, say all of the formatted IO code is in a mixin or base class
 (assume base class for the same of discussion) that calls a  read(void*,

 or write(void*, size_t) method in the derived class.  Now say I want to

 char, wchar, or dchar from the stream.  How many bytes should I read and

 I know what the encoding format is?  C++ streams handle this fairly simply

 making the char type a template parameter:

 # class Stream(CharT) {
 #     Stream get(CharT) {}
 #     Stream put(CharT) {}
 # }

 This has the obvious limitation that the programmer must instantiate the

 type of stream for the data format he is trying to read (as there is only

 get/put method for any char type: CharT).  But it makes things pretty

 Stream!(char) means "this is a stream formatted in UTF8."

 The other option I can think off offhand would be to have a class member

 the derived class could set which specifies the encoding format:

 # class Stream {
 #     enum Encoding{ UTF8, UTF16, UTF32 }
 #     Encoding encoding;
 #     this() { encoding = Encoding.UTF8; }
 #     Stream get(char) {}
 #     Stream get(wchar) {}
 #     Stream get(dchar) {}
 #     ...
 # }
 #
 # class File: Stream {
 #     void open(wchar[] filename) { encoding = UTF16; }
 # }

 This has tbe benefit of allowing the user to read and write any char type

 single instantiation, but requires greater complexity in the Stream class

 the Derived class.  And I wonder if such flexibility is truly necessary.

 Any other design possibilities?  Preferences?  I'm really trying to

 good formatted IO design than work out the perfect stream API.  Any other

 issues would be welcome also.


 Sean

Aug 03 2004
next sibling parent reply Sean Kelly <sean f4.ca> writes:
In article <cep4dd$1nde$1 digitaldaemon.com>, antiAlias says...
What you good folks seem to be describing is pretty much how mango.io
operates. All the questions raised so far are quietly handled by that
library (even the separate input & output buffers, if you want that), so it
might be worthwhile checking it out. It's also house-trained, documented,
and has a raft of additional features that you selectively apply where
appropriate (it's not all tragically intertwined).

Yup. I've played around with Mango and kind of like it. One of the reasons I started these stream mods was to have an alternate design to compare to Mango for the sake of discussion. ie. I don't want folks to settle on Mango simply because the other choices are missing features.
I think it's great to have "competing" libraries under way, but at some
point is it worth considering funneling efforts instead? Perhaps not?

Definately. Sean
Aug 03 2004
parent "antiAlias" <gblazzer corneleus.com> writes:
You are absolutely right. But not many people seem to know about Mango, so
the opportunity for "spreading the news" was too great to pass up :-)

"Sean Kelly" <sean f4.ca> wrote in message
news:cep64d$1o0t$1 digitaldaemon.com...
 In article <cep4dd$1nde$1 digitaldaemon.com>, antiAlias says...
What you good folks seem to be describing is pretty much how mango.io
operates. All the questions raised so far are quietly handled by that
library (even the separate input & output buffers, if you want that), so


might be worthwhile checking it out. It's also house-trained, documented,
and has a raft of additional features that you selectively apply where
appropriate (it's not all tragically intertwined).

Yup. I've played around with Mango and kind of like it. One of the

 started these stream mods was to have an alternate design to compare to

 for the sake of discussion.  ie. I don't want folks to settle on Mango

 because the other choices are missing features.

I think it's great to have "competing" libraries under way, but at some
point is it worth considering funneling efforts instead? Perhaps not?

Definately. Sean

Aug 03 2004
prev sibling parent reply parabolis <parabolis softhome.net> writes:
antiAlias wrote:

 What you good folks seem to be describing is pretty much how mango.io
 operates. All the questions raised so far are quietly handled by that
 library (even the separate input & output buffers, if you want that), so it
 might be worthwhile checking it out. It's also house-trained, documented,
 and has a raft of additional features that you selectively apply where
 appropriate (it's not all tragically intertwined).

I cant help but ask how it manages to do both input and output and still avoid multi-threading issues?
 As a bonus, there's a ton of functionality already built on top of mango.io,
 including http-server, servlet-engine, clustering, logging, local & remote
 object caching; even tossing remote D executable objects around a local
 network. The DSP project is also targeting Mango as a delivery mechanism.
 Check them out over at dsource.org.

I have only started looking over the library. It is rather extensive. The source is well documented and organized. Both are rare to see. I am not fond of the pdf format. Anyway I am impressed at the surface. I will take a look deeper within.
 I think it's great to have "competing" libraries under way, but at some
 point is it worth considering funneling efforts instead? Perhaps not?

On the note of competing libraries I could not help but notice your primes.d implementation. You might want to look at the primes.d on Deimos and consider using that instead. It is rather cleverly designed and could be tuned to do no worse than your bsearch for all ushort values.
Aug 03 2004
parent reply "antiAlias" <gblazzer corneleus.com> writes:
The primes.d thing is now a distant and foggy memory :-)

Can I hook you up with a copy of the latest (much better, with annotated
source) documentation? You'll see Primes.d is gone, along with some other
warts:
http://svn.dsource.org/svn/projects/mango/downloads/mango_beta_9-2_doc.zip


"parabolis" <parabolis softhome.net> wrote in message
news:cep9ee$1ov1$1 digitaldaemon.com...
 antiAlias wrote:

 What you good folks seem to be describing is pretty much how mango.io
 operates. All the questions raised so far are quietly handled by that
 library (even the separate input & output buffers, if you want that), so


 might be worthwhile checking it out. It's also house-trained,


 and has a raft of additional features that you selectively apply where
 appropriate (it's not all tragically intertwined).

I cant help but ask how it manages to do both input and output and still avoid multi-threading issues?
 As a bonus, there's a ton of functionality already built on top of


 including http-server, servlet-engine, clustering, logging, local &


 object caching; even tossing remote D executable objects around a local
 network. The DSP project is also targeting Mango as a delivery


 Check them out over at dsource.org.

I have only started looking over the library. It is rather extensive. The source is well documented and organized. Both are rare to see. I am not fond of the pdf format. Anyway I am impressed at the surface. I will take a look deeper within.
 I think it's great to have "competing" libraries under way, but at some
 point is it worth considering funneling efforts instead? Perhaps not?

On the note of competing libraries I could not help but notice your primes.d implementation. You might want to look at the primes.d on Deimos and consider using that instead. It is rather cleverly designed and could be tuned to do no worse than your bsearch for all ushort values.

Aug 03 2004
parent reply parabolis <parabolis softhome.net> writes:
antiAlias wrote:

 The primes.d thing is now a distant and foggy memory :-)
 
 Can I hook you up with a copy of the latest (much better, with annotated
 source) documentation? You'll see Primes.d is gone, along with some other
 warts:
 http://svn.dsource.org/svn/projects/mango/downloads/mango_beta_9-2_doc.zip

lol - The Mango Tree... just got it from the docs :) I am now without question under the belief that the mango docs are great. I was going to suggest in my last post that I would like to see some docs that cover more of the concept area than just doxygen stuff. I decided that it would probably be to much to excpect :) Quote: ================================================================ Note that these Tokenizers do not maintain any state of their own. Thus they are all thread-safe. ================================================================ This is always good to know from documentation. :) However I am curious about IPickle's design. Would it not be possible to serialize objects based on the data in ClassInfo?
Aug 03 2004
parent reply "antiAlias" <gblazzer corneleus.com> writes:
"parabolis" wrote..

 Quote:
 ================================================================
 Note that these Tokenizers do not maintain any state of their
 own. Thus they are all thread-safe.
 ================================================================
 This is always good to know from documentation. :)


 However I am curious about IPickle's design. Would it not be
 possible to serialize objects based on the data in ClassInfo?

Doing it the introspection way (ala Java) has a bunch of issues all of it's own, and D doesn't have the power to expose all the requisite data as yet (I could be wrong on the latter though). IPickle was a nice and simple way to approach it; there's no monkey business anywhere (like Java has), it's explicit, and it's very fast. While not an overriding design factor, throughput is one of the main things all the Mango branches/packages keep an watchful eye upon. Frankly, I'd like to see a decent introspection approach emerge along the way; perhaps as a complement rather than a replacement: within Mango there's no obvious reason why the two approaches could not produce an equivalent serialized stream, and therefore be interchangeable at the endpoints. This is one area where I think getting other people involved in the project would help tremendously.
Aug 03 2004
parent reply parabolis <parabolis softhome.net> writes:
antiAlias wrote:

 "parabolis" wrote..
 
 
Quote:
================================================================
Note that these Tokenizers do not maintain any state of their
own. Thus they are all thread-safe.
================================================================
This is always good to know from documentation. :)


However I am curious about IPickle's design. Would it not be
possible to serialize objects based on the data in ClassInfo?

Doing it the introspection way (ala Java) has a bunch of issues all of it's own, and D doesn't have the power to expose all the requisite data as yet (I could be wrong on the latter though).

I think I was premature to suppose D could do that. I just gave the issue some thought and there is just enough introspection to make a shallow copy which is obviously not sufficient.
 IPickle was a nice and simple way to approach it; there's no monkey business
 anywhere (like Java has), it's explicit, and it's very fast. While not an
 overriding design factor, throughput is one of the main things all the Mango
 branches/packages keep an watchful eye upon. Frankly, I'd like to see a
 decent introspection approach emerge along the way; perhaps as a complement
 rather than a replacement: within Mango there's no obvious reason why the
 two approaches could not produce an equivalent serialized stream, and
 therefore be interchangeable at the endpoints.

Any automated serializing algorithm would have to either allow IPickles to [de-]serialize themselves or ignore read/write. However given one of those holds then the serialization ought to be compatible.
 This is one area where I think getting other people involved in the project
 would help tremendously.

I think I am probably sold on being willing to help. It is more an issue of whether I can provide anything that will further mango. :)
Aug 03 2004
parent "antiAlias" <gblazzer corneleus.com> writes:
"parabolis" <parabolis softhome.net> wrote in message
news:cepppv$1ugt$1 digitaldaemon.com...
 antiAlias wrote:

 "parabolis" wrote..


Quote:
================================================================
Note that these Tokenizers do not maintain any state of their
own. Thus they are all thread-safe.
================================================================
This is always good to know from documentation. :)


However I am curious about IPickle's design. Would it not be
possible to serialize objects based on the data in ClassInfo?

Doing it the introspection way (ala Java) has a bunch of issues all of


 own, and D doesn't have the power to expose all the requisite data as


 could be wrong on the latter though).

I think I was premature to suppose D could do that. I just gave the issue some thought and there is just enough introspection to make a shallow copy which is obviously not sufficient.
 IPickle was a nice and simple way to approach it; there's no monkey


 anywhere (like Java has), it's explicit, and it's very fast. While not


 overriding design factor, throughput is one of the main things all the


 branches/packages keep an watchful eye upon. Frankly, I'd like to see a
 decent introspection approach emerge along the way; perhaps as a


 rather than a replacement: within Mango there's no obvious reason why


 two approaches could not produce an equivalent serialized stream, and
 therefore be interchangeable at the endpoints.

Any automated serializing algorithm would have to either allow IPickles to [de-]serialize themselves or ignore read/write. However given one of those holds then the serialization ought to be compatible.
 This is one area where I think getting other people involved in the


 would help tremendously.

I think I am probably sold on being willing to help. It is more an issue of whether I can provide anything that will further mango. :)

Here's some things that have been noted: http://www.dsource.org/forums/viewtopic.php?t=174&sid=f5f234d101f0405ebaf9cb df728af44a And here's some more: http://www.dsource.org/forums/viewtopic.php?t=157&sid=f5f234d101f0405ebaf9cb df728af44a That's just the tip of the iceberg though. For example, there's no Unicode support as yet since we decided to wait until Hauke & AJ released all the requisite pieces (better to do it properly); IO filters/decorators such as companders have not actually been implemented yet, although there's a solid placeholder for them; there's some annoying things that are currently unimplemented on Unix (noted in the documentation todo list); etc. etc. Plenty of room for improvement all over the place, and that's before you hit the upper decks :-) The project is very open to other packages hooking in at any level: as a peer, as part of the Mango Tree itself, or as a package user. For example, there's currently a bit-sliced XML/SAX engine in the works (okay; "byte-sliced" then), plus the DSP project mentioned earlier (which looks to be really uber cool ... everyone should check that one out). Having real-world user-code drive the design and functionality is of truly immense value: the bad stuff is typically identified and removed/replaced rather quickly. Anyone who would like to get involved, please jump on the dsource.org forums!
Aug 03 2004
prev sibling parent reply "Walter" <newshound digitalmars.com> writes:
"Sean Kelly" <sean f4.ca> wrote in message
news:ceopfj$1hcl$1 digitaldaemon.com...
 This has tbe benefit of allowing the user to read and write any char type

 single instantiation, but requires greater complexity in the Stream class

 the Derived class.  And I wonder if such flexibility is truly necessary.

 Any other design possibilities?  Preferences?  I'm really trying to

 good formatted IO design than work out the perfect stream API.  Any other

 issues would be welcome also.

I'm one of those folks who is very much in favor of a file reader being able to automatically detect the encoding in it. Hence, D can auto-detect the UTF formatting. So, I'd recommend that the format be an enum that can be specifically set or can be auto-detected. Different resulting behaviors can be handled with virtual functions. Also, formats like UTF-16 have two variants, big end and little end. It should also be able to read data in other formats, such as code pages, and convert them to utf. These cannot be auto-detected.
Aug 03 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cep6nb$1o72$1 digitaldaemon.com>, Walter says...

I'm one of those folks who is very much in favor of a file reader being able
to automatically detect the encoding in it. Hence, D can auto-detect the UTF
formatting. So, I'd recommend that the format be an enum that can be
specifically set or can be auto-detected. Different resulting behaviors can
be handled with virtual functions.

With all due respect, Walter, that's not really feasible. It is very hard, for example, to distinguish between ISO-8859-1 and ISO-8859-2 (not to mention ISO-8859-3, etc.). Yes, distinguishing between UTFs is straightforward, but not all encodings make life that easy for us. You can't use an enum, because there are an unlimited number of possible encodings. Besides, if you're parsing an HTTP header, and if, within that header, you read "Content-Type: text/plain; encoding=MAC-ROMAN", then you can be pretty sure you know what the encoding of the following document is going to be. Other formats have different indicators (HTML meta tags; Python source file comments; -the list is endless). Only at the application level can you /really/ sort this out, because the application presumably knows what it's looking at.
Also, formats like UTF-16 have two variants, big end and little end.

Best to treat those as two separate encodings, although if the encoding is specified as "UTF-16" you may still need to auto-detect which variant is being used. Once you know for sure, stick with it.
It should also be able to read data in other formats, such as code pages,
and convert them to utf. These cannot be auto-detected.

I think that's the whole point. Windows code pages /are/ encodings. WINDOWS-1252 is an encoding, same as UTF-8. I think people here are talking about encodings generally, not just UTFs. Jill
Aug 03 2004
next sibling parent reply Sean Kelly <sean f4.ca> writes:
In article <ceq0mg$20d8$1 digitaldaemon.com>, Arcane Jill says...
In article <cep6nb$1o72$1 digitaldaemon.com>, Walter says...

Also, formats like UTF-16 have two variants, big end and little end.

Best to treat those as two separate encodings, although if the encoding is specified as "UTF-16" you may still need to auto-detect which variant is being used. Once you know for sure, stick with it.

That reminds me. Which format does the code in utf.d use? I'm thinking I may do something like this for encoding for now: enum Format { UTF8 = 0, UTF16 = 1, UTF16LE = 1, UTF16BE = 2 } So "UTF-16" would actually default to one of the two methods. Sean
Aug 04 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <ceqv9a$15b$1 digitaldaemon.com>, Sean Kelly says...
That reminds me.  Which format does the code in utf.d use?

To be honest, I don't understand the question.
I'm thinking I may
do something like this for encoding for now:

enum Format {
UTF8 = 0,
UTF16 = 1,
UTF16LE = 1,
UTF16BE = 2
}

So "UTF-16" would actually default to one of the two methods.

Whatever works, works. But I'd make the enum private. Encodings should be universally known by their IANA registered name, otherwise how can you map name to number. (For example, you encounter an XML file which declares its own encoding to be "X-ARCANE-JILLS-CUSTOM-ENCODING" - how do you turn that into an enum?) Got an unrelated question for you. In the stream function void read(out int), there is an assumption that the bytes will be embedded in the stream in little-endian order. Should applications assume (a) it's always little endian, regardless of host architecture, or (b) it's always host-byte order. Is there a big endian version? Is there a network byte order version? Should there be? Jill
Aug 04 2004
next sibling parent reply Sean Kelly <sean f4.ca> writes:
In article <cer4k8$7jj$1 digitaldaemon.com>, Arcane Jill says...
In article <ceqv9a$15b$1 digitaldaemon.com>, Sean Kelly says...
That reminds me.  Which format does the code in utf.d use?

To be honest, I don't understand the question.

std.utf has methods like toUTF16. But does this target the big or little endian encoding scheme? I suppose I could assume it corresponds to the byte order of the target machine, but this would imply different behavior on different platforms.
I'm thinking I may
do something like this for encoding for now:

enum Format {
UTF8 = 0,
UTF16 = 1,
UTF16LE = 1,
UTF16BE = 2
}

So "UTF-16" would actually default to one of the two methods.

Whatever works, works. But I'd make the enum private. Encodings should be universally known by their IANA registered name, otherwise how can you map name to number. (For example, you encounter an XML file which declares its own encoding to be "X-ARCANE-JILLS-CUSTOM-ENCODING" - how do you turn that into an enum?)

This raises an interesting question. Rather than having the encoding handled directly by the Stream layer perhaps it should be dropped into another class. I can't imagine coding a base lib to support "Joe's custom encoding scheme." For the moment though, I think I'll leave stream.d as-is. This seems like a design issue that will take a bit of talk to get right.
Got an unrelated question for you. In the stream function void read(out int),
there is an assumption that the bytes will be embedded in the stream in
little-endian order. Should applications assume (a) it's always little endian,
regardless of host architecture, or (b) it's always host-byte order. Is there a
big endian version? Is there a network byte order version?

Not currently. This corresponds to the C++ design: unformatted IO is assumed to be in the byte order of the host platform.
Should there be?

Probably. Or at least one that converts to/from network byte order. I'll probably have the first cut of stream.d done in a few more days and after that we can talk about what's wrong with it, etc. Sean
Aug 04 2004
next sibling parent "Ben Hinkle" <bhinkle mathworks.com> writes:
Whatever works, works. But I'd make the enum private. Encodings should be
universally known by their IANA registered name, otherwise how can you


to number. (For example, you encounter an XML file which declares its own
encoding to be "X-ARCANE-JILLS-CUSTOM-ENCODING" - how do you turn that


enum?)

This raises an interesting question. Rather than having the encoding

 directly by the Stream layer perhaps it should be dropped into another

 can't imagine coding a base lib to support "Joe's custom encoding scheme."

 the moment though, I think I'll leave stream.d as-is.  This seems like a

 issue that will take a bit of talk to get right.

I wonder if delegates could help out here. Instead of subclasses or wrapping a stream in another stream the primary Stream class could have a delegate to sort out big/little endian or encoding issues. I'm not exactly sure how it would work but it's worth investigating. There might be issues with sharing data between the stream and the encoder/decoder delegate.
Got an unrelated question for you. In the stream function void read(out


there is an assumption that the bytes will be embedded in the stream in
little-endian order. Should applications assume (a) it's always little


regardless of host architecture, or (b) it's always host-byte order. Is


big endian version? Is there a network byte order version?

Not currently. This corresponds to the C++ design: unformatted IO is

 be in the byte order of the host platform.

Should there be?

Probably. Or at least one that converts to/from network byte order. I'll probably have the first cut of stream.d done in a few more days and after

 we can talk about what's wrong with it, etc.


 Sean

Aug 04 2004
prev sibling parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cer7fh$9t5$1 digitaldaemon.com>, Sean Kelly says...
std.utf has methods like toUTF16.  But does this target the big or little endian
encoding scheme?  I suppose I could assume it corresponds to the byte order of
the target machine, but this would imply different behavior on different
platforms.

Neither, really. toUTF16 returns an array of wchars, not an array of chars, so (conceptually) there is no byte-order issue involved. A wchar is (conceptually) a sixteen bit wide value, with bit 0 being the low order bit, and bit 15 being the high order bit. Byte ordering doesn't come into it. Problems occur, however, when a wchar or a dchar leaves the nice safe environment of D and heads out into a stream. Only then does byte ordering become an issue (as it does also with arrays of ints, etc.). If you cast a wchar[] (or an int[], etc.) to a void[], then the bytes of data don't change, only the reference has a different type. In practice, this means you have (inadvertantly) applied a host-byte-order encoding to the array. There doesn't seem to be much that a stream can do about this, so, I reckon the problem here lies not with the stream, but with the cast. In short, a cast is not the most architecture-independent way to convert an arbitrary array into a void[]. Maybe some new functions could be written to implement this?
This raises an interesting question.  Rather than having the encoding handled
directly by the Stream layer perhaps it should be dropped into another class.  I
can't imagine coding a base lib to support "Joe's custom encoding scheme."  For
the moment though, I think I'll leave stream.d as-is.  This seems like a design
issue that will take a bit of talk to get right.

Right. Someone writing an application ought to be able to make their own transcoder (extending a library-defined base class; implementing a library-defined interface; whatever). Let's say that (in an application, not a library) I define classes JoesCustomReader and JoesCustomWriter. Now, I should still be able to do: # Stream s = new Reader(underlyingCharFilter, "X-JOES-CUSTOM-ENCODING"); and read the file. If a reader needs to be identified by a globally unique enum, then I can't do that without the possibility of an enum value clash. But if, on the other hand, they are identified by a string, then the possibility of a clash becomes vanishingly small. I do agree with you that registration of readers/writers and the dispatching mechanism is something best left until later, however. Jill
Aug 04 2004
parent reply Sean Kelly <sean f4.ca> writes:
In article <ceraen$c48$1 digitaldaemon.com>, Arcane Jill says...
Problems occur, however, when a wchar or a dchar leaves the nice safe
environment of D and heads out into a stream. Only then does byte ordering
become an issue (as it does also with arrays of ints, etc.).

Bah. Of course. So the two UTF schemes just depend on the byte order when serialized. Makes sense.
If you cast a wchar[] (or an int[], etc.) to a void[], then the bytes of data
don't change, only the reference has a different type. In practice, this means
you have (inadvertantly) applied a host-byte-order encoding to the array. There
doesn't seem to be much that a stream can do about this, so, I reckon the
problem here lies not with the stream, but with the cast. In short, a cast is
not the most architecture-independent way to convert an arbitrary array into a
void[]. Maybe some new functions could be written to implement this?

I think byte order should be specified, perhaps as a quality of the stream. It could default to native and perhaps be switchable? The only other catch I see is that a console stream should probably ignore this setting and always leave everything in native format. In any case, this byte order would affect encoding schemes using > 1 byte characters and perhaps a new set of unformatted IO methods as well. Again something I'm going to ignore for now as it's more complexity than we need quite yet. Sean
Aug 04 2004
parent reply Arcane Jill <Arcane_member pathlink.com> writes:
In article <cerbsa$d02$1 digitaldaemon.com>, Sean Kelly says...

In short, a cast is
not the most architecture-independent way to convert an arbitrary array into a
void[]. Maybe some new functions could be written to implement this?


I think byte order should be specified, perhaps as a quality of the stream.  It
could default to native and perhaps be switchable?

Well, from one point of view, the problem we've got here is serialization. How do you serialize an array of primitive types having sizeof > 1? This boils down to a simpler question: how do you serialize a single primitive with sizeof > 1. Let's cut to a clear example - how do you serialize an int? std.stream.Stream.write(int) serializes in little-endian order. But the specs say "Outside of byte, ubyte, and char, the format is implementation-specific and should only be used in conjunction with read." I think this is scary. Perhaps it would be better for a stream to /mandate/ the order. As you suggest, it could be a property of the stream, but there are disadvantages to that - if you chain a whole bunch of streams together, each with different endianness, you could end up with a lot of byteswapping going on. Another possibility might be to ditch the function write(int), and replace it with two functions, writeBE(int) and writeLE(int), (and similarly with all other primitive types). That would be absolutely guaranteed to be platform independent. Of course that applies to wchar and dchar too, but the whole point of encodings (well, /one/ of the points of encodings anyway) is that you never have to spit out anything other than a stream of /bytes/. The encoding itself determines the byte order. There really is no such encoding as "UTF-16" (although calling wchar[]s UTF-16 does make sense). As far as actual encodings are concerned, the name "UTF-16" is just a shorthand way of saying "either UTF-16LE or UTF-16BE". When reading, you have to auto-detect between them, but once you've /established/ the encoding, then you rewind the stream and start reading it again with the now known encoding. When writing, you get to choose, arbitrarily (so you would probably choose native byte order), but you can make it easier for subsequent readers to auto-detect by writing a BOM at the start of the stream. How does this affect users' code? Well, you simply don't allow anyone to write # Reader s = new UTF16Reader(underlyingStream) (i.e. you define no such class). Instead, give them a factory method. Make them write: # Reader s = createUTF16Reader(underlyingStream) or even # Reader s = createReader(underlyingStream, "UTF-16"); (but we said we wouldn't talk about dispatching yet, so let's stick with createUTF16Reader() to keep things simple) The function createUTF16Reader() reads the underlying stream, auto-detects between UTF-16LE and UTF-16BE, and then constructs either a UTF16LEReader or a UTF16BEReader, and returns it. Somehow it needs a method of pushing back the characters it's already read into the stream. Then, when the caller calls s.read(), the exact encoding is known, and the stream is (re)read from the start.
The only other catch I see
is that a console stream should probably ignore this setting and always leave
everything in native format.

Maybe writeLE() and writeBE() could be supplemented by writeNative(), with the warning that it's no longer cross-platform? (Of course, the function write() does that right now, but calling it writeNative() would give you a clue that you were doing something a bit parochial).
In any case, this byte order would affect encoding
schemes using > 1 byte characters and perhaps a new set of unformatted IO
methods as well.

I don't think it would affect encodings at all, only the serialization of primitive types other than byte, ubyte and char. Transcoders, as I said, read or write /bytes/ to or from an underlying stream (but have dchar read() and/or void write(dchar) methods for callers to use).
Again something I'm going to ignore for now as it's more
complexity than we need quite yet.

Righty ho. I vaguely remember Hauke saying he was working on a class to do something about transcoding issues, but I don't know the specifics. Arcane Jill
Aug 04 2004
parent reply "Carlos Santander B." <carlos8294 msn.com> writes:
"Arcane Jill" <Arcane_member pathlink.com> escribió en el mensaje
news:cerk3u$i4f$1 digitaldaemon.com
| (so you would probably choose native byte order), but you can make it easier
for
| subsequent readers to auto-detect by writing a BOM at the start of the stream.
|
| ...
|
| between UTF-16LE and UTF-16BE, and then constructs either a UTF16LEReader or a
| UTF16BEReader, and returns it. Somehow it needs a method of pushing back the
| characters it's already read into the stream. Then, when the caller calls
| s.read(), the exact encoding is known, and the stream is (re)read from the
| start.
|


In the former case (the stream includes a BOM), would re-reading from the start
include the BOM? If so, what good would it be for a user who just wants to read
the file, independent of the encoding? (did I make myself clear?)

-----------------------
Carlos Santander Bernal
Aug 04 2004
parent Arcane Jill <Arcane_member pathlink.com> writes:
In article <ces5mu$r8p$1 digitaldaemon.com>, Carlos Santander B. says...

In the former case (the stream includes a BOM), would re-reading from the start
include the BOM?

Good question. I guess probably not. If the encoding is known, then it's known - Since a BOM serves only to identify the encoding, you don't need to re-read it in this instance. That said, it's still best that readers be prepared to ignore it. That is, if a reader reads U+FEFF as the first character, it would be harmless to throw that character away and return instead the second one. Pretty much all BOM related questions are answered here: http://www.unicode.org/faq/utf_bom.html#BOM.
If so, what good would it be for a user who just wants to read
the file, independent of the encoding? (did I make myself clear?)

If you fail to discard a BOM, and accidently treat it as a character, it will appear to your application as the character U+FEFF (ZERO WIDTH NON-BREAKING SPACE). It will display as a zero-width space. It has a general category of Cf (which actually makes it a formatting control, not a space!). Basically, it tries as hard as it can to do nothing at all. So it's useless to the "user who just wants to read the file" - useless, but harmless, most especially if you can recognise it and throw it away. Arcane Jill
Aug 04 2004
prev sibling parent Regan Heath <regan netwin.co.nz> writes:
On Wed, 4 Aug 2004 16:58:48 +0000 (UTC), Arcane Jill 
<Arcane_member pathlink.com> wrote:

<snip>

 Got an unrelated question for you. In the stream function void read(out 
 int),
 there is an assumption that the bytes will be embedded in the stream in
 little-endian order. Should applications assume (a) it's always little 
 endian,
 regardless of host architecture, or (b) it's always host-byte order. Is 
 there a
 big endian version? Is there a network byte order version?

 Should there be?

I think we go with (b). I think it is best handled with a filter. eg. Stream s = new BigEndian(new FileStream("test.dat",FileMode.READ)); so BigEndian looks like: #class BigEndian { # ulong read(void* address, ulong length) { # version(LittleEndian) { # //on a little endian system we convert. # } # else { # //no conversion is required. # } # } #} You'll need a LittleEndian one too. Using the filter you can guarantee the endian-ness of the data. Of course if you're sending binary data from a LE to BE system via sockets you need to know what you're doing, and you need to decide what endian-ness will be used for the transmission, in this case on the one end of the socket you'll need a toBigEndian/toLittleEndian filter. Regan -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
Aug 04 2004
prev sibling parent "Walter" <newshound digitalmars.com> writes:
"Arcane Jill" <Arcane_member pathlink.com> wrote in message
news:ceq0mg$20d8$1 digitaldaemon.com...
 In article <cep6nb$1o72$1 digitaldaemon.com>, Walter says...

I'm one of those folks who is very much in favor of a file reader being


to automatically detect the encoding in it. Hence, D can auto-detect the


formatting. So, I'd recommend that the format be an enum that can be
specifically set or can be auto-detected. Different resulting behaviors


be handled with virtual functions.

With all due respect, Walter, that's not really feasible. It is very hard,

 example, to distinguish between ISO-8859-1 and ISO-8859-2 (not to mention
 ISO-8859-3, etc.). Yes, distinguishing between UTFs is straightforward,

 all encodings make life that easy for us. You can't use an enum, because

 are an unlimited number of possible encodings.

I understand there are limits to this. I think it should be done where possible, and that it should not be precluded by design.
 Besides, if you're parsing an HTTP header, and if, within that header, you

 "Content-Type: text/plain; encoding=MAC-ROMAN", then you can be pretty

 know what the encoding of the following document is going to be. Other

 have different indicators (HTML meta tags; Python source file

 list is endless). Only at the application level can you /really/ sort this

 because the application presumably knows what it's looking at.

Yes. And this argues for a capability to switch horses midstream, so to speak.
Aug 04 2004