digitalmars.D - Why ranges don't return vectors?
- Piotr Szturmaj (27/27) Feb 28 2013 Seriously, nobody will ever get performance from single-element
- bearophile (14/16) Feb 28 2013 Time ago I have suggested in this newsgroup that idea, naming it
- Piotr Szturmaj (2/14) Mar 01 2013 Thanks for the references. They look interesting indeed.
- Chris Cain (6/8) Feb 28 2013 There's no reason you can't do that. In fact, some ranges already
- H. S. Teoh (12/21) Feb 28 2013 You can make a buffered range that reads in a large chunk of data at a
- Brad Anderson (3/16) Feb 28 2013 Isn't this essentially what is going on in stdio when a range
- Piotr Szturmaj (18/35) Mar 01 2013 Yes, I've written some buffered ranges myself, but my original post was
- monarch_dodra (19/24) Mar 01 2013 Well, why don't you write a range whose "elements" are vectors?
- Chris Cain (35/53) Mar 01 2013 My point wasn't about things being buffered. I was just
- deadalnix (3/9) Mar 04 2013 And still, some time you'd be surprised by what the compiler can
- Marco Leise (8/10) Feb 28 2013 Side note: HDD manufacturers recently switched to 4 KiB
- renoX (5/7) Mar 04 2013 As Marco Leise wrote: now this 4KiB in many case..
Seriously, nobody will ever get performance from single-element iterator/range pattern - this makes CPU stall! Anytime you read one byte from typical hard disk, system reads full sector - typically 512 bytes, no more, no less. Anytime you read one byte from memory, CPU loads entire cache line from RAM into the cache (64 bytes on all modern Intel CPU's). Why not exploit that with ranges? Ranges could potentially return arrays in front() (or in frontVector/whatever), so basically they will become ranges of ranges where the deepest range is always RandomAccessRange. This has obvious performance benefits. Everyone knows that traversing memory is faster than iteraring by popFront(). On the other hand memory lacks the flexibility of ranges - so lets make a hybrid range! Advantages: * performance (!) * limited lookahead/backtracking How? 1. instruction level parallelism: CPU's can execute few instructions in parallel given that they operate on different data (different vector elements) 2. SIMD: load entire vector to SSE/AVX register then run the operation on all elements at once 3. use of L1 cache: traversing vector in memory is way faster that calling popFront() for each vector element - likely if it's inlined Disadvantages: * potential inconvenience I know it can't be easy drop-in addition to the current range implementation, but who knows...
Feb 28 2013
Piotr Szturmaj:I know it can't be easy drop-in addition to the current range implementation, but who knows...Time ago I have suggested in this newsgroup that idea, naming it "vectorized lazyness", that was mostly ignored: http://web.archiveorange.com/archive/v/j3CAOE6ZVy4122g4OfhS It's also related to the idea of Segmented Iterators that I have discussed later in this same newsgroup (I think I received no answers): http://pdf.aminer.org/000/136/863/segmented_iterators_and_hierarchical_algorithms.pdf I think that so far there is only a little amount of vectorized lazyness implemented in Phobos, search for "semi-lazy parallel map" here: http://dlang.org/phobos/std_parallelism.html Bye, bearophile
Feb 28 2013
bearophile wrote:Piotr Szturmaj:Thanks for the references. They look interesting indeed.I know it can't be easy drop-in addition to the current range implementation, but who knows...Time ago I have suggested in this newsgroup that idea, naming it "vectorized lazyness", that was mostly ignored: http://web.archiveorange.com/archive/v/j3CAOE6ZVy4122g4OfhS It's also related to the idea of Segmented Iterators that I have discussed later in this same newsgroup (I think I received no answers): http://pdf.aminer.org/000/136/863/segmented_iterators_and_hierarchical_algorithms.pdf I think that so far there is only a little amount of vectorized lazyness implemented in Phobos, search for "semi-lazy parallel map" here: http://dlang.org/phobos/std_parallelism.html
Mar 01 2013
On Friday, 1 March 2013 at 01:02:17 UTC, Piotr Szturmaj wrote:Seriously, nobody will ever get performance from single-element iterator/range pattern - this makes CPU stall!There's no reason you can't do that. In fact, some ranges already do this. Take a look at std.stdio.File.byChunk and byLine. This isn't appropriate for all ranges, though. Like arrays, for instance. You want to be able to access single elements and do things with them, otherwise the abstraction is pointless.
Feb 28 2013
On Fri, Mar 01, 2013 at 03:35:23AM +0100, Chris Cain wrote:On Friday, 1 March 2013 at 01:02:17 UTC, Piotr Szturmaj wrote:You can make a buffered range that reads in a large chunk of data at a time, and .front and .popFront merely move an internal pointer until it reaches the end of the buffer, then the next buffer page is loaded in. In this case, .front is very simple (just a pointer lookup) and will probably be inlined by an optimizing compiler. Just because the abstraction is reading per-element, doesn't mean the implementation has to literally do that! T -- "Maybe" is a strange word. When mom or dad says it it means "yes", but when my big brothers say it it means "no"! -- PJ jr.Seriously, nobody will ever get performance from single-element iterator/range pattern - this makes CPU stall!There's no reason you can't do that. In fact, some ranges already do this. Take a look at std.stdio.File.byChunk and byLine. This isn't appropriate for all ranges, though. Like arrays, for instance. You want to be able to access single elements and do things with them, otherwise the abstraction is pointless.
Feb 28 2013
On Friday, 1 March 2013 at 03:44:30 UTC, H. S. Teoh wrote:You can make a buffered range that reads in a large chunk of data at a time, and .front and .popFront merely move an internal pointer until it reaches the end of the buffer, then the next buffer page is loaded in. In this case, .front is very simple (just a pointer lookup) and will probably be inlined by an optimizing compiler. Just because the abstraction is reading per-element, doesn't mean the implementation has to literally do that! TIsn't this essentially what is going on in stdio when a range uses fread?
Feb 28 2013
H. S. Teoh wrote:On Fri, Mar 01, 2013 at 03:35:23AM +0100, Chris Cain wrote:Yes, I've written some buffered ranges myself, but my original post was not about buffering.On Friday, 1 March 2013 at 01:02:17 UTC, Piotr Szturmaj wrote:Seriously, nobody will ever get performance from single-element iterator/range pattern - this makes CPU stall!There's no reason you can't do that. In fact, some ranges already do this. Take a look at std.stdio.File.byChunk and byLine. This isn't appropriate for all ranges, though. Like arrays, for instance. You want to be able to access single elements and do things with them, otherwise the abstraction is pointless.You can make a buffered range that reads in a large chunk of data at a time, and .front and .popFront merely move an internal pointer until it reaches the end of the buffer, then the next buffer page is loaded in. In this case, .front is very simple (just a pointer lookup) and will probably be inlined by an optimizing compiler.Inlining is not always the case.Just because the abstraction is reading per-element, doesn't mean the implementation has to literally do that!Range abstraction is limiting some useful optimizations, mainly vectorization. Even if range internally uses a buffer and does some SIMD operations on it, it exposes only one element from this buffer, which can't be used for next SIMD operation. In other words it limits pipelining. I'm arguing that (vector) ranges of primitive types should always expose at least 16 byte vector, even if it's one element range - but then abstraction guarantees that it can be used for SSE ops. They could return slices pointing to fixed size buffers (at least 16 byte ones), then slices may be adjusted for the last iteration to specify actual number of elements in a vector. Anyway SIMD op will be done on the full vector (underlying buffer) - CPU don't care if element in a vector is garbage or not. So, in short: ranges pass these vectors between themselfs and each subsequent range mutates the same vector - possibly using SIMD.
Mar 01 2013
On Friday, 1 March 2013 at 15:38:28 UTC, Piotr Szturmaj wrote:Range abstraction is limiting some useful optimizations, mainly vectorization. Even if range internally uses a buffer and does some SIMD operations on it, it exposes only one element from this buffer, which can't be used for next SIMD operation. In other words it limits pipelining.Well, why don't you write a range whose "elements" are vectors? Or that operate on vectors. The range abstraction only defines how you iterate and view a collection. YOU are the one that defines what the elements in that collection are. For example, "chunk" will do exactly that: Given a range, it transforms it into a range of sub-slices: //---- int[] arr = iota(0, 16).array(); //[0, 1, 2, ..., 16] auto chunks = std.range.chunks(arr, 4); foreach (int[] chunk ; chunks) { chunk[] *= 2; writeln(chunk); } //---- What else are you asking for? Nothing is stopping you from pipping the chunk elements with another range, btw.
Mar 01 2013
On Friday, 1 March 2013 at 15:38:28 UTC, Piotr Szturmaj wrote:Yes, I've written some buffered ranges myself, but my original post was not about buffering.My point wasn't about things being buffered. I was just suggesting that many ranges do return arrays, when it's appropriate. If you want to conceptualize a "single item" as a grouping of items from a range (and such a grouping may be a vector of 16 bytes if you so choose), then you are free to do so. The concept is flexible enough to allow this. However, not all ranges should return arrays. As an answer to the topic's name, the reason why ranges don't (always) return arrays is that the concept of a range is to _abstract_ operations on _any_ type of container (and even things that are not containers, such as RNGs), as long as it has certain capabilities. Returning arrays isn't helpful when you're trying to process arrays, for instance, because if you didn't want an abstract look at an array, you would have worked on the original array without the abstraction in the first place.Range abstraction is limiting some useful optimizations, mainly vectorization. Even if range internally uses a buffer and does some SIMD operations on it, it exposes only one element from this buffer, which can't be used for next SIMD operation. In other words it limits pipelining. I'm arguing that (vector) ranges of primitive types should always expose at least 16 byte vector, even if it's one element range - but then abstraction guarantees that it can be used for SSE ops. They could return slices pointing to fixed size buffers (at least 16 byte ones), then slices may be adjusted for the last iteration to specify actual number of elements in a vector. Anyway SIMD op will be done on the full vector (underlying buffer) - CPU don't care if element in a vector is garbage or not. So, in short: ranges pass these vectors between themselfs and each subsequent range mutates the same vector - possibly using SIMD.Okay, sure, performance. But if you want to code for performance, there's nothing stopping you now. The idea about ranges is to unify lots of concepts. If you need lower level work and require every bit of performance, but still want to use ranges, then you can, but it should be a specialized range type that you do it on which meets your precise specification. I hope I'm explaining it clearly: ranges are more of a universal specification, and it sounds like you have a very particular set of requirements which is a subset of what ranges are supposed to cover. Creating a specialized range type that meets your exact specifications would give you better performance than a generalized range concept (even optimized) and an optimized generalized range concept would not be appropriate for all uses. Perhaps you could code up a vectorRange that does things as you want and submit it in an enhancement request. And you can show off how awesome it is so that everyone will want to use it when it's appropriate. But certainly not all ranges should be vectorRanges because most of the time we're just trying to work on singular units at a time.
Mar 01 2013
On Friday, 1 March 2013 at 22:18:10 UTC, Chris Cain wrote:Okay, sure, performance. But if you want to code for performance, there's nothing stopping you now. The idea about ranges is to unify lots of concepts. If you need lower level work and require every bit of performance, but still want to use ranges, then you can, but it should be a specialized range type that you do it on which meets your precise specification.And still, some time you'd be surprised by what the compiler can do for you. Cf the discussion about SentinelRange.
Mar 04 2013
Am Fri, 01 Mar 2013 02:02:16 +0100 schrieb Piotr Szturmaj <bncrbme jadamspam.pl>:Anytime you read one byte from typical hard disk, system reads full sector - typically 512 bytes, no more, no less.Side note: HDD manufacturers recently switched to 4 KiB sectors (8x size). Everyone who bypasses operating system buffers to do raw disk access now needs to query the system for optimal buffer alignment. -- Marco
Feb 28 2013
On Friday, 1 March 2013 at 01:02:17 UTC, Piotr Szturmaj wrote:Anytime you read one byte from typical hard disk, system reads full sector - typically 512 bytes, no more, no less.As Marco Leise wrote: now this 4KiB in many case.. That's one big issue with performance tuning: if you hardcoded your code with 512B page, you don't get the optimal performance in a 4KiB page system..
Mar 04 2013