www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - My experience making an on-disk merge sort using ranges

reply "Chris Cain" <clcain uncg.edu> writes:
Greetings everyone,

First, let me apologize that this is quite a long post. I hope 
that it doesn't scare you off.

I wasn't exactly sure where to put this because I have the 
intention of learning more about D and its overall design as well 
as to let others know about some various difficulties I've been 
experiencing while using it for various purposes and point out 
potential fixes. As such I was torn between the D.Learn forum and 
the vastly more observed D forum. I hope you don't mind my 
eventual decision on the matter.

Let me say that I was inspired to try out the "Component 
Programming" aspect of D using its ranges because of the 
excellently done recent talk by Walter. As such, I'm mostly 
interested in perfecting this particular approach. Obviously, I 
have no doubt that I could program this in other ways and have no 
issues whatsoever.

--

So, with that said, let's get to my point:

My goal is to write an on-disk merge sort (nearly) exclusively 
using ranges. The part of the problem I'm highlighting in this 
post is the process of merging multiple sorted files together 
into one file (in this case, stdout). I claim that this code 
should be acceptable:

---

import std.stdio, std.file, std.algorithm, std.conv;

void main() {
	dirEntries("data", "*.out", SpanMode.shallow)
		.map!(e => File(e.name).byLine(KeepTerminator.yes)
			.map!(l => l.to!string())()
		)()
		.array()
		.nWayUnion()
		.copy(stdout.lockingTextWriter);
}

---

I was certainly surprised to find out that nWayUnion existed in 
std.algorithm which essentially does the work for me. However, I 
was more surprised to find out that this does not compile. So, 
who here thinks this _shouldn't_ work? Look at it pretty closely 
and, without investigating the error, justify to yourself why 
this shouldn't work before you continue on. I will explain how I 
perceive the problem.

I thought I was already way ahead of the game by using to!string 
on each line in byLine (because I'm aware of its oddities because 
it gives you the actual buffer it writes to). Honestly, I was a 
bit shocked to discover that such a simple approach won't work 
because moveFront doesn't work with MapResult. Keep in mind, 
there are two maps here, but the MapResult that causes a problem 
is the *inner* one (the one which is applied on each line of the 
file).

---

Okay, so with that out of the way, I'd like to know what the 
idiomatic way to solve this particular problem is. Here's what I 
did (I don't claim the ranges I created are robustly implemented):

https://gist.github.com/Zshazz/c488e70eee4fd352b789

The first thing I figured out was that if you turned the 
MapResult into an array (using .array()), then it would work as 
expected. However, this is obviously not an acceptable solution 
to my problem because I'm doing an on-disk merge sort to sort 
things that wouldn't work in RAM. So, finding this out, I 
realized that the (likely) problem with MapResult is that it's a 
value type and that prevents moveFront from being applicable to 
it for some reason (I'm unknowledgable to that reason 
unfortunately).

My second solution was to write a wrapper for it that turns it 
into a reference type (just made a quick class that forwards 
calls to MapResult which it holds a copy of) to test my theory. 
Sure enough, my wrapper worked as expected. However, I compared 
it to another program and found it to be much slower, so I 
assumed it might be because of the forwarding mechanism slowing 
it down.

My third solution was to write a reference-based map. I couldn't 
make a value type of map that would work because I didn't know 
what prevented it from playing nicely with moveFront. This 
approach was much faster and actually managed to match my 
hand-written merge code. I was pretty satisfied with that 
approach, but it bothered me that I was essentially duplicating 
the functionality provided by std.algorithm.

My final solution was born out of me spending more time looking 
at the reason why my refMap was faster. I discovered that when I 
implemented refMap, I cached the result of applying the function 
on front (which differs from the map implementation in 
std.algorithm). So, I decided to rewrite my original wrapper so 
that it caches the result. This provided me the performance 
benefits of my refMap, but I didn't feel guilty about duplicating 
existing functionality. So, my caching range is my favorite 
solution thus far. Still, I would have much rather have had my 
original attempt work and a caching range be invented soley to 
improve performance, not to just get it to work at all.

---

With all of that in mind, my question is this: Why doesn't 
MapResult work with moveFront? Also, if it cannot be made to work 
with moveFront as a value type, would it be a good idea to turn 
it into a reference type? Is there any way to make it transparent 
to the user so that they don't have to do this sort of 
investigation to solve what should be a fairly simple problem 
(merging multiple ranges together, in this case)?


Thank you for taking the time to read this,
Take care.
Feb 26 2013
next sibling parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Chris Cain:

 import std.stdio, std.file, std.algorithm, std.conv;

 void main() {
 	dirEntries("data", "*.out", SpanMode.shallow)
 		.map!(e => File(e.name).byLine(KeepTerminator.yes)
 			.map!(l => l.to!string())()
 		)()
 		.array()
 		.nWayUnion()
 		.copy(stdout.lockingTextWriter);
 }

 ---

 I was certainly surprised to find out that nWayUnion existed in 
 std.algorithm which essentially does the work for me. However, 
 I was more surprised to find out that this does not compile.

With the latest compiler this compiles to me: import std.stdio, std.file, std.algorithm, std.conv, std.array; void main() { dirEntries(".", "data*.txt", SpanMode.shallow) .map!(e => File(e.name).byLine(KeepTerminator.yes).map!text) .array .nWayUnion .copy(stdout.lockingTextWriter); } Then at runtime that array() gives me this: object.Error: Access Violation ---------------- 0x0041DC9F in void* gc.gcx.GC.mallocNoSync(uint, uint, uint*) 0x00363297 ---------------- Bye, bearophile
Feb 26 2013
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 2/26/13 10:31 PM, Chris Cain wrote:
 After reading bearophile's post, I realized I neglected to mention I was
 using 2.062.

 On Wednesday, 27 February 2013 at 03:16:39 UTC, bearophile wrote:
 With the latest compiler this compiles to me
 ...

Interesting. map!text() compiles but it crashes. It's worth noting that map!text().cached() works without crashing on my end. That said, it's not the first time I've seen Object Violation while working on this problem. Although, I was previously certain the times it was failing with that was my fault (because I was trying to implement a version of a wrapper where I mistakenly took the address of the value type MapResult without thinking that it meant I was taking the address of the stack ... whoops). I wonder why using

In all likelihood these are manifestations of simple bugs. Please submit all cases that should work but are failing (in the same bug report) and we'll get them fixed. Thanks for your work! nWayUnion is a pretty darn slick algorithm. Andrei
Feb 26 2013
prev sibling next sibling parent "Chris Cain" <clcain uncg.edu> writes:
After reading bearophile's post, I realized I neglected to 
mention I was using 2.062.

On Wednesday, 27 February 2013 at 03:16:39 UTC, bearophile wrote:
 With the latest compiler this compiles to me
 ...

Interesting. map!text() compiles but it crashes. It's worth noting that map!text().cached() works without crashing on my end. That said, it's not the first time I've seen Object Violation while working on this problem. Although, I was previously certain the times it was failing with that was my fault (because I was trying to implement a version of a wrapper where I mistakenly took the address of the value type MapResult without thinking that it meant I was taking the address of the stack ... whoops). I wonder why using
Feb 26 2013
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Tue, Feb 26, 2013 at 11:21:41PM -0500, Andrei Alexandrescu wrote:
 On 2/26/13 10:31 PM, Chris Cain wrote:
After reading bearophile's post, I realized I neglected to mention I
was using 2.062.

On Wednesday, 27 February 2013 at 03:16:39 UTC, bearophile wrote:
With the latest compiler this compiles to me
...

Interesting. map!text() compiles but it crashes. It's worth noting that map!text().cached() works without crashing on my end. That said, it's not the first time I've seen Object Violation while working on this problem. Although, I was previously certain the times it was failing with that was my fault (because I was trying to implement a version of a wrapper where I mistakenly took the address of the value type MapResult without thinking that it meant I was taking the address of the stack ... whoops).


 
 In all likelihood these are manifestations of simple bugs. Please
 submit all cases that should work but are failing (in the same bug
 report) and we'll get them fixed.
 
 Thanks for your work! nWayUnion is a pretty darn slick algorithm.

Random marginally-related remark: levenshteinDistanceAndPath is an even cooler inclusion in Phobos. The name is kinda off-putting, though, since it didn't immediately occur to me what Levenshtein distance at a glance, but this is essentially the Unix diff utility built right into Phobos. Let me put that another way... <anecdote> I was writing a utility for verifying the coordinates of some geometric data I'm working on, and my first stab at it was a little D program that could tell whether there was an error, but couldn't pinpoint where it might be. I had large arrays of coordinates, so I wondered how to show the errors. I could display the first few mismatches, but what if an entry or two were merely missing? Then there'd be lots of irrelevant output from the remaining shifted items that would have matched otherwise. Then I vaguely remember something that might be helpful in std.algorithm and looked over it again more carefully, and lo and behold, sitting right there was levenshteinDistanceAndPath. It allowed me to instantly pop it in and display a minimal diff of the incorrect data vs. what the expected data should be. Instant Phobos win!! </anecdote> <rave> If only the naysayers knew how cool Phobos will be once we clean it up and polish it into the high-quality product that it should be. It's already showing its coolness right now by being able to do diffs in a completely generic way -- not of mere text lines like the Unix diff can only do, but of *arbitrary D data*!! Need to do some custom diffing? It's already in Phobos in completely generic form, just waiting for you to instantiate it with your particular data types! </rave> T -- Windows: the ultimate triumph of marketing over technology. -- Adrian von Bidder
Feb 26 2013
prev sibling next sibling parent "Chris Cain" <clcain uncg.edu> writes:
On Wednesday, 27 February 2013 at 04:21:41 UTC, Andrei 
Alexandrescu wrote:
 In all likelihood these are manifestations of simple bugs. 
 Please submit all cases that should work but are failing (in 
 the same bug report) and we'll get them fixed.

It's honestly a bit scary opening my first bug report here. I wasn't sure if it was just something that I'm overlooking about the behavior. However, I've investigated a bit further and I can see that it only applies to ByLine (a string[][], for instance, works). I put that additional knowledge into the bug report: http://d.puremagic.com/issues/show_bug.cgi?id=9598
 Thanks for your work! nWayUnion is a pretty darn slick 
 algorithm.

Thank you too. I was about to write my own range-based solution to solve this problem and just happened across nWayUnion while I was looking for useful components. I was close to coding my own solution (using the heap from std.container, no less) and found out that my solution was already written. I've run across many such gems in the standard library and it never ceases to amaze me just how flexible, effective, and (usually) fast everything is.
Feb 26 2013
prev sibling next sibling parent "Chris Cain" <clcain uncg.edu> writes:
On Wednesday, 27 February 2013 at 05:23:39 UTC, H. S. Teoh wrote:
 <rave>
 If only the naysayers knew how cool Phobos will be once we 
 clean it up
 and polish it into the high-quality product that it should be. 
 It's
 already showing its coolness right now by being able to do [...]
 </rave>

To add on to your rave & anecdote: I used D in a class last year which was centered around solving various problems using programming. We were allowed to use (essentially) any language we wanted, so I thought I'd learn D by solving the problems. There were a few extra challenges along the way (such as who has the fastest solution to a problem) that really allowed me to flex some of D's muscles. Needless to say, by the end of the class, everyone came to know that D enables essentially all of the problems solved in (close to) the most efficient way easily because of its overall design and its standard library. Usually I was either 1st or 2nd in speed of my solutions which were often only beaten by some crazy smart dude writing pure C code whose solutions were neigh incomprehensible. It became a running joke that whenever anyone asked "how did you solve this using only x" someone would interrupt me with "Because it's D." The highly efficient slicing mechanics of D's arrays are one of the huge reasons for so many wins. Honestly, if I took the class again with the knowledge I've gained with D over the last six months of me using it occasionally (and with the massive improvements made to it since the class), I'm certain that I could do things much, much better. Many of the problems we solved would be easy enough to solve in 10-15 readable lines of code using ranges and the standard library.
Feb 26 2013
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Wednesday, February 27, 2013 08:05:28 Chris Cain wrote:
 Thank you too. I was about to write my own range-based solution
 to solve this problem and just happened across nWayUnion while I
 was looking for useful components. I was close to coding my own
 solution (using the heap from std.container, no less) and found
 out that my solution was already written. I've run across many
 such gems in the standard library and it never ceases to amaze me
 just how flexible, effective, and (usually) fast everything is.

In a way, all of that power and flexibility is exactly the problem. When dealing with generic code like Phobos does, it's fairly difficult to test it thoroughly enough to make sure that it works with all of the various inputs that it can be given - especially when you start combining all kinds of stuff. Both the code and the tests are continually improving, but for a lot of it, if you poke it hard enough, you'll be able to find corner cases that don't work quite right yet. But the more that gets tried, and the more problems that get found, the more of them that will get fixed, and the less frequently you'll run into bugs when dealing with std.algorithm and its ilk. - Jonathan M Davis
Feb 26 2013
prev sibling next sibling parent "deadalnix" <deadalnix gmail.com> writes:
On Wednesday, 27 February 2013 at 07:32:58 UTC, Jonathan M Davis 
wrote:
 In a way, all of that power and flexibility is exactly the 
 problem. When
 dealing with generic code like Phobos does, it's fairly 
 difficult to test it
 thoroughly enough to make sure that it works with all of the 
 various inputs
 that it can be given - especially when you start combining all 
 kinds of stuff.
 Both the code and the tests are continually improving, but for 
 a lot of it, if
 you poke it hard enough, you'll be able to find corner cases 
 that don't work
 quite right yet. But the more that gets tried, and the more 
 problems that get
 found,  the more of them that will get fixed, and the less 
 frequently you'll
 run into bugs when dealing with std.algorithm and its ilk.

I think this is mostly due to things having corners cases, or not being defined. For instance, it is unknown what passing a range by value does (and it in fact does different things on different ranges). The same goes for many many others things.
Feb 26 2013
prev sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Andrei Alexandrescu:

 nWayUnion is a pretty darn slick algorithm.

I have some problems with it: http://d.puremagic.com/issues/show_bug.cgi?id=6718 Bye, bearophile
Feb 27 2013