digitalmars.D - tolf and detab

Walter Bright (104/104) Aug 06 2010 I wrote these two trivial utilities for the purpose of canonicalizing so...

Andrei Alexandrescu (6/14) Aug 06 2010 [snip]

Andrej Mitrovic (14/32) Aug 06 2010 Or improve your google-fu by finding some existing tools that do the job

Walter Bright (2/4) Aug 06 2010 Sure, but I suspect it's faster to write the utility! After all, they ar...

Walter Bright (2/4) Aug 06 2010 Some D2-fu would be cool. Any takers?
Yao G. (5/22) Aug 06 2010 What does idiomatic D means?

Andrei Alexandrescu (4/5) Aug 06 2010 At a quick glance - I'm thinking two elements would be using string and
Nick Sabalausky (3/4) Aug 06 2010 "idiomatic D" -> "In typical D style"

Jonathan M Davis (68/90) Aug 07 2010 I didn't try and worry about multiline string literals, but here are my ...

bearophile (7/11) Aug 07 2010 Your code looks better.

Andrei Alexandrescu (5/24) Aug 07 2010 I think it's worth targeting D2 to tasks that are usually handled by
Andrei Alexandrescu (3/7) Aug 07 2010 That would be great so we can tune our approach. Thanks!

bearophile (61/67) Aug 08 2010 In Python there is a helper module:

Walter Bright (5/8) Aug 08 2010 So it is with byLine, too. You've burdened D with double the amount of a...

Nick Sabalausky (3/11) Aug 08 2010 I thought byLine just re-uses the same buffer each time?
bearophile (28/37) Aug 08 2010 I think you are wrong two times:

Walter Bright (5/29) Aug 08 2010 If you want to conclude that Python is better at processing files, you n...

bearophile (6/9) Aug 08 2010 byLine() yields a char[], so if you want to do most kinds of strings pro...

Andrej Mitrovic (3/25) Aug 08 2010 Andrei used to!string() in an early example in TDPL for some line-by-lin...

Andrei Alexandrescu (3/6) Aug 08 2010 For example, to!string(someString) does not duplicate the string.

bearophile (3/4) Aug 08 2010 I don't know where the performance bug is, maybe it's a matter of GC, no...
Yao G. (4/17) Aug 08 2010 What's next? Will you demand attribution like the time Andrei

bearophile (4/6) Aug 08 2010 Of course. In the end all D will be mine ...

Yao G. (5/12) Aug 08 2010 :D That was a good comeback.

Andrei Alexandrescu (8/28) Aug 08 2010 Well I understand his frustration. I asked him for a comparison and he

Andrei Alexandrescu (4/11) Aug 08 2010 Thanks for your analysis. Where does xio derive its performance

bearophile (7/8) Aug 08 2010 I'd like to give you a good answer, but I can't. dlibs1 (that you can fo...

Kagamin (2/6) Aug 08 2010 Don't you minimize heap allocation etc by reading whole file in one io c...

bearophile (4/5) Aug 09 2010 The whole thread was about lazy read of file lines. If the file is very ...

Michel Fortin (13/19) Aug 09 2010 For non-huge files that can fit in the memory space, I'd just

Jonathan M Davis (5/22) Aug 09 2010 Well, you can just read the whole file in as a string with readText(), a...

Andrei Alexandrescu (5/42) Aug 08 2010 I think at the end of the day, regardless the relative possibilities of

bearophile (5/8) Aug 08 2010 For now I suggest you to aim to be just about as fast as Python in this ...

Andrei Alexandrescu (3/10) Aug 08 2010 Why?

bearophile (4/9) Aug 09 2010 Because it's a core functionality for Python so devs probably have optim...

Andrei Alexandrescu (8/18) Aug 09 2010 Then we can do whatever they've done. It's not like they're using APIs

dsimcha (22/33) Aug 08 2010 because reading the lines of a _normal_ text file is faster in Python co...

Bruno Medeiros (17/23) Sep 30 2010 dsimcha wrote:

bearophile (12/21) Sep 30 2010 This is an interesting topic of practical language design, it's a wide p...

Bruno Medeiros (38/59) Oct 01 2010 I'm not so sure about that. Probably backwards-incompatible changes will...

bearophile (16/29) Oct 01 2010 Ada has essentially died for several reasons, but in my opinion one of t...

Pelle (20/49) Oct 01 2010 No, dynamic scoping is the crazy thing. Perl code:
Bruno Medeiros (19/30) Oct 05 2010 There are a lot of things in a language that, if they make it harder to

Nick Sabalausky (5/17) Aug 08 2010 I can respect that. Personally, though, I find a lot of value in not nee...

Jonathan M Davis (5/34) Aug 07 2010 Actually, looking at the code again, that while loop really should be
Andrei Alexandrescu (50/122) Aug 07 2010 Very nice. Here's how I'd improve removeTabs:

Jonathan M Davis (29/86) Aug 07 2010 Ah. I needed to close the file. I pretty much always just use readText()...

Walter Bright (3/5) Aug 07 2010 Because of asynchronous I/O, being able to start processing and start wr...

Nick Sabalausky (4/7) Aug 08 2010 I'm fairly sure SVN doesn't commit touched files unless there are actual...

Andrei Alexandrescu (3/12) Aug 08 2010 It doesn't, but it still shows them as changed etc.

Leandro Lucarella (33/46) Aug 08 2010 Nope, not really:

Norbert Nemec (3/107) Aug 08 2010 I usually do the same thing with a shell pipe

Walter Bright (2/4) Aug 08 2010
Nick Sabalausky (3/5) Aug 08 2010 Filed under "Why I don't like regex for non-trivial things" ;)

Leandro Lucarella (17/24) Aug 08 2010 Those regex are non-trivial?

Nick Sabalausky (13/24) Aug 08 2010 IMHO, A task has to be REALLY trivial to be trivial in regex ;)

Walter Bright (3/5) Aug 08 2010 Regexes are like flying airplanes. You have to do them often or you get ...

bearophile (21/23) Aug 08 2010 I have modified the code:

Andrej Mitrovic (4/28) Aug 08 2010 What are you using to time the app? I'm using timeit (from the Windows

bearophile (5/7) Aug 08 2010 If you run the benchmarks two times, the second time if you have enough ...
Walter Bright (2/4) Aug 08 2010 Just run it several times until the times stop going down.

Walter Bright <newshound2 digitalmars.com> writes:

I wrote these two trivial utilities for the purpose of canonicalizing source 
code before checkins and to deal with FreeBSD's inability to deal with CRLF
line 
endings, and because I can never figure out the right settings for git to make 
it do the canonicalization.

tolf - converts LF, CR, and CRLF line endings to LF.

detab - converts all tabs to the correct number of spaces. Assumes tabs are 8 
column tabs. Removes trailing whitespace from lines.

Posted here just in case someone wonders what they are.
---------------------------------------------------------
/* Replace tabs with spaces, and remove trailing whitespace from lines.
  */

import std.file;
import std.path;

int main(string[] args)
{
     foreach (f; args[1 .. $])
     {
         auto input = cast(char[]) std.file.read(f);
         auto output = filter(input);
         if (output != input)
             std.file.write(f, output);
     }
     return 0;
}


char[] filter(char[] input)
{
     char[] output;
     size_t j;

     int column;
     for (size_t i = 0; i < input.length; i++)
     {
         auto c = input[i];

         switch (c)
         {
             case '\t':
                 while ((column & 7) != 7)
                 {   output ~= ' ';
                     j++;
                     column++;
                 }
                 c = ' ';
                 column++;
                 break;

             case '\r':
             case '\n':
                 while (j && output[j - 1] == ' ')
                     j--;
                 output = output[0 .. j];
                 column = 0;
                 break;

             default:
                 column++;
                 break;
         }
         output ~= c;
         j++;
     }
     while (j && output[j - 1] == ' ')
         j--;
     return output[0 .. j];
}
-----------------------------------------------------
/* Replace line endings with LF
  */

import std.file;
import std.path;

int main(string[] args)
{
     foreach (f; args[1 .. $])
     {
         auto input = cast(char[]) std.file.read(f);
         auto output = filter(input);
         if (output != input)
             std.file.write(f, output);
     }
     return 0;
}


char[] filter(char[] input)
{
     char[] output;
     size_t j;

     for (size_t i = 0; i < input.length; i++)
     {
         auto c = input[i];

         switch (c)
         {
             case '\r':
                 c = '\n';
                 break;

             case '\n':
                 if (i && input[i - 1] == '\r')
                     continue;
                 break;

             case 0:
                 continue;

             default:
                 break;
         }
         output ~= c;
         j++;
     }
     return output[0 .. j];
}
------------------------------------------

Aug 06 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 08/06/2010 08:34 PM, Walter Bright wrote:
 I wrote these two trivial utilities for the purpose of canonicalizing
 source code before checkins and to deal with FreeBSD's inability to deal
 with CRLF line endings, and because I can never figure out the right
 settings for git to make it do the canonicalization.

 tolf - converts LF, CR, and CRLF line endings to LF.

 detab - converts all tabs to the correct number of spaces. Assumes tabs
 are 8 column tabs. Removes trailing whitespace from lines.

 Posted here just in case someone wonders what they are.

[snip]

Nice, though they don't account for multiline string literals.

A good exercise would be rewriting these tools in idiomatic D2 and 
assess the differences.


Andrei

Aug 06 2010

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

Or improve your google-fu by finding some existing tools that do the job
right. :)

I'm pretty sure Uncrustify is good at most of these issues, not to mention
it's a very nice source-code "prettifier/indenter". There's a front-end
called UniversalIndentGUI, which has about a dozen integrated versions of
source-code prettifiers (including uncrustify, and for many languages). It
has varios settings on the left, and togglable *Live* preview mode which you
can view on the right.

I invite you guys to try it out sometime:

http://universalindent.sourceforge.net/

(+ you can save different settings which is neat when you're coding for
different projects that have different "code design & look" standards)

On Sat, Aug 7, 2010 at 3:50 AM, Andrei Alexandrescu <
SeeWebsiteForEmail erdani.org> wrote:

 On 08/06/2010 08:34 PM, Walter Bright wrote:

 I wrote these two trivial utilities for the purpose of canonicalizing
 source code before checkins and to deal with FreeBSD's inability to deal
 with CRLF line endings, and because I can never figure out the right
 settings for git to make it do the canonicalization.

 tolf - converts LF, CR, and CRLF line endings to LF.

 detab - converts all tabs to the correct number of spaces. Assumes tabs
 are 8 column tabs. Removes trailing whitespace from lines.

 Posted here just in case someone wonders what they are.

 [snip]

 Nice, though they don't account for multiline string literals.

 A good exercise would be rewriting these tools in idiomatic D2 and assess
 the differences.


 Andrei

Aug 06 2010

Walter Bright <newshound2 digitalmars.com> writes:

Andrej Mitrovic wrote:
 Or improve your google-fu by finding some existing tools that do the job 
 right. :)

Sure, but I suspect it's faster to write the utility! After all, they are
trivial.

Aug 06 2010

Walter Bright <newshound2 digitalmars.com> writes:

Andrei Alexandrescu wrote:
 A good exercise would be rewriting these tools in idiomatic D2 and 
 assess the differences.

Some D2-fu would be cool. Any takers?

Aug 06 2010

"Yao G." <nospamyao gmail.com> writes:

What does idiomatic D means?

On Fri, 06 Aug 2010 20:50:52 -0500, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 08/06/2010 08:34 PM, Walter Bright wrote:
 I wrote these two trivial utilities for the purpose of canonicalizing
 source code before checkins and to deal with FreeBSD's inability to deal
 with CRLF line endings, and because I can never figure out the right
 settings for git to make it do the canonicalization.

 tolf - converts LF, CR, and CRLF line endings to LF.

 detab - converts all tabs to the correct number of spaces. Assumes tabs
 are 8 column tabs. Removes trailing whitespace from lines.

 Posted here just in case someone wonders what they are.

 [snip]

 Nice, though they don't account for multiline string literals.

 A good exercise would be rewriting these tools in idiomatic D2 and  
 assess the differences.


 Andrei


-- 
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/

Aug 06 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 08/06/2010 09:33 PM, Yao G. wrote:
 What does idiomatic D means?

At a quick glance - I'm thinking two elements would be using string and 
possibly byLine.

Andrei

Aug 06 2010

"Nick Sabalausky" <a a.a> writes:

"Yao G." <nospamyao gmail.com> wrote in message 
news:op.vg1qpcjfxeuu2f miroslava.gateway.2wire.net...
 What does idiomatic D means?

"idiomatic D" -> "In typical D style"

Aug 06 2010

Jonathan M Davis <jmdavisprog gmail.com> writes:

On Friday 06 August 2010 18:50:52 Andrei Alexandrescu wrote:
 On 08/06/2010 08:34 PM, Walter Bright wrote:
 I wrote these two trivial utilities for the purpose of canonicalizing
 source code before checkins and to deal with FreeBSD's inability to deal
 with CRLF line endings, and because I can never figure out the right
 settings for git to make it do the canonicalization.
 
 tolf - converts LF, CR, and CRLF line endings to LF.
 
 detab - converts all tabs to the correct number of spaces. Assumes tabs
 are 8 column tabs. Removes trailing whitespace from lines.
 
 Posted here just in case someone wonders what they are.

 
 [snip]
 
 Nice, though they don't account for multiline string literals.
 
 A good exercise would be rewriting these tools in idiomatic D2 and
 assess the differences.
 
 
 Andrei

I didn't try and worry about multiline string literals, but here are my more 
idiomatic solutions:



detab:

/* Replace tabs with spaces, and remove trailing whitespace from lines.
  */

import std.conv;
import std.file;
import std.stdio;
import std.string;

void main(string[] args)
{
    const int tabSize = to!int(args[1]);
    foreach(f; args[2 .. $])
        removeTabs(tabSize, f);
}


void removeTabs(int tabSize, string fileName)
{
    auto file = File(fileName);
    string[] output;

    foreach(line; file.byLine())
    {
        int lastTab = 0;

        while(lastTab != -1)
        {
            const int tab = line.indexOf('\t');

            if(tab == -1)
                break;

            const int numSpaces = tabSize - tab % tabSize;

            line = line[0 .. tab] ~ repeat(" ", numSpaces) ~ line[tab + 1 .. $];

            lastTab = tab + numSpaces;
        }

        output ~= line.idup;
    }

    std.file.write(fileName, output.join("\n"));
}

-------------------------------------------

The three differences between mine and Walter's are that mine takes the tab
size 
as the first argumen,t it doesn't put a newline at the end of the file, and it 
writes the file even if it changed (you could test for that, but when using 
byLine(), it's a bit harder). Interestingly enough, from the few tests that I 
ran, mine seems to be somewhat faster. I also happen to think that the code is 
clearer (it's certainly shorter), though that might be up for debate.

-------------------------------------------



tolf:

/* Replace line endings with LF
  */

import std.file;
import std.string;

void main(string[] args)
{
    foreach(f; args[1 .. $])
        fixEndLines(f);
}

void fixEndLines(string fileName)
{
    auto fileStr = std.file.readText(fileName);
    auto result = fileStr.replace("\r\n", "\n").replace("\r", "\n");

    std.file.write(fileName, result);
}

-------------------------------------------

This version is ludicrously simple. And it was also faster than Walter's in the 
few tests that I ran. In either case, I think that it is definitely clearer
code.


I would have thought that being more idomatic would have resulted in slower
code 
than what Walter did, but interestingly enough, both programs are faster with
my 
code. They might take more memory though. I'm not quite sure how to check that. 
In any cases, you wanted some idiomatic D2 solutions, so there you go.

- Jonathan M Davis

Aug 07 2010

bearophile <bearophileHUGS lycos.com> writes:

Jonathan M Davis:
 I would have thought that being more idomatic would have resulted in slower
code 
 than what Walter did, but interestingly enough, both programs are faster with
my 
 code. They might take more memory though. I'm not quite sure how to check
that. 
 In any cases, you wanted some idiomatic D2 solutions, so there you go.

Your code looks better.

My (probably controversial) opinion on this is that the idiomatic D solution
for those text "scripts" is to use a scripting language, as Python :-)

In this case a Python version is more readable, shorter and probably faster too
because reading the lines of a _normal_ text file is faster in Python compared
to D (because Python is more optimized for such purposes. I can show benchmarks
on request).

On the other hand D2 is in its debugging phase, so it's good to use it even for
purposes it's not the best language for, to catch bugs or performance bugs. So
I think it's positive to write such scripts in D2, even if in a real-world
setting I want to use Python to write them.

Bye,
bearophile

Aug 07 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 08/07/2010 11:16 PM, bearophile wrote:
 Jonathan M Davis:
 I would have thought that being more idomatic would have resulted
 in slower code than what Walter did, but interestingly enough, both
 programs are faster with my code. They might take more memory
 though. I'm not quite sure how to check that. In any cases, you
 wanted some idiomatic D2 solutions, so there you go.

 Your code looks better.

 My (probably controversial) opinion on this is that the idiomatic D
 solution for those text "scripts" is to use a scripting language, as
 Python :-)

 In this case a Python version is more readable, shorter and probably
 faster too because reading the lines of a _normal_ text file is
 faster in Python compared to D (because Python is more optimized for
 such purposes. I can show benchmarks on request).

 On the other hand D2 is in its debugging phase, so it's good to use
 it even for purposes it's not the best language for, to catch bugs or
 performance bugs. So I think it's positive to write such scripts in
 D2, even if in a real-world setting I want to use Python to write
 them.

I think it's worth targeting D2 to tasks that are usually handled by 
scripting languages. I've done a lot of that and it beats the hell out 
of rewriting in D a script that's grown out of control

Andrei

Aug 07 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 08/07/2010 11:16 PM, bearophile wrote:
 In this case a Python version is more readable, shorter and probably
 faster too because reading the lines of a _normal_ text file is
 faster in Python compared to D (because Python is more optimized for
 such purposes. I can show benchmarks on request).

That would be great so we can tune our approach. Thanks!

Andrei

Aug 07 2010

bearophile <bearophileHUGS lycos.com> writes:

Andrei Alexandrescu:
 This makes me think we should have a range that detects and replaces 
 patterns lazily and on the fly.

In Python there is a helper module:
http://docs.python.org/library/fileinput.html


 I think it's worth targeting D2 to tasks that are usually handled by
 scripting languages. I've done a lot of that and it beats the hell out
 of rewriting in D a script that's grown out of control

Dynamic languages are handy but they require some rigour when you program.
Python is probably unfit to write one million lines long programs, but if you
train yourself a little and keep your code clean, you usually become able to
write clean larghish programs in Python.


 That would be great so we can tune our approach. Thanks!

In my dlibs I have the xio module that read by lines efficiently, it was faster
than the iterating on the lines of BufferedFile.
There are tons of different benchmarks that you may use, but a simple one to
start is better, one that just iterates the file lines. See below.

Related: experiments have shown that the (oldish) Java GC improves its
performance if it is able to keep strings (that are immutable) in a separate
memory pool, _and_ be able to recognize duplicated strings, of course keeping
only one string for each equality set. It's positive to do a similar experiment
with the D GC, but first you need applications that use the GC to test if this
idea is an improvement :-)


So I have used a minimal benchmark:

--------------------------


from sys import argv

def process(file_name):
    total = 0
    for line in open(file_name):
        total += len(line)
    return total

print "Total:", process(argv[1])

--------------------------

// D2 code
import std.stdio: File, writeln;

int process(string fileName) {
    int total = 0;

    auto file = File(fileName);
    foreach (rawLine; file.byLine()) {
        string line = rawLine.idup;
        total += line.length;
    }
    file.close();

    return total;
}

void main(string[] args) {
    if (args.length == 2)
        writeln("Total: ", process(args[1]));
}

--------------------------

In the D code I have added an idup to make the comparison more fair, because in
the Python code the "line" is a true newly allocated line, you can safely use
it as dictionary key.

I have used Python 2.7 with no Psyco JIT (http://psyco.sourceforge.net/ ) to
speed up the Python code because it's not available yet for Python 2.7. D code
compiled with dmd 2.047, optimized build.

As test text data I have used a concatenation of all text files here (they are
copyrighted, but freely usable):
http://gnosis.cx/TPiP/

The result on Windows is a file of 1_116_552 bytes.

I have attached the file to itself, duplicating its length, some times, the
result is a file of 71_459_328 bytes (this is not an fully realistic case
because you often have many small files to read instead of a very large one).

The timings are taken with warm disk cache, so they are essentially read from
RAM. This is not fully realistic, but if you want to write a benchmark you have
to do this, because for me it's very hard on Windows to make sure that the disk
cache is fully empty. So it's better to do the opposite and benchmark a warm
file.


The output of the Python code is:
Total: 69789888
Found in 0.88 seconds (best of 6, the variance is minimal).


The output of the D code is:
Total: 69789888
Found in 1.28 seconds (best of 6, minimal variance).


If in the D2 code I comment out the idup like this:

    foreach (rawLine; file.byLine()) {
        total += rawLine.length;
    }

The output of the D code without idup is:
Total: 69789888
Found in 0.75 seconds (best of 6, minimal variance).

As you see it's a matter of GC efficiency too.

Beside the GC the cause of the higher performance of the Python code comes from
a tuned design, you can see the function getline_via_fgets here:
http://svn.python.org/view/python/trunk/Objects/fileobject.c?revision=81275&view=markup
It uses a "stack buffer" (char buf[MAXBUFSIZE]; where MAXBUFSIZE is 300) too.

Bye,
bearophile

Aug 08 2010

Walter Bright <newshound2 digitalmars.com> writes:

bearophile wrote:
 In the D code I have added an idup to make the comparison more fair, because
 in the Python code the "line" is a true newly allocated line, you can safely
 use it as dictionary key.

So it is with byLine, too. You've burdened D with double the amount of
allocations.

Also, I object in general to this method of making things "more fair". Using a 
less efficient approach in X because Y cannot use such an approach is not a 
legitimate comparison.

Aug 08 2010

"Nick Sabalausky" <a a.a> writes:

"Walter Bright" <newshound2 digitalmars.com> wrote in message 
news:i3mpnb$2hcf$1 digitalmars.com...
 bearophile wrote:
 In the D code I have added an idup to make the comparison more fair, 
 because
 in the Python code the "line" is a true newly allocated line, you can 
 safely
 use it as dictionary key.

 So it is with byLine, too. You've burdened D with double the amount of 
 allocations.

I thought byLine just re-uses the same buffer each time?

Aug 08 2010

bearophile <bearophileHUGS lycos.com> writes:

Walter Bright:
 bearophile wrote:
 In the D code I have added an idup to make the comparison more fair, because
 in the Python code the "line" is a true newly allocated line, you can safely
 use it as dictionary key.

 
 So it is with byLine, too. You've burdened D with double the amount of
allocations.

I think you are wrong two times:

1) byLine() doesn't return a newly allocated line, you can see it with this
small program:

import std.stdio: File, writeln;

void main(string[] args) {
    char[][] lines;
    auto file = File(args[1]);
    foreach (rawLine; file.byLine()) {
        writeln(rawLine.ptr);
        lines ~= rawLine;
    }
    file.close();
}


Its output shows that all "strings" (char[]) share the same pointer:

14E5E00
14E5E00
14E5E00
14E5E00
14E5E00
14E5E00
14E5E00
...


2) You can't use the result of rawLine() as string key for an associative
array, as you I have said you can in Python. Currently you can, but according
to Andrei this is a bug. And if it's not a bug then I'll reopen this closed bug
4474:

http://d.puremagic.com/issues/show_bug.cgi?id=4474


 Also, I object in general to this method of making things "more fair". Using a
 less efficient approach in X because Y cannot use such an approach is not a
 legitimate comparison.

I generally agree, but this it not the case.
In some situations you indeed don't need a newly allocated string for each
loop, because for example you just want to read them and process them and not
change/store them. You can't do this in Python, but this is not what I want to
test. As I have explained in bug 4474 this behaviour is useful but it is
acceptable only if explicitly requested by the programmer, and not as default
one. The language is safe, as Andrei explains there, because you are supposed
to idup the char[] to use it as key for an associative array (if your
associative array is declared as int[char[]] then it can accept such rawLine()
as keys, but you can clearly see those aren't strings. This is why I have
closed bug 4474).

Bye,
bearophile

Aug 08 2010

Walter Bright <newshound2 digitalmars.com> writes:

bearophile wrote:
 Walter Bright:
 bearophile wrote:
 In the D code I have added an idup to make the comparison more fair,
 because in the Python code the "line" is a true newly allocated line, you
 can safely use it as dictionary key.

 So it is with byLine, too. You've burdened D with double the amount of
 allocations.

 
 I think you are wrong two times:
 
 1) byLine() doesn't return a newly allocated line, you can see it with this
 small program:
 
 import std.stdio: File, writeln;
 
 void main(string[] args) { char[][] lines; auto file = File(args[1]); foreach
 (rawLine; file.byLine()) { writeln(rawLine.ptr); lines ~= rawLine; } 
 file.close(); }
 
 
 Its output shows that all "strings" (char[]) share the same pointer:
 
 14E5E00 14E5E00 14E5E00 14E5E00 14E5E00 14E5E00 14E5E00 ...

eh, you're right. the phobos documentation for byLine needs to be fixed.


 You can't do this in Python, but this is not what I want to test.

If you want to conclude that Python is better at processing files, you need to 
show it using each language doing it a way well suited to that language, rather 
than burdening one so it uses the same method as the less powerful one.

Aug 08 2010

bearophile <bearophileHUGS lycos.com> writes:

Walter Bright:
 If you want to conclude that Python is better at processing files, you need to 
 show it using each language doing it a way well suited to that language,
rather 
 than burdening one so it uses the same method as the less powerful one.

byLine() yields a char[], so if you want to do most kinds of strings processing
or you want to store the line (or parts of it), you have to idup it. So in this
case Python is not significantly less powerful than D.

You can of course use the raw char[], but then you lose the advantages
advertised when you have introduced the safer immutable D2 strings. And in many
situations you have to dup the char[] anyway, otherwise your have all kinds of
bugs, that Python lacks. In D1 to avoid it I used to use dup more often than
necessary. I have explained this in the bug 4474.

In this newsgroup my purpose it to show D faults, suggest improvements, etc. In
this case my purpose was just to show that byLine()+idup is slow. And you have
to thankful for my benchmarks. In my dlibs1 for D1 I have a xio module that
reads files by line that is faster than iterating on a BufferedFile, so it's
not a limit of the language, it's Phobos that has a performance bug that can be
improved.

Bye,
bearophile

Aug 08 2010

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

Andrei used to!string() in an early example in TDPL for some line-by-line
processing. I'm not sure of the advantages/disadvantages of to!type vs .dup.

On Sun, Aug 8, 2010 at 11:44 PM, bearophile <bearophileHUGS lycos.com>wrote:

 Walter Bright:
 If you want to conclude that Python is better at processing files, you

 need to
 show it using each language doing it a way well suited to that language,

 rather
 than burdening one so it uses the same method as the less powerful one.

 byLine() yields a char[], so if you want to do most kinds of strings
 processing or you want to store the line (or parts of it), you have to idup
 it. So in this case Python is not significantly less powerful than D.

 You can of course use the raw char[], but then you lose the advantages
 advertised when you have introduced the safer immutable D2 strings. And in
 many situations you have to dup the char[] anyway, otherwise your have all
 kinds of bugs, that Python lacks. In D1 to avoid it I used to use dup more
 often than necessary. I have explained this in the bug 4474.

 In this newsgroup my purpose it to show D faults, suggest improvements,
 etc. In this case my purpose was just to show that byLine()+idup is slow.
 And you have to thankful for my benchmarks. In my dlibs1 for D1 I have a xio
 module that reads files by line that is faster than iterating on a
 BufferedFile, so it's not a limit of the language, it's Phobos that has a
 performance bug that can be improved.

 Bye,
 bearophile

Aug 08 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 08/08/2010 04:48 PM, Andrej Mitrovic wrote:
 Andrei used to!string() in an early example in TDPL for some
 line-by-line processing. I'm not sure of the advantages/disadvantages of
 to!type vs .dup.

For example, to!string(someString) does not duplicate the string.

Andrei

Aug 08 2010

bearophile <bearophileHUGS lycos.com> writes:

 so it's not a limit of the language, it's Phobos that has a performance bug
that can be improved.

I don't know where the performance bug is, maybe it's a matter of GC, not a
Phobos performance bug.

Bye,
bearophile

Aug 08 2010

"Yao G." <nospamyao gmail.com> writes:

On Sun, 08 Aug 2010 16:44:09 -0500, bearophile <bearophileHUGS lycos.com>  
wrote:

 Walter Bright:
 If you want to conclude that Python is better at processing files, you  
 need to
 show it using each language doing it a way well suited to that  
 language, rather
 than burdening one so it uses the same method as the less powerful one.

 byLine() yields a char[], so if you want to do most kinds of strings  
 processing or you want to store the line (or parts of it), you have to  
 idup it. So in this case Python is not significantly less powerful than  
 D.

 [snip] And you have to [be] thankful for my benchmarks. [snip]

 Bye,
 bearophile

<g> What's next? Will you demand attribution like the time Andrei  
presented the ranges design?

Aug 08 2010

bearophile <bearophileHUGS lycos.com> writes:

Yao G.:
 <g> What's next? Will you demand attribution like the time Andrei  
 presented the ranges design?

Of course. In the end all D will be mine <evil laugh with echo effects> :-)

Bye,
bearophile

Aug 08 2010

"Yao G." <nospamyao gmail.com> writes:

On Sun, 08 Aug 2010 17:27:04 -0500, bearophile <bearophileHUGS lycos.com>  
wrote:

 Yao G.:
 <g> What's next? Will you demand attribution like the time Andrei
 presented the ranges design?

 Of course. In the end all D will be mine <evil laugh with echo effects>  
 :-)

 Bye,
 bearophile

  :D That was a good comeback.


-- 
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/

Aug 08 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 08/08/2010 05:17 PM, Yao G. wrote:
 On Sun, 08 Aug 2010 16:44:09 -0500, bearophile
 <bearophileHUGS lycos.com> wrote:

 Walter Bright:
 If you want to conclude that Python is better at processing files,
 you need to
 show it using each language doing it a way well suited to that
 language, rather
 than burdening one so it uses the same method as the less powerful one.

 byLine() yields a char[], so if you want to do most kinds of strings
 processing or you want to store the line (or parts of it), you have to
 idup it. So in this case Python is not significantly less powerful
 than D.

 [snip] And you have to [be] thankful for my benchmarks. [snip]

 Bye,
 bearophile

 <g> What's next? Will you demand attribution like the time Andrei
 presented the ranges design?

Well I understand his frustration. I asked him for a comparison and he 
took the time to write one and play with it. I think the proper answer 
to that is to see what we can do to improve the situation, not defend 
the status quo. Whatever the weaknesses of the benchmark are they should 
be fixed, and then whatever weaknesses the library has they should be 
addressed.

Andrei

Aug 08 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 08/08/2010 04:44 PM, bearophile wrote:
 Walter Bright:
 If you want to conclude that Python is better at processing files, you need to
 show it using each language doing it a way well suited to that language, rather
 than burdening one so it uses the same method as the less powerful one.

 byLine() yields a char[], so if you want to do most kinds of strings
processing or you want to store the line (or parts of it), you have to idup it.
So in this case Python is not significantly less powerful than D.

 You can of course use the raw char[], but then you lose the advantages
advertised when you have introduced the safer immutable D2 strings. And in many
situations you have to dup the char[] anyway, otherwise your have all kinds of
bugs, that Python lacks. In D1 to avoid it I used to use dup more often than
necessary. I have explained this in the bug 4474.

 In this newsgroup my purpose it to show D faults, suggest improvements, etc.
In this case my purpose was just to show that byLine()+idup is slow. And you
have to thankful for my benchmarks. In my dlibs1 for D1 I have a xio module
that reads files by line that is faster than iterating on a BufferedFile, so
it's not a limit of the language, it's Phobos that has a performance bug that
can be improved.

Thanks for your analysis. Where does xio derive its performance 
advantage from?

Andrei

Aug 08 2010

bearophile <bearophileHUGS lycos.com> writes:

Andrei:

Where does xio derive its performance advantage from?<

I'd like to give you a good answer, but I can't. dlibs1 (that you can found
online still) has a Python Licence, so to create xio.xfile() I have just
translated to D1 the C code of the CPython implementation code of the file
object I have already linked here.

I think it minimizes heap allocations, the performance is tuned for a line
length found to be the "average one" for normal files. So I presume if your
text file has very short lines (like 5 chars each) or very long ones (like 1000
chars each) it becomes less efficient.

So it's probably a matter of good usage of the C I/O functions and probably a
more efficient management by the GC.

Phobos is Boost Licence, but I don't think Python devs can get mad if you take
a look at how Python reads lines lazily :-) Someone has tried to implement a
Python-style associative array in a similar way.

Bye,
bearophile

Aug 08 2010

Kagamin <spam here.lot> writes:

bearophile Wrote:

 I think it minimizes heap allocations, the performance is tuned for a line
length found to be the "average one" for normal files. So I presume if your
text file has very short lines (like 5 chars each) or very long ones (like 1000
chars each) it becomes less efficient.
 
 So it's probably a matter of good usage of the C I/O functions and probably a
more efficient management by the GC.
 

Don't you minimize heap allocation etc by reading whole file in one io call?

Aug 08 2010

bearophile <bearophileHUGS lycos.com> writes:

Kagamin:

 Don't you minimize heap allocation etc by reading whole file in one io call?

The whole thread was about lazy read of file lines. If the file is very large
it's not wise to load it all in RAM at once.

Bye,
bearophile

Aug 09 2010

Michel Fortin <michel.fortin michelf.com> writes:

On 2010-08-09 07:12:38 -0400, bearophile <bearophileHUGS lycos.com> said:

 Kagamin:
 
 Don't you minimize heap allocation etc by reading whole file in one io call?

 
 The whole thread was about lazy read of file lines. If the file is very 
 large it's not wise to load it all in RAM at once.

For non-huge files that can fit in the memory space, I'd just 
memory-map the whole file and treat it as a giant string that I could 
then slice and keep the slices around (yeah!). The virtual memory 
system will take care of loading the file content's as you read from 
its memory space, so the file isn't loaded all at once.

But that's not compatible with the C file IO functions. Does Python 
uses C file IO calls when reading from a file? If not, perhaps that's 
why it's faster.

-- 
Michel Fortin
michel.fortin michelf.com
http://michelf.com/

Aug 09 2010

Jonathan M Davis <jmdavisprog gmail.com> writes:

On Monday, August 09, 2010 05:30:33 Michel Fortin wrote:
 On 2010-08-09 07:12:38 -0400, bearophile <bearophileHUGS lycos.com> said:
 Kagamin:
 Don't you minimize heap allocation etc by reading whole file in one io
 call?

 
 The whole thread was about lazy read of file lines. If the file is very
 large it's not wise to load it all in RAM at once.

 
 For non-huge files that can fit in the memory space, I'd just
 memory-map the whole file and treat it as a giant string that I could
 then slice and keep the slices around (yeah!). The virtual memory
 system will take care of loading the file content's as you read from
 its memory space, so the file isn't loaded all at once.
 
 But that's not compatible with the C file IO functions. Does Python
 uses C file IO calls when reading from a file? If not, perhaps that's
 why it's faster.

Well, you can just read the whole file in as a string with readText(), and any 
slices to that could stick around, but presumably, that's using the C file I/O 
calls underneath.

- Jonathan M Davis

Aug 09 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 08/08/2010 02:32 PM, bearophile wrote:
 Walter Bright:
 bearophile wrote:
 In the D code I have added an idup to make the comparison more fair, because
 in the Python code the "line" is a true newly allocated line, you can safely
 use it as dictionary key.

 So it is with byLine, too. You've burdened D with double the amount of
allocations.

 I think you are wrong two times:

 1) byLine() doesn't return a newly allocated line, you can see it with this
small program:

 import std.stdio: File, writeln;

 void main(string[] args) {
      char[][] lines;
      auto file = File(args[1]);
      foreach (rawLine; file.byLine()) {
          writeln(rawLine.ptr);
          lines ~= rawLine;
      }
      file.close();
 }


 Its output shows that all "strings" (char[]) share the same pointer:

 14E5E00
 14E5E00
 14E5E00
 14E5E00
 14E5E00
 14E5E00
 14E5E00
 ...


 2) You can't use the result of rawLine() as string key for an associative
array, as you I have said you can in Python. Currently you can, but according
to Andrei this is a bug. And if it's not a bug then I'll reopen this closed bug
4474:

 http://d.puremagic.com/issues/show_bug.cgi?id=4474


 Also, I object in general to this method of making things "more fair". Using a
 less efficient approach in X because Y cannot use such an approach is not a
 legitimate comparison.

 I generally agree, but this it not the case.
 In some situations you indeed don't need a newly allocated string for each
loop, because for example you just want to read them and process them and not
change/store them. You can't do this in Python, but this is not what I want to
test. As I have explained in bug 4474 this behaviour is useful but it is
acceptable only if explicitly requested by the programmer, and not as default
one. The language is safe, as Andrei explains there, because you are supposed
to idup the char[] to use it as key for an associative array (if your
associative array is declared as int[char[]] then it can accept such rawLine()
as keys, but you can clearly see those aren't strings. This is why I have
closed bug 4474).

 Bye,
 bearophile

I think at the end of the day, regardless the relative possibilities of 
file reading in the two languages, we should be faster than Python when 
allocating one new string per line.

Andrei

Aug 08 2010

bearophile <bearophileHUGS lycos.com> writes:

Andrei Alexandrescu:
 I think at the end of the day, regardless the relative possibilities of 
 file reading in the two languages, we should be faster than Python when 
 allocating one new string per line.

For now I suggest you to aim to be just about as fast as Python in this task
:-) Beating Python significantly on this task is probably not easy.
(Later someday I'd also like D AAs to become about as fast as Python dicts.)

Bye,
bearophile

Aug 08 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 08/08/2010 10:29 PM, bearophile wrote:
 Andrei Alexandrescu:
 I think at the end of the day, regardless the relative
 possibilities of file reading in the two languages, we should be
 faster than Python when allocating one new string per line.

 For now I suggest you to aim to be just about as fast as Python in
 this task :-) Beating Python significantly on this task is probably
 not easy.

Why?

Andrei

Aug 08 2010

bearophile <bearophileHUGS lycos.com> writes:

Andrei Alexandrescu:

 For now I suggest you to aim to be just about as fast as Python in
 this task :-) Beating Python significantly on this task is probably
 not easy.

 
 Why?

Because it's a core functionality for Python so devs probably have optimized it
well, it's written in C, and in this case there is very little interpreter
overhead.

Bye,
bearophile

Aug 09 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

bearophile wrote:
 Andrei Alexandrescu:
 
 For now I suggest you to aim to be just about as fast as Python
 in this task :-) Beating Python significantly on this task is
 probably not easy.

 Why?

 
 Because it's a core functionality for Python so devs probably have
 optimized it well, it's written in C, and in this case there is very
 little interpreter overhead.

Then we can do whatever they've done. It's not like they're using APIs 
nobody heard of.

It seems such a comparison of file I/O speed becomes in fact a 
comparison of garbage collectors. That's fine, but in that case the 
notion that D offers the possibility to avoid allocation should come 
back to the table.


Andrei

Aug 09 2010

dsimcha <dsimcha yahoo.com> writes:

== Quote from bearophile (bearophileHUGS lycos.com)'s article
 Jonathan M Davis:
 I would have thought that being more idomatic would have resulted in slower
code
 than what Walter did, but interestingly enough, both programs are faster with
my
 code. They might take more memory though. I'm not quite sure how to check that.
 In any cases, you wanted some idiomatic D2 solutions, so there you go.

 Your code looks better.
 My (probably controversial) opinion on this is that the idiomatic D solution
for

those text "scripts" is to use a scripting language, as Python :-)
 In this case a Python version is more readable, shorter and probably faster too

because reading the lines of a _normal_ text file is faster in Python compared
to
D (because Python is more optimized for such purposes. I can show benchmarks on
request).
 On the other hand D2 is in its debugging phase, so it's good to use it even for

purposes it's not the best language for, to catch bugs or performance bugs. So I
think it's positive to write such scripts in D2, even if in a real-world
setting I
want to use Python to write them.
 Bye,
 bearophile

I disagree completely.  D is clearly designed from the "simple things should be
simple and complicated things should be possible" point of view.  If it doesn't
work well for these kinds of short scripts then we've failed at making simple
things simple and we're just like every other crappy "large scale, industrial
strength" language like Java and C++ that's great for megaprojects but makes
simple things complicated.

That said, I think D does a great job in this regard.  I actually use Python as
my
language of second choice for things D isn't good at.  Mostly this means needing
Python's huge standard library, needing 64-bit support, or needing to share my
code with people who don't know D.  Needing to write a very short script tends
not
to be a reason for me to switch over.  It's not that rare for me to start with a
short script and then end up adding something that needs performance to it (like
monte carlo simulation of a null probability distribution) and I don't find D
substantially harder to use for these cases.

Aug 08 2010

Bruno Medeiros <brunodomedeiros+spam com.gmail> writes:

On 08/08/2010 14:31, dsimcha wrote:
 I disagree completely.  D is clearly designed from the "simple things should be
 simple and complicated things should be possible" point of view.  If it doesn't
 work well for these kinds of short scripts then we've failed at making simple
 things simple and we're just like every other crappy "large scale, industrial
 strength" language like Java and C++ that's great for megaprojects but makes
 simple things complicated.

dsimcha wrote:
"I hate Java and every programming language where a readable hello world 
takes more than 3 SLOC"

That may be your preference, but other people here in the community, me 
at least, very much want D to be a "large scale, industrial strength" 
language that's great for megaprojects. I think that medium and large 
scale projects are simply much more important and interesting than small 
scale ones.

I am hoping this would become an *explicit* point of D design goals, if 
it isn't already.
And I will campaign against (so to speak), people like you who think 
small scale is more important. No personal animosity intended though.

Note: I am not stating that is is not possible to be good, even great, 
at both things (small and medium/large scale).


-- 
Bruno Medeiros - Software Engineer

Sep 30 2010

bearophile <bearophileHUGS lycos.com> writes:

Bruno Medeiros:

 I think that medium and large
 scale projects are simply much more important and interesting than small
 scale ones.

 I am hoping this would become an *explicit* point of D design goals, if
 it isn't already.
 And I will campaign against (so to speak), people like you who think
 small scale is more important. No personal animosity intended though.

 Note: I am not stating that is is not possible to be good, even great,
 at both things (small and medium/large scale).

This is an interesting topic of practical language design, it's a wide problem
and I can't have complete answers.

D2 design is mostly done, only small parts may be changed now, so those
campaigns probably can't change D2 design much.

The name of the Scala language means that it is meant to be a scalable
language, this means it is designed to be useful and usable for both large and
quite small programs.

A language like Ada is not a bad language. Programming practice shows that in
many situations the debug time is the larger percentage of the development of a
program. So minimizing debug time is usually a very good thing. Ada tries hard
to avoid many common bugs, much more than D (it has ranged integers, integer
overflows, it defines a portable floating point semantics (despite there is a
way to use the faster IEEE semantics), it forces to use clear interfaces
between modules (much more explicit ones than D ones), it never silently
changes variable types, its semantics is fully specified, there are very
precise Ada semantics specs, all Ada compilers must pass a very large test
suite, and so on and on). In practice the language is able to catch many bugs
before they happen.

So if you want to write a program in critical situations, like important
control systems, Ada is a language better than Perl, and probably better than D
too :-)

Yet, writing programs in Ada is not handy, if you need to write small programs
you need lot of boilerplate code that is useful only in larger programs. And
Ada is a Pascal-like language that many modern programmers don't know/like. Ada
looks designed for larger, low-bug-count, costly (and often well planned out
from the beginning, with no specs that change with time) programs, but it's not
handy to write small programs. Probably Ada is not the best language to write
web code that has to change all the time. Today Ada is not a dead language, but
it smells funny, it's not commonly used.

Andrei has expressed the desire to use D2 as a language to write script-like
programs too. I think in most cases a language like Python is better than D2 to
write small script-like programs, yet I agree with Andrei that it's good to try
to make D2 language fit to write small script-like programs too, because to
write such programs you need a very handy language, that catches/avoids many
simple common bugs quickly, gives you excellent error messages/stack traces,
and allows you to do common operations on web/text files/images/sounds/etc in
few lines of code. My theory is that later those qualities turn out to be
useful even in large programs. I think such qualities may help D avoid the Ada
fate.

The ability to write small programs with D is also useful to attract
programmers to D, because if in your language you need to write 30 lines long
programs to write a "hello world" on the screen then newcomers are likely to
stop using that language after their first try.

Designing a language that is both good for small and large programs is not
easy, but it is a worth goal. D module system must be debugged & finished &
improved to improve the usage of D for larger programs. Some features of the
unittesting and design by contract currently missing are very useful if you
want to use D to write large programs. If you want to write large programs
reliability becomes an important concern, so integer overflow tests and some
system to avoid null-related bugs (not-nullable types and more) become useful
or very useful.

Bye,
bearophile

Sep 30 2010

Bruno Medeiros <brunodomedeiros+spam com.gmail> writes:

On 30/09/2010 19:31, bearophile wrote:
 Bruno Medeiros:

 I think that medium and large
 scale projects are simply much more important and interesting than small
 scale ones.

 I am hoping this would become an *explicit* point of D design goals, if
 it isn't already.
 And I will campaign against (so to speak), people like you who think
 small scale is more important. No personal animosity intended though.

 Note: I am not stating that is is not possible to be good, even great,
 at both things (small and medium/large scale).

 This is an interesting topic of practical language design, it's a wide problem
and I can't have complete answers.

 D2 design is mostly done, only small parts may be changed now, so those
campaigns probably can't change D2 design much.

I'm not so sure about that. Probably backwards-incompatible changes will 
be very few, if any. But there can be backwards-compatible changes, or 
changes to stuff that was not mentioned in TDPL. And there may be a D3 
eventually (a long way down the road though)
But my main worry is not language changes, I actually think it's very 
unlikely Walter and Andrei would do a language change that intentionally 
would adversely affect medium/large scale programs in favor of small 
scale programs.
My main issue is with the time and thinking resources that are expended 
here in the NG when people argue for changes (or against other changes) 
with the intention of favoring small-scall programs. If this were 
explicit in the D design goals, it would help save us from these 
discussions (which affect NG readers, not just posters).

 The name of the Scala language means that it is meant to be a scalable
language, this means it is designed to be useful and usable for both large and
quite small programs.

Whoa wait. From my understanding, Scala is a "scalable language" in the 
sense that it easy to add new language features, or something similar to 
that.
But lets be clear, that's not what I'm talking about, and neither is 
scalability of program data/inputs/performance.
I'm talking about scalability of source code, software components, 
developers, teams, requirements, planning changes, project management 
issues, etc..

 A language like Ada is not a bad language. Programming practice shows that in
many situations the debug time is the larger percentage of the development of a
program. So minimizing debug time is usually a very good thing. Ada tries hard
to avoid many common bugs, much more than D (it has ranged integers, integer
overflows, it defines a portable floating point semantics (despite there is a
way to use the faster IEEE semantics), it forces to use clear interfaces
between modules (much more explicit ones than D ones), it never silently
changes variable types, its semantics is fully specified, there are very
precise Ada semantics specs, all Ada compilers must pass a very large test
suite, and so on and on). In practice the language is able to catch many bugs
before they happen.

 So if you want to write a program in critical situations, like important
control systems, Ada is a language better than Perl, and probably better than D
too :-)

 Yet, writing programs in Ada is not handy, if you need to write small programs
you need lot of boilerplate code that is useful only in larger programs. And
Ada is a Pascal-like language that many modern programmers don't know/like. Ada
looks designed for larger, low-bug-count, costly (and often well planned out
from the beginning, with no specs that change with time) programs, but it's not
handy to write small programs. Probably Ada is not the best language to write
web code that has to change all the time. Today Ada is not a dead language, but
it smells funny, it's not commonly used.

Certainly it's not just web code that can change all the time.
But I'm missing your point here, what does Ada have to do with this?

 Andrei has expressed the desire to use D2 as a language to write script-like
programs too. I think in most cases a language like Python is better than D2 to
write small script-like programs, yet I agree with Andrei that it's good to try
to make D2 language fit to write small script-like programs too, because to
write such programs you need a very handy language, that catches/avoids many
simple common bugs quickly, gives you excellent error messages/stack traces,
and allows you to do common operations on web/text files/images/sounds/etc in
few lines of code. My theory is that later those qualities turn out to be
useful even in large programs. I think such qualities may help D avoid the Ada
fate.

 The ability to write small programs with D is also useful to attract
programmers to D, because if in your language you need to write 30 lines long
programs to write a "hello world" on the screen then newcomers are likely to
stop using that language after their first try.

 Designing a language that is both good for small and large programs is not
easy, but it is a worth goal. D module system must be debugged&  finished& 
improved to improve the usage of D for larger programs. Some features of the
unittesting and design by contract currently missing are very useful if you
want to use D to write large programs. If you want to write large programs
reliability becomes an important concern, so integer overflow tests and some
system to avoid null-related bugs (not-nullable types and more) become useful
or very useful.

 Bye,
 bearophile

Yeah, I actually think D (or any other language under design) can be 
quite good at both things. Maybe something like 90% of features that are 
good for large-scale programs are also good for small-scale ones.

One of the earliest useful programs I wrote in D, was a two-page bash 
shell script that I converted to D. Even though it was just abut two 
pages, it was already hard to extend and debug. After converting it to 
D, with the right shortcut methods and abstractions, the code actually 
manage to be quite succint and comparable, I suspect, to code in Python 
or Perl, or languages like that. (I say suspect because I don't actually 
know much about Python or Perl, but I simply didn't see much language 
changes that could have made my D more succint, barring crazy stuff like 
dynamic scoping)

-- 
Bruno Medeiros - Software Engineer

Oct 01 2010

bearophile <bearophileHUGS lycos.com> writes:

Bruno Medeiros:

 From my understanding, Scala is a "scalable language" in the sense
 that it easy to add new language features, or something similar to that.

I see. You may be right.


 But I'm missing your point here, what does Ada have to do with this?

Ada has essentially died for several reasons, but in my opinion one of them is
the amount of code you have to write to do even small things. If you design a
language that is not handy to write small programs, you have a higher risk of
seeing your language die.


 but I simply didn't see much language 
 changes that could have made my D more succint,

Making a language more succint is easy, you may take a look at J or K
languages. The hard thing is to design a succint language that is also readable
and not bug-prone.

Python has some features that make the code longer, like the obligatory "self."
before class instance names and the optional usage of argument names at the
calling point make the code longer. The ternary operator too in Python is
longer, as the "and" operator, etc. Such things improve readability, etc.

Several Python features help shorten the code, like sequence unpacking syntax
and multiple return values:

 def foo():



...   return 1, 2
...
 a, b = foo()
 a



1
 b



2


List comprehensions help shorten the code, but I think they also reduce bug
count a bit and allow you to think about your code at a bit higher level:

 xs = [2,3,4,5,6,7,8,9,10,11,12,13]
 ps = [x * x for x in xs if x % 2]
 ps



[9, 25, 49, 81, 121, 169]


Python has some other features that help shorten the code, like the significant
leading white space that avoids some bugs, avoids brace style wars, and removes
both some noise and closing brace code lines.


 barring crazy stuff like dynamic scoping)

I don't know what dynamic scoping is, do you mean that crazy nice thing named
dynamic typing? :-)

Bye,
bearophile

Oct 01 2010

Pelle <pelle.mansson gmail.com> writes:

On 10/01/2010 01:54 PM, bearophile wrote:
 Bruno Medeiros:

  From my understanding, Scala is a "scalable language" in the sense
 that it easy to add new language features, or something similar to that.

 I see. You may be right.


 But I'm missing your point here, what does Ada have to do with this?

 Ada has essentially died for several reasons, but in my opinion one of them is
the amount of code you have to write to do even small things. If you design a
language that is not handy to write small programs, you have a higher risk of
seeing your language die.


 but I simply didn't see much language
 changes that could have made my D more succint,

 Making a language more succint is easy, you may take a look at J or K
languages. The hard thing is to design a succint language that is also readable
and not bug-prone.

 Python has some features that make the code longer, like the obligatory
"self." before class instance names and the optional usage of argument names at
the calling point make the code longer. The ternary operator too in Python is
longer, as the "and" operator, etc. Such things improve readability, etc.

 Several Python features help shorten the code, like sequence unpacking syntax
and multiple return values:

 def foo():



 ...   return 1, 2
 ...
 a, b = foo()
 a



 1
 b



 2


 List comprehensions help shorten the code, but I think they also reduce bug
count a bit and allow you to think about your code at a bit higher level:

 xs = [2,3,4,5,6,7,8,9,10,11,12,13]
 ps = [x * x for x in xs if x % 2]
 ps



 [9, 25, 49, 81, 121, 169]


 Python has some other features that help shorten the code, like the
significant leading white space that avoids some bugs, avoids brace style wars,
and removes both some noise and closing brace code lines.


 barring crazy stuff like dynamic scoping)

 I don't know what dynamic scoping is, do you mean that crazy nice thing named
dynamic typing? :-)

 Bye,
 bearophile

No, dynamic scoping is the crazy thing. Perl code:



$x = 1;

sub p {
     print "$x\n"
}
sub a {
     local $x = 2;
     p;
}
p;
a;
p

results in:


pp ~/perl% perl wat.pl
1
2
1

Crazy. :-)

Oct 01 2010

Bruno Medeiros <brunodomedeiros+spam com.gmail> writes:

On 01/10/2010 12:54, bearophile wrote:
 Bruno Medeiros:

  From my understanding, Scala is a "scalable language" in the sense
 that it easy to add new language features, or something similar to that.

 I see. You may be right.


 But I'm missing your point here, what does Ada have to do with this?

 Ada has essentially died for several reasons, but in my opinion one of them is
the amount of code you have to write to do even small things. If you design a
language that is not handy to write small programs, you have a higher risk of
seeing your language die.

There are a lot of things in a language that, if they make it harder to 
write small programs, they will also make it harder for larger programs. 
(sometimes even much harder)
I'm no expert in ADA, and there are many things that will affect the 
success of the language, so I can't comment in detail. But from a 
cursory look at the language, it looks terribly terse. That "begin" "end 
<name of block>" syntax is awful. I already think just "begin" "end" 
syntax is bad, but also having to repeat the name of 
block/function/procedure/loop at the "end", that's awful. Is it trying 
to compete with "XML" ? :p


 but I simply didn't see much language
 changes that could have made my D more succint,

 Making a language more succint is easy, you may take a look at J or K
languages. The hard thing is to design a succint language that is also readable
and not bug-prone.

Indeed, I agree. And that was the spirit of that original comment:
First of all, I meant succinct not only in character and line count but 
also syntactical and semantic constructs. And succinct without changes 
that would impact a lot the readability or safety of the code. (as 
mentioned in "barring crazy stuff like dynamic scoping")

 barring crazy stuff like dynamic scoping)

 I don't know what dynamic scoping is, do you mean that crazy nice thing named
dynamic typing? :-)

Like Pete explained, it's indeed exactly "dynamic scoping" that I meant.

-- 
Bruno Medeiros - Software Engineer

Oct 05 2010

"Nick Sabalausky" <a a.a> writes:

"bearophile" <bearophileHUGS lycos.com> wrote in message 
news:i3lb30$26vf$1 digitalmars.com...
 Jonathan M Davis:
 I would have thought that being more idomatic would have resulted in 
 slower code
 than what Walter did, but interestingly enough, both programs are faster 
 with my
 code. They might take more memory though. I'm not quite sure how to check 
 that.
 In any cases, you wanted some idiomatic D2 solutions, so there you go.

 Your code looks better.

 My (probably controversial) opinion on this is that the idiomatic D 
 solution for those text "scripts" is to use a scripting language, as 
 Python :-)

I can respect that. Personally, though, I find a lot of value in not needing 
to switch languages for that sort of thing. Too much "context switch" for my 
brain ;)

Aug 08 2010

Jonathan M Davis <jmdavisProg gmail.com> writes:

Jonathan M Davis wrote:

 void removeTabs(int tabSize, string fileName)
 {
     auto file = File(fileName);
     string[] output;
 
     foreach(line; file.byLine())
     {
         int lastTab = 0;
 
         while(lastTab != -1)
         {
             const int tab = line.indexOf('\t');
 
             if(tab == -1)
                 break;
 
             const int numSpaces = tabSize - tab % tabSize;
 
             line = line[0 .. tab] ~ repeat(" ", numSpaces) ~ line[tab + 1
             .. $];
 
             lastTab = tab + numSpaces;
         }
 
         output ~= line.idup;
     }
 
     std.file.write(fileName, output.join("\n"));
 }

Actually, looking at the code again, that while loop really should be 
while(1) rather than while(lastTab != -1), but it will work the same 
regardless.

- Jonathan M Davis

Aug 07 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 08/07/2010 11:04 PM, Jonathan M Davis wrote:
 On Friday 06 August 2010 18:50:52 Andrei Alexandrescu wrote:
 A good exercise would be rewriting these tools in idiomatic D2 and
 assess the differences.


 Andrei

 I didn't try and worry about multiline string literals, but here are my more
 idiomatic solutions:



 detab:

 /* Replace tabs with spaces, and remove trailing whitespace from lines.
    */

 import std.conv;
 import std.file;
 import std.stdio;
 import std.string;

 void main(string[] args)
 {
      const int tabSize = to!int(args[1]);
      foreach(f; args[2 .. $])
          removeTabs(tabSize, f);
 }


 void removeTabs(int tabSize, string fileName)
 {
      auto file = File(fileName);
      string[] output;

      foreach(line; file.byLine())
      {
          int lastTab = 0;

          while(lastTab != -1)
          {
              const int tab = line.indexOf('\t');

              if(tab == -1)
                  break;

              const int numSpaces = tabSize - tab % tabSize;

              line = line[0 .. tab] ~ repeat(" ", numSpaces) ~ line[tab + 1 ..
$];

              lastTab = tab + numSpaces;
          }

          output ~= line.idup;
      }

      std.file.write(fileName, output.join("\n"));
 }

Very nice. Here's how I'd improve removeTabs:


import std.conv;
import std.file;
import std.getopt;
import std.stdio;
import std.string;

void main(string[] args)
{
     uint tabSize = 8;
     getopt(args, "tabsize|t", &tabSize);
     foreach(f; args[1 .. $])
         removeTabs(tabSize, f);
}

void removeTabs(int tabSize, string fileName)
{
     auto file = File(fileName);
     string output;
     bool changed;

     foreach(line; file.byLine(File.KeepTerminator.yes))
     {
         int lastTab = 0;

         while(lastTab != -1)
         {
             const tab = line.indexOf('\t');
             if(tab == -1)
                 break;
             const numSpaces = tabSize - tab % tabSize;
             line = line[0 .. tab] ~ repeat(" ", numSpaces) ~ line[tab + 
1 .. $];
             lastTab = tab + numSpaces;
             changed = true;
         }

         output ~= line;
     }

     file.close();
     if (changed)
         std.file.write(fileName, output);
}

 -------------------------------------------

 The three differences between mine and Walter's are that mine takes the tab
size
 as the first argumen,t it doesn't put a newline at the end of the file, and it
 writes the file even if it changed (you could test for that, but when using
 byLine(), it's a bit harder). Interestingly enough, from the few tests that I
 ran, mine seems to be somewhat faster. I also happen to think that the code is
 clearer (it's certainly shorter), though that might be up for debate.

 -------------------------------------------



 tolf:

 /* Replace line endings with LF
    */

 import std.file;
 import std.string;

 void main(string[] args)
 {
      foreach(f; args[1 .. $])
          fixEndLines(f);
 }

 void fixEndLines(string fileName)
 {
      auto fileStr = std.file.readText(fileName);
      auto result = fileStr.replace("\r\n", "\n").replace("\r", "\n");

      std.file.write(fileName, result);
 }

 -------------------------------------------

 This version is ludicrously simple. And it was also faster than Walter's in the
 few tests that I ran. In either case, I think that it is definitely clearer
code.

Very nice! You may as well guard the write with an if (result != 
fileStr). With control source etc. in the mix it's always polite to not 
touch files unless you are actually modifying them.

This makes me think we should have a range that detects and replaces 
patterns lazily and on the fly. I've always thought that loading entire 
files in memory and working on them is "cheating" in some sense, and a 
range would help with replacing patterns in streams.

 I would have thought that being more idomatic would have resulted in slower
code
 than what Walter did, but interestingly enough, both programs are faster with
my
 code. They might take more memory though. I'm not quite sure how to check that.
 In any cases, you wanted some idiomatic D2 solutions, so there you go.

Looking very good, thanks. I think we should have a feature these and a 
few others as examples on the website.


Andrei

Aug 07 2010

Jonathan M Davis <jmdavisprog gmail.com> writes:

On Saturday 07 August 2010 21:59:50 Andrei Alexandrescu wrote:
 Very nice. Here's how I'd improve removeTabs:
 

 import std.conv;
 import std.file;
 import std.getopt;
 import std.stdio;
 import std.string;
 
 void main(string[] args)
 {
      uint tabSize = 8;
      getopt(args, "tabsize|t", &tabSize);
      foreach(f; args[1 .. $])
          removeTabs(tabSize, f);
 }
 
 void removeTabs(int tabSize, string fileName)
 {
      auto file = File(fileName);
      string output;
      bool changed;
 
      foreach(line; file.byLine(File.KeepTerminator.yes))
      {
          int lastTab = 0;
 
          while(lastTab != -1)
          {
              const tab = line.indexOf('\t');
              if(tab == -1)
                  break;
              const numSpaces = tabSize - tab % tabSize;
              line = line[0 .. tab] ~ repeat(" ", numSpaces) ~ line[tab +
 1 .. $];
              lastTab = tab + numSpaces;
              changed = true;
          }
 
          output ~= line;
      }
 
      file.close();
      if (changed)
          std.file.write(fileName, output);
 }

Ah. I needed to close the file. I pretty much always just use readText(), so I 
didn't catch that. Also, it does look like detecting whether the file changed
was 
a bit simpler than I thought that it would be. Quite simple really. Thanks.

 Very nice! You may as well guard the write with an if (result !=
 fileStr). With control source etc. in the mix it's always polite to not
 touch files unless you are actually modifying them.

Yes. That would be good. It's the kind of thing that I forget - probably
because 
most of the code that I write generates new files rather than updating pre-
existing ones.

 
 This makes me think we should have a range that detects and replaces
 patterns lazily and on the fly. I've always thought that loading entire
 files in memory and working on them is "cheating" in some sense, and a
 range would help with replacing patterns in streams.

It would certainly be nice to have a way to reasonably process with ranges 
without having to load the whole thing into memory at once. Most of the time, I 
wouldn't care too much, but if you start processing large files, having the
whole 
thing in memory could be a problem (especially if you have multiple versions of 
it which were created along the way as you were manipulating it). Haskell does 
lazy loading of files by default and doesn't load the data until you read the 
appropriate part of the string. It shouldn't be all that hard to do something 
similar with D and ranges. The hard port would be trying to do all of it in a 
way that makes it so that all of the processing of the file's data doesn't have 
to load it all into memory (let alone load it multiple times). I'm not sure
that 
you could do that without explicitly processing a file line by line, writing it 
to disk after each line is processed, since you could be doing an arbitrary set 
of operations on the data. It could be interesting to try and find a solution
for 
that though.

 
 Looking very good, thanks. I think we should have a feature these and a
 few others as examples on the website.

Well, I for one, much prefer the ability to program in a manner that's closer
to 
telling the computer to do what I want rather than having to tell it how to do 
what I want (the replace end-of-line character program being a prime example). 
It makes life much simpler. Ranges certainly help a lot in that regard too. And 
having good example code of how to program that way could help encourage people 
to program that way and use std.range and std.algorithm and their ilk rather 
than trying more low-level solutions which aren't as easy to understand.

- Jonathan M Davis

Aug 07 2010

Walter Bright <newshound2 digitalmars.com> writes:

Jonathan M Davis wrote:
 It would certainly be nice to have a way to reasonably process with ranges 
 without having to load the whole thing into memory at once.

Because of asynchronous I/O, being able to start processing and start writing 
the new file before the old one is finished reading should speed things up.

Aug 07 2010

"Nick Sabalausky" <a a.a> writes:

"Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message 
news:i3ldk4$2ci0$1 digitalmars.com...
 Very nice! You may as well guard the write with an if (result != fileStr). 
 With control source etc. in the mix it's always polite to not touch files 
 unless you are actually modifying them.

I'm fairly sure SVN doesn't commit touched files unless there are actual 
changes. (Or maybe it's TortoiseSVN that adds that intelligence?)

Aug 08 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 08/08/2010 12:28 PM, Nick Sabalausky wrote:
 "Andrei Alexandrescu"<SeeWebsiteForEmail erdani.org>  wrote in message
 news:i3ldk4$2ci0$1 digitalmars.com...
 Very nice! You may as well guard the write with an if (result != fileStr).
 With control source etc. in the mix it's always polite to not touch files
 unless you are actually modifying them.

 I'm fairly sure SVN doesn't commit touched files unless there are actual
 changes. (Or maybe it's TortoiseSVN that adds that intelligence?)

It doesn't, but it still shows them as changed etc.

Andrei

Aug 08 2010

Leandro Lucarella <luca llucax.com.ar> writes:

Andrei Alexandrescu, el  8 de agosto a las 14:44 me escribiste:
 On 08/08/2010 12:28 PM, Nick Sabalausky wrote:
"Andrei Alexandrescu"<SeeWebsiteForEmail erdani.org>  wrote in message
news:i3ldk4$2ci0$1 digitalmars.com...
Very nice! You may as well guard the write with an if (result != fileStr).
With control source etc. in the mix it's always polite to not touch files
unless you are actually modifying them.

I'm fairly sure SVN doesn't commit touched files unless there are actual
changes. (Or maybe it's TortoiseSVN that adds that intelligence?)

 
 It doesn't, but it still shows them as changed etc.

Nope, not really:

/tmp$ svnadmin create x
/tmp$ svn co file:///tmp/x xwc
Revisión obtenida: 0
/tmp$ cd xwc/
/tmp/xwc$ echo hello > hello
/tmp/xwc$ svn add hello
A         hello
/tmp/xwc$ svn commit -m 'test'
Añadiendo      hello
Transmitiendo contenido de archivos .
Commit de la revisión 1.
/tmp/xwc$ touch hello
/tmp/xwc$ svn status
/tmp/xwc$ echo changed > hello
/tmp/xwc$ svn status
M       hello
/tmp/xwc$

(sorry about the Spanish messages, I saw them after copying the test and
I'm too lazy to repeat them changing the LANG environment variable :)


You might want to set the mtime to the same as the original file for
build purposes though (you know you're changing the file in a way it
doesn't really change its semantics, so you might want to avoid
unnecessary recompilation).

-- 
Leandro Lucarella (AKA luca)                     http://llucax.com.ar/
----------------------------------------------------------------------
GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145  104C 949E BFB6 5F5A 8D05)
----------------------------------------------------------------------
... los cuales son susceptibles a una creciente variedad de ataques previsibles,
tales como desbordamiento del tampón, falsificación de parámetros, ...
	-- Stealth - ISS LLC - Seguridad de IT

Aug 08 2010

Norbert Nemec <Norbert Nemec-online.de> writes:

I usually do the same thing with a shell pipe
	expand | sed 's/ *$//;s/\r$//;s/\r/\n/'


On 07/08/10 02:34, Walter Bright wrote:
 I wrote these two trivial utilities for the purpose of canonicalizing
 source code before checkins and to deal with FreeBSD's inability to deal
 with CRLF line endings, and because I can never figure out the right
 settings for git to make it do the canonicalization.

 tolf - converts LF, CR, and CRLF line endings to LF.

 detab - converts all tabs to the correct number of spaces. Assumes tabs
 are 8 column tabs. Removes trailing whitespace from lines.

 Posted here just in case someone wonders what they are.
 ---------------------------------------------------------
 /* Replace tabs with spaces, and remove trailing whitespace from lines.
 */

 import std.file;
 import std.path;

 int main(string[] args)
 {
 foreach (f; args[1 .. $])
 {
 auto input = cast(char[]) std.file.read(f);
 auto output = filter(input);
 if (output != input)
 std.file.write(f, output);
 }
 return 0;
 }


 char[] filter(char[] input)
 {
 char[] output;
 size_t j;

 int column;
 for (size_t i = 0; i < input.length; i++)
 {
 auto c = input[i];

 switch (c)
 {
 case '\t':
 while ((column & 7) != 7)
 { output ~= ' ';
 j++;
 column++;
 }
 c = ' ';
 column++;
 break;

 case '\r':
 case '\n':
 while (j && output[j - 1] == ' ')
 j--;
 output = output[0 .. j];
 column = 0;
 break;

 default:
 column++;
 break;
 }
 output ~= c;
 j++;
 }
 while (j && output[j - 1] == ' ')
 j--;
 return output[0 .. j];
 }
 -----------------------------------------------------
 /* Replace line endings with LF
 */

 import std.file;
 import std.path;

 int main(string[] args)
 {
 foreach (f; args[1 .. $])
 {
 auto input = cast(char[]) std.file.read(f);
 auto output = filter(input);
 if (output != input)
 std.file.write(f, output);
 }
 return 0;
 }


 char[] filter(char[] input)
 {
 char[] output;
 size_t j;

 for (size_t i = 0; i < input.length; i++)
 {
 auto c = input[i];

 switch (c)
 {
 case '\r':
 c = '\n';
 break;

 case '\n':
 if (i && input[i - 1] == '\r')
 continue;
 break;

 case 0:
 continue;

 default:
 break;
 }
 output ~= c;
 j++;
 }
 return output[0 .. j];
 }
 ------------------------------------------

Aug 08 2010

Walter Bright <newshound2 digitalmars.com> writes:

Norbert Nemec wrote:
 I usually do the same thing with a shell pipe
     expand | sed 's/ *$//;s/\r$//;s/\r/\n/'

<g>

Aug 08 2010

"Nick Sabalausky" <a a.a> writes:

"Norbert Nemec" <Norbert Nemec-online.de> wrote in message 
news:i3lq17$99u$1 digitalmars.com...
I usually do the same thing with a shell pipe
 expand | sed 's/ *$//;s/\r$//;s/\r/\n/'

Filed under "Why I don't like regex for non-trivial things" ;)

Aug 08 2010

Leandro Lucarella <luca llucax.com.ar> writes:

Nick Sabalausky, el  8 de agosto a las 13:31 me escribiste:
 "Norbert Nemec" <Norbert Nemec-online.de> wrote in message 
 news:i3lq17$99u$1 digitalmars.com...
I usually do the same thing with a shell pipe
 expand | sed 's/ *$//;s/\r$//;s/\r/\n/'

 
 Filed under "Why I don't like regex for non-trivial things" ;) 

Those regex are non-trivial?

Maybe you're confusing sed statements with regex, in that sed program,
there are 3 trivial regex:

regex	replace with
 *$     (nothing)
\r$	(nothing)
\r	\n

They are the most trivial regex you'd ever find! =)

-- 
Leandro Lucarella (AKA luca)                     http://llucax.com.ar/
----------------------------------------------------------------------
GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145  104C 949E BFB6 5F5A 8D05)
----------------------------------------------------------------------
Vaporeso sostenía a rajacincha la teoría del No-Water, la cual le
pertenecía y versaba lo siguiente: "Para darle la otra mejilla al fuego,
éste debe ser apagado con alpargatas apenas húmedas".

Aug 08 2010

"Nick Sabalausky" <a a.a> writes:

"Leandro Lucarella" <luca llucax.com.ar> wrote in message 
news:20100808212859.GL3360 llucax.com.ar...
 Nick Sabalausky, el  8 de agosto a las 13:31 me escribiste:
 "Norbert Nemec" <Norbert Nemec-online.de> wrote in message
 news:i3lq17$99u$1 digitalmars.com...
I usually do the same thing with a shell pipe
 expand | sed 's/ *$//;s/\r$//;s/\r/\n/'

 Filed under "Why I don't like regex for non-trivial things" ;)

 Those regex are non-trivial?

IMHO, A task has to be REALLY trivial to be trivial in regex ;)

 Maybe you're confusing sed statements with regex, in that sed program,
 there are 3 trivial regex:

Ahh, I see. I'm not familiar with sed, so my eyes got to the part after 
"sed" and began bleeding, so I figured it had to be one of three things:

- Encrypted data
- Hardware crash
- Regex

;)

Insert other joke about "read-only languages" or "languages that look the 
same before and after RSA encryption" here.

(I'm not genuinely complaining about regexes. They can be very useful. They 
just tend to get real ugly real fast.)

Aug 08 2010

Walter Bright <newshound2 digitalmars.com> writes:

Nick Sabalausky wrote:
 (I'm not genuinely complaining about regexes. They can be very useful. They 
 just tend to get real ugly real fast.)

Regexes are like flying airplanes. You have to do them often or you get "rusty" 
real fast. (Flying is not a natural behavior, it's not like riding a bike.)

Aug 08 2010

bearophile <bearophileHUGS lycos.com> writes:

Andrej Mitrovic:

 Andrei used to!string() in an early example in TDPL for some line-by-line
 processing. I'm not sure of the advantages/disadvantages of to!type vs .dup.

I have modified the code:

import std.stdio: File, writeln;
import std.conv: to;

int process(string fileName) {
    int total = 0;

    auto file = File(fileName);
    foreach (rawLine; file.byLine()) {
        string line = to!string(rawLine);
        total += line.length;
    }
    file.close();

    return total;
}

void main(string[] args) {
    if (args.length == 2)
        writeln("Total: ", process(args[1]));
}


The run time is 1.29 seconds, showing this is equivalent to the idup.

Bye,
bearophile

Aug 08 2010

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

What are you using to time the app? I'm using timeit (from the Windows
Server 2003 Resource Kit). I'm getting similar results to yours. Btw, how do
you use a warm disk cache? Is there a setting somewhere for that?


On Sun, Aug 8, 2010 at 11:54 PM, bearophile <bearophileHUGS lycos.com>wrote:

 Andrej Mitrovic:

 Andrei used to!string() in an early example in TDPL for some line-by-line
 processing. I'm not sure of the advantages/disadvantages of to!type vs

 .dup.

 I have modified the code:

 import std.stdio: File, writeln;
 import std.conv: to;

 int process(string fileName) {
    int total = 0;

    auto file = File(fileName);
    foreach (rawLine; file.byLine()) {
         string line = to!string(rawLine);
         total += line.length;
    }
    file.close();

    return total;
 }

 void main(string[] args) {
    if (args.length == 2)
        writeln("Total: ", process(args[1]));
 }


 The run time is 1.29 seconds, showing this is equivalent to the idup.

 Bye,
 bearophile

Aug 08 2010

bearophile <bearophileHUGS lycos.com> writes:

Andrej Mitrovic:

 What are you using to time the app?

A buggy utility that is the Windows port of the GNU time command.


 Btw, how do you use a warm disk cache? Is there a setting somewhere for that?

If you run the benchmarks two times, the second time if you have enough free
RAM and your system isn't performing I/O on disk for other purposes, then
Windows keep essentially the whole file in a cache in RAM. On Linux too there
is a disk cache. The HD too has a cache, so the situation is not simple and
probably they are not fully under control of Windows.

Bye,
bearophile

Aug 08 2010

Walter Bright <newshound2 digitalmars.com> writes:

Andrej Mitrovic wrote:
 Btw, 
 how do you use a warm disk cache? Is there a setting somewhere for that?

Just run it several times until the times stop going down.

Aug 08 2010

D Programming

C/C++ Programming

Other

digitalmars.D - tolf and detab