www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - tolf and detab

reply Walter Bright <newshound2 digitalmars.com> writes:
I wrote these two trivial utilities for the purpose of canonicalizing source 
code before checkins and to deal with FreeBSD's inability to deal with CRLF
line 
endings, and because I can never figure out the right settings for git to make 
it do the canonicalization.

tolf - converts LF, CR, and CRLF line endings to LF.

detab - converts all tabs to the correct number of spaces. Assumes tabs are 8 
column tabs. Removes trailing whitespace from lines.

Posted here just in case someone wonders what they are.
---------------------------------------------------------
/* Replace tabs with spaces, and remove trailing whitespace from lines.
  */

import std.file;
import std.path;

int main(string[] args)
{
     foreach (f; args[1 .. $])
     {
         auto input = cast(char[]) std.file.read(f);
         auto output = filter(input);
         if (output != input)
             std.file.write(f, output);
     }
     return 0;
}


char[] filter(char[] input)
{
     char[] output;
     size_t j;

     int column;
     for (size_t i = 0; i < input.length; i++)
     {
         auto c = input[i];

         switch (c)
         {
             case '\t':
                 while ((column & 7) != 7)
                 {   output ~= ' ';
                     j++;
                     column++;
                 }
                 c = ' ';
                 column++;
                 break;

             case '\r':
             case '\n':
                 while (j && output[j - 1] == ' ')
                     j--;
                 output = output[0 .. j];
                 column = 0;
                 break;

             default:
                 column++;
                 break;
         }
         output ~= c;
         j++;
     }
     while (j && output[j - 1] == ' ')
         j--;
     return output[0 .. j];
}
-----------------------------------------------------
/* Replace line endings with LF
  */

import std.file;
import std.path;

int main(string[] args)
{
     foreach (f; args[1 .. $])
     {
         auto input = cast(char[]) std.file.read(f);
         auto output = filter(input);
         if (output != input)
             std.file.write(f, output);
     }
     return 0;
}


char[] filter(char[] input)
{
     char[] output;
     size_t j;

     for (size_t i = 0; i < input.length; i++)
     {
         auto c = input[i];

         switch (c)
         {
             case '\r':
                 c = '\n';
                 break;

             case '\n':
                 if (i && input[i - 1] == '\r')
                     continue;
                 break;

             case 0:
                 continue;

             default:
                 break;
         }
         output ~= c;
         j++;
     }
     return output[0 .. j];
}
------------------------------------------
Aug 06 2010
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 08/06/2010 08:34 PM, Walter Bright wrote:
 I wrote these two trivial utilities for the purpose of canonicalizing
 source code before checkins and to deal with FreeBSD's inability to deal
 with CRLF line endings, and because I can never figure out the right
 settings for git to make it do the canonicalization.

 tolf - converts LF, CR, and CRLF line endings to LF.

 detab - converts all tabs to the correct number of spaces. Assumes tabs
 are 8 column tabs. Removes trailing whitespace from lines.

 Posted here just in case someone wonders what they are.
[snip] Nice, though they don't account for multiline string literals. A good exercise would be rewriting these tools in idiomatic D2 and assess the differences. Andrei
Aug 06 2010
next sibling parent reply Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
Or improve your google-fu by finding some existing tools that do the job
right. :)

I'm pretty sure Uncrustify is good at most of these issues, not to mention
it's a very nice source-code "prettifier/indenter". There's a front-end
called UniversalIndentGUI, which has about a dozen integrated versions of
source-code prettifiers (including uncrustify, and for many languages). It
has varios settings on the left, and togglable *Live* preview mode which you
can view on the right.

I invite you guys to try it out sometime:

http://universalindent.sourceforge.net/

(+ you can save different settings which is neat when you're coding for
different projects that have different "code design & look" standards)

On Sat, Aug 7, 2010 at 3:50 AM, Andrei Alexandrescu <
SeeWebsiteForEmail erdani.org> wrote:

 On 08/06/2010 08:34 PM, Walter Bright wrote:

 I wrote these two trivial utilities for the purpose of canonicalizing
 source code before checkins and to deal with FreeBSD's inability to deal
 with CRLF line endings, and because I can never figure out the right
 settings for git to make it do the canonicalization.

 tolf - converts LF, CR, and CRLF line endings to LF.

 detab - converts all tabs to the correct number of spaces. Assumes tabs
 are 8 column tabs. Removes trailing whitespace from lines.

 Posted here just in case someone wonders what they are.
[snip] Nice, though they don't account for multiline string literals. A good exercise would be rewriting these tools in idiomatic D2 and assess the differences. Andrei
Aug 06 2010
parent Walter Bright <newshound2 digitalmars.com> writes:
Andrej Mitrovic wrote:
 Or improve your google-fu by finding some existing tools that do the job 
 right. :)
Sure, but I suspect it's faster to write the utility! After all, they are trivial.
Aug 06 2010
prev sibling next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
Andrei Alexandrescu wrote:
 A good exercise would be rewriting these tools in idiomatic D2 and 
 assess the differences.
Some D2-fu would be cool. Any takers?
Aug 06 2010
prev sibling next sibling parent reply "Yao G." <nospamyao gmail.com> writes:
What does idiomatic D means?

On Fri, 06 Aug 2010 20:50:52 -0500, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 08/06/2010 08:34 PM, Walter Bright wrote:
 I wrote these two trivial utilities for the purpose of canonicalizing
 source code before checkins and to deal with FreeBSD's inability to deal
 with CRLF line endings, and because I can never figure out the right
 settings for git to make it do the canonicalization.

 tolf - converts LF, CR, and CRLF line endings to LF.

 detab - converts all tabs to the correct number of spaces. Assumes tabs
 are 8 column tabs. Removes trailing whitespace from lines.

 Posted here just in case someone wonders what they are.
[snip] Nice, though they don't account for multiline string literals. A good exercise would be rewriting these tools in idiomatic D2 and assess the differences. Andrei
-- Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
Aug 06 2010
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 08/06/2010 09:33 PM, Yao G. wrote:
 What does idiomatic D means?
At a quick glance - I'm thinking two elements would be using string and possibly byLine. Andrei
Aug 06 2010
prev sibling parent "Nick Sabalausky" <a a.a> writes:
"Yao G." <nospamyao gmail.com> wrote in message 
news:op.vg1qpcjfxeuu2f miroslava.gateway.2wire.net...
 What does idiomatic D means?
"idiomatic D" -> "In typical D style"
Aug 06 2010
prev sibling parent reply Jonathan M Davis <jmdavisprog gmail.com> writes:
On Friday 06 August 2010 18:50:52 Andrei Alexandrescu wrote:
 On 08/06/2010 08:34 PM, Walter Bright wrote:
 I wrote these two trivial utilities for the purpose of canonicalizing
 source code before checkins and to deal with FreeBSD's inability to deal
 with CRLF line endings, and because I can never figure out the right
 settings for git to make it do the canonicalization.
 
 tolf - converts LF, CR, and CRLF line endings to LF.
 
 detab - converts all tabs to the correct number of spaces. Assumes tabs
 are 8 column tabs. Removes trailing whitespace from lines.
 
 Posted here just in case someone wonders what they are.
[snip] Nice, though they don't account for multiline string literals. A good exercise would be rewriting these tools in idiomatic D2 and assess the differences. Andrei
I didn't try and worry about multiline string literals, but here are my more idiomatic solutions: detab: /* Replace tabs with spaces, and remove trailing whitespace from lines. */ import std.conv; import std.file; import std.stdio; import std.string; void main(string[] args) { const int tabSize = to!int(args[1]); foreach(f; args[2 .. $]) removeTabs(tabSize, f); } void removeTabs(int tabSize, string fileName) { auto file = File(fileName); string[] output; foreach(line; file.byLine()) { int lastTab = 0; while(lastTab != -1) { const int tab = line.indexOf('\t'); if(tab == -1) break; const int numSpaces = tabSize - tab % tabSize; line = line[0 .. tab] ~ repeat(" ", numSpaces) ~ line[tab + 1 .. $]; lastTab = tab + numSpaces; } output ~= line.idup; } std.file.write(fileName, output.join("\n")); } ------------------------------------------- The three differences between mine and Walter's are that mine takes the tab size as the first argumen,t it doesn't put a newline at the end of the file, and it writes the file even if it changed (you could test for that, but when using byLine(), it's a bit harder). Interestingly enough, from the few tests that I ran, mine seems to be somewhat faster. I also happen to think that the code is clearer (it's certainly shorter), though that might be up for debate. ------------------------------------------- tolf: /* Replace line endings with LF */ import std.file; import std.string; void main(string[] args) { foreach(f; args[1 .. $]) fixEndLines(f); } void fixEndLines(string fileName) { auto fileStr = std.file.readText(fileName); auto result = fileStr.replace("\r\n", "\n").replace("\r", "\n"); std.file.write(fileName, result); } ------------------------------------------- This version is ludicrously simple. And it was also faster than Walter's in the few tests that I ran. In either case, I think that it is definitely clearer code. I would have thought that being more idomatic would have resulted in slower code than what Walter did, but interestingly enough, both programs are faster with my code. They might take more memory though. I'm not quite sure how to check that. In any cases, you wanted some idiomatic D2 solutions, so there you go. - Jonathan M Davis
Aug 07 2010
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Jonathan M Davis:
 I would have thought that being more idomatic would have resulted in slower
code 
 than what Walter did, but interestingly enough, both programs are faster with
my 
 code. They might take more memory though. I'm not quite sure how to check
that. 
 In any cases, you wanted some idiomatic D2 solutions, so there you go.
Your code looks better. My (probably controversial) opinion on this is that the idiomatic D solution for those text "scripts" is to use a scripting language, as Python :-) In this case a Python version is more readable, shorter and probably faster too because reading the lines of a _normal_ text file is faster in Python compared to D (because Python is more optimized for such purposes. I can show benchmarks on request). On the other hand D2 is in its debugging phase, so it's good to use it even for purposes it's not the best language for, to catch bugs or performance bugs. So I think it's positive to write such scripts in D2, even if in a real-world setting I want to use Python to write them. Bye, bearophile
Aug 07 2010
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 08/07/2010 11:16 PM, bearophile wrote:
 Jonathan M Davis:
 I would have thought that being more idomatic would have resulted
 in slower code than what Walter did, but interestingly enough, both
 programs are faster with my code. They might take more memory
 though. I'm not quite sure how to check that. In any cases, you
 wanted some idiomatic D2 solutions, so there you go.
Your code looks better. My (probably controversial) opinion on this is that the idiomatic D solution for those text "scripts" is to use a scripting language, as Python :-) In this case a Python version is more readable, shorter and probably faster too because reading the lines of a _normal_ text file is faster in Python compared to D (because Python is more optimized for such purposes. I can show benchmarks on request). On the other hand D2 is in its debugging phase, so it's good to use it even for purposes it's not the best language for, to catch bugs or performance bugs. So I think it's positive to write such scripts in D2, even if in a real-world setting I want to use Python to write them.
I think it's worth targeting D2 to tasks that are usually handled by scripting languages. I've done a lot of that and it beats the hell out of rewriting in D a script that's grown out of control Andrei
Aug 07 2010
prev sibling next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 08/07/2010 11:16 PM, bearophile wrote:
 In this case a Python version is more readable, shorter and probably
 faster too because reading the lines of a _normal_ text file is
 faster in Python compared to D (because Python is more optimized for
 such purposes. I can show benchmarks on request).
That would be great so we can tune our approach. Thanks! Andrei
Aug 07 2010
parent reply bearophile <bearophileHUGS lycos.com> writes:
Andrei Alexandrescu:
 This makes me think we should have a range that detects and replaces 
 patterns lazily and on the fly.
In Python there is a helper module: http://docs.python.org/library/fileinput.html
 I think it's worth targeting D2 to tasks that are usually handled by
 scripting languages. I've done a lot of that and it beats the hell out
 of rewriting in D a script that's grown out of control
Dynamic languages are handy but they require some rigour when you program. Python is probably unfit to write one million lines long programs, but if you train yourself a little and keep your code clean, you usually become able to write clean larghish programs in Python.
 That would be great so we can tune our approach. Thanks!
In my dlibs I have the xio module that read by lines efficiently, it was faster than the iterating on the lines of BufferedFile. There are tons of different benchmarks that you may use, but a simple one to start is better, one that just iterates the file lines. See below. Related: experiments have shown that the (oldish) Java GC improves its performance if it is able to keep strings (that are immutable) in a separate memory pool, _and_ be able to recognize duplicated strings, of course keeping only one string for each equality set. It's positive to do a similar experiment with the D GC, but first you need applications that use the GC to test if this idea is an improvement :-) So I have used a minimal benchmark: -------------------------- from sys import argv def process(file_name): total = 0 for line in open(file_name): total += len(line) return total print "Total:", process(argv[1]) -------------------------- // D2 code import std.stdio: File, writeln; int process(string fileName) { int total = 0; auto file = File(fileName); foreach (rawLine; file.byLine()) { string line = rawLine.idup; total += line.length; } file.close(); return total; } void main(string[] args) { if (args.length == 2) writeln("Total: ", process(args[1])); } -------------------------- In the D code I have added an idup to make the comparison more fair, because in the Python code the "line" is a true newly allocated line, you can safely use it as dictionary key. I have used Python 2.7 with no Psyco JIT (http://psyco.sourceforge.net/ ) to speed up the Python code because it's not available yet for Python 2.7. D code compiled with dmd 2.047, optimized build. As test text data I have used a concatenation of all text files here (they are copyrighted, but freely usable): http://gnosis.cx/TPiP/ The result on Windows is a file of 1_116_552 bytes. I have attached the file to itself, duplicating its length, some times, the result is a file of 71_459_328 bytes (this is not an fully realistic case because you often have many small files to read instead of a very large one). The timings are taken with warm disk cache, so they are essentially read from RAM. This is not fully realistic, but if you want to write a benchmark you have to do this, because for me it's very hard on Windows to make sure that the disk cache is fully empty. So it's better to do the opposite and benchmark a warm file. The output of the Python code is: Total: 69789888 Found in 0.88 seconds (best of 6, the variance is minimal). The output of the D code is: Total: 69789888 Found in 1.28 seconds (best of 6, minimal variance). If in the D2 code I comment out the idup like this: foreach (rawLine; file.byLine()) { total += rawLine.length; } The output of the D code without idup is: Total: 69789888 Found in 0.75 seconds (best of 6, minimal variance). As you see it's a matter of GC efficiency too. Beside the GC the cause of the higher performance of the Python code comes from a tuned design, you can see the function getline_via_fgets here: http://svn.python.org/view/python/trunk/Objects/fileobject.c?revision=81275&view=markup It uses a "stack buffer" (char buf[MAXBUFSIZE]; where MAXBUFSIZE is 300) too. Bye, bearophile
Aug 08 2010
parent reply Walter Bright <newshound2 digitalmars.com> writes:
bearophile wrote:
 In the D code I have added an idup to make the comparison more fair, because
 in the Python code the "line" is a true newly allocated line, you can safely
 use it as dictionary key.
So it is with byLine, too. You've burdened D with double the amount of allocations. Also, I object in general to this method of making things "more fair". Using a less efficient approach in X because Y cannot use such an approach is not a legitimate comparison.
Aug 08 2010
next sibling parent "Nick Sabalausky" <a a.a> writes:
"Walter Bright" <newshound2 digitalmars.com> wrote in message 
news:i3mpnb$2hcf$1 digitalmars.com...
 bearophile wrote:
 In the D code I have added an idup to make the comparison more fair, 
 because
 in the Python code the "line" is a true newly allocated line, you can 
 safely
 use it as dictionary key.
So it is with byLine, too. You've burdened D with double the amount of allocations.
I thought byLine just re-uses the same buffer each time?
Aug 08 2010
prev sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Walter Bright:
 bearophile wrote:
 In the D code I have added an idup to make the comparison more fair, because
 in the Python code the "line" is a true newly allocated line, you can safely
 use it as dictionary key.
So it is with byLine, too. You've burdened D with double the amount of allocations.
I think you are wrong two times: 1) byLine() doesn't return a newly allocated line, you can see it with this small program: import std.stdio: File, writeln; void main(string[] args) { char[][] lines; auto file = File(args[1]); foreach (rawLine; file.byLine()) { writeln(rawLine.ptr); lines ~= rawLine; } file.close(); } Its output shows that all "strings" (char[]) share the same pointer: 14E5E00 14E5E00 14E5E00 14E5E00 14E5E00 14E5E00 14E5E00 ... 2) You can't use the result of rawLine() as string key for an associative array, as you I have said you can in Python. Currently you can, but according to Andrei this is a bug. And if it's not a bug then I'll reopen this closed bug 4474: http://d.puremagic.com/issues/show_bug.cgi?id=4474
 Also, I object in general to this method of making things "more fair". Using a
 less efficient approach in X because Y cannot use such an approach is not a
 legitimate comparison.
I generally agree, but this it not the case. In some situations you indeed don't need a newly allocated string for each loop, because for example you just want to read them and process them and not change/store them. You can't do this in Python, but this is not what I want to test. As I have explained in bug 4474 this behaviour is useful but it is acceptable only if explicitly requested by the programmer, and not as default one. The language is safe, as Andrei explains there, because you are supposed to idup the char[] to use it as key for an associative array (if your associative array is declared as int[char[]] then it can accept such rawLine() as keys, but you can clearly see those aren't strings. This is why I have closed bug 4474). Bye, bearophile
Aug 08 2010
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
bearophile wrote:
 Walter Bright:
 bearophile wrote:
 In the D code I have added an idup to make the comparison more fair,
 because in the Python code the "line" is a true newly allocated line, you
 can safely use it as dictionary key.
So it is with byLine, too. You've burdened D with double the amount of allocations.
I think you are wrong two times: 1) byLine() doesn't return a newly allocated line, you can see it with this small program: import std.stdio: File, writeln; void main(string[] args) { char[][] lines; auto file = File(args[1]); foreach (rawLine; file.byLine()) { writeln(rawLine.ptr); lines ~= rawLine; } file.close(); } Its output shows that all "strings" (char[]) share the same pointer: 14E5E00 14E5E00 14E5E00 14E5E00 14E5E00 14E5E00 14E5E00 ...
eh, you're right. the phobos documentation for byLine needs to be fixed.
 You can't do this in Python, but this is not what I want to test.
If you want to conclude that Python is better at processing files, you need to show it using each language doing it a way well suited to that language, rather than burdening one so it uses the same method as the less powerful one.
Aug 08 2010
parent reply bearophile <bearophileHUGS lycos.com> writes:
Walter Bright:
 If you want to conclude that Python is better at processing files, you need to 
 show it using each language doing it a way well suited to that language,
rather 
 than burdening one so it uses the same method as the less powerful one.
byLine() yields a char[], so if you want to do most kinds of strings processing or you want to store the line (or parts of it), you have to idup it. So in this case Python is not significantly less powerful than D. You can of course use the raw char[], but then you lose the advantages advertised when you have introduced the safer immutable D2 strings. And in many situations you have to dup the char[] anyway, otherwise your have all kinds of bugs, that Python lacks. In D1 to avoid it I used to use dup more often than necessary. I have explained this in the bug 4474. In this newsgroup my purpose it to show D faults, suggest improvements, etc. In this case my purpose was just to show that byLine()+idup is slow. And you have to thankful for my benchmarks. In my dlibs1 for D1 I have a xio module that reads files by line that is faster than iterating on a BufferedFile, so it's not a limit of the language, it's Phobos that has a performance bug that can be improved. Bye, bearophile
Aug 08 2010
next sibling parent reply Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
Andrei used to!string() in an early example in TDPL for some line-by-line
processing. I'm not sure of the advantages/disadvantages of to!type vs .dup.

On Sun, Aug 8, 2010 at 11:44 PM, bearophile <bearophileHUGS lycos.com>wrote:

 Walter Bright:
 If you want to conclude that Python is better at processing files, you
need to
 show it using each language doing it a way well suited to that language,
rather
 than burdening one so it uses the same method as the less powerful one.
byLine() yields a char[], so if you want to do most kinds of strings processing or you want to store the line (or parts of it), you have to idup it. So in this case Python is not significantly less powerful than D. You can of course use the raw char[], but then you lose the advantages advertised when you have introduced the safer immutable D2 strings. And in many situations you have to dup the char[] anyway, otherwise your have all kinds of bugs, that Python lacks. In D1 to avoid it I used to use dup more often than necessary. I have explained this in the bug 4474. In this newsgroup my purpose it to show D faults, suggest improvements, etc. In this case my purpose was just to show that byLine()+idup is slow. And you have to thankful for my benchmarks. In my dlibs1 for D1 I have a xio module that reads files by line that is faster than iterating on a BufferedFile, so it's not a limit of the language, it's Phobos that has a performance bug that can be improved. Bye, bearophile
Aug 08 2010
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 08/08/2010 04:48 PM, Andrej Mitrovic wrote:
 Andrei used to!string() in an early example in TDPL for some
 line-by-line processing. I'm not sure of the advantages/disadvantages of
 to!type vs .dup.
For example, to!string(someString) does not duplicate the string. Andrei
Aug 08 2010
prev sibling next sibling parent bearophile <bearophileHUGS lycos.com> writes:
 so it's not a limit of the language, it's Phobos that has a performance bug
that can be improved.
I don't know where the performance bug is, maybe it's a matter of GC, not a Phobos performance bug. Bye, bearophile
Aug 08 2010
prev sibling next sibling parent reply "Yao G." <nospamyao gmail.com> writes:
On Sun, 08 Aug 2010 16:44:09 -0500, bearophile <bearophileHUGS lycos.com>  
wrote:

 Walter Bright:
 If you want to conclude that Python is better at processing files, you  
 need to
 show it using each language doing it a way well suited to that  
 language, rather
 than burdening one so it uses the same method as the less powerful one.
byLine() yields a char[], so if you want to do most kinds of strings processing or you want to store the line (or parts of it), you have to idup it. So in this case Python is not significantly less powerful than D. [snip] And you have to [be] thankful for my benchmarks. [snip] Bye, bearophile
<g> What's next? Will you demand attribution like the time Andrei presented the ranges design?
Aug 08 2010
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Yao G.:
 <g> What's next? Will you demand attribution like the time Andrei  
 presented the ranges design?
Of course. In the end all D will be mine <evil laugh with echo effects> :-) Bye, bearophile
Aug 08 2010
parent "Yao G." <nospamyao gmail.com> writes:
On Sun, 08 Aug 2010 17:27:04 -0500, bearophile <bearophileHUGS lycos.com>  
wrote:

 Yao G.:
 <g> What's next? Will you demand attribution like the time Andrei
 presented the ranges design?
Of course. In the end all D will be mine <evil laugh with echo effects> :-) Bye, bearophile
:D That was a good comeback. -- Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
Aug 08 2010
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 08/08/2010 05:17 PM, Yao G. wrote:
 On Sun, 08 Aug 2010 16:44:09 -0500, bearophile
 <bearophileHUGS lycos.com> wrote:

 Walter Bright:
 If you want to conclude that Python is better at processing files,
 you need to
 show it using each language doing it a way well suited to that
 language, rather
 than burdening one so it uses the same method as the less powerful one.
byLine() yields a char[], so if you want to do most kinds of strings processing or you want to store the line (or parts of it), you have to idup it. So in this case Python is not significantly less powerful than D. [snip] And you have to [be] thankful for my benchmarks. [snip] Bye, bearophile
<g> What's next? Will you demand attribution like the time Andrei presented the ranges design?
Well I understand his frustration. I asked him for a comparison and he took the time to write one and play with it. I think the proper answer to that is to see what we can do to improve the situation, not defend the status quo. Whatever the weaknesses of the benchmark are they should be fixed, and then whatever weaknesses the library has they should be addressed. Andrei
Aug 08 2010
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 08/08/2010 04:44 PM, bearophile wrote:
 Walter Bright:
 If you want to conclude that Python is better at processing files, you need to
 show it using each language doing it a way well suited to that language, rather
 than burdening one so it uses the same method as the less powerful one.
byLine() yields a char[], so if you want to do most kinds of strings processing or you want to store the line (or parts of it), you have to idup it. So in this case Python is not significantly less powerful than D. You can of course use the raw char[], but then you lose the advantages advertised when you have introduced the safer immutable D2 strings. And in many situations you have to dup the char[] anyway, otherwise your have all kinds of bugs, that Python lacks. In D1 to avoid it I used to use dup more often than necessary. I have explained this in the bug 4474. In this newsgroup my purpose it to show D faults, suggest improvements, etc. In this case my purpose was just to show that byLine()+idup is slow. And you have to thankful for my benchmarks. In my dlibs1 for D1 I have a xio module that reads files by line that is faster than iterating on a BufferedFile, so it's not a limit of the language, it's Phobos that has a performance bug that can be improved.
Thanks for your analysis. Where does xio derive its performance advantage from? Andrei
Aug 08 2010
parent reply bearophile <bearophileHUGS lycos.com> writes:
Andrei:

Where does xio derive its performance advantage from?<
I'd like to give you a good answer, but I can't. dlibs1 (that you can found online still) has a Python Licence, so to create xio.xfile() I have just translated to D1 the C code of the CPython implementation code of the file object I have already linked here. I think it minimizes heap allocations, the performance is tuned for a line length found to be the "average one" for normal files. So I presume if your text file has very short lines (like 5 chars each) or very long ones (like 1000 chars each) it becomes less efficient. So it's probably a matter of good usage of the C I/O functions and probably a more efficient management by the GC. Phobos is Boost Licence, but I don't think Python devs can get mad if you take a look at how Python reads lines lazily :-) Someone has tried to implement a Python-style associative array in a similar way. Bye, bearophile
Aug 08 2010
parent reply Kagamin <spam here.lot> writes:
bearophile Wrote:

 I think it minimizes heap allocations, the performance is tuned for a line
length found to be the "average one" for normal files. So I presume if your
text file has very short lines (like 5 chars each) or very long ones (like 1000
chars each) it becomes less efficient.
 
 So it's probably a matter of good usage of the C I/O functions and probably a
more efficient management by the GC.
 
Don't you minimize heap allocation etc by reading whole file in one io call?
Aug 08 2010
parent reply bearophile <bearophileHUGS lycos.com> writes:
Kagamin:

 Don't you minimize heap allocation etc by reading whole file in one io call?
The whole thread was about lazy read of file lines. If the file is very large it's not wise to load it all in RAM at once. Bye, bearophile
Aug 09 2010
parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2010-08-09 07:12:38 -0400, bearophile <bearophileHUGS lycos.com> said:

 Kagamin:
 
 Don't you minimize heap allocation etc by reading whole file in one io call?
The whole thread was about lazy read of file lines. If the file is very large it's not wise to load it all in RAM at once.
For non-huge files that can fit in the memory space, I'd just memory-map the whole file and treat it as a giant string that I could then slice and keep the slices around (yeah!). The virtual memory system will take care of loading the file content's as you read from its memory space, so the file isn't loaded all at once. But that's not compatible with the C file IO functions. Does Python uses C file IO calls when reading from a file? If not, perhaps that's why it's faster. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Aug 09 2010
parent Jonathan M Davis <jmdavisprog gmail.com> writes:
On Monday, August 09, 2010 05:30:33 Michel Fortin wrote:
 On 2010-08-09 07:12:38 -0400, bearophile <bearophileHUGS lycos.com> said:
 Kagamin:
 Don't you minimize heap allocation etc by reading whole file in one io
 call?
The whole thread was about lazy read of file lines. If the file is very large it's not wise to load it all in RAM at once.
For non-huge files that can fit in the memory space, I'd just memory-map the whole file and treat it as a giant string that I could then slice and keep the slices around (yeah!). The virtual memory system will take care of loading the file content's as you read from its memory space, so the file isn't loaded all at once. But that's not compatible with the C file IO functions. Does Python uses C file IO calls when reading from a file? If not, perhaps that's why it's faster.
Well, you can just read the whole file in as a string with readText(), and any slices to that could stick around, but presumably, that's using the C file I/O calls underneath. - Jonathan M Davis
Aug 09 2010
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 08/08/2010 02:32 PM, bearophile wrote:
 Walter Bright:
 bearophile wrote:
 In the D code I have added an idup to make the comparison more fair, because
 in the Python code the "line" is a true newly allocated line, you can safely
 use it as dictionary key.
So it is with byLine, too. You've burdened D with double the amount of allocations.
I think you are wrong two times: 1) byLine() doesn't return a newly allocated line, you can see it with this small program: import std.stdio: File, writeln; void main(string[] args) { char[][] lines; auto file = File(args[1]); foreach (rawLine; file.byLine()) { writeln(rawLine.ptr); lines ~= rawLine; } file.close(); } Its output shows that all "strings" (char[]) share the same pointer: 14E5E00 14E5E00 14E5E00 14E5E00 14E5E00 14E5E00 14E5E00 ... 2) You can't use the result of rawLine() as string key for an associative array, as you I have said you can in Python. Currently you can, but according to Andrei this is a bug. And if it's not a bug then I'll reopen this closed bug 4474: http://d.puremagic.com/issues/show_bug.cgi?id=4474
 Also, I object in general to this method of making things "more fair". Using a
 less efficient approach in X because Y cannot use such an approach is not a
 legitimate comparison.
I generally agree, but this it not the case. In some situations you indeed don't need a newly allocated string for each loop, because for example you just want to read them and process them and not change/store them. You can't do this in Python, but this is not what I want to test. As I have explained in bug 4474 this behaviour is useful but it is acceptable only if explicitly requested by the programmer, and not as default one. The language is safe, as Andrei explains there, because you are supposed to idup the char[] to use it as key for an associative array (if your associative array is declared as int[char[]] then it can accept such rawLine() as keys, but you can clearly see those aren't strings. This is why I have closed bug 4474). Bye, bearophile
I think at the end of the day, regardless the relative possibilities of file reading in the two languages, we should be faster than Python when allocating one new string per line. Andrei
Aug 08 2010
parent reply bearophile <bearophileHUGS lycos.com> writes:
Andrei Alexandrescu:
 I think at the end of the day, regardless the relative possibilities of 
 file reading in the two languages, we should be faster than Python when 
 allocating one new string per line.
For now I suggest you to aim to be just about as fast as Python in this task :-) Beating Python significantly on this task is probably not easy. (Later someday I'd also like D AAs to become about as fast as Python dicts.) Bye, bearophile
Aug 08 2010
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 08/08/2010 10:29 PM, bearophile wrote:
 Andrei Alexandrescu:
 I think at the end of the day, regardless the relative
 possibilities of file reading in the two languages, we should be
 faster than Python when allocating one new string per line.
For now I suggest you to aim to be just about as fast as Python in this task :-) Beating Python significantly on this task is probably not easy.
Why? Andrei
Aug 08 2010
parent reply bearophile <bearophileHUGS lycos.com> writes:
Andrei Alexandrescu:

 For now I suggest you to aim to be just about as fast as Python in
 this task :-) Beating Python significantly on this task is probably
 not easy.
Why?
Because it's a core functionality for Python so devs probably have optimized it well, it's written in C, and in this case there is very little interpreter overhead. Bye, bearophile
Aug 09 2010
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
bearophile wrote:
 Andrei Alexandrescu:
 
 For now I suggest you to aim to be just about as fast as Python
 in this task :-) Beating Python significantly on this task is
 probably not easy.
Why?
Because it's a core functionality for Python so devs probably have optimized it well, it's written in C, and in this case there is very little interpreter overhead.
Then we can do whatever they've done. It's not like they're using APIs nobody heard of. It seems such a comparison of file I/O speed becomes in fact a comparison of garbage collectors. That's fine, but in that case the notion that D offers the possibility to avoid allocation should come back to the table. Andrei
Aug 09 2010
prev sibling next sibling parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from bearophile (bearophileHUGS lycos.com)'s article
 Jonathan M Davis:
 I would have thought that being more idomatic would have resulted in slower
code
 than what Walter did, but interestingly enough, both programs are faster with
my
 code. They might take more memory though. I'm not quite sure how to check that.
 In any cases, you wanted some idiomatic D2 solutions, so there you go.
Your code looks better. My (probably controversial) opinion on this is that the idiomatic D solution for
those text "scripts" is to use a scripting language, as Python :-)
 In this case a Python version is more readable, shorter and probably faster too
because reading the lines of a _normal_ text file is faster in Python compared to D (because Python is more optimized for such purposes. I can show benchmarks on request).
 On the other hand D2 is in its debugging phase, so it's good to use it even for
purposes it's not the best language for, to catch bugs or performance bugs. So I think it's positive to write such scripts in D2, even if in a real-world setting I want to use Python to write them.
 Bye,
 bearophile
I disagree completely. D is clearly designed from the "simple things should be simple and complicated things should be possible" point of view. If it doesn't work well for these kinds of short scripts then we've failed at making simple things simple and we're just like every other crappy "large scale, industrial strength" language like Java and C++ that's great for megaprojects but makes simple things complicated. That said, I think D does a great job in this regard. I actually use Python as my language of second choice for things D isn't good at. Mostly this means needing Python's huge standard library, needing 64-bit support, or needing to share my code with people who don't know D. Needing to write a very short script tends not to be a reason for me to switch over. It's not that rare for me to start with a short script and then end up adding something that needs performance to it (like monte carlo simulation of a null probability distribution) and I don't find D substantially harder to use for these cases.
Aug 08 2010
parent reply Bruno Medeiros <brunodomedeiros+spam com.gmail> writes:
On 08/08/2010 14:31, dsimcha wrote:
 I disagree completely.  D is clearly designed from the "simple things should be
 simple and complicated things should be possible" point of view.  If it doesn't
 work well for these kinds of short scripts then we've failed at making simple
 things simple and we're just like every other crappy "large scale, industrial
 strength" language like Java and C++ that's great for megaprojects but makes
 simple things complicated.
dsimcha wrote: "I hate Java and every programming language where a readable hello world takes more than 3 SLOC" That may be your preference, but other people here in the community, me at least, very much want D to be a "large scale, industrial strength" language that's great for megaprojects. I think that medium and large scale projects are simply much more important and interesting than small scale ones. I am hoping this would become an *explicit* point of D design goals, if it isn't already. And I will campaign against (so to speak), people like you who think small scale is more important. No personal animosity intended though. Note: I am not stating that is is not possible to be good, even great, at both things (small and medium/large scale). -- Bruno Medeiros - Software Engineer
Sep 30 2010
parent reply bearophile <bearophileHUGS lycos.com> writes:
Bruno Medeiros:

 I think that medium and large
 scale projects are simply much more important and interesting than small
 scale ones.
 I am hoping this would become an *explicit* point of D design goals, if
 it isn't already.
 And I will campaign against (so to speak), people like you who think
 small scale is more important. No personal animosity intended though.
 Note: I am not stating that is is not possible to be good, even great,
 at both things (small and medium/large scale).
This is an interesting topic of practical language design, it's a wide problem and I can't have complete answers. D2 design is mostly done, only small parts may be changed now, so those campaigns probably can't change D2 design much. The name of the Scala language means that it is meant to be a scalable language, this means it is designed to be useful and usable for both large and quite small programs. A language like Ada is not a bad language. Programming practice shows that in many situations the debug time is the larger percentage of the development of a program. So minimizing debug time is usually a very good thing. Ada tries hard to avoid many common bugs, much more than D (it has ranged integers, integer overflows, it defines a portable floating point semantics (despite there is a way to use the faster IEEE semantics), it forces to use clear interfaces between modules (much more explicit ones than D ones), it never silently changes variable types, its semantics is fully specified, there are very precise Ada semantics specs, all Ada compilers must pass a very large test suite, and so on and on). In practice the language is able to catch many bugs before they happen. So if you want to write a program in critical situations, like important control systems, Ada is a language better than Perl, and probably better than D too :-) Yet, writing programs in Ada is not handy, if you need to write small programs you need lot of boilerplate code that is useful only in larger programs. And Ada is a Pascal-like language that many modern programmers don't know/like. Ada looks designed for larger, low-bug-count, costly (and often well planned out from the beginning, with no specs that change with time) programs, but it's not handy to write small programs. Probably Ada is not the best language to write web code that has to change all the time. Today Ada is not a dead language, but it smells funny, it's not commonly used. Andrei has expressed the desire to use D2 as a language to write script-like programs too. I think in most cases a language like Python is better than D2 to write small script-like programs, yet I agree with Andrei that it's good to try to make D2 language fit to write small script-like programs too, because to write such programs you need a very handy language, that catches/avoids many simple common bugs quickly, gives you excellent error messages/stack traces, and allows you to do common operations on web/text files/images/sounds/etc in few lines of code. My theory is that later those qualities turn out to be useful even in large programs. I think such qualities may help D avoid the Ada fate. The ability to write small programs with D is also useful to attract programmers to D, because if in your language you need to write 30 lines long programs to write a "hello world" on the screen then newcomers are likely to stop using that language after their first try. Designing a language that is both good for small and large programs is not easy, but it is a worth goal. D module system must be debugged & finished & improved to improve the usage of D for larger programs. Some features of the unittesting and design by contract currently missing are very useful if you want to use D to write large programs. If you want to write large programs reliability becomes an important concern, so integer overflow tests and some system to avoid null-related bugs (not-nullable types and more) become useful or very useful. Bye, bearophile
Sep 30 2010
parent reply Bruno Medeiros <brunodomedeiros+spam com.gmail> writes:
On 30/09/2010 19:31, bearophile wrote:
 Bruno Medeiros:

 I think that medium and large
 scale projects are simply much more important and interesting than small
 scale ones.
 I am hoping this would become an *explicit* point of D design goals, if
 it isn't already.
 And I will campaign against (so to speak), people like you who think
 small scale is more important. No personal animosity intended though.
 Note: I am not stating that is is not possible to be good, even great,
 at both things (small and medium/large scale).
This is an interesting topic of practical language design, it's a wide problem and I can't have complete answers. D2 design is mostly done, only small parts may be changed now, so those campaigns probably can't change D2 design much.
I'm not so sure about that. Probably backwards-incompatible changes will be very few, if any. But there can be backwards-compatible changes, or changes to stuff that was not mentioned in TDPL. And there may be a D3 eventually (a long way down the road though) But my main worry is not language changes, I actually think it's very unlikely Walter and Andrei would do a language change that intentionally would adversely affect medium/large scale programs in favor of small scale programs. My main issue is with the time and thinking resources that are expended here in the NG when people argue for changes (or against other changes) with the intention of favoring small-scall programs. If this were explicit in the D design goals, it would help save us from these discussions (which affect NG readers, not just posters).
 The name of the Scala language means that it is meant to be a scalable
language, this means it is designed to be useful and usable for both large and
quite small programs.
Whoa wait. From my understanding, Scala is a "scalable language" in the sense that it easy to add new language features, or something similar to that. But lets be clear, that's not what I'm talking about, and neither is scalability of program data/inputs/performance. I'm talking about scalability of source code, software components, developers, teams, requirements, planning changes, project management issues, etc..
 A language like Ada is not a bad language. Programming practice shows that in
many situations the debug time is the larger percentage of the development of a
program. So minimizing debug time is usually a very good thing. Ada tries hard
to avoid many common bugs, much more than D (it has ranged integers, integer
overflows, it defines a portable floating point semantics (despite there is a
way to use the faster IEEE semantics), it forces to use clear interfaces
between modules (much more explicit ones than D ones), it never silently
changes variable types, its semantics is fully specified, there are very
precise Ada semantics specs, all Ada compilers must pass a very large test
suite, and so on and on). In practice the language is able to catch many bugs
before they happen.

 So if you want to write a program in critical situations, like important
control systems, Ada is a language better than Perl, and probably better than D
too :-)

 Yet, writing programs in Ada is not handy, if you need to write small programs
you need lot of boilerplate code that is useful only in larger programs. And
Ada is a Pascal-like language that many modern programmers don't know/like. Ada
looks designed for larger, low-bug-count, costly (and often well planned out
from the beginning, with no specs that change with time) programs, but it's not
handy to write small programs. Probably Ada is not the best language to write
web code that has to change all the time. Today Ada is not a dead language, but
it smells funny, it's not commonly used.
Certainly it's not just web code that can change all the time. But I'm missing your point here, what does Ada have to do with this?
 Andrei has expressed the desire to use D2 as a language to write script-like
programs too. I think in most cases a language like Python is better than D2 to
write small script-like programs, yet I agree with Andrei that it's good to try
to make D2 language fit to write small script-like programs too, because to
write such programs you need a very handy language, that catches/avoids many
simple common bugs quickly, gives you excellent error messages/stack traces,
and allows you to do common operations on web/text files/images/sounds/etc in
few lines of code. My theory is that later those qualities turn out to be
useful even in large programs. I think such qualities may help D avoid the Ada
fate.

 The ability to write small programs with D is also useful to attract
programmers to D, because if in your language you need to write 30 lines long
programs to write a "hello world" on the screen then newcomers are likely to
stop using that language after their first try.

 Designing a language that is both good for small and large programs is not
easy, but it is a worth goal. D module system must be debugged&  finished& 
improved to improve the usage of D for larger programs. Some features of the
unittesting and design by contract currently missing are very useful if you
want to use D to write large programs. If you want to write large programs
reliability becomes an important concern, so integer overflow tests and some
system to avoid null-related bugs (not-nullable types and more) become useful
or very useful.

 Bye,
 bearophile
Yeah, I actually think D (or any other language under design) can be quite good at both things. Maybe something like 90% of features that are good for large-scale programs are also good for small-scale ones. One of the earliest useful programs I wrote in D, was a two-page bash shell script that I converted to D. Even though it was just abut two pages, it was already hard to extend and debug. After converting it to D, with the right shortcut methods and abstractions, the code actually manage to be quite succint and comparable, I suspect, to code in Python or Perl, or languages like that. (I say suspect because I don't actually know much about Python or Perl, but I simply didn't see much language changes that could have made my D more succint, barring crazy stuff like dynamic scoping) -- Bruno Medeiros - Software Engineer
Oct 01 2010
parent reply bearophile <bearophileHUGS lycos.com> writes:
Bruno Medeiros:

 From my understanding, Scala is a "scalable language" in the sense
 that it easy to add new language features, or something similar to that.
I see. You may be right.
 But I'm missing your point here, what does Ada have to do with this?
Ada has essentially died for several reasons, but in my opinion one of them is the amount of code you have to write to do even small things. If you design a language that is not handy to write small programs, you have a higher risk of seeing your language die.
 but I simply didn't see much language 
 changes that could have made my D more succint,
Making a language more succint is easy, you may take a look at J or K languages. The hard thing is to design a succint language that is also readable and not bug-prone. Python has some features that make the code longer, like the obligatory "self." before class instance names and the optional usage of argument names at the calling point make the code longer. The ternary operator too in Python is longer, as the "and" operator, etc. Such things improve readability, etc. Several Python features help shorten the code, like sequence unpacking syntax and multiple return values:
 def foo():
... return 1, 2 ...
 a, b = foo()
 a
1
 b
2 List comprehensions help shorten the code, but I think they also reduce bug count a bit and allow you to think about your code at a bit higher level:
 xs = [2,3,4,5,6,7,8,9,10,11,12,13]
 ps = [x * x for x in xs if x % 2]
 ps
[9, 25, 49, 81, 121, 169] Python has some other features that help shorten the code, like the significant leading white space that avoids some bugs, avoids brace style wars, and removes both some noise and closing brace code lines.
 barring crazy stuff like dynamic scoping)
I don't know what dynamic scoping is, do you mean that crazy nice thing named dynamic typing? :-) Bye, bearophile
Oct 01 2010
next sibling parent Pelle <pelle.mansson gmail.com> writes:
On 10/01/2010 01:54 PM, bearophile wrote:
 Bruno Medeiros:

  From my understanding, Scala is a "scalable language" in the sense
 that it easy to add new language features, or something similar to that.
I see. You may be right.
 But I'm missing your point here, what does Ada have to do with this?
Ada has essentially died for several reasons, but in my opinion one of them is the amount of code you have to write to do even small things. If you design a language that is not handy to write small programs, you have a higher risk of seeing your language die.
 but I simply didn't see much language
 changes that could have made my D more succint,
Making a language more succint is easy, you may take a look at J or K languages. The hard thing is to design a succint language that is also readable and not bug-prone. Python has some features that make the code longer, like the obligatory "self." before class instance names and the optional usage of argument names at the calling point make the code longer. The ternary operator too in Python is longer, as the "and" operator, etc. Such things improve readability, etc. Several Python features help shorten the code, like sequence unpacking syntax and multiple return values:
 def foo():
... return 1, 2 ...
 a, b = foo()
 a
1
 b
2 List comprehensions help shorten the code, but I think they also reduce bug count a bit and allow you to think about your code at a bit higher level:
 xs = [2,3,4,5,6,7,8,9,10,11,12,13]
 ps = [x * x for x in xs if x % 2]
 ps
[9, 25, 49, 81, 121, 169] Python has some other features that help shorten the code, like the significant leading white space that avoids some bugs, avoids brace style wars, and removes both some noise and closing brace code lines.
 barring crazy stuff like dynamic scoping)
I don't know what dynamic scoping is, do you mean that crazy nice thing named dynamic typing? :-) Bye, bearophile
No, dynamic scoping is the crazy thing. Perl code: $x = 1; sub p { print "$x\n" } sub a { local $x = 2; p; } p; a; p results in: pp ~/perl% perl wat.pl 1 2 1 Crazy. :-)
Oct 01 2010
prev sibling parent Bruno Medeiros <brunodomedeiros+spam com.gmail> writes:
On 01/10/2010 12:54, bearophile wrote:
 Bruno Medeiros:

  From my understanding, Scala is a "scalable language" in the sense
 that it easy to add new language features, or something similar to that.
I see. You may be right.
 But I'm missing your point here, what does Ada have to do with this?
Ada has essentially died for several reasons, but in my opinion one of them is the amount of code you have to write to do even small things. If you design a language that is not handy to write small programs, you have a higher risk of seeing your language die.
There are a lot of things in a language that, if they make it harder to write small programs, they will also make it harder for larger programs. (sometimes even much harder) I'm no expert in ADA, and there are many things that will affect the success of the language, so I can't comment in detail. But from a cursory look at the language, it looks terribly terse. That "begin" "end <name of block>" syntax is awful. I already think just "begin" "end" syntax is bad, but also having to repeat the name of block/function/procedure/loop at the "end", that's awful. Is it trying to compete with "XML" ? :p
 but I simply didn't see much language
 changes that could have made my D more succint,
Making a language more succint is easy, you may take a look at J or K languages. The hard thing is to design a succint language that is also readable and not bug-prone.
Indeed, I agree. And that was the spirit of that original comment: First of all, I meant succinct not only in character and line count but also syntactical and semantic constructs. And succinct without changes that would impact a lot the readability or safety of the code. (as mentioned in "barring crazy stuff like dynamic scoping")
 barring crazy stuff like dynamic scoping)
I don't know what dynamic scoping is, do you mean that crazy nice thing named dynamic typing? :-)
Like Pete explained, it's indeed exactly "dynamic scoping" that I meant. -- Bruno Medeiros - Software Engineer
Oct 05 2010
prev sibling parent "Nick Sabalausky" <a a.a> writes:
"bearophile" <bearophileHUGS lycos.com> wrote in message 
news:i3lb30$26vf$1 digitalmars.com...
 Jonathan M Davis:
 I would have thought that being more idomatic would have resulted in 
 slower code
 than what Walter did, but interestingly enough, both programs are faster 
 with my
 code. They might take more memory though. I'm not quite sure how to check 
 that.
 In any cases, you wanted some idiomatic D2 solutions, so there you go.
Your code looks better. My (probably controversial) opinion on this is that the idiomatic D solution for those text "scripts" is to use a scripting language, as Python :-)
I can respect that. Personally, though, I find a lot of value in not needing to switch languages for that sort of thing. Too much "context switch" for my brain ;)
Aug 08 2010
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmail.com> writes:
Jonathan M Davis wrote:

 void removeTabs(int tabSize, string fileName)
 {
     auto file = File(fileName);
     string[] output;
 
     foreach(line; file.byLine())
     {
         int lastTab = 0;
 
         while(lastTab != -1)
         {
             const int tab = line.indexOf('\t');
 
             if(tab == -1)
                 break;
 
             const int numSpaces = tabSize - tab % tabSize;
 
             line = line[0 .. tab] ~ repeat(" ", numSpaces) ~ line[tab + 1
             .. $];
 
             lastTab = tab + numSpaces;
         }
 
         output ~= line.idup;
     }
 
     std.file.write(fileName, output.join("\n"));
 }
Actually, looking at the code again, that while loop really should be while(1) rather than while(lastTab != -1), but it will work the same regardless. - Jonathan M Davis
Aug 07 2010
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 08/07/2010 11:04 PM, Jonathan M Davis wrote:
 On Friday 06 August 2010 18:50:52 Andrei Alexandrescu wrote:
 A good exercise would be rewriting these tools in idiomatic D2 and
 assess the differences.


 Andrei
I didn't try and worry about multiline string literals, but here are my more idiomatic solutions: detab: /* Replace tabs with spaces, and remove trailing whitespace from lines. */ import std.conv; import std.file; import std.stdio; import std.string; void main(string[] args) { const int tabSize = to!int(args[1]); foreach(f; args[2 .. $]) removeTabs(tabSize, f); } void removeTabs(int tabSize, string fileName) { auto file = File(fileName); string[] output; foreach(line; file.byLine()) { int lastTab = 0; while(lastTab != -1) { const int tab = line.indexOf('\t'); if(tab == -1) break; const int numSpaces = tabSize - tab % tabSize; line = line[0 .. tab] ~ repeat(" ", numSpaces) ~ line[tab + 1 .. $]; lastTab = tab + numSpaces; } output ~= line.idup; } std.file.write(fileName, output.join("\n")); }
Very nice. Here's how I'd improve removeTabs: import std.conv; import std.file; import std.getopt; import std.stdio; import std.string; void main(string[] args) { uint tabSize = 8; getopt(args, "tabsize|t", &tabSize); foreach(f; args[1 .. $]) removeTabs(tabSize, f); } void removeTabs(int tabSize, string fileName) { auto file = File(fileName); string output; bool changed; foreach(line; file.byLine(File.KeepTerminator.yes)) { int lastTab = 0; while(lastTab != -1) { const tab = line.indexOf('\t'); if(tab == -1) break; const numSpaces = tabSize - tab % tabSize; line = line[0 .. tab] ~ repeat(" ", numSpaces) ~ line[tab + 1 .. $]; lastTab = tab + numSpaces; changed = true; } output ~= line; } file.close(); if (changed) std.file.write(fileName, output); }
 -------------------------------------------

 The three differences between mine and Walter's are that mine takes the tab
size
 as the first argumen,t it doesn't put a newline at the end of the file, and it
 writes the file even if it changed (you could test for that, but when using
 byLine(), it's a bit harder). Interestingly enough, from the few tests that I
 ran, mine seems to be somewhat faster. I also happen to think that the code is
 clearer (it's certainly shorter), though that might be up for debate.

 -------------------------------------------



 tolf:

 /* Replace line endings with LF
    */

 import std.file;
 import std.string;

 void main(string[] args)
 {
      foreach(f; args[1 .. $])
          fixEndLines(f);
 }

 void fixEndLines(string fileName)
 {
      auto fileStr = std.file.readText(fileName);
      auto result = fileStr.replace("\r\n", "\n").replace("\r", "\n");

      std.file.write(fileName, result);
 }

 -------------------------------------------

 This version is ludicrously simple. And it was also faster than Walter's in the
 few tests that I ran. In either case, I think that it is definitely clearer
code.
Very nice! You may as well guard the write with an if (result != fileStr). With control source etc. in the mix it's always polite to not touch files unless you are actually modifying them. This makes me think we should have a range that detects and replaces patterns lazily and on the fly. I've always thought that loading entire files in memory and working on them is "cheating" in some sense, and a range would help with replacing patterns in streams.
 I would have thought that being more idomatic would have resulted in slower
code
 than what Walter did, but interestingly enough, both programs are faster with
my
 code. They might take more memory though. I'm not quite sure how to check that.
 In any cases, you wanted some idiomatic D2 solutions, so there you go.
Looking very good, thanks. I think we should have a feature these and a few others as examples on the website. Andrei
Aug 07 2010
next sibling parent reply Jonathan M Davis <jmdavisprog gmail.com> writes:
On Saturday 07 August 2010 21:59:50 Andrei Alexandrescu wrote:
 Very nice. Here's how I'd improve removeTabs:
 

 import std.conv;
 import std.file;
 import std.getopt;
 import std.stdio;
 import std.string;
 
 void main(string[] args)
 {
      uint tabSize = 8;
      getopt(args, "tabsize|t", &tabSize);
      foreach(f; args[1 .. $])
          removeTabs(tabSize, f);
 }
 
 void removeTabs(int tabSize, string fileName)
 {
      auto file = File(fileName);
      string output;
      bool changed;
 
      foreach(line; file.byLine(File.KeepTerminator.yes))
      {
          int lastTab = 0;
 
          while(lastTab != -1)
          {
              const tab = line.indexOf('\t');
              if(tab == -1)
                  break;
              const numSpaces = tabSize - tab % tabSize;
              line = line[0 .. tab] ~ repeat(" ", numSpaces) ~ line[tab +
 1 .. $];
              lastTab = tab + numSpaces;
              changed = true;
          }
 
          output ~= line;
      }
 
      file.close();
      if (changed)
          std.file.write(fileName, output);
 }
Ah. I needed to close the file. I pretty much always just use readText(), so I didn't catch that. Also, it does look like detecting whether the file changed was a bit simpler than I thought that it would be. Quite simple really. Thanks.
 Very nice! You may as well guard the write with an if (result !=
 fileStr). With control source etc. in the mix it's always polite to not
 touch files unless you are actually modifying them.
Yes. That would be good. It's the kind of thing that I forget - probably because most of the code that I write generates new files rather than updating pre- existing ones.
 
 This makes me think we should have a range that detects and replaces
 patterns lazily and on the fly. I've always thought that loading entire
 files in memory and working on them is "cheating" in some sense, and a
 range would help with replacing patterns in streams.
It would certainly be nice to have a way to reasonably process with ranges without having to load the whole thing into memory at once. Most of the time, I wouldn't care too much, but if you start processing large files, having the whole thing in memory could be a problem (especially if you have multiple versions of it which were created along the way as you were manipulating it). Haskell does lazy loading of files by default and doesn't load the data until you read the appropriate part of the string. It shouldn't be all that hard to do something similar with D and ranges. The hard port would be trying to do all of it in a way that makes it so that all of the processing of the file's data doesn't have to load it all into memory (let alone load it multiple times). I'm not sure that you could do that without explicitly processing a file line by line, writing it to disk after each line is processed, since you could be doing an arbitrary set of operations on the data. It could be interesting to try and find a solution for that though.
 
 Looking very good, thanks. I think we should have a feature these and a
 few others as examples on the website.
Well, I for one, much prefer the ability to program in a manner that's closer to telling the computer to do what I want rather than having to tell it how to do what I want (the replace end-of-line character program being a prime example). It makes life much simpler. Ranges certainly help a lot in that regard too. And having good example code of how to program that way could help encourage people to program that way and use std.range and std.algorithm and their ilk rather than trying more low-level solutions which aren't as easy to understand. - Jonathan M Davis
Aug 07 2010
parent Walter Bright <newshound2 digitalmars.com> writes:
Jonathan M Davis wrote:
 It would certainly be nice to have a way to reasonably process with ranges 
 without having to load the whole thing into memory at once.
Because of asynchronous I/O, being able to start processing and start writing the new file before the old one is finished reading should speed things up.
Aug 07 2010
prev sibling parent reply "Nick Sabalausky" <a a.a> writes:
"Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message 
news:i3ldk4$2ci0$1 digitalmars.com...
 Very nice! You may as well guard the write with an if (result != fileStr). 
 With control source etc. in the mix it's always polite to not touch files 
 unless you are actually modifying them.
I'm fairly sure SVN doesn't commit touched files unless there are actual changes. (Or maybe it's TortoiseSVN that adds that intelligence?)
Aug 08 2010
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 08/08/2010 12:28 PM, Nick Sabalausky wrote:
 "Andrei Alexandrescu"<SeeWebsiteForEmail erdani.org>  wrote in message
 news:i3ldk4$2ci0$1 digitalmars.com...
 Very nice! You may as well guard the write with an if (result != fileStr).
 With control source etc. in the mix it's always polite to not touch files
 unless you are actually modifying them.
I'm fairly sure SVN doesn't commit touched files unless there are actual changes. (Or maybe it's TortoiseSVN that adds that intelligence?)
It doesn't, but it still shows them as changed etc. Andrei
Aug 08 2010
parent Leandro Lucarella <luca llucax.com.ar> writes:
Andrei Alexandrescu, el  8 de agosto a las 14:44 me escribiste:
 On 08/08/2010 12:28 PM, Nick Sabalausky wrote:
"Andrei Alexandrescu"<SeeWebsiteForEmail erdani.org>  wrote in message
news:i3ldk4$2ci0$1 digitalmars.com...
Very nice! You may as well guard the write with an if (result != fileStr).
With control source etc. in the mix it's always polite to not touch files
unless you are actually modifying them.
I'm fairly sure SVN doesn't commit touched files unless there are actual changes. (Or maybe it's TortoiseSVN that adds that intelligence?)
It doesn't, but it still shows them as changed etc.
Nope, not really: /tmp$ svnadmin create x /tmp$ svn co file:///tmp/x xwc Revisión obtenida: 0 /tmp$ cd xwc/ /tmp/xwc$ echo hello > hello /tmp/xwc$ svn add hello A hello /tmp/xwc$ svn commit -m 'test' Añadiendo hello Transmitiendo contenido de archivos . Commit de la revisión 1. /tmp/xwc$ touch hello /tmp/xwc$ svn status /tmp/xwc$ echo changed > hello /tmp/xwc$ svn status M hello /tmp/xwc$ (sorry about the Spanish messages, I saw them after copying the test and I'm too lazy to repeat them changing the LANG environment variable :) You might want to set the mtime to the same as the original file for build purposes though (you know you're changing the file in a way it doesn't really change its semantics, so you might want to avoid unnecessary recompilation). -- Leandro Lucarella (AKA luca) http://llucax.com.ar/ ---------------------------------------------------------------------- GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05) ---------------------------------------------------------------------- ... los cuales son susceptibles a una creciente variedad de ataques previsibles, tales como desbordamiento del tampón, falsificación de parámetros, ... -- Stealth - ISS LLC - Seguridad de IT
Aug 08 2010
prev sibling next sibling parent reply Norbert Nemec <Norbert Nemec-online.de> writes:
I usually do the same thing with a shell pipe
	expand | sed 's/ *$//;s/\r$//;s/\r/\n/'


On 07/08/10 02:34, Walter Bright wrote:
 I wrote these two trivial utilities for the purpose of canonicalizing
 source code before checkins and to deal with FreeBSD's inability to deal
 with CRLF line endings, and because I can never figure out the right
 settings for git to make it do the canonicalization.

 tolf - converts LF, CR, and CRLF line endings to LF.

 detab - converts all tabs to the correct number of spaces. Assumes tabs
 are 8 column tabs. Removes trailing whitespace from lines.

 Posted here just in case someone wonders what they are.
 ---------------------------------------------------------
 /* Replace tabs with spaces, and remove trailing whitespace from lines.
 */

 import std.file;
 import std.path;

 int main(string[] args)
 {
 foreach (f; args[1 .. $])
 {
 auto input = cast(char[]) std.file.read(f);
 auto output = filter(input);
 if (output != input)
 std.file.write(f, output);
 }
 return 0;
 }


 char[] filter(char[] input)
 {
 char[] output;
 size_t j;

 int column;
 for (size_t i = 0; i < input.length; i++)
 {
 auto c = input[i];

 switch (c)
 {
 case '\t':
 while ((column & 7) != 7)
 { output ~= ' ';
 j++;
 column++;
 }
 c = ' ';
 column++;
 break;

 case '\r':
 case '\n':
 while (j && output[j - 1] == ' ')
 j--;
 output = output[0 .. j];
 column = 0;
 break;

 default:
 column++;
 break;
 }
 output ~= c;
 j++;
 }
 while (j && output[j - 1] == ' ')
 j--;
 return output[0 .. j];
 }
 -----------------------------------------------------
 /* Replace line endings with LF
 */

 import std.file;
 import std.path;

 int main(string[] args)
 {
 foreach (f; args[1 .. $])
 {
 auto input = cast(char[]) std.file.read(f);
 auto output = filter(input);
 if (output != input)
 std.file.write(f, output);
 }
 return 0;
 }


 char[] filter(char[] input)
 {
 char[] output;
 size_t j;

 for (size_t i = 0; i < input.length; i++)
 {
 auto c = input[i];

 switch (c)
 {
 case '\r':
 c = '\n';
 break;

 case '\n':
 if (i && input[i - 1] == '\r')
 continue;
 break;

 case 0:
 continue;

 default:
 break;
 }
 output ~= c;
 j++;
 }
 return output[0 .. j];
 }
 ------------------------------------------
Aug 08 2010
next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
Norbert Nemec wrote:
 I usually do the same thing with a shell pipe
     expand | sed 's/ *$//;s/\r$//;s/\r/\n/'
<g>
Aug 08 2010
prev sibling parent reply "Nick Sabalausky" <a a.a> writes:
"Norbert Nemec" <Norbert Nemec-online.de> wrote in message 
news:i3lq17$99u$1 digitalmars.com...
I usually do the same thing with a shell pipe
 expand | sed 's/ *$//;s/\r$//;s/\r/\n/'
Filed under "Why I don't like regex for non-trivial things" ;)
Aug 08 2010
parent reply Leandro Lucarella <luca llucax.com.ar> writes:
Nick Sabalausky, el  8 de agosto a las 13:31 me escribiste:
 "Norbert Nemec" <Norbert Nemec-online.de> wrote in message 
 news:i3lq17$99u$1 digitalmars.com...
I usually do the same thing with a shell pipe
 expand | sed 's/ *$//;s/\r$//;s/\r/\n/'
Filed under "Why I don't like regex for non-trivial things" ;)
Those regex are non-trivial? Maybe you're confusing sed statements with regex, in that sed program, there are 3 trivial regex: regex replace with *$ (nothing) \r$ (nothing) \r \n They are the most trivial regex you'd ever find! =) -- Leandro Lucarella (AKA luca) http://llucax.com.ar/ ---------------------------------------------------------------------- GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05) ---------------------------------------------------------------------- Vaporeso sostenía a rajacincha la teoría del No-Water, la cual le pertenecía y versaba lo siguiente: "Para darle la otra mejilla al fuego, éste debe ser apagado con alpargatas apenas húmedas".
Aug 08 2010
parent reply "Nick Sabalausky" <a a.a> writes:
"Leandro Lucarella" <luca llucax.com.ar> wrote in message 
news:20100808212859.GL3360 llucax.com.ar...
 Nick Sabalausky, el  8 de agosto a las 13:31 me escribiste:
 "Norbert Nemec" <Norbert Nemec-online.de> wrote in message
 news:i3lq17$99u$1 digitalmars.com...
I usually do the same thing with a shell pipe
 expand | sed 's/ *$//;s/\r$//;s/\r/\n/'
Filed under "Why I don't like regex for non-trivial things" ;)
Those regex are non-trivial?
IMHO, A task has to be REALLY trivial to be trivial in regex ;)
 Maybe you're confusing sed statements with regex, in that sed program,
 there are 3 trivial regex:
Ahh, I see. I'm not familiar with sed, so my eyes got to the part after "sed" and began bleeding, so I figured it had to be one of three things: - Encrypted data - Hardware crash - Regex ;) Insert other joke about "read-only languages" or "languages that look the same before and after RSA encryption" here. (I'm not genuinely complaining about regexes. They can be very useful. They just tend to get real ugly real fast.)
Aug 08 2010
parent Walter Bright <newshound2 digitalmars.com> writes:
Nick Sabalausky wrote:
 (I'm not genuinely complaining about regexes. They can be very useful. They 
 just tend to get real ugly real fast.)
Regexes are like flying airplanes. You have to do them often or you get "rusty" real fast. (Flying is not a natural behavior, it's not like riding a bike.)
Aug 08 2010
prev sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Andrej Mitrovic:

 Andrei used to!string() in an early example in TDPL for some line-by-line
 processing. I'm not sure of the advantages/disadvantages of to!type vs .dup.
I have modified the code: import std.stdio: File, writeln; import std.conv: to; int process(string fileName) { int total = 0; auto file = File(fileName); foreach (rawLine; file.byLine()) { string line = to!string(rawLine); total += line.length; } file.close(); return total; } void main(string[] args) { if (args.length == 2) writeln("Total: ", process(args[1])); } The run time is 1.29 seconds, showing this is equivalent to the idup. Bye, bearophile
Aug 08 2010
parent reply Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
What are you using to time the app? I'm using timeit (from the Windows
Server 2003 Resource Kit). I'm getting similar results to yours. Btw, how do
you use a warm disk cache? Is there a setting somewhere for that?


On Sun, Aug 8, 2010 at 11:54 PM, bearophile <bearophileHUGS lycos.com>wrote:

 Andrej Mitrovic:

 Andrei used to!string() in an early example in TDPL for some line-by-line
 processing. I'm not sure of the advantages/disadvantages of to!type vs
.dup. I have modified the code: import std.stdio: File, writeln; import std.conv: to; int process(string fileName) { int total = 0; auto file = File(fileName); foreach (rawLine; file.byLine()) { string line = to!string(rawLine); total += line.length; } file.close(); return total; } void main(string[] args) { if (args.length == 2) writeln("Total: ", process(args[1])); } The run time is 1.29 seconds, showing this is equivalent to the idup. Bye, bearophile
Aug 08 2010
next sibling parent bearophile <bearophileHUGS lycos.com> writes:
Andrej Mitrovic:

 What are you using to time the app?
A buggy utility that is the Windows port of the GNU time command.
 Btw, how do you use a warm disk cache? Is there a setting somewhere for that?
If you run the benchmarks two times, the second time if you have enough free RAM and your system isn't performing I/O on disk for other purposes, then Windows keep essentially the whole file in a cache in RAM. On Linux too there is a disk cache. The HD too has a cache, so the situation is not simple and probably they are not fully under control of Windows. Bye, bearophile
Aug 08 2010
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
Andrej Mitrovic wrote:
 Btw, 
 how do you use a warm disk cache? Is there a setting somewhere for that?
Just run it several times until the times stop going down.
Aug 08 2010