www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Java moves to copying for substrings

reply "Jesse Phillips" <Jesse.K.Phillips+D gmail.com> writes:
Somewhat interesting, Java has chosen to make substring result in 
a copy of the string data rather than returning a window of the 
underlying chars.

http://www.reddit.com/r/programming/comments/1qw73v/til_oracle_changed_the_internal_string/

"reduce the size of String instances. [...] This was the trigger."

"avoid memory leakage caused by retained substrings holding the 
entire character array."

So apparently substrings were considered a common cause of memory 
leaks. I got the impression most of the comments agreed the 
result is good, but changing the complexity is bad.

I'm not advocating such a change for D.
Nov 18 2013
next sibling parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Tuesday, 19 November 2013 at 05:38:14 UTC, Jesse Phillips
wrote:
 So apparently substrings were considered a common cause of 
 memory leaks.
I think it is pretty important to remember that slicing, while giving you a small view, still holds the entire array. I think there is nothing wrong with pipping a ".dup" every now and then, after slicing something. As a matter of fact, I've been playing around with transcoding strings (UTF-8/16/32). I start by allocating a large buffer to write into. When I'm done, I look at the buffer's useage, and if it's too low, I return a dup of the buffer slice, allowing the GC to reclaim the original buffer. Not only does this take up less memory, but overall, I actually get better run-times too (!)
Nov 18 2013
next sibling parent "deadalnix" <deadalnix gmail.com> writes:
On Tuesday, 19 November 2013 at 07:36:29 UTC, monarch_dodra wrote:
 On Tuesday, 19 November 2013 at 05:38:14 UTC, Jesse Phillips
 wrote:
 So apparently substrings were considered a common cause of 
 memory leaks.
I think it is pretty important to remember that slicing, while giving you a small view, still holds the entire array. I think there is nothing wrong with pipping a ".dup" every now and then, after slicing something. As a matter of fact, I've been playing around with transcoding strings (UTF-8/16/32). I start by allocating a large buffer to write into. When I'm done, I look at the buffer's useage, and if it's too low, I return a dup of the buffer slice, allowing the GC to reclaim the original buffer. Not only does this take up less memory, but overall, I actually get better run-times too (!)
The problem with Java's string is that you don't have enough control to do this. Instead of giving extra control, because it is obviously necessary, they dumb down the thing even more.
Nov 19 2013
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 11/18/2013 11:36 PM, monarch_dodra wrote:
 On Tuesday, 19 November 2013 at 05:38:14 UTC, Jesse Phillips
 wrote:
 So apparently substrings were considered a common cause of memory leaks.
I think it is pretty important to remember that slicing, while giving you a small view, still holds the entire array. I think there is nothing wrong with pipping a ".dup" every now and then, after slicing something.
And D gives you that choice. You can "slice & hold" or "slice & dup". Your choice, as the circumstances dictate. With Java, there is no choice, and they are stuck with one size fits all.
Nov 19 2013
prev sibling next sibling parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Jesse Phillips:

 Somewhat interesting, Java has chosen to make substring result 
 in a copy of the string data rather than returning a window of 
 the underlying chars.
I presume in Java slices weren't very common, unlike in D. So I think this is the right design choice for Java (also because those Java strings are too much large, four instance fields), but D is better designed as it is. On the other hand the idea of putting the hash code inside the string in D was not discussed enough :-) From the discussion, Dmd associative arrays were designed like this:
In Java 8 an improved solution devised by Doug Lea is used. In 
this solution colliding but Comparable Map keys are placed in a 
tree rather than a linked listed. Performance degenerates to 
O(log n) for the collisions but this is usually small unless 
someone is creating keys which intentionally collide or has a 
very, very bad hashcode implementation, ie. "return 3".<
I am reminded of a denial of service attack that used 
intentionally colliding request field names/values to attack web 
servers and bringing down servers to their knees.<
Bye, bearophile
Nov 19 2013
parent "bearophile" <bearophileHUGS lycos.com> writes:
 I presume in Java slices weren't very common,
Please ignore this part. Some answers of that Reddit thread show that some people slice a lot in Java too :-) Bye, bearophile
Nov 19 2013
prev sibling parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
On Tuesday, November 19, 2013 06:38:12 Jesse Phillips wrote:
 Somewhat interesting, Java has chosen to make substring result in
 a copy of the string data rather than returning a window of the
 underlying chars.
 
 http://www.reddit.com/r/programming/comments/1qw73v/til_oracle_changed_the_i
 nternal_string/
 
 "reduce the size of String instances. [...] This was the trigger."
 
 "avoid memory leakage caused by retained substrings holding the
 entire character array."
 
 So apparently substrings were considered a common cause of memory
 leaks. I got the impression most of the comments agreed the
 result is good, but changing the complexity is bad.
 
 I'm not advocating such a change for D.
Yikes. Maybe that's a good idea for Java for some reason, but I'd consider slicing strings to be a _huge_ strength of D. Still, Java's situation is rather different, because all of the slicing stuff is more of an implementation detail than a core feature like it is in D. It _is_ true however that if you're not careful about it, you can end up with a lot of slices that keep whole blocks of memory from being collected when they don't really need to refer to that memory anymore. So, depending on what profiling shows, some applications may need to make adjustments to avoid having slices keep too much extraneous memory from being collected. So, it's something to keep in mind, but I defintely don't think that we should be changing our approach at all. - Jonathan M Davis
Nov 19 2013
next sibling parent "Brian Schott" <briancschott gmail.com> writes:
On Tuesday, 19 November 2013 at 10:38:06 UTC, Jonathan M Davis 
wrote:
 It _is_ true however that if you're not careful about it, you 
 can end up with
 a lot of slices that keep whole blocks of memory from being 
 collected when
 they don't really need to refer to that memory anymore. So, 
 depending on what
 profiling shows, some applications may need to make adjustments 
 to avoid having
 slices keep too much extraneous memory from being collected. 
 So, it's
 something to keep in mind, but I defintely don't think that we 
 should be
 changing our approach at all.

 - Jonathan M Davis
I ended up doing this in DCD. The lexing step sliced the source code, so when caching autocompletion information for all of Phobos, Druntime, etc, the memory usage would get fairly large. Adding a few ".dup" calls in the phase that converts the AST to the autocompletion cache structures greatly reduced the memory usage.
Nov 19 2013
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 11/19/2013 2:37 AM, Jonathan M Davis wrote:
 So, it's
 something to keep in mind, but I defintely don't think that we should be
 changing our approach at all.
D doesn't need to change its approach at all because it offers both options - the user can choose. Note that the article says that some Java apps improved with this change, and some degraded. There is no correct answer.
Nov 19 2013
parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Tuesday, November 19, 2013 13:54:37 Walter Bright wrote:
 D doesn't need to change its approach at all because it offers both options
 - the user can choose.
Good point. While I wouldn't say that D's string handling is perfect, it's by far the best that I've ever dealt with, and I think that it's one of its great strengths and easily one of the things that I miss the most when I program in C++. - Jonathan M Davis
Nov 19 2013