www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - D-ish way to work with strings?

reply =?iso-8859-1?Q?Robert_M._M=FCnch?= <robert.muench saphirion.com> writes:
I want to do all the basics mutating things with strings: append, 
insert, replace

What is the D-ish way to do that since string is aliased to immutable(char)[]?

Using arrays, using ~ operator, always copying, changing, combining my 
strings into a new one? Does it make sense to think about reducing GC 
pressure?

I'm a bit lost in the possibilities and don't find any "that's the way 
to do it".

-- 
Robert M. Münch
http://www.saphirion.com
smarter | better | faster
Dec 22 2019
next sibling parent reply =?iso-8859-1?Q?Robert_M._M=FCnch?= <robert.muench saphirion.com> writes:
Want to add I'm talking about unicode strings.

Wouldn't it make sense to handle everything as UTF-32 so that iteration 
is simple because code-point = code-unit?

And later on, convert to UTF-16 or UTF-8 on demand?

-- 
Robert M. Münch
http://www.saphirion.com
smarter | better | faster
Dec 22 2019
parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Sun, Dec 22, 2019 at 06:27:03PM +0100, Robert M. Münch via
Digitalmars-d-learn wrote:
 Want to add I'm talking about unicode strings.
 
 Wouldn't it make sense to handle everything as UTF-32 so that
 iteration is simple because code-point = code-unit?
 
 And later on, convert to UTF-16 or UTF-8 on demand?
[...] Be careful that code point != "character" the way most people understand the word "character". The word you're looking for is "grapheme". Which, unfortunately, is rather complex and very slow to handle in Unicode. See std.uni.byGrapheme. Usually you want to just stick with UTF-8 (usually) or UTF-16 (for Windows and Java interop). UTF-32 wastes a lot of space, and *still* doesn't give you what you think you want, and Grapheme[] is just dog slow because of the amount of decoding/recoding needed to manipulate it. What are you planning to do with your strings? IME, using ~ occasionally doesn't add *too* much GC pressure, and slicing is usually the idiomatic way of working with strings in D (it can result in faster code than C because you don't have to keep strcpy()'d stuff all over the place). If you're appending string a LOT, you might want to consider using std.array.appender in your inner loops to alleviate some of the cost of using ~ too much. Or use lazy evaluation and ranges to defer actually constructing the string until the end when it's ready to be stored. Still, this all depends on what you're trying to do with your strings. Elaborate a bit more about your use case, and we might be able to give better advice. T -- Nobody is perfect. I am Nobody. -- pepoluan, GKC forum
Dec 23 2019
parent reply =?iso-8859-1?Q?Robert_M._M=FCnch?= <robert.muench saphirion.com> writes:
On 2019-12-23 15:05:20 +0000, H. S. Teoh said:

 On Sun, Dec 22, 2019 at 06:27:03PM +0100, Robert M. Münch via 
 Digitalmars-d-learn wrote:
 Want to add I'm talking about unicode strings.
 
 Wouldn't it make sense to handle everything as UTF-32 so that
 iteration is simple because code-point = code-unit?
 
 And later on, convert to UTF-16 or UTF-8 on demand?
[...] Be careful that code point != "character" the way most people understand the word "character".
I know. My point was that with UTF-8 code-points (not being a character) have different sizes. Which you need to take into account if you want to iterate by code-points.
 The word you're looking for is "grapheme". Which, unfortunately, is 
 rather complex and very slow to handle in
 Unicode. See std.uni.byGrapheme.
Yes, that's when we come to "characters". And a "grapheme" can consists of several code-points. Is grapheme handling just slow in D or in general? If it's the latter, well, than that's just how it is.
 Usually you want to just stick with UTF-8 (usually) or UTF-16 (for
 Windows and Java interop). UTF-32 wastes a lot of space, and *still*
 doesn't give you what you think you want, and Grapheme[] is just dog
 slow because of the amount of decoding/recoding needed to manipulate it.
I need to handle graphemes when things are goind to be rendered and edited.
 What are you planning to do with your strings?
Pretty simple: Have user editable content that is rendered using different fonts supporting unicode. So, all editing functions: insert, replace, delete at all locations in the string supporting all unicode characters. Viele Grüsse. -- Robert M. Münch http://www.saphirion.com smarter | better | faster
Dec 27 2019
parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Fri, Dec 27, 2019 at 01:23:57PM +0100, Robert M. Münch via
Digitalmars-d-learn wrote:
 On 2019-12-23 15:05:20 +0000, H. S. Teoh said:
[...]
 What are you planning to do with your strings?
Pretty simple: Have user editable content that is rendered using different fonts supporting unicode. So, all editing functions: insert, replace, delete at all locations in the string supporting all unicode characters.
[...] Ah, I see. In that case you might want to consider using graphemes by default, since that's what most closely corresponds to how the user will perceive a "character". For processing outside of editing, though, you might want to consider converting to some other representation for manipulation, since graphemes are slow (the decoding process is complex, and we can't work around that because that's what Unicode requires). T -- Windows: the ultimate triumph of marketing over technology. -- Adrian von Bidder
Dec 27 2019
prev sibling parent reply Steven Schveighoffer <schveiguy gmail.com> writes:
On 12/22/19 9:15 AM, Robert M. Münch wrote:
 I want to do all the basics mutating things with strings: append, 
 insert, replace
 
 What is the D-ish way to do that since string is aliased 
 to immutable(char)[]?
switch to using char[]. Unfortunately, there's a lot of code out there that accepts string instead of const(char)[], which is more usable. I think many people don't realize the purpose of the string type. It's meant to be something that is heap-allocated (or as a global), and NEVER goes out of scope. Many things are shoehorned into string which shouldn't be.
 Using arrays, using ~ operator, always copying, changing, combining my 
 strings into a new one? Does it make sense to think about reducing GC 
 pressure?
It really depends on your use cases. strings are great precisely because they don't change. slicing makes huge sense there.
 I'm a bit lost in the possibilities and don't find any "that's the way 
 to do it".
Again, use char[] if you are going to be rearranging strings. And you have to take care not to cheat and cast to string. Always use idup if you need one. If you find Phobos functions that unnecessarily take string instead of const(char)[] please post to bugzilla. -Steve
Dec 22 2019
parent =?iso-8859-1?Q?Robert_M._M=FCnch?= <robert.muench saphirion.com> writes:
On 2019-12-22 18:45:52 +0000, Steven Schveighoffer said:

 switch to using char[]. Unfortunately, there's a lot of code out there 
 that accepts string instead of const(char)[], which is more usable. I 
 think many people don't realize the purpose of the string type. It's 
 meant to be something that is heap-allocated (or as a global), and 
 NEVER goes out of scope.
Hi Steve, thanks for the feedback. Makes sense to me.
 It really depends on your use cases. strings are great precisely 
 because they don't change. slicing makes huge sense there.
My "strings" change a lot, so not really a good fit to use string.
 Again, use char[] if you are going to be rearranging strings. And you 
 have to take care not to cheat and cast to string. Always use idup if 
 you need one.
Will do.
 If you find Phobos functions that unnecessarily take string instead of 
 const(char)[] please post to bugzilla.
Ok, will keep an eye on it. -- Robert M. Münch http://www.saphirion.com smarter | better | faster
Dec 27 2019