digitalmars.D.announce - fixedstring: a safe, nogc string type

Moth (11/11) Jan 10 2022 hi all.

zjh (2/3) Jan 10 2022 Good.Let's use `betterC`.
jmh530 (2/12) Jan 10 2022 You might add some examples to the Readme.md

Moth (3/4) Jan 10 2022 good observation, i'll work on that. in the meantime the examples

Moth (16/20) Jan 10 2022 fixed.

Salih Dincer (18/20) Jan 10 2022 I try Fixedstring and, to my great relief I got the results I

Moth (9/10) Jan 11 2022 glad to hear you're finding it useful! =]

vit (4/10) Jan 11 2022 If you try mixing char/wchar/dchar, you need encoding/decoding
WebFreak001 (30/40) Jan 11 2022 you can relatively easily find out how many bytes a string takes

Moth (11/41) Jan 12 2022 aha, i think i might have miscommunicated here - i was talking

H. S. Teoh (95/105) Jan 11 2022 One minor usability issue I found just glancing over the code: many of

Paul Backus (7/17) Jan 11 2022 It already does this. In D2, `[]` is handled by a zero-argument
Moth (6/20) Jan 12 2022 they should all already be `in char[]`? i've added a test to

Paul Backus (9/19) Jan 12 2022 One thing you could potentially do is to round the size of the
H. S. Teoh (7/17) Jan 12 2022 [...]

Moth (1/1) Jan 12 2022 now available on dub: https://code.dlang.org/packages/fixedstring

Moth <postmaster gmail.com> writes:

hi all.

i got fed up with the built-in string type having so many 
features unavailable in ` nogc` code, so i made my own.

introducing fixedstring: a templated fixed-length array of 
`char`s, compatible with ` safe`, ` nogc`, and `nothrow` code.

licenced under the AGPL-3.0 or later, but i'm open to relicensing 
if someone really really wants it.

have fun =]

https://github.com/Moth-Tolias/fixedstring

special thanks to snarwin on the d discord for convincing me to 
post here.

Jan 10 2022

zjh <fqbqrr 163.com> writes:

On Monday, 10 January 2022 at 12:55:28 UTC, Moth wrote:
 hi all.

Good.Let's use `betterC`.

Jan 10 2022

jmh530 <john.michael.hall gmail.com> writes:

On Monday, 10 January 2022 at 12:55:28 UTC, Moth wrote:
 hi all.

 i got fed up with the built-in string type having so many 
 features unavailable in ` nogc` code, so i made my own.

 introducing fixedstring: a templated fixed-length array of 
 `char`s, compatible with ` safe`, ` nogc`, and `nothrow` code.

 licenced under the AGPL-3.0 or later, but i'm open to 
 relicensing if someone really really wants it.

 have fun =]

 https://github.com/Moth-Tolias/fixedstring

 [snip]

You might add some examples to the Readme.md

Jan 10 2022

Moth <postmaster gmail.com> writes:

On Monday, 10 January 2022 at 13:12:13 UTC, jmh530 wrote:

 You might add some examples to the Readme.md

good observation, i'll work on that. in the meantime the examples 
in the unittests should suffice.

Jan 10 2022

Moth <postmaster gmail.com> writes:

On Monday, 10 January 2022 at 14:06:27 UTC, Moth wrote:
 On Monday, 10 January 2022 at 13:12:13 UTC, jmh530 wrote:

 You might add some examples to the Readme.md

 good observation, i'll work on that. in the meantime the 
 examples in the unittests should suffice.

fixed.

for those who don't want to visit the github just to see the 
change, here's the example code:
```d
void main()  safe  nogc nothrow
{
	FixedString!14 foo = "clang";
	foo[0] = 'd';
	foo ~= " is cool";
	assert (foo == "dlang is cool");

	foo.length = 9;

	auto bar = FixedString!4("neat");
	assert (foo ~ bar == "dlang is neat");
}
```

Jan 10 2022

Salih Dincer <salihdb hotmail.com> writes:

On Monday, 10 January 2022 at 12:55:28 UTC, Moth wrote:
 have fun =]

 https://github.com/Moth-Tolias/fixedstring

I try Fixedstring and, to my great relief I got the results I 
expected. Thank you, good luck with your work.

So how to fix this double character issue:
```d
FixedString!6 sugar = "şeker"; // in Turkish
   assert(sugar[0..3] == "şe");

   FixedString!5 şeker = "sugar"; // in English
   assert(şeker[0..2] == "su");

   assert(sugar.length > şeker.length);
```

How about adding that member among FixedString?

```d
public size_t usefulCapacity()const pure  nogc  safe
   {
     return size - _length;
   }
```

Jan 10 2022

Moth <postmaster gmail.com> writes:

On Tuesday, 11 January 2022 at 03:20:22 UTC, Salih Dincer wrote:
 [snip]

glad to hear you're finding it useful! =]

hm, i'm not sure how i would go about fixing that double 
character issue. i know there's currently some wierdness with 
wchars / dchars equality that needs to be fixed [shouldn't be too 
much trouble, just need to set aside the time for it], but i 
think being able to tell how many chars there are in a glyph 
requires unicode awareness? i'll look into it.

what's your usecase for usefulCapacity()?

Jan 11 2022

vit <vit vit.vit> writes:

On Tuesday, 11 January 2022 at 11:16:13 UTC, Moth wrote:
 On Tuesday, 11 January 2022 at 03:20:22 UTC, Salih Dincer wrote:
 [snip]

 glad to hear you're finding it useful! =]

 ... i know there's currently some wierdness with wchars / 
 dchars equality that needs to be fixed [shouldn't be too much 
 trouble...

If you try mixing char/wchar/dchar, you need encoding/decoding 
for utf-8, utf-16 and utf-32 ( maybe even LE/BE ). It become 
complicated very fast...

Jan 11 2022

WebFreak001 <d.forum webfreak.org> writes:

On Tuesday, 11 January 2022 at 11:16:13 UTC, Moth wrote:
 On Tuesday, 11 January 2022 at 03:20:22 UTC, Salih Dincer wrote:
 [snip]

 glad to hear you're finding it useful! =]

 hm, i'm not sure how i would go about fixing that double 
 character issue. i know there's currently some wierdness with 
 wchars / dchars equality that needs to be fixed [shouldn't be 
 too much trouble, just need to set aside the time for it], but 
 i think being able to tell how many chars there are in a glyph 
 requires unicode awareness? i'll look into it.

 [...]

you can relatively easily find out how many bytes a string takes 
up with `std.utf`. You can also iterate by code points or 
graphemes there if you want to translate some kind of character 
index to byte position.

HOWEVER it's not clear what a character is. Sure for the posted 
cases here it's no problem but when it comes to languages based 
on combining glyphs together to form new glyphs it's no longer 
clear what is a character. There are Graphemes (grapheme 
clusters) which are probably the closest to what everybody would 
think a character is, but IIRC there are edge cases with that a 
programmer wouldn't expect, like adding a character not 
increasing the count of characters of the string because it 
merges with the last Grapheme. Additionally there is a 
performance impact on using Graphemes over simpler things like 
codepoints which fit 98% of use-cases with strings. Codepoints in 
D are mapped 1:1 using dchar, take up to 2 wchars or up to 4 
chars. You can use `std.utf` to compute byte lengths for a 
codepoint given a string.

I would rather suggest you support FixedString with types other 
than `char`. (wchar, dchar, heck users could even use any 
arbitrary type and use this as array class) For languages that 
commonly use more than 1 byte per codepoint or for interop with 

in general, etc. programmers might opt to use FixedString with 
wchar then.

With D's templates that should be quite easy to do (add a 
template parameter to the struct like `struct FixedString(size_t 
maxSize, CharT = char)` and replace all usage of char in your 
code with `CharT` in this case)

Jan 11 2022

Moth <postmaster gmail.com> writes:

On Tuesday, 11 January 2022 at 12:22:36 UTC, WebFreak001 wrote:
 [snip]

 you can relatively easily find out how many bytes a string 
 takes up with `std.utf`. You can also iterate by code points or 
 graphemes there if you want to translate some kind of character 
 index to byte position.

 HOWEVER it's not clear what a character is. Sure for the posted 
 cases here it's no problem but when it comes to languages based 
 on combining glyphs together to form new glyphs it's no longer 
 clear what is a character. There are Graphemes (grapheme 
 clusters) which are probably the closest to what everybody 
 would think a character is, but IIRC there are edge cases with 
 that a programmer wouldn't expect, like adding a character not 
 increasing the count of characters of the string because it 
 merges with the last Grapheme. Additionally there is a 
 performance impact on using Graphemes over simpler things like 
 codepoints which fit 98% of use-cases with strings. Codepoints 
 in D are mapped 1:1 using dchar, take up to 2 wchars or up to 4 
 chars. You can use `std.utf` to compute byte lengths for a 
 codepoint given a string.

aha, i think i might have miscommunicated here - i was talking 
about an error i thought i was having where a fixedstring of 
`"áéíóú"` wasn't equal to a string literal of the same, but as it 
turned out i was misreading the error message [i had been trying 
to assign a literal larger than the fixedstring could take]. to 
tell the truth, unicode awareness is... not something i really 
want to mess with right now, lol. it would be nice to have the 
option at some point in the future though.

 I would rather suggest you support FixedString with types other 
 than `char`. (wchar, dchar, heck users could even use any 
 arbitrary type and use this as array class) For languages that 
 commonly use more than 1 byte per codepoint or for interop with 

 in general, etc. programmers might opt to use FixedString with 
 wchar then.

 With D's templates that should be quite easy to do (add a 
 template parameter to the struct like `struct 
 FixedString(size_t maxSize, CharT = char)` and replace all 
 usage of char in your code with `CharT` in this case)


[i've pushed an update to the repo for 
this!](https://github.com/Moth-Tolias/fixedstring/releases/tag/v1.1.0) =] it
was a bit more complicated than a simple replace all, but not too hard.

Jan 12 2022

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Tue, Jan 11, 2022 at 11:16:13AM +0000, Moth via Digitalmars-d-announce wrote:
 On Tuesday, 11 January 2022 at 03:20:22 UTC, Salih Dincer wrote:
 [snip]

 
 glad to hear you're finding it useful! =]

One minor usability issue I found just glancing over the code: many of
your methods take char[] as argument. Generally, you want const(char)[]
instead, so that it will work with both char[] and immutable(char)[].
No reason why you can't copy some immutable chars into a FixedString,
for example.

Another potential issue is with the range interface. Your .popFront is
implemented by copying the entire buffer 1 char forwards, which can
easily become a hidden performance bottleneck. Iteration over a
FixedString currently is O(N^2), which is a problem if performance is
your concern.

Generally, I'd advise not conflating your containers with ranges over
your containers: I'd make .opSlice return a traditional D slice (i.e.,
const(char)[]) instead of a FixedString, and just require writing `[]`
when you need to iterate over the string as a range:

	FixedString!64 mystr;
	foreach (ch; mystr[]) { // <-- iterates over const(char)[]
		...
	}

This way, no redundant copying of data is done during iteration.

Another issue is the way concatenation is implemented. Since
FixedStrings have compile-time size, this potentially means every time
you concatenate a string in your code you get another instantiation of
FixedString. This can lead to a LOT of template bloat if you're not
careful, which may quickly outweigh any benefits you may have gained
from not using the built-in strings.


 hm, i'm not sure how i would go about fixing that double character
 issue. i know there's currently some wierdness with wchars / dchars
 equality that needs to be fixed [shouldn't be too much trouble, just
 need to set aside the time for it], but i think being able to tell how
 many chars there are in a glyph requires unicode awareness? i'll look
 into it.

[...]

Yes, you will require Unicode-awareness, and no, it will NOT be as
simple as you imagine.

First of all, you have the wide-character issue: if you're dealing with
anything outside of the ASCII range, you will need to deal with code
points (potentially wchar, dchar).  You can either take the lazy way out
(FixedString!(n, wchar), FixedString!(n, dchar)), but that will
exacerbate your template bloat very quickly. Plus, it wastes a lot of
memory, esp. if you start using dchar[] -- 4 bytes per character
potentially makes ASCII strings use up 4x more memory. (And even if you
decide using dchar[] isn't a concern, there's still the issue of
graphemes -- see below, which requires non-trivial decoding anyway.)

Or you can handle UTF-8, which is a better solution in terms of memory
usage. But then you will immediately run into the encoding/decoding
problem. Your .opSlice, for example, will not work correctly unless you
auto-decode. But that will be a performance hit -- this is one of the
design mistakes in hindsight that's still plaguing Phobos today. IMO the
better approach is to iterate over the string *without* decoding, but
just detecting codepoint boundaries.  Regardless, you will need *some*
way of iterating over code points instead of code units in order to deal
with this properly.

But that's only the beginning of the story. In Unicode, a "code point"
is NOT what most people imagine a "character" is. For most European
languages this is the case, but once you go outside of that, you'll
start finding things like accented characters that are composed of
multiple code points.  In Unicode, that's called a Grapheme, and here's
the bad news: the length of a Grapheme is technically unbounded (even
though in practice it's usually 2 or occasionally 3 -- but you *will*
find more on rare occasions). And worst of all, determining the length
of a grapheme requires an expensive, non-trivial algorithm that will
KILL your performance if you blindly do it every time you traverse your
string.

And generally, you don't *want* to do grapheme segmentation anyway --
most code doesn't even care what the graphemes are, it just wants to
treat strings as opaque data that you may occasionally want to segment
into substrings (and substrings don't necessarily require grapheme
segmentation to compute, depending on what the final goal is). But
occasionally you *will* need grapheme segmentation (e.g., if you need to
know how many visual "characters" there are in a string); for that, you
will need std.uni. And no, it's not something you can implement
overnight.  It requires some heavy-duty lookup tables and a (very
careful!) implementation of TR14.

Because of the foregoing, you have at least 4 different definitions of
the length of the string:

1. The number of code units it occupies, i.e., the number of chars /
wchars / dchars.

2. The number of code points it contains, which, in UTF-8, is a
non-trivial quantity that requires iterating over the entire string to
compute. Or you can just use wchar[] or dchar[], but then your memory
footprint will increase, potentially up to 4x.

3. The number of graphemes it contains, i.e., how many "visual
characters" (the way most people understand the word "character") it
contains. This requires grapheme segmentation, is expensive to compute,
and generally shouldn't be done unless you have some concrete reason why
you want to do this.

4. The rendered width of the string, i.e., how much space it occupies if
displayed on the screen. Even on a monospace-font text terminal, this is
a non-trivial quantity because some Unicode codepoints are double-width
(e.g., East Asian block), and some are *zero*-width (e.g., shy hyphens,
zero-width breaking spaces). And it depends on how your terminal
emulator renders these characters (what Unicode defines as a
double-width may not necessarily be rendered that way).  And of course,
on a GUI application measuring the length of a string requires font
details.

Welcome to the *cough* wonderful world of Unicode, where everything is
possible but nothing is simple. :-D


T

-- 
This sentence is false.

Jan 11 2022

Paul Backus <snarwin gmail.com> writes:

On Tuesday, 11 January 2022 at 17:55:28 UTC, H. S. Teoh wrote:
 Generally, I'd advise not conflating your containers with 
 ranges over your containers: I'd make .opSlice return a 
 traditional D slice (i.e., const(char)[]) instead of a 
 FixedString, and just require writing `[]` when you need to 
 iterate over the string as a range:

 	FixedString!64 mystr;
 	foreach (ch; mystr[]) { // <-- iterates over const(char)[]
 		...
 	}

 This way, no redundant copying of data is done during iteration.

It already does this. In D2, `[]` is handled by a zero-argument 
`opIndex` overload, not by `opSlice`. [1] FixedString has such an 
overload [2], and it does, in fact, return a slice.

[1] https://dlang.org/spec/operatoroverloading.html#slice
[2] 
https://github.com/Moth-Tolias/fixedstring/blob/v1.0.0/source/fixedstring.d#L105

Jan 11 2022

Moth <postmaster gmail.com> writes:

On Tuesday, 11 January 2022 at 17:55:28 UTC, H. S. Teoh wrote:
 [snip]

 One minor usability issue I found just glancing over the code: 
 many of your methods take char[] as argument. Generally, you 
 want const(char)[] instead, so that it will work with both 
 char[] and immutable(char)[]. No reason why you can't copy some 
 immutable chars into a FixedString, for example.

they should all already be `in char[]`? i've added a test to 
confirm it works with both `char[]` and `immutable(char)[]` and 
it compiles fine.

 [snip]
 Another issue is the way concatenation is implemented. Since 
 FixedStrings have compile-time size, this potentially means 
 every time you concatenate a string in your code you get 
 another instantiation of FixedString. This can lead to a LOT of 
 template bloat if you're not careful, which may quickly 
 outweigh any benefits you may have gained from not using the 
 built-in strings.

oh dear, that doesn't sound good. i hadn't considered that at 
all. i'm not sure how to even begin going about fixing that...

Jan 12 2022

Paul Backus <snarwin gmail.com> writes:

On Wednesday, 12 January 2022 at 19:55:41 UTC, Moth wrote:
 [snip]
 Another issue is the way concatenation is implemented. Since 
 FixedStrings have compile-time size, this potentially means 
 every time you concatenate a string in your code you get 
 another instantiation of FixedString. This can lead to a LOT 
 of template bloat if you're not careful, which may quickly 
 outweigh any benefits you may have gained from not using the 
 built-in strings.

 oh dear, that doesn't sound good. i hadn't considered that at 
 all. i'm not sure how to even begin going about fixing that...

One thing you could potentially do is to round the size of the 
result up to, say, a power of two. That way, instead of 
instantiating a new template for every individual string length, 
you only instantiate `FixedString!16`, `FixedString!32`, 
`FixedString!64`, etc.

Of course, doing this will leave you with some wasted memory at 
runtime. So you will probably want to run some benchmarks to 
compare performance before and after.

Jan 12 2022

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Wed, Jan 12, 2022 at 07:55:41PM +0000, Moth via Digitalmars-d-announce wrote:
 On Tuesday, 11 January 2022 at 17:55:28 UTC, H. S. Teoh wrote:

[...]
 One minor usability issue I found just glancing over the code: many
 of your methods take char[] as argument. Generally, you want
 const(char)[] instead, so that it will work with both char[] and
 immutable(char)[]. No reason why you can't copy some immutable chars
 into a FixedString, for example.

 
 they should all already be `in char[]`? i've added a test to confirm
 it works with both `char[]` and `immutable(char)[]` and it compiles
 fine.

[...]

Oh you're right!  I totally missed that.  Sorry, my bad.


T

-- 
Talk is cheap. Whining is actually free. -- Lars Wirzenius

Jan 12 2022

Moth <postmaster gmail.com> writes:

now available on dub: https://code.dlang.org/packages/fixedstring

Jan 12 2022

D Programming

C/C++ Programming

Other

digitalmars.D.announce - fixedstring: a safe, nogc string type