digitalmars.D.learn - Thin UTF8 string wrapper

Joseph Rushton Wakeling (23/23) Dec 06 2019 Hello folks,

Jonathan Marler (50/74) Dec 06 2019 Good questions. I don't have answers to them all but I hope this
Jonathan M Davis (57/78) Dec 06 2019 The module to look at here is std.utf, not std.encoding. decode and

Joseph Rushton Wakeling (13/14) Dec 07 2019 Hmmm, docs may need updating then -- several functions in

Jonathan M Davis (24/38) Dec 07 2019 There may have been some tweaks to std.encoding here and there, but for ...

Joseph Rushton Wakeling (16/37) Dec 07 2019 Ouch! I must say it was a surprise to read, precisely because

Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:

Hello folks,

I have a use-case that involves wanting to create a thin struct 
wrapper of underlying string data (the idea is to have a type 
that guarantees that the string has certain desirable properties).

The string is required to be valid UTF-8.  The question is what 
the most useful API is to expose from the wrapper: a sliceable 
random-access range?  A getter plus `alias this` to just treat it 
like a normal string from the reader's point of view?

One factor that I'm not sure how to address w.r.t. a full range 
API is how to handle iterating over elements: presumably they 
should be iterated over as `dchar`, but how to implement a 
`front` given that `std.encoding` gives no way to decode the 
initial element of the string that doesn't also pop it off the 
front?

I'm also slightly disturbed to see that `std.encoding.codePoints` 
requires `immutable(char)[]` input: surely it should operate on 
any range of `char`?

I'm inclining towards the "getter + `alias this`" approach, but I 
thought I'd throw the problem out here to see if anyone has any 
good experience and/or advice.

Thanks in advance for any thoughts!

All the best,

      -- Joe

Dec 06 2019

Jonathan Marler <johnnymarler gmail.com> writes:

On Friday, 6 December 2019 at 16:48:21 UTC, Joseph Rushton 
Wakeling wrote:
 Hello folks,

 I have a use-case that involves wanting to create a thin struct 
 wrapper of underlying string data (the idea is to have a type 
 that guarantees that the string has certain desirable 
 properties).

 The string is required to be valid UTF-8.  The question is what 
 the most useful API is to expose from the wrapper: a sliceable 
 random-access range?  A getter plus `alias this` to just treat 
 it like a normal string from the reader's point of view?

 One factor that I'm not sure how to address w.r.t. a full range 
 API is how to handle iterating over elements: presumably they 
 should be iterated over as `dchar`, but how to implement a 
 `front` given that `std.encoding` gives no way to decode the 
 initial element of the string that doesn't also pop it off the 
 front?

 I'm also slightly disturbed to see that 
 `std.encoding.codePoints` requires `immutable(char)[]` input: 
 surely it should operate on any range of `char`?

 I'm inclining towards the "getter + `alias this`" approach, but 
 I thought I'd throw the problem out here to see if anyone has 
 any good experience and/or advice.

 Thanks in advance for any thoughts!

 All the best,

      -- Joe

Good questions. I don't have answers to them all but I hope this 
information is helpful.

I use wrapper structs to represent properties in this way as 
well.  For example my  "mar" library has the SentinelPtr and 
SentinelArray types which guarantee that the underlying pointer 
and/or array is terminted by some value (i.e. like a 
null-terminated C string).

If I'm creating and use these wrapper types inside a 
self-contained program then I don't really care about API 
compatibility so I would use a simple powerful mechanism like 
"alias this".  For libraries where the API boundary is important 
I implement the most limited API I can.  The reason for this, is 
it allows you to see all possible interaction with the type.  
This way, when you need to change the API you know all the 
existing ways it can be interacted with and iterate on the API 
design appropriately.  This is the case for SentinelPtr and 
SentinelArray.  For this case I only implement the operations I 
know are being used, and I made this easy by creating a simple 
module I call "wrap.d" 
(https://github.com/dragon-lang/mar/blob/master/src/mar/wrap.d).

If you have a struct that wraps a string and guarantees it's UTF8 
encoded, wrap.d lets you declare that it's a wrapper type and 
allows you to mixin the operations you want to expose like this:

struct Utf8String
{
     private string str;
     import mar.wrap;

     // this verifies the size of the wrapper struct and the 
underlying field
     // are the same, and creates the wrappedValueRef method that 
the other
     // wrapper mixins use to access the underlying wrapped value
     mixin WrapperFor!"str";

     // Now you can mixin different operations, for example
     mixin WrapOpCast;
     mixin WrapOpIndex;
     mixin WrapOpSlice;
}


On the topic of immutable(char)[] vs const(char)[]. If a function 
takes const data, I take it to mean that the function won't 
change the data.  If it takes immutable data, I take it to mean 
that the function won't change it AND the caller must ensure data 
won't change while the function has it.  However in practice, 
functions that require immutable data sill declare their data be 
"const" instead of "immutable".  I think this is because 
declaring it as immutable would require extra boiler-plate all 
over your code to cast data to immutable all the time.  So most 
functions end up using const even though they require immutable.

Dec 06 2019

Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:

On Friday, December 6, 2019 9:48:21 AM MST Joseph Rushton Wakeling via 
Digitalmars-d-learn wrote:
 Hello folks,

 I have a use-case that involves wanting to create a thin struct
 wrapper of underlying string data (the idea is to have a type
 that guarantees that the string has certain desirable properties).

 The string is required to be valid UTF-8.  The question is what
 the most useful API is to expose from the wrapper: a sliceable
 random-access range?  A getter plus `alias this` to just treat it
 like a normal string from the reader's point of view?

 One factor that I'm not sure how to address w.r.t. a full range
 API is how to handle iterating over elements: presumably they
 should be iterated over as `dchar`, but how to implement a
 `front` given that `std.encoding` gives no way to decode the
 initial element of the string that doesn't also pop it off the
 front?

 I'm also slightly disturbed to see that `std.encoding.codePoints`
 requires `immutable(char)[]` input: surely it should operate on
 any range of `char`?

 I'm inclining towards the "getter + `alias this`" approach, but I
 thought I'd throw the problem out here to see if anyone has any
 good experience and/or advice.

 Thanks in advance for any thoughts!

The module to look at here is std.utf, not std.encoding. decode and
decodeFront can be used to get a code point if that's what you want, whereas
byCodeUnit and byUTF can be used to get a range over code units or code
points. There's also byCodePoint and byGrapheme in std.uni. std.encoding is
old and arguably needs an overhaul. I don't think that I've ever done
anything with it other than for dealing with BOMs.

If you provide a range of UTF-8 code units, then it will just work with any
code that's written to work with a range of any character type, whereas if
you specifically need to have it be a range of code points or graphemes,
then using the wrappers from std.utf or std.uni will get you that. And there
really isn't any reason to restrict the operations on a range of char the
way that std.range.primitives does for string. If you're dealing with a
function that was specifically written to operate on any range of
characters, then it's unnecessary, and if it's just a normal range-based
function which isn't specialized for ranges of characters, then it's going
to iterate over whatever the element type of the range is. So, you'll need
to use a wrapper like byUTF, byCodePoint, or byGrapheme to get whatever the
correct behavior is depending on what you're trying to do.

The main hiccup is that a lot of Phobos is basically written with the idea
that ranges of characters will be ranges of dchar. Some of Phobos has been
fixed so that it doesn't, but plenty of it hasn't been. However, what that
usually means is that the code just operates on the element type and
special-cases for narrow strings, or it's specifically written to operate on
ranges of dchar. For cases like that, byUTF!dchar or byCodePoint will likely
work; alternatively, you can provide a way to access the underlying string
and just have them operate directly on the string, but depending on what
you're trying to do with your wrapper, exposing the underlying string may or
may not be a problem (given that string has immutable elements though, it's
probably fine so long as you don't provide a reference to the string
itself).

In general, I'd strongly advise against using alias this with range-based
code (or really, generic code in general). Depending, it _can_ work, but
it's also an easy source of bugs. Unless the code forces the conversion,
what you can easily get is some of the code operating directly on the type
and some of it doing the implicit conversion to operate on the type. Best
case, that results in compilation errors, but it could also result in subtle
bugs. It's far less error-prone to require that the conversion be done
explicitly.

So, if all you're really trying to do is provide some guarantees about how
the string was constructed but then are looking to essentially just have it
be a string after that, it would probably be simplest to make it so that
your wrapper type doesn't have much in the way of operations and that it
just provides a property to access the underlying string. Then the type
itself isn't a range, and any code that wants to operate on the data can
just use the property to get the underlying string and use it as a string
after that. That approach basically completely sidesteps the issue of how to
treat the data as a range, since you get the normal behavior for strings for
any code that does much more than just pass around the data. You _do_ lose
the knowledge that the wrapper type gave you about the state of the string
once you start actually operating on the data, but once you start operating
on it, that knowledge is probably no longer valid anyway (especially if
you're passing it to a function which is going to return a wrapper range to
mutate the elements in the range rather than something like find which just
looks at the range).

- Jonathan M Davis

Dec 06 2019

Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:

On Saturday, 7 December 2019 at 03:23:00 UTC, Jonathan M Davis 
wrote:
 The module to look at here is std.utf, not std.encoding.

Hmmm, docs may need updating then -- several functions in 
`std.encoding` explicitly state they are replacements for 
`std.utf`.  Did you mean `std.uni`?

It is honestly a bit confusing which of these 3 modules to use, 
especially as they each offer different (and useful) tools.  For 
example, `std.utf.validate` is less useful than 
`std.encoding.isValid`, because it throws rather than returning a 
bool and giving the user the choice of behaviour.  `std.uni` 
doesn't seem to have any equivalent for either.

Thanks in any case for the as-ever characteristically detailed 
and useful advice :-)

Dec 07 2019

Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:

On Saturday, December 7, 2019 5:23:30 AM MST Joseph Rushton Wakeling via 
Digitalmars-d-learn wrote:
 On Saturday, 7 December 2019 at 03:23:00 UTC, Jonathan M Davis

 wrote:
 The module to look at here is std.utf, not std.encoding.

 Hmmm, docs may need updating then -- several functions in
 `std.encoding` explicitly state they are replacements for
 `std.utf`.  Did you mean `std.uni`?

 It is honestly a bit confusing which of these 3 modules to use,
 especially as they each offer different (and useful) tools.  For
 example, `std.utf.validate` is less useful than
 `std.encoding.isValid`, because it throws rather than returning a
 bool and giving the user the choice of behaviour.  `std.uni`
 doesn't seem to have any equivalent for either.

 Thanks in any case for the as-ever characteristically detailed
 and useful advice :-)

There may have been some tweaks to std.encoding here and there, but for the
most part, it's pretty ancient. Looking at the history, it's Seb who marked
some if it as being a replacement for std.utf, which is just plain wrong.
Phobos in general uses std.utf for dealing with UTF-8, UTF-16, and UTF-32,
not std.encoding. std.encoding is an old module that's had some tweaks done
to it but which probably needs a pretty serious overhaul. The only thing
that I've ever use it for is BOM stuff.

std.utf.validate does need a replacement, but doing so gets pretty
complicated. And looking at std.encoding.isValid, I'm not sure that what it
does is any better from simply wrapping std.utf.validate and returning a
bool based on whether an exception was thrown. Depending on the string, it
would actually be faster to use validate, because std.encoding.isValid
iterates through the entire string regardless. The way it checks validity is
also completely different from what std.utf does. Either way, some of the
std.encoding internals do seem to be an alternate implementation of what
std.utf has, but outside of std.encoding itself, std.utf is what Phobos uses
for UTF-8, UTF-16, and UTF-32, not std.encoding.

I did do a PR at one point to add isValidUTF to std.utf so that we could
replace std.utf.validate, but Andrei didn't like the implementation, so it
didn't get merged, and I haven't gotten around to figuring out how to
implement it more cleanly.

- Jonathan M Davis

Dec 07 2019

Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:

On Saturday, 7 December 2019 at 15:57:14 UTC, Jonathan M Davis 
wrote:
 There may have been some tweaks to std.encoding here and there, 
 but for the most part, it's pretty ancient. Looking at the 
 history, it's Seb who marked some if it as being a replacement 
 for std.utf, which is just plain wrong.

Ouch!  I must say it was a surprise to read, precisely because 
std.encoding seemed weird and clunky.  Good to know that it's 
misleading.

Unfortunately that adds to the list I have of weirdly misleading 
docs that seem to have crept in over the last months/years :-(

 std.utf.validate does need a replacement, but doing so gets 
 pretty complicated. And looking at std.encoding.isValid, I'm 
 not sure that what it does is any better from simply wrapping 
 std.utf.validate and returning a bool based on whether an 
 exception was thrown.

Unfortunately I'm dealing with a use case where exception 
throwing (and indeed, anything that generates garbage) is 
preferred to be avoided.  That's why I was looking for a function 
that returned a bool ;-)

 Depending on the string, it would actually be faster to use 
 validate, because std.encoding.isValid iterates through the 
 entire string regardless. The way it checks validity is also 
 completely different from what std.utf does. Either way, some 
 of the std.encoding internals do seem to be an alternate 
 implementation of what std.utf has, but outside of std.encoding 
 itself, std.utf is what Phobos uses for UTF-8, UTF-16, and 
 UTF-32, not std.encoding.

Thanks -- good to know.

 I did do a PR at one point to add isValidUTF to std.utf so that 
 we could replace std.utf.validate, but Andrei didn't like the 
 implementation, so it didn't get merged, and I haven't gotten 
 around to figuring out how to implement it more cleanly.

Thanks for the attempt, at least!  While I get the reasons it was 
rejected, it feels a bit of a shame -- surely it's easier to do a 
more major under-the-hood rewrite with the public API (and tests) 
already in place ... :-\

Dec 07 2019

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Thin UTF8 string wrapper