www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Handling arbitrary char ranges

reply Matt Kline <matt bitbashing.io> writes:
I'm doing some work with a REST API, and I wrote a simple utility 
function that sets an HTTP's onSend callback to send a string:

 property outgoingString(ref HTTP request, const(void)[] sendData)
{
	import std.algorithm : min;

	request.contentLength = sendData.length;
	auto remainingData = sendData;
	request.onSend = delegate size_t(void[] buf)
	{
		size_t minLen = min(buf.length, remainingData.length);
		if (minLen == 0) return 0;
		buf[0..minLen] = remainingData[0..minLen];
		remainingData = remainingData[minLen..$];
		return minLen;
	};
}

I then wrote a function that lazily strips some whitespace from 
strings I send. To accommodate this change, I need to modify the 
function above so it takes arbitrary ranges of char elements. I 
assumed this would be a modest task, but it's been an exercise in 
frustration. The closest I got was:

 property void outgoingString(T)(ref HTTP request, T sendData)
	if (isInputRange!T) // #1
{
	static if (isArray!T) {
		import std.algorithm : min;

		request.contentLength = sendData.length;
		request.onSend = delegate size_t(void[] buf)
		{
			size_t minLen = min(buf.length, sendData.length);
			if (minLen == 0) return 0;
			buf[0..minLen] = sendData[0..minLen]; // #2
			sendData = sendData[minLen..$];
			return minLen;
		};
	}
	else {
		auto bcu = byCodeUnit(sendData); // #3

		static assert(is(ElementType!bcu : char), // #4
		              __FUNCTION__ ~ " only takes char ranges, not "
                       ~ typeof(bcu.front).stringof);

         // Length unknown; chunked transfer
		request.contentLength = ulong.max;
		request.onSend = delegate size_t(void[] buf)
		{
			for (size_t i = 0; i < buf.length; ++i) {
				if (bcu.empty) return i;
				buf[i] = bcu.front; // #5
				bcu.popFront();
			}
             return buf.length;
		};
	}
}

To each of the commented lines above,

1. What is the idiomatic way to constrain the function to only 
take char ranges? One might naïvely add `is(ElementType!T : 
char)`, but that falls on its face due to strings "auto-decoding" 
their elements to dchar. (More on that later.)

2. The function fails to compile, issuing, "cannot implicitly 
convert expression (sendData[0..minLen]) of type string to 
void[]" on this line. I assume this has to do with the 
immutability of string elements. Specifying a non-const array of 
const elements is as simple as `const(void)[]`, but how does one 
do this here, with a template argument?

3. Is this needed, or is auto-decoding behavior specific to char 
arrays and not other char ranges?

4. Is this a sane approach to make sure I'm dealing with ranges 
of chars? Do I need to use `Unqual` to deal with const or 
immutable elements?

5. This fails, claiming the right hand side can't be converted to 
type void. Casting to ubyte doesn't help - so how *does* one 
write to an element of a void array?

Am I making things harder than they have to be? Or is dealing 
with an arbitrary ranges of chars this complex? I've lost count 
of times templated code wouldn't compile because dchar was 
sneaking in somewhere... at least I'm in good company. 
(http://forum.dlang.org/post/m01r3d$1frl$1 digitalmars.com)
Apr 20 2016
next sibling parent Alex Parrill <initrd.gz gmail.com> writes:
On Wednesday, 20 April 2016 at 17:09:29 UTC, Matt Kline wrote:
 I'm doing some work with a REST API, and I wrote a simple 
 utility function that sets an HTTP's onSend callback to send a 
 string:

 [...]
IO functions usually work with octets, not characters, so an extra encoding step is needed. For encoding character arrays to UTF-#, there's std.string.representation, and std.encoding might have something for arbitrary ranges. Avoid slicing ranges; not all ranges support it. If you absolutely need it (you don't here) then add hasSlicing to the constraint. isSomeChar can tell you if a type (like the ranges element type) is a character.
Apr 20 2016
prev sibling next sibling parent reply ag0aep6g <anonymous example.com> writes:
On 20.04.2016 19:09, Matt Kline wrote:
 1. What is the idiomatic way to constrain the function to only take char
 ranges? One might naïvely add `is(ElementType!T : char)`, but that falls
 on its face due to strings "auto-decoding" their elements to dchar.
 (More on that later.)
Well, string is not a char range. If you want to accept string, you have to special case it. Rejecting string is an option, though. The caller would then have to make a char range from the string. There's std.utf.byCodeUnit for that.
 2. The function fails to compile, issuing, "cannot implicitly convert
 expression (sendData[0..minLen]) of type string to void[]" on this line.
 I assume this has to do with the immutability of string elements.
 Specifying a non-const array of const elements is as simple as
 `const(void)[]`, but how does one do this here, with a template argument?
Looks like a compiler bug to me. It works when you do it in two steps: ---- string sendData = "foo"; void[] buf = new void[3]; immutable(void)[] voidSendData = sendData; buf[] = voidSendData[]; ---- I've filed an issue: https://issues.dlang.org/show_bug.cgi?id=15942
 3. Is this needed, or is auto-decoding behavior specific to char arrays
 and not other char ranges?
Auto-decoding is specific to arrays.
 4. Is this a sane approach to make sure I'm dealing with ranges of
 chars? Do I need to use `Unqual` to deal with const or immutable elements?
is(Foo : char) also accepts byte, ubyte, bool, and user-defined types with an alias this to a char. You don't need Unqual with `: char`. Since immutable(char) is a value type still, it implicitly converts to char. However, if you want to reject those other types, and only accept char and its qualified variants, then you need Unqual: ---- is(Unqual!(ElementType!bcu) == char) ----
 5. This fails, claiming the right hand side can't be converted to type
 void. Casting to ubyte doesn't help - so how *does* one write to an
 element of a void array?
void[] is a bit of a special case. A single value of type void isn't really a thing. You can't write `void v = 1;`. Maybe use ubyte[] for the buffer type instead. To do it with void[], I guess you'd have to slice things: ---- char c = 'x'; void[] buf = new void[1]; buf[0 .. 1] = (&c)[0 .. 1]; ----
 Am I making things harder than they have to be? Or is dealing with an
 arbitrary ranges of chars this complex? I've lost count of times
 templated code wouldn't compile because dchar was sneaking in
 somewhere... at least I'm in good company.
 (http://forum.dlang.org/post/m01r3d$1frl$1 digitalmars.com)
I think your problems come more from wanting to accept string, which simply isn't a char range, and from using void[] as the buffer type.
Apr 20 2016
parent reply Matt Kline <matt bitbashing.io> writes:
On Wednesday, 20 April 2016 at 19:29:22 UTC, ag0aep6g wrote:
 Maybe use ubyte[] for the buffer type instead.
I don't have an option here, do I? I assume HTTP.onSend doesn't take a `delegate size_t(ubyte[])` insetad of a `delegate size_t(void[])`, and that the former isn't implicitly convertible to the latter.
 I think your problems come more from wanting to accept string, 
 which simply isn't a char range
Is this due solely to the "auto-decode" behavior? Generally, (except apparently in this case) don't arrays of type T qualify as InputRanges of type T?
Apr 20 2016
parent reply ag0aep6g <anonymous example.com> writes:
On 20.04.2016 21:48, Matt Kline wrote:
 I don't have an option here, do I? I assume HTTP.onSend doesn't take a
 `delegate size_t(ubyte[])` insetad of a `delegate size_t(void[])`, and
 that the former isn't implicitly convertible to the latter.
Maybe I've missed it, but you didn't say where the HTTP type comes from, did you? If it's not under your control, then yeah, I guess you have to deal with void[]. [...]
 Is this due solely to the "auto-decode" behavior? Generally, (except
 apparently in this case) don't arrays of type T qualify as InputRanges
 of type T?
Yep. Generally, T[] is a range with element type T. char[], wchar[], and their qualified variants are the exception. And the reason is auto-decoding to dchar, yes.
Apr 20 2016
parent reply Matt Kline <matt bitbashing.io> writes:
On Wednesday, 20 April 2016 at 20:00:58 UTC, ag0aep6g wrote:

 Maybe I've missed it, but you didn't say where the HTTP type 
 comes from, did you?
std.net.curl: https://dlang.org/phobos/std_net_curl.html#.HTTP (Sorry, I assumed that was a given since it's a standard library type. Poor assumption, perhaps.) I'd rather not write my own cURL wrapper. Do you think it would be worthwhile starting a PR for Phobos to get it changed to ubyte[]? A reading of https://dlang.org/spec/arrays.html indicates the main difference is that that GC crawls void[], but I would think that wouldn't matter for a short-lived buffer being shoveled into libcurl, which is, by nature, a copy of the same data somewhere else in your program...
Apr 20 2016
parent ag0aep6g <anonymous example.com> writes:
On 20.04.2016 22:09, Matt Kline wrote:
 I'd rather not write my own cURL wrapper. Do you think it would be
 worthwhile starting a PR for Phobos to get it changed to ubyte[]? A
 reading of https://dlang.org/spec/arrays.html indicates the main
 difference is that that GC crawls void[], but I would think that
 wouldn't matter for a short-lived buffer being shoveled into libcurl,
 which is, by nature, a copy of the same data somewhere else in your
 program...
I don't know if a PR would be worthwhile. What you say makes sense to me, but I am by no means an expert here. As you say, void[] is the safer default with regards to the GC. It's also simpler to get a void[] from an arbitrary array, as any array implicitly converts to void[] (given compatible qualifiers). Getting a void[] from an arbitrary range isn't that simple, but getting a ubyte[] from an int[] requires some work, too. void[] is possibly be the better option all around.
Apr 20 2016
prev sibling parent reply Alex Parrill <initrd.gz gmail.com> writes:
On Wednesday, 20 April 2016 at 17:09:29 UTC, Matt Kline wrote:
 [...]
First, you can't assign anything to a void[], for the same reason you can't dereference a void*. This includes the slice assignment that you are trying to do in `buf[0..minLen] = remainingData[0..minLen];`. Cast the buffer to a `ubyte[]` buffer first, then you can assign bytes to it. auto bytebuf = cast(ubyte[]) buf; bytebuf[0] = 123; Second, don't use slicing on ranges (unless you need it). Not all ranges support it... auto buf = [1,2,3]; auto rng = filter!(x => x != 1)(buf); pragma(msg, hasSlicing!(typeof(rng))); // false ... and even ranges that support it don't support assigning to an array by slice: auto buf = new int[](3); buf[] = only(1,2,3)[]; // cannot implicitly convert expression (only(1, 2, 3).opSlice()) of type OnlyResult!(int, 3u) to int[] Instead, use a loop (or maybe `put`) to fill the array. Third, don't treat text as bytes; encode your characters. auto schema = EncodingScheme.create("utf-8"); auto range = chain("hello", " ", "world").map!(ch => cast(char) ch); auto buf = new ubyte[](100); auto currentPos = buf; while(!range.empty && schema.encodedLength(range.front) <= currentPos.length) { auto written = schema.encode(range.front, currentPos); currentPos = currentPos[written..$]; range.popFront(); } buf = buf[0..buf.length - currentPos.length]; (PS there ought to be a range in Phobos that encodes each character, something like map maybe)
Apr 20 2016
parent reply ag0aep6g <anonymous example.com> writes:
On 20.04.2016 23:59, Alex Parrill wrote:
 On Wednesday, 20 April 2016 at 17:09:29 UTC, Matt Kline wrote:
 [...]
First, you can't assign anything to a void[], for the same reason you can't dereference a void*. This includes the slice assignment that you are trying to do in `buf[0..minLen] = remainingData[0..minLen];`.
Not true. You can assign any dynamic array to a void[]. Regarding vector notation, the spec doesn't seem to mention how it interacts with void[], but dmd accepts this no problem: ---- int[] i = [1, 2, 3]; auto v = new void[](3 * int.sizeof); v[] = i[]; ---- [...]
 Second, don't use slicing on ranges (unless you need it). Not all ranges
 support it...
As far as I see, the slicing code is guarded by `static if (isArray!T)`. Arrays support slicing. [...]
 Instead, use a loop (or maybe `put`) to fill the array.
That's what done in the `else` path, no?
 Third, don't treat text as bytes; encode your characters.

      auto schema = EncodingScheme.create("utf-8");
      auto range = chain("hello", " ", "world").map!(ch => cast(char) ch);

      auto buf = new ubyte[](100);
      auto currentPos = buf;
      while(!range.empty && schema.encodedLength(range.front) <=
 currentPos.length) {
          auto written = schema.encode(range.front, currentPos);
          currentPos = currentPos[written..$];
          range.popFront();
      }
      buf = buf[0..buf.length - currentPos.length];
You're "converting" chars to UTF-8 here, right? That's a nop. char is a UTF-8 code unit already.
 (PS there ought to be a range in Phobos that encodes each character,
 something like map maybe)
std.utf.byChar and friends: https://dlang.org/phobos/std_utf.html#.byChar
Apr 20 2016
parent reply Alex Parrill <initrd.gz gmail.com> writes:
On Wednesday, 20 April 2016 at 22:44:37 UTC, ag0aep6g wrote:
 On 20.04.2016 23:59, Alex Parrill wrote:
 On Wednesday, 20 April 2016 at 17:09:29 UTC, Matt Kline wrote:
 [...]
First, you can't assign anything to a void[], for the same reason you can't dereference a void*. This includes the slice assignment that you are trying to do in `buf[0..minLen] = remainingData[0..minLen];`.
Not true. You can assign any dynamic array to a void[].
That's not assigning the elements of a void[]; it's just changing what the slice points to and adjusting the length, like doing `void* ptr = someOtherPtr;`
 Regarding vector notation, the spec doesn't seem to mention how 
 it interacts with void[], but dmd accepts this no problem:
 ----
 int[] i = [1, 2, 3];
 auto v = new void[](3 * int.sizeof);
 v[] = i[];
 ----
It only seems to work on arrays, not arbitrary ranges, sliceable or not. Though see below.
 [...]
 Second, don't use slicing on ranges (unless you need it). Not 
 all ranges
 support it...
As far as I see, the slicing code is guarded by `static if (isArray!T)`. Arrays support slicing. [...]
 Instead, use a loop (or maybe `put`) to fill the array.
That's what done in the `else` path, no?
Yes, I did not see the static if condition, my bad.
 Third, don't treat text as bytes; encode your characters.

      auto schema = EncodingScheme.create("utf-8");
      auto range = chain("hello", " ", "world").map!(ch => 
 cast(char) ch);

      auto buf = new ubyte[](100);
      auto currentPos = buf;
      while(!range.empty && schema.encodedLength(range.front) <=
 currentPos.length) {
          auto written = schema.encode(range.front, currentPos);
          currentPos = currentPos[written..$];
          range.popFront();
      }
      buf = buf[0..buf.length - currentPos.length];
You're "converting" chars to UTF-8 here, right? That's a nop. char is a UTF-8 code unit already.
It can be either chars, wchars, or dchars.
 (PS there ought to be a range in Phobos that encodes each 
 character,
 something like map maybe)
std.utf.byChar and friends: https://dlang.org/phobos/std_utf.html#.byChar
byChar would work. byWChar and byDChar might cause endian-ness issues.
Apr 20 2016
parent ag0aep6g <anonymous example.com> writes:
On 21.04.2016 04:35, Alex Parrill wrote:
 On Wednesday, 20 April 2016 at 22:44:37 UTC, ag0aep6g wrote:
 On 20.04.2016 23:59, Alex Parrill wrote:
[...]
 That's not assigning the elements of a void[]; it's just changing what
 the slice points to and adjusting the length, like doing `void* ptr =
 someOtherPtr;`
True, but assigning elements is possible via slices as shown. [...]
 It only seems to work on arrays, not arbitrary ranges, sliceable or not.
 Though see below.
Yes, assigning slices and more complex vector operations only works with dynamic arrays. [...]
      auto range = chain("hello", " ", "world").map!(ch => cast(char)
 ch);
[...]
          auto written = schema.encode(range.front, currentPos);
[...]
 You're "converting" chars to UTF-8 here, right? That's a nop. char is
 a UTF-8 code unit already.
It can be either chars, wchars, or dchars.
Your range specifically has element type char, though. Not wchar or dchar. And Matt Kline wants to work on char ranges (and maybe string), not on arbitrary ranges of char/wchar/dchar. [...]
 byChar would work. byWChar and byDChar might cause endian-ness issues.
Easily combined with the endianess functions from std.bitmanip: ---- void main() { import std.algorithm: equal; import std.bitmanip: nativeToBigEndian, nativeToLittleEndian; import std.utf: byWchar; string utf8 = "foobär"; auto utf16le = utf8.byWchar.map!nativeToLittleEndian; auto utf16be = utf8.byWchar.map!nativeToBigEndian; assert(equal(utf16le, [['f', 0], ['o', 0], ['o', 0], ['b', 0], [0xE4, 0], ['r', 0]])); assert(equal(utf16be, [[0, 'f'], [0, 'o'], [0, 'o'], [0, 'b'], [0, 0xE4], [0, 'r']])); } ----
Apr 20 2016