www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - toUTFz and WinAPI GetTextExtentPoint32W

reply Andre <andre s-e-a-p.de> writes:
Hi,

I want something like:

bool test(HDC dc, string str, int len, SIZE* s)
{
wchar[] wstr = toUTFz!(wchar*)str;
GetTextExtentPoint32W(dc wstr.ptr, wstr.length, s);
...

I get the wchar[] stuff not working. I am struggling
with pointer to array. Could you give some advice?

Kind regards
Andre
Sep 20 2011
next sibling parent reply Trass3r <un known.com> writes:
 bool test(HDC dc, string str, int len, SIZE* s)
 {
 wchar[] wstr = toUTFz!(wchar*)str;
 GetTextExtentPoint32W(dc wstr.ptr, wstr.length, s);

toUTFz returns a wchar*, not a wchar[].
Sep 20 2011
parent reply Andre <andre s-e-a-p.de> writes:
Am Tue, 20 Sep 2011 19:27:03 +0200 schrieb Trass3r:

 bool test(HDC dc, string str, int len, SIZE* s)
 {
 wchar[] wstr = toUTFz!(wchar*)str;
 GetTextExtentPoint32W(dc wstr.ptr, wstr.length, s);

toUTFz returns a wchar*, not a wchar[].

I am not familiar with pointers. I know I have to call toUTFz! and fill pointer value and length value of the WinAPI from the result. Do you have any suggestions how to achieve this API call? Kind regards Andre
Sep 20 2011
parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 09/20/2011 08:07 PM, Andre wrote:
 Am Tue, 20 Sep 2011 19:27:03 +0200 schrieb Trass3r:

 bool test(HDC dc, string str, int len, SIZE* s)
 {
 wchar[] wstr = toUTFz!(wchar*)str;
 GetTextExtentPoint32W(dc wstr.ptr, wstr.length, s);

toUTFz returns a wchar*, not a wchar[].

I am not familiar with pointers. I know I have to call toUTFz! and fill pointer value and length value of the WinAPI from the result. Do you have any suggestions how to achieve this API call? Kind regards Andre

Are you sure that the call requires the string to be null terminated? I do not know that winapi function, but this might work: bool test(HDC dc, string str, SIZE* s) { auto wstr = to!(wchar[])str; GetTextExtentPoint32W(dc, wstr.ptr, wstr.length, s); ...
Sep 20 2011
next sibling parent Timon Gehr <timon.gehr gmx.ch> writes:
On 09/20/2011 08:34 PM, Trass3r wrote:
 Are you sure that the call requires the string to be null terminated?
 I do not know that winapi function, but this might work:

 bool test(HDC dc, string str, SIZE* s)
 {
 auto wstr = to!(wchar[])str;
 GetTextExtentPoint32W(dc, wstr.ptr, wstr.length, s);
 ...

It doesn't need to be null-terminated for that function. Shouldn't you use to!wstring though?!

It has to be copied anyway, so there is no real difference. I just did not know the signature of that function, and if it had been missing the const, wstring would not have worked. But if there is a const, wstring is indeed superior because shorter and clearer.
Sep 20 2011
prev sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 09/20/2011 08:24 PM, Timon Gehr wrote:
 On 09/20/2011 08:07 PM, Andre wrote:
 Am Tue, 20 Sep 2011 19:27:03 +0200 schrieb Trass3r:

 bool test(HDC dc, string str, int len, SIZE* s)
 {
 wchar[] wstr = toUTFz!(wchar*)str;
 GetTextExtentPoint32W(dc wstr.ptr, wstr.length, s);

toUTFz returns a wchar*, not a wchar[].

I am not familiar with pointers. I know I have to call toUTFz! and fill pointer value and length value of the WinAPI from the result. Do you have any suggestions how to achieve this API call? Kind regards Andre

Are you sure that the call requires the string to be null terminated? I do not know that winapi function, but this might work: bool test(HDC dc, string str, SIZE* s) { auto wstr = to!(wchar[])str; GetTextExtentPoint32W(dc, wstr.ptr, wstr.length, s); ...

sry, should have read: bool test(HDC dc, string str, SIZE* s) { auto wstr = to!(wchar[])(str); GetTextExtentPoint32W(dc, wstr.ptr, wstr.length, s); ...
Sep 20 2011
parent Andre <andre s-e-a-p.de> writes:
Am Tue, 20 Sep 2011 20:44:40 +0200 schrieb Timon Gehr:

 On 09/20/2011 08:24 PM, Timon Gehr wrote:
 On 09/20/2011 08:07 PM, Andre wrote:
 Am Tue, 20 Sep 2011 19:27:03 +0200 schrieb Trass3r:

 bool test(HDC dc, string str, int len, SIZE* s)
 {
 wchar[] wstr = toUTFz!(wchar*)str;
 GetTextExtentPoint32W(dc wstr.ptr, wstr.length, s);

toUTFz returns a wchar*, not a wchar[].

I am not familiar with pointers. I know I have to call toUTFz! and fill pointer value and length value of the WinAPI from the result. Do you have any suggestions how to achieve this API call? Kind regards Andre

Are you sure that the call requires the string to be null terminated? I do not know that winapi function, but this might work: bool test(HDC dc, string str, SIZE* s) { auto wstr = to!(wchar[])str; GetTextExtentPoint32W(dc, wstr.ptr, wstr.length, s); ...

sry, should have read: bool test(HDC dc, string str, SIZE* s) { auto wstr = to!(wchar[])(str); GetTextExtentPoint32W(dc, wstr.ptr, wstr.length, s); ...

thanks a lot for your help. Kind regards Andre
Sep 20 2011
prev sibling next sibling parent Trass3r <un known.com> writes:
 Are you sure that the call requires the string to be null terminated? I  
 do not know that winapi function, but this might work:

 bool test(HDC dc, string str, SIZE* s)
 {
 auto wstr = to!(wchar[])str;
 GetTextExtentPoint32W(dc, wstr.ptr, wstr.length, s);
 ...

It doesn't need to be null-terminated for that function. Shouldn't you use to!wstring though?!
Sep 20 2011
prev sibling next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
Don't use length, use std.utf.count, ala:

import std.utf;
alias toUTFz!(const(wchar)*, string)  toUTF16z;
GetTextExtentPoint32W(str.toUTF16z, std.utf.count(str), s);

I like to keep that alias for my code since I was already using it beforehand.

I'm pretty sure (ok maybe 80% sure) that GetTextExtentPoint32W asks
for the count of characters and not code units. The WinAPI docs are a
bit fuzzy when it comes to these things, some functions take the
character count, others code-unit count. I've used this function in a
D port of a Neatpad project a while ago.
Sep 20 2011
prev sibling next sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Tuesday, September 20, 2011 14:27 Andrej Mitrovic wrote:
 Don't use length, use std.utf.count, ala:
 
 import std.utf;
 alias toUTFz!(const(wchar)*, string) toUTF16z;
 GetTextExtentPoint32W(str.toUTF16z, std.utf.count(str), s);

Or std.range.walkLength. I don't know why we really have std.utf.count. I just calls walkLength anyway. I suspect that it's a function that predates walkLength and was made to use walkLength after walkLength was introduced. But it's kind of pointless now. - Jonathan M Davis
Sep 20 2011
prev sibling next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On 9/20/11, Jonathan M Davis <jmdavisProg gmx.com> wrote:
 Or std.range.walkLength. I don't know why we really have std.utf.count. I
 just
 calls walkLength anyway. I suspect that it's a function that predates
 walkLength and was made to use walkLength after walkLength was introduced.
 But
 it's kind of pointless now.

 - Jonathan M Davis

I don't think having better-named aliases is a bad thing. Although now I'm seeing it's not just an alias but a function. What exactly is the "static if (E.sizeof < 4)" in there for btw? When would the element type exceed 4 bytes while still passing the isSomeChar contract, and then why not stop compilation at that point instead of return "s.length"?
Sep 20 2011
prev sibling next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
One other thing, count can only take an array which seems too
restrictive since walkLength can take any range at all. So maybe count
should be just an alias to walkLength or it should possibly be removed
(I'm against fully removing it because I already use it in code and I
think the name does make sense).
Sep 20 2011
prev sibling next sibling parent reply "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Tuesday, September 20, 2011 14:43 Andrej Mitrovic wrote:
 On 9/20/11, Jonathan M Davis <jmdavisProg gmx.com> wrote:
 Or std.range.walkLength. I don't know why we really have std.utf.count. I
 just
 calls walkLength anyway. I suspect that it's a function that predates
 walkLength and was made to use walkLength after walkLength was
 introduced. But
 it's kind of pointless now.
 
 - Jonathan M Davis

I don't think having better-named aliases is a bad thing. Although now I'm seeing it's not just an alias but a function.

We specifically avoid having aliases in Phobos simply for having alternate function names. Aliases need to actually be useful, or they shouldn't be there.
 What exactly is the "static if (E.sizeof < 4)" in there for btw? When
 would the element type exceed 4 bytes while still passing the
 isSomeChar contract, and then why not stop compilation at that point
 instead of return "s.length"?

The static if is there to special-case narrow strings. It's unnecessary (though it does eliminate a function call when -inline isn't used). It would have been necessary prior to count just forwarding to walkLength, but it isn't now.
 One other thing, count can only take an array which seems too
 restrictive since walkLength can take any range at all. So maybe count
 should be just an alias to walkLength or it should possibly be removed
 (I'm against fully removing it because I already use it in code and I
 think the name does make sense).

I don't know if we're going to remove std.utf.count or not, but it _is_ the kind of thing that we've been removing. It doesn't add any real value. It's just another function which does exactly the same thing as walkLength except that it's restricted to strings, and we don't generally like having pointless aliases around (or pointless function wrappers, which amounts to pretty much the same thing). So, it wouldn't surprise me at all if it goes away, but if/when it does, it'll go through the proper deprecation cycle rather than just being removed, so if/when we do that, it's not like your code would immediately break. - Jonathan M Davis
Sep 20 2011
parent reply travert phare.normalesup.org (Christophe) writes:
"Jonathan M Davis" , dans le message (digitalmars.D.learn:29637), a
 écrit :
 On Tuesday, September 20, 2011 14:43 Andrej Mitrovic wrote:
 On 9/20/11, Jonathan M Davis <jmdavisProg gmx.com> wrote:
 Or std.range.walkLength. I don't know why we really have std.utf.count. I
 just
 calls walkLength anyway. I suspect that it's a function that predates
 walkLength and was made to use walkLength after walkLength was
 introduced. But
 it's kind of pointless now.
 
 - Jonathan M Davis

I don't think having better-named aliases is a bad thing. Although now I'm seeing it's not just an alias but a function.


std.utf.count has on advantage: someone looking for the function will find it. The programmer might not look in std.range to find a function about UFT strings, and even if he did, it is not indicated in walkLength that it works with (narrow) strings the way it does. To know you can use walklength, you must know that: -popFront works differently in string. -hasLength is not true for strings. -what is walkLength. So yes, you experienced programmer don't need std.utf.count, but newbies do. Last point: WalkLength is not optimized for strings. std.utf.count should be. This short implementation of count was 3 to 8 times faster than walkLength is a simple benchmark: size_t myCount(string text) { size_t n = text.length; for (uint i=0; i<text.length; ++i) { auto s = text[i]>>6; n -= (s>>1) - ((s+1)>>2); } return n; } (compiled with gdc on 64 bits, the sample text was the introduction of french wikipedia UTF-8 article down to the sommaire - http://fr.wikipedia.org/wiki/UTF-8 ). The reason is that the loop can be unrolled by the compiler.
Sep 20 2011
next sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 09/21/2011 01:57 AM, Christophe wrote:
 "Jonathan M Davis" , dans le message (digitalmars.D.learn:29637), a
   écrit :
 On Tuesday, September 20, 2011 14:43 Andrej Mitrovic wrote:
 On 9/20/11, Jonathan M Davis<jmdavisProg gmx.com>  wrote:
 Or std.range.walkLength. I don't know why we really have std.utf.count. I
 just
 calls walkLength anyway. I suspect that it's a function that predates
 walkLength and was made to use walkLength after walkLength was
 introduced. But
 it's kind of pointless now.

 - Jonathan M Davis

I don't think having better-named aliases is a bad thing. Although now I'm seeing it's not just an alias but a function.


std.utf.count has on advantage: someone looking for the function will find it. The programmer might not look in std.range to find a function about UFT strings, and even if he did, it is not indicated in walkLength that it works with (narrow) strings the way it does. To know you can use walklength, you must know that: -popFront works differently in string. -hasLength is not true for strings. -what is walkLength. So yes, you experienced programmer don't need std.utf.count, but newbies do. Last point: WalkLength is not optimized for strings. std.utf.count should be. This short implementation of count was 3 to 8 times faster than walkLength is a simple benchmark: size_t myCount(string text) { size_t n = text.length; for (uint i=0; i<text.length; ++i) { auto s = text[i]>>6; n -= (s>>1) - ((s+1)>>2); } return n; } (compiled with gdc on 64 bits, the sample text was the introduction of french wikipedia UTF-8 article down to the sommaire - http://fr.wikipedia.org/wiki/UTF-8 ). The reason is that the loop can be unrolled by the compiler.

Very good point, you might want to file an enhancement request. It would make the functionality different enough to prevent count from being removed: walkLength throws on an invalid UTF sequence.
Sep 20 2011
next sibling parent reply travert phare.normalesup.org (Christophe) writes:
Timon Gehr , dans le message (digitalmars.D.learn:29641), a écrit :
 Last point: WalkLength is not optimized for strings.
 std.utf.count should be.

 This short implementation of count was 3 to 8 times faster than
 walkLength is a simple benchmark:

 size_t myCount(string text)
 {
    size_t n = text.length;
    for (uint i=0; i<text.length; ++i)
      {
        auto s = text[i]>>6;
        n -= (s>>1) - ((s+1)>>2);
      }
    return n;
 }

 (compiled with gdc on 64 bits, the sample text was the introduction of
 french wikipedia UTF-8 article down to the sommaire -
 http://fr.wikipedia.org/wiki/UTF-8 ).

 The reason is that the loop can be unrolled by the compiler.

Very good point, you might want to file an enhancement request. It would make the functionality different enough to prevent count from being removed: walkLength throws on an invalid UTF sequence.

I would be glad to do so, but I am quite new here, so I don't know how to. A little pointer could help. -- Christophe
Sep 20 2011
parent Timon Gehr <timon.gehr gmx.ch> writes:
On 09/21/2011 02:15 AM, Christophe wrote:
 Timon Gehr , dans le message (digitalmars.D.learn:29641), a écrit :
 Last point: WalkLength is not optimized for strings.
 std.utf.count should be.

 This short implementation of count was 3 to 8 times faster than
 walkLength is a simple benchmark:

 size_t myCount(string text)
 {
     size_t n = text.length;
     for (uint i=0; i<text.length; ++i)
       {
         auto s = text[i]>>6;
         n -= (s>>1) - ((s+1)>>2);
       }
     return n;
 }

 (compiled with gdc on 64 bits, the sample text was the introduction of
 french wikipedia UTF-8 article down to the sommaire -
 http://fr.wikipedia.org/wiki/UTF-8 ).

 The reason is that the loop can be unrolled by the compiler.

Very good point, you might want to file an enhancement request. It would make the functionality different enough to prevent count from being removed: walkLength throws on an invalid UTF sequence.

I would be glad to do so, but I am quite new here, so I don't know how to. A little pointer could help.

http://d.puremagic.com/issues/ You can tick 'Severity: enhancement request'. Probably it would be best if it throws if the final result is larger than text.length though.
Sep 21 2011
prev sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 21.09.2011 4:04, Timon Gehr wrote:
 On 09/21/2011 01:57 AM, Christophe wrote:
 "Jonathan M Davis" , dans le message (digitalmars.D.learn:29637), a
 écrit :
 On Tuesday, September 20, 2011 14:43 Andrej Mitrovic wrote:
 On 9/20/11, Jonathan M Davis<jmdavisProg gmx.com> wrote:
 Or std.range.walkLength. I don't know why we really have
 std.utf.count. I
 just
 calls walkLength anyway. I suspect that it's a function that predates
 walkLength and was made to use walkLength after walkLength was
 introduced. But
 it's kind of pointless now.

 - Jonathan M Davis

I don't think having better-named aliases is a bad thing. Although now I'm seeing it's not just an alias but a function.


std.utf.count has on advantage: someone looking for the function will find it. The programmer might not look in std.range to find a function about UFT strings, and even if he did, it is not indicated in walkLength that it works with (narrow) strings the way it does. To know you can use walklength, you must know that: -popFront works differently in string. -hasLength is not true for strings. -what is walkLength. So yes, you experienced programmer don't need std.utf.count, but newbies do. Last point: WalkLength is not optimized for strings. std.utf.count should be. This short implementation of count was 3 to 8 times faster than walkLength is a simple benchmark: size_t myCount(string text) { size_t n = text.length; for (uint i=0; i<text.length; ++i) { auto s = text[i]>>6; n -= (s>>1) - ((s+1)>>2); } return n; } (compiled with gdc on 64 bits, the sample text was the introduction of french wikipedia UTF-8 article down to the sommaire - http://fr.wikipedia.org/wiki/UTF-8 ). The reason is that the loop can be unrolled by the compiler.

Very good point, you might want to file an enhancement request. It would make the functionality different enough to prevent count from being removed: walkLength throws on an invalid UTF sequence.

Actually, I don't buy it. I guess the reason it's faster is that it doesn't check if the codepoint is valid. In fact you can easily get ridiculous overflowed "negative" lengths. Maybe we can put it here as unsafe and fast version though. Also check std.utf.stride to see if you can get it better, it's the beast behind narrow string popFront. -- Dmitry Olshansky
Sep 21 2011
next sibling parent Timon Gehr <timon.gehr gmx.ch> writes:
On 09/21/2011 12:37 PM, Dmitry Olshansky wrote:
 On 21.09.2011 4:04, Timon Gehr wrote:
 On 09/21/2011 01:57 AM, Christophe wrote:
 "Jonathan M Davis" , dans le message (digitalmars.D.learn:29637), a
 écrit :
 On Tuesday, September 20, 2011 14:43 Andrej Mitrovic wrote:
 On 9/20/11, Jonathan M Davis<jmdavisProg gmx.com> wrote:
 Or std.range.walkLength. I don't know why we really have
 std.utf.count. I
 just
 calls walkLength anyway. I suspect that it's a function that predates
 walkLength and was made to use walkLength after walkLength was
 introduced. But
 it's kind of pointless now.

 - Jonathan M Davis

I don't think having better-named aliases is a bad thing. Although now I'm seeing it's not just an alias but a function.


std.utf.count has on advantage: someone looking for the function will find it. The programmer might not look in std.range to find a function about UFT strings, and even if he did, it is not indicated in walkLength that it works with (narrow) strings the way it does. To know you can use walklength, you must know that: -popFront works differently in string. -hasLength is not true for strings. -what is walkLength. So yes, you experienced programmer don't need std.utf.count, but newbies do. Last point: WalkLength is not optimized for strings. std.utf.count should be. This short implementation of count was 3 to 8 times faster than walkLength is a simple benchmark: size_t myCount(string text) { size_t n = text.length; for (uint i=0; i<text.length; ++i) { auto s = text[i]>>6; n -= (s>>1) - ((s+1)>>2); } return n; } (compiled with gdc on 64 bits, the sample text was the introduction of french wikipedia UTF-8 article down to the sommaire - http://fr.wikipedia.org/wiki/UTF-8 ). The reason is that the loop can be unrolled by the compiler.

Very good point, you might want to file an enhancement request. It would make the functionality different enough to prevent count from being removed: walkLength throws on an invalid UTF sequence.

Actually, I don't buy it. I guess the reason it's faster is that it doesn't check if the codepoint is valid. In fact you can easily get ridiculous overflowed "negative" lengths.

Most of these could be caught by a final check. I think having the option of a version that is so much faster would be nice. Chances are pretty high that code actually manipulating the string will throw eventually if it is invalid.
 Maybe we can put it here as
 unsafe and fast version though.
 Also check std.utf.stride to see if you can get it better, it's the
 beast behind narrow string popFront.

Sep 21 2011
prev sibling parent reply travert phare.normalesup.org (Christophe) writes:
 Actually, I don't buy it. I guess the reason it's faster is that it 
 doesn't check if the codepoint is valid.

Why should it ? The documentation of std.utf.count says the string must be validly encoded, not that it will enforce that it is. Checking a string is valid everytime you use it would be very expensive. Actually, std.range.walkLength does not check the sequence is valid. See this test: void main() { string text = "aléluyah"; char[] text2 = text.dup; text2[3] = 'a'; writeln(walkLength(text2)); // outputs: 8 writeln(text2); // outputs: al\303aluyah } There is probably a way to check an utf sequence is valid with an unrollable loop.
 In fact you can easily get ridiculous overflowed "negative" lengths. 
 Maybe we can put it here as unsafe and fast version though.

Unless I am mistaken, the minimum length myCount can return is 0 even if the string is invalid.
 Also check std.utf.stride to see if you can get it better, it's the 
 beast behind narrow string popFront.

stride does not make much checking. It can even return 5 or 6, which is not possible for a valid utf-8 string ! The equivalent of myCount to stride would be: size_t myStride(char c) { // optional: // if ( (((c>>7)+1)>>1) - (((c>>6)+1)>>2) + (((c>>3)+1)>>5)) // throw new UtfException("Not the start of the UTF-8 sequence"); return 1 + (((c>>6)+1)>>2) + (((c>>5)+1)>>3) + (((c>>4)+1)>>4); } That I compared to: size_t utfLikeStride(char c) { // optional: // immutable result = UTF8stride[c]; // if (result == 0xFF) // throw new UtfException("Not the start of the UTF-8 sequence"); // return result; return UTF8stride[c]; } One table lookup is replaced by byte some arythmetic in myStride. I also took only one char as input, since stride only looked at the i-th character. Actually, if stride signature is kept to uint "stride(char[] s, int i)", I did not find any change with -O3. Average times for "a lot" of calls: (compiled with gcc, tested with -O3 and a homogenous distribution of "valid" characters from '\x00'..'\x7F' and '\xC2'..'\xF4') myStride no throws: 1112ms. utfLikeStride no throws: 1433ms. utfLikeStride throws: 1868ms. (the current implementation). myStride throws: 8269ms. Removing throws from utfLikeStride makes it about 25% faster. Removing throws from myStride makes it about 7 times faster. With -O0, myStride gets less 10% slower than utfLikeStride (no throws). In conclusion, the fastest implementation is myStride without throws, and it beats the current implementation by about 40%. Changing std.utf.stride may be desirable. As I said earlier, the throws do not enforce the validity of the string. Really checking the validity of the string would cost much more, which may not be desirable, so why bother checking at all? A more serious benchmark could justify to change std.utf.stride. The improvement could be even better in real situation, because the lookup table of utfLikeStride may not be always at hand - this actually really depends on what the compiler does. In any case, this may not improve walkLength by more than a few percents. -- Christophe now I'll go back to my real work...
Sep 21 2011
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 21.09.2011 18:47, Christophe wrote:
 Actually, I don't buy it. I guess the reason it's faster is that it
 doesn't check if the codepoint is valid.

Why should it ? The documentation of std.utf.count says the string must be validly encoded, not that it will enforce that it is. Checking a string is valid everytime you use it would be very expensive. Actually, std.range.walkLength does not check the sequence is valid. See this test: void main() { string text = "aléluyah"; char[] text2 = text.dup; text2[3] = 'a'; writeln(walkLength(text2)); // outputs: 8 writeln(text2); // outputs: al\303aluyah }

Ouch, the checking is apparently very loosy.
 There is probably a way to check an utf sequence is valid with an
 unrollable loop.

 In fact you can easily get ridiculous overflowed "negative" lengths.
 Maybe we can put it here as unsafe and fast version though.

Unless I am mistaken, the minimum length myCount can return is 0 even if the string is invalid.

Yeah, a brain malfunction on my part.
 Also check std.utf.stride to see if you can get it better, it's the
 beast behind narrow string popFront.

stride does not make much checking. It can even return 5 or 6, which is not possible for a valid utf-8 string ! The equivalent of myCount to stride would be: size_t myStride(char c) { // optional: // if ( (((c>>7)+1)>>1) - (((c>>6)+1)>>2) + (((c>>3)+1)>>5)) // throw new UtfException("Not the start of the UTF-8 sequence"); return 1 + (((c>>6)+1)>>2) + (((c>>5)+1)>>3) + (((c>>4)+1)>>4); } That I compared to: size_t utfLikeStride(char c) { // optional: // immutable result = UTF8stride[c]; // if (result == 0xFF) // throw new UtfException("Not the start of the UTF-8 sequence"); // return result; return UTF8stride[c]; } One table lookup is replaced by byte some arythmetic in myStride. I also took only one char as input, since stride only looked at the i-th character. Actually, if stride signature is kept to uint "stride(char[] s, int i)", I did not find any change with -O3. Average times for "a lot" of calls: (compiled with gcc, tested with -O3 and a homogenous distribution of "valid" characters from '\x00'..'\x7F' and '\xC2'..'\xF4') myStride no throws: 1112ms. utfLikeStride no throws: 1433ms. utfLikeStride throws: 1868ms. (the current implementation). myStride throws: 8269ms.

implementation of utfLikeStride. It should amount to cmp vs test, not sure if it matters much.
 Removing throws from utfLikeStride makes it about 25% faster.
 Removing throws from myStride makes it about 7 times faster.

 With -O0, myStride gets less 10% slower than utfLikeStride (no throws).

 In conclusion, the fastest implementation is myStride without throws,
 and it beats the current implementation by about 40%. Changing
 std.utf.stride may be desirable. As I said earlier, the throws do
 not enforce the validity of the string. Really checking the validity of
 the string would cost much more, which may not be desirable, so why
 bother checking at all?

The truth is I'd checked this in the past (though I used some bsr black magic) and if I kept check in place the end result was always slower then current. But since the check is not very accurate anyway, maybe it can be replaced. It's problematic if some code happen to depend on it. (given the doc it should not)
 A more serious benchmark could justify to change
 std.utf.stride. The improvement could be even better in real situation,
 because the lookup table of utfLikeStride may not be always at hand -
 this actually really depends on what the compiler does.

traversing UTF, on decoding - maybe. Generally if you do a lot calls to stride it's in cache, if not it doesn't matter much(?). Though I'd prefer non-tabulated version
 In any case, this may not improve walkLength by more than a few
 percents.

Then specializing walkLength to do your unrollable version seems like good idea. -- Dmitry Olshansky
Sep 21 2011
prev sibling parent reply zeljkog <zeljkog private.com> writes:
On 21.09.2011 01:57, Christophe wrote:
 size_t myCount(string text)
 {
    size_t n = text.length;
    for (uint i=0; i<text.length; ++i)
      {
        auto s = text[i]>>6;
        n -= (s>>1) - ((s+1)>>2);
      }
    return n;
 }

Here is a more readable and a bit faster version on dmd windows: size_t utfCount(string text) { size_t n = 0; for (uint i=0; i<text.length; ++i) n += ((text[i]>>6)^0b10)? 1: 0; return n; }
Sep 21 2011
parent reply travert phare.normalesup.org (Christophe Travert) writes:
 Here is a more readable and a bit faster version on dmd windows:
 
 size_t utfCount(string text)
 {
      size_t n = 0;
      for (uint i=0; i<text.length; ++i)
           n += ((text[i]>>6)^0b10)? 1: 0;
      return n;
 }

Nice. It is better with gdc linux 64bits too. I wanted to avoid conditional expressions like ?: but it's actually slightly faster that way. And now people can't tell it is dangerous because it could return a fuzzy number. Even faster, through less readable: size_t utfLength(string text) { size_t n=0; for (size_t i=0; i<text.length; ++i) n += (((text[i]>>6)^0b10) != 0); return n; } Let's see how we can boost std.utf.stride that way... -- Christophe
Sep 21 2011
parent zeljkog <zeljkog private.com> writes:
On 21.09.2011 19:12, Christophe Travert wrote:
 Nice. It is better with gdc linux 64bits too. I wanted to avoid
 conditional expressions like ?: but it's actually slightly faster that
 way.

It is not compiled in as conditional jump.
Sep 21 2011
prev sibling next sibling parent Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On 9/20/11, Jonathan M Davis <jmdavisProg gmx.com> wrote:
 We specifically avoid having aliases in Phobos simply for having alternate
 function names. Aliases need to actually be useful, or they shouldn't be
 there.

And function names have to be useful to library users. walkLength is an awful name for something that returns the character count. If you ask a GUI developer to look for a function that creates a rectangle path, you can be sure he'll start looking for Rectangle or DrawRect or something similar, and not "ClosedShapePointN!4" or something that generic.
Sep 20 2011
prev sibling parent "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Tuesday, September 20, 2011 15:10 Andrej Mitrovic wrote:
 On 9/20/11, Jonathan M Davis <jmdavisProg gmx.com> wrote:
 We specifically avoid having aliases in Phobos simply for having
 alternate function names. Aliases need to actually be useful, or they
 shouldn't be there.

And function names have to be useful to library users. walkLength is an awful name for something that returns the character count. If you ask a GUI developer to look for a function that creates a rectangle path, you can be sure he'll start looking for Rectangle or DrawRect or something similar, and not "ClosedShapePointN!4" or something that generic.

In this case, if there's a problem it's not how generic the function is, it's the name walkLength. There's nothing special about strings which makes the name count better for them than it is for other ranges. The function is returning the number of elements in the range - be they code points or integers or whatever. The name walkLength works just as well for strings as it does for anything else. So, if there's a problem it's that the name walkLength isn't necessarily all that great. Strings aren't so special that they merit their own function name for the same functionality. So, if count stays, it's simply because it's been around for a while, not because it's inherently better to have a separate count function. - Jonathan M Davis
Sep 20 2011