digitalmars.D.learn - toUTFz and WinAPI GetTextExtentPoint32W

Andre (11/11) Sep 20 2011 Hi,

Trass3r (1/5) Sep 20 2011 toUTFz returns a wchar*, not a wchar[].

Andre (7/13) Sep 20 2011 I am not familiar with pointers. I know I have to

Timon Gehr (8/21) Sep 20 2011 Are you sure that the call requires the string to be null terminated? I

Trass3r (2/9) Sep 20 2011 It doesn't need to be null-terminated for that function.

Timon Gehr (5/15) Sep 20 2011 It has to be copied anyway, so there is no real difference. I just did

Timon Gehr (7/31) Sep 20 2011 sry, should have read:

Andre (4/39) Sep 20 2011 thanks a lot for your help.

Andrej Mitrovic (10/10) Sep 20 2011 Don't use length, use std.utf.count, ala:
Jonathan M Davis (6/11) Sep 20 2011 Or std.range.walkLength. I don't know why we really have std.utf.count. ...
Andrej Mitrovic (7/14) Sep 20 2011 I don't think having better-named aliases is a bad thing. Although now
Andrej Mitrovic (5/5) Sep 20 2011 One other thing, count can only take an array which seems too

Jonathan M Davis (18/39) Sep 20 2011 We specifically avoid having aliases in Phobos simply for having alterna...

travert phare.normalesup.org (Christophe) (30/44) Sep 20 2011 std.utf.count has on advantage: someone looking for the function will

Timon Gehr (4/48) Sep 20 2011 Very good point, you might want to file an enhancement request. It would...

travert phare.normalesup.org (Christophe) (5/31) Sep 20 2011 I would be glad to do so, but I am quite new here, so I don't know how

Timon Gehr (4/33) Sep 21 2011 http://d.puremagic.com/issues/

Dmitry Olshansky (9/65) Sep 21 2011 Actually, I don't buy it. I guess the reason it's faster is that it

Timon Gehr (5/70) Sep 21 2011 Most of these could be caught by a final check. I think having the
travert phare.normalesup.org (Christophe) (65/71) Sep 21 2011 Why should it ? The documentation of std.utf.count says the string must

Dmitry Olshansky (19/88) Sep 21 2011 Yeah, a brain malfunction on my part.

zeljkog (9/19) Sep 21 2011 Here is a more readable and a bit faster version on dmd windows:

travert phare.normalesup.org (Christophe Travert) (16/25) Sep 21 2011 Nice. It is better with gdc linux 64bits too. I wanted to avoid

zeljkog (2/5) Sep 21 2011 It is not compiled in as conditional jump.

Andrej Mitrovic (7/10) Sep 20 2011 And function names have to be useful to library users. walkLength is
Jonathan M Davis (12/24) Sep 20 2011 In this case, if there's a problem it's not how generic the function is,...

Andre <andre s-e-a-p.de> writes:

Hi,

I want something like:

bool test(HDC dc, string str, int len, SIZE* s)
{
wchar[] wstr = toUTFz!(wchar*)str;
GetTextExtentPoint32W(dc wstr.ptr, wstr.length, s);
...

I get the wchar[] stuff not working. I am struggling
with pointer to array. Could you give some advice?

Kind regards
Andre

Sep 20 2011

Trass3r <un known.com> writes:

 bool test(HDC dc, string str, int len, SIZE* s)
 {
 wchar[] wstr = toUTFz!(wchar*)str;
 GetTextExtentPoint32W(dc wstr.ptr, wstr.length, s);

toUTFz returns a wchar*, not a wchar[].

Sep 20 2011

Andre <andre s-e-a-p.de> writes:

Am Tue, 20 Sep 2011 19:27:03 +0200 schrieb Trass3r:

 bool test(HDC dc, string str, int len, SIZE* s)
 {
 wchar[] wstr = toUTFz!(wchar*)str;
 GetTextExtentPoint32W(dc wstr.ptr, wstr.length, s);

 
 toUTFz returns a wchar*, not a wchar[].

I am not familiar with pointers. I know I have to
call toUTFz! and fill pointer value and length value
of the WinAPI from the result.
Do you have any suggestions how to achieve this API call?

Kind regards
Andre

Sep 20 2011

Timon Gehr <timon.gehr gmx.ch> writes:

On 09/20/2011 08:07 PM, Andre wrote:
 Am Tue, 20 Sep 2011 19:27:03 +0200 schrieb Trass3r:

 bool test(HDC dc, string str, int len, SIZE* s)
 {
 wchar[] wstr = toUTFz!(wchar*)str;
 GetTextExtentPoint32W(dc wstr.ptr, wstr.length, s);

 toUTFz returns a wchar*, not a wchar[].

 I am not familiar with pointers. I know I have to
 call toUTFz! and fill pointer value and length value
 of the WinAPI from the result.
 Do you have any suggestions how to achieve this API call?

 Kind regards
 Andre

Are you sure that the call requires the string to be null terminated? I 
do not know that winapi function, but this might work:

bool test(HDC dc, string str, SIZE* s)
{
auto wstr = to!(wchar[])str;
GetTextExtentPoint32W(dc, wstr.ptr, wstr.length, s);
...

Sep 20 2011

Trass3r <un known.com> writes:

 Are you sure that the call requires the string to be null terminated? I  
 do not know that winapi function, but this might work:

 bool test(HDC dc, string str, SIZE* s)
 {
 auto wstr = to!(wchar[])str;
 GetTextExtentPoint32W(dc, wstr.ptr, wstr.length, s);
 ...

It doesn't need to be null-terminated for that function.
Shouldn't you use to!wstring though?!

Sep 20 2011

Timon Gehr <timon.gehr gmx.ch> writes:

On 09/20/2011 08:34 PM, Trass3r wrote:
 Are you sure that the call requires the string to be null terminated?
 I do not know that winapi function, but this might work:

 bool test(HDC dc, string str, SIZE* s)
 {
 auto wstr = to!(wchar[])str;
 GetTextExtentPoint32W(dc, wstr.ptr, wstr.length, s);
 ...

 It doesn't need to be null-terminated for that function.
 Shouldn't you use to!wstring though?!

It has to be copied anyway, so there is no real difference. I just did 
not know the signature of that function, and if it had been missing the 
const, wstring would not have worked. But if there is a const, wstring 
is indeed superior because shorter and clearer.

Sep 20 2011

Timon Gehr <timon.gehr gmx.ch> writes:

On 09/20/2011 08:24 PM, Timon Gehr wrote:
 On 09/20/2011 08:07 PM, Andre wrote:
 Am Tue, 20 Sep 2011 19:27:03 +0200 schrieb Trass3r:

 bool test(HDC dc, string str, int len, SIZE* s)
 {
 wchar[] wstr = toUTFz!(wchar*)str;
 GetTextExtentPoint32W(dc wstr.ptr, wstr.length, s);

 toUTFz returns a wchar*, not a wchar[].

 I am not familiar with pointers. I know I have to
 call toUTFz! and fill pointer value and length value
 of the WinAPI from the result.
 Do you have any suggestions how to achieve this API call?

 Kind regards
 Andre

 Are you sure that the call requires the string to be null terminated? I
 do not know that winapi function, but this might work:

 bool test(HDC dc, string str, SIZE* s)
 {
 auto wstr = to!(wchar[])str;
 GetTextExtentPoint32W(dc, wstr.ptr, wstr.length, s);
 ...

sry, should have read:

bool test(HDC dc, string str, SIZE* s)
{
auto wstr = to!(wchar[])(str);
GetTextExtentPoint32W(dc, wstr.ptr, wstr.length, s);
...

Sep 20 2011

Andre <andre s-e-a-p.de> writes:

Am Tue, 20 Sep 2011 20:44:40 +0200 schrieb Timon Gehr:

 On 09/20/2011 08:24 PM, Timon Gehr wrote:
 On 09/20/2011 08:07 PM, Andre wrote:
 Am Tue, 20 Sep 2011 19:27:03 +0200 schrieb Trass3r:

 bool test(HDC dc, string str, int len, SIZE* s)
 {
 wchar[] wstr = toUTFz!(wchar*)str;
 GetTextExtentPoint32W(dc wstr.ptr, wstr.length, s);

 toUTFz returns a wchar*, not a wchar[].

 I am not familiar with pointers. I know I have to
 call toUTFz! and fill pointer value and length value
 of the WinAPI from the result.
 Do you have any suggestions how to achieve this API call?

 Kind regards
 Andre

 Are you sure that the call requires the string to be null terminated? I
 do not know that winapi function, but this might work:

 bool test(HDC dc, string str, SIZE* s)
 {
 auto wstr = to!(wchar[])str;
 GetTextExtentPoint32W(dc, wstr.ptr, wstr.length, s);
 ...

 
 sry, should have read:
 
 bool test(HDC dc, string str, SIZE* s)
 {
 auto wstr = to!(wchar[])(str);
 GetTextExtentPoint32W(dc, wstr.ptr, wstr.length, s);
 ...


thanks a lot for your help. 

Kind regards
Andre

Sep 20 2011

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

Don't use length, use std.utf.count, ala:

import std.utf;
alias toUTFz!(const(wchar)*, string)  toUTF16z;
GetTextExtentPoint32W(str.toUTF16z, std.utf.count(str), s);

I like to keep that alias for my code since I was already using it beforehand.

I'm pretty sure (ok maybe 80% sure) that GetTextExtentPoint32W asks
for the count of characters and not code units. The WinAPI docs are a
bit fuzzy when it comes to these things, some functions take the
character count, others code-unit count. I've used this function in a
D port of a Neatpad project a while ago.

Sep 20 2011

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Tuesday, September 20, 2011 14:27 Andrej Mitrovic wrote:
 Don't use length, use std.utf.count, ala:
 
 import std.utf;
 alias toUTFz!(const(wchar)*, string) toUTF16z;
 GetTextExtentPoint32W(str.toUTF16z, std.utf.count(str), s);

Or std.range.walkLength. I don't know why we really have std.utf.count. I just 
calls walkLength anyway. I suspect that it's a function that predates 
walkLength and was made to use walkLength after walkLength was introduced. But 
it's kind of pointless now.

- Jonathan M Davis

Sep 20 2011

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

On 9/20/11, Jonathan M Davis <jmdavisProg gmx.com> wrote:
 Or std.range.walkLength. I don't know why we really have std.utf.count. I
 just
 calls walkLength anyway. I suspect that it's a function that predates
 walkLength and was made to use walkLength after walkLength was introduced.
 But
 it's kind of pointless now.

 - Jonathan M Davis

I don't think having better-named aliases is a bad thing. Although now
I'm seeing it's not just an alias but a function.

What exactly is the "static if (E.sizeof < 4)" in there for btw? When
would the element type exceed 4 bytes while still passing the
isSomeChar contract, and then why not stop compilation at that point
instead of return "s.length"?

Sep 20 2011

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

One other thing, count can only take an array which seems too
restrictive since walkLength can take any range at all. So maybe count
should be just an alias to walkLength or it should possibly be removed
(I'm against fully removing it because I already use it in code and I
think the name does make sense).

Sep 20 2011

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Tuesday, September 20, 2011 14:43 Andrej Mitrovic wrote:
 On 9/20/11, Jonathan M Davis <jmdavisProg gmx.com> wrote:
 Or std.range.walkLength. I don't know why we really have std.utf.count. I
 just
 calls walkLength anyway. I suspect that it's a function that predates
 walkLength and was made to use walkLength after walkLength was
 introduced. But
 it's kind of pointless now.
 
 - Jonathan M Davis

 
 I don't think having better-named aliases is a bad thing. Although now
 I'm seeing it's not just an alias but a function.

We specifically avoid having aliases in Phobos simply for having alternate 
function names. Aliases need to actually be useful, or they shouldn't be 
there.

 What exactly is the "static if (E.sizeof < 4)" in there for btw? When
 would the element type exceed 4 bytes while still passing the
 isSomeChar contract, and then why not stop compilation at that point
 instead of return "s.length"?

The static if is there to special-case narrow strings. It's unnecessary 
(though it does eliminate a function call when -inline isn't used). It would 
have been necessary prior to count just forwarding to walkLength, but it isn't 
now.

 One other thing, count can only take an array which seems too
 restrictive since walkLength can take any range at all. So maybe count
 should be just an alias to walkLength or it should possibly be removed
 (I'm against fully removing it because I already use it in code and I
 think the name does make sense).

I don't know if we're going to remove std.utf.count or not, but it _is_ the 
kind of thing that we've been removing. It doesn't add any real value. It's 
just another function which does exactly the same thing as walkLength except 
that it's restricted to strings, and we don't generally like having pointless 
aliases around (or pointless function wrappers, which amounts to pretty much 
the same thing). So, it wouldn't surprise me at all if it goes away, but 
if/when it does, it'll go through the proper deprecation cycle rather than 
just being removed, so if/when we do that, it's not like your code would
immediately break.

- Jonathan M Davis

Sep 20 2011

travert phare.normalesup.org (Christophe) writes:

"Jonathan M Davis" , dans le message (digitalmars.D.learn:29637), a
 écrit :
 On Tuesday, September 20, 2011 14:43 Andrej Mitrovic wrote:
 On 9/20/11, Jonathan M Davis <jmdavisProg gmx.com> wrote:
 Or std.range.walkLength. I don't know why we really have std.utf.count. I
 just
 calls walkLength anyway. I suspect that it's a function that predates
 walkLength and was made to use walkLength after walkLength was
 introduced. But
 it's kind of pointless now.
 
 - Jonathan M Davis

 
 I don't think having better-named aliases is a bad thing. Although now
 I'm seeing it's not just an alias but a function.

 

std.utf.count has on advantage: someone looking for the function will 
find it. The programmer might not look in std.range to find a function 
about UFT strings, and even if he did, it is not indicated in walkLength 
that it works with (narrow) strings the way it does. To know you can use 
walklength, you must know that:
-popFront works differently in string.
-hasLength is not true for strings.
-what is walkLength.

So yes, you experienced programmer don't need std.utf.count, but newbies 
do.

Last point: WalkLength is not optimized for strings.
std.utf.count should be.

This short implementation of count was 3 to 8 times faster than 
walkLength is a simple benchmark:

size_t myCount(string text)
{
  size_t n = text.length;
  for (uint i=0; i<text.length; ++i)
    {
      auto s = text[i]>>6;
      n -= (s>>1) - ((s+1)>>2);
    }
  return n;
}

(compiled with gdc on 64 bits, the sample text was the introduction of 
french wikipedia UTF-8 article down to the sommaire - 
http://fr.wikipedia.org/wiki/UTF-8 ).

The reason is that the loop can be unrolled by the compiler.

Sep 20 2011

Timon Gehr <timon.gehr gmx.ch> writes:

On 09/21/2011 01:57 AM, Christophe wrote:
 "Jonathan M Davis" , dans le message (digitalmars.D.learn:29637), a
   écrit :
 On Tuesday, September 20, 2011 14:43 Andrej Mitrovic wrote:
 On 9/20/11, Jonathan M Davis<jmdavisProg gmx.com>  wrote:
 Or std.range.walkLength. I don't know why we really have std.utf.count. I
 just
 calls walkLength anyway. I suspect that it's a function that predates
 walkLength and was made to use walkLength after walkLength was
 introduced. But
 it's kind of pointless now.

 - Jonathan M Davis

 I don't think having better-named aliases is a bad thing. Although now
 I'm seeing it's not just an alias but a function.


 std.utf.count has on advantage: someone looking for the function will
 find it. The programmer might not look in std.range to find a function
 about UFT strings, and even if he did, it is not indicated in walkLength
 that it works with (narrow) strings the way it does. To know you can use
 walklength, you must know that:
 -popFront works differently in string.
 -hasLength is not true for strings.
 -what is walkLength.

 So yes, you experienced programmer don't need std.utf.count, but newbies
 do.

 Last point: WalkLength is not optimized for strings.
 std.utf.count should be.

 This short implementation of count was 3 to 8 times faster than
 walkLength is a simple benchmark:

 size_t myCount(string text)
 {
    size_t n = text.length;
    for (uint i=0; i<text.length; ++i)
      {
        auto s = text[i]>>6;
        n -= (s>>1) - ((s+1)>>2);
      }
    return n;
 }

 (compiled with gdc on 64 bits, the sample text was the introduction of
 french wikipedia UTF-8 article down to the sommaire -
 http://fr.wikipedia.org/wiki/UTF-8 ).

 The reason is that the loop can be unrolled by the compiler.

Very good point, you might want to file an enhancement request. It would 
make the functionality different enough to prevent count from being 
removed: walkLength throws on an invalid UTF sequence.

Sep 20 2011

travert phare.normalesup.org (Christophe) writes:

Timon Gehr , dans le message (digitalmars.D.learn:29641), a écrit :
 Last point: WalkLength is not optimized for strings.
 std.utf.count should be.

 This short implementation of count was 3 to 8 times faster than
 walkLength is a simple benchmark:

 size_t myCount(string text)
 {
    size_t n = text.length;
    for (uint i=0; i<text.length; ++i)
      {
        auto s = text[i]>>6;
        n -= (s>>1) - ((s+1)>>2);
      }
    return n;
 }

 (compiled with gdc on 64 bits, the sample text was the introduction of
 french wikipedia UTF-8 article down to the sommaire -
 http://fr.wikipedia.org/wiki/UTF-8 ).

 The reason is that the loop can be unrolled by the compiler.

 
 Very good point, you might want to file an enhancement request. It would 
 make the functionality different enough to prevent count from being 
 removed: walkLength throws on an invalid UTF sequence.

I would be glad to do so, but I am quite new here, so I don't know how 
to. A little pointer could help.

-- 
Christophe

Sep 20 2011

Timon Gehr <timon.gehr gmx.ch> writes:

On 09/21/2011 02:15 AM, Christophe wrote:
 Timon Gehr , dans le message (digitalmars.D.learn:29641), a écrit :
 Last point: WalkLength is not optimized for strings.
 std.utf.count should be.

 This short implementation of count was 3 to 8 times faster than
 walkLength is a simple benchmark:

 size_t myCount(string text)
 {
     size_t n = text.length;
     for (uint i=0; i<text.length; ++i)
       {
         auto s = text[i]>>6;
         n -= (s>>1) - ((s+1)>>2);
       }
     return n;
 }

 (compiled with gdc on 64 bits, the sample text was the introduction of
 french wikipedia UTF-8 article down to the sommaire -
 http://fr.wikipedia.org/wiki/UTF-8 ).

 The reason is that the loop can be unrolled by the compiler.

 Very good point, you might want to file an enhancement request. It would
 make the functionality different enough to prevent count from being
 removed: walkLength throws on an invalid UTF sequence.

 I would be glad to do so, but I am quite new here, so I don't know how
 to. A little pointer could help.

http://d.puremagic.com/issues/

You can tick 'Severity: enhancement request'. Probably it would be best 
if it throws if the final result is larger than text.length though.

Sep 21 2011

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 21.09.2011 4:04, Timon Gehr wrote:
 On 09/21/2011 01:57 AM, Christophe wrote:
 "Jonathan M Davis" , dans le message (digitalmars.D.learn:29637), a
 écrit :
 On Tuesday, September 20, 2011 14:43 Andrej Mitrovic wrote:
 On 9/20/11, Jonathan M Davis<jmdavisProg gmx.com> wrote:
 Or std.range.walkLength. I don't know why we really have
 std.utf.count. I
 just
 calls walkLength anyway. I suspect that it's a function that predates
 walkLength and was made to use walkLength after walkLength was
 introduced. But
 it's kind of pointless now.

 - Jonathan M Davis

 I don't think having better-named aliases is a bad thing. Although now
 I'm seeing it's not just an alias but a function.


 std.utf.count has on advantage: someone looking for the function will
 find it. The programmer might not look in std.range to find a function
 about UFT strings, and even if he did, it is not indicated in walkLength
 that it works with (narrow) strings the way it does. To know you can use
 walklength, you must know that:
 -popFront works differently in string.
 -hasLength is not true for strings.
 -what is walkLength.

 So yes, you experienced programmer don't need std.utf.count, but newbies
 do.

 Last point: WalkLength is not optimized for strings.
 std.utf.count should be.

 This short implementation of count was 3 to 8 times faster than
 walkLength is a simple benchmark:

 size_t myCount(string text)
 {
 size_t n = text.length;
 for (uint i=0; i<text.length; ++i)
 {
 auto s = text[i]>>6;
 n -= (s>>1) - ((s+1)>>2);
 }
 return n;
 }

 (compiled with gdc on 64 bits, the sample text was the introduction of
 french wikipedia UTF-8 article down to the sommaire -
 http://fr.wikipedia.org/wiki/UTF-8 ).

 The reason is that the loop can be unrolled by the compiler.

 Very good point, you might want to file an enhancement request. It would
 make the functionality different enough to prevent count from being
 removed: walkLength throws on an invalid UTF sequence.

Actually, I don't buy it. I guess the reason it's faster is that it 
doesn't check if the codepoint is valid. In fact you can easily get 
ridiculous overflowed "negative" lengths. Maybe we can put it here as 
unsafe and fast version though.
Also check std.utf.stride to see if you can get it better, it's the 
beast behind narrow string popFront.

-- 
Dmitry Olshansky

Sep 21 2011

Timon Gehr <timon.gehr gmx.ch> writes:

On 09/21/2011 12:37 PM, Dmitry Olshansky wrote:
 On 21.09.2011 4:04, Timon Gehr wrote:
 On 09/21/2011 01:57 AM, Christophe wrote:
 "Jonathan M Davis" , dans le message (digitalmars.D.learn:29637), a
 écrit :
 On Tuesday, September 20, 2011 14:43 Andrej Mitrovic wrote:
 On 9/20/11, Jonathan M Davis<jmdavisProg gmx.com> wrote:
 Or std.range.walkLength. I don't know why we really have
 std.utf.count. I
 just
 calls walkLength anyway. I suspect that it's a function that predates
 walkLength and was made to use walkLength after walkLength was
 introduced. But
 it's kind of pointless now.

 - Jonathan M Davis

 I don't think having better-named aliases is a bad thing. Although now
 I'm seeing it's not just an alias but a function.


 std.utf.count has on advantage: someone looking for the function will
 find it. The programmer might not look in std.range to find a function
 about UFT strings, and even if he did, it is not indicated in walkLength
 that it works with (narrow) strings the way it does. To know you can use
 walklength, you must know that:
 -popFront works differently in string.
 -hasLength is not true for strings.
 -what is walkLength.

 So yes, you experienced programmer don't need std.utf.count, but newbies
 do.

 Last point: WalkLength is not optimized for strings.
 std.utf.count should be.

 This short implementation of count was 3 to 8 times faster than
 walkLength is a simple benchmark:

 size_t myCount(string text)
 {
 size_t n = text.length;
 for (uint i=0; i<text.length; ++i)
 {
 auto s = text[i]>>6;
 n -= (s>>1) - ((s+1)>>2);
 }
 return n;
 }

 (compiled with gdc on 64 bits, the sample text was the introduction of
 french wikipedia UTF-8 article down to the sommaire -
 http://fr.wikipedia.org/wiki/UTF-8 ).

 The reason is that the loop can be unrolled by the compiler.

 Very good point, you might want to file an enhancement request. It would
 make the functionality different enough to prevent count from being
 removed: walkLength throws on an invalid UTF sequence.

 Actually, I don't buy it. I guess the reason it's faster is that it
 doesn't check if the codepoint is valid. In fact you can easily get
 ridiculous overflowed "negative" lengths.

Most of these could be caught by a final check. I think having the 
option of a version that is so much faster would be nice. Chances are 
pretty high that code actually manipulating the string will throw 
eventually if it is invalid.

 Maybe we can put it here as
 unsafe and fast version though.
 Also check std.utf.stride to see if you can get it better, it's the
 beast behind narrow string popFront.

Sep 21 2011

travert phare.normalesup.org (Christophe) writes:

 Actually, I don't buy it. I guess the reason it's faster is that it 
 doesn't check if the codepoint is valid.

Why should it ? The documentation of std.utf.count says the string must 
be validly encoded, not that it will enforce that it is.
Checking a string is valid everytime you use it would be very expensive.

Actually, std.range.walkLength does not check the sequence is valid. See 
this test:

void main()
{
  string text = "aléluyah";
  char[] text2 = text.dup;
  text2[3] = 'a';
  writeln(walkLength(text2)); // outputs: 8
  writeln(text2);             // outputs: al\303aluyah
}

There is probably a way to check an utf sequence is valid with an 
unrollable loop.

 In fact you can easily get ridiculous overflowed "negative" lengths. 
 Maybe we can put it here as unsafe and fast version though.

Unless I am mistaken, the minimum length myCount can return is 0 even 
if the string is invalid.

 Also check std.utf.stride to see if you can get it better, it's the 
 beast behind narrow string popFront.

stride does not make much checking. It can even return 5 or 6, which is 
not possible for a valid utf-8 string !

The equivalent of myCount to stride would be:

size_t myStride(char c)
{
    // optional:
    // if ( (((c>>7)+1)>>1) - (((c>>6)+1)>>2) + (((c>>3)+1)>>5))
    //     throw new UtfException("Not the start of the UTF-8 sequence");
    return 1 + (((c>>6)+1)>>2) + (((c>>5)+1)>>3) + (((c>>4)+1)>>4);
}

That I compared to:

size_t utfLikeStride(char c)
{
  // optional:
  // immutable result = UTF8stride[c];
  // if (result == 0xFF)
  // throw new UtfException("Not the start of the UTF-8 sequence");
  // return result;
  return UTF8stride[c];
}

One table lookup is replaced by byte some arythmetic in myStride.

I also took only one char as input, since stride only looked at the i-th 
character. Actually, if stride signature is kept to uint "stride(char[] 
s, int i)", I did not find any change with -O3.

Average times for "a lot" of calls:
(compiled with gcc, tested with -O3 and a homogenous distribution of 
"valid" characters from '\x00'..'\x7F' and '\xC2'..'\xF4')

myStride no throws:      1112ms.
utfLikeStride no throws: 1433ms.
utfLikeStride throws:    1868ms. (the current implementation).
myStride throws:         8269ms.

Removing throws from utfLikeStride makes it about 25% faster.
Removing throws from myStride makes it about 7 times faster.

With -O0, myStride gets less 10% slower than utfLikeStride (no throws).

In conclusion, the fastest implementation is myStride without throws, 
and it beats the current implementation by about 40%. Changing 
std.utf.stride may be desirable. As I said earlier, the throws do 
not enforce the validity of the string. Really checking the validity of 
the string would cost much more, which may not be desirable, so why 
bother checking at all? A more serious benchmark could justify to change 
std.utf.stride. The improvement could be even better in real situation, 
because the lookup table of utfLikeStride may not be always at hand - 
this actually really depends on what the compiler does.

In any case, this may not improve walkLength by more than a few 
percents.

-- 
Christophe

now I'll go back to my real work...

Sep 21 2011

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 21.09.2011 18:47, Christophe wrote:
 Actually, I don't buy it. I guess the reason it's faster is that it
 doesn't check if the codepoint is valid.

 Why should it ? The documentation of std.utf.count says the string must
 be validly encoded, not that it will enforce that it is.
 Checking a string is valid everytime you use it would be very expensive.

 Actually, std.range.walkLength does not check the sequence is valid. See
 this test:

 void main()
 {
    string text = "aléluyah";
    char[] text2 = text.dup;
    text2[3] = 'a';
    writeln(walkLength(text2)); // outputs: 8
    writeln(text2);             // outputs: al\303aluyah
 }

Ouch, the checking is apparently very loosy.

 There is probably a way to check an utf sequence is valid with an
 unrollable loop.

 In fact you can easily get ridiculous overflowed "negative" lengths.
 Maybe we can put it here as unsafe and fast version though.

 Unless I am mistaken, the minimum length myCount can return is 0 even
 if the string is invalid.

Yeah, a brain malfunction on my part.

 Also check std.utf.stride to see if you can get it better, it's the
 beast behind narrow string popFront.

 stride does not make much checking. It can even return 5 or 6, which is
 not possible for a valid utf-8 string !

 The equivalent of myCount to stride would be:

 size_t myStride(char c)
 {
      // optional:
      // if ( (((c>>7)+1)>>1) - (((c>>6)+1)>>2) + (((c>>3)+1)>>5))
      //     throw new UtfException("Not the start of the UTF-8 sequence");
      return 1 + (((c>>6)+1)>>2) + (((c>>5)+1)>>3) + (((c>>4)+1)>>4);
 }

 That I compared to:

 size_t utfLikeStride(char c)
 {
    // optional:
    // immutable result = UTF8stride[c];
    // if (result == 0xFF)
    // throw new UtfException("Not the start of the UTF-8 sequence");
    // return result;
    return UTF8stride[c];
 }

 One table lookup is replaced by byte some arythmetic in myStride.

 I also took only one char as input, since stride only looked at the i-th
 character. Actually, if stride signature is kept to uint "stride(char[]
 s, int i)", I did not find any change with -O3.

 Average times for "a lot" of calls:
 (compiled with gcc, tested with -O3 and a homogenous distribution of
 "valid" characters from '\x00'..'\x7F' and '\xC2'..'\xF4')

 myStride no throws:      1112ms.
 utfLikeStride no throws: 1433ms.
 utfLikeStride throws:    1868ms. (the current implementation).
 myStride throws:         8269ms.

I wonder what impact may have if any changing 0xff to 0x00 in 
implementation of utfLikeStride. It should amount to cmp vs test, not 
sure if it matters much.

 Removing throws from utfLikeStride makes it about 25% faster.
 Removing throws from myStride makes it about 7 times faster.

 With -O0, myStride gets less 10% slower than utfLikeStride (no throws).

 In conclusion, the fastest implementation is myStride without throws,
 and it beats the current implementation by about 40%. Changing
 std.utf.stride may be desirable. As I said earlier, the throws do
 not enforce the validity of the string. Really checking the validity of
 the string would cost much more, which may not be desirable, so why
 bother checking at all?

The truth is I'd checked this in the past (though I used some bsr black 
magic) and if I kept check in place the end result was always slower 
then current. But since the check is not very accurate anyway, maybe it 
can be replaced. It's problematic if some code happen to depend on it. 
(given the doc it should not)

 A more serious benchmark could justify to change
 std.utf.stride. The improvement could be even better in real situation,
 because the lookup table of utfLikeStride may not be always at hand -
 this actually really depends on what the compiler does.

Yes and no, I think it would be hard to find app that bottlenecks at 
traversing UTF, on decoding - maybe. Generally if you do a lot calls to 
stride it's in cache, if not it doesn't matter much(?). Though I'd 
prefer non-tabulated version

 In any case, this may not improve walkLength by more than a few
 percents.

Then specializing walkLength to do your unrollable version seems like 
good idea.

-- 
Dmitry Olshansky

Sep 21 2011

zeljkog <zeljkog private.com> writes:

On 21.09.2011 01:57, Christophe wrote:
 size_t myCount(string text)
 {
    size_t n = text.length;
    for (uint i=0; i<text.length; ++i)
      {
        auto s = text[i]>>6;
        n -= (s>>1) - ((s+1)>>2);
      }
    return n;
 }

Here is a more readable and a bit faster version on dmd windows:

size_t utfCount(string text)
{
     size_t n = 0;
     for (uint i=0; i<text.length; ++i)
          n += ((text[i]>>6)^0b10)? 1: 0;
     return n;
}

Sep 21 2011

travert phare.normalesup.org (Christophe Travert) writes:

 Here is a more readable and a bit faster version on dmd windows:
 
 size_t utfCount(string text)
 {
      size_t n = 0;
      for (uint i=0; i<text.length; ++i)
           n += ((text[i]>>6)^0b10)? 1: 0;
      return n;
 }

Nice. It is better with gdc linux 64bits too. I wanted to avoid 
conditional expressions like ?: but it's actually slightly faster that 
way.

And now people can't tell it is dangerous because it could return a 
fuzzy number.

Even faster, through less readable:

size_t utfLength(string text)
{
  size_t n=0;
  for (size_t i=0; i<text.length; ++i)
    n += (((text[i]>>6)^0b10) != 0);
  return n;
}

Let's see how we can boost std.utf.stride that way...

-- 
Christophe

Sep 21 2011

zeljkog <zeljkog private.com> writes:

On 21.09.2011 19:12, Christophe Travert wrote:
 Nice. It is better with gdc linux 64bits too. I wanted to avoid
 conditional expressions like ?: but it's actually slightly faster that
 way.

It is not compiled in as conditional jump.

Sep 21 2011

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

On 9/20/11, Jonathan M Davis <jmdavisProg gmx.com> wrote:
 We specifically avoid having aliases in Phobos simply for having alternate
 function names. Aliases need to actually be useful, or they shouldn't be
 there.

And function names have to be useful to library users. walkLength is
an awful name for something that returns the character count.

If you ask a GUI developer to look for a function that creates a
rectangle path, you can be sure he'll start looking for Rectangle or
DrawRect or something similar, and not "ClosedShapePointN!4" or
something that generic.

Sep 20 2011

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Tuesday, September 20, 2011 15:10 Andrej Mitrovic wrote:
 On 9/20/11, Jonathan M Davis <jmdavisProg gmx.com> wrote:
 We specifically avoid having aliases in Phobos simply for having
 alternate function names. Aliases need to actually be useful, or they
 shouldn't be there.

 
 And function names have to be useful to library users. walkLength is
 an awful name for something that returns the character count.
 
 If you ask a GUI developer to look for a function that creates a
 rectangle path, you can be sure he'll start looking for Rectangle or
 DrawRect or something similar, and not "ClosedShapePointN!4" or
 something that generic.

In this case, if there's a problem it's not how generic the function is, it's 
the name walkLength. There's nothing special about strings which makes the 
name count better for them than it is for other ranges. The function is 
returning the number of elements in the range - be they code points or 
integers or whatever. The name walkLength works just as well for strings as it 
does for anything else. So, if there's a problem it's that the name walkLength 
isn't necessarily all that great. Strings aren't so special that they merit 
their own function name for the same functionality. So, if count stays, it's 
simply because it's been around for a while, not because it's inherently 
better to have a separate count function.

- Jonathan M Davis

Sep 20 2011

D Programming

C/C++ Programming

Other

digitalmars.D.learn - toUTFz and WinAPI GetTextExtentPoint32W