www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Proposed Phobos equivalent of wcswidth()

reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
This past week, while reviewing Phobos PR #6008, I started experimenting
with an optimized D equivalent of wcswidth().

For more details, see:

	https://issues.dlang.org/show_bug.cgi?id=7054
	https://issues.dlang.org/show_bug.cgi?id=17810

as well as the discussion on:

	https://github.com/dlang/phobos/pull/6008

Anyway, the TL;DR summary is this: given a format() spec like "%20s", in
order to insert the correct number of spaces to pad the string to 20
characters (or rather, 20 spaces in the output), we need to compute the
displayed length of the string in monospace font.  Unfortunately, given
the complexities of Unicode, this is far from trivial:

- In C, the C library doesn't even pretend to know Unicode, so the
  padding is just based on the number of bytes the string occupies.
  Obviously, for anything non-ASCII the output will be wrong
  (misaligned).

- In D, in the original naïve implementation, we try to be a little
  smarter by counting the number of dchars. Unfortunately, this is also
  wrong, because of combining diacritics like U+0301 which modify the
  preceding character and do not advance the cursor. 

- In Phobos PR #6008, we improved this to count grapheme clusters
  instead. However, this is *still* wrong, because of the existence of
  zero-width characters (don't you just love Unicode?!), and also
  because of "wide" or "full-width" East Asian block characters as
  specified by Unicode TR11 (and scarily enough, the new Emoji blocks
  are included in this "wide" category), which on a text console
  generally occupies 2 positions per grapheme rather than 1.

Eventually, the solution boils down to implementing the equivalent of
Posix wcswidth().

But a naïve implementation of this is extremely inefficient, because
segmenting a Unicode string by grapheme and *then* computing its width
is non-trivial. So inefficient that it's just too slow to use in
format(), especially if most strings you'd pass to format() are
ASCII-only or mostly ASCII.

Thankfully, std.uni provides (some of) the tools to optimize this. The
basic idea is this: we don't actually care to segment graphemes; all we
want to do is to know, given some string s, how many display positions
it will occupy, so that we can insert the right number of spaces. The
actual grapheme segmentation and typesetting is the terminal's job, and
none of format()'s business.  So we can cut some corners while still
producing the right results.

Basically my current solution consists of:

- Parsing EastAsianWidth.txt published by the Unicode Consortium to
  precompute a table of wide/full-width characters (W and F) -- this is
  not done at runtime or compile-time, but as a separate step to
  generate the source code of the table, since otherwise it's either too
  slow at runtime or would slow down Phobos compilation too much, plus
  it depends on an external file which is not practical;

- Combining this table with Unicode category Grapheme_extend, plus a
  bunch of hand-coded zero-width characters to produce a mapping of
  every dchar to 0, 1, or 2. All characters that extend a grapheme, like
  a combining diacritic, maps to 0. All characters designated as Wide or
  Full-width (excluding grapheme extenders) map to 2. Everything else
  maps to 1.

- Compiling this table into a 3-level Trie (std.uni.Trie) for O(1)
  runtime lookup per dchar.

- Computing the display width, then, is just a matter of iterating over
  dchars in the string and summing the values looked up in the trie.

Of course, no matter how optimized a width lookup is, it's still pretty
slow for an ASCII-only string, which is 90% of the use cases of
format(). So to improve this common case, the additional optimization is
to scan the string for ASCII-only bytes, and just incrementing the width
since we know ASCII characters are always 1 column wide.  Only when we
encounter a non-ASCII byte that we bother with UTF-8 decoding and the
table lookup.

Here's my current implementation:

	https://github.com/quickfur/strwidth

Here's my current benchmark results:

- walkLength is literally passing the string to std.range.walkLength,
  which is basically counting the number of code points in the string.
  As mentioned before, this does not produce the correct width.

- byGraphemeWalk is the next step up, to count the number of graphemes
  using std.uni.byGrapheme.  Unfortunately, this is still not fully
  correct.

- graphemeStrideWalk is a slight optimization of byGraphemeWalk, by not
  actually decoding the grapheme, but just computing the stride. It also
  has the virtue of being usable in CTFE. Performance-wise, it's not
  that much different from byGraphemeWalk.

- width0 is the first "correct" string width computation, but with a
  naïve, slow implementation. It serves as a baseline to compare the
  next implementations.

- width1 is the trie-optimized version of width0. It shows significant
  improvement over width0, but is still very slow for ASCII strings
  compared with walkLength.

- width2 includes the optimization of skipping ASCII-only parts of the
  string, bypassing decoding and trie lookup.

- width3 is a variant of width2 that also skips trie lookup for
  characters below U+300, the first block that contains widths not equal
  to 1 (combining diacritics).  The benchmark results don't show
  significant performance differences with width2, however.


[walkLength] (100000 iterations):
	ASCII strings of 32 bytes:	24 ms, 483 μs, and 4 hnsecs
	Unicode strings of 32 bytes:	16 ms, 829 μs, and 2 hnsecs
	ASCII strings of 128 bytes:	94 ms, 538 μs, and 9 hnsecs
	Unicode strings of 128 bytes:	59 ms, 395 μs, and 6 hnsecs
	ASCII strings of 1024 bytes:	672 ms, 73 μs, and 7 hnsecs
	Unicode strings of 1024 bytes:	376 ms, 634 μs, and 7 hnsecs
[byGraphemeWalk] (100000 iterations):
	ASCII strings of 32 bytes:	953 ms, 353 μs, and 5 hnsecs
	Unicode strings of 32 bytes:	353 ms, 934 μs, and 2 hnsecs
	ASCII strings of 128 bytes:	3 secs, 862 ms, 747 μs, and 7 hnsecs
	Unicode strings of 128 bytes:	1 sec, 302 ms, 128 μs, and 6 hnsecs
	ASCII strings of 1024 bytes:	30 secs, 993 ms, 404 μs, and 1 hnsec
	Unicode strings of 1024 bytes:	9 secs, 900 ms, 579 μs, and 1 hnsec
[graphemeStrideWalk] (100000 iterations):
	ASCII strings of 32 bytes:	816 ms, 190 μs, and 9 hnsecs
	Unicode strings of 32 bytes:	307 ms, 141 μs, and 9 hnsecs
	ASCII strings of 128 bytes:	3 secs, 317 ms, 661 μs, and 7 hnsecs
	Unicode strings of 128 bytes:	1 sec, 138 ms, 172 μs, and 7 hnsecs
	ASCII strings of 1024 bytes:	26 secs, 903 ms, and 898 μs
	Unicode strings of 1024 bytes:	8 secs, 804 ms, 48 μs, and 1 hnsec
[width0] (100000 iterations):
	ASCII strings of 32 bytes:	1 sec, 62 ms, 163 μs, and 5 hnsecs
	Unicode strings of 32 bytes:	397 ms and 517 μs
	ASCII strings of 128 bytes:	4 secs, 228 ms, 269 μs, and 2 hnsecs
	Unicode strings of 128 bytes:	1 sec, 448 ms, 926 μs, and 5 hnsecs
	ASCII strings of 1024 bytes:	33 secs, 797 ms, 690 μs, and 5 hnsecs
	Unicode strings of 1024 bytes:	11 secs, 206 ms, 915 μs, and 8 hnsecs
[width1] (100000 iterations):
	ASCII strings of 32 bytes:	173 ms, 509 μs, and 8 hnsecs
	Unicode strings of 32 bytes:	80 ms, 730 μs, and 9 hnsecs
	ASCII strings of 128 bytes:	692 ms and 117 μs
	Unicode strings of 128 bytes:	296 ms, 89 μs, and 1 hnsec
	ASCII strings of 1024 bytes:	5 secs, 486 ms, 328 μs, and 3 hnsecs
	Unicode strings of 1024 bytes:	2 secs, 208 ms, 846 μs, and 9 hnsecs
[width2] (100000 iterations):
	ASCII strings of 32 bytes:	12 ms, 791 μs, and 9 hnsecs
	Unicode strings of 32 bytes:	69 ms, 51 μs, and 5 hnsecs
	ASCII strings of 128 bytes:	50 ms, 643 μs, and 3 hnsecs
	Unicode strings of 128 bytes:	283 ms, 572 μs, and 3 hnsecs
	ASCII strings of 1024 bytes:	319 ms, 527 μs, and 4 hnsecs
	Unicode strings of 1024 bytes:	2 secs, 200 ms, 473 μs, and 3 hnsecs
[width3] (100000 iterations):
	ASCII strings of 32 bytes:	12 ms, 927 μs, and 5 hnsecs
	Unicode strings of 32 bytes:	67 ms, 952 μs, and 9 hnsecs
	ASCII strings of 128 bytes:	50 ms, 322 μs, and 7 hnsecs
	Unicode strings of 128 bytes:	283 ms, 628 μs, and 6 hnsecs
	ASCII strings of 1024 bytes:	331 ms, 921 μs, and 9 hnsecs
	Unicode strings of 1024 bytes:	2 secs, 239 ms, 415 μs, and 5 hnsecs


Given these results, it appears that width2 is probably the best way to
go.

And now the obligatory bikeshed: what should the Phobos equivalent of
wcswidth be called? The current plan is just width(), but the name is
too simplistic and too likely to collide with user-defined identifiers.
Any suggestions?  Bring on the rainbow bikeshed! :-P


T

-- 
Without geometry, life would be pointless. -- VS
Jan 13
next sibling parent reply Jack Stouffer <jack jackstouffer.com> writes:
On Saturday, 13 January 2018 at 17:26:52 UTC, H. S. Teoh wrote:
 ...
Thanks for taking the time to do this.
 And now the obligatory bikeshed: what should the Phobos 
 equivalent of wcswidth be called?
std.utf.displayWidth
Jan 15
next sibling parent reply Simen =?UTF-8?B?S2rDpnLDpXM=?= <simen.kjaras gmail.com> writes:
On Monday, 15 January 2018 at 13:34:09 UTC, Jack Stouffer wrote:
 std.utf.displayWidth
+1 -- Simen
Jan 15
parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Mon, Jan 15, 2018 at 02:14:56PM +0000, Simen Kjrs via Digitalmars-d wrote:
 On Monday, 15 January 2018 at 13:34:09 UTC, Jack Stouffer wrote:
 std.utf.displayWidth
+1
[...] Why std.utf rather than std.uni, though? T -- ASCII stupid question, getty stupid ANSI.
Jan 15
parent reply Jack Stouffer <jack jackstouffer.com> writes:
On Monday, 15 January 2018 at 17:32:40 UTC, H. S. Teoh wrote:
 On Mon, Jan 15, 2018 at 02:14:56PM +0000, Simen Kjærås via 
 Digitalmars-d wrote:
 On Monday, 15 January 2018 at 13:34:09 UTC, Jack Stouffer 
 wrote:
 std.utf.displayWidth
+1
[...] Why std.utf rather than std.uni, though?
The way I understand it is that std.uni is (supposed to be) for functions on individual unicode units (be they code units/points or graphemes) and std.utf is for functions which handle operating on unicode strings. Obviously there are exceptions. I think "they" put graphemeStride in std.uni because Grapheme was defined there and it seemed reasonable at the time. But, generally I think utf stuff should go into std.utf.
Jan 15
next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Mon, Jan 15, 2018 at 06:20:16PM +0000, Jack Stouffer via Digitalmars-d wrote:
 On Monday, 15 January 2018 at 17:32:40 UTC, H. S. Teoh wrote:
 On Mon, Jan 15, 2018 at 02:14:56PM +0000, Simen Kjrs via Digitalmars-d
 wrote:
 On Monday, 15 January 2018 at 13:34:09 UTC, Jack Stouffer wrote:
 std.utf.displayWidth
+1
[...] Why std.utf rather than std.uni, though?
The way I understand it is that std.uni is (supposed to be) for functions on individual unicode units (be they code units/points or graphemes) and std.utf is for functions which handle operating on unicode strings.
Are you sure? I thought std.utf was specifically dealing with UTF-* encodings, i.e., code units and conversions to/from code points, and std.uni was supposed to be for implementing Unicode algorithms and Unicode compliance in general, i.e., stuff that works at the code point level.
 Obviously there are exceptions. I think "they" put graphemeStride in
 std.uni because Grapheme was defined there and it seemed reasonable at
 the time.  But, generally I think utf stuff should go into std.utf.
But displayWidth isn't really directly related to UTF (i.e., the encoding of Unicode code points). It seems to me to be more to do with processing Unicode in general, though, granted, the optimizations I implemented are kinda in a grey zone between dealing with Unicode proper (i.e., with code points) vs. working with code units. T -- Klein bottle for rent ... inquire within. -- Stephen Mulraney
Jan 15
prev sibling parent Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Monday, January 15, 2018 10:37:14 H. S. Teoh via Digitalmars-d wrote:
 On Mon, Jan 15, 2018 at 06:20:16PM +0000, Jack Stouffer via Digitalmars-d 
wrote:
 On Monday, 15 January 2018 at 17:32:40 UTC, H. S. Teoh wrote:
 On Mon, Jan 15, 2018 at 02:14:56PM +0000, Simen Kjrs via
 Digitalmars-d

 wrote:
 On Monday, 15 January 2018 at 13:34:09 UTC, Jack Stouffer wrote:
 std.utf.displayWidth
+1
[...] Why std.utf rather than std.uni, though?
The way I understand it is that std.uni is (supposed to be) for functions on individual unicode units (be they code units/points or graphemes) and std.utf is for functions which handle operating on unicode strings.
Are you sure? I thought std.utf was specifically dealing with UTF-* encodings, i.e., code units and conversions to/from code points, and std.uni was supposed to be for implementing Unicode algorithms and Unicode compliance in general, i.e., stuff that works at the code point level.
Your understanding of the division more or less matches mine, though I'm not sure that the line is entirely clearcut. I would definitely think that std.uni was the more appropriate place for such a function. - Jonathan M Davis
Jan 15
prev sibling parent WhatMeForget <kheaser gmail.com> writes:
On Monday, 15 January 2018 at 13:34:09 UTC, Jack Stouffer wrote:
 On Saturday, 13 January 2018 at 17:26:52 UTC, H. S. Teoh wrote:
 ...
Thanks for taking the time to do this.
 And now the obligatory bikeshed: what should the Phobos 
 equivalent of wcswidth be called?
std.utf.displayWidth
std.utf.bikeshed Never heard that phrase before. Nice one :)
Jan 15
prev sibling parent reply Kagamin <spam here.lot> writes:
columnWidth as it only makes sense for column-oriented text 
display.
Jan 15
parent Dominikus Dittes Scherkl <dominikus.scherkl continental-corporation.com> writes:
On Monday, 15 January 2018 at 15:08:24 UTC, Kagamin wrote:
 columnWidth as it only makes sense for column-oriented text 
 display.
I think displayWidth is better, because "width" is directly linked to hozizontal direction (else it would be called hight), and setting text in colums would still take additional steps to be set correct. Also "display" indicates that it has nothing to do with the string length, which is good to avoid confusion.
Jan 15