www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - UTF8 + SIMD = win

reply deadalnix <deadalnix gmail.com> writes:
http://woboq.com/blog/utf-8-processing-using-simd.html

All in the article. As D include Unicode as a language feature, I think 
it is interesting to mention here.
Jul 30 2012
next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
deadalnix:
 http://woboq.com/blog/utf-8-processing-using-simd.html

So many things to do, so little time to do them :-) Bye, bearophile
Jul 30 2012
prev sibling next sibling parent Guillaume Chatelet <chatelet.guillaume gmail.com> writes:
On 07/30/12 21:13, deadalnix wrote:
 http://woboq.com/blog/utf-8-processing-using-simd.html
 
 All in the article. As D include Unicode as a language feature, I think
 it is interesting to mention here.

Very interesting, thx for sharing. This NG definitely is a horn of plenty :)
Jul 30 2012
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 7/30/2012 12:13 PM, deadalnix wrote:
 http://woboq.com/blog/utf-8-processing-using-simd.html

 All in the article. As D include Unicode as a language feature, I think it is
 interesting to mention here.

If someone wants to fix std.utf http://dlang.org/phobos/std_utf.html to use SIMD instructions, that would be cool!
Jul 31 2012
next sibling parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Walter Bright:

 to use SIMD instructions, that would be cool!

I think in D the most needed UTF operation is UTF8 -> UTF32. Bye, bearophile
Jul 31 2012
parent Walter Bright <newshound2 digitalmars.com> writes:
On 7/31/2012 5:24 AM, Jakob Ovrum wrote:
 On Tuesday, 31 July 2012 at 12:11:25 UTC, bearophile wrote:
 Bernard Helyer:

 Where is UTF-32 actually used?

I think all std.algorithm and std.range yield UTF-32 dchars, when you give them a string in input. Bye, bearophile

In addition, foreach over a string with a dchar loop variable does implicit UTF-8 decoding.

SIMD isn't going to speed things up at all for decoding one character. It is for transcoding a large array.
Jul 31 2012
prev sibling next sibling parent "Bernard Helyer" <b.helyer gmail.com> writes:
On Tuesday, 31 July 2012 at 10:57:23 UTC, bearophile wrote:
 Walter Bright:

 to use SIMD instructions, that would be cool!

I think in D the most needed UTF operation is UTF8 -> UTF32. Bye, bearophile

Where is UTF-32 actually used?
Jul 31 2012
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Bernard Helyer:

 Where is UTF-32 actually used?

I think all std.algorithm and std.range yield UTF-32 dchars, when you give them a string in input. Bye, bearophile
Jul 31 2012
prev sibling next sibling parent "Jakob Ovrum" <jakobovrum gmail.com> writes:
On Tuesday, 31 July 2012 at 12:11:25 UTC, bearophile wrote:
 Bernard Helyer:

 Where is UTF-32 actually used?

I think all std.algorithm and std.range yield UTF-32 dchars, when you give them a string in input. Bye, bearophile

In addition, foreach over a string with a dchar loop variable does implicit UTF-8 decoding.
Jul 31 2012
prev sibling next sibling parent "Jakob Ovrum" <jakobovrum gmail.com> writes:
On Tuesday, 31 July 2012 at 19:28:03 UTC, Walter Bright wrote:
 On 7/31/2012 5:24 AM, Jakob Ovrum wrote:
 On Tuesday, 31 July 2012 at 12:11:25 UTC, bearophile wrote:
 Bernard Helyer:

 Where is UTF-32 actually used?

I think all std.algorithm and std.range yield UTF-32 dchars, when you give them a string in input. Bye, bearophile

In addition, foreach over a string with a dchar loop variable does implicit UTF-8 decoding.

SIMD isn't going to speed things up at all for decoding one character. It is for transcoding a large array.

Duh, good point, I totally forgot the context.
Jul 31 2012
prev sibling next sibling parent "Tobias Pankrath" <tobias pankrath.net> writes:
On Tuesday, 31 July 2012 at 19:28:03 UTC, Walter Bright wrote:
 On 7/31/2012 5:24 AM, Jakob Ovrum wrote:
 On Tuesday, 31 July 2012 at 12:11:25 UTC, bearophile wrote:
 Bernard Helyer:

 Where is UTF-32 actually used?

I think all std.algorithm and std.range yield UTF-32 dchars, when you give them a string in input. Bye, bearophile

In addition, foreach over a string with a dchar loop variable does implicit UTF-8 decoding.

SIMD isn't going to speed things up at all for decoding one character. It is for transcoding a large array.

You could decode them in advance.
Jul 31 2012
prev sibling next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Walter Bright:

 SIMD isn't going to speed things up at all for decoding one 
 character. It is for transcoding a large array.

Right. Maybe you remember my two or three posts about vectorized lazynesss and related matters (that later was a bit implemented in the half-eager map of std.parallelism). Introducing some vectorized lazyness in std.algorithm when the iterable is a UTF-8 (or rarely UTF-16) string allows to use SIMD and probably leads to higher performance. Bye, bearophile
Jul 31 2012
prev sibling parent "jerro" <a a.com> writes:
On Tuesday, 31 July 2012 at 19:41:02 UTC, Tobias Pankrath wrote:
 On Tuesday, 31 July 2012 at 19:28:03 UTC, Walter Bright wrote:
 On 7/31/2012 5:24 AM, Jakob Ovrum wrote:
 On Tuesday, 31 July 2012 at 12:11:25 UTC, bearophile wrote:
 Bernard Helyer:

 Where is UTF-32 actually used?

I think all std.algorithm and std.range yield UTF-32 dchars, when you give them a string in input. Bye, bearophile

In addition, foreach over a string with a dchar loop variable does implicit UTF-8 decoding.

SIMD isn't going to speed things up at all for decoding one character. It is for transcoding a large array.

You could decode them in advance.

The problem is you don't know how much you are going to need. This would actually hurt performance in some cases.
Jul 31 2012