www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - non-utf8-decoding regex (for speed)?

reply Timothee Cour via Digitalmars-d <digitalmars-d puremagic.com> writes:
Is there a way to avoid decoding (as utf8) when calling regex' apis?
or a plan to do so?

use case: speed (no decoding) and avoiding throwing on invalid utf8 sequences

ideally this should allow:

---
auto s = cast(ubyte[])  "abcd"; //potentially not valid utf8 sequence
auto r = cast(ubyte[])  `^\d`;
auto m=match(s, r.regex); // right now: regex cannot deduce function
from argument types !()(ubyte[])
---
Apr 05 2016
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 06-Apr-2016 01:00, Timothee Cour via Digitalmars-d wrote:
 Is there a way to avoid decoding (as utf8) when calling regex' apis?
 or a plan to do so?
Custom alphabets - yes, including ASCII.
 use case: speed (no decoding) and avoiding throwing on invalid utf8 sequences
The speed gain for ASCII only vs Unicode with ASCII special case would be around 0.5% (the time spent on decoding) as my extensive profiling shows. Latest pull for std.regex did exactly that - special path for ASCII.
 ideally this should allow:

 ---
 auto s = cast(ubyte[])  "abcd"; //potentially not valid utf8 sequence
 auto r = cast(ubyte[])  `^\d`;
 auto m=match(s, r.regex); // right now: regex cannot deduce function
 from argument types !()(ubyte[])
 ---
-- Dmitry Olshansky
Apr 06 2016