www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - isAsciiString in Phobos?

reply Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
If I want to transfer some string to a C function that expects
ascii-only string. What can I use to verify there are no non-ascii
characters in a D string? I haven't seen anything in Phobos.

I was thinking of using:

bool isAscii = mystring.all!(a => a <= 0xFF);

Is this safe?

I'm thinking of whether a code point can consist of two code units
such as [C1][C2], where C2 may be in the range 0 - 0xFF. I don't know
if that's possible (not a unicode pro here..).
Oct 07 2013
parent reply "Adam D. Ruppe" <destructionator gmail.com> writes:
On Monday, 7 October 2013 at 15:18:06 UTC, Andrej Mitrovic wrote:
 bool isAscii = mystring.all!(a => a <= 0xFF);
If you want strict ASCII, it should be <= 127 rather than 255 because the high bit can be all kinds of different encodings (the first 255 of unicode codepoints I think match latin-1 numerically, but that's different than windows-1252 or various non-English extended asciis.) You could also convert utf-8 to ascii.... sort of... by just stripping out any byte > 127 since bytes higher than that are multibyte sequences in utf8.
Oct 07 2013
parent reply Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On 10/7/13, Adam D. Ruppe <destructionator gmail.com> wrote:
 If you want strict ASCII, it should be <= 127 rather than 255
 because the high bit can be all kinds of different encodings (the
 first 255 of unicode codepoints I think match latin-1
 numerically, but that's different than windows-1252 or various
 non-English extended asciis.)

 You could also convert utf-8 to ascii.... sort of... by just
 stripping out any byte > 127 since bytes higher than that are
 multibyte sequences in utf8.
Thanks. I got some useful info from Jakob from IRC, and ended up with this: bool isAsciiString(string input) { auto data = cast(const(ubyte)[])input; return data.all!(a => a <= 0x7F); } The cast is needed to avoid decoding by the "all" function. Also there's isASCII that works on a dchar in std.ascii, but I was looking for something that works on entire strings at once. So the above function does the work for me. Should we put something like this in Phobos?
Oct 07 2013
parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Monday, 7 October 2013 at 15:57:15 UTC, Andrej Mitrovic wrote:
 On 10/7/13, Adam D. Ruppe <destructionator gmail.com> wrote:
 If you want strict ASCII, it should be <= 127 rather than 255
 because the high bit can be all kinds of different encodings 
 (the
 first 255 of unicode codepoints I think match latin-1
 numerically, but that's different than windows-1252 or various
 non-English extended asciis.)

 You could also convert utf-8 to ascii.... sort of... by just
 stripping out any byte > 127 since bytes higher than that are
 multibyte sequences in utf8.
Thanks. I got some useful info from Jakob from IRC, and ended up with this: bool isAsciiString(string input) { auto data = cast(const(ubyte)[])input; return data.all!(a => a <= 0x7F); } The cast is needed to avoid decoding by the "all" function. Also there's isASCII that works on a dchar in std.ascii, but I was looking for something that works on entire strings at once. So the above function does the work for me.
You can use std.string.representation to do the cast for you, and you might as well just use isASCII anyways. return data.representation().all!isASCII(); If we want even more efficiency, we could iterate on the string, interpreting it as a size_t[]. We mask each of its elements with 0x80808080/0x80808080_80808080, and if one of the resulting masked elements is not null, then the string isn't ASCII.
Oct 07 2013
parent reply Andrej Mitrovic <andrej.mitrovich gmail.com> writes:
On 10/7/13, monarch_dodra <monarchdodra gmail.com> wrote:
 If we want even more efficiency, we could iterate on the string,
 interpreting it as a size_t[]. We mask each of its elements with
 0x80808080/0x80808080_80808080, and if one of the resulting
 masked elements is not null, then the string isn't ASCII.
Clever! So I think we should definitely try and push it to the library.
Oct 07 2013
parent "monarch_dodra" <monarchdodra gmail.com> writes:
On Monday, 7 October 2013 at 16:23:12 UTC, Andrej Mitrovic wrote:
 On 10/7/13, monarch_dodra <monarchdodra gmail.com> wrote:
 If we want even more efficiency, we could iterate on the 
 string,
 interpreting it as a size_t[]. We mask each of its elements 
 with
 0x80808080/0x80808080_80808080, and if one of the resulting
 masked elements is not null, then the string isn't ASCII.
Clever! So I think we should definitely try and push it to the library.
I wrote this: Only lightly tested. //-------- bool isASCII(const(char[]) str) { static if (size_t.sizeof == 8) { enum size = 8; enum size_t mask = 0x80808080_80808080; enum size_t alignMask = ~cast(size_t)0b111; } else { enum size = 4; enum size_t mask = 0x80808080; enum size_t alignMask = ~cast(size_t)0b11; } if (str.length < size) { foreach (c; str) if (c & 0x80) return false; return true; } immutable start = (cast(size_t)str.ptr & alignMask) + size; immutable end = cast(size_t)(str.ptr + str.length) & alignMask; //we start with block, because it is faster //and chances the start is aligned anyways (so we check it later). for ( auto p = cast(size_t*)start ; p != cast(size_t*)end ; ++p ) if (*p & mask) return false; //Then the trailing chars. for ( auto p = cast(char*)end ; p != str.ptr + str.length ; ++p ) if (*p & 0x80) return false; //Finally, the first chars. for ( auto p = str.ptr ; p != cast(char*)start ; ++p ) if (*p & 0x80) return false; return true; } //-------- assert( "hello".isASCII()); assert( "heellohelloellohelloellohelloellohellollohello"); assert( "hellellohelloellohelloo"[3 .. $].isASCII()); assert(!"heéppellohelloellohelloellohelloellohelloellohellollo".isASCII()); assert(!"heppellohelloellohelloellohéelloellohelloellohellollo".isASCII()); assert(!"heppellohelloellohelloellohelloellohelloellohellolléo".isASCII()); //-------- What do you think? I have some doubts though: 1. Does x64 require qword alignment for size_t, or is dword enough? 2. Isn't there some built-in that'll give me the wanted alignement, isntead of doing it by hand? 3. Are those casts 100% correct?
Oct 07 2013