digitalmars.D.learn - isAsciiString in Phobos?

Andrej Mitrovic (9/9) Oct 07 2013 If I want to transfer some string to a C function that expects

Adam D. Ruppe (9/10) Oct 07 2013 If you want strict ASCII, it should be <= 127 rather than 255

Andrej Mitrovic (12/20) Oct 07 2013 Thanks. I got some useful info from Jakob from IRC, and ended up with th...

monarch_dodra (8/31) Oct 07 2013 You can use std.string.representation to do the cast for you, and

Andrej Mitrovic (2/6) Oct 07 2013 Clever! So I think we should definitely try and push it to the library.

monarch_dodra (63/72) Oct 07 2013 I wrote this:

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

If I want to transfer some string to a C function that expects
ascii-only string. What can I use to verify there are no non-ascii
characters in a D string? I haven't seen anything in Phobos.

I was thinking of using:

bool isAscii = mystring.all!(a => a <= 0xFF);

Is this safe?

I'm thinking of whether a code point can consist of two code units
such as [C1][C2], where C2 may be in the range 0 - 0xFF. I don't know
if that's possible (not a unicode pro here..).

Oct 07 2013

"Adam D. Ruppe" <destructionator gmail.com> writes:

On Monday, 7 October 2013 at 15:18:06 UTC, Andrej Mitrovic wrote:
 bool isAscii = mystring.all!(a => a <= 0xFF);

If you want strict ASCII, it should be <= 127 rather than 255 
because the high bit can be all kinds of different encodings (the 
first 255 of unicode codepoints I think match latin-1 
numerically, but that's different than windows-1252 or various 
non-English extended asciis.)

You could also convert utf-8 to ascii.... sort of... by just 
stripping out any byte > 127 since bytes higher than that are 
multibyte sequences in utf8.

Oct 07 2013

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

On 10/7/13, Adam D. Ruppe <destructionator gmail.com> wrote:
 If you want strict ASCII, it should be <= 127 rather than 255
 because the high bit can be all kinds of different encodings (the
 first 255 of unicode codepoints I think match latin-1
 numerically, but that's different than windows-1252 or various
 non-English extended asciis.)

 You could also convert utf-8 to ascii.... sort of... by just
 stripping out any byte > 127 since bytes higher than that are
 multibyte sequences in utf8.

Thanks. I got some useful info from Jakob from IRC, and ended up with this:

bool isAsciiString(string input)
{
    auto data = cast(const(ubyte)[])input;
    return data.all!(a => a <= 0x7F);
}

The cast is needed to avoid decoding by the "all" function. Also
there's isASCII that works on a dchar in std.ascii, but I was looking
for something that works on entire strings at once. So the above
function does the work for me.

Should we put something like this in Phobos?

Oct 07 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Monday, 7 October 2013 at 15:57:15 UTC, Andrej Mitrovic wrote:
 On 10/7/13, Adam D. Ruppe <destructionator gmail.com> wrote:
 If you want strict ASCII, it should be <= 127 rather than 255
 because the high bit can be all kinds of different encodings 
 (the
 first 255 of unicode codepoints I think match latin-1
 numerically, but that's different than windows-1252 or various
 non-English extended asciis.)

 You could also convert utf-8 to ascii.... sort of... by just
 stripping out any byte > 127 since bytes higher than that are
 multibyte sequences in utf8.

 Thanks. I got some useful info from Jakob from IRC, and ended 
 up with this:

 bool isAsciiString(string input)
 {
     auto data = cast(const(ubyte)[])input;
     return data.all!(a => a <= 0x7F);
 }

 The cast is needed to avoid decoding by the "all" function. Also
 there's isASCII that works on a dchar in std.ascii, but I was 
 looking
 for something that works on entire strings at once. So the above
 function does the work for me.

You can use std.string.representation to do the cast for you, and 
you might as well just use isASCII anyways.

return data.representation().all!isASCII();

If we want even more efficiency, we could iterate on the string, 
interpreting it as a size_t[]. We mask each of its elements with 
0x80808080/0x80808080_80808080, and if one of the resulting 
masked elements is not null, then the string isn't ASCII.

Oct 07 2013

Andrej Mitrovic <andrej.mitrovich gmail.com> writes:

On 10/7/13, monarch_dodra <monarchdodra gmail.com> wrote:
 If we want even more efficiency, we could iterate on the string,
 interpreting it as a size_t[]. We mask each of its elements with
 0x80808080/0x80808080_80808080, and if one of the resulting
 masked elements is not null, then the string isn't ASCII.

Clever! So I think we should definitely try and push it to the library.

Oct 07 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Monday, 7 October 2013 at 16:23:12 UTC, Andrej Mitrovic wrote:
 On 10/7/13, monarch_dodra <monarchdodra gmail.com> wrote:
 If we want even more efficiency, we could iterate on the 
 string,
 interpreting it as a size_t[]. We mask each of its elements 
 with
 0x80808080/0x80808080_80808080, and if one of the resulting
 masked elements is not null, then the string isn't ASCII.

 Clever! So I think we should definitely try and push it to the 
 library.

I wrote this:
Only lightly tested.

//--------
bool isASCII(const(char[]) str)
{
     static if (size_t.sizeof == 8)
     {
         enum size = 8;
         enum size_t mask  = 0x80808080_80808080;
         enum size_t alignMask = ~cast(size_t)0b111;
     }
     else
     {
         enum size = 4;
         enum size_t mask = 0x80808080;
         enum size_t alignMask = ~cast(size_t)0b11;
     }

     if (str.length < size)
     {
         foreach (c; str)
             if (c & 0x80)
                 return false;
         return true;
     }

     immutable start = (cast(size_t)str.ptr & alignMask) + size;
     immutable end = cast(size_t)(str.ptr + str.length) & 
alignMask;

     //we start with block, because it is faster
     //and chances the start is aligned anyways (so we check it 
later).
     for ( auto p = cast(size_t*)start ; p != cast(size_t*)end ; 
++p )
         if (*p & mask)
             return false;

     //Then the trailing chars.
     for ( auto p = cast(char*)end ; p != str.ptr + str.length ; 
++p )
         if (*p & 0x80)
             return false;

     //Finally, the first chars.
     for ( auto p = str.ptr ; p != cast(char*)start ; ++p )
         if (*p & 0x80)
             return false;

     return true;
}
//--------
     assert( "hello".isASCII());
     assert( "heellohelloellohelloellohelloellohellollohello");
     assert( "hellellohelloellohelloo"[3 .. $].isASCII());
     
assert(!"heéppellohelloellohelloellohelloellohelloellohellollo".isASCII());
     
assert(!"heppellohelloellohelloellohéelloellohelloellohellollo".isASCII());
     
assert(!"heppellohelloellohelloellohelloellohelloellohellolléo".isASCII());
//--------

What do you think? I have some doubts though:
1. Does x64 require qword alignment for size_t, or is dword 
enough?
2. Isn't there some built-in that'll give me the wanted 
alignement, isntead of doing it by hand?
3. Are those casts 100% correct?

Oct 07 2013

D Programming

C/C++ Programming

Other

digitalmars.D.learn - isAsciiString in Phobos?