digitalmars.D - Is there a native function to detect if file is UTF encoding?

MacAsm (4/4) Aug 22 2014 To call decode() from std.encoding I need to make sure it is an

Dejan Lekic (4/8) Aug 22 2014 You may want to take a look at
Kiith-Sa (52/56) Aug 22 2014 This may be simpler for reference:

"MacAsm" <jckj33 gmail.com> writes:

To call decode() from std.encoding I need to make sure it is an 
UTF (may ne ASCII too) otherwise is will skyp over ASCII values. 
Is there any D native for it or I need to check byte order mark 
and write one myself?

Aug 22 2014

"Dejan Lekic" <dejan.lekic gmail.com> writes:

On Friday, 22 August 2014 at 13:53:04 UTC, MacAsm wrote:
 To call decode() from std.encoding I need to make sure it is an 
 UTF (may ne ASCII too) otherwise is will skyp over ASCII 
 values. Is there any D native for it or I need to check byte 
 order mark and write one myself?

You may want to take a look at 

Note that this module is scheduled for depreciation...

Aug 22 2014

"Kiith-Sa" <kiithsacmp gmail.com> writes:

On Friday, 22 August 2014 at 13:53:04 UTC, MacAsm wrote:
 To call decode() from std.encoding I need to make sure it is an 
 UTF (may ne ASCII too) otherwise is will skyp over ASCII 
 values. Is there any D native for it or I need to check byte 
 order mark and write one myself?

This may be simpler for reference:

https://github.com/kiith-sa/tinyendian/blob/master/source/tinyendian.d


Note that you _can't_ reliably differentiate between UTF-8 and 
plain ASCII,
because not all UTF-8 files start with a UTF-8 BOM.

However, you can (relatively) quickly determine if a UTF-8/ASCII 
buffer contains only ASCII characters; as UTF-8 bytes always have 
the topmost bit set, and ASCII don't, you can use a 64-bit 
bitmask and check by 8 characters at a time.

See 
https://github.com/kiith-sa/D-YAML/blob/master/source/dyaml/reader.d,

specifically the countASCII() function - it should be easy to 
change it into 'detectNonASCII':


/// Counts the number of ASCII characters in buffer until the 
first UTF-8 sequence.
///
/// Used to determine how many characters we can process without 
decoding.
size_t countASCII(const(char)[] buffer)  trusted pure nothrow 
 nogc
{
     size_t count = 0;
     // The topmost bit in ASCII characters is always 0
     enum ulong Mask8 = 0x7f7f7f7f7f7f7f7f;
     enum uint Mask4 = 0x7f7f7f7f;
     enum ushort Mask2 = 0x7f7f;
     // Start by checking in 8-byte chunks.
     while(buffer.length >= Mask8.sizeof)
     {
         const block = *cast(typeof(Mask8)*)buffer.ptr;
         const masked = Mask8 & block;
         if(masked != block) { break; }
         count += Mask8.sizeof;
         buffer = buffer[Mask8.sizeof .. $];
     }
     // If 8 bytes didn't match, try 4, 2 bytes.
     import std.typetuple;
     foreach(Mask; TypeTuple!(Mask4, Mask2))
     {
         if(buffer.length < Mask.sizeof) { continue; }
         const block = *cast(typeof(Mask)*)buffer.ptr;
         const masked = Mask & block;
         if(masked != block) { continue; }
         count += Mask.sizeof;
         buffer = buffer[Mask.sizeof .. $];
     }
     // If even a 2-byte chunk didn't match, test just one byte.
     if(buffer.empty || buffer[0] >= 0x80) { return count; }
     ++count;
     return count;
}

Aug 22 2014

D Programming

C/C++ Programming

Other

digitalmars.D - Is there a native function to detect if file is UTF encoding?