www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Is there a native function to detect if file is UTF encoding?

reply "MacAsm" <jckj33 gmail.com> writes:
To call decode() from std.encoding I need to make sure it is an 
UTF (may ne ASCII too) otherwise is will skyp over ASCII values. 
Is there any D native for it or I need to check byte order mark 
and write one myself?
Aug 22 2014
next sibling parent "Dejan Lekic" <dejan.lekic gmail.com> writes:
On Friday, 22 August 2014 at 13:53:04 UTC, MacAsm wrote:
 To call decode() from std.encoding I need to make sure it is an 
 UTF (may ne ASCII too) otherwise is will skyp over ASCII 
 values. Is there any D native for it or I need to check byte 
 order mark and write one myself?
You may want to take a look at http://dlang.org/phobos/std_stream.html#.EndianStream.readBOM . Note that this module is scheduled for depreciation...
Aug 22 2014
prev sibling parent "Kiith-Sa" <kiithsacmp gmail.com> writes:
On Friday, 22 August 2014 at 13:53:04 UTC, MacAsm wrote:
 To call decode() from std.encoding I need to make sure it is an 
 UTF (may ne ASCII too) otherwise is will skyp over ASCII 
 values. Is there any D native for it or I need to check byte 
 order mark and write one myself?
This may be simpler for reference: https://github.com/kiith-sa/tinyendian/blob/master/source/tinyendian.d Note that you _can't_ reliably differentiate between UTF-8 and plain ASCII, because not all UTF-8 files start with a UTF-8 BOM. However, you can (relatively) quickly determine if a UTF-8/ASCII buffer contains only ASCII characters; as UTF-8 bytes always have the topmost bit set, and ASCII don't, you can use a 64-bit bitmask and check by 8 characters at a time. See https://github.com/kiith-sa/D-YAML/blob/master/source/dyaml/reader.d, specifically the countASCII() function - it should be easy to change it into 'detectNonASCII': /// Counts the number of ASCII characters in buffer until the first UTF-8 sequence. /// /// Used to determine how many characters we can process without decoding. size_t countASCII(const(char)[] buffer) trusted pure nothrow nogc { size_t count = 0; // The topmost bit in ASCII characters is always 0 enum ulong Mask8 = 0x7f7f7f7f7f7f7f7f; enum uint Mask4 = 0x7f7f7f7f; enum ushort Mask2 = 0x7f7f; // Start by checking in 8-byte chunks. while(buffer.length >= Mask8.sizeof) { const block = *cast(typeof(Mask8)*)buffer.ptr; const masked = Mask8 & block; if(masked != block) { break; } count += Mask8.sizeof; buffer = buffer[Mask8.sizeof .. $]; } // If 8 bytes didn't match, try 4, 2 bytes. import std.typetuple; foreach(Mask; TypeTuple!(Mask4, Mask2)) { if(buffer.length < Mask.sizeof) { continue; } const block = *cast(typeof(Mask)*)buffer.ptr; const masked = Mask & block; if(masked != block) { continue; } count += Mask.sizeof; buffer = buffer[Mask.sizeof .. $]; } // If even a 2-byte chunk didn't match, test just one byte. if(buffer.empty || buffer[0] >= 0x80) { return count; } ++count; return count; }
Aug 22 2014