digitalmars.D.learn - Suppressing UTFException / Squashing Bad Codepoints?

John Carter (43/43) Dec 23 2013 This frustrated me in Ruby unicode too....

Brad Anderson (6/34) Dec 23 2013 The encoding schemes in std.encoding support cleaning up input

John Carter (29/68) Dec 23 2013 Eww.

Brad Anderson (3/9) Dec 23 2013 Pull requests with improvements are always welcome. I don't think

Jonathan M Davis (5/18) Dec 23 2013 Andrei has also stated that he thinks that it's a failed experiment. I t...

John Carter <john.carter taitradio.com> writes:

This frustrated me in Ruby unicode too....

Typically i/o is the ultimate in "untrusted and untrustworthy" sources,
coming usually from systems beyond my control.

Likely to be corrupted, or maliciously crafted, or defective...

Unfortunately not all sequences of bytes are valid UTF8.

Thus inevitably in every collection of inputs there are always going to be
around 1 in a million codepoints resulting in an UTFException thrown.

Alas, I always have to do Regex matches on the other 999999 valid
codepoints.....

Is there a standard recipe in stdio for squashing bad codepoints to some
default?

These days memory is very much larger than most files I want to scan.

So if I was doing this in C I would typically mmap the file PROT_READ |
PROT_WRITE and MAP_PRIVATE then run down the file squashing bad codepoints
and then run down it again matching patterns.

In Ruby I have a horridly inefficient utility....
      def IO.read_utf_8(file)

read(file,:external_encoding=>'ASCII-8BIT').encode('UTF-8',:undef=>:replace)
      end

What is the idiomatic D solution to this conundrum?

-- 
John Carter
Phone : (64)(3) 358 6639
Tait Electronics
PO Box 1645 Christchurch
New Zealand

-- 

------------------------------
This email, including any attachments, is only for the intended recipient. 
It is subject to copyright, is confidential and may be the subject of legal 
or other privilege, none of which is waived or lost by reason of this 
transmission.
If you are not an intended recipient, you may not use, disseminate, 
distribute or reproduce such email, any attachments, or any part thereof. 
If you have received a message in error, please notify the sender 
immediately and erase all copies of the message and any attachments.
Unfortunately, we cannot warrant that the email has not been altered or 
corrupted during transmission nor can we guarantee that any email or any 
attachments are free from computer viruses or other conditions which may 
damage or interfere with recipient data, hardware or software. The 
recipient relies upon its own procedures and assumes all risk of use and of 
opening any attachments.
------------------------------

Dec 23 2013

"Brad Anderson" <eco gnuk.net> writes:

On Monday, 23 December 2013 at 20:48:08 UTC, John Carter wrote:
 This frustrated me in Ruby unicode too....

 Typically i/o is the ultimate in "untrusted and untrustworthy" 
 sources,
 coming usually from systems beyond my control.

 Likely to be corrupted, or maliciously crafted, or defective...

 Unfortunately not all sequences of bytes are valid UTF8.

 Thus inevitably in every collection of inputs there are always 
 going to be
 around 1 in a million codepoints resulting in an UTFException 
 thrown.

 Alas, I always have to do Regex matches on the other 999999 
 valid
 codepoints.....

 Is there a standard recipe in stdio for squashing bad 
 codepoints to some
 default?

 These days memory is very much larger than most files I want to 
 scan.

 So if I was doing this in C I would typically mmap the file 
 PROT_READ |
 PROT_WRITE and MAP_PRIVATE then run down the file squashing bad 
 codepoints
 and then run down it again matching patterns.

 In Ruby I have a horridly inefficient utility....
       def IO.read_utf_8(file)

 read(file,:external_encoding=>'ASCII-8BIT').encode('UTF-8',:undef=>:replace)
       end

 What is the idiomatic D solution to this conundrum?

The encoding schemes in std.encoding support cleaning up input 
using the sanitize function.



It'd be nicer if the API were range based but it seems to do the 
trick in my experience.

Dec 23 2013

John Carter <john.carter taitradio.com> writes:

Eww.

If I read the source correctly it mallocs a new array and runs down the
original at least three times! (Four if you count peeks)

Not to mention that it is completely unintegrated with stdio.

Sigh! I miss the Good Old Days of 7-bit ASCII! ;-)


On Tue, Dec 24, 2013 at 9:51 AM, Brad Anderson <eco gnuk.net> wrote:

 On Monday, 23 December 2013 at 20:48:08 UTC, John Carter wrote:

 This frustrated me in Ruby unicode too....

 Typically i/o is the ultimate in "untrusted and untrustworthy" sources,
 coming usually from systems beyond my control.

 Likely to be corrupted, or maliciously crafted, or defective...

 Unfortunately not all sequences of bytes are valid UTF8.

 Thus inevitably in every collection of inputs there are always going to be
 around 1 in a million codepoints resulting in an UTFException thrown.

 Alas, I always have to do Regex matches on the other 999999 valid
 codepoints.....

 Is there a standard recipe in stdio for squashing bad codepoints to some
 default?

 These days memory is very much larger than most files I want to scan.

 So if I was doing this in C I would typically mmap the file PROT_READ |
 PROT_WRITE and MAP_PRIVATE then run down the file squashing bad codepoints
 and then run down it again matching patterns.

 In Ruby I have a horridly inefficient utility....
       def IO.read_utf_8(file)

 read(file,:external_encoding=>'ASCII-8BIT').encode('UTF-8',:
 undef=>:replace)
       end

 What is the idiomatic D solution to this conundrum?

 The encoding schemes in std.encoding support cleaning up input using the
 sanitize function.



 It'd be nicer if the API were range based but it seems to do the trick in
 my experience.



-- 
John Carter
Phone : (64)(3) 358 6639
Tait Electronics
PO Box 1645 Christchurch
New Zealand

-- 

------------------------------
This email, including any attachments, is only for the intended recipient. 
It is subject to copyright, is confidential and may be the subject of legal 
or other privilege, none of which is waived or lost by reason of this 
transmission.
If you are not an intended recipient, you may not use, disseminate, 
distribute or reproduce such email, any attachments, or any part thereof. 
If you have received a message in error, please notify the sender 
immediately and erase all copies of the message and any attachments.
Unfortunately, we cannot warrant that the email has not been altered or 
corrupted during transmission nor can we guarantee that any email or any 
attachments are free from computer viruses or other conditions which may 
damage or interfere with recipient data, hardware or software. The 
recipient relies upon its own procedures and assumes all risk of use and of 
opening any attachments.
------------------------------

Dec 23 2013

"Brad Anderson" <eco gnuk.net> writes:

On Monday, 23 December 2013 at 22:41:47 UTC, John Carter wrote:
 Eww.

 If I read the source correctly it mallocs a new array and runs 
 down the
 original at least three times! (Four if you count peeks)

 Not to mention that it is completely unintegrated with stdio.

 Sigh! I miss the Good Old Days of 7-bit ASCII! ;-)

Pull requests with improvements are always welcome. I don't think
std.encoding gets a lot of attention.

Dec 23 2013

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Tuesday, December 24, 2013 02:06:22 Brad Anderson wrote:
 On Monday, 23 December 2013 at 22:41:47 UTC, John Carter wrote:
 Eww.
 
 If I read the source correctly it mallocs a new array and runs
 down the
 original at least three times! (Four if you count peeks)
 
 Not to mention that it is completely unintegrated with stdio.
 
 Sigh! I miss the Good Old Days of 7-bit ASCII! ;-)

 
 Pull requests with improvements are always welcome. I don't think
 std.encoding gets a lot of attention.

Andrei has also stated that he thinks that it's a failed experiment. I think 
that it's one of the modules that could use a redesign (including making it 
range based), but no one has taken the time to do that.

- Jonathan M Davis

Dec 23 2013

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Suppressing UTFException / Squashing Bad Codepoints?