www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Reading text (I mean "real" text...)

reply Denis <noreply noserver.lan> writes:
THE PROBLEM

UTF-8 validation alone is insufficient for ensuring that a file 
contains only human-readable text, because control characters are 
UTF-8 valid. Apart from tab, newline, carriage return, and a few 
less commonly used others considered to be whitespace, 
human-readable text files should not normally contain embedded 
control characters.

In the standard library, the read functions for "text" files (as 
opposed to binary files) that I looked at are not actually based 
on "human readable text", but on "characters". For example:

  - In std.stdio, readln accepts everything. Lines are simply 
separated by the occurrence of a newline or user-designated 
character.
  - In std.file, readText accepts all valid UTF-8 characters.

This means, for example, that all of these functions will happily 
try to read an enormous file of zeroes in its entirety (something 
that should not even be considered "text") into a string 
variable, on the very first call to the read function. Not 
good... Whereas a function that reads only "human-readable text" 
should instead generate an exception immediately upon 
encountering an invalid control character or invalid UTF-8 
character.

THE OBJECTIVE

The objective is to read a file one line at a time (reading each 
line into a string), while checking for human-readable text 
character by character. Invalid characters (control and UTF-8) 
should generate an exception.

Unless there's already an existing function that works as 
described, I'd like to write one. I expect that this will require 
combining an existing read-by-UTF8-char or read-by-byte function 
with the additional validation.

Q1: Which existing functions (D or C) would you suggest 
leveraging? For example, there are quite a few variants of "read" 
and in different libraries too. For a newcomer, it can be 
difficult to intuit which one is best suited for what.

Q2: Any source code (D or C) you might suggest I look at, to get 
ideas for how parts of this could be written?

Thanks for your help.
Jun 19 2020
parent reply Paul Backus <snarwin gmail.com> writes:
On Saturday, 20 June 2020 at 01:35:56 UTC, Denis wrote:
 THE OBJECTIVE

 The objective is to read a file one line at a time (reading 
 each line into a string), while checking for human-readable 
 text character by character. Invalid characters (control and 
 UTF-8) should generate an exception.

 Unless there's already an existing function that works as 
 described, I'd like to write one. I expect that this will 
 require combining an existing read-by-UTF8-char or read-by-byte 
 function with the additional validation.
It sounds like maybe what you are looking for is Unicode character categories: https://en.wikipedia.org/wiki/Unicode_character_property#General_Category
Jun 19 2020
parent reply Denis <noreply noserver.lan> writes:
On Saturday, 20 June 2020 at 01:41:50 UTC, Paul Backus wrote:
 It sounds like maybe what you are looking for is Unicode 
 character categories:

 https://en.wikipedia.org/wiki/Unicode_character_property#General_Category
The character validation step could indeed be expressed using Unicode properties: Allow Unicode White_Space Reject Unicode Control Allow everything else
Jun 19 2020
parent Denis <noreply noserver.lan> writes:
Digging into this a bit further --

POSIX defines a "print" class, which I believe is an exact fit. 
The Unicode spec doesn't define this class, which I presume is 
why D's std.uni library also omits it. But there is an isprint() 
function in libc, which I should be able to use (POSIX here). 
This function refers to the system locale, so it isn't limited to 
ASCII characters (unlike std.ascii:isPrintable).

So that's one down, two to go:

   Loop until newline or EOF
    (1) Read bytes or character             } Possibly
    (2) Decode UTF-8, exception if invalid  } together
    (3) Call isprint(), exception if invalid
   Return line

(This simplified outline obviously doesn't show how to deal with 
the complications arising from using buffers, handling codepoints 
that straddle the end of the buffer, etc.)

Where I'm still stuck is the read or read-and-auto-decode: this 
is where the waters get really muddy for me. Three different 
techniques for reading characters are suggested in this thread 
(iopipe, ranges, rawRead): 
https://forum.dlang.org/thread/cgteipqqfxejngtpgbbt forum.dlang.org

I'd like to stick with standard D or C libraries initially, so 
that rules out iopipe for now. What would really help is some 
details about what one read technique does particularly well vs. 
another. And is there a technique that seems more suited to this 
use case than the rest?

Thanks again
Jun 20 2020