## digitalmars.D.learn - Searching for a string in a text buffer with a regular expression

• maxpat78 (16/16) Dec 06 2013 While porting a simple Python script to D, I found the following
• bearophile (4/6) Dec 06 2013 Do you mean Python3?
• Shammah Chancellor (4/23) Dec 06 2013 Why don't you follow one of the file reading examples?
• maxpat78 (33/33) Dec 08 2013 I mean a code fragment like this:
"maxpat78" <maxpat78 yahoo.it> writes:
While porting a simple Python script to D, I found the following
problem.

I need to read in some thousand of little text files and search
every one for a match with a given regular expression.

Obviously, the program can't (and it should not) be certain about
the encoding of each input file.

I initially used read() casting it with a cast(char[]), but, at
some point, the regex engine crashed with an exception: it
encountered an UTF-8 character it couldn't automatically decode.
This is right, since char[] is not byte[].

Now I'm casting with a Latin1String, since I know this is the
right encoding for the input buffers: and it works fine, at
last... but what about if I'd need to treat a RAW (binary?
unknown encoding?) buffer?

Is there a simple and elegant solution in D for such case?
Python didn't gave such problems!

Dec 06 2013
"bearophile" <bearophileHUGS lycos.com> writes:
maxpat78:

Is there a simple and elegant solution in D for such case?
Python didn't gave such problems!

Do you mean Python3?

Bye,
bearophile

Dec 06 2013
Shammah Chancellor <anonymous coward.com> writes:
On 2013-12-06 08:53:04 +0000, maxpat78 said:

While porting a simple Python script to D, I found the following problem.

I need to read in some thousand of little text files and search every
one for a match with a given regular expression.

Obviously, the program can't (and it should not) be certain about the
encoding of each input file.

I initially used read() casting it with a cast(char[]), but, at some
point, the regex engine crashed with an exception: it encountered an
UTF-8 character it couldn't automatically decode. This is right, since
char[] is not byte[].

Now I'm casting with a Latin1String, since I know this is the right
encoding for the input buffers: and it works fine, at last... but what
about if I'd need to treat a RAW (binary? unknown encoding?) buffer?

Is there a simple and elegant solution in D for such case?
Python didn't gave such problems!

readText is what you're looking for.


Dec 06 2013
"maxpat78" <maxpat78 yahoo.it> writes:
I mean a code fragment like this:

foreach(i; 1..2085)
{
// Bugbug: when we read in the buffer, we can't know anything
// But REGEX could fail if it contained unknown chars!
Latin1String buf;
string s;

try
{
transcode(buf, s);
}
catch (Exception e)
{
writeln("Last record (", i, ") reached.");
exit(1);
}

// Exception "Invalid UTF-8 sequence  index 1" in file 55
enum rx = ctRegex!(<p class="aggiornamentoAlbo">.+?</div>,
"gs");
auto m = match(s, rx);

if (! m.empty())
{
if (indexOf(m.captures[0], "xxxxxxxx", 0) > -1 &&
indexOf(m.captures[0], "1983", 0) > -1)
writeln(m.captures[0]);
}
}

The question is: what kind of cast should I use to safely
(=without conversion exceptions got raised) scan all possible
kind of textual (or binary) buffer, lile in Python 2.7.x?

Thanks!

Dec 08 2013