www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Searching for a string in a text buffer with a regular expression

reply "maxpat78" <maxpat78 yahoo.it> writes:
While porting a simple Python script to D, I found the following 
problem.

I need to read in some thousand of little text files and search 
every one for a match with a given regular expression.

Obviously, the program can't (and it should not) be certain about 
the encoding of each input file.

I initially used read() casting it with a cast(char[]), but, at 
some point, the regex engine crashed with an exception: it 
encountered an UTF-8 character it couldn't automatically decode. 
This is right, since char[] is not byte[].

Now I'm casting with a Latin1String, since I know this is the 
right encoding for the input buffers: and it works fine, at 
last... but what about if I'd need to treat a RAW (binary? 
unknown encoding?) buffer?

Is there a simple and elegant solution in D for such case?
Python didn't gave such problems!
Dec 06 2013
next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
maxpat78:

 Is there a simple and elegant solution in D for such case?
 Python didn't gave such problems!
Do you mean Python3? Bye, bearophile
Dec 06 2013
prev sibling parent reply Shammah Chancellor <anonymous coward.com> writes:
On 2013-12-06 08:53:04 +0000, maxpat78 said:

 While porting a simple Python script to D, I found the following problem.
 
 I need to read in some thousand of little text files and search every 
 one for a match with a given regular expression.
 
 Obviously, the program can't (and it should not) be certain about the 
 encoding of each input file.
 
 I initially used read() casting it with a cast(char[]), but, at some 
 point, the regex engine crashed with an exception: it encountered an 
 UTF-8 character it couldn't automatically decode. This is right, since 
 char[] is not byte[].
 
 Now I'm casting with a Latin1String, since I know this is the right 
 encoding for the input buffers: and it works fine, at last... but what 
 about if I'd need to treat a RAW (binary? unknown encoding?) buffer?
 
 Is there a simple and elegant solution in D for such case?
 Python didn't gave such problems!
Why don't you follow one of the file reading examples? readText is what you're looking for.
Dec 06 2013
parent "maxpat78" <maxpat78 yahoo.it> writes:
I mean a code fragment like this:

	foreach(i; 1..2085)
	{
		// Bugbug: when we read in the buffer, we can't know anything 
about its encoding...
		// But REGEX could fail if it contained unknown chars!
		Latin1String buf;
		string s;

		try
		{
			buf = cast(Latin1String) read(format("psi\\psi%04d.htm", i));
			transcode(buf, s);
		}
		catch (Exception e)
		{
			writeln("Last record (", i, ") reached.");
			exit(1);
		}

		// Exception "Invalid UTF-8 sequence  index 1" in file 55
		enum rx = ctRegex!(`<p class="aggiornamentoAlbo">.+?</div>`, 
"gs");
		auto m = match(s, rx);

		if (! m.empty())
		{
			if (indexOf(m.captures[0], "xxxxxxxx", 0) > -1 && 
indexOf(m.captures[0], "1983", 0) > -1)
				writeln(m.captures[0]);
		}
	}

The question is: what kind of cast should I use to safely 
(=without conversion exceptions got raised) scan all possible 
kind of textual (or binary) buffer, lile in Python 2.7.x?

Thanks!
Dec 08 2013