www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Limitation with current regex API

reply Jerry <jlquinn optonline.net> writes:
Hi all,

In general, I'm enjoying the regex respin.  However, I ran into one
issue that seems to have no clean workaround.

Generally, I want to be able to get the start and end indices of
matches.  With the complete match, this info can be pieced together with
match.pre().length and match.hit.length().  However, I can't do this
with captures.

For an example: I have a string and the regex .*(a).*(b).*(c).*.  I want
to find where a, b, and c are located when I match.  As far as I can
tell, the only way to do this would be to capture every chunk of text,
then iterate to determine the offsets.  That seems wasteful.

If you look at the ICU and Java regex APIs, you'll see that this
information is retrievable.  I believe it's available under the covers
of the D regex library API too.

Can this please be exposed?  It's very helpful for doing text processing
where you need to be able to align the results of multiple
transformations to the input text.

Thanks
Jerry
Jan 16 2012
next sibling parent reply "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Monday, 16 January 2012 at 19:28:42 UTC, Jerry wrote:
 As far as I can tell, the only way to do this would be to 
 capture every chunk of text, then iterate to determine the 
 offsets.

Not sure if this is what you were referring to, but you can do... m.pre.length + m.captures[1].ptr - m.hit.ptr
Jan 16 2012
parent reply "Nick Sabalausky" <a a.a> writes:
"Vladimir Panteleev" <vladimir thecybershadow.net> wrote in message 
news:klzeekkilpzwmjmkudhh dfeed.kimsufi.thecybershadow.net...
 On Tuesday, 17 January 2012 at 01:44:37 UTC, Vladimir Panteleev wrote:
 On Monday, 16 January 2012 at 19:28:42 UTC, Jerry wrote:
 As far as I can tell, the only way to do this would be to capture every 
 chunk of text, then iterate to determine the offsets.

Not sure if this is what you were referring to, but you can do...

Even simpler: m.captures[1].ptr - s.ptr (s is the string being matched)

That wouldn't work in safe mode, would it?
Jan 16 2012
parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 01/17/2012 04:03 AM, Nick Sabalausky wrote:
 "Vladimir Panteleev"<vladimir thecybershadow.net>  wrote in message
 news:klzeekkilpzwmjmkudhh dfeed.kimsufi.thecybershadow.net...
 On Tuesday, 17 January 2012 at 01:44:37 UTC, Vladimir Panteleev wrote:
 On Monday, 16 January 2012 at 19:28:42 UTC, Jerry wrote:
 As far as I can tell, the only way to do this would be to capture every
 chunk of text, then iterate to determine the offsets.

Not sure if this is what you were referring to, but you can do...

Even simpler: m.captures[1].ptr - s.ptr (s is the string being matched)

That wouldn't work in safe mode, would it?

There is nothing unsafe about the operation, so I'd actually expect it to work.
Jan 16 2012
parent reply "Nick Sabalausky" <a a.a> writes:
"Timon Gehr" <timon.gehr gmx.ch> wrote in message 
news:jf2p5d$2ria$1 digitalmars.com...
 On 01/17/2012 04:03 AM, Nick Sabalausky wrote:
 "Vladimir Panteleev"<vladimir thecybershadow.net>  wrote in message
 news:klzeekkilpzwmjmkudhh dfeed.kimsufi.thecybershadow.net...
 On Tuesday, 17 January 2012 at 01:44:37 UTC, Vladimir Panteleev wrote:
 On Monday, 16 January 2012 at 19:28:42 UTC, Jerry wrote:
 As far as I can tell, the only way to do this would be to capture 
 every
 chunk of text, then iterate to determine the offsets.

Not sure if this is what you were referring to, but you can do...

Even simpler: m.captures[1].ptr - s.ptr (s is the string being matched)

That wouldn't work in safe mode, would it?

There is nothing unsafe about the operation, so I'd actually expect it to work.

I thought pointer arithmetic was forbidden in safe?
Jan 16 2012
parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 01/17/2012 05:00 AM, Nick Sabalausky wrote:
 "Timon Gehr"<timon.gehr gmx.ch>  wrote in message
 news:jf2p5d$2ria$1 digitalmars.com...
 On 01/17/2012 04:03 AM, Nick Sabalausky wrote:
 "Vladimir Panteleev"<vladimir thecybershadow.net>   wrote in message
 news:klzeekkilpzwmjmkudhh dfeed.kimsufi.thecybershadow.net...
 On Tuesday, 17 January 2012 at 01:44:37 UTC, Vladimir Panteleev wrote:
 On Monday, 16 January 2012 at 19:28:42 UTC, Jerry wrote:
 As far as I can tell, the only way to do this would be to capture
 every
 chunk of text, then iterate to determine the offsets.

Not sure if this is what you were referring to, but you can do...

Even simpler: m.captures[1].ptr - s.ptr (s is the string being matched)

That wouldn't work in safe mode, would it?

There is nothing unsafe about the operation, so I'd actually expect it to work.

I thought pointer arithmetic was forbidden in safe?

I don't know exactly, since safe is neither fully specified nor implemented. In my understanding, in safe code, operations that may lead to memory corruption are forbidden. Pointer - pointer cannot, other kinds of pointer arithmetic may.
Jan 16 2012
parent reply Don Clugston <dac nospam.com> writes:
On 17/01/12 10:40, Jonathan M Davis wrote:
 On Tuesday, January 17, 2012 05:04:39 Timon Gehr wrote:
 I don't know exactly, since  safe is neither fully specified nor
 implemented. In my understanding, in  safe code, operations that may
 lead to memory corruption are forbidden. Pointer - pointer cannot, other
 kinds of pointer arithmetic may.

Pointer arithmetic is definitely forbidden in safe, but I'm not sure that that forbids pointer - pointer, since it's not dangerous. It's changing a pointer via arithmetic which is dangerous. - Jonathan M Davis

My guess is that safe D is supposed to enforce C pointer semantics. At least, code which is both safe and pure must do so. The semantics are currently enforced in CTFE. pointer - pointer is undefined behaviour in C, if the pointers come from different arrays. It's OK if they are from the same array, which is true in this case.
Jan 17 2012
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 1/17/12 6:59 AM, Don Clugston wrote:
 On 17/01/12 10:40, Jonathan M Davis wrote:
 On Tuesday, January 17, 2012 05:04:39 Timon Gehr wrote:
 I don't know exactly, since  safe is neither fully specified nor
 implemented. In my understanding, in  safe code, operations that may
 lead to memory corruption are forbidden. Pointer - pointer cannot, other
 kinds of pointer arithmetic may.

Pointer arithmetic is definitely forbidden in safe, but I'm not sure that that forbids pointer - pointer, since it's not dangerous. It's changing a pointer via arithmetic which is dangerous. - Jonathan M Davis

My guess is that safe D is supposed to enforce C pointer semantics. At least, code which is both safe and pure must do so. The semantics are currently enforced in CTFE. pointer - pointer is undefined behaviour in C, if the pointers come from different arrays. It's OK if they are from the same array, which is true in this case.

Yah, that C rule is to allow segmented memory architectures work properly. One possibility for D is to require a flat memory model, in which the difference between any two pointers can be taken. Andrei
Jan 17 2012
prev sibling next sibling parent "Vladimir Panteleev" <vladimir thecybershadow.net> writes:
On Tuesday, 17 January 2012 at 01:44:37 UTC, Vladimir Panteleev 
wrote:
 On Monday, 16 January 2012 at 19:28:42 UTC, Jerry wrote:
 As far as I can tell, the only way to do this would be to 
 capture every chunk of text, then iterate to determine the 
 offsets.

Not sure if this is what you were referring to, but you can do...

Even simpler: m.captures[1].ptr - s.ptr (s is the string being matched)
Jan 16 2012
prev sibling next sibling parent Jerry <jlquinn optonline.net> writes:
"Vladimir Panteleev" <vladimir thecybershadow.net> writes:

 On Tuesday, 17 January 2012 at 01:44:37 UTC, Vladimir Panteleev wrote:
 On Monday, 16 January 2012 at 19:28:42 UTC, Jerry wrote:
 As far as I can tell, the only way to do this would be to capture every
 chunk of text, then iterate to determine the offsets.

Not sure if this is what you were referring to, but you can do...

Even simpler: m.captures[1].ptr - s.ptr (s is the string being matched)

Ah ok, that'll work.
Jan 16 2012
prev sibling next sibling parent Mail Mantis <mail.mantis.88 gmail.com> writes:
2012/1/17 Mail Mantis <mail.mantis.88 gmail.com>:
 Correct me if I'm wrong, but wouldn't this be better:
 (m_captures[1].ptr - s.ptr) / s[0].sizeof;

No, it wouldn't. Somehow, I forgot the rules for pointer ariphmetics. Sorry.
Jan 16 2012
prev sibling next sibling parent Mail Mantis <mail.mantis.88 gmail.com> writes:
2012/1/17 Vladimir Panteleev <vladimir thecybershadow.net>:
 On Tuesday, 17 January 2012 at 01:44:37 UTC, Vladimir Panteleev wrote:
 On Monday, 16 January 2012 at 19:28:42 UTC, Jerry wrote:
 As far as I can tell, the only way to do this would be to capture every
 chunk of text, then iterate to determine the offsets.

Not sure if this is what you were referring to, but you can do...

Even simpler: m.captures[1].ptr - s.ptr (s is the string being matched)

Correct me if I'm wrong, but wouldn't this be better: (m_captures[1].ptr - s.ptr) / s[0].sizeof;
Jan 16 2012
prev sibling next sibling parent Jerry <jlquinn optonline.net> writes:
Mail Mantis <mail.mantis.88 gmail.com> writes:

 2012/1/17 Vladimir Panteleev <vladimir thecybershadow.net>:
 On Tuesday, 17 January 2012 at 01:44:37 UTC, Vladimir Panteleev wrote:
 On Monday, 16 January 2012 at 19:28:42 UTC, Jerry wrote:
 As far as I can tell, the only way to do this would be to capture every
 chunk of text, then iterate to determine the offsets.

Not sure if this is what you were referring to, but you can do...

Even simpler: m.captures[1].ptr - s.ptr (s is the string being matched)

Correct me if I'm wrong, but wouldn't this be better: (m_captures[1].ptr - s.ptr) / s[0].sizeof;

I *think* pointer arithmetic handles that. However this is much uglier than: m_captures[1].begin m_captures[1].end Jerry
Jan 16 2012
prev sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Tuesday, January 17, 2012 05:04:39 Timon Gehr wrote:
 I don't know exactly, since  safe is neither fully specified nor
 implemented. In my understanding, in  safe code, operations that may
 lead to memory corruption are forbidden. Pointer - pointer cannot, other
 kinds of pointer arithmetic may.

Pointer arithmetic is definitely forbidden in safe, but I'm not sure that that forbids pointer - pointer, since it's not dangerous. It's changing a pointer via arithmetic which is dangerous. - Jonathan M Davis
Jan 17 2012