Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Comparing Regular Expressions

Reply
Thread Tools

Comparing Regular Expressions

 
 
Robert Inder
Guest
Posts: n/a
 
      04-06-2005

I'm interested in comparing the "coverage" of regular expressions.

In particular, if I have two regular expressions, I want to know
whether it is possible to construct a string that will match
both, or indeed that will match one and not the other.

So, for instance, suppose I believe that if an object has a
descripiton string matching regexp R1, it is of type A.
Someone now tells me that an object with a
descripiton string matching regexp R2 is of type B. What can I tell
about (the sets of things of type) A and B by looking at R1 and R2?

So if R1= "^a.*$" and R2 = "^b.*$", I know that A and B are disjoint.

Whereas if R1= "foo" and R2 = "foobar", I know that B is a subset of A.

But how do I formalise (i.e. program) this comparison process?

This strikes me as a very general problem/issue and there
just has to be some work on formalising and solving it. But I don't
know how to find it: Google searches for things like "regular
expression comparison" are swamped by algorithms for matching regexps
against strings and so forth, which isn't what I want.

Can anyone point me in the right direction?

Robert.

--
__ To avoid the spam trap, mail me
|_) _ |_ _ ._ |- | _ _| _ ._ at bcs.org.uk, not deadspam.com.
| \(_)|_)(-'| |_ || |(_|(-'| '
Best viewed in Ebriated.
 
Reply With Quote
 
 
 
 
Anno Siegel
Guest
Posts: n/a
 
      04-06-2005
Robert Inder <(E-Mail Removed)> wrote in comp.lang.perl.misc:
>
> I'm interested in comparing the "coverage" of regular expressions.
>
> In particular, if I have two regular expressions, I want to know
> whether it is possible to construct a string that will match
> both, or indeed that will match one and not the other.


[snip]

That will be utterly non-trivial.

Consider the partial problem of finding a string that a given regex
matches. Since there are patterns that don't match anything (/$.^/),
there can't be a general solution. At the very least you'll need a
complete regex parser before you can tackle this problem.

Anno
 
Reply With Quote
 
 
 
 
Brian McCauley
Guest
Posts: n/a
 
      04-06-2005
Robert Inder wrote:

> I'm interested in comparing the "coverage" of regular expressions.
>
> In particular, if I have two regular expressions, I want to know
> whether it is possible to construct a string that will match
> both, or indeed that will match one and not the other.


My instinct says that will be "hard". Not just "hard" in the common
sense, but hard in the CompSci sense of being equivalent to the halting
problem (i.e. insoluble in the general case in finite time).

Note I'm reading this in clp.misc. I suspect the nettizens of
comp.theory will be much more clued-up on this.

 
Reply With Quote
 
Jean-Marc Bourguet
Guest
Posts: n/a
 
      04-06-2005
Robert Inder <(E-Mail Removed)> writes:

> I'm interested in comparing the "coverage" of regular expressions.


For which definition or regular expressions? I see this is crossposted
to comp.lang.perl.misc and comp.theory and the default meaning of
regular expression is quite different in the two groups.

> In particular, if I have two regular expressions, I want to know
> whether it is possible to construct a string that will match both,
> or indeed that will match one and not the other.


For the comp.theory adapted definition, it is possible to construct
FSM accepting the language described by a regular expressions.

Starting from the FSMs and using a variant of the algorithm used to
minimize them it is easy to construct an FSM which accept the union of
the two languages. As there is a know relation between the states in
the new FSM and the states in the old, it is easy to classify the
accepting states of the new FSM as having correspondance in only one
or both of the original FSM.

Yours,

--
Jean-Marc
 
Reply With Quote
 
Torben Ægidius Mogensen
Guest
Posts: n/a
 
      04-06-2005
Robert Inder <(E-Mail Removed)> writes:

> I'm interested in comparing the "coverage" of regular expressions.
>
> In particular, if I have two regular expressions, I want to know
> whether it is possible to construct a string that will match
> both, or indeed that will match one and not the other.


Regular languages are closed under intersection and set difference, so
you can not only find strings (if any exist) in the intersection and
difference sets, you can construct regular expressions or DFA's that
describe all such strings.

> So, for instance, suppose I believe that if an object has a
> descripiton string matching regexp R1, it is of type A.
> Someone now tells me that an object with a
> descripiton string matching regexp R2 is of type B. What can I tell
> about (the sets of things of type) A and B by looking at R1 and R2?
>
> So if R1= "^a.*$" and R2 = "^b.*$", I know that A and B are disjoint.
>
> Whereas if R1= "foo" and R2 = "foobar", I know that B is a subset of A.


Say what? The set containing only the string "foobar" is not a subset
of the set containing only "foo". Or do you consider a string to
match a regular expression if a prefix of the string does?

> But how do I formalise (i.e. program) this comparison process?
>
> This strikes me as a very general problem/issue and there
> just has to be some work on formalising and solving it. But I don't
> know how to find it: Google searches for things like "regular
> expression comparison" are swamped by algorithms for matching regexps
> against strings and so forth, which isn't what I want.


If you have a DFA D1 with states s_0 ... s_m for A and a DFA D2 with
states t_0 ... t_n for B, you can construct a DFA D3 for the
intersection of A and B in the following way:

1. The start state of D3 is the pair (s_0,t_0) of starting states of
D1 and D2.

2. If D1 has a transition on symbol c from s_i to s_j and D2 has a
transition on symbol c from t_k to t_l, D3 has a transition from
(s_i,t_k) to (s_j,t_l) on c.

3. If s_i is accepting in D1 and t_k is accepting in D2, (s_i,t_k) is
accepting in D3.

For A\B (A minus B), the construction is:

0. Add a non-accepting state t_(n+1) to D2. If there is a state t_k
in D2 and a symbol c such that t_k has no transition on c, add a
transition from t_k to t_(n+1) on c. There is a transition from
t_(n+1) to t_(n+1) on all symbols. This way, D2 has transitions
on all symbols from all states.

1. The start state of D3 is the pair (s_0,t_0) of starting states of
D1 and D2.

2. If D1 has a transition on symbol c from s_i to s_j and D2 has a
transition on symbol c from t_k to t_l, D3 has a transition from
(s_i,t_k) to (s_j,t_l) on c.

3. If s_i is accepting in D1 and t_k is not accepting in D2,
(s_i,t_k) is accepting in D3.


Torben
 
Reply With Quote
 
Steven N. Hirsch
Guest
Posts: n/a
 
      04-12-2005
Robert Inder wrote:
> I'm interested in comparing the "coverage" of regular expressions.
>
> In particular, if I have two regular expressions, I want to know
> whether it is possible to construct a string that will match
> both, or indeed that will match one and not the other.
>
> So, for instance, suppose I believe that if an object has a
> descripiton string matching regexp R1, it is of type A.
> Someone now tells me that an object with a
> descripiton string matching regexp R2 is of type B. What can I tell
> about (the sets of things of type) A and B by looking at R1 and R2?
>
> So if R1= "^a.*$" and R2 = "^b.*$", I know that A and B are disjoint.
>
> Whereas if R1= "foo" and R2 = "foobar", I know that B is a subset of A.
>
> But how do I formalise (i.e. program) this comparison process?
>
> This strikes me as a very general problem/issue and there
> just has to be some work on formalising and solving it. But I don't
> know how to find it: Google searches for things like "regular
> expression comparison" are swamped by algorithms for matching regexps
> against strings and so forth, which isn't what I want.
>
> Can anyone point me in the right direction?


There's a wonderful treatment of the subject in "Higher Order Perl",
written by Mark Jason Dominus. This is easily one of the most
thought-provoking books on Perl to date (IMHO) and a highly-recommended
read.
 
Reply With Quote
 
Mark Jason Dominus
Guest
Posts: n/a
 
      04-13-2005
In article <d30g15$729$(E-Mail Removed)>,
Brian McCauley <(E-Mail Removed)> wrote:
>My instinct says that will be "hard". Not just "hard" in the common
>sense, but hard in the CompSci sense of being equivalent to the halting
>problem (i.e. insoluble in the general case in finite time).


My recollection is that this is not the case. Certainly equivalence
of recursive languages is undecidable, but regular languages are
highly restricted special cases.

I may be remembering wrong, but I think regex equivalence is
NP-complete. If so, this would mean that there would be an
exponential-time algorithm for solving it. A quick google search
finds various mentions of various versions of the problem being in
PSPACE or being EXP-complete, which means that it is not undecidable.

So the problem is hard, but not impossible.
 
Reply With Quote
 
Jürgen Exner
Guest
Posts: n/a
 
      04-13-2005
Mark Jason Dominus wrote:
> In article <d30g15$729$(E-Mail Removed)>,
> Brian McCauley <(E-Mail Removed)> wrote:
>> My instinct says that will be "hard". Not just "hard" in the common
>> sense, but hard in the CompSci sense of being equivalent to the
>> halting problem (i.e. insoluble in the general case in finite time).

>
> My recollection is that this is not the case. Certainly equivalence
> of recursive languages is undecidable, but regular languages are
> highly restricted special cases.


Except that Perl's REs (as most other REs that are in common use today) are
not"regular" in the Computer Science classification any longer but have been
extended.

jue


 
Reply With Quote
 
Robert Inder
Guest
Posts: n/a
 
      04-14-2005

>>>>> Jürgen Exner writes:

> Subject: Re: Comparing Regular Expressions
> Date: Wed, 13 Apr 2005 23:22:52 GMT


> Mark Jason Dominus wrote:
>> In article <d30g15$729$(E-Mail Removed)>,
>> Brian McCauley <(E-Mail Removed)> wrote:
>>> My instinct says that will be "hard". Not just "hard" in the common
>>> sense, but hard in the CompSci sense of being equivalent to the
>>> halting problem (i.e. insoluble in the general case in finite time).

>>
>> My recollection is that this is not the case. Certainly equivalence
>> of recursive languages is undecidable, but regular languages are
>> highly restricted special cases.


> Except that Perl's REs (as most other REs that are in common use
> today) are not"regular" in the Computer Science classification
> any longer but have been extended.


Right now, I'm just interested in the what can and can't be done.

So if some restrictions on the regexps would let me get a handle on
comparing them, I'd be intrigued.

Even some pointers to work that "merely" dealt with regular RE's
would be a great help.

> jue


Robert.

--
__ To avoid the spam trap, mail me
|_) _ |_ _ ._ |- | _ _| _ ._ at bcs.org.uk, not deadspam.com.
| \(_)|_)(-'| |_ || |(_|(-'| '
Best viewed in Ebriated.
 
Reply With Quote
 
Jean-Marc Bourguet
Guest
Posts: n/a
 
      04-15-2005
Robert Inder <(E-Mail Removed)> writes:

> Even some pointers to work that "merely" dealt with regular RE's
> would be a great help.


I think both Torben Ægidius Mogensen
((E-Mail Removed)) and me
((E-Mail Removed)) have outlined the
algorithms you want. It would help to give you better
responses if you state what you think is missing.

The algorithm for going from a RE to a FSM is probably in
every introduction book on compiler.

Yours,

--
Jean-Marc
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Custom Regular Expressions in ASP.net Jay Douglas ASP .Net 3 11-03-2003 08:09 PM
Regular expressions mark Perl 4 10-28-2003 12:37 PM
perl regular expressions return last matched occurence? Dustin D. Perl 1 08-28-2003 01:51 AM
matching curly braces and regular expressions Dustin D. Perl 0 08-26-2003 11:18 PM
Add custom regular expressions to the validation list of available expressions Jay Douglas ASP .Net 0 08-15-2003 10:19 PM



Advertisments