Robert Inder <(EMail Removed)> writes:
> I'm interested in comparing the "coverage" of regular expressions.
>
> In particular, if I have two regular expressions, I want to know
> whether it is possible to construct a string that will match
> both, or indeed that will match one and not the other.
Regular languages are closed under intersection and set difference, so
you can not only find strings (if any exist) in the intersection and
difference sets, you can construct regular expressions or DFA's that
describe all such strings.
> So, for instance, suppose I believe that if an object has a
> descripiton string matching regexp R1, it is of type A.
> Someone now tells me that an object with a
> descripiton string matching regexp R2 is of type B. What can I tell
> about (the sets of things of type) A and B by looking at R1 and R2?
>
> So if R1= "^a.*$" and R2 = "^b.*$", I know that A and B are disjoint.
>
> Whereas if R1= "foo" and R2 = "foobar", I know that B is a subset of A.
Say what? The set containing only the string "foobar" is not a subset
of the set containing only "foo". Or do you consider a string to
match a regular expression if a prefix of the string does?
> But how do I formalise (i.e. program) this comparison process?
>
> This strikes me as a very general problem/issue and there
> just has to be some work on formalising and solving it. But I don't
> know how to find it: Google searches for things like "regular
> expression comparison" are swamped by algorithms for matching regexps
> against strings and so forth, which isn't what I want.
If you have a DFA D1 with states s_0 ... s_m for A and a DFA D2 with
states t_0 ... t_n for B, you can construct a DFA D3 for the
intersection of A and B in the following way:
1. The start state of D3 is the pair (s_0,t_0) of starting states of
D1 and D2.
2. If D1 has a transition on symbol c from s_i to s_j and D2 has a
transition on symbol c from t_k to t_l, D3 has a transition from
(s_i,t_k) to (s_j,t_l) on c.
3. If s_i is accepting in D1 and t_k is accepting in D2, (s_i,t_k) is
accepting in D3.
For A\B (A minus B), the construction is:
0. Add a nonaccepting state t_(n+1) to D2. If there is a state t_k
in D2 and a symbol c such that t_k has no transition on c, add a
transition from t_k to t_(n+1) on c. There is a transition from
t_(n+1) to t_(n+1) on all symbols. This way, D2 has transitions
on all symbols from all states.
1. The start state of D3 is the pair (s_0,t_0) of starting states of
D1 and D2.
2. If D1 has a transition on symbol c from s_i to s_j and D2 has a
transition on symbol c from t_k to t_l, D3 has a transition from
(s_i,t_k) to (s_j,t_l) on c.
3. If s_i is accepting in D1 and t_k is not accepting in D2,
(s_i,t_k) is accepting in D3.
Torben
