Velocity Reviews > Perl > Construction of a non-regexable subset of the set of all strings

# Construction of a non-regexable subset of the set of all strings

Dominic van der Zypen
Guest
Posts: n/a

 06-23-2006
Hello,

I have a question which I am afraid might seem "overly scientific" (not
to say nutty), but which I find is quite natural after all.

If we look at the set S of all finite strings, then each regex
represents some subset of S. For instance, h(a|e)ndel is the set
consisting of handel and hendel; and the regex .+ represents the set
of all non-empty strings not containing \n.

I wondered whether there is a "non-regexable" subset N of S (i.e. a
subset N for which there is no regex representing N) and came up with
a (non-constructive) positive answer from set theory: there are
uncountably many subsets of S, but there are only countably many
regular expressions, so there are non-regexable subsets.

What I would like to know is whether one can construct a subset N of S
which is non-regexable (and whether it is hard to prove that N has the
desired property).

Many thanks in advance and best wishes, Dominic

John W. Krahn
Guest
Posts: n/a

 06-23-2006
Dominic van der Zypen wrote:
>
> I have a question which I am afraid might seem "overly scientific" (not
> to say nutty), but which I find is quite natural after all.
>
> If we look at the set S of all finite strings, then each regex
> represents some subset of S. For instance, h(a|e)ndel is the set
> consisting of handel and hendel; and the regex .+ represents the set
> of all non-empty strings not containing \n.
>
> I wondered whether there is a "non-regexable" subset N of S (i.e. a
> subset N for which there is no regex representing N) and came up with
> a (non-constructive) positive answer from set theory: there are
> uncountably many subsets of S, but there are only countably many
> regular expressions, so there are non-regexable subsets.
>
> What I would like to know is whether one can construct a subset N of S
> which is non-regexable (and whether it is hard to prove that N has the
> desired property).

Are you talking about regular expressions in general or only Perl's regular
expressions? If this is a generic regular expression question you may want to

John
--
use Perl;
program
fulfillment

David Squire
Guest
Posts: n/a

 06-23-2006
Glenn Jackman wrote:
> At 2006-06-23 08:45AM, Dominic van der Zypen <(E-Mail Removed)> wrote:
>> Hello,
>>
>> I have a question which I am afraid might seem "overly scientific" (not
>> to say nutty), but which I find is quite natural after all.
>>
>> If we look at the set S of all finite strings, then each regex
>> represents some subset of S. For instance, h(a|e)ndel is the set
>> consisting of handel and hendel; and the regex .+ represents the set
>> of all non-empty strings not containing \n.
>>
>> I wondered whether there is a "non-regexable" subset N of S (i.e. a
>> subset N for which there is no regex representing N) and came up with
>> a (non-constructive) positive answer from set theory: there are
>> uncountably many subsets of S, but there are only countably many
>> regular expressions, so there are non-regexable subsets.
>>
>> What I would like to know is whether one can construct a subset N of S
>> which is non-regexable (and whether it is hard to prove that N has the
>> desired property).

>
> It seems to me that, for every string x in S, one can construct at least
> one regular expression to match it, i.e. ^x\$
>

Yes, but the set of all subsets of S is not the same as the set of
single strings in S (i.e. S).

DS

Ted Zlatanov
Guest
Posts: n/a

 06-23-2006
On 23 Jun 2006, http://www.velocityreviews.com/forums/(E-Mail Removed) wrote:

> I wondered whether there is a "non-regexable" subset N of S (i.e. a
> subset N for which there is no regex representing N) and came up with
> a (non-constructive) positive answer from set theory: there are
> uncountably many subsets of S, but there are only countably many
> regular expressions, so there are non-regexable subsets.

> What I would like to know is whether one can construct a subset N of S
> which is non-regexable (and whether it is hard to prove that N has the
> desired property).

(S: set of all finite strings)

I assume by "regexable" you mean "uniquely regexable", meaning that
the regular expression would not match S-N.

Any finite subset N of S would be regexable, since you can construct a
regular expression that does an OR of all the finite strings in N.

Some infinite subsets (e.g. all the strings that begin with "a") could
be regexable, but generally you can't uniquely match a random infinite
collection of strings with a regular expression, since there's nothing
for the regular expression to "latch" onto. I'm sure there's a more
rigorous analysis, perhaps using the Huffman distance, but I would say
that if a set is infinite and constructed without a specific set of
rules, it's not possible in finite time to *build* a regular
expression to match only that set.

Ted