Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Javascript > RegExp as Finite State Machine

Reply
Thread Tools

RegExp as Finite State Machine

 
 
Georg Jaehnig
Guest
Posts: n/a
 
      12-30-2008
Hi,

I wonder if it's possible to have Finite State Machines (FSM) [1] in
Javascript. Basically, they must be already implemented because a
RegExp is a FSM , both describe a regular language.

But the formalism of FSMs allows much more than RegExps seem to allow.
Intersection for instance:

pattern1 = new RegExp( "(a|b|c)" ); // contains a,b,c
pattern2 = new RegExp( "(a|b)" ); // contains a,b

// This would be nice:
pattern3 = RegExp.intersect( pattern1, pattern2 )
// pattern3 = /"(a|b)"/

So do you maybe know a library with those functions? If not, is there
any way to access the internal representation of the RegExp object to
implement e.g. an intersection algorithm?

[1] http://en.wikipedia.org/wiki/Finite_state_machine

--
Georg | http://serchilo.net - command the web

 
Reply With Quote
 
 
 
 
Stevo
Guest
Posts: n/a
 
      12-30-2008
Georg Jaehnig wrote:
> I wonder if it's possible to have Finite State Machines (FSM) [1] in
> Javascript. Basically, they must be already implemented because a
> RegExp is a FSM , both describe a regular language.


What ? Which language does a state machine represent ? Greek ? When did
the dictionary-eating types (that like to take a simple concept that can
be described in a paragraph or two, but instead turn it into a 5-page
marketing-speak document) decide to add the word Finite to state
machines just so they can make it a TLA ? It's always been a plain state
machine in my experience. None of the state machines I ever programmed
"described a regular language". They all handled various input events
and made some choices based on the current state, resulting in some
output and sometimes a change in state.

When did RegExp become a state machine ? It's a regular expression
search [and replace] function.

> But the formalism of FSMs allows much more than RegExps seem to allow.


What ? Oh it's the formalism that does it, I never realized it was the
formalism. Is it's RegExps informalism (or is it unformalisticness?)
that lets it down ?

> Intersection for instance:
>
> pattern1 = new RegExp( "(a|b|c)" ); // contains a,b,c
> pattern2 = new RegExp( "(a|b)" ); // contains a,b
>
> // This would be nice:
> pattern3 = RegExp.intersect( pattern1, pattern2 )
> // pattern3 = /"(a|b)"/
>
> So do you maybe know a library with those functions? If not, is there
> any way to access the internal representation of the RegExp object to
> implement e.g. an intersection algorithm?
>
> [1] http://en.wikipedia.org/wiki/Finite_state_machine


I very much doubt you'll find anything like that.
 
Reply With Quote
 
 
 
 
Thomas 'PointedEars' Lahn
Guest
Posts: n/a
 
      01-01-2009
Georg Jaehnig wrote:
> I wonder if it's possible to have Finite State Machines (FSM) [1] in
> Javascript. Basically, they must be already implemented because a
> RegExp is a FSM , both describe a regular language.


A regular expression *engine* must be an FSM, because the formal
language it accepts is regular -- Regular Expressions.

AFAIK, a RegEx engine is implemented either as a DFA or an NFA. See
also: Friedl, Jeffrey E. F.; Mastering Regular Expressions, Second
Edition, chapter 4 "The Mechanics of Expression Processing", O'Reilly,
1997 (you can read that sample chapter online, for free); maybe it is
also in the Third Edition, O'Reilly, 2006.

The Specification of ECMAScript RegExp processing can be found in the
ECMAScript Language Specification (ECMA-262), Edition 3 Final [ES3F],
about page 160. There are at least two implementations of it that are
Open Source, Mozilla.org SpiderMonkey and Apple JavaScriptCore (of
WebKit).

> But the formalism of FSMs allows much more than RegExps seem to allow.


Of course. The former is an formal automaton, the latter is the
implementation of a formal language.

> Intersection for instance:
>
> pattern1 = new RegExp( "(a|b|c)" ); // contains a,b,c
> pattern2 = new RegExp( "(a|b)" ); // contains a,b
>
> // This would be nice:
> pattern3 = RegExp.intersect( pattern1, pattern2 )
> // pattern3 = /"(a|b)"/
>
> So do you maybe know a library with those functions?


My RegExp library, but it is not online yet.

> If not, is there
> any way to access the internal representation of the RegExp object to
> implement e.g. an intersection algorithm?


(I'll answer anyway if you don't mind.)

An FSM reads a string of characters from an input alphabet. So if you
want to do with RegExp *objects* what you can do with a modified FSM,
you have to use the *string* representation of the object.

If you had cared to read [ES3F], you would have known that all
ECMAScript values have such a representation. RegExp objects, in
particular, inherit a property, ‘source’, which evaluates to it, and a
method, toString(), which returns it. (I am not completely sure as to
the availability of the former yet. It is specified, but I have only
tested it positive from JScript 3.0 [IE/Win 4.0] on, and in Safari on
this iPhone 3G [firmware version 2.2 (5G77)].)

Trimming and concatenating the results, and passing them to the RegExp
factory/constructor is left as an exercise to the reader.


HTH

PointedEars
 
Reply With Quote
 
Stevo
Guest
Posts: n/a
 
      01-01-2009
Thomas 'PointedEars' Lahn wrote:
> Georg Jaehnig wrote:
>> I wonder if it's possible to have Finite State Machines (FSM) [1] in
>> Javascript. Basically, they must be already implemented because a
>> RegExp is a FSM , both describe a regular language.

>
> A regular expression *engine* must be an FSM, because the formal
> language it accepts is regular -- Regular Expressions.


You found a kindred spirit Georg. One that speaks your language.

>> But the formalism of FSMs allows much more than RegExps seem to allow.

>
> Of course. The former is an formal automaton, the latter is the
> implementation of a formal language.
>
> PointedEars


Happy New Year Thomas, You're my hero
 
Reply With Quote
 
Lasse Reichstein Nielsen
Guest
Posts: n/a
 
      01-04-2009
Thomas 'PointedEars' Lahn <(E-Mail Removed)> writes:

> Georg Jaehnig wrote:
>> I wonder if it's possible to have Finite State Machines (FSM) [1] in
>> Javascript. Basically, they must be already implemented because a
>> RegExp is a FSM , both describe a regular language.

>
> A regular expression *engine* must be an FSM, because the formal
> language it accepts is regular -- Regular Expressions.


It can be a FSM, but it doesn't have to be one. For the "regular"
expressions in Javascript, it can't be, since they recognize
non-regular languages (but only because of back-references).

Classsical counter-example of regularness:
/^(11+)\1+$/
which recognizes composite numbers (non-primes) in unary notation.

The original definition of Javascript regular expressions was based on
the backtracking implementations in Netscape and IE. The specification
is given specifically in terms of when to backtrack, which makes it
harder to implement the specified semantics without backtracking (e.g.,
if there are more ways to match, the specification is to pick the "first"
one, not the longest one as in other regexp-implementations).

Of the current crop of browsers, AFAIK none of them use a FSM
based implementation of regular expressions.

> The Specification of ECMAScript RegExp processing can be found in the
> ECMAScript Language Specification (ECMA-262), Edition 3 Final [ES3F],
> about page 160. There are at least two implementations of it that are
> Open Source, Mozilla.org SpiderMonkey and Apple JavaScriptCore (of
> WebKit).


The one in JavaScriptCore is currently being rewritten from scratch.
The original version is an adaption of PCRE (Perl-regexps) to Javascript
regexps, and can't really be recommended as an example.
The new version is not feature complete yet.

/L
--
Lasse Reichstein Holst Nielsen
'Javascript frameworks is a disruptive technology'

 
Reply With Quote
 
Thomas 'PointedEars' Lahn
Guest
Posts: n/a
 
      01-04-2009
Lasse Reichstein Nielsen wrote:
> Thomas 'PointedEars' Lahn <(E-Mail Removed)> writes:
>> Georg Jaehnig wrote:
>>> I wonder if it's possible to have Finite State Machines (FSM) [1] in
>>> Javascript. Basically, they must be already implemented because a
>>> RegExp is a FSM , both describe a regular language.

>> A regular expression *engine* must be an FSM, because the formal
>> language it accepts is regular -- Regular Expressions.

>
> It can be a FSM, but it doesn't have to be one.


True, it could also be an alternating finite automaton or a read-only Turing
machine. (I'd like to see that!)

> For the "regular" expressions in Javascript, it can't be, since they recognize
> non-regular languages (but only because of back-references).
>
> Classsical counter-example of regularness:
> /^(11+)\1+$/
> which recognizes composite numbers (non-primes) in unary notation.


In which way is that proof that ECMAScript Regular Expressions are not a
regular language?

> Of the current crop of browsers, AFAIK none of them use a FSM
> based implementation of regular expressions.


So sure are you, despite other engines that also support backreferences
(like egrep) have been demonstrated to use FSM implementations?

<http://oreilly.com/catalog/regex/chapter/ch04.html>


PointedEars
 
Reply With Quote
 
Lasse Reichstein Nielsen
Guest
Posts: n/a
 
      01-04-2009
Thomas 'PointedEars' Lahn <(E-Mail Removed)> writes:

> Lasse Reichstein Nielsen wrote:


>> Classsical counter-example of regularness:
>> /^(11+)\1+$/
>> which recognizes composite numbers (non-primes) in unary notation.

>
> In which way is that proof that ECMAScript Regular Expressions are not a
> regular language?


The language is not regular (It's composite, the language of "1"'s of
prime length, is not regular. This is failry easily proved by assuming
it's regular, and using pumping to derive a contradition).

The expression is a valid Javascript regular expression that recognizes
a non-regular language, so Javascript regexps are not restricted to the
regular languages (and therefore not possible to implement using only
a finite state machine).

>> Of the current crop of browsers, AFAIK none of them use a FSM
>> based implementation of regular expressions.


> So sure are you, despite other engines that also support backreferences
> (like egrep) have been demonstrated to use FSM implementations?
> <http://oreilly.com/catalog/regex/chapter/ch04.html>


Referring to this?
"You might, however, notice that GNU's version of egrep does support
backreferences. It does so by having two complete engines under the
hood! It first uses a DFA engine to see whether a match is likely,
and then uses an NFA engine (which supports the full flavor,
including backreferences) to confirm the match."

I'm only talking about the current regexp implementations because I
have actually been looking at them.

What I probably /should/ have also said, is that by FSM I mean a
deterministic machine (i.e., a DFA, one that is run linearly over the
input). A nondeterministic machine can use any number of tricks to
implement the nondeterminism, including backtracking.

/L
--
Lasse Reichstein Holst Nielsen
'Javascript frameworks is a disruptive technology'

 
Reply With Quote
 
Thomas 'PointedEars' Lahn
Guest
Posts: n/a
 
      01-04-2009
Lasse Reichstein Nielsen wrote:
> Thomas 'PointedEars' Lahn <(E-Mail Removed)> writes:
>> Lasse Reichstein Nielsen wrote:
>>> Classsical counter-example of regularness:
>>> /^(11+)\1+$/
>>> which recognizes composite numbers (non-primes) in unary notation.

>> In which way is that proof that ECMAScript Regular Expressions are not a
>> regular language?

>
> The language is not regular (It's composite, the language of "1"'s of
> prime length, is not regular. This is failry easily proved by assuming
> it's regular, and using pumping to derive a contradition).


Ah, but I think you confuse the language that a regular expression can
match, L_m(r) -- he above matched language is context-free (Chomsky Type-2)
but not regular (Type-3), if I'm not mistaken -- with the Regular Expression
language, L_r. In order to prove that the ECMAScript Regular Expression
language is not regular, you have to apply the pumping lemma for regular
languages to the expression itself, not to the strings it could match.

> The expression is a valid Javascript regular expression that recognizes
> a non-regular language, so Javascript regexps are not restricted to the
> regular languages (and therefore not possible to implement using only
> a finite state machine).


Sorry, I don't think you are correct here.

>>> Of the current crop of browsers, AFAIK none of them use a FSM
>>> based implementation of regular expressions.

>>
>> So sure are you, despite other engines that also support backreferences
>> (like egrep) have been demonstrated to use FSM implementations?
>> <http://oreilly.com/catalog/regex/chapter/ch04.html>

>
> Referring to this?
> "You might, however, notice that GNU's version of egrep does support
> backreferences. It does so by having two complete engines under the
> hood! It first uses a DFA engine to see whether a match is likely,
> and then uses an NFA engine (which supports the full flavor,
> including backreferences) to confirm the match."


Yes, exactly.

> I'm only talking about the current regexp implementations because I
> have actually been looking at them.


We'll see.

> What I probably /should/ have also said, is that by FSM I mean a
> deterministic machine (i.e., a DFA, one that is run linearly over the
> input). A nondeterministic machine can use any number of tricks to
> implement the nondeterminism, including backtracking.


But your premise is incorrect. Nondeterministic Finite Automata (NFAs),
also called "nondeterministic finite state machines", are definitely a
subset of Finite State Machines (FSMs).

<http://en.wikipedia.org/wiki/Nondeterministic_finite_state_machine>


PointedEars
 
Reply With Quote
 
Stevo
Guest
Posts: n/a
 
      01-05-2009
Thomas 'PointedEars' Lahn wrote:
> Lasse Reichstein Nielsen wrote:
>> Thomas 'PointedEars' Lahn <(E-Mail Removed)> writes:
>>> Lasse Reichstein Nielsen wrote:
>>>> Classsical counter-example of regularness:
>>>> /^(11+)\1+$/
>>>> which recognizes composite numbers (non-primes) in unary notation.
>>> In which way is that proof that ECMAScript Regular Expressions are not a
>>> regular language?

>> The language is not regular (It's composite, the language of "1"'s of
>> prime length, is not regular. This is failry easily proved by assuming
>> it's regular, and using pumping to derive a contradition).

>
> Ah, but I think you confuse the language that a regular expression can
> match, L_m(r) -- he above matched language is context-free (Chomsky Type-2)
> but not regular (Type-3), if I'm not mistaken -- with the Regular Expression
> language, L_r. In order to prove that the ECMAScript Regular Expression
> language is not regular, you have to apply the pumping lemma for regular
> languages to the expression itself, not to the strings it could match.


Anyone else brave enough to admit this is way over their head? Check out
these two links. The first one might as well be in Japanese for the
amount of sense it makes to me:

http://en.wikipedia.org/wiki/Nondete..._of_NFA-.CE.B5


This one initially looks less complicated, but after reading the whole
section I'm left open-mouthed and saying "Huh?" to myself...

http://en.wikipedia.org/wiki/Regular...age_is_regular


I'm not sure if this newsgroup will consider everything after the # on
those links as part of the links. We'll see.
 
Reply With Quote
 
Lasse Reichstein Nielsen
Guest
Posts: n/a
 
      01-05-2009
Thomas 'PointedEars' Lahn <(E-Mail Removed)> writes:

> Lasse Reichstein Nielsen wrote:


[/^(11+)\1+$/]

> Ah, but I think you confuse the language that a regular expression can
> match, L_m(r) -- he above matched language is context-free (Chomsky Type-2)
> but not regular (Type-3), if I'm not mistaken -- with the Regular Expression
> language, L_r.


Probably. The latter is much less interesting (and definitly context free)

The language recognized by the regexp above is actually not even
context-free. For a language over an alphabet with only one element,
the pumping lemma for context free languages is equivalent to the
one for regular languages. It's probably Type-1.
IIRC, it's known to be in (NP intersect co-NP).

> In order to prove that the ECMAScript Regular Expression
> language is not regular, you have to apply the pumping lemma for regular
> languages to the expression itself, not to the strings it could match.


It's even simpler than that. The regular expression language has
matched parentheses to an arbitrary depth. That alone is enough to
rule out being regular.

>> The expression is a valid Javascript regular expression that recognizes
>> a non-regular language, so Javascript regexps are not restricted to the
>> regular languages (and therefore not possible to implement using only
>> a finite state machine).

>
> Sorry, I don't think you are correct here.


I should be, unless there is a flaw in the logic.

If all Javascript regexps can be implemented using only finite state
machines, then all the languages recognized by them are regular (a FSM
can only recognize a regular language).

Since the language of /^(11+)\1+$/ is not regular, and is a Javascript
regexp, there is at least one language that cannot be implemented using
a FSM.

...

>> What I probably /should/ have also said, is that by FSM I mean a
>> deterministic machine (i.e., a DFA, one that is run linearly over the
>> input). A nondeterministic machine can use any number of tricks to
>> implement the nondeterminism, including backtracking.

>
> But your premise is incorrect. Nondeterministic Finite Automata (NFAs),
> also called "nondeterministic finite state machines", are definitely a
> subset of Finite State Machines (FSMs).


That is definitly true. However, we don't have the hardware yet to
implement NFA's directly, so we have to emulate the nondeterminism
somehow. The typical way is to implement a nondeterministic choice by
trying one branch and setting up a backtrack point to try the other in
case the first fails. This gets you a stack of backtrack points, which
means that the implementation no longer uses a finite amount of
space. The space used can depend on the input, not only the automaton.
This is why I don't think of this implementation as really being a
finite state machine, although it is emulating one at a certain
level of abstraction.

There are other ways to implement a NFA that doesn't use a stack,
but instead determinizes the DFA (possibly lazily during execution).
This can instead lead to exponential space blow-up, which is still
finite, but not always manageable.

/L
--
Lasse Reichstein Holst Nielsen
'Javascript frameworks is a disruptive technology'

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Safe finite state machine design SomeDude VHDL 3 08-14-2006 12:47 PM
Building the finite state machine of a Ruby regexp Gilles Lacroix Ruby 3 08-03-2006 10:16 PM
regexp finite state machine? Walter Roberson Perl Misc 0 03-08-2005 07:21 PM
Gate Level model of a Finite state machine Inderkal VHDL 8 12-09-2004 11:17 PM
HELP PLEASE!! - Finite State Machine - Automaton - Microprogrammed System deejayfred VHDL 0 10-02-2003 01:23 AM



Advertisments