Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > simple regex pattern sought

Reply
Thread Tools

simple regex pattern sought

 
 
markspace
Guest
Posts: n/a
 
      05-26-2012
On 5/26/2012 6:19 AM, Roedy Green wrote:

> exercisePattern( Pattern.compile(
> "\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'" ) ); // works, accepts
> empty strings
> // (?: ) is a non-capturing group. This is Robert Klemme's
> contribution. I don't understand how it works.



Ah, OK, so here's my contribution to your excellent SSCCE. First this
pattern is basically the same as mine. It uses alternation (the
vertical bar |) to pick a string delimited by either ' or "

Here's his regex string without the extra escapes for Java:

"(?:\\.|[^\"])*"|'(?:\\.|[^\'])*'
^^^^^^^^^^^^^^^^

Let's look at just the first half for a moment, without the (?:\\. part.

"[^\"]*"
^^^^^^^^
12 3
Example for the first part:
1. " string starts with double quote
2. [^\"]* doesn't contain a "
3. " ends with double quote

Same for the second half of the string.

Notice he's using * instead of +'s, which is why his matches 0 width
strings.

The other part didn't appear in your problem statement, but in HTML/XML
it's allowed to escape characters. E.g., 'Bob\'s your uncle.' So his
inclusion is very reasonable.

So he Robert adds (\\.|[^\"])* to the first part, which is
12 345 6

1. Start a group
2. A slash. It needs to be escaped for regex, hence \\.
3. . is regex "any character". 2 and 3 together mean "match \ followed
by any character"
4. OR (alternation again)
5. character class, negated (the ^), matches anything except \ or ". I
think this is a mistake: the \ needs to be quoted.
6. zero or more.

Then after that mess, he does the obvious thing and adds non-capturing
group, to make the regex do a little less work.

"(?:\\.|[^\"])*"

Phew! Next, he adds one alternation and does the same for a ' delimited
string.

|'(?:\\.|[^\'])*'

Same thing, just ' instead of ".

Finally I think this could be simplified slightly with Lew's
back-reference idea.

(['"])(?:\\.|[^\1\\])*

(Untested.) This allows empty strings between delimiters; instead of a
* use + for only non-empty strings between the quotes.



My executive summary:

Regex is a great rapid development tool, except when it isn't. You
realize your problem is simple, and you could have hand-coded a parser
to do this much quicker than all these news post exchanges?
 
Reply With Quote
 
 
 
 
markspace
Guest
Posts: n/a
 
      05-26-2012
On 5/26/2012 7:37 AM, Robert Klemme wrote:
> On 26.05.2012 03:43, markspace wrote:
>> On 5/25/2012 3:12 PM, Robert Klemme wrote:
>>
>>> "\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'"

>>

....
>> and I don't think you need to in a regex
>> either (although I didn't check that).

>
> There is also no regexp escaping of single quotes either. The only
> regexp escaping you can see are the \\\\ which translate into \\ in the
> string which is a literal backslash for the regexp engine.



Yes, there is, although I think it's a typo. Both \\\" and \\' get
passed to the regex as \" and \', which means just a single character "
and ' respectively.

You're right about the rest of it though. With so many \'s floating
around, I have a hard time reading Java regex!


> It's not parenthesis around character classes but around the alternative
> of "match a backslash followed by any char" and "any char which is not
> backslash or the opening quote type of this string variant".



Yup, I totally missed this too. Thanks for pointing it out.

 
Reply With Quote
 
 
 
 
Robert Klemme
Guest
Posts: n/a
 
      05-26-2012
On 26.05.2012 16:57, markspace wrote:
> On 5/26/2012 6:19 AM, Roedy Green wrote:
>
>> exercisePattern( Pattern.compile(
>> "\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'" ) ); // works, accepts
>> empty strings
>> // (?: ) is a non-capturing group. This is Robert Klemme's
>> contribution. I don't understand how it works.

>
>
> Ah, OK, so here's my contribution to your excellent SSCCE. First this
> pattern is basically the same as mine. It uses alternation (the vertical
> bar |) to pick a string delimited by either ' or "
>
> Here's his regex string without the extra escapes for Java:
>
> "(?:\\.|[^\"])*"|'(?:\\.|[^\'])*'
> ^^^^^^^^^^^^^^^^
>
> Let's look at just the first half for a moment, without the (?:\\. part.
>
> "[^\"]*"
> ^^^^^^^^
> 12 3
> Example for the first part:
> 1. " string starts with double quote
> 2. [^\"]* doesn't contain a "
> 3. " ends with double quote
>
> Same for the second half of the string.
>
> Notice he's using * instead of +'s, which is why his matches 0 width
> strings.
>
> The other part didn't appear in your problem statement, but in HTML/XML
> it's allowed to escape characters. E.g., 'Bob\'s your uncle.' So his
> inclusion is very reasonable.
>
> So he Robert adds (\\.|[^\"])* to the first part, which is
> 12 345 6
>
> 1. Start a group
> 2. A slash. It needs to be escaped for regex, hence \\.
> 3. . is regex "any character". 2 and 3 together mean "match \ followed
> by any character"
> 4. OR (alternation again)
> 5. character class, negated (the ^), matches anything except \ or ". I
> think this is a mistake: the \ needs to be quoted.


Oh, right, thanks for finding that!

> 6. zero or more.
>
> Then after that mess, he does the obvious thing and adds non-capturing
> group, to make the regex do a little less work.
>
> "(?:\\.|[^\"])*"
>
> Phew! Next, he adds one alternation and does the same for a ' delimited
> string.
>
> |'(?:\\.|[^\'])*'
>
> Same thing, just ' instead of ".
>
> Finally I think this could be simplified slightly with Lew's
> back-reference idea.
>
> (['"])(?:\\.|[^\1\\])*
>
> (Untested.) This allows empty strings between delimiters; instead of a *
> use + for only non-empty strings between the quotes.


Interesting approach - but it doesn't work. Simple test with
Pattern.compile("(.)[a\\1]"):

Exception in thread "main" java.util.regex.PatternSyntaxException:
Illegal/unsupported escape sequence near index 6
(.)[a\1]
^

> My executive summary:
>
> Regex is a great rapid development tool, except when it isn't. You
> realize your problem is simple, and you could have hand-coded a parser
> to do this much quicker than all these news post exchanges?


Maybe, maybe not.

Kind regards

robert

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/
 
Reply With Quote
 
Robert Klemme
Guest
Posts: n/a
 
      05-26-2012
On 26.05.2012 17:06, markspace wrote:
> On 5/26/2012 7:37 AM, Robert Klemme wrote:
>> On 26.05.2012 03:43, markspace wrote:
>>> On 5/25/2012 3:12 PM, Robert Klemme wrote:
>>>
>>>> "\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'"
>>>

> ...
>>> and I don't think you need to in a regex
>>> either (although I didn't check that).

>>
>> There is also no regexp escaping of single quotes either. The only
>> regexp escaping you can see are the \\\\ which translate into \\ in the
>> string which is a literal backslash for the regexp engine.

>
>
> Yes, there is, although I think it's a typo. Both \\\" and \\' get
> passed to the regex as \" and \', which means just a single character "
> and ' respectively.


Right you are - both times: there is regexp escapind and it was in fact
a typo (missing \\)!

> You're right about the rest of it though. With so many \'s floating
> around, I have a hard time reading Java regex!


That's true for other languages as well - the basic reason is that the
same character is used for

- escaping in strings
- escaping in backslashes
- escaping in the source text (in this case we could pick another
character)

>> It's not parenthesis around character classes but around the alternative
>> of "match a backslash followed by any char" and "any char which is not
>> backslash or the opening quote type of this string variant".

>
>
> Yup, I totally missed this too. Thanks for pointing it out.


You're welcome! Thank you again for finding the missing escape.

Cheers

robert

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/
 
Reply With Quote
 
markspace
Guest
Posts: n/a
 
      05-26-2012
On 5/26/2012 8:13 AM, Robert Klemme wrote:
> On 26.05.2012 16:57, markspace wrote:
>> Finally I think this could be simplified slightly with Lew's
>> back-reference idea.
>>
>> (['"])(?:\\.|[^\1\\])*
>>
>> (Untested.) This allows empty strings between delimiters; instead of a *
>> use + for only non-empty strings between the quotes.

>
> Interesting approach - but it doesn't work. Simple test with
> Pattern.compile("(.)[a\\1]"):
>
> Exception in thread "main" java.util.regex.PatternSyntaxException:
> Illegal/unsupported escape sequence near index 6
> (.)[a\1]
> ^



Yup, [] is for characters, and \1 could be a string. Gets rejected. I
think you could use "negative lookahead" to say "not this string" when
parsing. Gets kinda ugly though.

<http://www.regular-expressions.info/conditional.html>

Java:

"(['\"])(?:\\\\.|(?!\\1|\\\\).)+\\1"

Regex:

(['"])(?:\\.|(?!\1|\\).)+\1

I re-did Roedy's test program to be a bit more clear about what it was
looking for, and the results. This could be even cleaner if it was run
with a JUnit test harness.

At this point though the regex is basically just a mess. Download antlr
and get an XML/HTML grammar from online.



package quicktest;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

import static java.lang.System.out;

/**
*
* @author Brenden
*/
public class MindProdRegex {

}

/*
* [TestRegexFindQuotedString.java]
*
* Summary: Finding a quoted String with a regex.
..
*
* Copyright: (c) 2012 Roedy Green, Canadian Mind Products,
http://mindprod.com
*
* Licence: This software may be copied and used freely for any
purpose but military.
* http://mindprod.com/contact/nonmil.html
*
* Requires: JDK 1.7+
*
* Created with: JetBrains IntelliJ IDEA IDE
http://www.jetbrains.com/idea/
*
* Version History:
* 1.0 2012-05-25 initial release
*/

/**
* Finding a quoted String with a regex.
*
* @author Roedy Green, Canadian Mind Products
* @version 1.0 2012-05-25 initial release
* @since 2012-05-25
*/
class TestRegexFindQuotedString
{
// ------------------------------
CONSTANTS------------------------------

private static final String[] vectors =
{"Basic: George said \"that's theticket\".",
"\"that's theticket\"",
"Nested: Jeb replied '\"ticket?\"what ticket'.",
"'\"ticket?\"what ticket'",
"Non-ASCII: \"How na\u00efve!\".",
"\"How na\u00efve!\"",
" empty: \"\"xx",
"\"\"",
" escaped: 'Bob\\'s your uncle.'",
"'Bob\\'s your uncle.'",
" 'unbalanced\"",
"",
};

// -------------------------- STATIC METHODS--------------------------

/**
* exercise that pattern to see what if can find
*/
static void exercisePattern( Pattern pattern )
{
out.println();
out.println( "Pattern: " + pattern.toString() );
for( int i = 0; i < vectors.length; i+=2 ) {
String test = vectors[i];
String result = vectors[i+1];
final Matcher m = pattern.matcher( test );
boolean found = m.find();
boolean correct = false;
String groupString = null;
if( found ) {
correct = m.group(0).equals( result );
groupString = m.group();
}
System.out.println( test+", found: "+ found +
", correct: "+correct+" ("+groupString+")");
}
}

// --------------------------- main() method---------------------------

/**
* test harness
*
* @param args not used
*/
public static void main( String[] args )
{
// We want to find Strings of the form "xx'xx" or 'xx"xx'
// We want to avoid the following problems:
// 1. Works even if String contains foreign languages,
evenRussian or accented letters.
// 2. If starts with " must end with ", if starts with '
mustend with '.
// 3. ' is ok inside "...", and " is ok inside '...'
// 4. We don't worry about how to use ' inside '...'.

// here are some suggested techniques:

exercisePattern( Pattern.compile( "[\"']\\p{Print}+?[\"']" )
); // fails 1 2 3

exercisePattern( Pattern.compile( "[\"'][^\"']+[\"']" ) );
//fails 2 3

exercisePattern( Pattern.compile( "([\"'])[^\"']+\\1" ) );
//fails 3, uses a capturing group.

exercisePattern( Pattern.compile( "\"[^\"]+\"|'[^']+'" ) );
//works, rejects empty strings by Mark Space.
exercisePattern( Pattern.compile(
"(['\"])(?:\\\\.|(?!\\1|\\\\).)+\\1" ) ); //works, rejects empty strings
by Mark Space.

exercisePattern( Pattern.compile( "\"[^\"]*\"|'[^']*'" ) );
//works, accepts empty strings by Robert Klemme.
exercisePattern( Pattern.compile(
"\"(?:\\\\.|[^\\\"])*\"|'(?:\\\\.|[^\\'])*'" ) ); // works, acceptsempty
strings
// (?: ) is a non-capturing group. This is Robert
Klemme'scontribution. I don't understand how it works.
}
}
 
Reply With Quote
 
Lew
Guest
Posts: n/a
 
      05-26-2012
markspace wrote:
> Lew wrote:
>> Use a regex like "[\"'][^\"']+[\"']" is one way. The cleanest? I don't know.
>>

> This would match "John's restaurant" as "John'.
>
> The first quote matches ", John does not contain either ' or " as specified,
> and the last character class matches the '. Not I think what is wanted.


As I correct6ed in my very next post.

--
Lew
Honi soit qui mal y pense.
http://upload.wikimedia.org/wikipedi.../c/cf/Friz.jpg
 
Reply With Quote
 
Roedy Green
Guest
Posts: n/a
 
      05-26-2012
On Sat, 26 May 2012 10:08:58 -0700, markspace <-@.> wrote, quoted or
indirectly quoted someone who said :

>I re-did Roedy's test program to be a bit more clear about what it was
>looking for, and the results. This could be even cleaner if it was run
>with a JUnit test harness.


Thanks Brendan. I have incorporated your suggestions plus a bit more
polishing.

See http://mindprod.com/jgloss/regex.html#FINDQUOTED

for a formatted listing + output.

The next task, probably procrastinated, is to solve it with a little
finite state automaton that decodes \x as well, and a simpler version
without. If a newbie is interested in tackling that, they can look at
my Java snippet parser as part of JPrep/JDisplay and strip it down.
--
Roedy Green Canadian Mind Products
http://mindprod.com
I would be quite surprised if the NSA (National Security Agency)
did not have a computer program to scan bits of shredded
documents and electronically put them back together like a giant
jigsaw puzzle. This suggests you cannot just shred, you must also burn.
..
 
Reply With Quote
 
markspace
Guest
Posts: n/a
 
      05-27-2012
On 5/26/2012 2:07 PM, Lew wrote:
> markspace wrote:
>> Lew wrote:
>>> Use a regex like "[\"'][^\"']+[\"']" is one way. The cleanest? I
>>> don't know.
>>>

>> This would match "John's restaurant" as "John'.
>>
>> The first quote matches ", John does not contain either ' or " as
>> specified,
>> and the last character class matches the '. Not I think what is wanted.

>
> As I correct6ed in my very next post.
>



Unfortunately that one doesn't work either. The central part, [^"'],
doesn't allow a match of a ' if the starting delimiter was a ", and that
doesn't match Roedy's spec. "John's restaurant" wouldn't be matched at
all, because the matcher couldn't match past the ' to get to the ".

I think the easiest is to write out a grammar for the expression, then
translate to regex.

QUOTED_STRING := SQUOTED_STRING | DQUOTED_STRING

SQUOTED_STRING := ' NON_S_QUOTE + '

DQUOTED_STRING := " NON_D_QUOTE + "

NON_S_QUOTE := [^']

NON_D_QUOTE := [^"]

At this point the grammar is very clear. (Note I haven't included
Robert's \x escape sequences.) I think it's worth learning to use antlr
rather than regex, which tends to obfuscate more than it helps.
However, a literal translation into regex isn't hard, and a literal
translation avoids mis-optimizations.


 
Reply With Quote
 
Lew
Guest
Posts: n/a
 
      05-27-2012
markspace wrote:
> Lew wrote:
>> markspace wrote:
>>> Lew wrote:
>>>> Use a regex like "[\"'][^\"']+[\"']" is one way. The cleanest? I
>>>> don't know.
>>>>
>>> This would match "John's restaurant" as "John'.
>>>
>>> The first quote matches ", John does not contain either ' or " as
>>> specified,
>>> and the last character class matches the '. Not I think what is wanted.

>>
>> As I correct6ed in my very next post.

>
> Unfortunately that one doesn't work either. The central part, [^"'], doesn't
> allow a match of a ' if the starting delimiter was a ", and that doesn't match
> Roedy's spec. "John's restaurant" wouldn't be matched at all, because the
> matcher couldn't match past the ' to get to the ".
>
> I think the easiest is to write out a grammar for the expression, then
> translate to regex.
>
> QUOTED_STRING := SQUOTED_STRING | DQUOTED_STRING
>
> SQUOTED_STRING := ' NON_S_QUOTE + '
>
> DQUOTED_STRING := " NON_D_QUOTE + "
>
> NON_S_QUOTE := [^']
>
> NON_D_QUOTE := [^"]
>
> At this point the grammar is very clear. (Note I haven't included Robert's \x
> escape sequences.) I think it's worth learning to use antlr rather than regex,
> which tends to obfuscate more than it helps. However, a literal translation
> into regex isn't hard, and a literal translation avoids mis-optimizations.


Very illuminating. Thank you.

--
Lew
Honi soit qui mal y pense.
http://upload.wikimedia.org/wikipedi.../c/cf/Friz.jpg
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
How to use Regex to breakdown a pattern and use the pattern to breakdown a string ChrisC Perl Misc 4 06-25-2010 05:46 PM
Regex testing and UTF8 awarenes or Regex and numeric pattern matching sln@netherlands.com Perl Misc 2 03-10-2009 03:51 AM
String Pattern Matching: regex and Python regex documentation Xah Lee Python 8 09-26-2006 03:24 PM
String Pattern Matching: regex and Python regex documentation Xah Lee Perl Misc 2 09-25-2006 03:15 AM
String Pattern Matching: regex and Python regex documentation Xah Lee Java 1 09-22-2006 07:11 PM



Advertisments