Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > Need help with regular expression to parse URLs

Reply
Thread Tools

Need help with regular expression to parse URLs

 
 
Roedy Green
Guest
Posts: n/a
 
      08-10-2009
On Mon, 10 Aug 2009 11:35:04 -0700 (PDT), Neil
<(E-Mail Removed)> wrote, quoted or indirectly quoted someone
who said :

>
>I am having trouble figuring out how to write a regular expression to
>parse our parts of a url.


The URL/URI classes are designed to take URLs apart and put them back
together. You probably don't even have to roll your own regex.

Even if it does not do everything, you can get it strip out the piece
you need, that you can process with a simple regex.
--
Roedy Green Canadian Mind Products
http://mindprod.com

"You can have quality software, or you can have pointer arithmetic; but you cannot have both at the same time."
~ Bertrand Meyer (born: 1950 age: 59) 1989, creator of design by contract and the Eiffel language.
 
Reply With Quote
 
 
 
 
markspace
Guest
Posts: n/a
 
      08-10-2009
Wojtek wrote:
> Neil wrote :
>> I am having trouble figuring out how to write a regular expression to
>> parse our parts of a url.

>
> Not to dis regex, but...
>
> I read this thread and think that I could have written a custom parser
> in less time, and probably with better performance.
>



Seriously? It took me about two minutes of fiddling with the regex
before I felt I had the answer, and some of that included just messing
around to make absolutely sure I was doing what I thought I was doing.

If you can write a custom parser in two minutes, I'd like to see it.

Also, the regex will be more flexible when requirements do inevitably
change.
 
Reply With Quote
 
 
 
 
Roedy Green
Guest
Posts: n/a
 
      08-10-2009
On Mon, 10 Aug 2009 11:35:04 -0700 (PDT), Neil
<(E-Mail Removed)> wrote, quoted or indirectly quoted someone
who said :

>([^/]+/[^/]+)


This sort of thing might be easier to process by extracting a chunk of
the big string, and doing a regex split.

http://mindprod.com/jgloss/regex.html#SPLIT
--
Roedy Green Canadian Mind Products
http://mindprod.com

"You can have quality software, or you can have pointer arithmetic; but you cannot have both at the same time."
~ Bertrand Meyer (born: 1950 age: 59) 1989, creator of design by contract and the Eiffel language.
 
Reply With Quote
 
Roedy Green
Guest
Posts: n/a
 
      08-10-2009
On Mon, 10 Aug 2009 11:35:04 -0700 (PDT), Neil
<(E-Mail Removed)> wrote, quoted or indirectly quoted someone
who said :

>http://jammconsulting.com/jamm/page/...Backpacks.html


Complicated regexes are such a bitch to debug. We need a tool that
shows you just how far it got.
--
Roedy Green Canadian Mind Products
http://mindprod.com

"You can have quality software, or you can have pointer arithmetic; but you cannot have both at the same time."
~ Bertrand Meyer (born: 1950 age: 59) 1989, creator of design by contract and the Eiffel language.
 
Reply With Quote
 
Wojtek
Guest
Posts: n/a
 
      08-10-2009
markspace wrote :
> Wojtek wrote:
>> Neil wrote :
>>> I am having trouble figuring out how to write a regular expression to
>>> parse our parts of a url.

>>
>> Not to dis regex, but...
>>
>> I read this thread and think that I could have written a custom parser in
>> less time, and probably with better performance.
>>

>
>
> Seriously? It took me about two minutes of fiddling with the regex before I
> felt I had the answer, and some of that included just messing around to make
> absolutely sure I was doing what I thought I was doing.
>
> If you can write a custom parser in two minutes, I'd like to see it.


Well maybe three minutes... or so

For this one, the start of the parse would be the length of the base
URI "http://jammconsulting.com/jamm/page/products/", then read through
the remainder gathering characters into a StringBuffer. When the exit
point is reached for that "block" (back-slash), place the
StringBuffer.toString() into a ListArray and go again. When ".html" is
reached, exit the loop.

Print out the ListArray. Done.

So :
http://jammconsulting.com/jamm/page/...Backpacks.html

would produce:
Stuff
Bags-%26-Luggage
Bags-%26-Totes
Backpacks


> Also, the regex will be more flexible when requirements do inevitably change.


I write a lot of parsers and find them easier than regex, but then I do
not pretend to be a regex master, so creating a regex is almost like a
black art to me. I read through the docs, use a dynamic tester, and
cross my fingers. Both hands...

--
Wojtek


 
Reply With Quote
 
Roedy Green
Guest
Posts: n/a
 
      08-10-2009
On Mon, 10 Aug 2009 11:35:04 -0700 (PDT), Neil
<(E-Mail Removed)> wrote, quoted or indirectly quoted someone
who said :

>http://jammconsulting.com/jamm/page/...Backpacks.html



try:


"http://jammconsulting.com/jamm/[^/]+/[^/]+/([^/]+)/([^/]+)/([^.]+)\\.html"


or much easier:

String [] chunks = Pattern.compile( "/" ).split( s );
--
Roedy Green Canadian Mind Products
http://mindprod.com

"You can have quality software, or you can have pointer arithmetic; but you cannot have both at the same time."
~ Bertrand Meyer (born: 1950 age: 59) 1989, creator of design by contract and the Eiffel language.
 
Reply With Quote
 
Stefan Ram
Guest
Posts: n/a
 
      08-10-2009
Neil <(E-Mail Removed)> writes:
>I am having trouble figuring out how to write a regular expression to
>parse our parts of a url.


http://web.archive.org/web/200707050...erl/url3.regex

 
Reply With Quote
 
Tom Anderson
Guest
Posts: n/a
 
      08-10-2009
On Mon, 10 Aug 2009, Roedy Green wrote:

> On Mon, 10 Aug 2009 11:35:04 -0700 (PDT), Neil
> <(E-Mail Removed)> wrote, quoted or indirectly quoted someone
> who said :
>
>> http://jammconsulting.com/jamm/page/...Backpacks.html

>
> Complicated regexes are such a bitch to debug. We need a tool that
> shows you just how far it got.


There's a good regexp plugin for Eclipse (and there are doubtless others
than this):

http://brosinski.com/regex/

It doesn't quite do what you say, but it does live updating of a match
display as you edit the pattern, which goes a long way towards letting you
play with regexps interactively.

tom

--
I do not fear death. I had been dead for billions and billions of years
before I was born. -- Mark Twain
 
Reply With Quote
 
Tom Anderson
Guest
Posts: n/a
 
      08-10-2009
On Mon, 10 Aug 2009, markspace wrote:

> Neil wrote:
>
>> I wrote this regular expression:
>> ^http://jammconsulting.com/jamm/[^/]+/[^/]+/([^/]+/[^/]+)*\\.html?
>>
>> It seems to be working fine for most urls, but it barfed on this one:
>> http://jammconsulting.com/jamm/page/...Backpacks.html
>>
>> The matcher gives me 1 group with this value: s/Backpacks
>>
>> I dont understand how that could have happened. I was expecting to
>> get
>> two groups:
>> Stuff/Bags-%26-Luggage
>> Bags-%26-Totes/Backpacks
>>
>> Any ideas what went wrong?


You have two problems.

Firstly, the repeated group as written has no way to admit slashes
*between* pairs of path elements. Expand the repetition by hand (three
times, here):

[^/]+/[^/]+[^/]+/[^/]+[^/]+/[^/]+

You get the slash between elements in a pair, but not between pairs. This
explains your results. You need something that expands to:

[^/]+/[^/]+/[^/]+/[^/]+/[^/]+/[^/]+

Like:

^http://jammconsulting.com/jamm/[^/]+/[^/]+(/[^/]+/[^/]+)*\\.html?

You can get the individual elements with smaller capturing groups (here
making the pair-level group non-capturing):

^http://jammconsulting.com/jamm/[^/]+/[^/]+(?:/([^/]+)/([^/]+))*\\.html?

Secondly, you get one matching group per occurrence of a capturing group
in the *pattern*, not per occurrence of the subpattern in the match. That
is, if the above pair group matches five times, you'll still only get a
single pair of captured groups (the last ones). That, i think, means
there's no way to use a regular expression to do what you want to do here.

At least, not directly. What you can do is make a regexp which matches a
single occurrence of a pair of elements, and then use the Matcher's find()
method to loop over all occurrences in the string. Like so:

import java.net.URI;
import java.net.URISyntaxException;
import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class Split {
public static void main(String... args) throws URISyntaxException {
Pattern whole = Pattern.compile("^/jamm/[^/]+/[^/]+(.*?)\\.html?$");
Pattern pair = Pattern.compile("([^/]+)/([^/]+)");
for (String s: args) {
URI uri = new URI(s);
String path = uri.getPath();
Matcher wholeMatch = whole.matcher(path);
if (wholeMatch.matches()) {
Matcher pairMatch = pair.matcher(wholeMatch.group(1));
while (pairMatch.find()) {
String first = pairMatch.group(1);
String second = pairMatch.group(2);
System.out.println(Integer.toString(pairMatch.star t()) + "\t" + first + "\t" + second);
}
}
}
}
}

Note that rather than matching against the raw URL string, i'm going via
java.net.URI; this saves me having to match the other bits of the URL
explicitly, and also takes care of resolving % escapes.

> I don't understand what the * was in the end of your regex: "*\.html" ?


It's a quantifier on the preceding group - the one which captures the
paired path components like 'Stuff/Bags-%26-Luggage'. It means that there
can be any number of such pairs.

tom

--
I do not fear death. I had been dead for billions and billions of years
before I was born. -- Mark Twain
 
Reply With Quote
 
Tom Anderson
Guest
Posts: n/a
 
      08-10-2009
On Mon, 10 Aug 2009, Roedy Green wrote:

> On Mon, 10 Aug 2009 11:35:04 -0700 (PDT), Neil
> <(E-Mail Removed)> wrote, quoted or indirectly quoted someone
> who said :
>
>> http://jammconsulting.com/jamm/page/...Backpacks.html

>
> or much easier:
>
> String [] chunks = Pattern.compile( "/" ).split( s );


This is absolutely the right thing to do (yes, i know i've just posted a
completely different solution - split() is better), and i'm shocked that
nobody else has suggested it yet.

Writing a loop to iterate over the elements of the chunks array in pairs
is a pain, but a very minor one.

tom

--
I do not fear death. I had been dead for billions and billions of years
before I was born. -- Mark Twain
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Help to find a regular expression to parse po file gialloporpora Python 4 07-06-2009 05:42 PM
Converting Relative URLs into Absolute URLs Nathan Sokalski ASP .Net 1 08-12-2008 07:03 AM
Need to parse SQL statements...use regular expression? Justin F Perl Misc 4 03-05-2004 04:43 PM
Distinguish text URLs from non-text URLs? Kaidi Java 5 01-04-2004 10:15 AM
Dynamically changing the regular expression of Regular Expression validator VSK ASP .Net 2 08-24-2003 02:47 PM



Advertisments