Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > Keeping the split token in a Java regular expression

Reply
Thread Tools

Keeping the split token in a Java regular expression

 
 
Jim Janney
Guest
Posts: n/a
 
      03-27-2012
laredotornado <(E-Mail Removed)> writes:

> Hi,
>
> I'm using Java 6. I want to split a Java string on a regular
> expression, but I would like to keep part of the string used to split
> in the results. What I have are Strings like
>
> Fri 7:30 PM, Sat 2 PM, Sun 2:30 PM
>
> What I would like to do is split the expression wherever I have an
> expression matching /(am|pm),?/i . Hopefully I got that right. In
> the above example, I would like the results to be
>
> Fri 7:30 PM
> Sat 2 PM
> Sun 2:30 PM
>
> But with String.split, the split token is not kept within the
> results. How would I write a Java parsing expression to do what I
> want?
>
> Thanks, - Dave


You want to match ,? only when it is preceded by (am|pm). That's what
lookbehind is for:

public class LookBehind {
public static void main(String[] args) {

String data = "Fri 7:30 PM, Sat 2 PM, Sun 2:30 PM";
String pattern = "(?i)(?<=am|pm),?";

String[] split = data.split(pattern);
for (String s : split) {
System.out.println("'" + s + "'");
}
}
}

See http://www.regular-expressions.info/lookaround.html for a tutorial.

--
Jim Janney

 
Reply With Quote
 
 
 
 
laredotornado
Guest
Posts: n/a
 
      03-27-2012
On Mar 27, 9:15*am, Jim Janney <(E-Mail Removed)> wrote:
> laredotornado <(E-Mail Removed)> writes:
> > Hi,

>
> > I'm using Java 6. *I want to split a Java string on a regular
> > expression, but I would like to keep part of the string used to split
> > in the results. *What I have are Strings like

>
> > * * Fri 7:30 PM, Sat 2 PM, Sun 2:30 PM

>
> > What I would like to do is split the expression wherever I have an
> > expression matching /(am|pm),?/i . *Hopefully I got that right. *In
> > the above example, I would like the results to be

>
> > * * Fri 7:30 PM
> > * * Sat 2 PM
> > * * Sun 2:30 PM

>
> > But with String.split, the split token is not kept within the
> > results. *How would I write a Java parsing expression to do what I
> > want?

>
> > Thanks, - Dave

>
> You want to match ,? only when it is preceded by (am|pm). *That's what
> lookbehind is for:
>
> public class LookBehind {
> * public static void main(String[] args) {
>
> * * String data = "Fri 7:30 PM, Sat 2 PM, Sun 2:30 PM";
> * * String pattern = "(?i)(?<=am|pm),?";
>
> * * String[] split = data.split(pattern);
> * * for (String s : split) {
> * * * System.out.println("'" + s + "'");
> * * }
> * }
>
> }
>
> Seehttp://www.regular-expressions.info/lookaround.htmlfor a tutorial.
>
> --
> Jim Janney


Jim, That's absolutely brilliant and does exactly what I want in a
short amount of code.

Stefan, thanks for your solution as well. I tried that out first and
it works too. - Dave
 
Reply With Quote
 
 
 
 
Jim Janney
Guest
Posts: n/a
 
      03-27-2012
laredotornado <(E-Mail Removed)> writes:

> On Mar 27, 9:15*am, Jim Janney <(E-Mail Removed)> wrote:
>> laredotornado <(E-Mail Removed)> writes:

>
> Jim, That's absolutely brilliant and does exactly what I want in a
> short amount of code.
>
> Stefan, thanks for your solution as well. I tried that out first and
> it works too. - Dave


It turns out that lookbehind only works with some patterns; the engine
has to be able to determine the length of the match in advance. Not
surprising when you think about it. It's an interesting question and
gave me a reason to learn something new.

--
Jim Janney
 
Reply With Quote
 
Gene Wirchenko
Guest
Posts: n/a
 
      03-27-2012
On Mon, 26 Mar 2012 20:40:24 -0700, Knute Johnson
<(E-Mail Removed)> wrote:

>On 3/26/2012 7:07 PM, Lew wrote:
>> Gene Wirchenko wrote:
>>> What about "Sun 9, 11 AM, and 1 PM"?
>>> Or "Sun 9 and 11 AM, and 1 and 3 PM"?
>>>
>>> I think you had better be quite sure of all of the variants. For
>>> that matter, people often omit the comma before "and" which would give
>>> "Sun 9, 11 AM and 1 PM" for my first example. Such people have
>>> probably not seen
>>> http://www.outsidethebeltway.com/oxford-comma-cartoon/
>>> or other such references.

>>
>> The point is that you need a precise, perhaps formal statement of the

exact rules to parse the input, and what to do when the input format
fails quality checks.
>>
>> Parsing is a Dark Art in programming - not really the hardest of them,

but worthy of close attention.
>>
>> It does require a careful, methodical approach.


>You've been awfully poetic lately Lew.


I prefer the "new" Lew. He has dropped the antagonism that I
often saw, and it has made his posts much more readable and useful.

Sincerely,

Gene Wirchenko
 
Reply With Quote
 
Daniel Pitts
Guest
Posts: n/a
 
      03-27-2012
On 3/27/12 8:21 AM, Jim Janney wrote:
> laredotornado<(E-Mail Removed)> writes:
>
>> On Mar 27, 9:15 am, Jim Janney<(E-Mail Removed)> wrote:
>>> laredotornado<(E-Mail Removed)> writes:

>>
>> Jim, That's absolutely brilliant and does exactly what I want in a
>> short amount of code.
>>
>> Stefan, thanks for your solution as well. I tried that out first and
>> it works too. - Dave

>
> It turns out that lookbehind only works with some patterns; the engine
> has to be able to determine the length of the match in advance. Not
> surprising when you think about it. It's an interesting question and
> gave me a reason to learn something new.
>

That's interesting. I've written my own Deterministic FSA to implement a
subset of regex functionality, and arbitrary lookbehind actually would
be an easy feature to add. Easier than zero-width matches (for example
word-boundaries).

Anyway, one thing to point out is that Stefan's is likely to perform
better, and definitely has lower memory overhead for long inputs than
"split".
 
Reply With Quote
 
Lew
Guest
Posts: n/a
 
      03-27-2012
Gene Wirchenko wrote:
> I prefer the "new" Lew. He has dropped the antagonism that I
> often saw, and it has made his posts much more readable and useful.


I give your preference all the consideration that it is due.

--
Lew
 
Reply With Quote
 
Gene Wirchenko
Guest
Posts: n/a
 
      03-27-2012
On Tue, 27 Mar 2012 11:09:56 -0700 (PDT), Lew <(E-Mail Removed)>
wrote:

>Gene Wirchenko wrote:
>> I prefer the "new" Lew. He has dropped the antagonism that I
>> often saw, and it has made his posts much more readable and useful.

>
>I give your preference all the consideration that it is due.


As manners are a social lubricant and a fairly inexpensive one,
that would be quite a lot. Thank you. If you did not mean that,
consider meaning that. You are quite knowledgeable, and without an
antagonistic curve, your posts are very good indeed. This same
statement applies to many people posting on USENET.

Call my preference the USENET Manners Project if you want.
Disagreeing is one thing; being disagreeable is quite another.
http://xkcd.com/386/
is a good joke but a poor reality.

I look forward to your next politely informative post, Lew. Your
recent one clarifying a sentence of yours was very nice indeed.

Sincerely,

Gene Wirchenko
 
Reply With Quote
 
Robert Klemme
Guest
Posts: n/a
 
      03-27-2012
On 03/27/2012 03:46 AM, Arne Vajh°j wrote:
> On 3/26/2012 4:01 PM, Robert Klemme wrote:
>> On 03/26/2012 09:22 PM, Lew wrote:


>>> Based on what you've shown it looks like you could split on the comma
>>> and trim the resulting strings.

>>
>> And one wouldn't even need a regular expression for that.
>> http://docs.oracle.com/javase/6/docs...Tokenizer.html

>
> StringTokenizer is somewhat obsoleted by String split.


I find regular expressions are quite a bit of overhead for splitting at
commas only. (Now we know that the OP has more demanding requirements
so regexp is probably the tool of choice.)

Hmm... I don't like those methods in class String that much which use a
String with a regular expression which is then parsed on every
invocation of the method. That might be good for one off usage but for
everything else I prefer solutions which at least use a Pattern constant
to avoid parsing overhead per call. Even if it wasn't for runtime
overhead of parsing I like to have the constant which can have it's own
JavaDoc explaining what's going on plus I can reuse it and quickly find
all places of usage etc.

Kind regards

robert

 
Reply With Quote
 
Arne Vajh°j
Guest
Posts: n/a
 
      03-27-2012
On 3/27/2012 5:01 PM, Robert Klemme wrote:
> On 03/27/2012 03:46 AM, Arne Vajh°j wrote:
>> On 3/26/2012 4:01 PM, Robert Klemme wrote:
>>> On 03/26/2012 09:22 PM, Lew wrote:

>
>>>> Based on what you've shown it looks like you could split on the comma
>>>> and trim the resulting strings.
>>>
>>> And one wouldn't even need a regular expression for that.
>>> http://docs.oracle.com/javase/6/docs...Tokenizer.html

>>
>> StringTokenizer is somewhat obsoleted by String split.

>
> I find regular expressions are quite a bit of overhead for splitting at
> commas only. (Now we know that the OP has more demanding requirements so
> regexp is probably the tool of choice.)
>
> Hmm... I don't like those methods in class String that much which use a
> String with a regular expression which is then parsed on every
> invocation of the method. That might be good for one off usage but for
> everything else I prefer solutions which at least use a Pattern constant
> to avoid parsing overhead per call. Even if it wasn't for runtime
> overhead of parsing I like to have the constant which can have it's own
> JavaDoc explaining what's going on plus I can reuse it and quickly find
> all places of usage etc.


Split is the way you do it.

To cut down on overhead a non-regex split should be added.

Arne

 
Reply With Quote
 
Arne Vajh°j
Guest
Posts: n/a
 
      03-27-2012
On 3/27/2012 12:14 AM, Daniel Pitts wrote:
> On 3/26/12 6:58 PM, Arne Vajh°j wrote:
>> On 3/26/2012 2:54 PM, laredotornado wrote:
>>> I'm using Java 6. I want to split a Java string on a regular
>>> expression, but I would like to keep part of the string used to split
>>> in the results. What I have are Strings like
>>>
>>> Fri 7:30 PM, Sat 2 PM, Sun 2:30 PM
>>>
>>> What I would like to do is split the expression wherever I have an
>>> expression matching /(am|pm),?/i . Hopefully I got that right. In
>>> the above example, I would like the results to be
>>>
>>> Fri 7:30 PM
>>> Sat 2 PM
>>> Sun 2:30 PM
>>>
>>> But with String.split, the split token is not kept within the
>>> results. How would I write a Java parsing expression to do what I
>>> want?

>>
>> A hackish solution:
>>
>> String[] p = s.replaceAll("[AP]M", "$0X$0").split("X[AP]M");

>
> Nice. As far as hackish, using "split" for this purpose at all is
> hackish.


That type of split is the typical way in most modern languages
(though usually in a non regex flavor).

Arne
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Regular expression capture dependent on token order? sthussey@gmail.com Javascript 1 02-09-2007 12:15 AM
Token pasting (## operator) - Add whitespace to a token Wessi C Programming 3 08-11-2005 01:02 PM
"token" "token sequence" "scalar variable" "vector" ?? G Fernandes C Programming 1 02-18-2005 05:32 AM
preprocessor, token concatenation, no valid preprocessor token Cronus C++ 1 07-14-2004 11:10 PM
Dynamically changing the regular expression of Regular Expression validator VSK ASP .Net 2 08-24-2003 02:47 PM



Advertisments