Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > java String split() does not work for delimiter "|" ?

Reply
Thread Tools

java String split() does not work for delimiter "|" ?

 
 
Arved Sandstrom
Guest
Posts: n/a
 
      08-09-2013
On 08/09/2013 03:46 AM, Kevin McMurtrie wrote:
> In article <(E-Mail Removed)>,
> Lew <(E-Mail Removed)> wrote:
>
>> (E-Mail Removed) wrote:
>>> You can also do like this :
>>> StringTokenizer tokenizer = new StringTokenizer(content, "|");
>>> while(tokenizer.hasMoreTokens()){
>>> _log.info("tokenizer.nextToken() : "+tokenizer.nextToken());
>>> }

>>
>> "StringTokenizer is a legacy class that is retained for compatibility reasons
>> although
>> its use is discouraged in new code. It is recommended that anyone seeking
>> this
>> functionality use the split method of String or the java.util.regex package
>> instead."
>> http://docs.oracle.com/javase/7/docs...Tokenizer.html

>
> Last time I checked, the performance of String.spit() sucked. The
> JavaDoc up to 1.6 even says it sucks. Hopefully they've fixed that
> before calling a simple and effective tool like StringTokenizer "legacy."
>
> Now if there was only a way to revert String.substring()'s performance
> in Java 1.7, I might try Oracle's version of Java.
>
>
>> "Variable names should not start with underscore _ or dollar sign $
>> characters,
>> even though both are allowed."
>> http://www.oracle.com/technetwork/ja...conventions-13
>> 5099.html#367


I had to check that because I didn't remember ever seeing that the
Javadoc for String.split saying that the performance sucked. Lo and
behold, I don't see that language.

What's the basis for assessing the suckage of Java String.split? Doing
millions of splits? And if the situation calls for industrial text
processing, why use Java anyway? It's not the first language I'd think
of for that purpose, it's cumbersome. And you can't ramp up your RAM?

I don't mind your comments about Java implementation performance, they
are useful to followup. I just wonder what kind of Java programs you
write where you find this kind of detail to be that important. Can't say
I've ever in 15+ years seen a Java SE or EE project be significantly
impacted by these considerations.

AHS
--
When a true genius appears, you can know him by this sign:
that all the dunces are in a confederacy against him.
-- Jonathan Swift
 
Reply With Quote
 
 
 
 
Eric Sosman
Guest
Posts: n/a
 
      08-09-2013
On 8/8/2013 8:06 AM, http://www.velocityreviews.com/forums/(E-Mail Removed) wrote:
> On Saturday, 13 October 2007 02:09:06 UTC+5:30, (E-Mail Removed) wrote:


Couldn't you have waited for its sixth birthday?

--
Eric Sosman
(E-Mail Removed)d
 
Reply With Quote
 
 
 
 
Kevin McMurtrie
Guest
Posts: n/a
 
      08-10-2013
In article <i61Nt.55783$(E-Mail Removed)>,
Arved Sandstrom <(E-Mail Removed)> wrote:

> On 08/09/2013 03:46 AM, Kevin McMurtrie wrote:
> > In article <(E-Mail Removed)>,
> > Lew <(E-Mail Removed)> wrote:
> >
> >> (E-Mail Removed) wrote:
> >>> You can also do like this :
> >>> StringTokenizer tokenizer = new StringTokenizer(content, "|");
> >>> while(tokenizer.hasMoreTokens()){
> >>> _log.info("tokenizer.nextToken() : "+tokenizer.nextToken());
> >>> }
> >>
> >> "StringTokenizer is a legacy class that is retained for compatibility
> >> reasons
> >> although
> >> its use is discouraged in new code. It is recommended that anyone seeking
> >> this
> >> functionality use the split method of String or the java.util.regex
> >> package
> >> instead."
> >> http://docs.oracle.com/javase/7/docs...Tokenizer.html

> >
> > Last time I checked, the performance of String.spit() sucked. The
> > JavaDoc up to 1.6 even says it sucks. Hopefully they've fixed that
> > before calling a simple and effective tool like StringTokenizer "legacy."
> >
> > Now if there was only a way to revert String.substring()'s performance
> > in Java 1.7, I might try Oracle's version of Java.
> >
> >
> >> "Variable names should not start with underscore _ or dollar sign $
> >> characters,
> >> even though both are allowed."
> >> http://www.oracle.com/technetwork/ja...odeconventions
> >> -13
> >> 5099.html#367

>
> I had to check that because I didn't remember ever seeing that the
> Javadoc for String.split saying that the performance sucked. Lo and
> behold, I don't see that language.
>
> What's the basis for assessing the suckage of Java String.split? Doing
> millions of splits? And if the situation calls for industrial text
> processing, why use Java anyway? It's not the first language I'd think
> of for that purpose, it's cumbersome. And you can't ramp up your RAM?
>
> I don't mind your comments about Java implementation performance, they
> are useful to followup. I just wonder what kind of Java programs you
> write where you find this kind of detail to be that important. Can't say
> I've ever in 15+ years seen a Java SE or EE project be significantly
> impacted by these considerations.
>
> AHS


String.split() delegates to the Pattern class. The Pattern class
mentions that the form used in String is not efficient because it must
compile the regular expression on each use.

Let me test...

Java 1.6.0_51 on an old Mac gives me these relative times:
splitNanos= 5341045000
tokenizerNanos= 1934390000

I hacked in a copy of 1.7.0_40-ea and got:
splitNanos= 3299753000
tokenizerNanos= 1675745000


It's not HUGE, but don't think you should deprecate a class that's 2
times faster than the replacement. String.split() is great for utility
use but the core code should use pre-compiled patterns or
StringTokenizer.

Last time I checked, Oracle was still targeting big business. Asking to
double the datacenter could get a whole Engineering team fired.



public class Str
{
final char testChars[]=
"\t\n;0123456789abcdefghijklmnopqrstuvwxyzABCDEFGH IJKLMNOPQRSTUVWXYZ"
.toCharArray();
final Random rnd= new Random();

public static void main(String[] args)
{
final Str str= new Str();

long splitNanos= 0;
long tokenizerNanos= 0;

for (int i= 0; i < 100; ++i)
{
final String line= str.randomAlphaNumerics();
String formatBySplit= null, formatByTokenize= null;

final long startTime= System.nanoTime();
for (int j= 0; j < 10000; ++j)
formatBySplit= str.formatSplit(line);
final long midTime= System.nanoTime();
for (int j= 0; j < 10000; ++j)
formatByTokenize= str.formatTokenized(line);
final long endTime= System.nanoTime();

splitNanos+= midTime - startTime;
tokenizerNanos+= endTime - midTime;

if (!formatBySplit.equals(formatByTokenize))
throw new RuntimeException("formatBySplit=" + formatBySplit +
" formatByTokenize=" +formatByTokenize);
}

System.out.println ("splitNanos= " + splitNanos);
System.out.println ("tokenizerNanos= " + tokenizerNanos);
}

private String formatSplit (String input)
{
final String toks[]= input.split("[ \t\n;]+");
final StringBuilder buf= new StringBuilder (input.length());

for (String tok : toks)
{
if (tok.length() > 0)
{
if (buf.length() > 0)
buf.append('\n');
buf.append(tok);
}
}
return buf.toString();
}

private String formatTokenized (String input)
{
final StringTokenizer tok= new StringTokenizer(input, " \t\n;", false);
final StringBuilder buf= new StringBuilder (input.length());

if (tok.hasMoreElements())
buf.append(tok.nextElement());

while (tok.hasMoreElements())
buf.append('\n').append(tok.nextElement());

return buf.toString();
}

private String randomAlphaNumerics ()
{
final char buf[]= new char[rnd.nextInt(200)];
for (int i= 0; i < buf.length; ++i)
buf[i]= testChars[rnd.nextInt(testChars.length)];
return new String (buf);
}
}
 
Reply With Quote
 
Michael Jung
Guest
Posts: n/a
 
      08-10-2013
Kevin McMurtrie <(E-Mail Removed)> writes:
> In article <i61Nt.55783$(E-Mail Removed)>,
> Arved Sandstrom <(E-Mail Removed)> wrote:
>> On 08/09/2013 03:46 AM, Kevin McMurtrie wrote:
>> > In article <(E-Mail Removed)>,
>> > Lew <(E-Mail Removed)> wrote:
>> >
>> >> (E-Mail Removed) wrote:
>> >>> StringTokenizer tokenizer = new StringTokenizer(content, "|");
>> >>> while(tokenizer.hasMoreTokens()){
>> >>> _log.info("tokenizer.nextToken() : "+tokenizer.nextToken());
>> >>> }
>> >> "StringTokenizer is a legacy class that is retained for compatibility
>> >> reasons although
>> >> its use is discouraged in new code. It is recommended that anyone seeking
>> >> this
>> >> functionality use the split method of String or the java.util.regex
>> >> package instead."
>> >> http://docs.oracle.com/javase/7/docs...Tokenizer.html
>> > Last time I checked, the performance of String.spit() sucked. The
>> > JavaDoc up to 1.6 even says it sucks. Hopefully they've fixed that
>> > before calling a simple and effective tool like StringTokenizer "legacy.
>> > Now if there was only a way to revert String.substring()'s performance
>> > in Java 1.7, I might try Oracle's version of Java.

>> I had to check that because I didn't remember ever seeing that the
>> Javadoc for String.split saying that the performance sucked. Lo and
>> behold, I don't see that language.
>> What's the basis for assessing the suckage of Java String.split? Doing
>> millions of splits? And if the situation calls for industrial text
>> processing, why use Java anyway? It's not the first language I'd think
>> of for that purpose, it's cumbersome. And you can't ramp up your RAM?
>> I don't mind your comments about Java implementation performance, they
>> are useful to followup. I just wonder what kind of Java programs you
>> write where you find this kind of detail to be that important. Can't say
>> I've ever in 15+ years seen a Java SE or EE project be significantly
>> impacted by these considerations.

> String.split() delegates to the Pattern class. The Pattern class
> mentions that the form used in String is not efficient because it must
> compile the regular expression on each use.
> Let me test...
> Java 1.6.0_51 on an old Mac gives me these relative times:
> splitNanos= 5341045000
> tokenizerNanos= 1934390000
> I hacked in a copy of 1.7.0_40-ea and got:
> splitNanos= 3299753000
> tokenizerNanos= 1675745000
> It's not HUGE, but don't think you should deprecate a class that's 2
> times faster than the replacement. String.split() is great for utility
> use but the core code should use pre-compiled patterns or
> StringTokenizer.
> Last time I checked, Oracle was still targeting big business. Asking to
> double the datacenter could get a whole Engineering team fired.


I can confirm that this does matter in business code. We got a 10%-20%
performance boost by avoiding split for certain use cases that used it a
lot, not just in micro-optimizing tests. The numbers from Kevin are
about what we had (although I personally wouldn't show that many decimal
places that suggest a higher degree of accuracy than is actually
reasonable).

Michael
 
Reply With Quote
 
Joerg Meier
Guest
Posts: n/a
 
      08-10-2013
On Fri, 09 Aug 2013 23:25:52 -0700, Kevin McMurtrie wrote:

> String.split() delegates to the Pattern class. The Pattern class
> mentions that the form used in String is not efficient because it must
> compile the regular expression on each use.


There is really no way around that with .split(), short of some convoluted
internal chaching system where the last x patterns compiled by .sort are
stored for y time. You call a method with a String as a parameter twice,
how are you going to avoid having to compile the String to a Pattern other
than through that ?

The .split syntax is convenient, but slow. There is really no sensible way
to speed it up while keeping the convenient method signature. Of course,
simply using Pattern is not terribly hard at all.

With all that being said: StringTokenizer obviously can only handle very
simple splitting due to the lack of regex support, and thus is naturally
faster, but if your splitting is simple enough not to need regex, it might
be simple enough to use indexOf, which is almost a magnitude faster than
even Tokenizer.

Liebe Gruesse,
Joerg

--
Ich lese meine Emails nicht, replies to Email bleiben also leider
ungelesen.
 
Reply With Quote
 
Arved Sandstrom
Guest
Posts: n/a
 
      08-11-2013
On 08/10/2013 07:37 AM, Michael Jung wrote:
> Kevin McMurtrie <(E-Mail Removed)> writes:
>> In article <i61Nt.55783$(E-Mail Removed)>,
>> Arved Sandstrom <(E-Mail Removed)> wrote:
>>> On 08/09/2013 03:46 AM, Kevin McMurtrie wrote:
>>>> In article <(E-Mail Removed)>,
>>>> Lew <(E-Mail Removed)> wrote:
>>>>
>>>>> (E-Mail Removed) wrote:
>>>>>> StringTokenizer tokenizer = new StringTokenizer(content, "|");
>>>>>> while(tokenizer.hasMoreTokens()){
>>>>>> _log.info("tokenizer.nextToken() : "+tokenizer.nextToken());
>>>>>> }
>>>>> "StringTokenizer is a legacy class that is retained for compatibility
>>>>> reasons although
>>>>> its use is discouraged in new code. It is recommended that anyone seeking
>>>>> this
>>>>> functionality use the split method of String or the java.util.regex
>>>>> package instead."
>>>>> http://docs.oracle.com/javase/7/docs...Tokenizer.html
>>>> Last time I checked, the performance of String.spit() sucked. The
>>>> JavaDoc up to 1.6 even says it sucks. Hopefully they've fixed that
>>>> before calling a simple and effective tool like StringTokenizer "legacy.
>>>> Now if there was only a way to revert String.substring()'s performance
>>>> in Java 1.7, I might try Oracle's version of Java.
>>> I had to check that because I didn't remember ever seeing that the
>>> Javadoc for String.split saying that the performance sucked. Lo and
>>> behold, I don't see that language.
>>> What's the basis for assessing the suckage of Java String.split? Doing
>>> millions of splits? And if the situation calls for industrial text
>>> processing, why use Java anyway? It's not the first language I'd think
>>> of for that purpose, it's cumbersome. And you can't ramp up your RAM?
>>> I don't mind your comments about Java implementation performance, they
>>> are useful to followup. I just wonder what kind of Java programs you
>>> write where you find this kind of detail to be that important. Can't say
>>> I've ever in 15+ years seen a Java SE or EE project be significantly
>>> impacted by these considerations.

>> String.split() delegates to the Pattern class. The Pattern class
>> mentions that the form used in String is not efficient because it must
>> compile the regular expression on each use.
>> Let me test...
>> Java 1.6.0_51 on an old Mac gives me these relative times:
>> splitNanos= 5341045000
>> tokenizerNanos= 1934390000
>> I hacked in a copy of 1.7.0_40-ea and got:
>> splitNanos= 3299753000
>> tokenizerNanos= 1675745000
>> It's not HUGE, but don't think you should deprecate a class that's 2
>> times faster than the replacement. String.split() is great for utility
>> use but the core code should use pre-compiled patterns or
>> StringTokenizer.
>> Last time I checked, Oracle was still targeting big business. Asking to
>> double the datacenter could get a whole Engineering team fired.

>
> I can confirm that this does matter in business code. We got a 10%-20%
> performance boost by avoiding split for certain use cases that used it a
> lot, not just in micro-optimizing tests. The numbers from Kevin are
> about what we had (although I personally wouldn't show that many decimal
> places that suggest a higher degree of accuracy than is actually
> reasonable).
>
> Michael
>

I don't doubt that use of String.split is not always the optimal
approach. From the sounds of it it's not often the optimal approach. But
I'll bet that the large majority of the time using it is a "good enough"
approach, because very often that extra 10-20 percent speed bump isn't
actually needed.

Funny thing is, I can think of one ESB application of mine right now
that needs to process a high volume of messages, and each message is
composed of 10-20 lines each one of which may have multiple fields
delimited by slashes...and I've been using String.split without
problems. Having said that, this is a 24/7 "don't fail or **** rains
down from the heavens" application, so I might try swapping out
..split(), since it's not complicated logic and I know exactly what the
delimiter is.

But I wouldn't eschew String.split as a rule. I doubt most apps care.

AHS

--
When a true genius appears, you can know him by this sign:
that all the dunces are in a confederacy against him.
-- Jonathan Swift
 
Reply With Quote
 
Michael Jung
Guest
Posts: n/a
 
      08-11-2013
Arved Sandstrom <(E-Mail Removed)> writes:
> On 08/10/2013 07:37 AM, Michael Jung wrote:

[...]
>> I can confirm that this does matter in business code. We got a 10%-20%
>> performance boost by avoiding split for certain use cases that used it a
>> lot, not just in micro-optimizing tests. The numbers from Kevin are
>> about what we had (although I personally wouldn't show that many decimal
>> places that suggest a higher degree of accuracy than is actually
>> reasonable).

> I don't doubt that use of String.split is not always the optimal
> approach. From the sounds of it it's not often the optimal
> approach. But I'll bet that the large majority of the time using it is
> a "good enough" approach, because very often that extra 10-20 percent
> speed bump isn't actually needed.

[...]
> But I wouldn't eschew String.split as a rule. I doubt most apps care.


I use split myself often enough. You can read my response as a case for
optimzation surprises. The micro benchmark shows around a 200% boost
(3:10), the overall gain was 15%, but the code in question as to the
amount of (user-level) code run through was far less than 1% (big "fat"
EE application).

Michael
 
Reply With Quote
 
Joerg Meier
Guest
Posts: n/a
 
      08-11-2013
On Sun, 11 Aug 2013 11:12:38 +0200, Michael Jung wrote:

> I use split myself often enough. You can read my response as a case for
> optimzation surprises. The micro benchmark shows around a 200% boost
> (3:10), the overall gain was 15%, but the code in question as to the
> amount of (user-level) code run through was far less than 1% (big "fat"
> EE application).


Well, odds are, not many applications spend 25% of their CPU time doing
..split(), so I would say that your application speeding up that much is an
extreme edge case. What on Earth do you do that requires millions of
..split() calls per second, and why did you think that would even remotely
be a representative example ?

Liebe Gruesse,
Joerg

--
Ich lese meine Emails nicht, replies to Email bleiben also leider
ungelesen.
 
Reply With Quote
 
Michael Jung
Guest
Posts: n/a
 
      08-11-2013
Joerg Meier <(E-Mail Removed)> writes:
> On Sun, 11 Aug 2013 11:12:38 +0200, Michael Jung wrote:
>> I use split myself often enough. You can read my response as a case for
>> optimzation surprises. The micro benchmark shows around a 200% boost
>> (3:10), the overall gain was 15%, but the code in question as to the
>> amount of (user-level) code run through was far less than 1% (big "fat"
>> EE application).

> Well, odds are, not many applications spend 25% of their CPU time doing
> .split(), so I would say that your application speeding up that much is an
> extreme edge case. What on Earth do you do that requires millions of
> .split() calls per second, and why did you think that would even remotely
> be a representative example ?


Odds are that the rest of the application was already highly
optimized. (I already said this was for certain use cases.) Whether this
is representative of something, I don't know, everybody has to judge for
himself what to do with split. But string manipulation is omnipresent in
many applications these days. This was just some light.

Michael

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
string as getline delimiter chhuang.cl@gmail.com C++ 1 05-17-2008 11:43 AM
[C++] String Tokenizer with Delimiter {Novice Programmer} AMT2K5 C++ 9 06-11-2005 03:24 PM
Parse Array of Objects - string delimiter java Pedro Rocha Java 8 12-08-2004 08:54 AM
Regular expressions as string delimiter? chad Java 4 03-06-2004 01:02 AM
Re: String.Split with multi character delimiter Kevin Spencer ASP .Net 5 01-21-2004 05:31 PM



Advertisments