Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > Help: what is the quickest way to find out whether a string contains another string?

Reply
Thread Tools

Help: what is the quickest way to find out whether a string contains another string?

 
 
tuweiwen@gmail.com
Guest
Posts: n/a
 
      12-05-2005
Thomas thanks for the tip. I went on to download and run Roedy's Boyer
program and it works great. I used quite large strings of Chinese
characters (target string of size 7KB and pattern string of 50
character-long, to avoid it falling back to String.indexOf() . I am
running a Chinese version of WinXP) to test it, and it works too.
Thanks Roedy.

Chris, the strings are downloaded web pages in memory. What my app (a
spider) does is to find out if a page contains one of several pattern
strings, if it is, then it's saved to disk. I would like to know if my
current approach is an efficient one.

 
Reply With Quote
 
 
 
 
Roedy Green
Guest
Posts: n/a
 
      12-05-2005
On 4 Dec 2005 23:50:06 -0800, http://www.velocityreviews.com/forums/(E-Mail Removed) wrote, quoted or
indirectly quoted someone who said :

>
>Result: the speeds of indexOf() and contains() are about the same,
>but matches() is at least ten times slower.
>Is regular expression bad in terms of performance?


Don't you believe your own experiment?

In general the more specialised a tool is to the task at hand the
faster it will be.
--
Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.
 
Reply With Quote
 
 
 
 
Roedy Green
Guest
Posts: n/a
 
      12-05-2005
On Mon, 5 Dec 2005 10:27:51 -0000, "Chris Uppal"
<(E-Mail Removed)-THIS.org> wrote, quoted or indirectly
quoted someone who said :

>You've already been pointed at better algorithms for searching strings than a
>naive scan. I just wanted to add a few warnings.

see http://mindprod.com/products1.html#BOYER
for a faster indexOf
--
Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.
 
Reply With Quote
 
Chris Smith
Guest
Posts: n/a
 
      12-05-2005
Rhino <(E-Mail Removed)> wrote:
> One approach would be to use "regular expressions", often abbreviated
> "regexp". If you look at the Pattern class in the J2SE API, you'll find an
> overview describing how to use them.


Why would anyone want to use regular expressions to look for a plain
String? Regular expressions are a tool for deciding whether a String or
some subset thereof matches a defined regular language. Although string
searching is a trivial degeneration of the problem addressed by regular
expressions, there's no need to introduce the added complexity.

If you are set on using regular expressions, converting the arbitrary
input string into a pattern that matches only itself is the first task.
Prior to Java 1.5, this was very difficult. In 1.5, it's encapsulated
in a method called Pattern.quote(String). Once that's done, you can
search for that pattern. It's far easier, though, to use String.indexOf
(prior to 1.5) or String.contains (as of 1.5).

--
www.designacourse.com
The Easiest Way To Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation
 
Reply With Quote
 
Chris Smith
Guest
Posts: n/a
 
      12-05-2005
<(E-Mail Removed)> wrote:
> Chris, the strings are downloaded web pages in memory. What my app (a
> spider) does is to find out if a page contains one of several pattern
> strings, if it is, then it's saved to disk. I would like to know if my
> current approach is an efficient one.


(Realizing your addressing your comment to Chris Uppal, but I'll respond
anyway. Hope you don't mind.)

In this case, you can be practically guaranteed that String matching is
not your bottleneck. It will be much faster than retrieving those same
strings over a network connection.

--
www.designacourse.com
The Easiest Way To Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation
 
Reply With Quote
 
Rhino
Guest
Posts: n/a
 
      12-05-2005

"Chris Smith" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed).. .
> Rhino <(E-Mail Removed)> wrote:
>> One approach would be to use "regular expressions", often abbreviated
>> "regexp". If you look at the Pattern class in the J2SE API, you'll find
>> an
>> overview describing how to use them.

>
> Why would anyone want to use regular expressions to look for a plain
> String? Regular expressions are a tool for deciding whether a String or
> some subset thereof matches a defined regular language. Although string
> searching is a trivial degeneration of the problem addressed by regular
> expressions, there's no need to introduce the added complexity.
>
> If you are set on using regular expressions, converting the arbitrary
> input string into a pattern that matches only itself is the first task.
> Prior to Java 1.5, this was very difficult. In 1.5, it's encapsulated
> in a method called Pattern.quote(String). Once that's done, you can
> search for that pattern. It's far easier, though, to use String.indexOf
> (prior to 1.5) or String.contains (as of 1.5).
>

I didn't say regular expressions were the _best_ way, just that they were
an option.

The original poster mentioned that he was searching very long strings for
other strings. I don't know if regular contains() or similar methods have
length limitations and I've never looked into whether the more traditional
methods are more or less efficient than techniques based on regular
expressions. Offhand, I suspected that regular expressions might be more
efficient on longer strings but the other replies to this thread indicate
that regular expressions are actually _less_ efficient that contains() or
its brethren.

Therefore, I learned something along the way, too.

Rhino



 
Reply With Quote
 
Alan Moore
Guest
Posts: n/a
 
      12-06-2005
I'd like to add a note about the inefficiency of regexes. The regex
".*is not.*" is inherently slow, and not a fair test of regex speed.
Adding the dot-stars before and after the real search string is a hack
for when you can only use String#matches(). When using the find()
method, you can get rid of them and the search will go much faster.
When I ran the OP's second test program with the regex "is not", it
took 1052ms (compared to 4486ms for ".*is not.*"). That was still
three times as long as the contains() or indexOf() approaches, so I'm
not pushing for regex use here. Just setting the record straight.

 
Reply With Quote
 
Hiran Chaudhuri
Guest
Posts: n/a
 
      12-06-2005

"Thomas Weidenfeller" <(E-Mail Removed)> schrieb im Newsbeitrag
news:dn0u58$ika$(E-Mail Removed)...
> (E-Mail Removed) wrote:
>> What's best way to handle this to achieve best performance?

>
> "best" always depends on the context, the word as such is rather
> meaningless. I assume you mean fast. Then, contrary to other suggestions,
> regexps are not a good choice at all. We have discussed this in the past.


Agree, best is ambiguous. But the message subject line contains 'quickest',
so the question must be runtime.

Hiran


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Quickest way to search and replace in a query string? laredotornado Javascript 3 11-12-2009 09:06 AM
RE: How to find if a string contains another string Borse, Ganesh Python 2 10-30-2007 05:34 AM
How to find if a string contains another string Borse, Ganesh Python 0 10-30-2007 04:01 AM
Quickest way to do this string op. eastcoastcoder@gmail.com Ruby 3 03-13-2006 09:14 PM
Quickest way to find the rank of a <map> Steve Edwards C++ 13 03-01-2006 12:33 AM



Advertisments