Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Fastest way to remove common substrings from a list of strings

Reply
Thread Tools

Fastest way to remove common substrings from a list of strings

 
 
freesoft12@gmail.com
Guest
Posts: n/a
 
      07-15-2008
Hi,

Given a list containing various UNIX directory paths:

@list = qw(
/home/bart
/home/bart/foo
/usr/lib/
/home/bart/foo/xyz
/home/bart/foo/xyz/live.cpp )

I want to trim this list to contain:

@list = qw(
/usr/lib/
/home/bart/foo/xyz/live.cpp )

Is there a better way than this possibly n^2 algorithm below.

my @to_be_removed_indices = ();
for (my $index=0; $index < #list; ++$index) {
my $current_item = $list[$index];
++$index;
for (my $j = $index; $j < $#list; ++$j) {
if ($current_item =~ m/$list[$j]/) { # is the current item a
substring of another
push (@to_be_removed_indices, $index); # schedule its removal
last; # break
}
}
# Remove matched items

Regards
John
 
Reply With Quote
 
 
 
 
Mirco Wahab
Guest
Posts: n/a
 
      07-15-2008
wrote:
> Hi,
>
> Given a list containing various UNIX directory paths:
>
> @list = qw(
> /home/bart
> /home/bart/foo
> /usr/lib/
> /home/bart/foo/xyz
> /home/bart/foo/xyz/live.cpp )
>
> I want to trim this list to contain:
>
> @list = qw(
> /usr/lib/
> /home/bart/foo/xyz/live.cpp )
>
> Is there a better way than this possibly n^2 algorithm below.


Still pseudo O(2), but much shorter:

...
foreach my $k (@list) {
$k = undef if grep defined && /^$k./, @list;
}

print join "\n", grep defined, @list;
...

Regards

M.

 
Reply With Quote
 
 
 
 
Dave B
Guest
Posts: n/a
 
      07-15-2008
wrote:
> Hi,
>
> Given a list containing various UNIX directory paths:
>
> @list = qw(
> /home/bart
> /home/bart/foo
> /usr/lib/
> /home/bart/foo/xyz
> /home/bart/foo/xyz/live.cpp )
>
> I want to trim this list to contain:
>
> @list = qw(
> /usr/lib/
> /home/bart/foo/xyz/live.cpp )
>
> Is there a better way than this possibly n^2 algorithm below.
>
> my @to_be_removed_indices = ();
> for (my $index=0; $index < #list; ++$index) {
> my $current_item = $list[$index];
> ++$index;
> for (my $j = $index; $j < $#list; ++$j) {
> if ($current_item =~ m/$list[$j]/) { # is the current item a
> substring of another
> push (@to_be_removed_indices, $index); # schedule its removal
> last; # break
> }
> }
> # Remove matched items


If your condition is "remove a string if it's a substring of another,
matching *at the beginning*", then the following code seems to work:

my @list = sort qw(
/home/bart
/home/bart/foo
/usr/lib/
/home/bart/foo/xyz
/home/bart/foo/xyz/live.cpp );

my $s=$list[0];
my @newlist;

for (@list, substr($list[-1],1)) {
push(@newlist,$s) unless /$s/;
$s=$_;
}

print "@newlist\n";

# now @newlist has the values that are not
# substring of anything else

I hope I understood the problem correctly.

--
D.
 
Reply With Quote
 
Dave B
Guest
Posts: n/a
 
      07-15-2008
Dave B wrote:

> push(@newlist,$s) unless /$s/;


This is probably better and safer:

push(@newlist,$s) unless /^\Q$s\E/;

--
D.
 
Reply With Quote
 
Ted Zlatanov
Guest
Posts: n/a
 
      07-15-2008
On Tue, 15 Jul 2008 06:02:38 -0700 (PDT) "" <> wrote:

fc> Given a list containing various UNIX directory paths:

fc> @list = qw(
fc> /home/bart
fc> /home/bart/foo
fc> /usr/lib/
fc> /home/bart/foo/xyz
fc> /home/bart/foo/xyz/live.cpp )

fc> I want to trim this list to contain:

fc> @list = qw(
fc> /usr/lib/
fc> /home/bart/foo/xyz/live.cpp )

fc> Is there a better way than this possibly n^2 algorithm below.

fc> my @to_be_removed_indices = ();
fc> for (my $index=0; $index < #list; ++$index) {
fc> my $current_item = $list[$index];
fc> ++$index;
fc> for (my $j = $index; $j < $#list; ++$j) {
fc> if ($current_item =~ m/$list[$j]/) { # is the current item a
fc> substring of another
fc> push (@to_be_removed_indices, $index); # schedule its removal
fc> last; # break
fc> }
fc> }
fc> # Remove matched items

If @to_be_removed_indices has less than 5 elements, don't worry about
it, your complexity is too low to matter. Even 10 elements won't make a
big difference. Complexity measurements matter when you get into
thousands and millions of items ($disclaimers_go_here).

Otherwise, you can do many optimizations. First one I see is that you
should use index() instead of a match if you want to check for
substrings. Another is to find the ASCII characters not in
@to_be_removed_indices, let's say they are qw/q w p/, and then exclude
any string that doesn't have those characters. You can also do a trie
of your search strings, or build an regex alternation of them and hope
the regex engine can optimize it for you. The general idea is to
pre-process as much as possible, and let Perl use index() instead of
regular expressions when possible.

Ted
 
Reply With Quote
 
Mirco Wahab
Guest
Posts: n/a
 
      07-15-2008
Dave B wrote:
> Dave B wrote:
>
>> push(@newlist,$s) unless /$s/;

>
> This is probably better and safer:
>
> push(@newlist,$s) unless /^\Q$s\E/;


You won't need all that if you start from an already *sorted* list,
a single map will do:

my @result = map { $list[$_]=~/^\Q$list[$_-1]\E./ ? () : $list[$_] } 1..$#list;

Regards

M.


 
Reply With Quote
 
Dave B
Guest
Posts: n/a
 
      07-15-2008
Mirco Wahab wrote:

>>> push(@newlist,$s) unless /$s/;

>> This is probably better and safer:
>>
>> push(@newlist,$s) unless /^\Q$s\E/;

>
> You won't need all that if you start from an already *sorted* list,
> a single map will do:
>
> my @result = map { $list[$_]=~/^\Q$list[$_-1]\E./ ? () : $list[$_] } 1..$#list;


I agree that using map is more Perl-style, but the above statement alone
doesn't work. You need at least this:

my @result=map{$list[$_]=~/^\Q$list[$_-1]\E./ ?():$list[$_-1] } 1..$#list;
push @result, $list[-1];

--
D.
 
Reply With Quote
 
xhoster@gmail.com
Guest
Posts: n/a
 
      07-15-2008
"" <> wrote:
> Hi,
>
> Given a list containing various UNIX directory paths:
>
> @list = qw(
> /home/bart
> /home/bart/foo
> /usr/lib/
> /home/bart/foo/xyz
> /home/bart/foo/xyz/live.cpp )
>
> I want to trim this list to contain:
>
> @list = qw(
> /usr/lib/
> /home/bart/foo/xyz/live.cpp )



That isn't a complete enough specification. What if there is a
/foo/home/bart? How about a /home/bartzilla?


>
> Is there a better way than this possibly n^2 algorithm below.


Almost anything would be better than the below. It doesn't compile.
And once #list is turned to $#list, it does compile but leaves
@to_be_removed_indices empty, so it doesn't what you say you want done.
If you are happy with a wrong answer, then there are much better ways than
n^2 to achieve that.

>
> my @to_be_removed_indices = ();
> for (my $index=0; $index < #list; ++$index) {
> my $current_item = $list[$index];
> ++$index;


You are incrementing ++$index twice, so only even numbered indexes are
being "processed".

etc.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
Reply With Quote
 
Mirco Wahab
Guest
Posts: n/a
 
      07-15-2008
Dave B wrote:
> Mirco Wahab wrote:
>> my @result = map { $list[$_]=~/^\Q$list[$_-1]\E./ ? () : $list[$_] } 1..$#list;

>
> I agree that using map is more Perl-style, but the above statement alone
> doesn't work. You need at least this:
>
> my @result=map{$list[$_]=~/^\Q$list[$_-1]\E./ ?():$list[$_-1] } 1..$#list;
> push @result, $list[-1];
>


Yes, you are right. I saw the mistake and corrected it (tried to):

my @result =
(
(map { $list[$_+1] =~ /^\Q$list[$_]\E./ ? () : $list[$_] } 0..$#list-1),
$list[-1]
);

but you have been faster. The last element was missing.

Thanks & Regards

M.
 
Reply With Quote
 
Dave B
Guest
Posts: n/a
 
      07-15-2008
Leon Timmermans wrote:

>> push(@newlist,$s) unless /$s/;

>
> You should really quotemeta $s or else weird things can happen if it
> contains regexp meta characters. Better yet I would write that line as:
>
> push(@newlist, $s) if substr($_, 0, length $s) ne $s.
>
> Since that doesn't waste time looking for a substring if it isn't at the
> start.


I fixed both issues in a subsequent post.

--
D.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Fastest way to detect a non-ASCII character in a list of strings. Dun Peal Python 2 10-18-2010 02:23 AM
fastest way to remove this strange character Junkone Ruby 2 12-26-2007 11:33 PM
replacing substrings within strings amadain Python 11 02-14-2007 02:31 PM
Char strings, pointers and substrings. Lawrie C Programming 8 04-07-2005 06:38 PM
What the fastest way to copy strings from file to a vector? SIgnOff C++ 7 03-21-2005 07:01 AM



Advertisments