Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Vector Space Search Engine

Reply
Thread Tools

Vector Space Search Engine

 
 
babydoe@mailinator.com
Guest
Posts: n/a
 
      10-11-2005
These are notes to myself, and anyone else
having trouble with the article at:
'http://www.perl.com/pub/a/2003/02/19/engine.html.'

No mass public feels at ease with electronic privacy.
But marketing is very much at ease at invading our
privacy, and marketing has no particular concern with
truth. So we ought to be using privacy to address
lovers, postmen, children and pets. We are not, so far.

Privacy concerns with Google desktop, which is just not
de rigueur, made me look for a replacement: I found
Perl. And in particular, the Perl distribution 'Active
State,' 'http://www.activestate.com.'

With Perl installed, you can roll your own search
engine, and unlike Mr Creepy ****in Google's search
engine, this engine, does not go online to index
anything, it does exactly what it should do (what it
should do if it worked, because like everything with
Perl, things almost work, but not quite).

You will need extra Perl modules for your search
engine: 'Lingua-Stem,' which you can from the 'Active
State' central repository, by running the command,
'c:\>ppm i Lingua-Stem;' and also the 'pdl win32
binaries,' named 'PDL-2.4.1-win32-4.zip,' links to the
binaries are available from 'http://pdl.perl.org/.'
Unzip these files and run the batch file
'install-pdl.bat.'

Download sample code, 'Listing 1, VectorSpace.pm,' from
http://www.perl.com/2003/02/19/examples/VectorSpace.pm
and install in the directory, 'c:/Perl/Lib/Search.' The
'VectorSpace.pm' does not work because of the way Perl
handles record separators. You need to comment out the
subroutine 'load_stop_list' in 'VectorySpace.pm', and
replace with the following subroutine.

--%<-----%<----first patch for VectorSpace.pl-----%<--
=item load_stop_list

Hacked by me, because, as written, with record separator
$\ = undef, the entire stop list was slurped up into
one key. Now the hash performs as it should, with each
stop_word being a separate record

=cut

sub load_stop_list {
$_ = <DATA>;
chomp(my @stop_words = split);
my %stop_words;
$stop_words{$_}++ for @stop_words;
return \%stop_words;
}
--%<----%<-----%<----%<-----%<----%<-----%<----%<------

And there is one other error (blatant thank goodness),
which will give the following warning, "Use of
uninitialized value in subroutine entry at
c:/perl/lib//Search/VectorSpace.pm line 175, <DATA>
chunk 1." You need to hack into 'VectorSpace.pm' and
search for the line:

@lookup{@sorted_words} = (1..$#sorted_words );

Funny, but to me, that error sticks out like a
foreskin at a Jewish wedding; of course, replace it
with the line:

@lookup{@sorted_words} = (0..$#sorted_words );

'http://www.perl.com/pub/a/2003/02/19/engine.html,' is
incomplete, in that it never explains how to run it.
(Why are scientists like that? always leaving it to the
candy man, in his bow chicka bow bow purple velvet pimp
suit and hat, to make a practical application.)

For my practical application, I have been given an
orphaned quote, "The taxi moves off slowly, the man
still not having said a word to the driver," and I
want to find which document, in my 'eBooks' directory,
the quote came from. With 'VectorSpace' Perl module
now I can.

I type the sentence into a file 'Quote.txt' and save it
to my desktop.

-----%<----%<-----%<----Quote.txt-----%<----%<----%<--
The taxi moves off slowly, the man still not having said
a word to the driver.
-----%<----%<-----%<----%<-----%<----%<-----%<----%<--

I also save the following script to my desktop;
Arrrrrrrrgh, even to me, who wrote this script, it is a
mess, with Perl magic everywhere. The theory is simple;
I am drilling into my eBooks directory and finding the
closest match to the words in 'Quote.txt.'

-----%<----%<-----%<----searchBooks.pl-----%<----%<---
#!perl
#
use warnings;
use strict;
use File::Glob ':glob';
use Search::VectorSpace;
use File::Temp qw/ tempfile tempdir /;
#
local $/ = undef;
my $homedir = $ENV{'USERPROFILE'}."/My Documents/eBooks";
my @files = <$homedir/*>;
@files = grep -f, @files;
my @docs;
for ( 0 .. $#files ) {
open my $fh, "$files[$_]"
or die "cannot open file $files[$_]: $!";
$docs[$_] = <$fh>;
}
#
my $engine = Search::VectorSpace->new( docs => \@docs,
threshold => 0.04 );
$engine->build_index();
#
my $query = <>;
my %results = $engine->search($query);
my ($fh, $filename) = tempfile(SUFFIX => '.html');
foreach my $result (
sort { $results{$b} <=> $results{$a} }
keys %results
)
{
print "Relevance: ", $results{$result}, "\n";
print $fh $result, "\n\n"; close $fh;
exec $filename;
}
-----%<----%<-----%<----%<-----%<----%<-----%<----%<--

>From the command line I type


c:\Documents and Settings\Nomen Nescio\Desktop>
perl searchBooks.pl Quote.txt

And Ta-Dum, displayed in my Internet browser is the
ebook "The Story of O, by Pauline Reage," an ebook,
incidently, I accidently downloaded from the #ebooks
channel on IRC. (Note to self: better put encryption on
that book.)

 
Reply With Quote
 
 
 
 
A. Sinan Unur
Guest
Posts: n/a
 
      10-11-2005
http://www.velocityreviews.com/forums/(E-Mail Removed) wrote in news:1129018494.679490.52800
@g44g2000cwa.googlegroups.com:

> With Perl installed, you can roll your own search
> engine, and unlike Mr Creepy ****in Google's search


Upon reading this, I went ahead and added you to my killfile. However, I
had already started commenting on your code, so, here they are:

> The 'VectorSpace.pm' does not work because of the way Perl
> handles record separators. You need to comment out the
> subroutine 'load_stop_list' in 'VectorySpace.pm', and
> replace with the following subroutine.


This is misleading.

> --%<-----%<----first patch for VectorSpace.pl-----%<--
> =item load_stop_list
>
> Hacked by me, because, as written, with record separator
> $\ = undef, the entire stop list was slurped up into
> one key.


The problem stems from you slapping a

local $\;

at the top of your program. (You also set it to undef, indicating you do
not understand how local works).

You should restrict changes from default behavior to the smallest
possible scope.

[ more drivel laced with profanity snipped ]

> -----%<----%<-----%<----searchBooks.pl-----%<----%<---
> #!perl
> #
> use warnings;
> use strict;
> use File::Glob ':glob';
> use Search::VectorSpace;
> use File::Temp qw/ tempfile tempdir /;
> #
> local $/ = undef;


This should not be here.

> my $homedir = $ENV{'USERPROFILE'}."/My Documents/eBooks";
> my @files = <$homedir/*>;
> @files = grep -f, @files;
> my @docs;
> for ( 0 .. $#files ) {
> open my $fh, "$files[$_]"


Useless use of quotes.

> or die "cannot open file $files[$_]: $!";


local $\;

> $docs[$_] = <$fh>;
> }


You should put the

in the body of the for loop

....

> my ($fh, $filename) = tempfile(SUFFIX => '.html');


....

> print "Relevance: ", $results{$result}, "\n";
> print $fh $result, "\n\n"; close $fh;


You are writing plain text to an html file. Newlines won't help you
display it the way you seem to want.

Bye.

Sinan
--
A. Sinan Unur <(E-Mail Removed)>
(reverse each component and remove .invalid for email address)

comp.lang.perl.misc guidelines on the WWW:
http://mail.augustmail.com/~tadmc/cl...uidelines.html
 
Reply With Quote
 
 
 
 
Tad McClellan
Guest
Posts: n/a
 
      10-11-2005
(E-Mail Removed) <(E-Mail Removed)> wrote:

> Mr Creepy ****in Google's search
> engine,



This is a family newsgroup.

Please attempt to develop a richer vocabulary so you won't have
to resort to vulgarity as a placeholder for something meaningful.


> Hacked by me, because, as written, with record separator
> $\ = undef, the entire stop list was slurped up into
> one key.



I seriously doubt that the *output* record separator has
an effect on *input* ...


> And there is one other error (blatant thank goodness),
> which will give the following warning, "Use of
> uninitialized value



That is not an error message. It is a warning message.


> open my $fh, "$files[$_]"



perldoc -q vars


--
Tad McClellan SGML consulting
(E-Mail Removed) Perl programming
Fort Worth, Texas
 
Reply With Quote
 
A. Sinan Unur
Guest
Posts: n/a
 
      10-11-2005
"A. Sinan Unur" <(E-Mail Removed)> wrote in
news:Xns96EC5120845A8asu1cornelledu@127.0.0.1:

> local $\;


This should have been:

local $/;

as pointed out by Tad.

Arrrgh!

Sinan

--
A. Sinan Unur <(E-Mail Removed)>
(reverse each component and remove .invalid for email address)

comp.lang.perl.misc guidelines on the WWW:
http://mail.augustmail.com/~tadmc/cl...uidelines.html
 
Reply With Quote
 
babydoe@mailinator.com
Guest
Posts: n/a
 
      10-11-2005
A. Sinan Unur writes:
> I wrote:


> Upon reading this, I went ahead and added you to my
> killfile. However, I had already started commenting
> on your code, so, here they are:


I was expecting no commentary on my post, but thank
you anyway. Though our meeting was brief, I will
always have the images of you peering with
lugubriously feigned interest at the boilerplated
buttocks of my code.

>> -----%<----%<-----%<----searchBooks.pl-----%<----
>> #!perl
>> #
>> use warnings;
>> use strict;
>> use File::Glob ':glob';
>> use Search::VectorSpace;
>> use File::Temp qw/ tempfile tempdir /;
>> #
>> local $/ = undef;

>
>
>This should not be here.
>
>
>> my $homedir = $ENV{'USERPROFILE'} .
>> "/My Documents/eBooks";
>> my @files = <$homedir/*>;
>> @files = grep -f, @files;
>> my @docs;
>> for ( 0 .. $#files ) {
>> open my $fh, "$files[$_]"

>
>
>Useless use of quotes.
>
>
>> or die "cannot open file $files[$_]: $!";

>
>
> local $\;
>
>
>> $docs[$_] = <$fh>;
>> }


Code changed as per your suggestions.

Bye

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Free memory allocate by a STL vector, vector of vector, map of vector Allerdyce.John@gmail.com C++ 8 02-18-2006 12:48 AM
Why Python style guide (PEP-8) says 4 space indents instead of 8 space??? 8 space indents ever ok?? Christian Seberino Python 21 10-27-2003 04:20 PM
Re: Why Python style guide (PEP-8) says 4 space indents instead of8 space??? 8 space indents ever ok?? Ian Bicking Python 2 10-23-2003 07:07 AM
Stack space, global space, heap space Shuo Xiang C Programming 10 07-11-2003 07:30 PM



Advertisments