Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > document ID tracking

Reply
Thread Tools

document ID tracking

 
 
slash
Guest
Posts: n/a
 
      07-24-2003
Hi,
I am trying to write a script that will allow me to manipulate words
in a certain way and also keep track of the documents from which those
words came from. In other words, let's say my corpus consisted of
htese three documents with the following contents.

DocID 1.TXT
Compose your message

DocID 2.TXT
Use this form to post your message

DocID 3.TXT
Remember that it can be viewed by millions

Now, when I do my processing for all files, I want to be able to see
that "message" is a word that appears in both DocID 1.TXT and DocID
2.TXT

How can I do this in Perl? Is this what an inverted index is minus the
term frequencies, etc.? I am under pressure and wanted to know if
there was any way I could perhaps get this code from somewhere else or
perhaps the pseudocode.
I would certainly appreciate any help.

Thanks,
Slash
 
Reply With Quote
 
 
 
 
A. Sinan Unur
Guest
Posts: n/a
 
      07-24-2003
http://www.velocityreviews.com/forums/(E-Mail Removed) (slash) wrote in news:30fe9f1e.0307240405.7908ae70
@posting.google.com:

> Hi,
> I am trying to write a script that will allow me to manipulate words
> in a certain way and also keep track of the documents from which those
> words came from. In other words, let's say my corpus consisted of
> htese three documents with the following contents.
>
> DocID 1.TXT
> Compose your message
>
> DocID 2.TXT
> Use this form to post your message
>
> DocID 3.TXT
> Remember that it can be viewed by millions
>
> Now, when I do my processing for all files, I want to be able to see
> that "message" is a word that appears in both DocID 1.TXT and DocID
> 2.TXT


I am sure there is a better way to do this, but you can use a hash keyed
on the words. My quick hack is below. (BTW, I do hope this is not
homework).

# cw: Common Word
# Script to list words that appear in all the files
# passed on the command line

use diagnostics;
use strict;
use warnings;

die "$0: file1 ... fileN\n" unless scalar @ARGV;

my %word_to_files;

while(<ARGV>) {
chomp;
my @words = split /\s+/;
foreach my $word (@words) {
if(exists $word_to_files{$word}) {
unless(grep /$ARGV/, @{$word_to_files{$word}}) {
push @{$word_to_files{$word}}, ($ARGV);
}
} else {
$word_to_files{$word} = [$ARGV];
}
}
}

foreach (sort keys %word_to_files) {
print "$_: @{$word_to_files{$_}}\n";
}

__END__

C:\develop\perl\misc>cat file?.txt
Compose your message

Use this form to post your message

Remember that it can be viewed by millions

C:\develop\perl\misc>cw.pl file1.txt file2.txt file3.txt
Compose: file1.txt
Remember: file3.txt
Use: file2.txt
be: file3.txt
by: file3.txt
can: file3.txt
form: file2.txt
it: file3.txt
message: file1.txt file2.txt
millions: file3.txt
post: file2.txt
that: file3.txt
this: file2.txt
to: file2.txt
viewed: file3.txt
your: file1.txt file2.txt

--
A. Sinan Unur
(E-Mail Removed)
Remove dashes for address
Spam bait: (E-Mail Removed)
 
Reply With Quote
 
 
 
 
Steve in NY
Guest
Posts: n/a
 
      07-24-2003
On 24 Jul 2003 12:53:47 -0700, (E-Mail Removed) (slash) wrote:

>Hi,
>I am trying to write a script that will allow me to manipulate words
>in a certain way and also keep track of the documents from which those
>words came from. In other words, let's say my corpus consisted of
>htese three documents with the following contents.
>
>DocID 1.TXT
>Compose your message
>
>DocID 2.TXT
>Use this form to post your message
>
>DocID 3.TXT
>Remember that it can be viewed by millions
>
>Now, when I do my processing for all files, I want to be able to see
>that "message" is a word that appears in both DocID 1.TXT and DocID
>2.TXT
>
>How can I do this in Perl? Is this what an inverted index is minus the
>term frequencies, etc.? I am under pressure and wanted to know if
>there was any way I could perhaps get this code from somewhere else or
>perhaps the pseudocode.
>I would certainly appreciate any help.
>
>Thanks,
>Slash


this doesn't check for frequencies, just that the word does exist in each file.
to check for frequencies, I would suggest first breaking up the line on each
word (break each word by whitespace), and then using a hash with the word as key
and vaule would be number of times it appears, etc....

#!/usr/bin/perl -w
use strict;

my $word = "message";

my @files = qw(DocID_1.TXT
DocID_2.TXT
DocID_3.TXT);

for my $file (@files) {
open (FILE, "<$file");
while (<FILE>) {
if ($_ =~ /($word)/) {
print "$file contains the word $word.\n"
}
}
}






 
Reply With Quote
 
Eric J. Roode
Guest
Posts: n/a
 
      07-25-2003
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

"A. Sinan Unur" <(E-Mail Removed)> wrote in
news:Xns93C2A80F838A4asu1cornelledu@132.236.56.8:

> if(exists $word_to_files{$word}) {
> unless(grep /$ARGV/, @{$word_to_files{$word}}) {
> push @{$word_to_files{$word}}, ($ARGV);
> }
> } else {
> $word_to_files{$word} = [$ARGV];
> }


Why use an array as the second-level data structure -- why not a hash?

$word_to_files{$word}{$ARGV} = 1;

- --
Eric
$_ = reverse sort qw p ekca lre Js reh ts
p, $/.r, map $_.$", qw e p h tona e; print

-----BEGIN PGP SIGNATURE-----
Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>

iQA/AwUBPyEAG2PeouIeTNHoEQL8tQCcDqV7RIXQpkLdixd/fX8I6mS3TKQAnRUK
IYK6PGqSuEmL6krOv6gj+mI0
=7lDc
-----END PGP SIGNATURE-----
 
Reply With Quote
 
A. Sinan Unur
Guest
Posts: n/a
 
      07-25-2003
"Eric J. Roode" <(E-Mail Removed)> wrote in
news:Xns93C33D640AD40sdn.comcast@206.127.4.25:

> "A. Sinan Unur" <(E-Mail Removed)> wrote in
> news:Xns93C2A80F838A4asu1cornelledu@132.236.56.8:
>
>> if(exists $word_to_files{$word}) {
>> unless(grep /$ARGV/, @{$word_to_files{$word}}) {
>> push @{$word_to_files{$word}}, ($ARGV);
>> }
>> } else {
>> $word_to_files{$word} = [$ARGV];
>> }

>
> Why use an array as the second-level data structure -- why not a hash?
>
> $word_to_files{$word}{$ARGV} = 1;


Muddled thinking I guess. And I do remember making a mental note of this
when you pointed out the same thing in another thread, but it looks like
I regarded that as just another deadline reminder

Is this better?

# cw: Common Word
# Script to list words that appear in all the files
# passed on the command line

use diagnostics;
use strict;
use warnings;

die "$0: file1 ... fileN\n" unless scalar @ARGV;

my %word_to_files;

while(<ARGV>) {
chomp;
my @words = split /\s+/;
foreach (@words) {
$word_to_files{$_}{$ARGV} = 1;
}
}

foreach (sort keys %word_to_files) {
print "$_: ", join(" ", keys %{$word_to_files{$_}}), "\n";
}

__END__



--
A. Sinan Unur
(E-Mail Removed)
Remove dashes for address
Spam bait: (E-Mail Removed)
 
Reply With Quote
 
A. Sinan Unur
Guest
Posts: n/a
 
      07-26-2003
(E-Mail Removed) (slash) wrote in
news:(E-Mail Removed) om:

> Thanks so much for all the helpful responses. (Sinan, this is not a HW
> problem! I didn't get any helpful responses in another related
> postiing so I am adding this as a followup here hoping that it will
> get reviewed.


You'll need to post something that can be run locally (make sure some
sample input data are included).

....

> undef $/;


Are you sure you want to this here?

> my @words = split /\W+/, <> ;
> my $line_number = 2;
> my $n;
> my $line_num = 2;


What is the difference between $line_number and $line_num and what purpose
do they serve?

> my $n_cols = 5;
> my $col = { align => 'left'}; # no title, left alignment
> my $tb = Text::Table->new( ( $col) x $n_cols);
> my @stack = ( '*' ) x $n_cols;
> foreach $word ( @words ) {
> shift @stack;
> push @stack, $word;
> $tb->add(@stack);
> }


What on earth is going on in here?

> my @lines = $tb->add("$stack[-4]", "$stack[-3]", "$stack[-2]",
> "$stack[-1]", "*");
> my @lines = $tb->add("$stack[-3]", "$stack[-2]", "$stack[-1]","*",
> "*");
> my @lines = $tb->table($line_number, $n);


Why do you keep redeclaring and redefining @lines before you do anything
with it?

> #print @lines;
> my $t1 = $tb->select(2, {is_sep => 2, body => " "}, 1,0,
> {is_sep => 2, body => "\n"},
> 2, {is_sep =>2, body => " "}, 3,4);
> #foreach $textID (@textID) {
> #$t1 = $t1->add($ARGV); }#adds one data line at the end of ngrams not
> a col


I do not understand this comment. Is it supposed to do something else? Did
you read the docs for Text::Table?

add()
adds a data line to the table, returns the table.

> my $input = $t1->table($line_num, $n);
> print $input;

....
> To recap, I don't know if I really need an inverted index. Perhaps an
> array of arrays might help instead of the table module. Where I can
> have @lines and $ARGV. Would that work? In other words, an array
> consisting of the followingfirst line of ngram, $ARGV)
> (Second line of ngram, $ARGV)
> .
> .
> .
> (Last line of ngram, $ARGV)
> And perhaps I could put this into a table and do the select statemetns
> over them to display the desired output. Is this possible or I am just
> dreaming?


No, you are just rambling. The way this works is, you post a specific
problem, and people try to help you solve it. We cannot figure out for you
your requirements etc because we do not have the information you have
regarding the overall picture.

So, I do not know why you decided the previous solutions we posted to the
problem of associating each word with the file(s) it came from were
inadequate. Before people can help you, you have to clearly communicate
what problem you are trying to solve.

> Any suggestions on how to achieve this would be very much appreciated.


I do not know what you mean by "this". But, would the following help?

# fubar.pl

use strict;
use warnings;

use Text::Table;

my $cols = 5;
my $col = { 'align' => 'left' };
my $table = Text::Table->new(($col) x $cols);

{
local $/;
while(my @words = split /\W+/, <ARGV>) {
while (@words) {
my @row = splice (@words, 0, $cols - 1);
if(@row < $cols - 1) {
push @row, (undef)x($cols - @row - 1);
}
push @row, ($ARGV);
$table->add(@row);
}
}
}

print ($table->body());
__END__

C:\Home> cat file1
file1a file1b
file1c
file1d
file1e file1f
file1g

C:\Home>cat file2
file2a
file2b
file2c
file2d
file2e
file2f

C:\Home>fubar.pl file1 file2
file1a file1b file1c file1d file1
file1e file1f file1g file1
file2a file2b file2c file2d file2
file2e file2f file2

--
A. Sinan Unur
(E-Mail Removed)
Remove dashes for address
Spam bait: (E-Mail Removed)
 
Reply With Quote
 
slash
Guest
Posts: n/a
 
      07-27-2003
Thanks for the help again and sorry for the confusion. I really
appreciate the time people are taking out of their busy schedules to
help me out here. I just don't get references well enough to be able
to figure out how I can get what I want.

What I was trying to do (so ineffectively) was to have a 5gram first
for all the words and then the filenames next to them so that I can
select specific columns for display.
I will try to explain this with more detail:

My input is essentially bunch of text files that I am currently
passing in to the script as: perl -n script.pl ./*.TXT

fox.txt
quick brown fox jumped over lazy dog tripped over resting fox

My badly written program was trying to achieve three things:
produce complete 5grams, then
select specific columns from that 5grams,
third part has to do with document tracking that I just can't seem to
get it in.

First part: 5grams
=================
.. . . . quick
.. . . quick brown
.. . quick brown fox
.. quick brown fox jumped
quick brown fox jumped over
brown fox jumped over lazy
fox jumped over lazy dog
jumped over lazy dog tripped
over lazy dog tripped over
lazy dog tripped over resting
dog tripped over resting fox
tripped over resting fox
over resting fox


2nd part: select columns
=====================
quick brown fox
brown quick
brown fox jumped
fox brown quick
fox jumped over
fox resting over
jumped fox brown
jumped over lazy
over jumped fox
over lazy dog
lazy over jumped
lazy dog tripped
dog lazy over
dog tripped over
tripped dog lazy
tripped over resting
over tripped dog
over resting fox
resting tripped over
resting fox

Third part: filenames
===================
quick brown fox fox.txt
brown quick fox.txt
brown fox jumped fox.txt
fox brown quick fox.txt
fox jumped over fox.txt
fox resting over fox.txt
jumped fox brown fox.txt
jumped over lazy fox.txt
over jumped fox fox.txt
over lazy dog fox.txt
lazy over jumped fox.txt
lazy dog tripped fox.txt
dog lazy over fox.txt
dog tripped over fox.txt
tripped dog lazy fox.txt
tripped over resting fox.txt
over tripped dog fox.txt
over resting fox fox.txt
resting tripped over fox.txt
resting fox fox.txt

Any help you could give me to move this forward would be greatly
appreciated.

many thanks,
slash


"A. Sinan Unur" <(E-Mail Removed)> wrote in message news:<Xns93C4A044A8D34asu1cornelledu@132.236.56.8> ...
> (E-Mail Removed) (slash) wrote in
> news:(E-Mail Removed) om:
>
> > Thanks so much for all the helpful responses. (Sinan, this is not a HW
> > problem! I didn't get any helpful responses in another related
> > postiing so I am adding this as a followup here hoping that it will
> > get reviewed.

>
> You'll need to post something that can be run locally (make sure some
> sample input data are included).
>
> ...
>
> > undef $/;

>
> Are you sure you want to this here?
>
> > my @words = split /\W+/, <> ;
> > my $line_number = 2;
> > my $n;
> > my $line_num = 2;

>
> What is the difference between $line_number and $line_num and what purpose
> do they serve?
>
> > my $n_cols = 5;
> > my $col = { align => 'left'}; # no title, left alignment
> > my $tb = Text::Table->new( ( $col) x $n_cols);
> > my @stack = ( '*' ) x $n_cols;
> > foreach $word ( @words ) {
> > shift @stack;
> > push @stack, $word;
> > $tb->add(@stack);
> > }

>
> What on earth is going on in here?
>
> > my @lines = $tb->add("$stack[-4]", "$stack[-3]", "$stack[-2]",
> > "$stack[-1]", "*");
> > my @lines = $tb->add("$stack[-3]", "$stack[-2]", "$stack[-1]","*",
> > "*");
> > my @lines = $tb->table($line_number, $n);

>
> Why do you keep redeclaring and redefining @lines before you do anything
> with it?
>
> > #print @lines;
> > my $t1 = $tb->select(2, {is_sep => 2, body => " "}, 1,0,
> > {is_sep => 2, body => "\n"},
> > 2, {is_sep =>2, body => " "}, 3,4);
> > #foreach $textID (@textID) {
> > #$t1 = $t1->add($ARGV); }#adds one data line at the end of ngrams not
> > a col

>
> I do not understand this comment. Is it supposed to do something else? Did
> you read the docs for Text::Table?
>
> add()
> adds a data line to the table, returns the table.
>
> > my $input = $t1->table($line_num, $n);
> > print $input;

> ...
> > To recap, I don't know if I really need an inverted index. Perhaps an
> > array of arrays might help instead of the table module. Where I can
> > have @lines and $ARGV. Would that work? In other words, an array
> > consisting of the followingfirst line of ngram, $ARGV)
> > (Second line of ngram, $ARGV)
> > .
> > .
> > .
> > (Last line of ngram, $ARGV)
> > And perhaps I could put this into a table and do the select statemetns
> > over them to display the desired output. Is this possible or I am just
> > dreaming?

>
> No, you are just rambling. The way this works is, you post a specific
> problem, and people try to help you solve it. We cannot figure out for you
> your requirements etc because we do not have the information you have
> regarding the overall picture.
>
> So, I do not know why you decided the previous solutions we posted to the
> problem of associating each word with the file(s) it came from were
> inadequate. Before people can help you, you have to clearly communicate
> what problem you are trying to solve.
>
> > Any suggestions on how to achieve this would be very much appreciated.

>
> I do not know what you mean by "this". But, would the following help?
>
> # fubar.pl
>
> use strict;
> use warnings;
>
> use Text::Table;
>
> my $cols = 5;
> my $col = { 'align' => 'left' };
> my $table = Text::Table->new(($col) x $cols);
>
> {
> local $/;
> while(my @words = split /\W+/, <ARGV>) {
> while (@words) {
> my @row = splice (@words, 0, $cols - 1);
> if(@row < $cols - 1) {
> push @row, (undef)x($cols - @row - 1);
> }
> push @row, ($ARGV);
> $table->add(@row);
> }
> }
> }
>
> print ($table->body());
> __END__
>
> C:\Home> cat file1
> file1a file1b
> file1c
> file1d
> file1e file1f
> file1g
>
> C:\Home>cat file2
> file2a
> file2b
> file2c
> file2d
> file2e
> file2f
>
> C:\Home>fubar.pl file1 file2
> file1a file1b file1c file1d file1
> file1e file1f file1g file1
> file2a file2b file2c file2d file2
> file2e file2f file2

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Tracking Someone Tracking Me Edw. Peach Computer Security 4 07-07-2005 05:50 PM
document.referrer question for tracking hits and visits goks Javascript 7 05-30-2004 01:16 AM
Help with JDOM, turn org.jdom.Document -> org.w3c.dom.Document? Wendy S Java 1 08-04-2003 11:48 PM
Xalan document() function finding wrong document root Steve Carrow Java 0 07-28-2003 02:28 AM
document ID tracking slash Perl Misc 1 07-24-2003 09:07 PM



Advertisments