Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > ranking texts against a white list

Reply
Thread Tools

ranking texts against a white list

 
 
Mario Protto
Guest
Posts: n/a
 
      03-04-2005
hi all,

I have many small texts (200-1000 chars), I have a white list (100 words), I
have to evaluate any text with its relevancy against the word list.
Now I'm using a very simple alg like
_______________________
in text there is at least 1 word from list?
yes --> rank = 1
no --> rank = 0
_______________________

but I'd like rank to be a real number between 0 and 1, I have think
something like count how many differnt word there are in test and normalize
to 1 but perhaps there is some other, most intelligent..., way to do that
.....any suggest?

thx

Mario
www.mario-online.com


 
Reply With Quote
 
 
 
 
Arndt Jonasson
Guest
Posts: n/a
 
      03-04-2005

"Mario Protto" <mario AT mario-online DOT http://www.velocityreviews.com/forums/(E-Mail Removed)d> writes:
>
> I have many small texts (200-1000 chars), I have a white list (100 words), I
> have to evaluate any text with its relevancy against the word list.
> Now I'm using a very simple alg like
> _______________________
> in text there is at least 1 word from list?
> yes --> rank = 1
> no --> rank = 0
> _______________________
>
> but I'd like rank to be a real number between 0 and 1, I have think
> something like count how many differnt word there are in test and normalize
> to 1 but perhaps there is some other, most intelligent..., way to do that
> ....any suggest?


This question doesn't have anything to do with Perl, until there is
a particular implementation problem you want help with, so this is
not the proper news group for it.

If you don't know what the meaning of the relevancy number is, how
can anyone else? It's easy to start speculating, but before even doing
that I would want to know how the number is to be used.

If you search with google using some of the words "rank text white list",
you may find more information. Another source of ideas is documentation
(and source) of existing text search and ranking tools. 'Glimpse' comes
to mind, but there are probably many.

There's probably a proper news group dealing with such questions, but
I don't know what it might be called.
 
Reply With Quote
 
 
 
 
Mario Protto
Guest
Posts: n/a
 
      03-04-2005
>> I have many small texts (200-1000 chars), I have a white list (100
>> words), I
>> have to evaluate any text with its relevancy against the word list.
>> Now I'm using a very simple alg like
>> _______________________
>> in text there is at least 1 word from list?
>> yes --> rank = 1
>> no --> rank = 0
>> _______________________
>>
>> but I'd like rank to be a real number between 0 and 1, I have think
>> something like count how many differnt word there are in test and
>> normalize
>> to 1 but perhaps there is some other, most intelligent..., way to do
>> that
>> ....any suggest?

>
> This question doesn't have anything to do with Perl, until there is
> a particular implementation problem you want help with, so this is
> not the proper news group for it.


Ehm...sorry but I forgot to tell that this function is embedded in a Perl
project that start fetching text in a various way, putting it in a
Postgresql db and, via a PHP front-end, permit to human operators to filter
and show the contents.

> If you don't know what the meaning of the relevancy number is, how
> can anyone else? It's easy to start speculating, but before even doing
> that I would want to know how the number is to be used.


Well, the relevancy number could be something like "how much this document
talk about my terms", I know it could be almost a theoric question but it
seems to me as a common needed for perl programmer managing text...isn't it?

> If you search with google using some of the words "rank text white list",
> you may find more information. Another source of ideas is documentation
> (and source) of existing text search and ranking tools. 'Glimpse' comes
> to mind, but there are probably many.


of course I've done some Cpan and Google search before my post, also (for
who is interested) in italian newsgroup about Perl Stefano Rodighiero
suggest a very interesting article:
* "Building a Vector Space Search Engine in Perl"
http://www.perl.com/pub/a/2003/02/19/engine.html

> There's probably a proper news group dealing with such questions, but
> I don't know what it might be called.


me too...

Mario


 
Reply With Quote
 
Mark Clements
Guest
Posts: n/a
 
      03-04-2005
Mario Protto wrote:
> hi all,
>
> I have many small texts (200-1000 chars), I have a white list (100 words), I
> have to evaluate any text with its relevancy against the word list.
> Now I'm using a very simple alg like
> _______________________
> in text there is at least 1 word from list?
> yes --> rank = 1
> no --> rank = 0
> _______________________
>
> but I'd like rank to be a real number between 0 and 1, I have think
> something like count how many differnt word there are in test and normalize
> to 1 but perhaps there is some other, most intelligent..., way to do that
> ....any suggest?
>

Hi

check out

http://www.perl.com/pub/a/2003/02/19/engine.html

is an article on building vector-space searches. May be what you are after.

Mark
 
Reply With Quote
 
A. Sinan Unur
Guest
Posts: n/a
 
      03-04-2005
"Mario Protto" <mario AT mario-online DOT
(E-Mail Removed)d> wrote in
news:d09m58$e3c$(E-Mail Removed):

>>> I have many small texts (200-1000 chars), I have a white list (100
>>> words), I have to evaluate any text with its relevancy

....

>>> but I'd like rank to be a real number between 0 and 1, I have think
>>> something like count how many differnt word there are in test and

....

>> This question doesn't have anything to do with Perl, until there is
>> a particular implementation problem you want help with, so this is
>> not the proper news group for it.

>
> Ehm...sorry but I forgot to tell that this function is embedded in a
> Perl project that start fetching text in a various way,


Still irrelevant.

To get a better idea of what types of topics are relevant here, you should
read the posting guidelines for this group. They are posted here regularly
or you can Google for them on the web.

Sinan
 
Reply With Quote
 
David K. Wall
Guest
Posts: n/a
 
      03-04-2005
A. Sinan Unur <(E-Mail Removed)> wrote:

> "Mario Protto" <mario AT mario-online DOT
> (E-Mail Removed)d> wrote in
> news:d09m58$e3c$(E-Mail Removed):
>
>>>> I have many small texts (200-1000 chars), I have a white list
>>>> (100 words), I have to evaluate any text with its relevancy

> ...
>
>>>> but I'd like rank to be a real number between 0 and 1, I have
>>>> think something like count how many differnt word there are in
>>>> test and

> ...
>
>>> This question doesn't have anything to do with Perl, until there
>>> is a particular implementation problem you want help with, so
>>> this is not the proper news group for it.

>>
>> Ehm...sorry but I forgot to tell that this function is embedded
>> in a Perl project that start fetching text in a various way,

>
> Still irrelevant.


Maybe comp.programming? It seems like it might be a better place to
discuss an algorithm without caring about what language it's
implemented in.


> To get a better idea of what types of topics are relevant here,
> you should read the posting guidelines for this group. They are
> posted here regularly or you can Google for them on the web.


I bet Google hates the use of their trademarked name as a generic
verb....

--
David Wall
 
Reply With Quote
 
Tad McClellan
Guest
Posts: n/a
 
      03-04-2005
David K. Wall <(E-Mail Removed)> wrote:
> A. Sinan Unur <(E-Mail Removed)> wrote:


>> you can Google for them on the web.

>
> I bet Google hates the use of their trademarked name as a generic
> verb....



I hope the smiley means you mean just the opposite...?

I would think they _love_ it.


--
Tad McClellan SGML consulting
(E-Mail Removed) Perl programming
Fort Worth, Texas
 
Reply With Quote
 
Chris Mattern
Guest
Posts: n/a
 
      03-04-2005
Tad McClellan wrote:

> David K. Wall <(E-Mail Removed)> wrote:
>> A. Sinan Unur <(E-Mail Removed)> wrote:

>
>>> you can Google for them on the web.

>>
>> I bet Google hates the use of their trademarked name as a generic
>> verb....

>
>
> I hope the smiley means you mean just the opposite...?
>
> I would think they _love_ it.
>

Er, no. Because that's how you lose trademarks. Ask Bayer,
for whom aspirin used to be a trademark. Also escalator,
linoleum, zipper and yo-yo, all of which used to be brand
names, and were lost to their owners because they became
generic terms.

--
Christopher Mattern

"Which one you figure tracked us?"
"The ugly one, sir."
"...Could you be more specific?"
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
M$ against Blu-ray, M$ for Blu-ray, M$ against Blu-ray, M$ forBlu-ray, ...... Blig Merk DVD Video 66 04-27-2008 04:46 AM
White font on white background???? Stubby Firefox 3 08-18-2006 04:25 PM
Color.white vs. Color.WHITE Niels Dybdahl Java 3 10-06-2004 03:21 PM
Affects of a PL filter on white balance and white balance (D100) jeff liss Digital Photography 1 09-05-2003 02:07 PM
Drop down list ranking/resequencing Avonelle Lovhaug Javascript 0 09-03-2003 09:36 PM



Advertisments