Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Python (http://www.velocityreviews.com/forums/f43-python.html)
-   -   Re: Finding keywords (http://www.velocityreviews.com/forums/t744793-re-finding-keywords.html)

Heather Brown 03-08-2011 12:39 PM

Re: Finding keywords
 
On 01/-10/-28163 02:59 PM, Cross wrote:
> Hello
>
> I have got a project in which I have to extract keywords given a URL. I
> would like to know methods for extraction of keywords. Frequency of
> occurence is one; but it seems naive. I would prefer something more
> robust. Please suggest.
>
> Regards
> Cross
>
> --- news://freenews.netfront.net/ - complaints: news@netfront.net ---
>


The keywords are an attribute in a tag called <meta>, in the section
called <head>. Are you having trouble parsing the xhtml to that point?

Be more specific in your question, and somebody is likely to chime in.
Although I'm not the one, if it's a question of parsing the xhtml.

DaveA

Matt Chaput 03-08-2011 07:00 PM

Re: Finding keywords
 
On 08/03/2011 8:58 AM, Cross wrote:
> I know meta tags contain keywords but they are not always reliable. I
> can parse xhtml to obtain keywords from meta tags; but how do I verify
> them. To obtain reliable keywords, I have to parse the plain text
> obtained from the URL.


I think maybe what the OP is asking about is extracting key words from a
text, i.e. a short list of words that characterize the text. This is an
information retrieval problem, not really a Python problem.

One simple way to do this is to calculate word frequency histograms for
each document in your corpus, and then for a given document, select
words that are frequent in that document but infrequent in the corpus as
a whole. Whoosh does this. There are different ways of calculating the
importance of words, and stemming and conflating synonyms can give you
better results as well.

A more sophisticated method uses "part of speech" tagging. See the
Python Natural Language Toolkit (NLTK) and topia.termextract for more
information.

http://pypi.python.org/pypi/topia.termextract/

Yahoo has a web service for key word extraction:

http://developer.yahoo.com/search/co...xtraction.html

You might want to investigate these resources and try google searches
for e.g. "extracting key terms from documents" and then come back if you
have a question about the Python implementation.

Cheers,

Matt

Vlastimil Brom 03-08-2011 07:51 PM

Re: Finding keywords
 
2011/3/8 Cross <X@x.tv>:
> On 03/08/2011 06:09 PM, Heather Brown wrote:
>>
>> The keywords are an attribute in a tag called <meta>, in the section
>> called
>> <head>. Are you having trouble parsing the xhtml to that point?
>>
>> Be more specific in your question, and somebody is likely to chime in.
>> Although
>> I'm not the one, if it's a question of parsing the xhtml.
>>
>> DaveA

>
> I know meta tags contain keywords but they are not always reliable. I can
> parse xhtml to obtain keywords from meta tags; but how do I verify them. To
> obtain reliable keywords, I have to parse the plain text obtained from the
> URL.
>
> Cross
>
> --- news://freenews.netfront.net/ - complaints: news@netfront.net ---
> --
> http://mail.python.org/mailman/listinfo/python-list
>


Hi,
if you need to extract meaningful keywords in terms of data mining
using natural language processing, it might become quite a complex
task, depending on the requirements; the NLTK toolkit may help with
some approaches [ http://www.nltk.org/ ].
One possibility would be to filter out more frequent and less
meaningful words ("stopwords") and extract the more frequent words
from the reminder., e.g. (with some simplifications/hacks in the
interactive mode):

>>> import re, urllib2, nltk
>>> page_src = urllib2.urlopen("http://www.python.org/doc/essays/foreword/").read().decode("utf-8")
>>> page_plain = nltk.clean_html(page_src).lower()
>>> txt_filtered = nltk.Text((word for word in re.findall(r"(?u)\w+", page_plain) if word not in set(nltk.corpus.stopwords.words("english"))))
>>> frequency_dist = nltk.FreqDist(txt_filtered)
>>> [(word, freq) for (word, freq) in frequency_dist.items() if freq > 2]

[(u'python', 39), (u'abc', 11), (u'code', 10), (u'c', 7),
(u'language', 7), (u'programming', 7), (u'unix', 7), (u'foreword', 5),
(u'new', 5), (u'would', 5), (u'1st', 4), (u'book', 4), (u'ed', 4),
(u'features', 4), (u'many', 4), (u'one', 4), (u'programmer', 4),
(u'time', 4), (u'use', 4), (u'community', 3), (u'documentation', 3),
(u'early', 3), (u'enough', 3), (u'even', 3), (u'first', 3), (u'help',
3), (u'indentation', 3), (u'instance', 3), (u'less', 3), (u'like', 3),
(u'makes', 3), (u'personal', 3), (u'programmers', 3), (u'readability',
3), (u'readable', 3), (u'write', 3)]
>>>


Another possibility would be to extract parts of speech (e.g. nouns,
adjective, verbs) using e.g. nltk.pos_tag(input_txt) etc.;
for more convoluted html code e.g. BeautifulSoup might be used and
there are likely many other options.

hth,
vbr

Terry Reedy 03-08-2011 09:00 PM

Re: Finding keywords
 
On 3/8/2011 2:00 PM, Matt Chaput wrote:
> On 08/03/2011 8:58 AM, Cross wrote:
>> I know meta tags contain keywords but they are not always reliable. I
>> can parse xhtml to obtain keywords from meta tags; but how do I verify
>> them. To obtain reliable keywords, I have to parse the plain text
>> obtained from the URL.


This, of course, is a problem for all search engines, especially given
'search optimization' games.

> I think maybe what the OP is asking about is extracting key words from a
> text, i.e. a short list of words that characterize the text. This is an
> information retrieval problem, not really a Python problem.
>
> One simple way to do this is to calculate word frequency histograms for
> each document in your corpus, and then for a given document, select
> words that are frequent in that document but infrequent in the corpus as
> a whole. Whoosh does this.


I believe Google does something like this also. I have seen a claim that
Google only looks at the first x words, hence the advice 'Make sure your
target keywords are in the first x words.'. You, of course, can and
should process entire docs


--
Terry Jan Reedy


ramkrishan.bhatt@gmail.com 12-05-2013 01:39 PM

Re: Finding keywords
 
Hi , If you got the solutions please let me know also. I have to implement asap.
On Wednesday, 9 March 2011 23:43:26 UTC+5:30, Cross wrote:
> On 03/09/2011 01:21 AM, Vlastimil Brom wrote:
> > 2011/3/8 Cross<X@x.tv>:
> >> On 03/08/2011 06:09 PM, Heather Brown wrote:
> >>>
> >>> The keywords are an attribute in a tag called<meta>, in the section
> >>> called
> >>> <head>. Are you having trouble parsing the xhtml to that point?
> >>>
> >>> Be more specific in your question, and somebody is likely to chime in.
> >>> Although
> >>> I'm not the one, if it's a question of parsing the xhtml.
> >>>
> >>> DaveA
> >>
> >> I know meta tags contain keywords but they are not always reliable. I can
> >> parse xhtml to obtain keywords from meta tags; but how do I verify them. To
> >> obtain reliable keywords, I have to parse the plain text obtained from the
> >> URL.
> >>
> >> Cross
> >>
> >> --- news://freenews.netfront.net/ - complaints: news@netfront.net ---
> >> --
> >> http://mail.python.org/mailman/listinfo/python-list
> >>

> >
> > Hi,
> > if you need to extract meaningful keywords in terms of data mining
> > using natural language processing, it might become quite a complex
> > task, depending on the requirements; the NLTK toolkit may help with
> > some approaches [ http://www.nltk.org/ ].
> > One possibility would be to filter out more frequent and less
> > meaningful words ("stopwords") and extract the more frequent words
> > from the reminder., e.g. (with some simplifications/hacks in the
> > interactive mode):
> >
> >>>> import re, urllib2, nltk
> >>>> page_src = urllib2.urlopen("http://www.python.org/doc/essays/foreword/").read().decode("utf-8")
> >>>> page_plain = nltk.clean_html(page_src).lower()
> >>>> txt_filtered = nltk.Text((word for word in re.findall(r"(?u)\w+", page_plain) if word not in set(nltk.corpus.stopwords.words("english"))))
> >>>> frequency_dist = nltk.FreqDist(txt_filtered)
> >>>> [(word, freq) for (word, freq) in frequency_dist.items() if freq> 2]

> > [(u'python', 39), (u'abc', 11), (u'code', 10), (u'c', 7),
> > (u'language', 7), (u'programming', 7), (u'unix', 7), (u'foreword', 5),
> > (u'new', 5), (u'would', 5), (u'1st', 4), (u'book', 4), (u'ed', 4),
> > (u'features', 4), (u'many', 4), (u'one', 4), (u'programmer', 4),
> > (u'time', 4), (u'use', 4), (u'community', 3), (u'documentation', 3),
> > (u'early', 3), (u'enough', 3), (u'even', 3), (u'first', 3), (u'help',
> > 3), (u'indentation', 3), (u'instance', 3), (u'less', 3), (u'like', 3),
> > (u'makes', 3), (u'personal', 3), (u'programmers', 3), (u'readability',
> > 3), (u'readable', 3), (u'write', 3)]
> >>>>

> >
> > Another possibility would be to extract parts of speech (e.g. nouns,
> > adjective, verbs) using e.g. nltk.pos_tag(input_txt) etc.;
> > for more convoluted html code e.g. BeautifulSoup might be used and
> > there are likely many other options.
> >
> > hth,
> > vbr

> I had considered nltk. That is why I said that straightforward frequency
> calculation of words would be naive. I have to look into this BeautifulSoup thing.
>
> --- news://freenews.netfront.net/ - complaints: news@netfront.net ---




All times are GMT. The time now is 08:59 AM.

Powered by vBulletin®. Copyright ©2000 - 2014, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.