Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Extracting "true" words

Reply
Thread Tools

Extracting "true" words

 
 
candide
Guest
Posts: n/a
 
      04-01-2011
Back again with my study of regular expressions There exists a
special character allowing alphanumeric extraction, the special
character \w (BTW, what the letter 'w' refers to?). But this feature
doesn't permit to extract true words; by "true" I mean word composed
only of _alphabetic_ letters (not digit nor underscore).


So I was wondering what is the pattern to extract (or to match) _true_
words ? Of course, I don't restrict myself to the ascii universe so that
the pattern [a-zA-Z]+ doesn't meet my needs.
 
Reply With Quote
 
 
 
 
Chris Rebert
Guest
Posts: n/a
 
      04-01-2011
On Fri, Apr 1, 2011 at 1:55 PM, candide <(E-Mail Removed)> wrote:
> Back again with my study of regular expressions There exists a special
> character allowing alphanumeric extraction, the special character \w (BTW,
> what the letter 'w' refers to?).


"Word" presumably/intuitively; hence the non-standard "[:word:]"
POSIX-like character class alias for \w in some environments.

> But this feature doesn't permit to extract
> true words; by "true" I mean word composed only of _alphabetic_ letters (not
> digit nor underscore).


Are you intentionally excluding CJK ideographs (as not "letters"/alphabetic)?
And what of hyphenated terms (e.g. "re-lock")?

> So I was wondering what is the pattern to extract (or to match) _true_ words
> ? Of course, I don't restrict myself to the ascii universe so that the
> pattern [a-zA-Z]+ doesn't meet my needs.


AFAICT, there doesn't appear to be a nice way to do this in Python
using the std lib `re` module, but I'm not a regex guru.
POSIX character classes are unsupported, which rules out "[:alpha:]".
\w can be made Unicode/locale-sensitive, but includes digits and the
underscore, as you've already pointed out.
\p (Unicode property/block testing), which would allow for
"\p{Alphabetic}" or similar, is likewise unsupported.

Cheers,
Chris
--
http://blog.rebertia.com
 
Reply With Quote
 
 
 
 
MRAB
Guest
Posts: n/a
 
      04-01-2011
On 01/04/2011 21:55, candide wrote:
> Back again with my study of regular expressions There exists a
> special character allowing alphanumeric extraction, the special
> character \w (BTW, what the letter 'w' refers to?). But this feature
> doesn't permit to extract true words; by "true" I mean word composed
> only of _alphabetic_ letters (not digit nor underscore).
>

The 'w' refers to a 'word' character, although in regex it refers to
letters, digits and the underscore character '_' due to its use in
computer languages (basically, the characters of an identifier or name).
>
> So I was wondering what is the pattern to extract (or to match) _true_
> words ? Of course, I don't restrict myself to the ascii universe so that
> the pattern [a-zA-Z]+ doesn't meet my needs.
>

Using the re module, you would have to create a character class out of
all the possible letters, something like this:

letter_class = u"[" + u"".join(unichr(c) for c in range(0x10000) if
unichr(c).isalpha()) + u"]"

Alternatively, you could try the new regex implementation here:

http://pypi.python.org/pypi/regex

which adds support for Unicode properties, and do something like this:

words = regex.findall(ur"\p{Letter}+", unicode_text)
 
Reply With Quote
 
John Nagle
Guest
Posts: n/a
 
      04-02-2011
On 4/1/2011 4:10 PM, Chris Rebert wrote:
> On Fri, Apr 1, 2011 at 1:55 PM, candide<(E-Mail Removed)> wrote:
>> Back again with my study of regular expressions There exists a special
>> character allowing alphanumeric extraction, the special character \w (BTW,
>> what the letter 'w' refers to?).

>
> "Word" presumably/intuitively; hence the non-standard "[:word:]"
> POSIX-like character class alias for \w in some environments.
>
>> But this feature doesn't permit to extract
>> true words; by "true" I mean word composed only of _alphabetic_ letters (not
>> digit nor underscore).

>
> Are you intentionally excluding CJK ideographs (as not "letters"/alphabetic)?
> And what of hyphenated terms (e.g. "re-lock")?


It's an interesting parsing problem to find word breaks in mixed
language text. It's quite common to find English and Japanese text
mixed. (See "http://www.dokidoki6.com/00_index1.html". Caution,
excessively cute.) Each ideograph is a "word", of course.

Parse this into words:

★12/25/2009★
6%DOKIDOKI VISUAL FILE vol.4を公開しました。
アルバ*の上部で再生操作、下部でサ*ネイルがご覧いた*けます。

John Nagle
 
Reply With Quote
 
candide
Guest
Posts: n/a
 
      04-02-2011
Le 02/04/2011 01:10, Chris Rebert a écrit :

> "Word" presumably/intuitively; hence the non-standard "[:word:]"
> POSIX-like character class alias for \w in some environments.


OK


> Are you intentionally excluding CJK ideographs (as not "letters"/alphabetic)?


Yes, CJK ideographs don't belong to the locale I'm working with


> And what of hyphenated terms (e.g. "re-lock")?



I'm interested only with ascii letters and ascii letters with diacritics


Thanks for your response.

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Extracting repeated words candide Python 2 04-02-2011 01:18 PM
extracting from web pages but got disordered words sometimes Frank Potter Python 3 01-28-2007 02:33 AM
extracting numbers from a file, excluding words dawenliu@gmail.com Python 5 11-01-2005 09:52 PM
extracting numbers from a file, excluding fixed words dawenliu Python 5 10-29-2005 09:15 PM
extracting HTML fragments and counting words Ksenia Marasanova Python 0 02-18-2005 08:28 PM



Advertisments