Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Reg Exp and sentences

Reply
Thread Tools

Reg Exp and sentences

 
 
kjhjhjhjadsasda@urbanhabit.com
Guest
Posts: n/a
 
      09-30-2005
Hi

Im trying to get a solid regular expression that identifies sentences
from a text chunk and that throws away anything that isnt.

Example:

pjkoqwe () asdkj() asdasd...... dasdkasjk ** This is a proper sentence,
right here. Hejrkjlekk werkwe wer werjlkj! Wedkljew erewrkjkj?
Wwlkjfdskjsdflk sdlkfjsdsd sdflkjsd sdfkjsdf, sdfklj sdflkjsdf lksdfj.
1223 sd dskj() sdkjas | asd| |sdasda sadkjasd

Would result in:

This is a proper sentence, right here. Hejrkjlekk werkwe wer werjlkj!
Wedkljew erewrkjkj? Wwlkjfdskjsdflk sdlkfjsdsd sdflkjsd sdfkjsdf,
sdfklj sdflkjsdf lksdfj.

Eg something that looks for a length more than say 5 words, that starts
with an upper case letter, can include ,()- and space and ends with an
..!?

Thanks
M

 
Reply With Quote
 
 
 
 
Sherm Pendley
Guest
Posts: n/a
 
      09-30-2005
writes:

> Im trying to get a solid regular expression that identifies sentences
> from a text chunk and that throws away anything that isnt.
>
> Example:
>
> pjkoqwe () asdkj() asdasd...... dasdkasjk ** This is a proper sentence,
> right here. Hejrkjlekk werkwe wer werjlkj! Wedkljew erewrkjkj?
> Wwlkjfdskjsdflk sdlkfjsdsd sdflkjsd sdfkjsdf, sdfklj sdflkjsdf lksdfj.
> 1223 sd dskj() sdkjas | asd| |sdasda sadkjasd
>
> Would result in:
>
> This is a proper sentence, right here. Hejrkjlekk werkwe wer werjlkj!
> Wedkljew erewrkjkj? Wwlkjfdskjsdflk sdlkfjsdsd sdflkjsd sdfkjsdf,
> sdfklj sdflkjsdf lksdfj.
>
> Eg something that looks for a length more than say 5 words, that starts
> with an upper case letter, can include ,()- and space and ends with an
> .!?


What have you tried so far?

If you need help getting started, try <http://learn.perl.org> for lots of
useful tutorials, book suggestions, and so forth.

Oh, and don't forget to read this group's guidelines, if you haven't yet
done so - lots of tips and useful links there too.

sherm--

--
Cocoa programming in Perl: http://camelbones.sourceforge.net
Hire me! My resume: http://www.dot-app.org
 
Reply With Quote
 
 
 
 
Dr.Ruud
Guest
Posts: n/a
 
      09-30-2005
schreef:

> Im trying to get a solid regular expression that identifies sentences
> from a text chunk and that throws away anything that isnt.


The sed mailing list on yahoogroups is a nice place to get free regexes.

That list is available on gmane too:
news://news.gmane.org/gmane.editors.sed.user

--
Affijn, Ruud

"Gewoon is een tijger."



 
Reply With Quote
 
Scott Bryce
Guest
Posts: n/a
 
      09-30-2005
wrote:

> Eg something that looks for a length more than say 5 words, that starts
> with an upper case letter, can include ,()- and space and ends with an
> .!?


Hey... Would this work? I don't know. Let me think. No. I guess not.

You may wind up tossing out complete sentences that have fewer than 5 words.

"Besides," he said, "Not all sentences end with a period." (At least I
don't think so.)

 
Reply With Quote
 
kjhjhjhjadsasda@urbanhabit.com
Guest
Posts: n/a
 
      09-30-2005
Its actually fine if it "by mistake" excludes some sentences, hard to
make it bullet proof I guess.

 
Reply With Quote
 
Matt Garrish
Guest
Posts: n/a
 
      10-01-2005

<> wrote in message
news: oups.com...
> Hi
>
> Im trying to get a solid regular expression that identifies sentences
> from a text chunk and that throws away anything that isnt.
>
> Example:
>
> pjkoqwe () asdkj() asdasd...... dasdkasjk ** This is a proper sentence,
> right here. Hejrkjlekk werkwe wer werjlkj! Wedkljew erewrkjkj?
> Wwlkjfdskjsdflk sdlkfjsdsd sdflkjsd sdfkjsdf, sdfklj sdflkjsdf lksdfj.
> 1223 sd dskj() sdkjas | asd| |sdasda sadkjasd
>
> Would result in:
>
> This is a proper sentence, right here. Hejrkjlekk werkwe wer werjlkj!
> Wedkljew erewrkjkj? Wwlkjfdskjsdflk sdlkfjsdsd sdflkjsd sdfkjsdf,
> sdfklj sdflkjsdf lksdfj.
>


Think of how you do that as a person. You cognitively determine whether each
word is a word and whether those words when strung together form a sentence
that makes sense to you as a speaker of that language. Regular expressions,
as you're hopefully aware, are not cognitive.

Regular expressions are for matching patterns, and you do no have a pattern
to match. You might use a regular expression to break up the sentences on
punctuation, but you're never going to write a regular expression to
determine what is and what isn't a "proper" sentence.

Matt



 
Reply With Quote
 
kjhjhjhjadsasda@urbanhabit.com
Guest
Posts: n/a
 
      10-01-2005
> Regular expressions are for matching patterns, and you do no have a pattern
> to match. You might use a regular expression to break up the sentences on
> punctuation, but you're never going to write a regular expression to
> determine what is and what isn't a "proper" sentence.
>
> Matt


Thanks all for the inputs.

Surely, though, there must be a regular expression saying $whatever
starts with A-Z, has whatever in the middle and ends with .
(punctuation) ?

M

 
Reply With Quote
 
Matt Garrish
Guest
Posts: n/a
 
      10-01-2005

<> wrote in message
news: oups.com...
>> Regular expressions are for matching patterns, and you do no have a
>> pattern
>> to match. You might use a regular expression to break up the sentences on
>> punctuation, but you're never going to write a regular expression to
>> determine what is and what isn't a "proper" sentence.
>>

>
> Surely, though, there must be a regular expression saying $whatever
> starts with A-Z, has whatever in the middle and ends with .
> (punctuation) ?
>


I hesistate to even write this, but...

my $text = <<TEXT;
I suppose this is a sentence. THisdsa askhwerjjk.vfklanf.,,dsf,, .
"I quote, this is going to fail you in ways you may not expect!?!<<<"
But that's not dkalkg ghdsklgklg askl my problem. Dskjdskjfn!
99 bottles of beer in my stomach... oops where'd my sentence go?
TEXT

foreach my $sentence ($text =~ /([A-Z0-9].*?[.!?])/gs) {
print $sentence, "\n";
}

Hopefully the above will give you some ideas as to what you're up against,
though.

Matt


 
Reply With Quote
 
William James
Guest
Posts: n/a
 
      10-02-2005

wrote:
> > Regular expressions are for matching patterns, and you do no have a pattern
> > to match. You might use a regular expression to break up the sentences on
> > punctuation, but you're never going to write a regular expression to
> > determine what is and what isn't a "proper" sentence.
> >
> > Matt

>
> Thanks all for the inputs.
>
> Surely, though, there must be a regular expression saying $whatever
> starts with A-Z, has whatever in the middle and ends with .
> (punctuation) ?
>
> M


A starting point (in Ruby):

# Will match multiple contiguous sentences.
re = /(?: ^ | \s )
(
(?:
["('`] *
[A-Z]
[- a-z \s ,;: () '`"]+
[.?!]
[")'`] *
(?: \s+ | $ )
) +
)
/xm
s = DATA.read
s.scan( re ){ |x| x = x.first.strip
if x.split.size > 4
puts x
end
}

__END__
pjkoqwe () asdkj() asdasd...... dasdkasjk ** This is a proper sentence,
right here. Hejrkjlekk werkwe wer werjlkj! Wedkljew erewrkjkj?
Wwlkjfdskjsdflk sdlkfjsdsd sdflkjsd sdfkjsdf, sdfklj sdflkjsdf lksdfj.
1223 sd dskj() sdkjas | asd| |sdasda sadkjasd
"I suppose this is a sentence," he said. THisdsa
askhwerjjk.vfklanf.,,dsf,, .
(A "sentence" at the very end.)


Output:

This is a proper sentence,
right here. Hejrkjlekk werkwe wer werjlkj! Wedkljew erewrkjkj?
Wwlkjfdskjsdflk sdlkfjsdsd sdflkjsd sdfkjsdf, sdfklj sdflkjsdf lksdfj.
"I suppose this is a sentence," he said.
(A "sentence" at the very end.)

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Help! Complex Pattern Extraction with Key/Value Pairs and Reg Exp? aekalman Perl Misc 6 11-22-2004 10:59 PM
reg exp and octal notation Lucas Branca Python 5 03-05-2004 04:33 PM
Newbie-Reg Exp psk Perl 2 01-19-2004 10:30 PM
regexp to list all sentences and sub sentences, with overlapping? Tony Perl 4 11-27-2003 01:38 PM
Reg exp: matching relative path only. Andrew Rowland Perl 0 08-01-2003 11:14 PM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57