Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > separating attribution, quoted text, and sigs from the body of a post

Reply
Thread Tools

separating attribution, quoted text, and sigs from the body of a post

 
 
Art Merkel
Guest
Posts: n/a
 
      01-17-2007
I wonder if anyone would be willing to share some code for pulling out
the "meat" of the body of an e-mail or usenet post? I mean given the
example

=====begin example
On 01/16/07 Fred Smith wrote:
> blah blah


Foo bar! Foo foo bar!

> blah blah blah


That's all I have to say

--
Here's my witty sig.
=====end example

just to return this:

Foo bar! Foo foo bar!
That's all I have to say


I'm thinking of something involving while and the .. operator, but I'm
not sure how to get rid of the "...wrote:"-type line without screwing
up on posts that don't have one, or what pattern to use to catch the
common ones.


 
Reply With Quote
 
 
 
 
usenet@DavidFilmer.com
Guest
Posts: n/a
 
      01-17-2007
Art Merkel wrote:
> I wonder if anyone would be willing to share some code for pulling out
> the "meat" of the body of an e-mail or usenet post?


You won't be able to do this 100% of the time because the behavior of
replies is different (and can be customized) in different newsreaders.
Usenet posts are plain text, and lack the context tagging of XML, etc.
But you can probably get pretty close to what you want.

You can probably exclude 90%+ of attribution lines by excluding
/wrote:$/ (but it won't work for Dr.Ruud's posts, etc). Of course,
that assumes English-language newsgroups. Some folks try to be cute
with attribution lines like:
When Art Merkel finally sobered up, he blundered:
Nuthin you can do about attribution lines like that, unless you
hard-code distinctive strings for prolific posters.

You can probably exclude 90%+ of context quotes by excluding /^>/.

A usenet sig (if it's properly configured) follows a cutline which is
two dashes and a space. It's easy to identify such a cutline and
ignore everything which follows. But many posters don't use a proper
cutline.

--
The best way to get a good answer is to ask a good question.
David Filmer (http://DavidFilmer.com)

 
Reply With Quote
 
 
 
 
Art Merkel
Guest
Posts: n/a
 
      01-18-2007
http://www.velocityreviews.com/forums/(E-Mail Removed) wrote:

> You can probably exclude 90%+ of attribution lines by excluding
> /wrote:$/ (but it won't work for Dr.Ruud's posts, etc). Of course,
> that assumes English-language newsgroups. Some folks try to be cute
> with attribution lines like:
> When Art Merkel finally sobered up, he blundered:
> Nuthin you can do about attribution lines like that, unless you
> hard-code distinctive strings for prolific posters.


How about storing lines (some people's attributin lines wrap) that
don't match /^>/ until

(1) I hit one that does match, and I discard what I've already got
or
(2) I hit the sig cutline or end of the message, in this case I keep
everything I've already got since it's probably an OP?

Not sure what to do about top-posting (b*st*rds) though!


> You can probably exclude 90%+ of context quotes by excluding /^>/.


Of course.

> A usenet sig (if it's properly configured) follows a cutline which is
> two dashes and a space. It's easy to identify such a cutline and
> ignore everything which follows. But many posters don't use a proper
> cutline.


Right --- when I hit /^-- $/ , stop there.


 
Reply With Quote
 
Art Merkel
Guest
Posts: n/a
 
      01-19-2007
(E-Mail Removed) wrote:

> You won't be able to do this 100% of the time because the behavior of
> replies is different (and can be customized) in different newsreaders.
> Usenet posts are plain text, and lack the context tagging of XML, etc.
> But you can probably get pretty close to what you want.
>
> You can probably exclude 90%+ of attribution lines by excluding
> /wrote:$/ (but it won't work for Dr.Ruud's posts, etc). Of course,
> that assumes English-language newsgroups. Some folks try to be cute
> with attribution lines like:
> When Art Merkel finally sobered up, he blundered:
> Nuthin you can do about attribution lines like that, unless you
> hard-code distinctive strings for prolific posters.
>
> You can probably exclude 90%+ of context quotes by excluding /^>/.


I'm thinking of something "stateful" in which I scan lines until

(1) I hit a line that starts with '>', in which case I discard
everything I have so far (attribution). Then I keep going, ignoring
/^>/ lines (quoted) but keeping other lines until I hit the cutline or
the end.

(2) I hit the cutline or the end, in which case I keep everything so
far (an OP).


> A usenet sig (if it's properly configured) follows a cutline which is
> two dashes and a space. It's easy to identify such a cutline and
> ignore everything which follows. But many posters don't use a proper
> cutline.


No way to deal with top-posting, is there?


 
Reply With Quote
 
Adam Funk
Guest
Posts: n/a
 
      02-06-2007
On 2007-01-17, (E-Mail Removed) wrote:

> You won't be able to do this 100% of the time because the behavior of
> replies is different (and can be customized) in different newsreaders.
> Usenet posts are plain text, and lack the context tagging of XML, etc.
> But you can probably get pretty close to what you want.
>
> You can probably exclude 90%+ of attribution lines by excluding
> /wrote:$/ (but it won't work for Dr.Ruud's posts, etc). Of course,
> that assumes English-language newsgroups. Some folks try to be cute
> with attribution lines like:
> When Art Merkel finally sobered up, he blundered:
> Nuthin you can do about attribution lines like that, unless you
> hard-code distinctive strings for prolific posters.


Here's something I've tinkered with, which assumes that either the
body is all original (no m/^>/ lines) or that all unquoted lines
before the first quoted one are attribution lines (I think this is
almost always the case for inline/bottom-posting).

Comments, suggestions?

Of course it doesn't handle top-posting!


##################################################
#!/usr/bin/perl

use strict;
use warnings;
use Getopt::Std;
use News::Article;

my ($filename, $in_art, $out_art, $out_filename);

while (@ARGV) {
$filename = shift(@ARGV);
$in_art = News::Article->new($filename);

print("*****\n$filename\n");

process_body($in_art->body());
}


sub process_body {
my @input = @_;
my @output = ();
my $op = 1;
my $line;
my $not_sig = 1;

# $op true IFF this is an original post (with no quoting)
foreach $line (@input) {
if ($line =~ /^>/) {
$op = 0;
last;
}
elsif ($line =~ /^-- /) {
last;
}
}

if ($op) {
print("original\n");
}
else {
print("quoting\n");
}


# copy the attribution lines
if (! $op) {
do {
$line = shift(@input);
print(" a $line\n"); # attribution
} while ($line !~ /^>/ );
}

while (@input && $not_sig) {
$line = shift(@input);
if ($line =~ /^-- /) {
$not_sig = 0;
print(" - "); # sig separator
}
elsif ($line !~ /^>/) {
print("n "); # new content

}
else {
print(" q "); # quoted
}
print($line, "\n");
}

while (@input) {
$line = shift(@input);
print(" s $line\n"); # sig
}

}
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
O/T: Tag lines, or sigs (Was: JNA performance) Lew Java 7 05-09-2011 12:36 PM
To reduce your body weight & slim your body Loss weight MCSA 0 07-23-2007 07:54 PM
To reduce your body weight & slim your body Loss weight MCSA 0 07-21-2007 05:15 AM
Not detecting body.scrollTop and body.scrollLeft in IE6 London Boy Javascript 2 01-12-2004 08:44 AM
Sigs in Outlook E? CB Computer Support 11 12-19-2003 01:04 PM



Advertisments