Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > Recursively Parsing through multipart messages use Mail::Box::Manager;

Reply
Thread Tools

Recursively Parsing through multipart messages use Mail::Box::Manager;

 
 
Bloch
Guest
Posts: n/a
 
      12-21-2005
I've written a little script that uses mailbox manager to parse an mbox
file, strip off most of the headers, decode the body, and eventually print
the data that is encoded as text/plain. It works fine for messages that
are flat (i.e., multipart/alternative on the top level) and it can just
grab the plaintext attachments from 1 level down.

I run into problems when I hit multipart/mixed messages and I have to
descend down a level. I've been reading through the groups.google.com
archives and and the man pages of these modules and see that applying
these items recursively is tricky for inexperienced programmers -- which
I claim to be. Can someone recommend a better way of getting to my desired
endpoint, or help me sort out how to get there using my existing approach?

I've attached the relevent portion of my code and the output of
printStructure to give a better idea of the problem domain.

#!/usr/local/bin/perl

use Mail::Box::Manager;
use Date:arse;
use warnings;
use strict;
my $mgr = Mail::Box::Manager->new;
#my $folder_file = "/home/salvador/mail/releases"; my $folder_file =
"/home/salvador/mail/releases.old"; my $folder = $mgr->open(folder =>
$folder_file) or die "Could not open folder $!n";
my(@subject,@sender,@body,@time);
my $x = 0;
for ($folder->messages) {
$subject[$x] = $_->subject;
$sender[$x] = $_->sender->address;
$time[$x] = $_->get('Date');
#body[$x] = $decode = $_->decoded;
#$_->printStructure;

if($_->isMultipart) {
foreach my $part($_->body->parts) {
my $attached_head = $part->head;
my $attached_body = $part->decoded;
if($attached_head =~ /text\/plain/) {
# print "$attached_head\n";
# print "OK\n";
}elsif($attached_head =~ /multipart\/alternative/i) {
print "$attached_head\n";
print "Crap\! How do I parse the next batch of headers?\n"; print
"$attached_body";
}
}
}
$x++;
}

PARTIAL OUTPUT OF MESSAGE STRUCTURES:

OK:
multipart/alternative: KENNEDY: AMERICANS DESERVE BETTER THAN A
REPUBLICAN BUDGET THAT LEAVES THEM BEHIND (111850 bytes)
text/plain (47689 bytes)
text/html (62436 bytes)

OK:
multipart/alternative: Boxer Asks Legal Scholars on Dean's 'Impeachable
Offense' Comment (10116 bytes)
text/plain (2647 bytes)
text/html (5495 bytes)

OK:
multipart/alternative: Sen. Jeffords' Statement on ANWR/Defense Spending
Bill (8876 bytes)
text/plain (1030 bytes)
text/html (5864 bytes)

FAILS TO PARSE PROPERLY:
multipart/mixed: KENNEDY: REPUBLICANS BLOCK INTELLIGENCE BILL TO AVOID THE
TRUTH OF THE WAR (202224 bytes)
multipart/alternative (146945 bytes)
text/plain (54877 bytes)
text/html (91778 bytes)
application/msexcel (53598 bytes)

....
 
Reply With Quote
 
 
 
 
Bloch
Guest
Posts: n/a
 
      12-21-2005
GEEEEEEYYYYYAAAARGH!!!

foreach my $part($_->body->parts('RECURSE'))

was the option that I was looking for. Missed it in the documentation
(several times, I might add).

For what it's worth, I place the blame entirely on Mark Overmeer, who
spent godknowshowlong writing and documenting this excellent module.
Mark, if you hadn't been so thorough, I would never have missed such an
important, easily-spotted detail. No, no, this has nothing to do with the
fact that I'm an American, weaned on television and raised in the age of
instant gratification. Nor with the fact that my iq is roughly 200 points
lower than a sponge -- and not one of those real sponges either, I'm
talking a sponge made by 3M or Dow or someone. No, it's your fault.

And that goes for the lot of you Perl mongers who have contributed to
developing Perl, and in so-doing, have helped to build the modern
internet, or rather, "internets" as our President so eloquently puts it.
You owe me something. I could be using Smalltalk, or Eiffel, or Scheme or
Visual Basic or something, but I chose Perl. Okay, admittedly, Perl
*might* be slightly better than those languages for the problem domains
that I usually look at -- parsing textfiles and playing around with *nixy
stuff and so on. But many of my former CS professors *insist* that it's
ugly -- so it must be true -- so, again, you owe me for giving me such a
cool language to play with for for free -- as in free beer and free
speech.



On Wed, 21 Dec 2005 01:06:35 +0000, Bloch wrote:

> I've written a little script that uses mailbox manager to parse an mbox
> file, strip off most of the headers, decode the body, and eventually
> print the data that is encoded as text/plain. It works fine for
> messages that are flat (i.e., multipart/alternative on the top level)
> and it can just grab the plaintext attachments from 1 level down.
>
> I run into problems when I hit multipart/mixed messages and I have to
> descend down a level. I've been reading through the groups.google.com
> archives and and the man pages of these modules and see that applying
> these items recursively is tricky for inexperienced programmers -- which
> I claim to be. Can someone recommend a better way of getting to my
> desired endpoint, or help me sort out how to get there using my existing
> approach?
>
> I've attached the relevent portion of my code and the output of
> printStructure to give a better idea of the problem domain.
>
> #!/usr/local/bin/perl
>
> use Mail::Box::Manager;
> use Date:arse;
> use warnings;
> use strict;
> my $mgr = Mail::Box::Manager->new;
> #my $folder_file = "/home/salvador/mail/releases"; my $folder_file =
> "/home/salvador/mail/releases.old"; my $folder = $mgr->open(folder =>
> $folder_file) or die "Could not open folder $!n";
> my(@subject,@sender,@body,@time);
> my $x = 0;
> for ($folder->messages) {
> $subject[$x] = $_->subject;
> $sender[$x] = $_->sender->address;
> $time[$x] = $_->get('Date');
> #body[$x] = $decode = $_->decoded;
> #$_->printStructure;
>
> if($_->isMultipart) {
> foreach my $part($_->body->parts) {
> my $attached_head = $part->head;
> my $attached_body = $part->decoded;
> if($attached_head =~ /text\/plain/) {
> # print "$attached_head\n";
> # print "OK\n";
> }elsif($attached_head =~ /multipart\/alternative/i) {
> print "$attached_head\n";
> print "Crap\! How do I parse the next batch of headers?\n";
> print "$attached_body";
> }
> }
> }
> $x++;
> }
>
> PARTIAL OUTPUT OF MESSAGE STRUCTURES:
>
> OK:
> multipart/alternative: KENNEDY: AMERICANS DESERVE BETTER THAN A
> REPUBLICAN BUDGET THAT LEAVES THEM BEHIND (111850 bytes)
> text/plain (47689 bytes)
> text/html (62436 bytes)
>
> OK:
> multipart/alternative: Boxer Asks Legal Scholars on Dean's 'Impeachable
> Offense' Comment (10116 bytes)
> text/plain (2647 bytes)
> text/html (5495 bytes)
>
> OK:
> multipart/alternative: Sen. Jeffords' Statement on ANWR/Defense Spending
> Bill (8876 bytes)
> text/plain (1030 bytes)
> text/html (5864 bytes)
>
> FAILS TO PARSE PROPERLY:
> multipart/mixed: KENNEDY: REPUBLICANS BLOCK INTELLIGENCE BILL TO AVOID
> THE TRUTH OF THE WAR (202224 bytes)
> multipart/alternative (146945 bytes)
> text/plain (54877 bytes)
> text/html (91778 bytes)
> application/msexcel (53598 bytes)
>
> ...

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
MIME Structure Multipart/Mixed with attachment and Multipart/Alternative blaine@worldweb.com Perl Misc 1 04-04-2007 08:23 PM
Joining and viewing multipart messages Robert D'Angelo Computer Support 5 10-27-2005 06:52 PM
Multipart messages in Netscape 7.2 Robert D'Angelo Computer Support 1 10-02-2005 09:42 PM
Combine and Decode multipart messages gandalf Firefox 2 09-13-2005 08:18 PM
Multipart Messages Phil Keys Computer Support 3 07-17-2004 09:37 PM



Advertisments