Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > YARQ - Yet another regex question

Reply
Thread Tools

YARQ - Yet another regex question

 
 
sjp
Guest
Posts: n/a
 
      03-29-2005
Hi folks,

I'm parsing through a series of delimited records. Some of the records
use '\t' for the delimiter, and others use '=09' as the delimiter. My
program handles the tab-delimited records fine, but records that use '=09'
have erroneous line breaks after '=' signs, like so:

93=093=094/1/2004=09=09HARNEY=09JAMES=09808 SITKA AVE=09=09NEWBERG=09OR=099=
71320000=098.33=09I=09CA=09PARK RANGER=09=0966.64=09=09=09=09=09=09PARKS & =
RECREATION DIVISION=09

I'd like to remove the '=' and EOL, but 'Sline =~ s/\=\n//g;' fails with
an "Can't modify constant item in substitution (s///) at
/usr/local/bin/mailparse line 18, near "s/\=\n//g;" error

What is the proper way to do it?

Thanks,

SJP
 
Reply With Quote
 
 
 
 
A. Sinan Unur
Guest
Posts: n/a
 
      03-29-2005
sjp <(E-Mail Removed)> wrote in
news:2gh2e.15366$Go4.14046@trnddc05:

> I'd like to remove the '=' and EOL, but 'Sline =~ s/\=\n//g;' fails


Is that supposed to be $line?

> What is the proper way to do it?


One way would be to read the error message, then fix the error in the
given location, instead of asking hundreds of people to guess what your
script looks like.

Sinan.
 
Reply With Quote
 
 
 
 
Paul Lalli
Guest
Posts: n/a
 
      03-29-2005
sjp wrote:
> Hi folks,
>
> I'm parsing through a series of delimited records. Some of the records
> use '\t' for the delimiter, and others use '=09' as the delimiter. My
> program handles the tab-delimited records fine, but records that use '=09'
> have erroneous line breaks after '=' signs, like so:
>
> 93=093=094/1/2004=09=09HARNEY=09JAMES=09808 SITKA AVE=09=09NEWBERG=09OR=099=
> 71320000=098.33=09I=09CA=09PARK RANGER=09=0966.64=09=09=09=09=09=09PARKS & =
> RECREATION DIVISION=09
>
> I'd like to remove the '=' and EOL, but 'Sline =~ s/\=\n//g;' fails with
> an "Can't modify constant item in substitution (s///) at
> /usr/local/bin/mailparse line 18, near "s/\=\n//g;" error


What the heck is 'Sline'? Are you sure you don't mean $line?
Conceivably, perl thinks that 'Sline' is some sort of constant item.

You are enabling strict and warnings, right?

Also, = is not special in a regexp. There's no reason to escape it.

Beyond that, I don't understand what your actual issue is. How does the
records being delimited by '=09' relate to the records having \n
characters after some = characters?

Paul Lalli

 
Reply With Quote
 
John Bokma
Guest
Posts: n/a
 
      03-29-2005
Paul Lalli wrote:

> sjp wrote:
>> Hi folks,
>>
>> I'm parsing through a series of delimited records. Some of the
>> records use '\t' for the delimiter, and others use '=09' as the
>> delimiter. My program handles the tab-delimited records fine, but
>> records that use '=09' have erroneous line breaks after '=' signs,
>> like so:
>>
>> 93=093=094/1/2004=09=09HARNEY=09JAMES=09808 SITKA
>> AVE=09=09NEWBERG=09OR=099= 71320000=098.33=09I=09CA=09PARK
>> RANGER=09=0966.64=09=09=09=09=09=09PARKS & = RECREATION DIVISION=09
>>
>> I'd like to remove the '=' and EOL, but 'Sline =~ s/\=\n//g;' fails
>> with
> > an "Can't modify constant item in substitution (s///) at
> > /usr/local/bin/mailparse line 18, near "s/\=\n//g;" error

>
> What the heck is 'Sline'? Are you sure you don't mean $line?
> Conceivably, perl thinks that 'Sline' is some sort of constant item.
>
> You are enabling strict and warnings, right?
>
> Also, = is not special in a regexp. There's no reason to escape it.
>
> Beyond that, I don't understand what your actual issue is. How does
> the records being delimited by '=09' relate to the records having \n
> characters after some = characters?


=
09

is not

=09

the =xx encoding is used in email (I forgot the name), I would *fix*
that first, and then do the parsing.

--
John Small Perl scripts: http://johnbokma.com/perl/
Perl programmer available: http://castleamber.com/
Happy Customers: http://castleamber.com/testimonials.html

 
Reply With Quote
 
A. Sinan Unur
Guest
Posts: n/a
 
      03-29-2005
John Bokma <(E-Mail Removed)> wrote in news:Xns9628860D818B6castleamber@130.133.1.4:

> Paul Lalli wrote:
>
>> sjp wrote:
>>> Hi folks,
>>>
>>> I'm parsing through a series of delimited records. Some of the
>>> records use '\t' for the delimiter, and others use '=09' as the
>>> delimiter. My program handles the tab-delimited records fine, but
>>> records that use '=09' have erroneous line breaks after '=' signs,
>>> like so:
>>>
>>> 93=093=094/1/2004=09=09HARNEY=09JAMES=09808 SITKA
>>> AVE=09=09NEWBERG=09OR=099= 71320000=098.33=09I=09CA=09PARK
>>> RANGER=09=0966.64=09=09=09=09=09=09PARKS & = RECREATION DIVISION=09
>>>


....

>> Beyond that, I don't understand what your actual issue is. How does
>> the records being delimited by '=09' relate to the records having \n
>> characters after some = characters?

>
> =
> 09
>
> is not
>
> =09


But there no such cases in the data the OP posted.

> the =xx encoding is used in email (I forgot the name), I would *fix*
> that first, and then do the parsing.


Base64. The CPAN module MIME::Base64 allows one to convert
Base64 encoded strings. On the other hand, I am not sure
if the data the OP posted really is Base64.

The following seems to satisfy the OP's requirements:

#! perl

use strict;
use warnings;

my $d = q{93=093=094/1/2004=09=09HARNEY=09JAMES=09808 SITKA AVE=09=09NEWBERG=09OR=099=
71320000=098.33=09I=09CA=09PARK RANGER=09=0966.64=09=09=09=09=09=09PARKS & =
RECREATION DIVISION=09};

$d =~ s/=09/\t/g;
$d =~ s/=\n//g;

print $d;
__END__

 
Reply With Quote
 
Paul Lalli
Guest
Posts: n/a
 
      03-29-2005
John Bokma wrote:
>
> Paul Lalli wrote:
>>
>>sjp wrote:
>>>
>>>My program handles the tab-delimited records fine, but
>>> records that use '=09' have erroneous line breaks after '=' signs,
>>> like so:
>>>93=093=094/1/2004=09=09HARNEY=09JAMES=09808 SITKA
>>>AVE=09=09NEWBERG=09OR=099= 71320000=098.33=09I=09CA=09PARK
>>>RANGER=09=0966.64=09=09=09=09=09=09PARKS & = RECREATION DIVISION=09

>>
>>Beyond that, I don't understand what your actual issue is. How does
>>the records being delimited by '=09' relate to the records having \n
>>characters after some = characters?

>
> =
> 09
>
> is not
>
> =09
>
> the =xx encoding is used in email (I forgot the name), I would *fix*
> that first, and then do the parsing.



There is no instance of
=
09

anywhere in the OP's data. The way it sounds to me is that the OP is
concerned about \n's after *any* = character.

I admit, of course, that I could be quite wrong. But in fact, there is
no instance of any "=\n" anywhere in the OP's data, so I don't think we
can really know what the OP is talking about until the OP himself clarifies.

Paul Lalli
 
Reply With Quote
 
sjp
Guest
Posts: n/a
 
      03-29-2005
On Tue, 29 Mar 2005 19:10:40 +0000, John Bokma wrote:

> Paul Lalli wrote:
>
>> sjp wrote:
>>> Hi folks,
>>>
>>> I'm parsing through a series of delimited records. Some of the
>>> records use '\t' for the delimiter, and others use '=09' as the
>>> delimiter. My program handles the tab-delimited records fine, but
>>> records that use '=09' have erroneous line breaks after '=' signs,
>>> like so:
>>>
>>> 93=093=094/1/2004=09=09HARNEY=09JAMES=09808 SITKA
>>> AVE=09=09NEWBERG=09OR=099= 71320000=098.33=09I=09CA=09PARK
>>> RANGER=09=0966.64=09=09=09=09=09=09PARKS & = RECREATION DIVISION=09
>>>
>>> I'd like to remove the '=' and EOL, but 'Sline =~ s/\=\n//g;' fails
>>> with
>> > an "Can't modify constant item in substitution (s///) at
>> > /usr/local/bin/mailparse line 18, near "s/\=\n//g;" error

>>
>> What the heck is 'Sline'? Are you sure you don't mean $line?
>> Conceivably, perl thinks that 'Sline' is some sort of constant item.
>>
>> You are enabling strict and warnings, right?
>>
>> Also, = is not special in a regexp. There's no reason to escape it.
>>
>> Beyond that, I don't understand what your actual issue is. How does
>> the records being delimited by '=09' relate to the records having \n
>> characters after some = characters?

>
> =
> 09
>
> is not
>
> =09
>
> the =xx encoding is used in email (I forgot the name), I would *fix*
> that first, and then do the parsing.


You're right, John. I'm parsing a very large email archive file and an
indeterminate number of attachments in the file are encoded
"quoted-printable". So the real issue, I suppose is how to properly
decode an indeterminate number of quoted-printable records from a mail
archive before processing the records contained in that archive.

Thanks for helping me to frame the problem.
 
Reply With Quote
 
John Bokma
Guest
Posts: n/a
 
      03-29-2005
A. Sinan Unur wrote:

> But there no such cases in the data the OP posted.


Yup, classical bad post / wrong example

I think "we" see this every day here?

> The following seems to satisfy the OP's requirements:
>
> #! perl
>
> use strict;
> use warnings;
>
> my $d = q{93=093=094/1/2004=09=09HARNEY=09JAMES=09808 SITKA
> AVE=09=09NEWBERG=09OR=099= 71320000=098.33=09I=09CA=09PARK
> RANGER=09=0966.64=09=09=09=09=09=09PARKS & = RECREATION DIVISION=09};
>
> $d =~ s/=09/\t/g;
> $d =~ s/=\n//g;


If you swap those two, yes.

--
John Small Perl scripts: http://johnbokma.com/perl/
Perl programmer available: http://castleamber.com/
Happy Customers: http://castleamber.com/testimonials.html

 
Reply With Quote
 
John Bokma
Guest
Posts: n/a
 
      03-29-2005
Paul Lalli wrote:

> There is no instance of
> =
> 09
>
> anywhere in the OP's data.


Of course not, because the OP posted a wrong example .

Does that never happen here?

> The way it sounds to me is that the OP is
> concerned about \n's after *any* = character.
>
> I admit, of course, that I could be quite wrong. But in fact, there
> is no instance of any "=\n" anywhere in the OP's data, so I don't
> think we can really know what the OP is talking about until the OP
> himself clarifies.


My best guess:

=
xx

should become

=xx

and then if xx = 09 it should be replaced with \t

I would have the decoding be handled by a dedicated Perl module.

--
John Small Perl scripts: http://johnbokma.com/perl/
Perl programmer available: http://castleamber.com/
Happy Customers: http://castleamber.com/testimonials.html

 
Reply With Quote
 
John Bokma
Guest
Posts: n/a
 
      03-29-2005
sjp wrote:

> On Tue, 29 Mar 2005 19:10:40 +0000, John Bokma wrote:


[ snip ]

>> the =xx encoding is used in email (I forgot the name), I would *fix*
>> that first, and then do the parsing.

>
> You're right, John. I'm parsing a very large email archive file and an
> indeterminate number of attachments in the file are encoded
> "quoted-printable".


Yup, that's the one

> So the real issue, I suppose is how to properly
> decode an indeterminate number of quoted-printable records from a mail
> archive before processing the records contained in that archive.


I am really sure that there are Perl modules that handle this.

<http://search.cpan.org/~gaas/MIME-Base64-Perl-
1.00/lib/MIME/QuotedPrint/Perl.pm>

> Thanks for helping me to frame the problem.


You're welcome.

--
John Small Perl scripts: http://johnbokma.com/perl/
Perl programmer available: http://castleamber.com/
Happy Customers: http://castleamber.com/testimonials.html

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
How make regex that means "contains regex#1 but NOT regex#2" ?? seberino@spawar.navy.mil Python 3 07-01-2008 03:06 PM
Yet another Java regex problem bauer@b3s.de Java 6 05-24-2005 09:54 PM
Yet another book recommendation, but for someone who can program and yet does not the terminology well Berehem C Programming 4 04-28-2005 05:25 PM
Yet another regex question. siliconmike Perl Misc 4 04-18-2005 06:43 PM
yet another yEnc question Merlin Zener Firefox 1 07-19-2003 12:03 AM



Advertisments