Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > I don't know what's wrong here !

Reply
Thread Tools

I don't know what's wrong here !

 
 
Wes Groleau
Guest
Posts: n/a
 
      12-26-2003
I have a script to process a certain file format.
It was working at one time, but it doesn't now.
Obviously I changed something important, but
I have no memory of doing so, nor can I see anything wrong.

The input format: Each line has up to four parts:
Level, ID, Tag, and Text; separated by white space.

White space is optional _before_ the level.

ID is optional; if present, it is the second part.
It is printable characters, starting and ending
with '@'

Tag is always there. It is a single "word" of
letters, no white space. For simplicity, I am
going for any non-whitespace characters.

Text is optional and is everything after the
tag.

Here is what is not working:

@lines = <STDIN>; # Read all lines into array @lines
foreach (@lines) # Each line is placed into predefined scalar $_
{
# Get rid of CR/LF chars for originating platform.
# (So this platform's style can be put back on later.)
chomp;
s/[\r\n]//;

($Level, $ID, $Tag, $XRef, $Text, $Comment) = ("", "", "", "", "", "");

# Get the component parts of the line.
# Can UTF-8 input break this regexp???
#($Level, $ID, $Tag, $XRef, $Text, $Comment) =
# --- ----- --- ----- -- --
#/^\s*(\d+)\s+(@\S+@)?\s*(\S+)s+(@\S+@)?\s+(.*)?({{. *}})?/;
# 1 2 3 4 5 6

# Let's just use four (standard GEDCOM) for now.
# 1 2 3 4
/^\s*(\d+)\s+(@\S+@)?\s*(\S+)\s+(.*)/;

# Get level of subrecord
if ( ! defined ( $1 ) )
{
$Level = 0;
$LineEOL = "NO LEVEL: $_\n";
}
else
{
$Level = $1;
}
print $Level . "\n"; # This part (Level) seems to work

# Save the label (if specified)
if ( ! defined ( $2 ) )
{
$ID = "";
}
else
{
$ID = $2;
}
print $ID . "\n";

# Uppercase the tag
if ( ! defined ( $3 ) )
{
$Tag = "";
$LineEOL = "NO TAG: $_\n";
}
else
{
$Tag = $3;
}
print $Tag . "\n"; # When input line is "0 HEAD" this should be "HEAD"
# but it is blank instead and the diagnostic "NO TAG"
# is added.

# Save everything else
if ( ! defined ( $4 ) )
{
$Text = "";
}
else
{
$Text = $4;
}
print $Text . "\n";

----------

In the form above, ID and Text are unknown, since the script
wants to open and write to a file with name derived from $Tag.
So it blows up on the first line "0 HEAD"

--
Wes Groleau
"Lewis's case for the existence of God is fallacious."
"You mean like circular reasoning?"
"He believes in God. Therefore, he's fallacious."

 
Reply With Quote
 
 
 
 
Ragnar Hafsta
Guest
Posts: n/a
 
      12-26-2003
"Wes Groleau" <> wrote in message
news:...
>

snipped verbose description of problem

in case like this you should make a short script only containing the part
that seems to be failing
and usually the problem will become clear

> /^\s*(\d+)\s+(@\S+@)?\s*(\S+)\s+(.*)/;


look at this a bit:
(your input is '0 HEAD')
^\s* zero or more whitespace. ok
(\d+) one ore more digits. ok matches '0', the rest is ' HEAD'
\s+ one or more ws. matches ' ', rest is 'HEAD'
(@\S+@)? maches ''
\s* matches ''
(\S+) matches 'HEAD' rest is ''
\s+ does not match . oooooops
(.*)/;

maybe your previous input contained space at the end of the '0 HEAD' line

gnari



 
Reply With Quote
 
 
 
 
John W. Krahn
Guest
Posts: n/a
 
      12-27-2003
Wes Groleau wrote:
>
> I have a script to process a certain file format.
> It was working at one time, but it doesn't now.
> Obviously I changed something important, but
> I have no memory of doing so, nor can I see anything wrong.
>
> The input format: Each line has up to four parts:
> Level, ID, Tag, and Text; separated by white space.
>
> White space is optional _before_ the level.
>
> ID is optional; if present, it is the second part.
> It is printable characters, starting and ending
> with '@'
>
> Tag is always there. It is a single "word" of
> letters, no white space. For simplicity, I am
> going for any non-whitespace characters.
>
> Text is optional and is everything after the
> tag.
>
> Here is what is not working:


use warnings;
use strict;

> @lines = <STDIN>; # Read all lines into array @lines
> foreach (@lines) # Each line is placed into predefined scalar $_
> {


Do you really need to slurp the entire input into an array?


> # Get rid of CR/LF chars for originating platform.
> # (So this platform's style can be put back on later.)
> chomp;
> s/[\r\n]//;


chomp() removes the contents of $/ from the end of $_. s/[\r\n]//
removes the FIRST occurrence of either \r or \n from $_. If you want to
remove any trailing whitespace including \r and \n then:

s/\s+\Z//;


> ($Level, $ID, $Tag, $XRef, $Text, $Comment) = ("", "", "", "", "", "");
>
> # Get the component parts of the line.
> # Can UTF-8 input break this regexp???
> #($Level, $ID, $Tag, $XRef, $Text, $Comment) =
> # --- ----- --- ----- -- --
> #/^\s*(\d+)\s+(@\S+@)?\s*(\S+)s+(@\S+@)?\s+(.*)?({{. *}})?/;
> # 1 2 3 4 5 6
>
> # Let's just use four (standard GEDCOM) for now.
> # 1 2 3 4
> /^\s*(\d+)\s+(@\S+@)?\s*(\S+)\s+(.*)/;


Why not just assign directly to your variables? That way you can
eliminate all the following else clauses. Also, the @ characters have
to be escaped or perl will try to interpolate them as arrays.

my ( $Level, $ID, $Tag, $Text ) =
/^\s*(\d+)\s*(\@\S+\@)?\s*(\S+)\s*(.*)/;

# Get level of subrecord
if ( not defined $Level )
{
$Level = 0;
$LineEOL = "NO LEVEL: $_\n";
}
print "$Level\n"; # This part (Level) seems to work

# Save the label (if specified)
if ( not defined $ID )
{
$ID = '';
}
print "$ID\n";

# Uppercase the tag
if ( not defined $Tag )
{
$Tag = '';
$LineEOL = "NO TAG: $_\n";
}
print "$Tag\n"; # When input line is "0 HEAD" this should be "HEAD"
# but it is blank instead and the diagnostic "NO TAG"
# is added.

# Save everything else
if ( not defined $Text )
{
$Text = '';
}
print "$Text\n";



John
--
use Perl;
program
fulfillment
 
Reply With Quote
 
Gunnar Hjalmarsson
Guest
Posts: n/a
 
      12-27-2003
Wes Groleau wrote:
>
> Text is optional and is everything after the tag.


That condition is not reflected in this regex, which requires one or
more whitespace characters after Tag, or else it won't capture the
data as expected:

> /^\s*(\d+)\s+(@\S+@)?\s*(\S+)\s+(.*)/;

---------------------------------^^^

It's better written as:

/^\s*(\d+)\s+(@\S+@)?\s*(\S+)(?:\s+(.+))?/;
---------------------------------^^^-------^^

Other comments:

You should declare your variables and run your script with strictures
and warnings enabled.

use strict;
use warnings;

> Here is what is not working:


That's a pointless description of your problem, isn't it? You'd better
explain what you expect the script to output, what it actually
outputs, and which error and warning messages you receive (if any).

> @lines = <STDIN>; # Read all lines into array @lines


Are you really using the STDIN filehandle for reading the file? Since
STDIN is a special filehandle, you should use some other name.

Basically I think your code can be shortened to:

while (<FILE>) {
chomp;
my ($Level, $ID, $Tag, $Text) =
/^\s*(\d+)\s+(@\S+@)?\s*(\S+)(?:\s+(.+))?/;
$Level ||= 0;
$ID ||= '';
$Text ||= '';
print "$Level\n$ID\n$Tag\n$Text\n";
}

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl

 
Reply With Quote
 
Wes Groleau
Guest
Posts: n/a
 
      12-27-2003
"Ragnar Hafsta�������������� �����" wrote:
>> /^\s*(\d+)\s+(@\S+@)?\s*(\S+)\s+(.*)/;

>
> look at this a bit:
> (your input is '0 HEAD')
> ^\s* zero or more whitespace. ok
> (\d+) one ore more digits. ok matches '0', the rest is ' HEAD'
> \s+ one or more ws. matches ' ', rest is 'HEAD'
> (@\S+@)? maches ''
> \s* matches ''
> (\S+) matches 'HEAD' rest is ''
> \s+ does not match . oooooops
> (.*)/;
>
> maybe your previous input contained space at the end of the '0 HEAD' line


OK, I see that. Thanks. Maybe there was a space.

However, if the regexp doesn't match, why was
it able to get the $Level and leave the other
parts undefined?

--
Wes Groleau
Alive and Well
http://freepages.religions.rootsweb.com/~wgroleau/

 
Reply With Quote
 
Wes Groleau
Guest
Posts: n/a
 
      12-27-2003
John W. Krahn wrote:
> Do you really need to slurp the entire input into an array?


Not really. I have used foreach (<>) before
I got this from someone else, and just kept that part.
But the original regexp didn't work. I made it work
but now I've broken it again.

> chomp() removes the contents of $/ from the end of $_. s/[\r\n]//
> removes the FIRST occurrence of either \r or \n from $_. If you want to
> remove any trailing whitespace including \r and \n then:
>
> s/\s+\Z//;


OK, thanks. That wasn't a problem in this particular
environment, but it's nice to know.

>> # 1 2 3 4
>> /^\s*(\d+)\s+(@\S+@)?\s*(\S+)\s+(.*)/;

>
> Why not just assign directly to your variables? That way you can
> eliminate all the following else clauses. Also, the @ characters have
> to be escaped or perl will try to interpolate them as arrays.
>
> my ( $Level, $ID, $Tag, $Text ) =
> /^\s*(\d+)\s*(\@\S+\@)?\s*(\S+)\s*(.*)/;


That was tried, but there was some problem I don't remember.
However, the '@' has always worked OK. But escaping them
won't hurt--I remember the error messages from that in other
contexts.

So you have helped me quite a bit (thanks), but I still
don't understand how the regexp as written was able to
define the $Level but not the $Tag

--
Wes Groleau
-----------
Daily Hoax: http://www.snopes2.com/cgi-bin/random/random.asp

 
Reply With Quote
 
Wes Groleau
Guest
Posts: n/a
 
      12-27-2003
Gunnar Hjalmarsson wrote:
>> Text is optional and is everything after the tag.

>
> That condition is not reflected in this regex, which requires one or
> more whitespace characters after Tag, or else it won't capture the
> data as expected:
>
>> /^\s*(\d+)\s+(@\S+@)?\s*(\S+)\s+(.*)/;

>
> ---------------------------------^^^


Sorry, I was unclear. The only way to determine
the end of the Tag is by whitespace. But as someone
else pointed out, if there is no text, then there may
be a line-end instead of white space. Perhaps this
is what you are saying?

> It's better written as:
>
> /^\s*(\d+)\s+(@\S+@)?\s*(\S+)(?:\s+(.+))?/;
> ---------------------------------^^^-------^^


So here, $4 matches the _inner_ parentheses?
I don't know what ?: means in this context.


> That's a pointless description of your problem, isn't it? You'd better
> explain what you expect the script to output, what it actually
> outputs, and which error and warning messages you receive (if any).


Well, I explained that $1 or $Level gets defined,
$2 ($ID) and $4 ($Text) do not (but weren't expected to
on the first line), and $3 ($Tag) did not which was a
surprise and prevented reading any other lines.
After finding the fatal error of $Tag undefined,
it seemed pointless to quote the remaining two pages
of the script.

>> @lines = <STDIN>; # Read all lines into array @lines

>
> Are you really using the STDIN filehandle for reading the file? Since
> STDIN is a special filehandle, you should use some other name.


Yes, I pipe cat (file) into the script. I figured
I should get the output right before improving the UI.

> Basically I think your code can be shortened to:
>
> while (<FILE>) {
> chomp;


Oh, I definitely need to convert line ends on files
from other platforms. (Mac/Win/Unix) But as John
pointed out, I did it wrong.

> my ($Level, $ID, $Tag, $Text) =
> /^\s*(\d+)\s+(@\S+@)?\s*(\S+)(?:\s+(.+))?/;


I'll give this a try, along with the other two guys' suggestions.

> $Level ||= 0;
> $ID ||= '';
> $Text ||= '';
> print "$Level\n$ID\n$Tag\n$Text\n";


OK, but I definitely want to keep the error reporting
if Level or Tag is missing.

Thanks to all three responders for the additional understanding.

For the curious, what I am doing is reformatting a GEDCOM
file, AND splitting it into multiple files. Have to
open a new file every time a Level = 0 line comes in,
with the filename determined by parts of the line.

But all of that is later in the script, and is meaningless
if Tag is undefined or missing.

I will experiment with what you three guys have offered,
and I appreciate it very much. Thanks.

--
Wes Groleau

Nobody believes a theoretical analysis -- except the guy who did it.
Everybody believes an experimental analysis -- except the guy who did it.
-- Unknown

 
Reply With Quote
 
Jay Tilton
Guest
Posts: n/a
 
      12-27-2003
Wes Groleau <> wrote:
: "Ragnar Hafsta" wrote:
: >> /^\s*(\d+)\s+(@\S+@)?\s*(\S+)\s+(.*)/;
: >
: > look at this a bit:
: > (your input is '0 HEAD')
: > ^\s* zero or more whitespace. ok
: > (\d+) one ore more digits. ok matches '0', the rest is ' HEAD'
: > \s+ one or more ws. matches ' ', rest is 'HEAD'
: > (@\S+@)? maches ''
: > \s* matches ''
: > (\S+) matches 'HEAD' rest is ''
: > \s+ does not match . oooooops
: > (.*)/;
: >
: > maybe your previous input contained space at the end of the '0 HEAD' line
:
: OK, I see that. Thanks. Maybe there was a space.
:
: However, if the regexp doesn't match, why was
: it able to get the $Level and leave the other
: parts undefined?

You are probably seeing a stale value that was captured from a prior
iteration. The values in $1, $2, etc. are not reset when a match fails,
and the code does not verify that the match is successful before using
those variables.

 
Reply With Quote
 
Gunnar Hjalmarsson
Guest
Posts: n/a
 
      12-27-2003
Wes Groleau wrote:
> Gunnar Hjalmarsson wrote:
>> Wes Groleau wrote:
>>>
>>> /^\s*(\d+)\s+(@\S+@)?\s*(\S+)\s+(.*)/;

>> --------------------------------^^^

>
> Sorry, I was unclear. The only way to determine the end of the Tag
> is by whitespace. But as someone else pointed out, if there is no
> text, then there may be a line-end instead of white space. Perhaps
> this is what you are saying?


Yes, that is what I mean.

>> It's better written as:
>>
>> /^\s*(\d+)\s+(@\S+@)?\s*(\S+)(?:\s+(.+))?/;
>> ---------------------------------^^^-------^^

>
> So here, $4 matches the _inner_ parentheses?


Yes.

> I don't know what ?: means in this context.


http://www.perldoc.com/perl5.8.0/pod...28--pattern%29

>> Basically I think your code can be shortened to:
>>
>> while (<FILE>) {
>> chomp;

>
> Oh, I definitely need to convert line ends on files from other
> platforms. (Mac/Win/Unix) But as John pointed out, I did it
> wrong.


I just thought that that is better handled when transferring between
file systems (using "ASCII" or "Text" transfer mode). But maybe it's
wise to explicitly remove all trailing whitespace as well...

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl

 
Reply With Quote
 
Lāʻie Techie
Guest
Posts: n/a
 
      12-27-2003
On Fri, 26 Dec 2003 23:48:24 -0500, Wes Groleau wrote:

> For the curious, what I am doing is reformatting a GEDCOM file, AND
> splitting it into multiple files. Have to open a new file every time a
> Level = 0 line comes in, with the filename determined by parts of the
> line.


That's what I thought when I read your original post. BTW, why reinvent
the wheel? Download the Gedcom module from CPAN. I'm sure that module
does most of what you want.

Hau'oli Makahiki Hou!
La'ie Techie

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
FAQ Topic - I have a question that is not answered in here or in any of the resources mentioned here but I'm sure it has been answered in clj. Where are the clj archives located? FAQ server Javascript 0 10-06-2006 11:00 PM
FAQ Topic - I have a question that is not answered in here or in any of the resources mentioned here but I'm sure it has been answered in clj. Where are the clj archives located? FAQ server Javascript 0 08-09-2006 11:00 PM
routing problems with multiple T1s - I know whats wrong but dont know how to fix it Jason Cisco 5 03-10-2006 01:03 AM
Try over here likely more to the point here George Hester Javascript 0 09-30-2004 03:14 AM
I know, I know, I don't know Andries Perl Misc 3 04-23-2004 02:17 AM



Advertisments