Velocity Reviews

Velocity Reviews (http://www.velocityreviews.com/forums/index.php)
-   Perl Misc (http://www.velocityreviews.com/forums/f67-perl-misc.html)
-   -   newbie ques (http://www.velocityreviews.com/forums/t895641-newbie-ques.html)

Madhu Ramachandran 12-15-2005 06:15 PM

newbie ques
 
i want to look for xml tokens.. like <token> .. the pattern matching doesnt
seem to work when i have < and >.. i tried escaping < and > with a \ , but
no use..

$inp=<STDIN>;
if ($inp =~ /\b\<[a-z]+\>\b/) {
print ("SUCCESS inp has <sometoken> \n");
}

even if i enter <a> as input, i never get SUCCESS printed

Thanks in advance



Paul Lalli 12-15-2005 06:34 PM

Re: newbie ques
 
Madhu Ramachandran wrote:
> Subject: newbie ques


Please put the subject of your post in the Subject of your post. Your
post is not about newbie questions. Your post is about parsing text
which contains < and >. Choose wiser subjects to get better help.

> i want to look for xml tokens.. like <token> .. the pattern matching doesnt
> seem to work when i have < and >.. i tried escaping < and > with a \ , but
> no use..


You've come to a false conclusion. < and > are not special in regular
expressions, and do not need to be escaped.

> $inp=<STDIN>;
> if ($inp =~ /\b\<[a-z]+\>\b/) {


What do you believe those \b markers are doing? I'm willing to bet
that's the root cause of your problem. \b matches a slot between a \w
character and a \W character. < and > are \W characters. Therefore,
in order to match your expression, the thing before the < and after the
> would have to be \w characters (that is, letters, numbers, or underscores). Are they?


> print ("SUCCESS inp has <sometoken> \n");
> }
>
> even if i enter <a> as input, i never get SUCCESS printed


Ah, so the thing before < is a beginning of string, and the thing after
> is an end of string. Neither of those match \w characters. What made you think you wanted or needed to put \b tokens in there to begin with?


#!/usr/bin/perl
use strict;
use warnings;

if ('<a>' =~ /<[a-z]+>/){
print "(1) Match\n";
}

if ('<a>' =~ /\b<[a-z]+>\b/){
print "(2) Match\n";
}

__END__
(1) Match


Please note that for parsing XML-like data, you should not be using
regular expressions at all. Instead, use a module designed to parse
XML. Go to http://search.cpan.org and search for XML::Parser and its
friends.

Paul Lalli


BZ 12-15-2005 06:42 PM

Re: newbie ques
 
Madhu Ramachandran wrote in comp.lang.perl.misc:
> i want to look for xml tokens.. like <token> .. the pattern matching doesnt
> seem to work when i have < and >.. i tried escaping < and > with a \ , but
> no use..
> $inp=<STDIN>;
> if ($inp =~ /\b\<[a-z]+\>\b/) {
> print ("SUCCESS inp has <sometoken> \n");
> }


That's because \< and \> match word boundaries, not < and >
characters.

You should know though, that xml parsing by means of regular
expressions can't be done (or at least, not very easily). You really
should be using a real XML parser.

--
BZ

usenet@DavidFilmer.com 12-15-2005 06:57 PM

Re: newbie ques
 
Madhu Ramachandran wrote:
> i want to look for xml tokens..


Then why not use a reilable and robust module that has already been
written for that purpose?

http://search.cpan.org/~podmaster/XM.../TokeParser.pm

Or one of the MANY other XML parsing modules which are freely
available.


Madhu Ramachandran 12-15-2005 09:40 PM

parsing < and > using word boundary pattern anchors
 
I apologize about the subject heading. changed it now and will keep that in
mind in my future posts.

1. About \b,
The book iam using says \b is a "Word-Boundary Pattern Anchor".
eg:
/\bdef/ matches def and and defghi, but will not match abcdef
and /\bdef\b/ will match exactly def, not abcdef or defghi

I was able to test the foll and it works..
when i tested $inp =~ /\bget[a-z]*lost\b/

only getlost or getAnyLetterHerelost works.. but agetlost or getlostb does
not work.. so i want to recognize tokens which start with < and end with >,
and ignore any embedded < or >
ie. match <token>, but dont match tok<en>

but can't seem to get it working with /\b<[a-z]+>\b/

2. Regarding not using standard xml parser.. That was my first aim. but
unfortunately iam using a load (OS) where i can't install any packages. its
long story.. cuz the disks used to install this load (is strictly monitored)
and comes with just perl, and they wont let me add any external modules to
the disk. These disks are the ones sent to customer and getting anything
added in there, would need to waddle thru lot of red tape. so, iam just
stuck with core modules of perl.

Iam just playing around to see if i can write my own simplistic xml parser
(non validating) in perl. Worst case i would just use a .properties file
with name value pairs instead of xml. :(


-----------------
"Paul Lalli" <mritty@gmail.com> wrote in message
news:1134671666.121475.180200@g44g2000cwa.googlegr oups.com...



robic0@yahoo.com 12-16-2005 12:16 AM

Re: parsing < and > using word boundary pattern anchors
 

Madhu Ramachandran wrote:
> I apologize about the subject heading. changed it now and will keep that in
> mind in my future posts.
>
> 1. About \b,
> The book iam using says \b is a "Word-Boundary Pattern Anchor".
> eg:
> /\bdef/ matches def and and defghi, but will not match abcdef
> and /\bdef\b/ will match exactly def, not abcdef or defghi
>
> I was able to test the foll and it works..
> when i tested $inp =~ /\bget[a-z]*lost\b/
>
> only getlost or getAnyLetterHerelost works.. but agetlost or getlostb does
> not work.. so i want to recognize tokens which start with < and end with >,
> and ignore any embedded < or >
> ie. match <token>, but dont match tok<en>
>
> but can't seem to get it working with /\b<[a-z]+>\b/
>
> 2. Regarding not using standard xml parser.. That was my first aim. but
> unfortunately iam using a load (OS) where i can't install any packages. its
> long story.. cuz the disks used to install this load (is strictly monitored)
> and comes with just perl, and they wont let me add any external modules to
> the disk. These disks are the ones sent to customer and getting anything
> added in there, would need to waddle thru lot of red tape. so, iam just
> stuck with core modules of perl.
>
> Iam just playing around to see if i can write my own simplistic xml parser
> (non validating) in perl. Worst case i would just use a .properties file
> with name value pairs instead of xml. :(
>
>
> -----------------
> "Paul Lalli" <mritty@gmail.com> wrote in message
> news:1134671666.121475.180200@g44g2000cwa.googlegr oups.com...


I don't think regex have the capabilities to handle nesting too well
especially when it comes to xml. Either way xml has to be streamed.
Just because it can be done to a degree, doesen't mean you can assign
any organization and structure to it within a regex.

Consider:

use strict;
use warnings;

my $cnt = 1;

$_ =
"<outer<thisisatest<oftheemergency>asdfsaf><thissh ou<asd<maybenot>fasdf>ldprint>root>";
print $_,"\n";
while (s/<([\[\]0-9a-z]+)>/[$cnt]/) { print $cnt," = ",$1,"\n"; $cnt++}

__END__


1 = oftheemergency
2 = thisisatest[1]asdfsaf
3 = maybenot
4 = asd[3]fasdf
5 = thisshou[4]ldprint
6 = outer[2][5]root


Matt Garrish 12-16-2005 12:35 AM

Re: parsing < and > using word boundary pattern anchors
 

<robic0@yahoo.com> wrote in message
news:1134692174.761339.166980@g49g2000cwa.googlegr oups.com...
>
> I don't think regex have the capabilities to handle nesting too well
> especially when it comes to xml. Either way xml has to be streamed.
> Just because it can be done to a degree, doesen't mean you can assign
> any organization and structure to it within a regex.
>
> Consider:
>
> use strict;
> use warnings;
>
> my $cnt = 1;
>
> $_ =
> "<outer<thisisatest<oftheemergency>asdfsaf><thissh ou<asd<maybenot>fasdf>ldprint>root>";
> print $_,"\n";
> while (s/<([\[\]0-9a-z]+)>/[$cnt]/) { print $cnt," = ",$1,"\n"; $cnt++}
>


Please go back and learn some more about xml before posting. Where did you
get the idea that it is possible to nest tags within tags in xml? You can
use CDATA sections if you don't want to use entities for < in text nodes,
but that still leaves your above example as nothing but a lot of garbage.
You couldn't even get away with nesting < inside attributes. XML is smarter
than you.

Matt



Tad McClellan 12-16-2005 01:17 AM

Re: parsing < and > using word boundary pattern anchors
 
Madhu Ramachandran <madhuram@nortel.com> wrote:

> I apologize about the subject heading.



Have you seen the Posting Guidelines that are posted here frequently?


> 1. About \b,


> I was able to test the foll and it works..



Does "foll" mean "following"?

Please use proper spelling, not doing so is inconsiderate of
those of us who don't have English as a first language.


> so i want to recognize tokens which start with < and end with >,
> and ignore any embedded < or >



Those characters are not allowed to be so embedded in XML.

If you change your requirement to: start with < and end with >,
and disallow embedded < and >, then:

/<[^<>]+>/


> ie. match <token>, but dont match tok<en>



So you want to match a \W character before <, and a \W
character after >.


> but can't seem to get it working with /\b<[a-z]+>\b/



You need the "anti \b", which matches between /W/W or
between /w/w:

/\B<[a-z]+>\B/


> 2. Regarding not using standard xml parser..


[snip]

> so, iam just
> stuck with core modules of perl.



Don't call your data "XML" if it does not comply with the
specifications that define XML.

"XML-like" would be more accurate.

If you call it "XML" then your code must do the Right Thing when
it encounters data like this:

<!-- <not_a_tag> -->

for example.


> Iam just playing around to see if i can write my own simplistic xml parser



Stop calling it XML if it isn't really XML.

So, you want to make your own "angle brackety parser". :-)


--
Tad McClellan SGML consulting
tadmc@augustmail.com Perl programming
Fort Worth, Texas

Tad McClellan 12-16-2005 04:34 AM

Re: parsing < and > using word boundary pattern anchors
 
robic0@yahoo.com <robic0@yahoo.com> wrote:


> I don't think regex have the capabilities to handle nesting too well
> especially when it comes to xml.



> "<outer<thisisatest<oftheemergency>asdfsaf><thissh ou<asd<maybenot>fasdf>ldprint>root>";



That is not XML.


--
Tad McClellan SGML consulting
tadmc@augustmail.com Perl programming
Fort Worth, Texas

robic0@yahoo.com 12-16-2005 11:00 PM

Re: parsing < and > using word boundary pattern anchors
 

Tad McClellan wrote:
> robic0@yahoo.com <robic0@yahoo.com> wrote:
>
>
> > I don't think regex have the capabilities to handle nesting too well
> > especially when it comes to xml.

>
>
> > "<outer<thisisatest<oftheemergency>asdfsaf><thissh ou<asd<maybenot>fasdf>ldprint>root>";

>
>
> That is not XML.
>
>
> --
> Tad McClellan SGML consulting
> tadmc@augustmail.com Perl programming
> Fort Worth, Texas


Sure it is, you just didn't do the substitution. Do I have to do
everything for you?

use strict;
use warnings;

my $cnt = 1;

$_ =
"<<outer<thisisatest<oftheemergency>asdfsaf><thiss hou<asd<maybenot>fasdf>ldprint>root>>";
print $_,"\n";
while (s/<([\[\]0-9a-z]+)>/[$cnt]/) { print $cnt," = ",$1,"\n"; $cnt++}


my $gabage1 =
"<T7><T6>outer<T2>thisisatest<T1>oftheemergenc y</T1>asdfsaf</T2><T5><T0></T0>this
shou<T4>asd<T3>maybenot</T3>fasdf</T4>ldprint<Z0/></T5>root</T6></T7>";

my $gabage2 =
"<outer>asdf<in1><in2>jjjj</in2><in3>asbefas</in3></in1>asdfb</outer>";

my @xml_ary = ($gabage1, $gabage2);

for (@xml_ary) {
$cnt = 1;
print "\n$_\n\n";
while (s/<([0-9a-zA-Z]+)\/>/[$cnt]/) { print "$cnt <$1> = \n"; $cnt++}
while (s/<([\[\]0-9a-zA-Z]+)>([^<]*)<\/\1>/[$cnt]/) { print "$cnt <$1>
= $2\n"; $cnt++}
}

__END__


<<outer<thisisatest<oftheemergency>asdfsaf><thissh ou<asd<maybenot>fasdf>ldprint>
root>>
1 = oftheemergency
2 = thisisatest[1]asdfsaf
3 = maybenot
4 = asd[3]fasdf
5 = thisshou[4]ldprint
6 = outer[2][5]root
7 = [6]

<T7><T6>outer<T2>thisisatest<T1>oftheemergency</T1>asdfsaf</T2><T5><T0></T0>this
shou<T4>asd<T3>maybenot</T3>fasdf</T4>ldprint<Z0/></T5>root</T6></T7>

1 <Z0> =
2 <T1> = oftheemergency
3 <T2> = thisisatest[2]asdfsaf
4 <T0> =
5 <T3> = maybenot
6 <T4> = asd[5]fasdf
7 <T5> = [4]this
shou[6]ldprint[1]
8 <T6> = outer[3][7]root
9 <T7> = [8]

<outer>asdf<in1><in2>jjjj</in2><in3>asbefas</in3></in1>asdfb</outer>

1 <in2> = jjjj
2 <in3> = asbefas
3 <in1> = [1][2]
4 <outer> = asdf[3]asdfb



All times are GMT. The time now is 10:37 PM.

Powered by vBulletin®. Copyright ©2000 - 2013, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57