Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > newbie ques

Reply
Thread Tools

newbie ques

 
 
Madhu Ramachandran
Guest
Posts: n/a
 
      12-15-2005
i want to look for xml tokens.. like <token> .. the pattern matching doesnt
seem to work when i have < and >.. i tried escaping < and > with a \ , but
no use..

$inp=<STDIN>;
if ($inp =~ /\b\<[a-z]+\>\b/) {
print ("SUCCESS inp has <sometoken> \n");
}

even if i enter <a> as input, i never get SUCCESS printed

Thanks in advance


 
Reply With Quote
 
 
 
 
Paul Lalli
Guest
Posts: n/a
 
      12-15-2005
Madhu Ramachandran wrote:
> Subject: newbie ques


Please put the subject of your post in the Subject of your post. Your
post is not about newbie questions. Your post is about parsing text
which contains < and >. Choose wiser subjects to get better help.

> i want to look for xml tokens.. like <token> .. the pattern matching doesnt
> seem to work when i have < and >.. i tried escaping < and > with a \ , but
> no use..


You've come to a false conclusion. < and > are not special in regular
expressions, and do not need to be escaped.

> $inp=<STDIN>;
> if ($inp =~ /\b\<[a-z]+\>\b/) {


What do you believe those \b markers are doing? I'm willing to bet
that's the root cause of your problem. \b matches a slot between a \w
character and a \W character. < and > are \W characters. Therefore,
in order to match your expression, the thing before the < and after the
> would have to be \w characters (that is, letters, numbers, or underscores). Are they?


> print ("SUCCESS inp has <sometoken> \n");
> }
>
> even if i enter <a> as input, i never get SUCCESS printed


Ah, so the thing before < is a beginning of string, and the thing after
> is an end of string. Neither of those match \w characters. What made you think you wanted or needed to put \b tokens in there to begin with?


#!/usr/bin/perl
use strict;
use warnings;

if ('<a>' =~ /<[a-z]+>/){
print "(1) Match\n";
}

if ('<a>' =~ /\b<[a-z]+>\b/){
print "(2) Match\n";
}

__END__
(1) Match


Please note that for parsing XML-like data, you should not be using
regular expressions at all. Instead, use a module designed to parse
XML. Go to http://search.cpan.org and search for XML:arser and its
friends.

Paul Lalli

 
Reply With Quote
 
 
 
 
BZ
Guest
Posts: n/a
 
      12-15-2005
Madhu Ramachandran wrote in comp.lang.perl.misc:
> i want to look for xml tokens.. like <token> .. the pattern matching doesnt
> seem to work when i have < and >.. i tried escaping < and > with a \ , but
> no use..
> $inp=<STDIN>;
> if ($inp =~ /\b\<[a-z]+\>\b/) {
> print ("SUCCESS inp has <sometoken> \n");
> }


That's because \< and \> match word boundaries, not < and >
characters.

You should know though, that xml parsing by means of regular
expressions can't be done (or at least, not very easily). You really
should be using a real XML parser.

--
BZ
 
Reply With Quote
 
usenet@DavidFilmer.com
Guest
Posts: n/a
 
      12-15-2005
Madhu Ramachandran wrote:
> i want to look for xml tokens..


Then why not use a reilable and robust module that has already been
written for that purpose?

http://search.cpan.org/~podmaster/XM.../TokeParser.pm

Or one of the MANY other XML parsing modules which are freely
available.

 
Reply With Quote
 
Madhu Ramachandran
Guest
Posts: n/a
 
      12-15-2005
I apologize about the subject heading. changed it now and will keep that in
mind in my future posts.

1. About \b,
The book iam using says \b is a "Word-Boundary Pattern Anchor".
eg:
/\bdef/ matches def and and defghi, but will not match abcdef
and /\bdef\b/ will match exactly def, not abcdef or defghi

I was able to test the foll and it works..
when i tested $inp =~ /\bget[a-z]*lost\b/

only getlost or getAnyLetterHerelost works.. but agetlost or getlostb does
not work.. so i want to recognize tokens which start with < and end with >,
and ignore any embedded < or >
ie. match <token>, but dont match tok<en>

but can't seem to get it working with /\b<[a-z]+>\b/

2. Regarding not using standard xml parser.. That was my first aim. but
unfortunately iam using a load (OS) where i can't install any packages. its
long story.. cuz the disks used to install this load (is strictly monitored)
and comes with just perl, and they wont let me add any external modules to
the disk. These disks are the ones sent to customer and getting anything
added in there, would need to waddle thru lot of red tape. so, iam just
stuck with core modules of perl.

Iam just playing around to see if i can write my own simplistic xml parser
(non validating) in perl. Worst case i would just use a .properties file
with name value pairs instead of xml.


-----------------
"Paul Lalli" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed) oups.com...


 
Reply With Quote
 
robic0@yahoo.com
Guest
Posts: n/a
 
      12-16-2005

Madhu Ramachandran wrote:
> I apologize about the subject heading. changed it now and will keep that in
> mind in my future posts.
>
> 1. About \b,
> The book iam using says \b is a "Word-Boundary Pattern Anchor".
> eg:
> /\bdef/ matches def and and defghi, but will not match abcdef
> and /\bdef\b/ will match exactly def, not abcdef or defghi
>
> I was able to test the foll and it works..
> when i tested $inp =~ /\bget[a-z]*lost\b/
>
> only getlost or getAnyLetterHerelost works.. but agetlost or getlostb does
> not work.. so i want to recognize tokens which start with < and end with >,
> and ignore any embedded < or >
> ie. match <token>, but dont match tok<en>
>
> but can't seem to get it working with /\b<[a-z]+>\b/
>
> 2. Regarding not using standard xml parser.. That was my first aim. but
> unfortunately iam using a load (OS) where i can't install any packages. its
> long story.. cuz the disks used to install this load (is strictly monitored)
> and comes with just perl, and they wont let me add any external modules to
> the disk. These disks are the ones sent to customer and getting anything
> added in there, would need to waddle thru lot of red tape. so, iam just
> stuck with core modules of perl.
>
> Iam just playing around to see if i can write my own simplistic xml parser
> (non validating) in perl. Worst case i would just use a .properties file
> with name value pairs instead of xml.
>
>
> -----------------
> "Paul Lalli" <(E-Mail Removed)> wrote in message
> news:(E-Mail Removed) oups.com...


I don't think regex have the capabilities to handle nesting too well
especially when it comes to xml. Either way xml has to be streamed.
Just because it can be done to a degree, doesen't mean you can assign
any organization and structure to it within a regex.

Consider:

use strict;
use warnings;

my $cnt = 1;

$_ =
"<outer<thisisatest<oftheemergency>asdfsaf><thissh ou<asd<maybenot>fasdf>ldprint>root>";
print $_,"\n";
while (s/<([\[\]0-9a-z]+)>/[$cnt]/) { print $cnt," = ",$1,"\n"; $cnt++}

__END__


1 = oftheemergency
2 = thisisatest[1]asdfsaf
3 = maybenot
4 = asd[3]fasdf
5 = thisshou[4]ldprint
6 = outer[2][5]root

 
Reply With Quote
 
Matt Garrish
Guest
Posts: n/a
 
      12-16-2005

<(E-Mail Removed)> wrote in message
news:(E-Mail Removed) oups.com...
>
> I don't think regex have the capabilities to handle nesting too well
> especially when it comes to xml. Either way xml has to be streamed.
> Just because it can be done to a degree, doesen't mean you can assign
> any organization and structure to it within a regex.
>
> Consider:
>
> use strict;
> use warnings;
>
> my $cnt = 1;
>
> $_ =
> "<outer<thisisatest<oftheemergency>asdfsaf><thissh ou<asd<maybenot>fasdf>ldprint>root>";
> print $_,"\n";
> while (s/<([\[\]0-9a-z]+)>/[$cnt]/) { print $cnt," = ",$1,"\n"; $cnt++}
>


Please go back and learn some more about xml before posting. Where did you
get the idea that it is possible to nest tags within tags in xml? You can
use CDATA sections if you don't want to use entities for < in text nodes,
but that still leaves your above example as nothing but a lot of garbage.
You couldn't even get away with nesting < inside attributes. XML is smarter
than you.

Matt


 
Reply With Quote
 
Tad McClellan
Guest
Posts: n/a
 
      12-16-2005
Madhu Ramachandran <(E-Mail Removed)> wrote:

> I apologize about the subject heading.



Have you seen the Posting Guidelines that are posted here frequently?


> 1. About \b,


> I was able to test the foll and it works..



Does "foll" mean "following"?

Please use proper spelling, not doing so is inconsiderate of
those of us who don't have English as a first language.


> so i want to recognize tokens which start with < and end with >,
> and ignore any embedded < or >



Those characters are not allowed to be so embedded in XML.

If you change your requirement to: start with < and end with >,
and disallow embedded < and >, then:

/<[^<>]+>/


> ie. match <token>, but dont match tok<en>



So you want to match a \W character before <, and a \W
character after >.


> but can't seem to get it working with /\b<[a-z]+>\b/



You need the "anti \b", which matches between /W/W or
between /w/w:

/\B<[a-z]+>\B/


> 2. Regarding not using standard xml parser..


[snip]

> so, iam just
> stuck with core modules of perl.



Don't call your data "XML" if it does not comply with the
specifications that define XML.

"XML-like" would be more accurate.

If you call it "XML" then your code must do the Right Thing when
it encounters data like this:

<!-- <not_a_tag> -->

for example.


> Iam just playing around to see if i can write my own simplistic xml parser



Stop calling it XML if it isn't really XML.

So, you want to make your own "angle brackety parser".


--
Tad McClellan SGML consulting
http://www.velocityreviews.com/forums/(E-Mail Removed) Perl programming
Fort Worth, Texas
 
Reply With Quote
 
Tad McClellan
Guest
Posts: n/a
 
      12-16-2005
(E-Mail Removed) <(E-Mail Removed)> wrote:


> I don't think regex have the capabilities to handle nesting too well
> especially when it comes to xml.



> "<outer<thisisatest<oftheemergency>asdfsaf><thissh ou<asd<maybenot>fasdf>ldprint>root>";



That is not XML.


--
Tad McClellan SGML consulting
(E-Mail Removed) Perl programming
Fort Worth, Texas
 
Reply With Quote
 
robic0@yahoo.com
Guest
Posts: n/a
 
      12-16-2005

Tad McClellan wrote:
> (E-Mail Removed) <(E-Mail Removed)> wrote:
>
>
> > I don't think regex have the capabilities to handle nesting too well
> > especially when it comes to xml.

>
>
> > "<outer<thisisatest<oftheemergency>asdfsaf><thissh ou<asd<maybenot>fasdf>ldprint>root>";

>
>
> That is not XML.
>
>
> --
> Tad McClellan SGML consulting
> (E-Mail Removed) Perl programming
> Fort Worth, Texas


Sure it is, you just didn't do the substitution. Do I have to do
everything for you?

use strict;
use warnings;

my $cnt = 1;

$_ =
"<<outer<thisisatest<oftheemergency>asdfsaf><thiss hou<asd<maybenot>fasdf>ldprint>root>>";
print $_,"\n";
while (s/<([\[\]0-9a-z]+)>/[$cnt]/) { print $cnt," = ",$1,"\n"; $cnt++}


my $gabage1 =
"<T7><T6>outer<T2>thisisatest<T1>oftheemergenc y</T1>asdfsaf</T2><T5><T0></T0>this
shou<T4>asd<T3>maybenot</T3>fasdf</T4>ldprint<Z0/></T5>root</T6></T7>";

my $gabage2 =
"<outer>asdf<in1><in2>jjjj</in2><in3>asbefas</in3></in1>asdfb</outer>";

my @xml_ary = ($gabage1, $gabage2);

for (@xml_ary) {
$cnt = 1;
print "\n$_\n\n";
while (s/<([0-9a-zA-Z]+)\/>/[$cnt]/) { print "$cnt <$1> = \n"; $cnt++}
while (s/<([\[\]0-9a-zA-Z]+)>([^<]*)<\/\1>/[$cnt]/) { print "$cnt <$1>
= $2\n"; $cnt++}
}

__END__


<<outer<thisisatest<oftheemergency>asdfsaf><thissh ou<asd<maybenot>fasdf>ldprint>
root>>
1 = oftheemergency
2 = thisisatest[1]asdfsaf
3 = maybenot
4 = asd[3]fasdf
5 = thisshou[4]ldprint
6 = outer[2][5]root
7 = [6]

<T7><T6>outer<T2>thisisatest<T1>oftheemergency</T1>asdfsaf</T2><T5><T0></T0>this
shou<T4>asd<T3>maybenot</T3>fasdf</T4>ldprint<Z0/></T5>root</T6></T7>

1 <Z0> =
2 <T1> = oftheemergency
3 <T2> = thisisatest[2]asdfsaf
4 <T0> =
5 <T3> = maybenot
6 <T4> = asd[5]fasdf
7 <T5> = [4]this
shou[6]ldprint[1]
8 <T6> = outer[3][7]root
9 <T7> = [8]

<outer>asdf<in1><in2>jjjj</in2><in3>asbefas</in3></in1>asdfb</outer>

1 <in2> = jjjj
2 <in3> = asbefas
3 <in1> = [1][2]
4 <outer> = asdf[3]asdfb

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Newbie Ques: ((void(*) (void *) 0) Desmond Foley C Programming 6 06-07-2005 03:00 PM
newbie ques - check double Apple C++ 7 09-28-2004 08:14 PM
QUES: ODFX/IDFX inferred in syplify, and not in XACT libraries ???? Ted VHDL 1 02-03-2004 11:58 PM
QUES: Where can I find Xilinx M1 tools Ted VHDL 6 01-21-2004 03:18 PM
Newbie Ques??: Camcorder to disk Joe Boater DVD Video 1 10-09-2003 05:19 PM



Advertisments