![]() |
newbie ques
i want to look for xml tokens.. like <token> .. the pattern matching doesnt
seem to work when i have < and >.. i tried escaping < and > with a \ , but no use.. $inp=<STDIN>; if ($inp =~ /\b\<[a-z]+\>\b/) { print ("SUCCESS inp has <sometoken> \n"); } even if i enter <a> as input, i never get SUCCESS printed Thanks in advance |
Re: newbie ques
Madhu Ramachandran wrote:
> Subject: newbie ques Please put the subject of your post in the Subject of your post. Your post is not about newbie questions. Your post is about parsing text which contains < and >. Choose wiser subjects to get better help. > i want to look for xml tokens.. like <token> .. the pattern matching doesnt > seem to work when i have < and >.. i tried escaping < and > with a \ , but > no use.. You've come to a false conclusion. < and > are not special in regular expressions, and do not need to be escaped. > $inp=<STDIN>; > if ($inp =~ /\b\<[a-z]+\>\b/) { What do you believe those \b markers are doing? I'm willing to bet that's the root cause of your problem. \b matches a slot between a \w character and a \W character. < and > are \W characters. Therefore, in order to match your expression, the thing before the < and after the > would have to be \w characters (that is, letters, numbers, or underscores). Are they? > print ("SUCCESS inp has <sometoken> \n"); > } > > even if i enter <a> as input, i never get SUCCESS printed Ah, so the thing before < is a beginning of string, and the thing after > is an end of string. Neither of those match \w characters. What made you think you wanted or needed to put \b tokens in there to begin with? #!/usr/bin/perl use strict; use warnings; if ('<a>' =~ /<[a-z]+>/){ print "(1) Match\n"; } if ('<a>' =~ /\b<[a-z]+>\b/){ print "(2) Match\n"; } __END__ (1) Match Please note that for parsing XML-like data, you should not be using regular expressions at all. Instead, use a module designed to parse XML. Go to http://search.cpan.org and search for XML::Parser and its friends. Paul Lalli |
Re: newbie ques
Madhu Ramachandran wrote in comp.lang.perl.misc:
> i want to look for xml tokens.. like <token> .. the pattern matching doesnt > seem to work when i have < and >.. i tried escaping < and > with a \ , but > no use.. > $inp=<STDIN>; > if ($inp =~ /\b\<[a-z]+\>\b/) { > print ("SUCCESS inp has <sometoken> \n"); > } That's because \< and \> match word boundaries, not < and > characters. You should know though, that xml parsing by means of regular expressions can't be done (or at least, not very easily). You really should be using a real XML parser. -- BZ |
Re: newbie ques
Madhu Ramachandran wrote:
> i want to look for xml tokens.. Then why not use a reilable and robust module that has already been written for that purpose? http://search.cpan.org/~podmaster/XM.../TokeParser.pm Or one of the MANY other XML parsing modules which are freely available. |
parsing < and > using word boundary pattern anchors
I apologize about the subject heading. changed it now and will keep that in
mind in my future posts. 1. About \b, The book iam using says \b is a "Word-Boundary Pattern Anchor". eg: /\bdef/ matches def and and defghi, but will not match abcdef and /\bdef\b/ will match exactly def, not abcdef or defghi I was able to test the foll and it works.. when i tested $inp =~ /\bget[a-z]*lost\b/ only getlost or getAnyLetterHerelost works.. but agetlost or getlostb does not work.. so i want to recognize tokens which start with < and end with >, and ignore any embedded < or > ie. match <token>, but dont match tok<en> but can't seem to get it working with /\b<[a-z]+>\b/ 2. Regarding not using standard xml parser.. That was my first aim. but unfortunately iam using a load (OS) where i can't install any packages. its long story.. cuz the disks used to install this load (is strictly monitored) and comes with just perl, and they wont let me add any external modules to the disk. These disks are the ones sent to customer and getting anything added in there, would need to waddle thru lot of red tape. so, iam just stuck with core modules of perl. Iam just playing around to see if i can write my own simplistic xml parser (non validating) in perl. Worst case i would just use a .properties file with name value pairs instead of xml. :( ----------------- "Paul Lalli" <mritty@gmail.com> wrote in message news:1134671666.121475.180200@g44g2000cwa.googlegr oups.com... |
Re: parsing < and > using word boundary pattern anchors
Madhu Ramachandran wrote: > I apologize about the subject heading. changed it now and will keep that in > mind in my future posts. > > 1. About \b, > The book iam using says \b is a "Word-Boundary Pattern Anchor". > eg: > /\bdef/ matches def and and defghi, but will not match abcdef > and /\bdef\b/ will match exactly def, not abcdef or defghi > > I was able to test the foll and it works.. > when i tested $inp =~ /\bget[a-z]*lost\b/ > > only getlost or getAnyLetterHerelost works.. but agetlost or getlostb does > not work.. so i want to recognize tokens which start with < and end with >, > and ignore any embedded < or > > ie. match <token>, but dont match tok<en> > > but can't seem to get it working with /\b<[a-z]+>\b/ > > 2. Regarding not using standard xml parser.. That was my first aim. but > unfortunately iam using a load (OS) where i can't install any packages. its > long story.. cuz the disks used to install this load (is strictly monitored) > and comes with just perl, and they wont let me add any external modules to > the disk. These disks are the ones sent to customer and getting anything > added in there, would need to waddle thru lot of red tape. so, iam just > stuck with core modules of perl. > > Iam just playing around to see if i can write my own simplistic xml parser > (non validating) in perl. Worst case i would just use a .properties file > with name value pairs instead of xml. :( > > > ----------------- > "Paul Lalli" <mritty@gmail.com> wrote in message > news:1134671666.121475.180200@g44g2000cwa.googlegr oups.com... I don't think regex have the capabilities to handle nesting too well especially when it comes to xml. Either way xml has to be streamed. Just because it can be done to a degree, doesen't mean you can assign any organization and structure to it within a regex. Consider: use strict; use warnings; my $cnt = 1; $_ = "<outer<thisisatest<oftheemergency>asdfsaf><thissh ou<asd<maybenot>fasdf>ldprint>root>"; print $_,"\n"; while (s/<([\[\]0-9a-z]+)>/[$cnt]/) { print $cnt," = ",$1,"\n"; $cnt++} __END__ 1 = oftheemergency 2 = thisisatest[1]asdfsaf 3 = maybenot 4 = asd[3]fasdf 5 = thisshou[4]ldprint 6 = outer[2][5]root |
Re: parsing < and > using word boundary pattern anchors
<robic0@yahoo.com> wrote in message news:1134692174.761339.166980@g49g2000cwa.googlegr oups.com... > > I don't think regex have the capabilities to handle nesting too well > especially when it comes to xml. Either way xml has to be streamed. > Just because it can be done to a degree, doesen't mean you can assign > any organization and structure to it within a regex. > > Consider: > > use strict; > use warnings; > > my $cnt = 1; > > $_ = > "<outer<thisisatest<oftheemergency>asdfsaf><thissh ou<asd<maybenot>fasdf>ldprint>root>"; > print $_,"\n"; > while (s/<([\[\]0-9a-z]+)>/[$cnt]/) { print $cnt," = ",$1,"\n"; $cnt++} > Please go back and learn some more about xml before posting. Where did you get the idea that it is possible to nest tags within tags in xml? You can use CDATA sections if you don't want to use entities for < in text nodes, but that still leaves your above example as nothing but a lot of garbage. You couldn't even get away with nesting < inside attributes. XML is smarter than you. Matt |
Re: parsing < and > using word boundary pattern anchors
Madhu Ramachandran <madhuram@nortel.com> wrote:
> I apologize about the subject heading. Have you seen the Posting Guidelines that are posted here frequently? > 1. About \b, > I was able to test the foll and it works.. Does "foll" mean "following"? Please use proper spelling, not doing so is inconsiderate of those of us who don't have English as a first language. > so i want to recognize tokens which start with < and end with >, > and ignore any embedded < or > Those characters are not allowed to be so embedded in XML. If you change your requirement to: start with < and end with >, and disallow embedded < and >, then: /<[^<>]+>/ > ie. match <token>, but dont match tok<en> So you want to match a \W character before <, and a \W character after >. > but can't seem to get it working with /\b<[a-z]+>\b/ You need the "anti \b", which matches between /W/W or between /w/w: /\B<[a-z]+>\B/ > 2. Regarding not using standard xml parser.. [snip] > so, iam just > stuck with core modules of perl. Don't call your data "XML" if it does not comply with the specifications that define XML. "XML-like" would be more accurate. If you call it "XML" then your code must do the Right Thing when it encounters data like this: <!-- <not_a_tag> --> for example. > Iam just playing around to see if i can write my own simplistic xml parser Stop calling it XML if it isn't really XML. So, you want to make your own "angle brackety parser". :-) -- Tad McClellan SGML consulting tadmc@augustmail.com Perl programming Fort Worth, Texas |
Re: parsing < and > using word boundary pattern anchors
robic0@yahoo.com <robic0@yahoo.com> wrote:
> I don't think regex have the capabilities to handle nesting too well > especially when it comes to xml. > "<outer<thisisatest<oftheemergency>asdfsaf><thissh ou<asd<maybenot>fasdf>ldprint>root>"; That is not XML. -- Tad McClellan SGML consulting tadmc@augustmail.com Perl programming Fort Worth, Texas |
Re: parsing < and > using word boundary pattern anchors
Tad McClellan wrote: > robic0@yahoo.com <robic0@yahoo.com> wrote: > > > > I don't think regex have the capabilities to handle nesting too well > > especially when it comes to xml. > > > > "<outer<thisisatest<oftheemergency>asdfsaf><thissh ou<asd<maybenot>fasdf>ldprint>root>"; > > > That is not XML. > > > -- > Tad McClellan SGML consulting > tadmc@augustmail.com Perl programming > Fort Worth, Texas Sure it is, you just didn't do the substitution. Do I have to do everything for you? use strict; use warnings; my $cnt = 1; $_ = "<<outer<thisisatest<oftheemergency>asdfsaf><thiss hou<asd<maybenot>fasdf>ldprint>root>>"; print $_,"\n"; while (s/<([\[\]0-9a-z]+)>/[$cnt]/) { print $cnt," = ",$1,"\n"; $cnt++} my $gabage1 = "<T7><T6>outer<T2>thisisatest<T1>oftheemergenc y</T1>asdfsaf</T2><T5><T0></T0>this shou<T4>asd<T3>maybenot</T3>fasdf</T4>ldprint<Z0/></T5>root</T6></T7>"; my $gabage2 = "<outer>asdf<in1><in2>jjjj</in2><in3>asbefas</in3></in1>asdfb</outer>"; my @xml_ary = ($gabage1, $gabage2); for (@xml_ary) { $cnt = 1; print "\n$_\n\n"; while (s/<([0-9a-zA-Z]+)\/>/[$cnt]/) { print "$cnt <$1> = \n"; $cnt++} while (s/<([\[\]0-9a-zA-Z]+)>([^<]*)<\/\1>/[$cnt]/) { print "$cnt <$1> = $2\n"; $cnt++} } __END__ <<outer<thisisatest<oftheemergency>asdfsaf><thissh ou<asd<maybenot>fasdf>ldprint> root>> 1 = oftheemergency 2 = thisisatest[1]asdfsaf 3 = maybenot 4 = asd[3]fasdf 5 = thisshou[4]ldprint 6 = outer[2][5]root 7 = [6] <T7><T6>outer<T2>thisisatest<T1>oftheemergency</T1>asdfsaf</T2><T5><T0></T0>this shou<T4>asd<T3>maybenot</T3>fasdf</T4>ldprint<Z0/></T5>root</T6></T7> 1 <Z0> = 2 <T1> = oftheemergency 3 <T2> = thisisatest[2]asdfsaf 4 <T0> = 5 <T3> = maybenot 6 <T4> = asd[5]fasdf 7 <T5> = [4]this shou[6]ldprint[1] 8 <T6> = outer[3][7]root 9 <T7> = [8] <outer>asdf<in1><in2>jjjj</in2><in3>asbefas</in3></in1>asdfb</outer> 1 <in2> = jjjj 2 <in3> = asbefas 3 <in1> = [1][2] 4 <outer> = asdf[3]asdfb |
| All times are GMT. The time now is 10:37 PM. |
Powered by vBulletin®. Copyright ©2000 - 2013, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.