Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > FAQ 6.4 How do I match XML, HTML, or other nasty, ugly things with a regex?

Thread Tools

FAQ 6.4 How do I match XML, HTML, or other nasty, ugly things with a regex?

PerlFAQ Server
Posts: n/a
This is an excerpt from the latest version perlfaq6.pod, which
comes with the standard Perl distribution. These postings aim to
reduce the number of repeated questions as well as allow the community
to review and update the answers. The latest version of the complete
perlfaq is at .


6.4: How do I match XML, HTML, or other nasty, ugly things with a regex?

(contributed by brian d foy)

If you just want to get work done, use a module and forget about the
regular expressions. The "XML:arser" and "HTML:arser" modules are
good starts, although each namespace has other parsing modules
specialized for certain tasks and different ways of doing it. Start at
CPAN Search ( ) and wonder at all the work people
have done for you already!

The problem with things such as XML is that they have balanced text
containing multiple levels of balanced text, but sometimes it isn't
balanced text, as in an empty tag ("<br/>", for instance). Even then,
things can occur out-of-order. Just when you think you've got a pattern
that matches your input, someone throws you a curveball.

If you'd like to do it the hard way, scratching and clawing your way
toward a right answer but constantly being disappointed, besieged by bug
reports, and weary from the inordinate amount of time you have to spend
reinventing a triangular wheel, then there are several things you can
try before you give up in frustration:

* Solve the balanced text problem from another question in perlfaq6

* Try the recursive regex features in Perl 5.10 and later. See perlre

* Try defining a grammar using Perl 5.10's "(?DEFINE)" feature.

* Break the problem down into sub-problems instead of trying to use a
single regex

* Convince everyone not to use XML or HTML in the first place

Good luck!


The perlfaq-workers, a group of volunteers, maintain the perlfaq. They
are not necessarily experts in every domain where Perl might show up,
so please include as much information as possible and relevant in any
corrections. The perlfaq-workers also don't have access to every
operating system or platform, so please include relevant details for
corrections to examples that do not work on particular platforms.
Working code is greatly appreciated.

If you'd like to help maintain the perlfaq, see the details in
Reply With Quote

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off

Similar Threads
Thread Thread Starter Forum Replies Last Post
Re: How include a large array? Edward A. Falk C Programming 1 04-04-2013 08:07 PM
FAQ 6.4 How do I match XML, HTML, or other nasty, ugly things with a regex? PerlFAQ Server Perl Misc 0 02-24-2011 05:00 AM
Things get really ugly when some folks enter the thread (was: Microsoft’s IE...) JeffM Computer Support 0 12-02-2009 05:17 PM
Is it true that other certificates are as ugly as the MCP one? Elliot Microsoft Certification 8 07-30-2007 01:42 PM
vs2005 publish website doing bad things, bad things =?Utf-8?B?V2lsbGlhbSBTdWxsaXZhbg==?= ASP .Net 1 10-25-2006 06:18 PM