Understood. To argue my case only slightly more, I'm not parsing arbitrary
html, I'm looking for a single tag called "localize" which I the replace the
contents of with the contents of an XML entry from a resource file. So
there's never a case where > appears in an attribute of that tag, UNLESS
it's inside an ASP block (<% %>). The attributes of the localize tag are
very restricted, true/false type things, except for the fact that somebody
may need to "bind" one of these true/falses to a functon call.
So my latest is:
<localize((?:[^>]*%>[^>])*[^>]*)>(.*?)</localize>
which fixes my original problem, but it's true that that won't handle
<localize visible="<%# x > 5%>">foo</localize>
but that seems fixable and "final", in that that's the only case that could
occur given the allowable values of the tag...
The problem with most HTML parsers is that (shocker) they don't handle
ASP.Net (which isn't HTML)... So rather than modding something big I was
hoping to keep it simple, even if that means constraining the user of the
tag somewhat.
"Tad McClellan" <> wrote in message
news:...
> Max Metral <> wrote:
>
> > Subject: HTML regex challenge
>
>
> Parsing arbitrary HTML with a regex is nearly impossible.
>
> You need a Real Parser that knows the HTML grammar.
>
>
> > The expression fragment I want is "match everything except right
> > bracket, unless there was a % before the right bracket"...
>
>
> Your problem description will not do the Right Thing for this HTML:
>
> <img src="cool.jpg" alt=">>Cool pic!<<">
>
> after you fix the regex for that case, post it here and we
> will show some other HTML that breaks it.
>
> Then after you fix the regex for _that_ case, post the regex
> and we'll do it again.
>
> Lather, rinse, repeat.
>
> We can keep that up longer than you can. 
>
>
> --
> Tad McClellan SGML consulting
> Perl programming
> Fort Worth, Texas