Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > ASP .Net > regex for replacing plain text within html string...

Reply
Thread Tools

regex for replacing plain text within html string...

 
 
Tim_Mac
Guest
Posts: n/a
 
      01-20-2006
hi,
i have a tricky problem and my regex expertise has reached its limit.
i have read other posts on this newsgroup that pull out the plain text
from a html string, but that won't work for me because i want to
preserve the html, and replace some of the plain text.

i basically want to show the user's search terms highlighted in the
page, like google does, but i want to do this server side (i have the
mechanics of intercepting the html sorted out, by overriding the
Page.Render method). i can use a simple regex pattern like (keyword)
and replace with <span class='highlight'>$1</span> but this causes
problems because the keyword may appear in markup tags or attribute
values, which the above example will also replace, screwing up the html
structure.

what i want to express is: match the keyword, where it is not contained
inside a html tag, i.e. between a < and > character

my most obvious attempt is too simplistic and doesn't work:
[^<]*(keyword)[^>]*

i did come up with another regex which i am almost embarassed to show

it essentially matches the keyword inside the inner text of a html tag
set. but the problem is that it misses subsequent occurrences of the
keyword in the same match.

here is the pattern:
<(?<tag>\w+)([^>]*>[^<]*)(?<innerText>KeyWord)([^<]*</\k<tag>>)
and the replace: <$3$1<span class='highlight'>$4</span>$2
it actually works, but as i mentioned it does miss multiple occurrences
inside the same tag, and requires all the text to be within an open +
close html tag.

i would be really grateful if anyone had a suggestion
thanks
tim

 
Reply With Quote
 
 
 
 
=?Utf-8?B?VG9tIEFuZGVyc29u?=
Guest
Posts: n/a
 
      01-20-2006
Your best bet with this type of replacement would be to first regex the text
between the html tabs (i.e. > and <) then do a standard string replace on the
keyword(s).

"Tim_Mac" wrote:

> hi,
> i have a tricky problem and my regex expertise has reached its limit.
> i have read other posts on this newsgroup that pull out the plain text
> from a html string, but that won't work for me because i want to
> preserve the html, and replace some of the plain text.
>
> i basically want to show the user's search terms highlighted in the
> page, like google does, but i want to do this server side (i have the
> mechanics of intercepting the html sorted out, by overriding the
> Page.Render method). i can use a simple regex pattern like (keyword)
> and replace with <span class='highlight'>$1</span> but this causes
> problems because the keyword may appear in markup tags or attribute
> values, which the above example will also replace, screwing up the html
> structure.
>
> what i want to express is: match the keyword, where it is not contained
> inside a html tag, i.e. between a < and > character
>
> my most obvious attempt is too simplistic and doesn't work:
> [^<]*(keyword)[^>]*
>
> i did come up with another regex which i am almost embarassed to show
>
> it essentially matches the keyword inside the inner text of a html tag
> set. but the problem is that it misses subsequent occurrences of the
> keyword in the same match.
>
> here is the pattern:
> <(?<tag>\w+)([^>]*>[^<]*)(?<innerText>KeyWord)([^<]*</\k<tag>>)
> and the replace: <$3$1<span class='highlight'>$4</span>$2
> it actually works, but as i mentioned it does miss multiple occurrences
> inside the same tag, and requires all the text to be within an open +
> close html tag.
>
> i would be really grateful if anyone had a suggestion
> thanks
> tim
>
>

 
Reply With Quote
 
 
 
 
Tim_Mac
Guest
Posts: n/a
 
      01-21-2006
hi tom. thanks for the reply.
yes but the problem i mentioned is that if you get text between > and <
characters, it could contain more tags inside it, so your
String.Replace method could still replace mark-up then.
by the way, how would you pull out the text by regex, then use
String.Replace and keep the structure of the page html all together? i
don't see how you would use the two approaches together...

 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Quotemeta & Regex question re-posted as plain text Jürgen Exner Perl Misc 1 01-26-2011 06:46 PM
How make regex that means "contains regex#1 but NOT regex#2" ?? seberino@spawar.navy.mil Python 3 07-01-2008 03:06 PM
Regex Replacement: Replacing text with an empty string Hal Vaughan Java 9 12-26-2007 04:15 AM
Plain Text Linebreak problem within a CGI script xhoster@gmail.com Perl Misc 3 05-24-2006 05:15 PM
when I add HTML to innerHTML, FireFox renders it as HTML, but IE shows it as plain text Jake Barnes Javascript 9 02-21-2006 10:37 AM



Advertisments