Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > ASP .Net > Stripping html tags from text

Reply
Thread Tools

Stripping html tags from text

 
 
Spondishy
Guest
Posts: n/a
 
      03-06-2006
Hi,

I'm looking for help with a regular expression and c#.

I want to remove all tags from a piece of html except the following.

<a>
<b>
<h1>
<h2>
<h3>

Also, <a> could be <a href="aa">aaa</a> etc.

Help would be appreciated, along with an explanation of the reg
expression created.

Thanks.

 
Reply With Quote
 
 
 
 
Kevin Spencer
Guest
Posts: n/a
 
      03-06-2006
HTML is complex. It would be better instead to say that you want to
*retrieve* *only* all of the following tags. That way, they are the only
tags the Regular Expression will have to look for.

The following will do this:

(?i)<\s*(a|br|h1|h2|h3)[^>]*>(?[^<\r\n]+)(?=(?:<\/\1)|(?:\r?\n)))?

Note: Grouping is used in this Regular Expression. It groups the tag names
into Group 1, and the InnerText into Group 2, in case you need either of
these.

--
HTH,

Kevin Spencer
Microsoft MVP
..Net Developer

Presuming that God is "only an idea" -
Ideas exist.
Therefore, God exists.

"Spondishy" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed) ups.com...
> Hi,
>
> I'm looking for help with a regular expression and c#.
>
> I want to remove all tags from a piece of html except the following.
>
> <a>
> <b>
> <h1>
> <h2>
> <h3>
>
> Also, <a> could be <a href="aa">aaa</a> etc.
>
> Help would be appreciated, along with an explanation of the reg
> expression created.
>
> Thanks.
>



 
Reply With Quote
 
 
 
 
m.posseth
Guest
Posts: n/a
 
      03-06-2006


i use this in VB

Private Function stripHTML(ByVal strHTML) As String

Dim objRegExp As New System.Text.RegularExpressions.Regex("<(.|\n)+?>")

Return objRegExp.Replace(strHTML, "")

End Function

so the regex System.Text.RegularExpressions.Regex("<(.|\n)+?>")

does the trick

so in C# it would be ( i am a VB coder so don`t shoot me )

private string stripHTML(object strHTML)

{

System.Text.RegularExpressions.Regex objRegExp = new
System.Text.RegularExpressions.Regex("<(.|\n)+?>") ;

return objRegExp.Replace(strHTML, "");

}

regards

Michel Posseth [MCP]





"Spondishy" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed) ups.com...
> Hi,
>
> I'm looking for help with a regular expression and c#.
>
> I want to remove all tags from a piece of html except the following.
>
> <a>
> <b>
> <h1>
> <h2>
> <h3>
>
> Also, <a> could be <a href="aa">aaa</a> etc.
>
> Help would be appreciated, along with an explanation of the reg
> expression created.
>
> Thanks.
>



 
Reply With Quote
 
Kevin Spencer
Guest
Posts: n/a
 
      03-06-2006
The problem with that Regular Expression (in this case) is that it simply
matches all tags in the page. It doesn't match InnerText, as he requested,
and it matches end tags as separate matches. It is excellent for, for
example, stripping HTML tags from a page, but not for his requirements.

--
HTH,

Kevin Spencer
Microsoft MVP
..Net Developer

Presuming that God is "only an idea" -
Ideas exist.
Therefore, God exists.

"m.posseth" <(E-Mail Removed)> wrote in message
news:%(E-Mail Removed)...
>
>
> i use this in VB
>
> Private Function stripHTML(ByVal strHTML) As String
>
> Dim objRegExp As New System.Text.RegularExpressions.Regex("<(.|\n)+?>")
>
> Return objRegExp.Replace(strHTML, "")
>
> End Function
>
> so the regex System.Text.RegularExpressions.Regex("<(.|\n)+?>")
>
> does the trick
>
> so in C# it would be ( i am a VB coder so don`t shoot me )
>
> private string stripHTML(object strHTML)
>
> {
>
> System.Text.RegularExpressions.Regex objRegExp = new
> System.Text.RegularExpressions.Regex("<(.|\n)+?>") ;
>
> return objRegExp.Replace(strHTML, "");
>
> }
>
> regards
>
> Michel Posseth [MCP]
>
>
>
>
>
> "Spondishy" <(E-Mail Removed)> wrote in message
> news:(E-Mail Removed) ups.com...
>> Hi,
>>
>> I'm looking for help with a regular expression and c#.
>>
>> I want to remove all tags from a piece of html except the following.
>>
>> <a>
>> <b>
>> <h1>
>> <h2>
>> <h3>
>>
>> Also, <a> could be <a href="aa">aaa</a> etc.
>>
>> Help would be appreciated, along with an explanation of the reg
>> expression created.
>>
>> Thanks.
>>

>
>



 
Reply With Quote
 
m.posseth
Guest
Posts: n/a
 
      03-07-2006
Oops

i just read "Stripping html tags from text" and missed the exclusion part

>>>except the following.
>>>
>>> <a>
>>> <b>
>>> <h1>
>>> <h2>
>>> <h3>
>>>
>>> Also, <a> could be <a href="aa">aaa</a> etc.


my code will convert
<html>
<head>
<body>
<table>
<tr><td>bla bla </td></tr>
</table>
</body>
</head>
</html>

into

bla bla


regards

Michel




"Kevin Spencer" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed)...
> The problem with that Regular Expression (in this case) is that it simply
> matches all tags in the page. It doesn't match InnerText, as he requested,
> and it matches end tags as separate matches. It is excellent for, for
> example, stripping HTML tags from a page, but not for his requirements.
>
> --
> HTH,
>
> Kevin Spencer
> Microsoft MVP
> .Net Developer
>
> Presuming that God is "only an idea" -
> Ideas exist.
> Therefore, God exists.
>
> "m.posseth" <(E-Mail Removed)> wrote in message
> news:%(E-Mail Removed)...
>>
>>
>> i use this in VB
>>
>> Private Function stripHTML(ByVal strHTML) As String
>>
>> Dim objRegExp As New System.Text.RegularExpressions.Regex("<(.|\n)+?>")
>>
>> Return objRegExp.Replace(strHTML, "")
>>
>> End Function
>>
>> so the regex System.Text.RegularExpressions.Regex("<(.|\n)+?>")
>>
>> does the trick
>>
>> so in C# it would be ( i am a VB coder so don`t shoot me )
>>
>> private string stripHTML(object strHTML)
>>
>> {
>>
>> System.Text.RegularExpressions.Regex objRegExp = new
>> System.Text.RegularExpressions.Regex("<(.|\n)+?>") ;
>>
>> return objRegExp.Replace(strHTML, "");
>>
>> }
>>
>> regards
>>
>> Michel Posseth [MCP]
>>
>>
>>
>>
>>
>> "Spondishy" <(E-Mail Removed)> wrote in message
>> news:(E-Mail Removed) ups.com...
>>> Hi,
>>>
>>> I'm looking for help with a regular expression and c#.
>>>
>>> I want to remove all tags from a piece of html except the following.
>>>
>>> <a>
>>> <b>
>>> <h1>
>>> <h2>
>>> <h3>
>>>
>>> Also, <a> could be <a href="aa">aaa</a> etc.
>>>
>>> Help would be appreciated, along with an explanation of the reg
>>> expression created.
>>>
>>> Thanks.
>>>

>>
>>

>
>



 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Stripping HTML attributes and tags JJ Harrison HTML 5 11-28-2005 10:12 PM
stripping tags from source on render Lance ASP .Net 4 06-17-2005 05:32 PM
stripping HTML tags shank ASP General 3 07-14-2004 11:55 AM
Stripping HTML tags from a TEXTAREA field Jeff North Javascript 15 02-14-2004 01:37 PM
Stripping content delimited by two tags Ken Fine ASP General 5 02-05-2004 11:35 PM



Advertisments