Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > re pattern for matching JS/CSS

Reply
Thread Tools

re pattern for matching JS/CSS

 
 
i80and
Guest
Posts: n/a
 
      12-15-2006
I'm working on a program to remove tags from a HTML document, leaving
just the content, but I want to do it simply. I've finished a system
to remove simple tags, but I want all CSS and JS to be removed. What
re pattern could I use to do that?

I've tried
'<script[\S\s]*/script>'
but that didn't work properly. I'm fairly basic in my knowledge of
Python, so I'm still trying to learn re.
What pattern would work?

 
Reply With Quote
 
 
 
 
ina
Guest
Posts: n/a
 
      12-15-2006

i80and wrote:
> I'm working on a program to remove tags from a HTML document, leaving
> just the content, but I want to do it simply. I've finished a system
> to remove simple tags, but I want all CSS and JS to be removed. What
> re pattern could I use to do that?
>
> I've tried
> '<script[\S\s]*/script>'
> but that didn't work properly. I'm fairly basic in my knowledge of
> Python, so I'm still trying to learn re.
> What pattern would work?


I use re.compile("<script.*?</script>",re.DOTALL)
for scripts. I strip this out first since my tag stripping re will
strip out script tags as well hope this was of help.

 
Reply With Quote
 
 
 
 
Tim Chase
Guest
Posts: n/a
 
      12-15-2006
>> I've tried
>> '<script[\S\s]*/script>'
>> but that didn't work properly. I'm fairly basic in my knowledge of
>> Python, so I'm still trying to learn re.
>> What pattern would work?

>
> I use re.compile("<script.*?</script>",re.DOTALL)
> for scripts. I strip this out first since my tag stripping re will
> strip out script tags as well hope this was of help.


This won't catch various alterations of

<
script
>
doEvil()
<
/
script
>

which is valid html/xhtml.

For less valid html, but still attemptable, one might find
something like

<scrip<script>hah</script>t>doEvil()</script>

which, if you nuke your pattern, leaves the valid but unwanted

<script>doEvil()</script>

I'd propose that it's better to use something such as
BeautifulSoup that actually parses the HTML, and then skim
through it whitelisting the tags you plan to allow, and skipping
the emission of any tags that don't make the whitelist.

-tkc




 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Help with Pattern matching. Matching multiple lines from while reading from a file. Bobby Chamness Perl Misc 2 05-03-2007 06:02 PM
Matching neighbouring words of a pattern using Regex CV Perl 2 08-31-2004 12:27 AM
Pattern matching : not matching problem Marc Bissonnette Perl Misc 9 01-13-2004 05:52 PM
Pattern matching help! grep emails from file! danpres2k Perl 3 08-25-2003 02:47 PM
A newbie question on pattern matching DelphiDude Perl 3 07-26-2003 12:54 PM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57