Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > HTML > Help with Regexps wanted

Reply
Thread Tools

Help with Regexps wanted

 
 
Spartanicus
Guest
Posts: n/a
 
      10-21-2004
I could use some examples of how to use regexps to filter html, I
haven't been able to grasp it using the tutorials on the net.

Functions I'm after:

1) Remove all tags except img, object, a and embed.
2) Remove all blank lines (they may contain spaces and/or tabs).
3) Remove html comments

My regexp parser (Homesite) is a bit limited, it doesn't support
functions and shortcuts.

--
Spartanicus
 
Reply With Quote
 
 
 
 
Andrew Urquhart
Guest
Posts: n/a
 
      10-21-2004
*Spartanicus* wrote:
> I could use some examples of how to use regexps to filter html, I
> haven't been able to grasp it using the tutorials on the net.
>
> Functions I'm after:
>
> 1) Remove all tags except img, object, a and embed.


I might be getting close with:

<(?!\/|img|object|a|embed)[^>]*?>|<\/(?!img|object|a|embed)[^>]*?>

For example (javascript/ecmascript/jscript):

str = str.replace(/\s*<(?!\/|img|object|a|embed)[^>]*?>\s*|\s*<\/(?!img|object|a|embed)[^>]*?>\s*/igm, " ");

It's not terribly nice and I gave up trying to remove the inefficient OR
in the middle :/

> 2) Remove all blank lines (they may contain spaces and/or tabs).


To match a blank line:

/^\s*$/

E.g.:

str = str.replace(/^\s*$/gm, "");


> 3) Remove html comments


http://groups.google.co.uk/groups?th=4b9c59a6279b9620
--
Andrew Urquhart
- FAQ: http://www.html-faq.com/
- Archive: http://groups.google.com/groups?group=alt.html
- Reply: http://andrewu.co.uk/contact/
 
Reply With Quote
 
 
 
 
William Park
Guest
Posts: n/a
 
      10-24-2004
Spartanicus <(E-Mail Removed)> wrote:
> I could use some examples of how to use regexps to filter html, I
> haven't been able to grasp it using the tutorials on the net.
>
> Functions I'm after:
>
> 1) Remove all tags except img, object, a and embed.


If you don't care about the relative order of those tags, then run the
- extract all text between '<a ' and '</a>',
- extract all text between '<img ' and '>',
- extract all text between '<object>' and '</object>',
- extract all text between '<embed>' and '</embed>',
separately using Python, Perl, or (patched) Bash shell. Essentially,
read the whole file into string, and then cut/slice.

If you like a shell solution, you can use
http://freshmeat.net/projects/bashdiff/
which has "string" cut/splicing.

> 2) Remove all blank lines (they may contain spaces and/or tabs).
> 3) Remove html comments
>
> My regexp parser (Homesite) is a bit limited, it doesn't support
> functions and shortcuts.



--
William Park <(E-Mail Removed)>
Open Geometry Consulting, Toronto, Canada
 
Reply With Quote
 
Eric B. Bednarz
Guest
Posts: n/a
 
      10-25-2004
William Park <(E-Mail Removed)> writes:

> Spartanicus <(E-Mail Removed)> wrote:


>> 1) Remove all tags except img, object, a and embed.


I'll just pick a simple one:

> - extract all text between '<img ' and '>',


<img src="tagc.png" alt=">">


Doing such things is usually as trivial as writing your own SGML parser
from scratch (the upshot being: there's a difference between something
like parsing a private set -- of yourself, or simply currently available
applications -- of *applied* syntax or conforming to a generic set of
defined syntactical rules; the former is only fairly easy as long as you
don't forget about your policies, the involved applications don't
unexpectadly change and you are the only user to start with).


--
| ) PiĆ¹ Cabernet,
-( meno Internet.
| ) http://bednarz.nl/
 
Reply With Quote
 
William Park
Guest
Posts: n/a
 
      10-25-2004
Eric B. Bednarz <(E-Mail Removed)> wrote:
> William Park <(E-Mail Removed)> writes:
>
> > Spartanicus <(E-Mail Removed)> wrote:

>
> >> 1) Remove all tags except img, object, a and embed.

>
> I'll just pick a simple one:
>
> > - extract all text between '<img ' and '>',

>
> <img src="tagc.png" alt=">">


Good one. I guess OP can turn HTML into XML syntax, and use XML parser.

--
William Park <(E-Mail Removed)>
Open Geometry Consulting, Toronto, Canada
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Need help to find byte offsets for regexps in a file Robert Dodier Perl Misc 2 07-09-2006 12:39 AM
Examples of using "reluctant" subexpressions in regexps? david.karr@wamu.net Java 4 04-27-2005 07:46 PM
HELP WANTED HELP WANTED HELP WANTED Harvey ASP .Net 1 07-16-2004 01:12 PM
HELP WANTED HELP WANTED HELP WANTED Harvey ASP .Net 0 07-16-2004 10:00 AM
Optimisation of regexps in Perl? Fredrik Ramsberg Perl 2 10-15-2003 08:30 AM



Advertisments