Spartanicus <> wrote:
> I could use some examples of how to use regexps to filter html, I
> haven't been able to grasp it using the tutorials on the net.
>
> Functions I'm after:
>
> 1) Remove all tags except img, object, a and embed.
If you don't care about the relative order of those tags, then run the
- extract all text between '<a ' and '</a>',
- extract all text between '<img ' and '>',
- extract all text between '<object>' and '</object>',
- extract all text between '<embed>' and '</embed>',
separately using Python, Perl, or (patched) Bash shell. Essentially,
read the whole file into string, and then cut/slice.
If you like a shell solution, you can use
http://freshmeat.net/projects/bashdiff/
which has "string" cut/splicing.
> 2) Remove all blank lines (they may contain spaces and/or tabs).
> 3) Remove html comments
>
> My regexp parser (Homesite) is a bit limited, it doesn't support
> functions and shortcuts.
--
William Park <>
Open Geometry Consulting, Toronto, Canada