Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Perl > Perl Misc > reuse code inquiry

Reply
Thread Tools

reuse code inquiry

 
 
a
Guest
Posts: n/a
 
      12-05-2007
Dear all, I am a perl beginner and I am suggested to parse HTML by using
other codes but not re-inventing the wheel.

The following code is from Internet Search but what i find is a lot of
subroutines. When I fed it with an HTM file, nothing is generated or printed
out. Would anybody tell me where all the TD elements it store?

-----------------------------------

# HTML parser
# Jim Davis, July 15 1994

# This is an HTML parser not an SGML parser. It does not parse a DTD,
# The DTD is implicit in the code, and specific to HTML.
# The processing of the HTML can be customized by the user by
# 1) Defining routines to be called for various tags (see Begin and End
arrays)
# 2) Defining routines html_content and html_whitespace

# This is not a validating parser. It does not check the content model
# eg you can use DT outside a DL and it won't know. It is too liberal in
# what tags are allowed to minimize what other tags.

# Bugs - can't parse the prolog or whatever you call it
# <!DOCTYPE HTML [
# <!entity % HTML.Minimal "INCLUDE">
# <!-- Include standard HTML DTD -->
# <!ENTITY % html PUBLIC "-//connolly hal.com//DTD WWW HTML 1.8//EN">
# %html;
# ]>

# modified 3 Aug to add a bunch of HTML 2.0 tags
# modified 3 Sept to print HTML stack to STDERR not STDOUT, to add new
# routines html_begin_doc and html_end_doc for application specific cleanup
# and to break parse_html into two pieces.
# modified 30 Sept 94. parse_attributes now handles tag attributes that
# don't have values. thanks to Bill Simpson-Young
<(E-Mail Removed)>
# for the code.
# modified 17 Apr 95 to support FORMS tags.
# modified 8 Dec 95 by Daniel LaLiberte to centralize STDERR output
# so it may be switched off more easily.

$debug = 0;

$whitespace_significant = 0;

# global variables:
# $line_buffer is line buffer
# $line_count is input line number.

$line_buffer = "";
$line_count = 0;

sub printErr {
# All errors should be printed through here, so they may be turned off.
print STDERR @_ if $debug;
}


sub parse_html {
local ($file) = @_;
open (HTML, $file) || die "Could not open $file: $!\nStopped";
&parse_html_stream ();
close (HTML);}

# Global input HTML is the handle to the stream of HTML
sub parse_html_stream {
local ($token, $new);

## initialization
@stack=();
$line_count = 0;
$line_buffer = "";

## application specific initialization
&html_begin_doc();
main:
while (1) {

# if whitespace does not matter, trim any leading space.
if (! $whitespace_significant) {
$line_buffer =~ s/^\s+//;}

# now dispatch on the type of token

if ($line_buffer =~ /^(\s+)/) {
$token = $1;
$line_buffer = $';
&html_whitespace ($token);}

# This will lose if there is more than one comment on the line!
elsif ($line_buffer =~ /^(\<!--.*-->)/) {
$token = $1;
$line_buffer = $';
&html_comment ($token);}

elsif ($line_buffer =~ /^(\<![^-][^\>]*\>)/) {
$token = $1;
$line_buffer = $';
&html_comment ($token);}

elsif ($line_buffer =~ /^(\<\/[^\>]*\>)/) {
$token = $1;
$line_buffer = $';
&html_etag ($token);}

elsif ($line_buffer =~ /^(\<[^!\/][^\>]*\>)/) {
$token = $1;
$line_buffer = $';
&html_tag ($token);}

elsif ($line_buffer =~ /^([^\s<]+)/) {
$token = $1;
$line_buffer = $';
$token = &substitute_entities($token); # not enough anyway.
&html_content ($token); }

else {
# No valid token in buffer. Maybe it's empty, or maybe there's an
# incomplete tag. So get some more data.
$new = <HTML>;
if (! defined ($new)) {last main;}
# if we're trying to find a match for a tag, then get rid of embedded
newline
# this is, I think, a kludge
if ($line_buffer =~ /^\</ && $line_buffer =~ /\n$/) {
chop $line_buffer;
$line_buffer .= " ";}
$line_buffer .= $new;
$line_count++;}
}

## cleanup
&html_end_doc();

if ($#stack > -1) {
&printErr ("Stack not empty at end of document\n");
&print_html_stack();}
}


sub html_tag {
local ($tag) = @_;
local ($element) = &tag_element ($tag);
local (%attributes) = &tag_attributes ($tag);

# the tag might minimize (be an implicit end) for the previous tag
local ($prev_element);
while (&Minimizes(&stack_top_element(), $element)) {
$prev_element = &stack_pop_element ();
if ($debug) {
&printErr ("MINIMIZING $prev_element with $element on $line_count\n");}
&html_end ($prev_element, 0);}

push (@stack, $tag);

&html_begin ($element, $tag, *attributes);

if (&Empty($element)) {
pop(@stack);
&html_end ($element, 0);}
}

sub html_etag {
local ($tag) = @_;
local ($element) = &tag_element ($tag);

# pop stack until find matching tag. This is probably a bad idea,
# or at least too general.
local ( $prev_element) = &stack_pop_element();
until ($prev_element eq $element) {
if ($debug) {
&printErr ("MINIMIZING $prev_element with /$element on $line_count
\n");}
&html_end ($prev_element, 0);

if ($#stack == -1) {
&printErr ("No match found for /$element. You will lose\n");
last;}
$prev_element = &stack_pop_element();}

&html_end ($element, 1);
}


# For each element, the names of elements which minimize it.
# This is of course totally HTML dependent and probably I have it wrong too
$Minimize{"DT"} = "DTD";
$Minimize{"DD"} = "DT";
$Minimize{"LI"} = "LI";
$Minimize{"P"} = "PT:LI:H1:H2:H3:H4:BLOCKQUOTE:UL:OLL";

# Does element E2 minimize E1?
sub Minimizes {
local ($e1, $e2) = @_;
local ($value) = 0;
foreach $elt (split (":", $Minimize{$e1})) {
if ($elt eq $e2) {$value = 1;}}
$value;}

$Empty{"BASE"} = 1;
$Empty{"BR"} = 1;
$Empty{"HR"} = 1;
$Empty{"IMG"} = 1;
$Empty{"ISINDEX"} = 1;
$Empty{"LINK"} = 1;
$Empty{"META"} = 1;
$Empty{"NEXTID"} = 1;
$Empty{"INPUT"} = 1;

# Empty tags have no content and hence no end tags
sub Empty {
local ($element) = @_;
$Empty{$element};}


sub print_html_stack {
&printErr ("\n ==\n");
foreach $elt (reverse @stack) {&printErr (" $elt\n");}
&printErr (" ==========\n");}

# The element on top of stack, if any.
sub stack_top_element {
if ($#stack >= 0) { &tag_element ($stack[$#stack]);}}

sub stack_pop_element {
&tag_element (pop (@stack));}

# The element from the tag, normalized.
sub tag_element {
local ($tag) = @_;
$tag =~ /<\/?([^\s>]+)/;
local ($element) = $1;
$element =~ tr/a-z/A-Z/;
$element;}

# associative array of the attributes of a tag.
sub tag_attributes {
local ($tag) = @_;
$tag =~ /^<[A-Za-z]+ +(.*)>$/;
&parse_attributes($1);}

# string should be something like
# KEY="value" KEY2="longer value" KEY3="tags o doom"
# output is an associative array (like a lisp property list)
# attributes names are not case sensitive, do I downcase them
# Maybe (probably) I should substitute for entities when parsing attributes.

sub parse_attributes {
local ($string) = @_;
local (%attributes);
local ($name, $val);
get: while (1) {
if ($string =~ /^ *([A-Za-z]+)\s*=\s*\"([^\"]*)\"/) {
$name = $1;
$val = $2;
$string = $';
$name =~ tr/A-Z/a-z/;
$attributes{$name} = $val; }
elsif ($string =~ /^ *([A-Za-z]+)\s*=\s*(\S*)/) {
$name = $1;
$val = $2;
$string = $';
$name =~ tr/A-Z/a-z/;
$attributes{$name} = $val;}
elsif ($string =~ /^ *([A-Za-z]+)/) {
$name = $1;
$val = "";
$string = $';
$name =~ tr/A-Z/a-z/;
$attributes{$name} = $val;}
else {last;}}
%attributes;}

sub substitute_entities {
local ($string) = @_;
$string =~ s/&lt;/</g;
$string =~ s/&gt;/>/g;
$string =~ s/&quot;/\"/g;
$string =~ s/&nbsp;/ /g;
# Other entities.

$string =~ s/&amp;/&/g; # Do this last.
$string;}


@HTML_elements = (
"A",
"ADDRESS",
"B",
"BASE",
"BLINK", # Netscape addition
"BLOCKQUOTE",
"BODY",
"BR",
"CITE",
"CENTER", # Netscape addition
"CODE",
"DD",
"DIR",
"DFN",
"DL",
"DT",
"EM",
"FORM",
"H1", "H2", "H3", "H4", "H5", "H6",
"HEAD",
"HR",
"HTML",
"I",
"ISINDEX",
"IMG",
"INPUT",
"KBD",
"LI",
"LINK",
"MENU",
"META",
"NEXTID",
"OL",
"OPTION",
"P",
"PRE",
"SAMP",
"SELECT",
"STRIKE",
"STRONG",
"TITLE",
"TEXTAREA",
"TT",
"UL",
"VAR",
);

sub define_element {
local ($element) = @_;
$Begin{$element} = "Noop";
$End{$element} = "Noop";}

foreach $element (@HTML_elements) {&define_element($element);}

# do nothing
sub Noop {
local ($element, $xxx) = @_;}

# called when a tag begins. Dispatches using Begin
sub html_begin {
local ($element, $tag, *attributes) = @_;

local ($routine) = $Begin{$element};
if ($routine eq "") {
&printErr ("Unknown HTML element $element ($tag) on line $line_count\n");}
else {eval "&$routine;"}}

# called when a tag ends. Explicit is 0 if tag end is because of
minimization
# not that you should care.
sub html_end {
local ($element, $explicit) = @_;
local ($routine) = $End{$element};
if ($routine eq "") {
&printErr ("Unknown HTML element \"$element\" (END $explicit) on line
$line_count\n");}
else {eval "&$routine(\"$element\", $explicit)";}}

sub html_content {
local ($word) = @_;
}

sub html_whitespace {
local ($whitespace) = @_;}

sub html_comment {
local ($tag) = @_;}

# redefine these for application-specific initialization and cleanup

sub html_begin_doc {}

sub html_end_doc {}

# return a "true value" when loaded by perl.
1;



 
Reply With Quote
 
 
 
 
Joost Diepenmaat
Guest
Posts: n/a
 
      12-05-2007
On Wed, 05 Dec 2007 22:34:39 +0800, a wrote:

> Dear all, I am a perl beginner and I am suggested to parse HTML by using
> other codes but not re-inventing the wheel.


If you're looking for existing perl libraries, the best place to start is

http://search.cpan.org/

As for the HTML parser, try this one:
http://search.cpan.org/~gaas/HTML-Parser-3.56/Parser.pm

Joost.

 
Reply With Quote
 
 
 
 
Ben Morrow
Guest
Posts: n/a
 
      12-05-2007

Quoth "a" <(E-Mail Removed)>:
> Dear all, I am a perl beginner and I am suggested to parse HTML by using
> other codes but not re-inventing the wheel.


This is good advice. One of Perl's strengths is the large amount of
good-quality code that is available for reuse.

> The following code is from Internet Search but what i find is a lot of
> subroutines. When I fed it with an HTM file, nothing is generated or printed
> out. Would anybody tell me where all the TD elements it store?


Nowhere. Did you read the comments? The code calls subs, which you have
to define, whenever it encounters a tag. If you want to store them
somewhere, you have to do it yourself. Also, the file is a Perl 4-style
library, not a complete script. If you attempt to simply run it it will
do nothing at all.

It seems to me that you need to read a good beginners book on Perl
before you go much further; 'Learning Perl' by Randal Schwartz et al. is
recommended, or see perldoc -q book for more.

> # HTML parser
> # Jim Davis, July 15 1994


This looks (from a brief check) like basically decent code, but it is
*very* old. It was clearly written for Perl 4, and only supports HTML
3.2, both of which are extinct nowadays. Get the HTML:arser module
from CPAN and use that instead.

In general searching CPAN http://search.cpan.org is a better place to
start when looking for Perl code than searching the whole web. There's
an awful lot of really bad Perl out there.

Ben

 
Reply With Quote
 
Michele Dondi
Guest
Posts: n/a
 
      12-05-2007
On Wed, 5 Dec 2007 22:34:39 +0800, "a" <(E-Mail Removed)> wrote:

>Dear all, I am a perl beginner and I am suggested to parse HTML by using
>other codes but not re-inventing the wheel.


Generally people are suggested more specifically to use good HTML
parsing modules out of CPAN. The code you pasted is not a module, and
doesn't look very good. Also, its release date -namely 1994- should
ring a bell.


Michele
--
{$_=pack'B8'x25,unpack'A8'x32,$a^=sub{pop^pop}->(map substr
(($a||=join'',map--$|x$_,(unpack'w',unpack'u','G^<R<Y]*YB='
..'KYU;*EVH[.FHF2W+#"\Z*5TI/ER<Z`S(G.DZZ9OX0Z')=~/./g)x2,$_,
256),7,249);s/[^\w,]/ /g;$ \=/^J/?$/:"\r";print,redo}#JAPH,
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
To reuse or not to reuse jacob navia C Programming 19 12-18-2006 07:22 AM
code reuse and design reuse sailor.gu@gmail.com C Programming 16 02-12-2006 09:09 PM
Code reuse, code behind, and can't inherit since already inheritsPage Randall Parker ASP .Net 2 11-01-2005 06:59 PM
Reuse paramter list and reuse connection tshad ASP .Net 5 05-17-2005 12:33 AM
To reuse or not to reuse.... Hylander Java 0 02-26-2004 12:00 AM



Advertisments