Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > HTML > HTML info extraction utility

Reply
Thread Tools

HTML info extraction utility

 
 
MaggieMagill
Guest
Posts: n/a
 
      03-03-2005
Is there any utility that can gather info such as a list of images, fonts
used, links used, etc? Something that could start at "index.html" and run
thru all other html (local) files that are referenced along the way?
 
Reply With Quote
 
 
 
 
Richard
Guest
Posts: n/a
 
      03-03-2005
On Thu, 03 Mar 2005 04:43:58 GMT MaggieMagill wrote:

> Is there any utility that can gather info such as a list of images, fonts
> used, links used, etc? Something that could start at "index.html" and run
> thru all other html (local) files that are referenced along the way?


Perhaps with a javascript routine.
Not directly possible with pure html.


 
Reply With Quote
 
 
 
 
Andy Dingley
Guest
Posts: n/a
 
      03-03-2005
It was somewhere outside Barstow when MaggieMagill
<> wrote:

>Is there any utility that can gather info such as a list of images, fonts
>used, links used, etc?


Any number of them. They're usually written in Perl, because it has
usable parsing modules available off the shelf.

 
Reply With Quote
 
data64
Guest
Posts: n/a
 
      03-04-2005
MaggieMagill <> wrote in news:iCwVd.25521$7z6.66
@lakeread04:

> Is there any utility that can gather info such as a list of images, fonts
> used, links used, etc? Something that could start at "index.html" and run
> thru all other html (local) files that are referenced along the way?


As Andy replied, using Perl this can put together in short order. I think
Dreamweaver also has some such capabilities (from what little I have used
it). You can run reports on local sites that would give you this information.

data64
 
Reply With Quote
 
MaggieMagill
Guest
Posts: n/a
 
      03-04-2005
Andy Dingley <> wrote in
news::

> It was somewhere outside Barstow when MaggieMagill
> <> wrote:
>
>>Is there any utility that can gather info such as a list of images,
>>fonts used, links used, etc?

>
> Any number of them. They're usually written in Perl, because it has
> usable parsing modules available off the shelf.
>


Could you direct me to where I would find these types of utilities? I'm not
quite sure what search terms I would use.

I would be using them on a local machine that now has Apache 1.3.33 running
and a bunch of 8-9 year old html pages (no styles used) that I need to sift
thru. Images, fonts & links is really the only data I need to extract.

I was thinking of breaking out the PASCAL until I realized that 15 years
of not using it might have dulled my minimal skills.
 
Reply With Quote
 
Andy Dingley
Guest
Posts: n/a
 
      03-04-2005
It was somewhere outside Barstow when MaggieMagill
<> wrote:

>Could you direct me to where I would find these types of utilities? I'm not
>quite sure what search terms I would use.


Google for "HTML analysis" or somesuch ought to give you tools that
meet your immediate needs, immediately.

If you want to write some Perl (whcih is a worthy goal, but probably
overkill for this) then look at the LWP module and the HTML:arser
class (HTML::TokeParser is sometimes easier to use for people less
familiar with Perl). This will spit every tag back at you and a
simple switch statement can recognise the tag types and analyse
accordingly. A couple of hashes (associative arrays) to store things
in and off you go.

You could probably do this task with anything, and I;m not smart
enough to know what the best tool is. But when I do it, I re-write the
nasty hacky Perl I used last time and change a handful of lines for my
specific need.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
extraction of info from fixed position Rose Perl Misc 3 02-16-2008 06:02 PM
Automated HTML code extraction and documenting sirleech@gmail.com HTML 0 09-13-2005 08:01 PM
Unstructured HTML extraction dayzman@hotmail.com XML 4 12-07-2004 12:57 PM
Unstructured HTML extraction dayzman@hotmail.com XML 0 12-07-2004 03:03 AM
Erroneous Text Extraction using HTML::Parser Himanshu Garg Perl 0 01-27-2004 03:28 PM



Advertisments
 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57