Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > ASP .Net > ASP General > ASP Question: Parse HTML file?

Reply
Thread Tools

ASP Question: Parse HTML file?

 
 
Rob Meade
Guest
Posts: n/a
 
      05-18-2006
Hi all,

I'm working on a project where there are just under 1300 course files, these
are HTML files - my problem is that I need to do more with the content of
these pages - and the thought of writing 1300 asp pages to deal with this
doesn't thrill me.

The HTML pages are provided by a training company. They seem to be
"structured" to some degree, but I'm not sure how easy its going to be to
parse the page.

Typically there are the following "sections" of each page:

Title
Summary
Topics
Technical Requirements
Copyright Information
Terms Of Use

I need to get the content for the Title, Summary, Topics, Technical
Requirements and lose the Copyright and Terms of use...in addition I need to
squeeze in a new section which will display pricing information and a link
to "Add to cart" etc....

My "plan" (if you can call it that) was to have 1 asp page which can parse
the appropriate HTML file based on the asp page being passed a code in the
querystring - the code will match the filename of the HTML page (the first
part prior to the dot).

What I then need to do is go through the content of the HTML....this is
where I am currently stuck....

I have pasted an example of one of these pages below - if anyone can suggest
to me how I might achieve this I would be most grateful - in addition - if
anyone can explain the XML Name Space stuff in there that would be handy
too - I figure this is just a normal HTML page, as there is no declaration
or anything at the top?

Any information/suggestions would be most appreciated.

Thanks in advance for your help,

Regards

Rob


Example file:

<html>
<head>
<title>Novell 560 CNE Series: File System</title>
<meta name="Description" content="">
<link rel="stylesheet" href="../resource/mlcatstyle.css"
type="text/css">
</head>
<body class="MlCatPage">
<table class="Header" xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:fn="http://www.w3.org/2005/xpath-functions">
<tr>
<td class="Logo" colspan="2">
<img class="Logo" src="../images/logo.gif">
</td>
</tr>
<tr>
<td class="Title">
<div class="ProductTitle">
<span class="CoCat">Novell 560 CNE Series: File System</span>
</div>
<div class="ProductDetails">
<span class="SmallText">
<span class="BoldText"> Product Code: </span>
560c04<span class="BoldText"> Time: </span>
4.0 hour(s)<span class="BoldText"> CEUs: </span>
Available</span>
</div>
</td>
<td class="Back">
<div class="BackButton">
<a href="javascript:history.back()">
<img src="../images/back.gif" align="right" border="0">
</a>
</div>
</td>
</tr>
</table>
<br xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:fn="http://www.w3.org/2005/xpath-functions">
<table class="HighLevel" xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:fn="http://www.w3.org/2005/xpath-functions">
<tr>
<td class="BlockHeader">
<h3 class="sectiontext">Summary:</h3>
</td>
</tr>
<tr>
<td class="Overview">
<div class="ProductSummary">This course provides an introduction
to NetWare 5 file system concepts and management procedures.</div>
<br>
<h3 class="Sectiontext">Objectives:</h3>
<div class="FreeText">After completing this course, students will
be able to: </div>
<div class="ObjectiveList">
<ul class="listing">
<li class="ObjectiveItem">Explain the relationship of the file
system and login scripts</li>
<li class="ObjectiveItem">Create login scripts</li>
<li class="ObjectiveItem">Manage file system directories and
files</li>
<li class="ObjectiveItem">Map network drives</li>
</ul>
</div>
<br></br>
<h3 class="Sectiontext">Topics:</h3>
<div class="OutlineList">
<ul class="listing">
<li class="OutlineItem">Managing the File System</li>
<li class="OutlineItem">Volume Space</li>
<li class="OutlineItem">Examining Login Scripts</li>
<li class="OutlineItem">Creating and Executing Login
Scripts</li>
<li class="OutlineItem">Drive Mappings</li>
<li class="OutlineItem">Login Scripts and Resources</li>
</ul>
</div>
</td>
</tr>
</table>
<br xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:fn="http://www.w3.org/2005/xpath-functions">
<table class="Details" xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:fn="http://www.w3.org/2005/xpath-functions">
<tr>
<td class="BlockHeader">
<h3 class="Sectiontext">Technical Requirements:</h3>
</td>
</tr>
<tr>
<td class="Details">
<div class="ProductRequirements">200MHz Pentium with 32MB Ram. 800
x 600 minimum screen resolution. Windows 98, NT, 2000, or XP. 56K minimum
connection speed, broadband (256 kbps or greater) connection recommended.
Internet Explorer 5.0 or higher required. Flash Player 7.0 or higher
required. JavaScript must be enabled. Netscape, Firefox and AOL browsers not
supported.</div>
</td>
</tr>
</table>
<br xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:fn="http://www.w3.org/2005/xpath-functions">
<table class="Legal" xmlns:fo="http://www.w3.org/1999/XSL/Format"
xmlns:fn="http://www.w3.org/2005/xpath-functions">
<tr>
<td class="BlockHeader">
<h3 class="Sectiontext">Copyright Information:</h3>
</td>
</tr>
<tr>
<td class="Copyright">
<div class="ProductRequirements">Product names mentioned in this
catalog may be trademarks/servicemarks or registered trademarks/servicemarks
of their respective companies and are hereby acknowledged. All product
names that are known to be trademarks or service marks have been
appropriately capitalized. Use of a name in this catalog is for
identification purposes only, and should not be regarded as affecting the
validity of any trademark or service mark, or as suggesting any affiliation
between MindLeaders.com, Inc. and the trademark/servicemark
proprietor.</div>
<br>
<h3 class="Sectiontext">Terms of Use:</h3>
<div class="ProductUsenote"></div>
</td>
</tr>
</table>
<p align="center">
<span class="SmallText">Copyright &copy; 2006 MindLeaders. All rights
reserved.</span>
</p>
</body>
</html>


 
Reply With Quote
 
 
 
 
Mike Brind
Guest
Posts: n/a
 
      05-18-2006

Rob Meade wrote:
> Hi all,
>
> I'm working on a project where there are just under 1300 course files, these
> are HTML files - my problem is that I need to do more with the content of
> these pages - and the thought of writing 1300 asp pages to deal with this
> doesn't thrill me.
>
> The HTML pages are provided by a training company. They seem to be
> "structured" to some degree, but I'm not sure how easy its going to be to
> parse the page.
>
> Typically there are the following "sections" of each page:
>
> Title
> Summary
> Topics
> Technical Requirements
> Copyright Information
> Terms Of Use


If you can identify the specific divs that hold this information (and
they are consistent across pages), you could use regex to parse the
files and pop the relevant bits into a database.

--
Mike Brind

 
Reply With Quote
 
 
 
 
Anthony Jones
Guest
Posts: n/a
 
      05-18-2006

>
> I have pasted an example of one of these pages below - if anyone can

suggest
> to me how I might achieve this I would be most grateful - in addition - if
> anyone can explain the XML Name Space stuff in there that would be handy
> too - I figure this is just a normal HTML page, as there is no declaration
> or anything at the top?
>


These pages will have been generated via an XSLT transform. The transform
will have made use of these namespaces. However unless informed otherwise
XSLT will output the xmlns tags for these namespaces even though no element
is output belonging to them which is the case here.

That's a long winded way of saying they don't do anything, ignore them.

It's a pity they didn't go the whole hog and output the whole page as XML it
would be a lot easier to do what you need. Still it's a good sign that the
content of the other 1299 pages are likely to be consistent so Mike's idea
of scanning with RegExp should work.

Anthony.


 
Reply With Quote
 
Rob Meade
Guest
Posts: n/a
 
      05-19-2006
"McKirahan" wrote ...

> Consider displaying their page inside of an <iframe>
> inside of a page that has your content.


Hi McKirahan,

Thanks for your reply - alas I need "bits" of their pages, with "bits" of my
stuff inserted in between, so including their whole page as-is unfortunately
is no good for me.

Regards

Rob


 
Reply With Quote
 
Rob Meade
Guest
Posts: n/a
 
      05-19-2006
"Mike Brind" wrote ...

> If you can identify the specific divs that hold this information (and
> they are consistent across pages), you could use regex to parse the
> files and pop the relevant bits into a database.


Hi Mike,

Thanks for your reply.

I don't suppose by any chance you might have an example that would get me
started with that approach would you - it sounds like it could well work.

Regards

Rob


 
Reply With Quote
 
Rob Meade
Guest
Posts: n/a
 
      05-19-2006
"Anthony Jones" wrote ...

> These pages will have been generated via an XSLT transform. The transform
> will have made use of these namespaces. However unless informed otherwise
> XSLT will output the xmlns tags for these namespaces even though no
> element
> is output belonging to them which is the case here.
>
> That's a long winded way of saying they don't do anything, ignore them.
>
> It's a pity they didn't go the whole hog and output the whole page as XML
> it
> would be a lot easier to do what you need. Still it's a good sign that
> the
> content of the other 1299 pages are likely to be consistent so Mike's idea
> of scanning with RegExp should work.


Hi Anthony,

Thanks for the reply.

I especially appreciate the explanation for why they are there - I tried
googling it last night and found some stuff about XSLT 2.0 but it didn't
really get me anywhere - I would agree that it's a shame they are not as
XML - that would have been nice!

Cheers

Rob


 
Reply With Quote
 
McKirahan
Guest
Posts: n/a
 
      05-19-2006
"Mike Brind" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed) oups.com...
>
> Rob Meade wrote:
> > Hi all,
> >
> > I'm working on a project where there are just under 1300 course files,

these
> > are HTML files - my problem is that I need to do more with the content

of
> > these pages - and the thought of writing 1300 asp pages to deal with

this
> > doesn't thrill me.
> >
> > The HTML pages are provided by a training company. They seem to be
> > "structured" to some degree, but I'm not sure how easy its going to be

to
> > parse the page.
> >
> > Typically there are the following "sections" of each page:
> >
> > Title
> > Summary
> > Topics
> > Technical Requirements
> > Copyright Information
> > Terms Of Use

>
> If you can identify the specific divs that hold this information (and
> they are consistent across pages), you could use regex to parse the
> files and pop the relevant bits into a database.
>
> --
> Mike Brind
>


It would have been nice if each div calss were unquie.
This one is repeated:
<div class="ProductRequirements">
It's not wrong just (potentially) inconvenient.

<td class="Details">
<div class="ProductRequirements">200MHz Pentium ...

<td class="Copyright">
<div class="ProductRequirements">Product names ...

Which div's are you interested in?


Here's a script that will extract all the div's into a new file:

Option Explicit
'*
Const cVBS = "Novell.vbs"
Const cOT1 = "Novell.htm" '= Input filename
Const cOT2 = "Novell.txt" '= Output filename
Const cDIV = "</div>"
'*
'* Declare Variables
'*
Dim intBEG
intBEG = 1
Dim arrDIV(9)
arrDIV(0) = "<div class=" & Chr(34) & "?" & Chr(34) & ">"
arrDIV(1) = "ProductTitle"
arrDIV(2) = "ProductDetails"
arrDIV(3) = "ProductSummary"
arrDIV(4) = "FreeText"
arrDIV(5) = "ObjectiveList"
arrDIV(6) = "OutlineList"
arrDIV(7) = "ProductRequirements"
arrDIV( = "ProductRequirements"
arrDIV(9) = "ProductUsenote"
Dim intDIV
Dim strDIV
Dim arrOT1
Dim intOT1
Dim strOT1
Dim strOT2
Dim intPOS
'*
'* Declare Objects
'*
Dim objFSO
Set objFSO = CreateObject("Scripting.FileSystemObject")
Dim objOT1
Set objOT1 = objFSO.OpenTextFile(cOT1,1)
Dim objOT2
Set objOT2 = objFSO.OpenTextFile(cOT2,2,True)
'*
'* Read File, Extract "div", Write Line
'*
strOT1 = objOT1.ReadAll()
For intDIV = 1 To UBound(arrDIV)
strOT2 = Mid(strOT1,intBEG)
strDIV = Replace(arrDIV(0),"?",arrDIV(intDIV))
intPOS = InStr(strOT2,strDIV)
If intPOS > 0 Then
strOT2 = Mid(strOT2,intPOS)
intPOS = InStr(strOT2,cDIV)
strOT2 = Left(strOT2,intPOS+Len(cDIV))
objOT2.WriteLine(strOT2 & vbCrLf)
intBEG = intPOS + Len(cDIV) + 1
End If
Next
'*
'* Destroy Objects
'*
Set objOT1 = Nothing
Set objOT2 = Nothing
Set objFSO = Nothing
'*
'* Done!
'*
MsgBox "Done!",vbInformation,cVBS

You could modify it to loop through a list or folder of files.

Note that each "class=" is in the stylesheet:
<link rel="stylesheet" href="../resource/mlcatstyle.css"
type="text/css">
which you should refer to when using their div's.


 
Reply With Quote
 
Rob Meade
Guest
Posts: n/a
 
      05-19-2006
"McKirahan" wrote ...

Hi McKirahan, thank you again for your reply and example.

I should add that I wont be writing these out to another file, instead it'll
need to do it on the fly, ie, take the original source page by the code
passed in the URL, read in the appropriate parts, and then spit out my own
layout and extra parts.

With the example you posted (below) - does it extract whats between the DIV
tags, ie the <tr>'s and <td's> as well, or just the actually "text"?

Thanks again

Rob
PS: The copyright one can be excluded..
PPS: When I say its going to happen on the fly, this would obviously depend
on how quick and efficient it is - if it turns out that because of the
number of hits they get on the site in question its a bit too slow, then I
might have to have some kind of "import" process which obviously would make
more sense anyway, this could then create new pages, or perhaps store the
information in the database.

> It would have been nice if each div calss were unquie.
> This one is repeated:
> <div class="ProductRequirements">
> It's not wrong just (potentially) inconvenient.
>
> <td class="Details">
> <div class="ProductRequirements">200MHz Pentium ...
>
> <td class="Copyright">
> <div class="ProductRequirements">Product names ...
>
> Which div's are you interested in?
>
>
> Here's a script that will extract all the div's into a new file:
>
> Option Explicit
> '*
> Const cVBS = "Novell.vbs"
> Const cOT1 = "Novell.htm" '= Input filename
> Const cOT2 = "Novell.txt" '= Output filename
> Const cDIV = "</div>"
> '*
> '* Declare Variables
> '*
> Dim intBEG
> intBEG = 1
> Dim arrDIV(9)
> arrDIV(0) = "<div class=" & Chr(34) & "?" & Chr(34) & ">"
> arrDIV(1) = "ProductTitle"
> arrDIV(2) = "ProductDetails"
> arrDIV(3) = "ProductSummary"
> arrDIV(4) = "FreeText"
> arrDIV(5) = "ObjectiveList"
> arrDIV(6) = "OutlineList"
> arrDIV(7) = "ProductRequirements"
> arrDIV( = "ProductRequirements"
> arrDIV(9) = "ProductUsenote"
> Dim intDIV
> Dim strDIV
> Dim arrOT1
> Dim intOT1
> Dim strOT1
> Dim strOT2
> Dim intPOS
> '*
> '* Declare Objects
> '*
> Dim objFSO
> Set objFSO = CreateObject("Scripting.FileSystemObject")
> Dim objOT1
> Set objOT1 = objFSO.OpenTextFile(cOT1,1)
> Dim objOT2
> Set objOT2 = objFSO.OpenTextFile(cOT2,2,True)
> '*
> '* Read File, Extract "div", Write Line
> '*
> strOT1 = objOT1.ReadAll()
> For intDIV = 1 To UBound(arrDIV)
> strOT2 = Mid(strOT1,intBEG)
> strDIV = Replace(arrDIV(0),"?",arrDIV(intDIV))
> intPOS = InStr(strOT2,strDIV)
> If intPOS > 0 Then
> strOT2 = Mid(strOT2,intPOS)
> intPOS = InStr(strOT2,cDIV)
> strOT2 = Left(strOT2,intPOS+Len(cDIV))
> objOT2.WriteLine(strOT2 & vbCrLf)
> intBEG = intPOS + Len(cDIV) + 1
> End If
> Next
> '*
> '* Destroy Objects
> '*
> Set objOT1 = Nothing
> Set objOT2 = Nothing
> Set objFSO = Nothing
> '*
> '* Done!
> '*
> MsgBox "Done!",vbInformation,cVBS
>
> You could modify it to loop through a list or folder of files.
>
> Note that each "class=" is in the stylesheet:
> <link rel="stylesheet" href="../resource/mlcatstyle.css"
> type="text/css">
> which you should refer to when using their div's.



 
Reply With Quote
 
McKirahan
Guest
Posts: n/a
 
      05-19-2006
"Rob Meade" <(E-Mail Removed)> wrote in message
news:e3WJh$(E-Mail Removed)...
> "McKirahan" wrote ...
>
> Hi McKirahan, thank you again for your reply and example.
>
> I should add that I wont be writing these out to another file, instead

it'll
> need to do it on the fly, ie, take the original source page by the code
> passed in the URL, read in the appropriate parts, and then spit out my own
> layout and extra parts.
>
> With the example you posted (below) - does it extract whats between the

DIV
> tags, ie the <tr>'s and <td's> as well, or just the actually "text"?
>
> Thanks again
>
> Rob
> PS: The copyright one can be excluded..
> PPS: When I say its going to happen on the fly, this would obviously

depend
> on how quick and efficient it is - if it turns out that because of the
> number of hits they get on the site in question its a bit too slow, then I
> might have to have some kind of "import" process which obviously would

make
> more sense anyway, this could then create new pages, or perhaps store the
> information in the database.
>


Did you try it as-is to see what you get?

I would probably put all 1300 files (pages) in a single folder.
Then run a process against each to generate 1300 new files in
a different folder. These would be posted for quick access.

Prior to posting the could be reviewed for accuracy.

Also, instead of extracting out the div's you could just identify
where you want your stuff inserted.


 
Reply With Quote
 
Rob Meade
Guest
Posts: n/a
 
      05-20-2006
"McKirahan" wrote ...

> Did you try it as-is to see what you get?


Hi McKirahan, thanks for your reply.

Not as of yet no - but I'm home this weekend so will be giving it ago )

> I would probably put all 1300 files (pages) in a single folder.


They come in a /courses directory

> Then run a process against each to generate 1300 new files in
> a different folder. These would be posted for quick access.


I think I might have to change the process a bit but the idea is the same -
the content provider has other bits that link to these files, so they'd
still need to be in a /courses directory, but I could put them somewhere
else first, "mangle" them and then spit them out to the /courses directory
)

> Prior to posting the could be reviewed for accuracy.


I might check a couple - but not all 1300 - I dont wanna go mental... D

> Also, instead of extracting out the div's you could just identify
> where you want your stuff inserted.


Yeah, but there were bits I needed to lose, ie the copyright section etc..

I seem to remember a long time back a discussion about transforming pages, I
think it might have been done in an ISAPI filter or something though - not
sure - from what I remember the requested page would get grabbed, actions
happen and then it can be spat out as a different page - I wonder if this is
what the previous company that did this adopted, because I find it hard to
believe they would have created 1300 asp files, but yet all of the links on
the original site were <course-code>.asp as opposed to the real file
<course-code.html - if you see what I mean...

Regards

Rob


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
optparse: parse v. parse! ?? 7stud -- Ruby 3 02-20-2008 05:20 AM
How to parse a string like C program parse the command line string? linzhenhua1205@163.com C Programming 19 03-15-2005 07:41 PM
IIS refuses to parse ASP.NET or ASP (but not both at the same time) Rea ASP .Net 2 05-27-2004 04:32 PM
parse inside of html tags jjliu Perl 3 10-11-2003 11:34 AM
[TABLE NOT SHOWN] problem with HTML::Parse Mitchua Perl 3 07-13-2003 11:38 PM



Advertisments