Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > HTML parsing using Java and Xerces

Reply
Thread Tools

HTML parsing using Java and Xerces

 
 
Camk
Guest
Posts: n/a
 
      03-19-2007
Hey, Is it possible to do the following?

1. Enter a search term in ask.com (Manually) and hit search
2. Once the Result page is shown, view the source file and save it to
the hard disk (Manually)
3. Use a Java program with the HTML parser embedded to extract the
returned URLs
4. Once the URLs are returned, they are to be automatically stored in
a MYSQL database.
The database has a Single table with the following columns:
Query - Stores a string of the search query used.
SearchEngine - Stores a string of the search engine (e.g. Ask)
ReturnedURL - Stores a string of the returned URL (this is got from
the parsed source sheet)
URLNo - Stores an int the position of the Returned URL (i.e. the first
URL is number 1 and so on)

 
Reply With Quote
 
 
 
 
Chris
Guest
Posts: n/a
 
      03-20-2007
Camk wrote:
> Hey, Is it possible to do the following?
>
> 1. Enter a search term in ask.com (Manually) and hit search
> 2. Once the Result page is shown, view the source file and save it to
> the hard disk (Manually)
> 3. Use a Java program with the HTML parser embedded to extract the
> returned URLs
> 4. Once the URLs are returned, they are to be automatically stored in
> a MYSQL database.
> The database has a Single table with the following columns:
> Query - Stores a string of the search query used.
> SearchEngine - Stores a string of the search engine (e.g. Ask)
> ReturnedURL - Stores a string of the returned URL (this is got from
> the parsed source sheet)
> URLNo - Stores an int the position of the Returned URL (i.e. the first
> URL is number 1 and so on)
>


Yes, it is possible. Lots of ways to do it. The trick is to find a
reliable way to recognize the various entities in the page.

I would start by reading the page into a String or char array, and then
seeing if I could write regular expressions to recognize things. See
java.util.regex.

Don't use Xerces. It will choke on any ill-formed html.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
parsing xsd using xerces chimanrao@gmail.com XML 0 06-09-2005 05:34 PM
Progressive Parsing using Xerces C++ Girish XML 3 04-11-2005 02:13 PM
HTML parsing with Xerces Hans Bijvoet XML 1 01-28-2005 08:12 AM
Upgrade of Xalan 1.2.2 and Xerces 1.4.4 to Xalan 2.6 and Xerces 2.6.2 cvissy XML 0 11-16-2004 07:06 AM
parsing XML to DOM, validating against a local DTD, using Xerces under JAXP? bugbear XML 0 08-28-2003 09:22 AM



Advertisments