Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > JavaScript and Screenscraping

Reply
Thread Tools

JavaScript and Screenscraping

 
 
Roedy Green
Guest
Posts: n/a
 
      03-30-2011
I am working on a screenscraping project that is turning out to much
more time-consuming that I thought it would be. I am trying to gather
a database of information about all the motherboards sold my major
manufacturers. The idea is to eventually create a comparison shopper
to help you narrow down models that fit your needs.

Oddly motherboard manufacturers don't use a database and generate
their specification pages. These are all hand-compiled with theme and
a dozen variations on every field. This is can handle.

However, Asus decided to obfuscate their web pages with JavaScript.
There are no data on them.

I wondered if there exists a tool that is like browser in that it will
read a page and render the JavaScript, but unlike a browser, it would
not show the information on the screen, just dump the generated HTML
or raw text and accept a script of pages to analyse.

--
Roedy Green Canadian Mind Products
http://mindprod.com
There are only two industries that refer to their customers as "users".
~ Edward Tufte

 
Reply With Quote
 
 
 
 
Michal Kleczek
Guest
Posts: n/a
 
      03-30-2011
Roedy Green wrote:

> I am working on a screenscraping project that is turning out to much
> more time-consuming that I thought it would be. I am trying to gather
> a database of information about all the motherboards sold my major
> manufacturers. The idea is to eventually create a comparison shopper
> to help you narrow down models that fit your needs.
>
> Oddly motherboard manufacturers don't use a database and generate
> their specification pages. These are all hand-compiled with theme and
> a dozen variations on every field. This is can handle.
>
> However, Asus decided to obfuscate their web pages with JavaScript.
> There are no data on them.
>
> I wondered if there exists a tool that is like browser in that it will
> read a page and render the JavaScript, but unlike a browser, it would
> not show the information on the screen, just dump the generated HTML
> or raw text and accept a script of pages to analyse.
>


http://htmlunit.sourceforge.net/

--
Michal
 
Reply With Quote
 
 
 
 
Tom Anderson
Guest
Posts: n/a
 
      03-30-2011
On Wed, 30 Mar 2011, Michal Kleczek wrote:

> Roedy Green wrote:
>
>> I am working on a screenscraping project that is turning out to much
>> more time-consuming that I thought it would be. I am trying to gather
>> a database of information about all the motherboards sold my major
>> manufacturers. The idea is to eventually create a comparison shopper
>> to help you narrow down models that fit your needs.
>>
>> Oddly motherboard manufacturers don't use a database and generate
>> their specification pages. These are all hand-compiled with theme and
>> a dozen variations on every field. This is can handle.
>>
>> However, Asus decided to obfuscate their web pages with JavaScript.
>> There are no data on them.
>>
>> I wondered if there exists a tool that is like browser in that it will
>> read a page and render the JavaScript, but unlike a browser, it would
>> not show the information on the screen, just dump the generated HTML
>> or raw text and accept a script of pages to analyse.

>
> http://htmlunit.sourceforge.net/


Finally, someone else who knows about it!

tom

--
For the first few years I ate lunch with he mathematicians. I soon found
that they were more interested in fun and games than in serious work,
so I shifted to eating with the physics table. There I stayed for a
number of years until the Nobel Prize, promotions, and offers from
other companies, removed most of the interesting people. So I shifted
to the corresponding chemistry table where I had a friend. At first I
asked what were the important problems in chemistry, then what important
problems they were working on, or problems that might lead to important
results. One day I asked, "if what they were working on was not important,
and was not likely to lead to important things, they why were they working
on them?" After that I had to eat with the engineers! -- R. W. Hamming
 
Reply With Quote
 
Roedy Green
Guest
Posts: n/a
 
      03-31-2011
On Wed, 30 Mar 2011 07:40:32 -0700, Peter Duniho
<(E-Mail Removed)> wrote, quoted or indirectly quoted
someone who said :

>Already done. For example:
>http://www.newegg.com/Store/SubCateg...?SubCategory=2


What I am doing is similar.

I want to track price information from multiple sources, and track all
MBs I can find, not just ones sold by one particular vendor. It is a
comparison shopper, though it could be used by a retailer. I have
different categories, leaning more toward those you would use to
eliminate some motherboards from consideration, rather than categorise
branding info. I wrote MB companies asking for computer-friendly
sources of info. They have not been forthcoming.
Perhaps they will if the thing catches on.

It is just amazing how many goofy things that vendors do that
interfere with scraping.

Here is my current database schema:

/** presume database mother pre-existing, create with dbcreate if
necessary */

DROP TABLE IF EXISTS mboards;
DROP TABLE IF EXISTS sellers;
DROP TABLE IF EXISTS prices;

CREATE TABLE mboards (

/* no cache, no slots */
manufacturer numeric( 2 ) NOT NULL, /* enum */
model varchar ( 30 ) NOT NULL,
manufacturerPartNo VARCHAR ( 30 ),
revision varchar ( 8 ),
formFactor numeric ( 2 ), /* enum */
widthInCm numeric ( 3, 1 ),
heightInCm numeric ( 3, 1 ),
socket numeric( 2 ), /* enum */
video varchar( 40 ),
memoryType numeric ( 2 ), /* enum */
maxGig numeric ( 3 ),
ramSpeedMhz numeric ( 4 ),
usb2 numeric ( 2 ),
usb2Internal numeric ( 2 ),
usb2Rear numeric ( 2 ),
usb3 numeric ( 2 ),
usb3Internal numeric ( 2 ),
usb3Rear numeric ( 2 ),
sata2 numeric ( 2 ),
sata3 numeric ( 2 ),
watts numeric ( 4 ),
theatreSound boolean,
lastUpdated numeric( 7) );

CREATE TABLE prices (

seller numeric ( 2 ) NOT NULL, /* enum */
manufacturer numeric( 2 ) NOT NULL,
model varchar ( 30 ) NOT NULL,
sellerPartNo varchar ( 50 ),
currency CHAR( 3 ),
price numeric ( 6, 2 ),
lastUpdated numeric( 7));


--
Roedy Green Canadian Mind Products
http://mindprod.com
There are only two industries that refer to their customers as "users".
~ Edward Tufte

 
Reply With Quote
 
Dr J R Stockton
Guest
Posts: n/a
 
      04-01-2011
In comp.lang.java.programmer message <rvc6p6toumdlevjb48ohjnlf1gur128eqe
@4ax.com>, Wed, 30 Mar 2011 06:51:29, Roedy Green <see_website@mindprod.
com.invalid> posted:

>I wondered if there exists a tool that is like browser in that it will
>read a page and render the JavaScript, but unlike a browser, it would
>not show the information on the screen, just dump the generated HTML
>or raw text and accept a script of pages to analyse.


A JavaScript newsgroup might know.

But JavaScript used as you describe does not necessarily generate HTML,
but can manipulate the DOM tree directly.

Or are you thinking of server-side scripting with .php?

--
(c) John Stockton, nr London UK. ?@merlyn.demon.co.uk IE8 FF3 Op10 Sf5 Cr7
news:comp.lang.javascript FAQ <http://www.jibbering.com/faq/index.html>.
<http://www.merlyn.demon.co.uk/js-index.htm> jscr maths, dates, sources.
<http://www.merlyn.demon.co.uk/> TP/BP/Delphi/jscr/&c, FAQ items, links.
 
Reply With Quote
 
Roedy Green
Guest
Posts: n/a
 
      04-02-2011
On Fri, 1 Apr 2011 23:39:32 +0100, Dr J R Stockton
<(E-Mail Removed)> wrote, quoted or indirectly quoted
someone who said :

>But JavaScript used as you describe does not necessarily generate HTML,
>but can manipulate the DOM tree directly.
>
>Or are you thinking of server-side scripting with .php?


I am just trying to go to motherboard manufacturer websites and
collect specs from the webpages. The webpages often contain a lot of
Javascript. The data does not appear in any form. Presumably the Java
script loads more Java script or resources then formats it.
--
Roedy Green Canadian Mind Products
http://mindprod.com
Doing what the user expects with respect to navigation is absurdly important for user satisfaction.
~ anonymous Google Android developer

 
Reply With Quote
 
Dr J R Stockton
Guest
Posts: n/a
 
      04-03-2011
In comp.lang.java.programmer message <t64dp61er3n5cbkpuippmpji0dlaijbsnm
@4ax.com>, Fri, 1 Apr 2011 20:00:27, Roedy Green <(E-Mail Removed)
om.invalid> posted:

>On Fri, 1 Apr 2011 23:39:32 +0100, Dr J R Stockton
><(E-Mail Removed)> wrote, quoted or indirectly quoted
>someone who said :
>
>>But JavaScript used as you describe does not necessarily generate HTML,
>>but can manipulate the DOM tree directly.
>>
>>Or are you thinking of server-side scripting with .php?

>
>I am just trying to go to motherboard manufacturer websites and
>collect specs from the webpages. The webpages often contain a lot of
>Javascript. The data does not appear in any form. Presumably the Java
>script loads more Java script or resources then formats it.


Probably but not entirely presumably; if using an iframe, there could be
no need for reformatting.

Given a URL or two as examples, and a clear indication of what is to be
scraped, one might be able to understand the situation better.

--
(c) John Stockton, nr London, UK. ?@merlyn.demon.co.uk Turnpike v6.05.
Website <http://www.merlyn.demon.co.uk/> - w. FAQish topics, links, acronyms
PAS EXE etc. : <http://www.merlyn.demon.co.uk/programs/> - see in 00index.htm
Dates - miscdate.htm estrdate.htm js-dates.htm pas-time.htm critdate.htm etc.
 
Reply With Quote
 
RedGrittyBrick
Guest
Posts: n/a
 
      04-05-2011
On 30/03/2011 14:51, Roedy Green wrote:
> I wondered if there exists a tool that is like browser in that it will
> read a page and render the JavaScript, but unlike a browser, it would
> not show the information on the screen, just dump the generated HTML
> or raw text and accept a script of pages to analyse.
>


http://links.twibright.com/features.php:

"Links runs in text mode (mouse optional) on UN*X console, ssh/telnet
virtual terminal, vt100 terminal, xterm, and virtually any other text
terminal. "

Links2 supports Javascript.

I haven't used it but it seems to have command line options, maybe, like
Lynx, some of them allow you to save the HTML to a file?

Open Source, so if the GPL is usable for your project, you can probably
repurpose it.

--
RGB
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Javascript and IE? Javascript and C#? porter Javascript 6 10-06-2007 10:03 AM
dynamic javascript and html using javascript parez Javascript 0 09-11-2007 04:08 PM
Screenscraping, in python, a web page that requires javascript? Dan Stromberg - Datallegro Python 1 08-09-2007 09:41 PM
screenscraping using htmltools and rexml Peter Bodik Ruby 2 01-21-2006 11:09 PM
ScreenScraping and Viewstate =?Utf-8?B?Um9iIFJlYWdhbg==?= ASP .Net 2 12-08-2004 02:31 AM



Advertisments