Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > HTML > Efficient way to rip html

Reply
Thread Tools

Efficient way to rip html

 
 
Arthur Rhodes
Guest
Posts: n/a
 
      10-03-2006
I'm building a web store and I have to create a large number of
product descriptions. The distributors do not provide spec sheets
or marketing materials to me in html format. Instead, they advise
me to simply copy the descriptions from their web sites.

The problem is that the descriptions I need to copy are embedded
in complex pages, with nested tables, etc. Simply copying the
page source doesn't seem to be that useful. I end up having to
cut out lots of table code, etc., and usually make mistakes that
are time consuming to figure out and fix.

The other alternative is to copy the text and then recreating the html
formatting from scratch.

Is there an easier way?

Right now, I'm just writing HTML by hand in a text editor. Would
this be any easier if I used a web editor like Dreamweaver?


 
Reply With Quote
 
 
 
 
Ben C
Guest
Posts: n/a
 
      10-03-2006
On 2006-10-03, Arthur Rhodes <(E-Mail Removed)> wrote:
> I'm building a web store and I have to create a large number of
> product descriptions. The distributors do not provide spec sheets
> or marketing materials to me in html format. Instead, they advise
> me to simply copy the descriptions from their web sites.
>
> The problem is that the descriptions I need to copy are embedded
> in complex pages, with nested tables, etc. Simply copying the
> page source doesn't seem to be that useful. I end up having to
> cut out lots of table code, etc., and usually make mistakes that
> are time consuming to figure out and fix.
>
> The other alternative is to copy the text and then recreating the html
> formatting from scratch.
>
> Is there an easier way?


Python, and Beautiful Soup.

http://www.crummy.com/software/BeautifulSoup/
 
Reply With Quote
 
 
 
 
Nikita the Spider
Guest
Posts: n/a
 
      10-03-2006
In article <(E-Mail Removed)>,
Ben C <(E-Mail Removed)> wrote:

> On 2006-10-03, Arthur Rhodes <(E-Mail Removed)> wrote:
> > I'm building a web store and I have to create a large number of
> > product descriptions. The distributors do not provide spec sheets
> > or marketing materials to me in html format. Instead, they advise
> > me to simply copy the descriptions from their web sites.
> >
> > The problem is that the descriptions I need to copy are embedded
> > in complex pages, with nested tables, etc. Simply copying the
> > page source doesn't seem to be that useful. I end up having to
> > cut out lots of table code, etc., and usually make mistakes that
> > are time consuming to figure out and fix.
> >
> > The other alternative is to copy the text and then recreating the html
> > formatting from scratch.
> >
> > Is there an easier way?

>
> Python, and Beautiful Soup.
>
> http://www.crummy.com/software/BeautifulSoup/


Seconded. If you're willing to go the Python programming route, Connelly
Barnes' htmldata might also prove helpful:
http://oregonstate.edu/~barnesc/htmldata/

Last but not least you could use command-line Spyce (HTML templates with
the dynamic bits written in Python) to build your Web pages:
http://spyce.sourceforge.net/

Good luck

--
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
 
Reply With Quote
 
dorayme
Guest
Posts: n/a
 
      10-03-2006
In article <(E-Mail Removed)>,
Arthur Rhodes <(E-Mail Removed)> wrote:

> I'm building a web store and I have to create a large number of
> product descriptions. The distributors do not provide spec sheets
> or marketing materials to me in html format. Instead, they advise
> me to simply copy the descriptions from their web sites.
>
> The problem is that the descriptions I need to copy are embedded
> in complex pages, with nested tables, etc. Simply copying the
> page source doesn't seem to be that useful. I end up having to
> cut out lots of table code, etc., and usually make mistakes that
> are time consuming to figure out and fix.
>
> The other alternative is to copy the text and then recreating the html
> formatting from scratch.
>
> Is there an easier way?
>
> Right now, I'm just writing HTML by hand in a text editor. Would
> this be any easier if I used a web editor like Dreamweaver?


It depends on how well you know Dreamweaver (or any other
software). I have a friend who would go this way and well. I
would grab the product descriptions and work hard and use a text
editor because it would take me less time. You are in the middle
of a job. Can you risk finding out? If you know what you are
doing with the text grabs, just do it and get it done and charge
the client. As you get going, you will find it going quicker and
quicker because you will be building patterns in your hand work.
Products are products, and if they are all in tables to show off
proper tabular specs, you will simply copy and paste a few table
types you have constructed, most data will fit in one or other of
them with little mods.

--
dorayme
 
Reply With Quote
 
ato_zee@hotmail.com
Guest
Posts: n/a
 
      10-03-2006

On 3-Oct-2006, dorayme <(E-Mail Removed)> wrote:

> Instead, they advise
> me to simply copy the descriptions from their web sites.


With images on websites you can usually right click then
copy and then paste the gif or jpg into the folder (or,
for some applications,) into the application of your choice.

Likewise with text, with a bit of practice you can right click
then wipe to highlight, release, right click again on
highlighted text, copy, then paste (selecting paste option
Unformatted Text) or paste into notepad which reduces
everyting to unformatted text.
Paste options depend on application, sometimes you have
to start from Edit menu to find the Unformatted Text option.
With Dreamweaver I think that there is an unformatted text
option to paste long runs of text in the code window.
But then I mostly edit/build in Wordpad because it opens
and saves, html without asking what format you want to
save in.
DW8 can produce non-validating code without warning you,
has some merit in early stages of design. With Wordpad
I can save and immediately see the effect with refresh
the browser.
Sometimes you can select tables or highlight cells, copy, and
paste into Excel, which gives you further options for
manipulating/parsing the data.
 
Reply With Quote
 
mbstevens
Guest
Posts: n/a
 
      10-03-2006
On Tue, 03 Oct 2006 11:22:24 -0600, Arthur Rhodes wrote:

> The problem is that the descriptions I need to copy are embedded in
> complex pages, with nested tables, etc. Simply copying the page source
> doesn't seem to be that useful. I end up having to cut out lots of table
> code, etc., and usually make mistakes that are time consuming to figure
> out and fix.



Perl's HTML:arser module will divide an HTML document into its various
parts (including text) with just a few lines of code. In the more
structured Python world, sgmllib, htmllib, or HTMLParser are the modules
to look into.
--
mbstevens
http://www.mbstevens.com/




 
Reply With Quote
 
Arthur Rhodes
Guest
Posts: n/a
 
      10-04-2006
On Tue, 03 Oct 2006 13:25:02 -0500, Ben C wrote:

>> Is there an easier way?

>
> Python, and Beautiful Soup.
>
> http://www.crummy.com/software/BeautifulSoup/


Looks good. You don't know of any ready made gui for it,
do you? I'm thinking it would be nice to have a tree
pane representing the structure of the document, and when
you click on a node a text pane shows the corresponding part
of the document.
 
Reply With Quote
 
Andy Dingley
Guest
Posts: n/a
 
      10-04-2006

dorayme wrote:

> It depends on how well you know Dreamweaver (or any other
> software). I have a friend who would go this way and well. I
> would grab the product descriptions and work hard and use a text
> editor


Twice a day, for two thousand products ?

 
Reply With Quote
 
Ben C
Guest
Posts: n/a
 
      10-04-2006
On 2006-10-04, Arthur Rhodes <(E-Mail Removed)> wrote:
> On Tue, 03 Oct 2006 13:25:02 -0500, Ben C wrote:
>
>>> Is there an easier way?

>>
>> Python, and Beautiful Soup.
>>
>> http://www.crummy.com/software/BeautifulSoup/

>
> Looks good. You don't know of any ready made gui for it,
> do you? I'm thinking it would be nice to have a tree
> pane representing the structure of the document, and when
> you click on a node a text pane shows the corresponding part
> of the document.


I don't know of one, but it wouldn't be hard to do. Someone may have
done one.

But Firefox can do exactly what you're describing, if you install the
"DOM Inspector" extension. You can click on something in the tree
representation in the DOM Inspector window and it flashes red on the
page, or you can point to part of the page, click, and the corresponding
part of the tree representation gets highlighted.

Having found your way around the document with this DOM Inspector, you
can then write the python/BeautifulSoup script to pull out the bits
you're interested in.
 
Reply With Quote
 
dorayme
Guest
Posts: n/a
 
      10-04-2006
In article
<(E-Mail Removed). com>,
"Andy Dingley" <(E-Mail Removed)> wrote:

> dorayme wrote:
>
> > It depends on how well you know Dreamweaver (or any other
> > software). I have a friend who would go this way and well. I
> > would grab the product descriptions and work hard and use a text
> > editor

>
> Twice a day, for two thousand products ?


No, well, if it were on this scale, I would fire up Dreamweaver
or even the 98 version of Word and export to HTML and see how it
renders a table of product specs. I would then see what I could
do to clean up crap via Search and Replace, using extra GREP if
need be, and shape it all how I wanted. But my point was this: be
sure the scale of the job is big enough to embark on anything
more than simple hard work with a text editor, entering, cutting
and pasting where possible etc.

You get these figures from?

Truth is this, I have found many earthlings think hard rote work
beneath their human dignity. I happen to think humans have no
real dignity, it is all a pretence and they should get a better
perspective of their place in evolution. They are machines and
should stop trying to distance themselves from lower and more
mechanical forms.


[btw. Alan Flavell has a philosophy behind the idea of hard rote
work, that it offends against human dignity... It is a point of
view. I am not saying it is unintelligent. But imo, much evil has
come from ideas like this. I don't suppose anyone wants to know
more? ]

--
dorayme
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
2 way of router rip command ? jh3ang Cisco 1 04-10-2006 08:38 PM
is there a way to speed up a dvd rip? cowboyz NZ Computing 35 11-03-2004 04:00 AM
Re: I have found a efficient way to flirt... kpg MCSE 0 07-30-2004 04:43 PM
converting accuset 1200 HW rip to software rip Wendell Computer Support 0 06-05-2004 07:32 PM
What is the most efficient way to access common fcts on asp.net pages when using user controls? Brent Minder ASP .Net 3 12-28-2003 02:28 PM



Advertisments