Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > XML > Spidering the web to find RDF

Reply
Thread Tools

Spidering the web to find RDF

 
 
Mark Watson
Guest
Posts: n/a
 
      10-02-2003
Last year, I did an experiment of allowing a very polite
web spider run for a few days trying to find RDF markup
embedded in web pages. I found close to zero RDF - not
encouraging!

I a recent post, I compalined about not being able to
embed RDF in XHTML (at least no standard way to do it
and still pass th W3C XHTML validator). Another poster
(Jeen Broekstr) provided a good example of simply
linking to a RDF file at the same site.

I was concerned about spiders being able to find
links to RDF because there is no standard for this,
then a few minutes ago I had one of those "Duh!" experiences:

A spider looking for RDF can look for embedded RDF
in HTML and also examine every link that is on the
same site and see if the file extension (if there is one)
ends in ".rdf". If such a link is found, assume that
it decribes to the page linking it.

Anyway, I will try my experiment again (when I have
time to set it up) and report the results. I hope that
lots of people link to separate RDF files on their sites
and my results will be better than last year when I
only looked for embedded RDF.

-Mark
 
Reply With Quote
 
 
 
 
Nick Kew
Guest
Posts: n/a
 
      10-03-2003
In article <(E-Mail Removed) >, one of infinite monkeys
at the keyboard of http://www.velocityreviews.com/forums/(E-Mail Removed) (Mark Watson) wrote:

> A spider looking for RDF can look for embedded RDF
> in HTML and also examine every link that is on the
> same site and see if the file extension (if there is one)
> ends in ".rdf".


Ahem ... the last few characters of a URL have absolutely no significance
except by convention. A spider that did that would be broken.

It could, however, look for links with the type="application/rdf+xml"
attribute. It would find a couple in my pages, for instance.

> If such a link is found, assume that
> it decribes to the page linking it.


Wouldn't it be better to believe the RDF concerning its own subject?

> only looked for embedded RDF.


I played with embedding RDF (for automatically-generated reports),
but abandoned the idea as a nonstarter.

--
Nick Kew

In urgent need of paying work - see http://www.webthing.com/~nick/cv.html
 
Reply With Quote
 
 
 
 
Jeen Broekstra
Guest
Posts: n/a
 
      10-03-2003
Nick Kew wrote:

> In article <(E-Mail Removed) >,
> one of infinite monkeys at the keyboard of
> (E-Mail Removed) (Mark Watson) wrote:
>
> > A spider looking for RDF can look for embedded RDF
> > in HTML and also examine every link that is on the
> > same site and see if the file extension (if there is one)
> > ends in ".rdf".

>
> Ahem ... the last few characters of a URL have absolutely no
> significance except by convention. A spider that did that
> would be broken.
>
> It could, however, look for links with the
> type="application/rdf+xml" attribute. It would find a couple
> in my pages, for instance.


That would, however, only work if the web server from which the
file is hosted is aware of this mime type. I don't know if Apache
comes preconfigured with it these days but I'll bet that older
versions won't spot it (for example, my rdf file would not be
found since the department web server serves it as text/plain).

You're right that this is the correct way of processing it, but
for now, being slightly more opportunistic and looking for
extensions (as well as trying to parse text/xml files) would
probably give much better results.

Jeen
--
Jeen Broekstra http://www.cs.vu.nl/~jbroeks/

New York is real. The rest is done with mirrors.
 
Reply With Quote
 
Nick Kew
Guest
Posts: n/a
 
      10-03-2003
In article <(E-Mail Removed)>, one of infinite monkeys
at the keyboard of Jeen Broekstra <(E-Mail Removed)> wrote:

>> It could, however, look for links with the
>> type="application/rdf+xml" attribute. It would find a couple
>> in my pages, for instance.

>
> That would, however, only work if the web server from which the
> file is hosted is aware of this mime type.



Nope. I said attribute.
<link rel="metadata" type="application/rdf+xml" href="metadata-for-page.html">

> I don't know if Apache
> comes preconfigured with it these days but I'll bet that older


Neither do I; in any case it wouldn't do anything for the above example
which I deliberately (and perfectly legitimately) ended with .html
The server should of course serve it with the correct MIME type,
but that's another issue.

> You're right that this is the correct way of processing it, but
> for now, being slightly more opportunistic and looking for
> extensions (as well as trying to parse text/xml files) would
> probably give much better results.


Even if .rdf gets something, it'll miss out on lots of .cgi, .php,
..xml and other things. It's simply broken.

Relying on the attribute will also miss out on many instances.
It's no more than a more correct thing than ".rdf" to look for
in (x)html links.

--
Nick Kew

In urgent need of paying work - see http://www.webthing.com/~nick/cv.html
 
Reply With Quote
 
Mark Watson
Guest
Posts: n/a
 
      10-03-2003
Jeen Broekstra <(E-Mail Removed)> wrote in message news:<(E-Mail Removed)>...
> You're right that this is the correct way of processing it, but
> for now, being slightly more opportunistic and looking for
> extensions (as well as trying to parse text/xml files) would
> probably give much better results.


It sounds like what I need to do is to roll all the ideas for spidering
RDF together and be as opportunistic as possible in collecting RDF.

So, I will use both Nick's and Jeen's ideas.

Thanks,
Mark
 
Reply With Quote
 
Jeen Broekstra
Guest
Posts: n/a
 
      10-03-2003
Nick Kew wrote:
> In article <(E-Mail Removed)>, one of
> infinite monkeys at the keyboard of Jeen Broekstra
> <(E-Mail Removed)> wrote:
>
> >> It could, however, look for links with the
> >> type="application/rdf+xml" attribute. It would find a
> >> couple in my pages, for instance.

> >
> > That would, however, only work if the web server from which the
> > file is hosted is aware of this mime type.

>
>
> Nope. I said attribute.
> <link rel="metadata" type="application/rdf+xml" href="metadata-for-page.html">
>


Blimey. My bad, I completely misread your post.

Jeen
--
Jeen Broekstra http://www.cs.vu.nl/~jbroeks/

Write a wise saying and your name will live forever.
-- Anonymous
 
Reply With Quote
 
Nick Kew
Guest
Posts: n/a
 
      10-03-2003
In article <(E-Mail Removed) >, one of infinite monkeys
at the keyboard of (E-Mail Removed) (Mark Watson) wrote:

> It sounds like what I need to do is to roll all the ideas for spidering
> RDF together and be as opportunistic as possible in collecting RDF.


My previous post was just a correction to something you said, which I
felt called for correction because it so often leads to confusion.

My *practical" suggestion would be to send HEAD requests from the spider
to ascertain the type of any URL before actually fetching it. Then fetch
HTML and XHTML pages to spider for more links, and RDF pages for your
collection.

I happen to have spidering software that'll do all that - among other
things Though I have the feeling you may not have the budget for it,
given the experimental nature of your task.

--
Nick Kew
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Web Crawling Spidering Question Rusty Hill ASP .Net 3 06-03-2007 05:26 PM
Google spidering & traffic Cinamon Thunder HTML 4 02-20-2007 09:32 PM
spidering script David Waizer Python 5 01-23-2007 11:01 PM
spidering a website to build a sitemap Bill Guindon Ruby 12 07-01-2005 02:10 PM
Google Spidering Parrot Computer Support 7 12-09-2003 09:42 PM



Advertisments