Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > C++ > crawling the net...

Reply
Thread Tools

crawling the net...

 
 
ask josephsen
Guest
Posts: n/a
 
      04-29-2004
Hi NG

I'm making a program to crawl the internet. It works by retrieving all links
in a page, downloading the page of each link and again retrieving all the
links. (If there is better ways I'd like to hear)

My problem is relative links (like "../../wohoo.asp"). What is the smartest
way to get the full url (http://www.xyz.com/wohoo.asp)? Do I have to parse
the relative link in relation to the url where the relative link was found
and then concatenate it? Does anyone know how other search-engines/ crawlers
walk the net?


Thanks

../ask


 
Reply With Quote
 
 
 
 
JKop
Guest
Posts: n/a
 
      04-29-2004
ask josephsen posted:

> Hi NG
>
> I'm making a program to crawl the internet. It works by retrieving all
> links in a page, downloading the page of each link and again retrieving
> all the links. (If there is better ways I'd like to hear)
>
> My problem is relative links (like "../../wohoo.asp"). What is the
> smartest way to get the full url (http://www.xyz.com/wohoo.asp)? Do I
> have to parse the relative link in relation to the url where the
> relative link was found and then concatenate it? Does anyone know how
> other search-engines/ crawlers walk the net?
>
>
> Thanks
>
> ./ask


You should have posted this on:

alt.sports.gymnastics


It would've been more on-topic _there_.

-JKop
 
Reply With Quote
 
 
 
 
Morten Wennevik
Guest
Posts: n/a
 
      04-29-2004
Hi Ask,

You could try using the features of Path.GetFullPath which collapses /../
and /./ and returns the proper path. However, it insists on adding the
application path so you will need to do something like

string newUrl =
Path.GetFullPath(url).Substring(Application.Startu pPath.Length+1));

It will switch the / to \ though. Oh, and remove the http:// from the url
first.

There are plenty web crawlers, just do a web searh on "web crawler" and
"web bot".


Happy coding!
Morten Wennevik [C# MVP]
 
Reply With Quote
 
mortb
Guest
Posts: n/a
 
      04-29-2004
I'm not developing webcrawlers, but a quick thought of mine is

string link = "../../wohoo.asp"
string thisPageURL = "http://www.xyz.com/wohoo.asp"
stirng [] linkParts = System.Text.RegularExpressions.Regex.Split(link,
"x2Ex2E/"); // split on ../
string [] URLParts = System.Text.RegularExpressions.Regex.Split(thisPag eURL,
"/");

the length of linkParts.Lenght - 1 will now contain the wanted numbers of
"../" "directory recursion" and the last element will be the wanted page
the URL to the new page will be concatenated from the URLParts array,
exluding the the linkPartLength number of elements, and the last element in
LinkParts

Just a quick shot at an solution...

/mortb


"ask josephsen" <jaj(((a)))oticon.dk> wrote in message
news:4090c8a4$0$1118$(E-Mail Removed)...
> Hi NG
>
> I'm making a program to crawl the internet. It works by retrieving all

links
> in a page, downloading the page of each link and again retrieving all the
> links. (If there is better ways I'd like to hear)
>
> My problem is relative links (like "../../wohoo.asp"). What is the

smartest
> way to get the full url (http://www.xyz.com/wohoo.asp)? Do I have to parse
> the relative link in relation to the url where the relative link was found
> and then concatenate it? Does anyone know how other search-engines/

crawlers
> walk the net?
>
>
> Thanks
>
> ./ask
>
>



 
Reply With Quote
 
Christopher Benson-Manica
Guest
Posts: n/a
 
      04-29-2004
ask josephsen <jaj(((a)))oticon.dk> spoke thus:

> I'm making a program to crawl the internet. It works by retrieving all links
> in a page, downloading the page of each link and again retrieving all the
> links. (If there is better ways I'd like to hear)


(You could look at how wget is implemented. Or, better, just USE wget.)

Your post is off-topic for comp.lang.c++. Please visit

http://www.slack.net/~shiva/welcome.txt
http://www.parashift.com/c++-faq-lite/

for posting guidelines and frequently asked questions. Thank you.

--
Christopher Benson-Manica | I *should* know what I'm talking about - if I
ataru(at)cyberspace.org | don't, I need to know. Flames welcome.
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Web Crawling/Threading and Things That Go Bump in the Night Remarkable Python 1 08-04-2006 06:12 PM
web crawling. S Borg Python 4 01-20-2006 04:44 AM
Crawling chris ASP .Net 1 06-15-2005 07:21 PM
Search engines crawling our .NET site Mark ASP .Net 3 03-07-2005 04:37 AM
Web-crawling John Bradbury Python 4 10-04-2003 04:26 PM



Advertisments