Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > Distinguish text URLs from non-text URLs?

Reply
Thread Tools

Distinguish text URLs from non-text URLs?

 
 
Kaidi
Guest
Posts: n/a
 
      01-01-2004
Hello,
I have a question when trying to use Java to program a crawler like
program.
As I only need the text (html) files, I am wondering whether anyone
know a
good way to distinguish text URLs (files such as html, htm, etc) from
non-text URLs?

What I want is: given a String type url, how can I decide whether this
URL
points a text file (.htm, etc) or not? We know text pages usually have
URLs
ending with .htm, .html, etc. But with many dynamic pages, such as in
http://www.amazon.com/exec/obidos/AS...149236-2409652
this URL points to a html page, but its URL has no file extension.

I have tried to use the getContentType() from Class URLConnection, but
it
works so bad and it even consider many .pdf files as text.

Anyone has any idea of it?
Thanks and happy new year!~
 
Reply With Quote
 
 
 
 
Tony Morris
Guest
Posts: n/a
 
      01-01-2004
The content type of the response is a web server setting.
If the server is responding with "Content-Type: text/plain" from a PDF file,
then it has not been configured correctly.

Server configuration is *usually* done with a mapping between file extension
and content type, so if you were to duplicate this functionality on the
client side, you may be prone to problems.

I'd configure the server correctly and use the getContentType() call.

--
Tony Morris
(BInfTech, Cert 3 I.T., SCJP[1.4], SCJD)
Software Engineer
IBM Australia - Tivoli Security Software

"Kaidi" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed) om...
> Hello,
> I have a question when trying to use Java to program a crawler like
> program.
> As I only need the text (html) files, I am wondering whether anyone
> know a
> good way to distinguish text URLs (files such as html, htm, etc) from
> non-text URLs?
>
> What I want is: given a String type url, how can I decide whether this
> URL
> points a text file (.htm, etc) or not? We know text pages usually have
> URLs
> ending with .htm, .html, etc. But with many dynamic pages, such as in
>

http://www.amazon.com/exec/obidos/AS...twalletcom/002
-5149236-2409652
> this URL points to a html page, but its URL has no file extension.
>
> I have tried to use the getContentType() from Class URLConnection, but
> it
> works so bad and it even consider many .pdf files as text.
>
> Anyone has any idea of it?
> Thanks and happy new year!~



 
Reply With Quote
 
 
 
 
Tor Iver Wilhelmsen
Guest
Posts: n/a
 
      01-01-2004
http://www.velocityreviews.com/forums/(E-Mail Removed) (Kaidi) writes:

> What I want is: given a String type url, how can I decide whether this
> URL
> points a text file (.htm, etc) or not?


Connect to the URL, do a HEAD request, and check the content type.

> We know text pages usually have URLs ending with .htm, .html, etc.


Not necessarily.

> I have tried to use the getContentType() from Class URLConnection,
> but it works so bad and it even consider many .pdf files as text.
>


No, it's not the method URLConnection.getContentType() that is bad,
it's the web server sending the wrong content type. The API cannot fix
outside errors.
 
Reply With Quote
 
=?ISO-8859-1?Q?Daniel_Sj=F6blom?=
Guest
Posts: n/a
 
      01-02-2004
Kaidi wrote:
> Hello,
> I have a question when trying to use Java to program a crawler like
> program.
> As I only need the text (html) files, I am wondering whether anyone
> know a
> good way to distinguish text URLs (files such as html, htm, etc) from
> non-text URLs?


You could try 'sniffing' the first few bytes of the files. If they start
with <!DOC or <html you can be pretty sure they're html files. Of
course, this isn't foolproof.
--
Daniel Sj÷blom

 
Reply With Quote
 
Kaidi
Guest
Posts: n/a
 
      01-03-2004
Thanks friends.
I will try the HEAD request as suggested above.
Currently, I 'sniff' the bytes, see if they have any html tages
such as HTML, HEAD, META, BODY, <p>, etc. It works OK for me, alghough
some kind of troublesome and a kind of "heuristic".

Daniel Sj÷blom <(E-Mail Removed)_NOSPAM> wrote in message news:<3ff4ff26$0$11440$(E-Mail Removed)>...
> Kaidi wrote:
> > Hello,
> > I have a question when trying to use Java to program a crawler like
> > program.
> > As I only need the text (html) files, I am wondering whether anyone
> > know a
> > good way to distinguish text URLs (files such as html, htm, etc) from
> > non-text URLs?

>
> You could try 'sniffing' the first few bytes of the files. If they start
> with <!DOC or <html you can be pretty sure they're html files. Of
> course, this isn't foolproof.

 
Reply With Quote
 
Andrew Thompson
Guest
Posts: n/a
 
      01-04-2004
"Kaidi" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed) om...
> Thanks friends.
> I will try the HEAD request as suggested above.
> Currently, I 'sniff' the bytes, see if they have any html tages
> such as HTML, HEAD, META, BODY, <p>, etc. It works OK for me, alghough
> some kind of troublesome and a kind of "heuristic".


The properly formed HTML documents will have a
string like this at the very top..
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">

So if the file _starts_ with '<!DOCTYPE HTML'
you can tell early that this is an HTML document.
[ Unfortunately, very few pages _are_ properly
formed. ]

Otherwise I would recommend searching for the strings
you mentioned, but with the opening '<', like..
'<head', or '<html'.

That reminds me of something else, make sure
you check them for either upper or lower case,
as either is valid.

HTH

--
Andrew Thompson
* http://www.PhySci.org/ PhySci software suite
* http://www.1point1C.org/ 1.1C - Superluminal!
* http://www.AThompson.info/andrew/ personal site


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
JDBC URLs ...not really URLs? Adam Monsen Java 11 02-08-2009 08:14 PM
Converting Relative URLs into Absolute URLs Nathan Sokalski ASP .Net 1 08-12-2008 07:03 AM
distinguish between binary text and regular text zvika Perl Misc 2 12-12-2004 04:20 PM
dynamic URLS convert to static URLS for search engines Steve T. ASP .Net Web Services 7 03-04-2004 03:16 PM
How to distinguish a null line(just a return) in text file gogomei C++ 3 09-02-2003 01:42 PM



Advertisments