Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Ruby > Creating a canonicalized url

Reply
Thread Tools

Creating a canonicalized url

 
 
Dan Cuddeford
Guest
Posts: n/a
 
      01-24-2008
Hello there guys,

I'm trying to track down an easy way to canonicalize a URL from with
ruby. I've been looking around for this but all I can find are some
procedure hacks sure as # canonicalize the url
if ($url -notmatch "^[a-z]+://") { $url = "http://$url" }

which isn't going to take into account everything according to RFC 2396

* Remove all leading and trailing dots
* Replace consecutive dots with a single dot.
* If the hostname can be parsed as an IP address, it should be
normalized to 4 dot-separated decimal values. The client should handle
any legal IP address encoding, including octal, hex, and fewer than 4
components.
* Lowercase the whole string.


# The sequences "/../" and "/./" in the path should be resolved, by
replacing "/./" with "/", and removing "/../" along with the preceding
path component.
# Runs of consecutive slashes should be replaced with a single slash
character.

So is there a method out there for this?
--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
 
 
 
Rob Biedenharn
Guest
Posts: n/a
 
      01-24-2008

On Jan 24, 2008, at 7:14 AM, Dan Cuddeford wrote:

> Hello there guys,
>
> I'm trying to track down an easy way to canonicalize a URL from with
> ruby. I've been looking around for this but all I can find are some
> procedure hacks sure as # canonicalize the url
> if ($url -notmatch "^[a-z]+://") { $url = "http://$url" }
>
> which isn't going to take into account everything according to RFC
> 2396
>
> * Remove all leading and trailing dots
> * Replace consecutive dots with a single dot.
> * If the hostname can be parsed as an IP address, it should be
> normalized to 4 dot-separated decimal values. The client should handle
> any legal IP address encoding, including octal, hex, and fewer than 4
> components.
> * Lowercase the whole string.
>
>
> # The sequences "/../" and "/./" in the path should be resolved, by
> replacing "/./" with "/", and removing "/../" along with the preceding
> path component.
> # Runs of consecutive slashes should be replaced with a single slash
> character.
>
> So is there a method out there for this?


I'd start looking at URI, in particular, URI#parse.

$ fri URI#parse
------------------------------------------------------------- URI:arse
URI:arse(uri)
------------------------------------------------------------------------
Synopsis
URI:arse(uri_str)

Args
+uri_str+: String with URI.

Description
Creates one of the URI's subclasses instance from the string.

Raises
URI::InvalidURIError

Raised if URI given is not a correct one.

Usage
require 'uri'

uri = URI.parse("http://www.ruby-lang.org/")
p uri
# => #<URI::HTTP:0x202281be URL:http://www.ruby-lang.org/>
p uri.scheme
# => "http"
p uri.host
# => "www.ruby-lang.org"

As for the "Lowercase the whole string" part, only the domain is
required to be case-insensitive. It is possible for the underlying
web server to ignore case when finding a path, but the URI is not
necessarily a reference to the same resource if the case is altered.

-Rob

Rob Biedenharn http://agileconsultingllc.com
http://www.velocityreviews.com/forums/(E-Mail Removed)


 
Reply With Quote
 
 
 
 
Jean-François Trân
Guest
Posts: n/a
 
      01-24-2008
2008/1/24, Rob Biedenharn <(E-Mail Removed)>:

> As for the "Lowercase the whole string" part, only the domain is
> required to be case-insensitive. It is possible for the underlying
> web server to ignore case when finding a path, but the URI is not
> necessarily a reference to the same resource if the case is altered.


There's URI#normalize and URI#normalize! to downcase the host
part of the url.

-- Jean-Fran=E7ois.

 
Reply With Quote
 
Dan Cuddeford
Guest
Posts: n/a
 
      01-24-2008
Thanks for your help - I'll let you know how I get on

--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
Dan Cuddeford
Guest
Posts: n/a
 
      01-24-2008
So it seems using the two together


require 'uri'

uri = URI.parse("http://www.ruBy-lang.org/ARSE")

can = uri.normalize
p can

p can.host

p can.path


means the path keeps it's case sensitivity but the host is normalized.

I think that's it - however,

try it with ruby-lang..org and

/usr/lib/ruby/1.8/uri/generic.rb:195:in `initialize': the scheme http
does not accept registry part: www.ruBy-lang..org (or bad hostname?)
(URI::InvalidURIError)
from /usr/lib/ruby/1.8/uri/http.rb:78:in `initialize'
from /usr/lib/ruby/1.8/uri/common.rb:488:in `new'
from /usr/lib/ruby/1.8/uri/common.rb:488:in `parse'
from canon.rb:3

So I guess it needs a bit or error checking before hand.
--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
Rob Biedenharn
Guest
Posts: n/a
 
      01-24-2008

On Jan 24, 2008, at 9:23 AM, Dan Cuddeford wrote:

> So it seems using the two together
>
>
> require 'uri'
>
> uri = URI.parse("http://www.ruBy-lang.org/ARSE")
>
> can = uri.normalize
> p can
>
> p can.host
>
> p can.path
>
>
> means the path keeps it's case sensitivity but the host is normalized.
>
> I think that's it - however,
>
> try it with ruby-lang..org and
>
> /usr/lib/ruby/1.8/uri/generic.rb:195:in `initialize': the scheme http
> does not accept registry part: www.ruBy-lang..org (or bad hostname?)
> (URI::InvalidURIError)
> from /usr/lib/ruby/1.8/uri/http.rb:78:in `initialize'
> from /usr/lib/ruby/1.8/uri/common.rb:488:in `new'
> from /usr/lib/ruby/1.8/uri/common.rb:488:in `parse'
> from canon.rb:3
>
> So I guess it needs a bit or error checking before hand.


require 'uri'

def canonicalize(uri)
u = uri.kind_of?(URI) ? uri : URI.parse(uri.to_s)
u.normalize!
newpath = u.path
while newpath.gsub!(%r{([^/]+)/\.\./?}) { |match|
$1 == '..' ? match : ''
} do end
newpath = newpath.gsub(%r{/\./}, '/').sub(%r{/\.\z}, '/')
u.path = newpath
u.to_s
end

canonicalize('http://www.Ruby-Lang.ORG/ARSE/done/../../rear/./end/.')
=> "http://www.ruby-lang.org/rear/end/"

-Rob

Rob Biedenharn http://agileconsultingllc.com
(E-Mail Removed)


 
Reply With Quote
 
Dan Cuddeford
Guest
Posts: n/a
 
      01-24-2008
Wow - thanks for the answer mate!
--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
Jörg W Mittag
Guest
Posts: n/a
 
      01-24-2008
Dan Cuddeford wrote:
> Wow - thanks for the answer mate!


There's also the Addressable Gem: <http://Addressable.RubyForge.Org/>.

It's intended as a standards compliant replacement for the stdlib's
URI library. Take a look into the test directory of that sucker: over
440 Unit Tests (actually, Object Examples) for a frickin' URI parser!
(See: <http://Addressable.RubyForge.Org/specdoc/>) That guy is nuts!
That code's gotta be as rock-solid as it gets.

Oh, and back to the topic at hand: it has a normalize method built in:

begin
require 'rubygems'
gem 'addressable'
rescue LoadError; end
require 'addressable/uri'
uri = Addressable::URI.heuristic_parse('www.Ruby-Lang..ORG/ARSE/done/../../r e a r/./end/.#exit')
uri.normalize!
puts uri.display_uri # => http://www.ruby-lang..org/r%20e%20a%20r/end/#exit

jwm
 
Reply With Quote
 
Dan Cuddeford
Guest
Posts: n/a
 
      01-25-2008
Jörg W Mittag wrote:
> puts uri.display_uri # =>
> http://www.ruby-lang..org/r%20e%20a%20r/end/#exit
>
> jwm


Nice but shouldn't it go to ruby-lang.org?
--
Posted via http://www.ruby-forum.com/.

 
Reply With Quote
 
Jörg W Mittag
Guest
Posts: n/a
 
      01-26-2008
Dan Cuddeford wrote:
> Jörg W Mittag wrote:
>> puts uri.display_uri # =>
>> http://www.ruby-lang..org/r%20e%20a%20r/end/#exit

> Nice but shouldn't it go to ruby-lang.org?


I'm not sure. I just scanned RfC3986 and RfC1034 and I'm not even sure
that's a valid URI host part to begin with. *If* it's invalid, then
there's not much a URI normalizer can do, right?

However, I could be wrong. Reading RfCs is not exactly my specialty.

jwm
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
"Error Creating Control" when creating a custom control (Design Time). Can't see tooltip message. Ravi Ambros Wallau ASP .Net Web Controls 0 06-01-2005 02:36 PM
"Error Creating Control" when creating a custom control (Design Time). Can't see tooltip message. Ravi Ambros Wallau ASP .Net 0 06-01-2005 02:36 PM
URL - substitution of a correct URL by a GUID like URL in favorites. Just D. ASP .Net Mobile 0 08-11-2004 04:26 PM
redirect URL's, return URL's, and URL Parameters Jon paugh ASP .Net 1 07-10-2004 05:29 AM
Whitespace in Canonicalized XML Celedor XML 3 01-24-2004 02:16 AM



Advertisments