Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > Generate unique ID for URL

Reply
Thread Tools

Generate unique ID for URL

 
 
Steve Howell
Guest
Posts: n/a
 
      11-14-2012
On Nov 13, 6:04*pm, Steven D'Aprano <steve
+comp.lang.pyt...@pearwood.info> wrote:
> On Tue, 13 Nov 2012 16:13:58 -0800, Miki Tebeka wrote:
> >> I want to create a URL-safe unique ID for URL's. What do you recommend?
> >> - Compression?

> > You can use base62 with a running counter, but then you'll need a (semi)
> > centralized entity to come up with the next id.

>
> > You can see one implementation athttp://bit.ly/PSJkHS(AppEngine
> > environment).

>
> Perhaps this is a silly question, but if you're using a running counter,
> why bother with base64? Decimal or hex digits are URL safe. If there are
> no concerns about predictability, why not just use the counter directly?
>
> You can encode a billion IDs in 8 hex digits compared to 16 base64
> characters:
>
> py> base64.urlsafe_b64encode('1000000000')
> 'MTAwMDAwMDAwMA=='
> py> "%x" % 1000000000
> '3b9aca00'
>
> Short and sweet and easy: no base64 calculation, no hash function, no
> database lookup, just a trivial int to string conversion.
>
> --
> Steven


If you're dealing entirely with integers, then this works too:

import base64

def encode(n):
s = ''
while n > 0:
s += chr(n % 256)
n //= 256
return base64.urlsafe_b64encode(s)

def test():
seen = set()
for i in range(999900000, 1000000000):
s = encode(i)
if s in seen:
raise Exception('non-unique encoding')
seen.add(s)
print encode(1000000000)

test()

It prints this for 1000000000:

AMqaOw==

 
Reply With Quote
 
 
 
 
Richard
Guest
Posts: n/a
 
      11-14-2012
I am dealing with URL's rather than integers
 
Reply With Quote
 
 
 
 
Richard
Guest
Posts: n/a
 
      11-14-2012
So the use case - I'm storing webpages on disk and want a quick retrieval system based on URL.
I can't store the files in a single directory because of OS limitations so have been using a sub folder structure.
For example to store data at URL "abc": a/b/c/index.html
This data is also viewed locally through a web app.

If you can suggest a better approach I would welcome it.
 
Reply With Quote
 
Richard
Guest
Posts: n/a
 
      11-14-2012
> The next step is to reduce the number of bits you are encoding. You
>
> said in another post that "1 collision in 10 million hashes would be
>
> tolerable". So you need:
>
>
>
> >>> math.log(10*1000*1000, 2)

>
> 23.25349666421154



I think a difficulty would be finding a hash algorithm that maps evenly across those bits.
 
Reply With Quote
 
Roy Smith
Guest
Posts: n/a
 
      11-14-2012
In article <1ce88f36-bfc7-4a55-89f8->,
Richard <> wrote:

> So the use case - I'm storing webpages on disk and want a quick retrieval
> system based on URL.
> I can't store the files in a single directory because of OS limitations so
> have been using a sub folder structure.
> For example to store data at URL "abc": a/b/c/index.html
> This data is also viewed locally through a web app.
>
> If you can suggest a better approach I would welcome it.


Ah, so basically, you're reinventing Varnish?

Maybe do what Varnish (and MongoDB, and a few other things) do? Bypass
the file system entirely. Juar mmap() a chunk of memory large enough to
hold everything and let the OS figure out how to page things to disk.
 
Reply With Quote
 
Richard
Guest
Posts: n/a
 
      11-14-2012
thanks for pointer to Varnish.

I found MongoDB had a lot of size overhead so that it ended up using 4x the data stored.
 
Reply With Quote
 
Chris Angelico
Guest
Posts: n/a
 
      11-14-2012
On Wed, Nov 14, 2012 at 2:25 PM, Richard <> wrote:
> So the use case - I'm storing webpages on disk and want a quick retrieval system based on URL.
> I can't store the files in a single directory because of OS limitations so have been using a sub folder structure.
> For example to store data at URL "abc": a/b/c/index.html
> This data is also viewed locally through a web app.
>
> If you can suggest a better approach I would welcome it.


The cost of a crypto hash on the URL will be completely dwarfed by the
cost of storing/retrieving on disk. You could probably do some
arithmetic and figure out exactly how many URLs (at an average length
of, say, 100 bytes) you can hash in the time of one disk seek.

ChrisA
 
Reply With Quote
 
Richard
Guest
Posts: n/a
 
      11-14-2012
yeah good point - I have gone with md5 for now.


On Wednesday, November 14, 2012 3:06:18 PM UTC+11, Chris Angelico wrote:
> On Wed, Nov 14, 2012 at 2:25 PM, Richard <> wrote:
>
> > So the use case - I'm storing webpages on disk and want a quick retrieval system based on URL.

>
> > I can't store the files in a single directory because of OS limitations so have been using a sub folder structure.

>
> > For example to store data at URL "abc": a/b/c/index.html

>
> > This data is also viewed locally through a web app.

>
> >

>
> > If you can suggest a better approach I would welcome it.

>
>
>
> The cost of a crypto hash on the URL will be completely dwarfed by the
>
> cost of storing/retrieving on disk. You could probably do some
>
> arithmetic and figure out exactly how many URLs (at an average length
>
> of, say, 100 bytes) you can hash in the time of one disk seek.
>
>
>
> ChrisA


 
Reply With Quote
 
Richard
Guest
Posts: n/a
 
      11-14-2012
yeah good point - I have gone with md5 for now.


On Wednesday, November 14, 2012 3:06:18 PM UTC+11, Chris Angelico wrote:
> On Wed, Nov 14, 2012 at 2:25 PM, Richard <> wrote:
>
> > So the use case - I'm storing webpages on disk and want a quick retrieval system based on URL.

>
> > I can't store the files in a single directory because of OS limitations so have been using a sub folder structure.

>
> > For example to store data at URL "abc": a/b/c/index.html

>
> > This data is also viewed locally through a web app.

>
> >

>
> > If you can suggest a better approach I would welcome it.

>
>
>
> The cost of a crypto hash on the URL will be completely dwarfed by the
>
> cost of storing/retrieving on disk. You could probably do some
>
> arithmetic and figure out exactly how many URLs (at an average length
>
> of, say, 100 bytes) you can hash in the time of one disk seek.
>
>
>
> ChrisA


 
Reply With Quote
 
Johannes Bauer
Guest
Posts: n/a
 
      11-14-2012
On 14.11.2012 01:41, Richard Baron Penman wrote:
> I found the MD5 and SHA hashes slow to calculate.


Slow? For URLs? Are you kidding? How many URLs per second do you want to
calculate?

> The builtin hash is fast but I was concerned about collisions. What
> rate of collisions could I expect?


MD5 has 16 bytes (128 bit), SHA1 has 20 bytes (160 bit). Utilizing the
birthday paradox and some approximations, I can tell you that when using
the full MD5 you'd need around 2.609e16 hashes in the same namespace to
get a one in a million chance of a collision. That is, 26090000000000000
filenames.

For SHA1 This number rises even further and you'd need around 1.71e21 or
1710000000000000000000 hashes in one namespace for the one-in-a-million.

I really have no clue about how many URLs you want to hash, and it seems
to be LOTS since the speed of MD5 seems to be an issue for you. Let me
estimate that you'd want to calculate a million hashes per second then
when you use MD5, you'd have about 827 years to fill the namespace up
enough to get a one-in-a-million.

If you need even more hashes (say a million million per second), I'd
suggest you go with SHA-1, giving you 54 years to get the one-in-a-million.

Then again, if you went for a million million hashes per second, Python
would probably not be the language of your choice.

Best regards,
Johannes

--
>> Wo hattest Du das Beben nochmal GENAU vorhergesagt?

> Zumindest nicht öffentlich!

Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
- Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$>
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Is there a unique method in python to unique a list? Token Type Python 9 09-09-2012 02:13 PM
list question... unique values in all possible unique spots ToshiBoy Python 6 08-12-2008 05:01 AM
how to generate unique value from java and mysql Mullin Java 3 04-08-2005 05:23 PM
Can we generate Unique ID from Java? Max Java 5 02-28-2004 03:00 PM
generate own unique sessionid instead standard asp.net 120bit sessionid Ronald ASP .Net 6 02-23-2004 08:03 AM



Advertisments