![]() |
Generate unique ID for URL
Hello,
I want to create a URL-safe unique ID for URL's. Currently I use: url_id = base64.urlsafe_b64encode(url) >>> base64.urlsafe_b64encode('docs.python.org/library/uuid.html') 'ZG9jcy5weXRob24ub3JnL2xpYnJhcnkvdXVpZC5odG1s' I would prefer more concise ID's. What do you recommend? - Compression? Richard |
Re: Generate unique ID for URL
In <0692e6a2-343c-4eb0-be57-fe5c815efb99@googlegroups.com> Richard <richardbp@gmail.com> writes:
> I want to create a URL-safe unique ID for URL's. > Currently I use: > url_id = base64.urlsafe_b64encode(url) > >>> base64.urlsafe_b64encode('docs.python.org/library/uuid.html') > 'ZG9jcy5weXRob24ub3JnL2xpYnJhcnkvdXVpZC5odG1s' > I would prefer more concise ID's. > What do you recommend? - Compression? Does the ID need to contain all the information necessary to recreate the original URL? -- John Gordon A is for Amy, who fell down the stairs gordon@panix.com B is for Basil, assaulted by bears -- Edward Gorey, "The Gashlycrumb Tinies" |
Re: Generate unique ID for URL
Good point - one way encoding would be fine.
Also this is performed millions of times so ideally efficient. On Wednesday, November 14, 2012 10:34:03 AM UTC+11, John Gordon wrote: > In <0692e6a2-343c-4eb0-be57-fe5c815efb99@googlegroups.com> Richard <richardbp@gmail.com> writes: > > > > > I want to create a URL-safe unique ID for URL's. > > > Currently I use: > > > url_id = base64.urlsafe_b64encode(url) > > > > > >>> base64.urlsafe_b64encode('docs.python.org/library/uuid.html') > > > 'ZG9jcy5weXRob24ub3JnL2xpYnJhcnkvdXVpZC5odG1s' > > > > > I would prefer more concise ID's. > > > What do you recommend? - Compression? > > > > Does the ID need to contain all the information necessary to recreate the > > original URL? > > > > -- > > John Gordon A is for Amy, who fell down the stairs > > gordon@panix.com B is for Basil, assaulted by bears > > -- Edward Gorey, "The Gashlycrumb Tinies" |
Re: Generate unique ID for URL
> I want to create a URL-safe unique ID for URL's.
> What do you recommend? - Compression? You can use base62 with a running counter, but then you'll need a (semi) centralized entity to come up with the next id. You can see one implementation at http://bit.ly/PSJkHS (AppEngine environment). |
Re: Generate unique ID for URL
One option would be using a hash. Python's built-in hash, a 32-bit
CRC, 128-bit MD5, 256-bit SHA or one of the many others that exist, depending on the needs. Higher bit counts will reduce the odds of accidental collisions; cryptographically secure ones if outside attacks matter. In such a case, you'd have to roll your own means of converting the hash back into the string if you ever need it for debugging, and there is always the possibility of collisions. A similar solution would be using a pseudo-random GUID using the url as the seed. You could use a counter if all IDs are generated by a single process (and even in other cases with some work). If you want to be able to go both ways, using base64 encoding is probably your best bet, though you might get benefits by using compression. Chris On Tue, Nov 13, 2012 at 3:56 PM, Richard <richardbp@gmail.com> wrote: > Good point - one way encoding would be fine. > > Also this is performed millions of times so ideally efficient. > > > On Wednesday, November 14, 2012 10:34:03 AM UTC+11, John Gordon wrote: >> In <0692e6a2-343c-4eb0-be57-fe5c815efb99@googlegroups.com> Richard <richardbp@gmail.com> writes: >> >> >> >> > I want to create a URL-safe unique ID for URL's. >> >> > Currently I use: >> >> > url_id = base64.urlsafe_b64encode(url) >> >> >> >> > >>> base64.urlsafe_b64encode('docs.python.org/library/uuid.html') >> >> > 'ZG9jcy5weXRob24ub3JnL2xpYnJhcnkvdXVpZC5odG1s' >> >> >> >> > I would prefer more concise ID's. >> >> > What do you recommend? - Compression? >> >> >> >> Does the ID need to contain all the information necessary to recreate the >> >> original URL? >> >> >> >> -- >> >> John Gordon A is for Amy, who fell down the stairs >> >> gordon@panix.com B is for Basil, assaulted by bears >> >> -- Edward Gorey, "The Gashlycrumb Tinies" > > -- > http://mail.python.org/mailman/listinfo/python-list |
Re: Generate unique ID for URL
I found the MD5 and SHA hashes slow to calculate.
The builtin hash is fast but I was concerned about collisions. What rate of collisions could I expect? Outside attacks not an issue and multiple processes would be used. On Wed, Nov 14, 2012 at 11:26 AM, Chris Kaynor <ckaynor@zindagigames.com> wrote: > One option would be using a hash. Python's built-in hash, a 32-bit > CRC, 128-bit MD5, 256-bit SHA or one of the many others that exist, > depending on the needs. Higher bit counts will reduce the odds of > accidental collisions; cryptographically secure ones if outside > attacks matter. In such a case, you'd have to roll your own means of > converting the hash back into the string if you ever need it for > debugging, and there is always the possibility of collisions. A > similar solution would be using a pseudo-random GUID using the url as > the seed. > > You could use a counter if all IDs are generated by a single process > (and even in other cases with some work). > > If you want to be able to go both ways, using base64 encoding is > probably your best bet, though you might get benefits by using > compression. > Chris > > > On Tue, Nov 13, 2012 at 3:56 PM, Richard <richardbp@gmail.com> wrote: >> Good point - one way encoding would be fine. >> >> Also this is performed millions of times so ideally efficient. >> >> >> On Wednesday, November 14, 2012 10:34:03 AM UTC+11, John Gordon wrote: >>> In <0692e6a2-343c-4eb0-be57-fe5c815efb99@googlegroups.com> Richard <richardbp@gmail.com> writes: >>> >>> >>> >>> > I want to create a URL-safe unique ID for URL's. >>> >>> > Currently I use: >>> >>> > url_id = base64.urlsafe_b64encode(url) >>> >>> >>> >>> > >>> base64.urlsafe_b64encode('docs.python.org/library/uuid.html') >>> >>> > 'ZG9jcy5weXRob24ub3JnL2xpYnJhcnkvdXVpZC5odG1s' >>> >>> >>> >>> > I would prefer more concise ID's. >>> >>> > What do you recommend? - Compression? >>> >>> >>> >>> Does the ID need to contain all the information necessary to recreate the >>> >>> original URL? >>> >>> >>> >>> -- >>> >>> John Gordon A is for Amy, who fell down the stairs >>> >>> gordon@panix.com B is for Basil, assaulted by bears >>> >>> -- Edward Gorey, "The Gashlycrumb Tinies" >> >> -- >> http://mail.python.org/mailman/listinfo/python-list |
Re: Generate unique ID for URL
These URL ID's would just be used internally for quick lookups, not exposed publicly in a web application.
Ideally I would want to avoid collisions altogether. But if that means significant extra CPU time then 1 collision in 10 million hashes would be tolerable. |
Re: Generate unique ID for URL
I found md5 / sha 4-5 times slower than hash. And base64 a lot slower.
No database or else I would just use their ID. On Wednesday, November 14, 2012 11:59:55 AM UTC+11, Christian Heimes wrote: > Am 14.11.2012 01:41, schrieb Richard Baron Penman: > > > I found the MD5 and SHA hashes slow to calculate. > > > The builtin hash is fast but I was concerned about collisions. What > > > rate of collisions could I expect? > > > > Seriously? It takes about 1-5msec to sha1() one MB of data on a modern > > CPU, 1.5 on my box. The openssl variants of Python's hash code release > > the GIL so you use the power of all cores. |
Re: Generate unique ID for URL
In article <0692e6a2-343c-4eb0-be57-fe5c815efb99@googlegroups.com>,
Richard <richardbp@gmail.com> wrote: > Hello, > > I want to create a URL-safe unique ID for URL's. > Currently I use: > url_id = base64.urlsafe_b64encode(url) > > >>> base64.urlsafe_b64encode('docs.python.org/library/uuid.html') > 'ZG9jcy5weXRob24ub3JnL2xpYnJhcnkvdXVpZC5odG1s' > > I would prefer more concise ID's. > What do you recommend? - Compression? If you're generating random id strings, there's only two ways to make them shorter. Either encode fewer bits of information, or encode them more compactly. Let's start with the second one. You're already using base64, so you're getting 6 bits per character. You can do a little better than that, but not much. The set of URL-safe characters is the 96-ish printable ascii set, minus a few pieces of punctuation. Maybe you could get it up to 6.3 or 6.4 bits per character, but that's about it. For the complexity this would add it's probably not worth it. The next step is to reduce the number of bits you are encoding. You said in another post that "1 collision in 10 million hashes would be tolerable". So you need: >>> math.log(10*1000*1000, 2) 23.25349666421154 24 bits worth of key. Base64 encoded, that's only 4 characters. Actually, I probably just proved that I don't really understand how probabilities work, so maybe what you really need is 32 or 48 or 64 bits. Certainly not the 264 bits you're encoding with your example above. So, something like: hash = md5.md5('docs.python.org/library/uuid.html').digest() hash64 = base64.urlsafe_b64encode(hash) id = hash64[:8] # or 12, or whatever But, I still don't really understand your use case. You've already mentioned the following requirements: "just be used internally for quick lookups, not exposed publicly" "URL-safe" "unique" "1 collision in 10 million hashes would be tolerable" "one way encoding would be fine" "performed millions of times so ideally efficient" but haven't really explained what it is that you're trying to do. If they're not going to be exposed publicly, why do you care if they're URL-safe? What's wrong with just using the URLs directly as dictionary keys and not worrying about it until you've got some hard data showing that this is not sufficient? |
Re: Generate unique ID for URL
On Tue, 13 Nov 2012 16:13:58 -0800, Miki Tebeka wrote:
>> I want to create a URL-safe unique ID for URL's. What do you recommend? >> - Compression? > You can use base62 with a running counter, but then you'll need a (semi) > centralized entity to come up with the next id. > > You can see one implementation at http://bit.ly/PSJkHS (AppEngine > environment). Perhaps this is a silly question, but if you're using a running counter, why bother with base64? Decimal or hex digits are URL safe. If there are no concerns about predictability, why not just use the counter directly? You can encode a billion IDs in 8 hex digits compared to 16 base64 characters: py> base64.urlsafe_b64encode('1000000000') 'MTAwMDAwMDAwMA==' py> "%x" % 1000000000 '3b9aca00' Short and sweet and easy: no base64 calculation, no hash function, no database lookup, just a trivial int to string conversion. -- Steven |
| All times are GMT. The time now is 08:07 AM. |
Powered by vBulletin®. Copyright ©2000 - 2013, vBulletin Solutions, Inc.
SEO by vBSEO ©2010, Crawlability, Inc.