Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > Storing large strings for future equality checks

Reply
Thread Tools

Storing large strings for future equality checks

 
 
Lothar Kimmeringer
Guest
Posts: n/a
 
      06-08-2011
David Kerber wrote:

> In article <iso8cm$a80$>,
> says...
>>
>> A small application that I'm making requires me to store very long
>> strings (>1000 characters) in a database.

>
> Unless you're storing millions of these strings or using Access, I'd say
> just store and use the string. I think you'll find that the speed
> penalty is nearly unnoticeable.


Depends on the database. Some of them force you to use CLOBS for
text-columns with more than 255 characters. CLOBS are a PITA in
terms of indexing.


Regards, Lothar
--
Lothar Kimmeringer E-Mail:
PGP-encrypted mails preferred (Key-ID: 0x8BC3CD81)

Always remember: The answer is forty-two, there can only be wrong
questions!
 
Reply With Quote
 
 
 
 
Robert Klemme
Guest
Posts: n/a
 
      06-08-2011
On 08.06.2011 18:35, Abu Yahya wrote:
> A small application that I'm making requires me to store very long
> strings (>1000 characters) in a database.
>
> I will need to use these strings later to compare for equality to
> incoming strings from another application. I will also want to add some
> of the incoming strings to the storage, if they meet certain criteria.
>
> For my application, I get a feeling that storing these strings in my
> table will be a waste of space, and will impact performance due to
> retrieval and storage times, as well as comparison times.
>
> I considered using an SHA-512 hash of these strings and storing them in
> the database. However, while these will save on storage space, it will
> take time to do the hashing before comparing an incoming string. So I'm
> still wasting time. (Collisions due to hashing will not be a problem,
> since an occasional false positive will not be fatal for my application).
>
> What would be the best approach?


Just out of curiosity: what do those strings represent? What do you do
with them?

Kind regards

robert

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/
 
Reply With Quote
 
 
 
 
Martin Gregorie
Guest
Posts: n/a
 
      06-08-2011
On Wed, 08 Jun 2011 20:28:11 +0200, Lothar Kimmeringer wrote:

> If you write seldom and read often, why not using two columns:
>
> string_hashcode
> sha1_hashcode
>

You haven't given us a maximum size for the string or the name of the
target database, so its possible that implementation constraints will
prevent the strings from fitting into a character() column and you'll
have to use a character varying, text, clob etc. instead.

If this type isn't allowed as the table's primary key, you can use the
hash code as the primary key. Since it doesn't matter if some strings
have clashing hash codes this can't cause a problem.


--
martin@ | Martin Gregorie
gregorie. | Essex, UK
org |
 
Reply With Quote
 
Tom Anderson
Guest
Posts: n/a
 
      06-08-2011
On Wed, 8 Jun 2011, Abu Yahya wrote:

> A small application that I'm making requires me to store very long
> strings (>1000 characters) in a database.


So >1000, but by any chance <2000? <4000?

> I will need to use these strings later to compare for equality to
> incoming strings from another application. I will also want to add some
> of the incoming strings to the storage, if they meet certain criteria.
>
> For my application, I get a feeling that storing these strings in my
> table will be a waste of space, and will impact performance due to
> retrieval and storage times, as well as comparison times.


I assume by 'table' you mean an in-memory hashtable. I wouldn't be so
quick to discount this - how many strings do you have? If you had 25 000
2000-character strings, that would be 100 MB; that's not an exorbitant
amount. Access would be pretty quick.

> I considered using an SHA-512 hash of these strings and storing them in
> the database. However, while these will save on storage space, it will
> take time to do the hashing before comparing an incoming string. So I'm
> still wasting time. (Collisions due to hashing will not be a problem,
> since an occasional false positive will not be fatal for my
> application).


If you're using a database, don't bother with a hash, just put the whole
string in, index the column, and then do equality queries on it. A B-tree
index will deal with this kind of query pretty efficiently.

If you wanted to pursue in-memory approaches, i'd suggest constructing a
trie of some sort. Tries are very fast at retrieval, don't need to access
the whole key to identify a miss, and only need to access the whole key
once to identify a hit - you always walk through the key from beginning to
end (perhaps skipping some characters in some kinds of tree), stopping as
soon as the key doesn't match, and only reaching the end if it does match.
I haven't found a good overview of the current state of the art in tries,
but look up Patricia tries, Judy tries, burst tries, fusion tries, and HAT
tries. You could consider only storing a prefix of the strings - the first
100 characters, say - in the trie, to save memory, with the leaves having
pointers to where the complete strings are stored on disk.

tom

--
The sun just came out, I can't believe it
 
Reply With Quote
 
Harry Tuttle
Guest
Posts: n/a
 
      06-09-2011
Lothar Kimmeringer, 08.06.2011 20:31:
> Depends on the database. Some of them force you to use CLOBS for
> text-columns with more than 255 characters. CLOBS are a PITA in
> terms of indexing.


No serious DBMS requires you to use a CLOB for columns with more than 255 characters.

The least flexible "big" name is Oracle, which has a limit of 4000 bytes. Beyond that a CLOB is needed.

Most others have a much larger limit on VARCHAR columns (32k, 64k, 1GB, ...)
 
Reply With Quote
 
Hallvard B Furuseth
Guest
Posts: n/a
 
      06-09-2011
Gene Wirchenko writes:
>On Wed, 08 Jun 2011 22:05:30 +0530, Abu Yahya <>
>wrote:
>>For my application, I get a feeling that storing these strings in my
>>table will be a waste of space, and will impact performance due to
>>retrieval and storage times, as well as comparison times.

>
> Your feeling is irrelevant. You should benchmark. If you do
> not, you may end up jumping through hoops for something that is
> unneeded (though you may never find out it that it is unneeded).


Indeed. No point in using a lot of time to solve a non-problem. But
after the benchmark, the decision can depend on who the application is
for. If it scales poorly, that can bite other users with different
input data.

OTOH if he'll be the only user and he finds that full strings and SHA
are both too slow: Another approach would be to use a faster hash, count
hash collisions, and don't bother with more if the count is acceptable.

Or try tries, as Tom Anderson suggested.

--
Hallvard
 
Reply With Quote
 
Michael Wojcik
Guest
Posts: n/a
 
      06-09-2011
Gene Wirchenko wrote:
> On Wed, 08 Jun 2011 22:05:30 +0530, Abu Yahya <>
> wrote:
>
>> I considered using an SHA-512 hash of these strings and storing them in
>> the database. However, while these will save on storage space, it will
>> take time to do the hashing before comparing an incoming string. So I'm
>> still wasting time. (Collisions due to hashing will not be a problem,
>> since an occasional false positive will not be fatal for my application).

>
> What does "occasional" mean?


Assuming the application doesn't accidentally run afoul of a
hitherto-unknown flaw in SHA-512 - a tremendously unlikely event - it
means "once in every N/2**256 uses", where N is the current number of
hashes in the database. (2**256 because of the Birthday Paradox; we're
interested in the probability of two arbitrary colliding pre-images.)

Or, in other words, "probably not before the heat death of the
universe". (Or false vacuum decay, zombie apocalypse, Rapture, etc.)

Worrying about the time to hash the string is silly. It's linear in
the length of the string, so it's roughly equivalent to the time taken
to do a few comparisons (where "a few" depends on how long, on
average, the matching prefix of the new string and the candidates is).

However, as various others have pointed out, this is premature
optimization. There's no reason to use any design other than the
obvious until a problem is demonstrated.

--
Michael Wojcik
Micro Focus
Rhetoric & Writing, Michigan State University
 
Reply With Quote
 
Joshua Maurice
Guest
Posts: n/a
 
      06-09-2011
On Jun 8, 9:35*am, Abu Yahya <abu_ya...@invalid.com> wrote:
> A small application that I'm making requires me to store very long
> strings (>1000 characters) in a database.
>
> I will need to use these strings later to compare for equality to
> incoming strings from another application. I will also want to add some
> of the incoming strings to the storage, if they meet certain criteria.
>
> For my application, I get a feeling that storing these strings in my
> table will be a waste of space, and will impact performance due to
> retrieval and storage times, as well as comparison times.
>
> I considered using an SHA-512 hash of these strings and storing them in
> the database. However, while these will save on storage space, it will
> take time to do the hashing before comparing an incoming string. So I'm
> still wasting time. (Collisions due to hashing will not be a problem,
> since an occasional false positive will not be fatal for my application).
>
> What would be the best approach?


If it's that relevant that you're asking, measure first to see if it's
a problem. If you're that concerned that it will be, then code a
number of reasonable alternatives and measure.

Presumably you need to do a Map lookup on the incoming strings. I
thought about some itern scheme, but that won't work if you're
receiving a lot of incoming new strings. Storing hashs could work. Do
you need to store the strings in a database? If you can store them
locally, maybe a trie?
http://en.wikipedia.org/wiki/Trie
I somewhat doubt (maybe?) that you're going to get much better lookup
performance than a trie (but of course I would measure too).
 
Reply With Quote
 
Harry Tuttle
Guest
Posts: n/a
 
      06-10-2011
Lothar Kimmeringer, 08.06.2011 20:31:
> Depends on the database. Some of them force you to use CLOBS for
> text-columns with more than 255 characters. CLOBS are a PITA in
> terms of indexing.


No serious DBMS requires you to use a CLOB for columns with more than 255 characters.

The least flexible "big" name is Oracle, which has a limit of 4000 bytes. Beyond that a CLOB is needed.

Most others have a much larger limit on VARCHAR columns (32k, 64k, 1GB, ...)
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
matching strings in a large set of strings Karin Lagesen Python 13 05-03-2010 03:53 PM
External Hashing [was Re: matching strings in a large set of strings] Helmut Jarausch Python 3 04-30-2010 08:44 PM
Strings, Strings and Damned Strings Ben C Programming 14 06-24-2006 05:09 AM
Storing variable arguments for future use in C? prasanthag@gmail.com C Programming 3 05-30-2006 12:45 PM
storing variable arguments for future use matevz bradac C Programming 6 04-17-2004 06:29 AM



Advertisments