Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > Storing large strings for future equality checks

Reply
Thread Tools

Storing large strings for future equality checks

 
 
Abu Yahya
Guest
Posts: n/a
 
      06-08-2011
A small application that I'm making requires me to store very long
strings (>1000 characters) in a database.

I will need to use these strings later to compare for equality to
incoming strings from another application. I will also want to add some
of the incoming strings to the storage, if they meet certain criteria.

For my application, I get a feeling that storing these strings in my
table will be a waste of space, and will impact performance due to
retrieval and storage times, as well as comparison times.

I considered using an SHA-512 hash of these strings and storing them in
the database. However, while these will save on storage space, it will
take time to do the hashing before comparing an incoming string. So I'm
still wasting time. (Collisions due to hashing will not be a problem,
since an occasional false positive will not be fatal for my application).

What would be the best approach?

 
Reply With Quote
 
 
 
 
markspace
Guest
Posts: n/a
 
      06-08-2011
On 6/8/2011 9:35 AM, Abu Yahya wrote:

> I considered using an SHA-512 hash of these strings and storing them in
> the database. However, while these will save on storage space, it will
> take time to do the hashing before comparing an incoming string. So I'm
> still wasting time. (Collisions due to hashing will not be a problem,
> since an occasional false positive will not be fatal for my application).



You have to store the whole string. Even if the SHA-512 hash codes are
equal, it could be that the strings are different. You'll have to
eventually compare the raw string, even if the SHA is used as a
quick-out case.

No one can really tell what is "faster" or "wasting time" until you can
better characterize the usage patterns. How big can these strings get?
How often will you get an actual duplicate? What's the penalty when
you need to add a new string? You'll need to implement a few
algorithms, profile them and then make a decision based on actual data.

For Java, I'd store the strings in a WeakHashMap or similar to allow
them to be cached, but tossed away when more storage is needed. Also
you should look into getting some DB caching library, much easier than
implementing this yourself (sorry I can't personally recommend any).
 
Reply With Quote
 
 
 
 
David Kerber
Guest
Posts: n/a
 
      06-08-2011
[This followup was posted to comp.lang.java.databases and a copy was
sent to the cited author.]

In article <iso8cm$a80$(E-Mail Removed)>, http://www.velocityreviews.com/forums/(E-Mail Removed)
says...
>
> A small application that I'm making requires me to store very long
> strings (>1000 characters) in a database.


Unless you're storing millions of these strings or using Access, I'd say
just store and use the string. I think you'll find that the speed
penalty is nearly unnoticeable.

D
 
Reply With Quote
 
Willem
Guest
Posts: n/a
 
      06-08-2011
markspace wrote:
) On 6/8/2011 9:35 AM, Abu Yahya wrote:
)> I considered using an SHA-512 hash of these strings and storing them in
)> the database. However, while these will save on storage space, it will
)> take time to do the hashing before comparing an incoming string. So I'm
)> still wasting time. (Collisions due to hashing will not be a problem,
)> since an occasional false positive will not be fatal for my application).
)
) You have to store the whole string. Even if the SHA-512 hash codes are
) equal, it could be that the strings are different. You'll have to
) eventually compare the raw string, even if the SHA is used as a
) quick-out case.

No he doesn't. Read again. Especially the last bit between parentheses.


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
 
Reply With Quote
 
Gene Wirchenko
Guest
Posts: n/a
 
      06-08-2011
On Wed, 08 Jun 2011 22:05:30 +0530, Abu Yahya <(E-Mail Removed)>
wrote:

>A small application that I'm making requires me to store very long
>strings (>1000 characters) in a database.
>
>I will need to use these strings later to compare for equality to
>incoming strings from another application. I will also want to add some
>of the incoming strings to the storage, if they meet certain criteria.
>
>For my application, I get a feeling that storing these strings in my
>table will be a waste of space, and will impact performance due to
>retrieval and storage times, as well as comparison times.


Your feeling is irrelevant. You should benchmark. If you do
not, you may end up jumping through hoops for something that is
unneeded (though you may never find out it that it is unneeded).

>I considered using an SHA-512 hash of these strings and storing them in
>the database. However, while these will save on storage space, it will
>take time to do the hashing before comparing an incoming string. So I'm
>still wasting time. (Collisions due to hashing will not be a problem,
>since an occasional false positive will not be fatal for my application).


What does "occasional" mean?

>What would be the best approach?


If it is a small app, why are you worrying about it so much?
Create a straightforward design, code it, and see if it is fast
enough. If it is, keep it. If it is not, then and only then, concern
yourself with how to speed it up.

Quit wasting time posting here, and benchmark something. <BEG>

Sincerely,

Gene Wirchenko
 
Reply With Quote
 
Abu Yahya
Guest
Posts: n/a
 
      06-08-2011
On 6/8/2011 10:19 PM, markspace wrote:
> On 6/8/2011 9:35 AM, Abu Yahya wrote:
>
>> I considered using an SHA-512 hash of these strings and storing them in
>> the database. However, while these will save on storage space, it will
>> take time to do the hashing before comparing an incoming string. So I'm
>> still wasting time. (Collisions due to hashing will not be a problem,
>> since an occasional false positive will not be fatal for my application).

>
>
> You have to store the whole string. Even if the SHA-512 hash codes are
> equal, it could be that the strings are different. You'll have to
> eventually compare the raw string, even if the SHA is used as a
> quick-out case.


You're right about comparing the whole strings anyway, but, for this
application I'm creating, I wouldn't mind very very rare incorrect results.

>
> No one can really tell what is "faster" or "wasting time" until you can
> better characterize the usage patterns. How big can these strings get?
> How often will you get an actual duplicate? What's the penalty when you
> need to add a new string? You'll need to implement a few algorithms,
> profile them and then make a decision based on actual data.


That makes sense. I'd need to analyze my input data and then run some
empirical tests.

>
> For Java, I'd store the strings in a WeakHashMap or similar to allow
> them to be cached, but tossed away when more storage is needed. Also you
> should look into getting some DB caching library, much easier than
> implementing this yourself (sorry I can't personally recommend any).


The WeakHashMap idea looks useful. As for a DB caching library, would
you recommend this as a replacement for the WeakHashMap, or as a complement?

Thanks for the useful pointers.
 
Reply With Quote
 
Abu Yahya
Guest
Posts: n/a
 
      06-08-2011
On 6/8/2011 10:58 PM, Willem wrote:
> markspace wrote:
> ) On 6/8/2011 9:35 AM, Abu Yahya wrote:
> )> I considered using an SHA-512 hash of these strings and storing them in
> )> the database. However, while these will save on storage space, it will
> )> take time to do the hashing before comparing an incoming string. So I'm
> )> still wasting time. (Collisions due to hashing will not be a problem,
> )> since an occasional false positive will not be fatal for my application).
> )
> ) You have to store the whole string. Even if the SHA-512 hash codes are
> ) equal, it could be that the strings are different. You'll have to
> ) eventually compare the raw string, even if the SHA is used as a
> ) quick-out case.
>
> No he doesn't. Read again. Especially the last bit between parentheses.
>


Yeah, it seems he missed that last part or read it incorrectly.
 
Reply With Quote
 
Abu Yahya
Guest
Posts: n/a
 
      06-08-2011
On 6/8/2011 10:28 PM, David Kerber wrote:
> [This followup was posted to comp.lang.java.databases and a copy was
> sent to the cited author.]
>
> In article<iso8cm$a80$(E-Mail Removed)>, (E-Mail Removed)
> says...
>>
>> A small application that I'm making requires me to store very long
>> strings (>1000 characters) in a database.

>
> Unless you're storing millions of these strings or using Access, I'd say
> just store and use the string. I think you'll find that the speed
> penalty is nearly unnoticeable.


Thanks - simply storing them the way they are does seem to be the best
way forward.
 
Reply With Quote
 
Lothar Kimmeringer
Guest
Posts: n/a
 
      06-08-2011
F'up to cljp

Abu Yahya wrote:

> I considered using an SHA-512 hash of these strings and storing them in
> the database. However, while these will save on storage space, it will
> take time to do the hashing before comparing an incoming string. So I'm
> still wasting time. (Collisions due to hashing will not be a problem,
> since an occasional false positive will not be fatal for my application).


If you write seldom and read often, why not using two columns:

string_hashcode
sha1_hashcode

If the first is equal, you can calculate the sha1-hash for the string
to be checked and if that is equal as well, you can consider the
string as equal. That both hashes collide I expect to be very
very unlikely (which is why I changed the other alg to sha-1, that
should be considerably more performant than sha512).

So calculation of the more complex algorithm is only done while
storing to the database and when checking a string that is already
in the database. If you have that case very often you still might
get a better performance with String.hashcode and SHA1 than with
just SHA512.

> What would be the best approach?


There is no single best approach, only an optimal one. Which
one it is dependend on what defines one way to be better than
the other (in terms of performance, storage-space, collision-
rates, etc).


Regards, Lothar
--
Lothar Kimmeringer E-Mail: (E-Mail Removed)
PGP-encrypted mails preferred (Key-ID: 0x8BC3CD81)

Always remember: The answer is forty-two, there can only be wrong
questions!
 
Reply With Quote
 
Abu Yahya
Guest
Posts: n/a
 
      06-08-2011
On 6/8/2011 11:37 PM, Gene Wirchenko wrote:
> On Wed, 08 Jun 2011 22:05:30 +0530, Abu Yahya<(E-Mail Removed)>
> wrote:
>
>> I considered using an SHA-512 hash of these strings and storing them in
>> the database. However, while these will save on storage space, it will
>> take time to do the hashing before comparing an incoming string. So I'm
>> still wasting time. (Collisions due to hashing will not be a problem,
>> since an occasional false positive will not be fatal for my application).

>
> What does "occasional" mean?
>


The application will be doing some probabilistic evaluation using the
data it is storing, and needs to be fairly close to the actual /most/ of
the time. Because it would anyway be impossible to be correct 100% of
the time, using a wrong value for a particular piece of input data will
not matter, provided it does not happen too often. (In my text above,
"rare" would have been a better choice of word than "occasional")
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
matching strings in a large set of strings Karin Lagesen Python 13 05-03-2010 03:53 PM
External Hashing [was Re: matching strings in a large set of strings] Helmut Jarausch Python 3 04-30-2010 08:44 PM
Strings, Strings and Damned Strings Ben C Programming 14 06-24-2006 05:09 AM
Storing variable arguments for future use in C? prasanthag@gmail.com C Programming 3 05-30-2006 12:45 PM
storing variable arguments for future use matevz bradac C Programming 6 04-17-2004 06:29 AM



Advertisments