Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > compressing or encrypting a String

Reply
Thread Tools

compressing or encrypting a String

 
 
Wendy S
Guest
Posts: n/a
 
      12-25-2004
I'm sorting JPG files into directories based on an IPTC field. I need a way
to create a reasonably short directory name based on a longer (up to 64
chars) String. For example, I might have 100 images, each having one of
these three IPTC Headlines:

Mary Smith | Zorro | Some Event
John Brown | Patrick | Some Event
Alice Jones | Jack | Some Event

I need to generate three short strings that I can use for directory names
that will be the same every time a certain String is processed. I do NOT
need to be able to decode the original value from the shortened one. I only
care that I get the same directory name for the same long string every time
I "compress" it.

After some searching around, I realized that String.hashCode() does
_exactly_ what I'm asking for.

However, vanity requires (or at least wishes) that the directory name be a
little "prettier" and alphanumeric. Is there anything I can do with an
integer like 1263925117 to get something like 4xr52z (and still be
reasonably sure I won't get duplicates, even when the second half of all of
the Strings is likely to be the same)?

Or does anyone know another way to compress the original String into
something shorter?

(If anyone comes upon this later looking for a Java API to read IPTC or EXIF
data from images, look here: http://www.drewnoakes.com/code/exif/ )

Thanks for any ideas,
--
Wendy S


 
Reply With Quote
 
 
 
 
Chris
Guest
Posts: n/a
 
      12-25-2004
> However, vanity requires (or at least wishes) that the directory name be a
> little "prettier" and alphanumeric. Is there anything I can do with an
> integer like 1263925117 to get something like 4xr52z (and still be
> reasonably sure I won't get duplicates, even when the second half of all

of
> the Strings is likely to be the same)?


String dirName = Integer.toHexString("foo".hashCode());

By encoding the directory name using hex digits instead of decimal digits,
the name will never be longer than 8 characters.


 
Reply With Quote
 
 
 
 
Wendy S
Guest
Posts: n/a
 
      12-26-2004
"Chris" <(E-Mail Removed)> wrote
> String dirName = Integer.toHexString("foo".hashCode());
> By encoding the directory name using hex digits instead of decimal digits,
> the name will never be longer than 8 characters.


Thank you, that's exactly what I needed.

--
Wendy


 
Reply With Quote
 
David Zimmerman
Guest
Posts: n/a
 
      12-30-2004


Wendy S wrote:
> "Chris" <(E-Mail Removed)> wrote
>
>>String dirName = Integer.toHexString("foo".hashCode());
>>By encoding the directory name using hex digits instead of decimal digits,
>>the name will never be longer than 8 characters.

>
>
> Thank you, that's exactly what I needed.
>

hashcodes of strings are not neccessarily unique. Do you mind collisions?
 
Reply With Quote
 
Wendy S
Guest
Posts: n/a
 
      12-30-2004
"David Zimmerman" <(E-Mail Removed)> wrote:

> hashcodes of strings are not neccessarily unique. Do you mind collisions?


How often do two Strings compare as equal when they're actually not? I
would think it's fairly rare, but in this case it wouldn't be fatal. All
I'm doing is sorting images into subdirectories to make web galleries, so
the worst that happens is I lose a sale if someone can't find their pictures
to place an order.

--
Wendy Smoak


 
Reply With Quote
 
James Kanze
Guest
Posts: n/a
 
      12-30-2004
Chris Uppal wrote:
> Wendy S wrote:


>>How often do two Strings compare as equal when they're
>>actually not?


> [I'm assuming you mean, "how often to they have equal hashes
> when they're not equal?"]


> If you are producing 32-bit hash codes, and if your hashing
> algorithms produces more-or-less random values, then you'd
> expect to find at least 1 pair of strings with the same hash
> if you have at least 2^16 input strings. 65,000 files isn't
> really /that/ many, so I wouldn't recommend using a 32-bit
> hash for this kind of purpose unless you know that your
> application is going to be handling a lot less than 65K files.


Note that 1) Java's hash codes are actually only 31 bits (no
unsigned 32 bit type), and 2) the last time I looked, the Java
hash code was not particularly good for certain types of
strings -- the multiplier is too small.

> Also that is on the assumption that String.hash() is of
> reasonable quality. It doesn't look spectacularly good to me
> (though I haven't attempted to validate it).


It is fairly trivial to create test cases where it doesn't work
well -- say all possible two character strings. Whether this is
a problem with real strings depends on what you are doing.

> In any case, if you want your hashes to be reliable over a
> release of a new JVM implementation, then you can't base your
> file names on the String.hash() function since it is possible
> that the next version of the JVM will use a different hash
> method (it has already changed once, so maybe it'll change
> again).


The original version was a disaster. The current version is a
linear congruent hash using a Mersenne prime as a multiplier --
in my experience, these generally make pretty good hash
functions. On the other hand, the multiplier is small, which if
nothing else means that you won't get large numbers from very
short strings.

> I think you'd be better off using a hash function that:


> a) produces a lot more than 32 bits of output


> b) is defined independently of the JVM implementation.


Agreed on both counts.

> One excellent source of such hashes is to use a crypto hash
> like SHA-1 (160 bit?) or MD5 (128 bit). I've never had to do
> that in Java, but I presume one would start with an instance
> of MessageDigest.


CRC codes are another good source -- done correctly, they can be
a lot faster than MD5. (I've never implemented SHA-1.)

> However the actual crypto-graphic quality is unimportant here,
> so you might find it easier and quicker to code your own
> 64-bit (say) hash function. Here's what I would try first if
> I needed to do this, it's translated directly from 'C' so
> it'll probably not compile...


> /**
> * a String hash implemented as a perturbed 64-bit linear

congruential
> pseudo-random
> * number generator.
> */
> long
> longLinearCongruentialHash(String input)
> {
> long hash = 0L;
> for (int i = 0; i < input.length(); i++)
> {
> hash += input.charAt(i);
> hash *= 6364136223846793005L;
> }


> return hash;
> }


I'd have a look at http://www.isthe.com/chongo/tech/comp/fnv/.
As far as I know, I invented the use of a Mersenne prime for the
multiplicator, but I currently use FNV whenever I need a hash
code -- it's been more or less proven good. About the only time
I might use something else would be on a machine where
multiplication is extremely slow. What made me look at Mersenne
primes in the first place is that the multiplication can be
implemented by means of a single shift and a subtraction -- on
an 8086, this made a whale of a difference, but on a modern
processor?

Note that getting it both correct and rapid without unsigned
arithmetic isn't necessarily trivial. On the other hand, it's
possible (or even probable?) that the errors introduced by
Java's signedness don't really make it that much worse.

--
James Kanze home: www.gabi-soft.fr
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 pl. Pierre Sémard, 78210 St.-Cyr-l'École, France +33 (0)1 30 23 00 34
 
Reply With Quote
 
Filip Larsen
Guest
Posts: n/a
 
      12-30-2004
James Kanze wrote

> CRC codes are another good source -- done correctly, they can be
> a lot faster than MD5. (I've never implemented SHA-1.)


I haven't measured their performance, but the build-in
java.util.zip.CRC32 and Adler32 have served me well in the past.


Regards,
--
Filip Larsen


 
Reply With Quote
 
Chris Uppal
Guest
Posts: n/a
 
      01-03-2005
James Kanze wrote:

> > it is possible
> > that the next version of the JVM will use a different hash
> > method (it has already changed once, so maybe it'll change
> > again).

>
> The original version was a disaster. The current version is a
> linear congruent hash using a Mersenne prime as a multiplier --


I hadn't twigged that the multiplier (31) was Mersenne prime. I wonder if the
JVM does actually jit the multiplication into a shift and decrement ?


> I'd have a look at http://www.isthe.com/chongo/tech/comp/fnv/.


Thanks for the link. I shall add it/them to my toolbox of hashing algorithms
(though I'm most likely to use a diffusion hash these days).

-- chris



 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
compressing text in std::vector <std::string> Lynn McGuire C++ 13 07-12-2010 04:25 PM
Compressing a string using Deflater dave@yamoo.com Java 1 05-04-2005 02:39 PM
Compressing and encrypting traffic (+) Pavel Aronovich Computer Security 3 02-09-2004 08:43 AM
Compressing a String? sven abels Java 5 12-03-2003 12:33 PM
Re: Compressing and encrypting XML Toby A Inkster HTML 0 07-08-2003 10:13 PM



Advertisments