Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Java > Slow parsing of byte array to String

Reply
Thread Tools

Slow parsing of byte array to String

 
 
Eric Razny
Guest
Posts: n/a
 
      01-24-2006
Hi.
I've got a problem parsing pre-saved datas (bytes storing UTF-16BE string)
as String.

A part of the code is as :

byte buf[]=new byte[L_LENGTH];
//[snap] ... buf is filled with raw datas, some parts are UTF-16BE strings
String S=new String(buf, 20, 54, "UTF-16BE");


Of course it works well but it's painfully slow and usenet and web search
was unsuccessfull.

Does someone has a trick to do the job faster?

Thanks,

Eric
 
Reply With Quote
 
 
 
 
John C. Bollinger
Guest
Posts: n/a
 
      01-24-2006
Eric Razny wrote:
> I've got a problem parsing pre-saved datas (bytes storing UTF-16BE string)
> as String.
>
> A part of the code is as :
>
> byte buf[]=new byte[L_LENGTH];
> //[snap] ... buf is filled with raw datas, some parts are UTF-16BE strings
> String S=new String(buf, 20, 54, "UTF-16BE");
>
>
> Of course it works well but it's painfully slow and usenet and web search
> was unsuccessfull.
>
> Does someone has a trick to do the job faster?


It depends on how you define "the job". I have trouble believing that
one invocation of the constructor you are using takes a
human-discernible amount of time, so you must be doing many such
decodings. That being the case, it is unclear how you are certain that
your problem is in that constructor, as opposed to anywhere else in the
data processing procedure.

Have you profiled the application? It is not worth your time or mine to
speculate on how to speed it up without knowing precisely where all the
cycles are going. If you *have* profiled it then we can guide you
better based on the full results, whether or not it turns out to be the
String constructor that is eating all the time.

--
John Bollinger
http://www.velocityreviews.com/forums/(E-Mail Removed)
 
Reply With Quote
 
 
 
 
RiCaRdO
Guest
Posts: n/a
 
      01-24-2006
Are you using the + operator on Strings?
Lots of string concatenation causes the Java garbage collector to work
over time.
try using a StringBuffer instead?


John C. Bollinger wrote:
> Eric Razny wrote:
> > I've got a problem parsing pre-saved datas (bytes storing UTF-16BE string)
> > as String.
> >
> > A part of the code is as :
> >
> > byte buf[]=new byte[L_LENGTH];
> > //[snap] ... buf is filled with raw datas, some parts are UTF-16BE strings
> > String S=new String(buf, 20, 54, "UTF-16BE");
> >
> >
> > Of course it works well but it's painfully slow and usenet and web search
> > was unsuccessfull.
> >
> > Does someone has a trick to do the job faster?

>
> It depends on how you define "the job". I have trouble believing that
> one invocation of the constructor you are using takes a
> human-discernible amount of time, so you must be doing many such
> decodings. That being the case, it is unclear how you are certain that
> your problem is in that constructor, as opposed to anywhere else in the
> data processing procedure.
>
> Have you profiled the application? It is not worth your time or mine to
> speculate on how to speed it up without knowing precisely where all the
> cycles are going. If you *have* profiled it then we can guide you
> better based on the full results, whether or not it turns out to be the
> String constructor that is eating all the time.
>
> --
> John Bollinger
> (E-Mail Removed)


 
Reply With Quote
 
Chris Uppal
Guest
Posts: n/a
 
      01-24-2006
John C. Bollinger wrote:

> It depends on how you define "the job". I have trouble believing that
> one invocation of the constructor you are using takes a
> human-discernible amount of time, [...]


Especially since the conversion from UTF16 (BE or LE) is as near trivial as
charset decoding can possibly get.

I second John's request for more information. If you don't have profiling data
then you must have some other timing information. What is it ?

-- chris


 
Reply With Quote
 
Eric Razny
Guest
Posts: n/a
 
      01-24-2006
Le Tue, 24 Jan 2006 09:30:27 +0000, Chris Uppal a écrit*:

> John C. Bollinger wrote:
>
>> It depends on how you define "the job". I have trouble believing that
>> one invocation of the constructor you are using takes a
>> human-discernible amount of time, [...]

>
> Especially since the conversion from UTF16 (BE or LE) is as near trivial
> as charset decoding can possibly get.
>
> I second John's request for more information. If you don't have profiling
> data then you must have some other timing information. What is it ?


John, Chris, thanks for your replies.

I have parsed the application the more basically I can do : difference
between getTimeInMillis() from end to begin of prog.

I stripped out all the real stuff to keep only the parts I suspected to
be the bottleneck of the application. The remaining code is :

/////// CODE //////

import java.io.FileInputStream;
import java.nio.MappedByteBuffer;
import java.nio.ByteBuffer;
import java.nio.channels.FileChannel;
import java.util.GregorianCalendar;

public class Test{
public static void main(String args[]){

int sum = 0;
String fileName="/tmp/test.bin"; //file size is 22MB : 22645840
GregorianCalendar start=new GregorianCalendar();
try {


FileInputStream fs = new FileInputStream(fileName);
FileChannel fchan = null;
fchan = fs.getChannel();
int sz = (int)fchan.size();
MappedByteBuffer mbuf = null;
mbuf =
fchan.map(FileChannel.MapMode.READ_ONLY,
0, sz);

byte tpb = 0;
ByteBuffer ibuf = mbuf.asReadOnlyBuffer();
byte tmpBuf[]=new byte[212];
while (ibuf.hasRemaining()) {
ibuf.get(tmpBuf);
String Part1=new String(tmpBuf, 20, 54, "UTF-16BE"); // comment this line out for profiling
String Part2=new String(tmpBuf, 74, 132, "UTF-16BE"); // comment this line out for profiling
}

}catch(Exception e){e.printStackTrace();}
System.out.println("Duration in ms : "+(new GregorianCalendar().getTimeInMillis()-start.getTimeInMillis()));
}
}
/////// END OF CODE //////

Platform :
java version "1.5.0_01"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_01-b0
Java HotSpot(TM) Client VM (build 1.5.0_01-b08, mixed mode, sharing)
Linux 2.6, debian sarge, via processor & chipset.

If i launch the prog the result is :
Duration in ms : 4618 (no significant differences between tests)

Then, when i comment out the two lines flagged above the result is :
Duration in ms : 383 (no significant differences between tests)

The 106820 loops are processed faster on... faster computers but I must
keep this platform. Since the 4+ seconds must be added to uncompressible
other ones and a human is waiting for a result i'll appreciate any
help to reduce this bottleneck.

Eric

PS : sorry for the 3 lines of more than 80 chars.

 
Reply With Quote
 
Chris Uppal
Guest
Posts: n/a
 
      01-24-2006
Eric Razny wrote:

> If i launch the prog the result is :
> Duration in ms : 4618 (no significant differences between tests)
>
> Then, when i comment out the two lines flagged above the result is :
> Duration in ms : 383 (no significant differences between tests)


That's odd. I've just created a 22,645,840 byte test file which is just the
string 0123456789 repeated enough times (as UTF16BE) to fill the file (since
the conversion from UTF16BE is trivial it shouldn't matter what the actual
characters are).

With that, and using JDK 1.5.0, the test runs on this 1.5 GHz Win XP machine in
about 1.1 seconds. If I comment out the string creation lines then it runs in
about 0.46 seconds. (Or, if I run under JDK 1.4.2, in about 0.2 seconds)

Trying again using JDK 1.4.2 on a 2.4 GHz SuSE 8 box takes around 1.3 and 0.2
seconds respectively. Pretty much the same as 1..4.2 on the Windows box.

The performance I'm seeing seems quite reasonable to me. I have no idea at all
why the decoding is taking so much longer on your machine.

If the choice of character encoding cannot change, and if the data is under
your control (so you know it is always valid without checking) then you /might/
be able to avoid the strange bottleneck by hardcoding your own UTF16-BE
decoder. Something like the following:

// hardwired UTF16-BE decoding
// Note:
// This DOES NOT CHECK for invalid UTF16 sequences
// and so may allow security breaches.
// This does not check that byteCount makes sense.
// This HAS NOT BEEN TESTED on non-ASCII input.
private String
decodeUTF16BE(byte[] bytes, int startByte, int byteCount)
{
int charCount = byteCount / 2;
StringBuilder builder = new StringBuilder(charCount);
for (int i = 0; i < charCount; i++)
{
byte low = bytes[startByte++];
byte high = bytes[startByte++];
int ch = ((low & 0xFF) << 8 ) | (high & 0xFF);
builder.append((char)ch);
}

return builder.toString();
}

For me, that's about 2x faster than going through the system-supplied code (in
part, I assume, because of the missing checks, and possibly also because I've
got it wrong).

Oh, BTW, you can get the current time in milliseconds straight from
System.currentTimeMillis() without having to mess around with
GregorianCalendars. You can also (since 1.5) get a finer-grained timer with
System.nanoTime()

-- chris


 
Reply With Quote
 
Eric Razny
Guest
Posts: n/a
 
      01-24-2006
Le Tue, 24 Jan 2006 15:25:15 +0000, Chris Uppal a écrit*:

[Snap Speed is better on your box than on mine]

The box has a Via chipset at 600 Mhz
Yours is simply faster, but the ratio still sounds bad to me.


> If the choice of character encoding cannot change, and if the data is
> under your control (so you know it is always valid without checking)


That's the case. I guess the checking burns out most of the time.

> then
> you /might/ be able to avoid the strange bottleneck by hardcoding your own
> UTF16-BE decoder. Something like the following:
>
> // hardwired UTF16-BE decoding
> // Note:
> // This DOES NOT CHECK for invalid UTF16 sequences // and so may allow
> security breaches. // This does not check that byteCount makes sense. //
> This HAS NOT BEEN TESTED on non-ASCII input. private String
> decodeUTF16BE(byte[] bytes, int startByte, int byteCount) {
> int charCount = byteCount / 2;
> StringBuilder builder = new StringBuilder(charCount); for (int i = 0; i
> < charCount; i++)
> {
> byte low = bytes[startByte++];
> byte high = bytes[startByte++];
> int ch = ((low & 0xFF) << 8 ) | (high & 0xFF);
> builder.append((char)ch);
> }
> }
> return builder.toString();
> }
> }
> For me, that's about 2x faster than going through the system-supplied code
> (in part, I assume, because of the missing checks, and possibly also
> because I've got it wrong).


Argh! This is the kind of thing I'm searching for... but it must still be
usable on 1.4.2 JVM

Many thanks anyway.


> Oh, BTW, you can get the current time in milliseconds straight from
> System.currentTimeMillis() without having to mess around with
> GregorianCalendars. You can also (since 1.5) get a finer-grained timer
> with System.nanoTime()


Yep, still a sequel of an old, old test with an automatic paste from my
editor

Eric.
 
Reply With Quote
 
Chris Uppal
Guest
Posts: n/a
 
      01-24-2006
Eric Razny wrote:

> Argh! This is the kind of thing I'm searching for... but it must still be
> usable on 1.4.2 JVM


Well, you could just change StringBuilder to StringBuffer, but -- at least for
my quick test -- the synchronised append() method is slow enough to loose most
of the gains from do-it-yourself UTF16 decoding.

Probably the easiest thing, at least to try, is to build up the data in a
char[] array instead:

...
int charCount = byteCount / 2;
char[] buffer = new char[charCount];
for (int i = 0; i < charCount; i++)
{
byte low = bytes[startByte++];
byte high = bytes[startByte++];
int ch = ((low & 0xFF) << 8 ) | (high & 0xFF);
buffer[i] = (char)ch;
}

return new String(buffer);

Somewhat to my surprise, that works (on my machine) rather faster than the
version which uses a StringBuilder -- despite the extra array allocation and
copy. I also tried using a statically allocated buffer, to avoid the extra
allocation (but not the extra copy), but there didn't seem to be any
performance edge over the simple version (and mutable static variables are not
a good idea anyway, unless you are willing to synchronise all access to them).

I don't know whether that depends (for it's performance) on optimisations that
only a 1.5 JVM performs, but it'll work on 1.4 and is probably worth a quick
test. BTW, are you using the -server flag ?

-- chris




 
Reply With Quote
 
Eric Razny
Guest
Posts: n/a
 
      01-24-2006
Le Tue, 24 Jan 2006 19:38:08 +0000, Chris Uppal a écrit*:

> Well, you could just change StringBuilder to StringBuffer, but -- at least
> for my quick test -- the synchronised append() method is slow enough to
> loose most of the gains from do-it-yourself UTF16 decoding.


I lost more than that in my case! The resulting program was slower than
the original one.


> Probably the easiest thing, at least to try, is to build up the data in a
> char[] array instead:
>
> ...
> int charCount = byteCount / 2;
> char[] buffer = new char[charCount];
> for (int i = 0; i < charCount; i++)
> {
> byte low = bytes[startByte++];
> byte high = bytes[startByte++];
> int ch = ((low & 0xFF) << 8 ) | (high & 0xFF); buffer[i] = (char)ch;
> }
> }
> return new String(buffer);


Great! Wonderfull! Marvel.. ahem, ok, it works!

Total time is now 1395, not anymore 4618. If we substract the fixed 383 ms
(Yes, i know, there's a 100ms resolution timer) this give :
1012 vs 4235. A 4x gain.

> Somewhat to my surprise, that works (on my machine) rather faster than the
> version which uses a StringBuilder -- despite the extra array allocation
> and copy. I also tried using a statically allocated buffer, to avoid the
> extra allocation (but not the extra copy), but there didn't seem to be any
> performance edge over the simple version


Tried the same thing on my computer with same result, little if any
improvment. I didn't try side effects with "uncommon" UTF-16 chars but
it's perfect for me as I know what can be stored in the file.

> (and mutable static variables are
> not a good idea anyway, unless you are willing to synchronise all access
> to them).


And loose even more time?

> I don't know whether that depends (for it's performance) on optimisations
> that only a 1.5 JVM performs, but it'll work on 1.4 and is probably worth
> a quick test. BTW, are you using the -server flag ?


For some distribution reasons no. But for this short code the -server
flag results in a 100ms penality! It seems I need to "unlearn" some
obvious things about java

Many Thanks for you help.

Eric


 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
convert form byte[4] to Int32 while retaining the binary value of the byte array jeff@foundrymusic.com C++ 20 09-07-2009 08:54 PM
Re: how to extend a byte[] array with a null byte? Tom McGlynn Java 4 04-18-2008 11:49 PM
Re: how to extend a byte[] array with a null byte? Patricia Shanahan Java 0 04-17-2008 06:47 PM
Converting a Primative byte array to a Byte array object Kirby Java 3 10-08-2004 03:01 AM
Appending byte[] to another byte[] array Bharat Bhushan Java 15 08-05-2003 07:52 PM



Advertisments