On 15 Sep 2003 21:11:56 GMT, Harald Hein <(E-Mail Removed)> wrote or
>The whole API is a stupid hack done in a hurry to use the info-zip
>library from within Java. It was just hacked to add JARs to Java.
>The API only contains the rudimentary stuff. It was for sure never
>intended to be published. I guess Sun just had to publish it when
>they recognized that people might want to play with JARs and ZIPs
Part of the problem was they wanted to make ZipOutputStream a true
OutputStream even though the file structure properly requires random
access and buffering to create.
If I were re-inventing jar files, they would have an alphabetical
index at the HEAD of the file with absolute offsets into the file
where to find the data. There might be a little indexing added to
speed searching for a particular name, e.g. class file loading. There
would be no embedded headers. That index itself would be optionally
compressed too. The names of the elements would be in UTF-8 encoding.
You could open a ZIP, add elements, delete elements, merge other zips,
and when you closed, then it would do a flurry of copying to create
the new zip. There would be no need to uncompress and recompress to
merge two zip files.
We have no way to update a ZIP now, only create a new one from
I'd also like to add convenience methods so you could just say which
files you wanted added, and it got them, with dates etc, and when it
unpacked them automatically did the necessary mkdirs( f.getParent() ).
> The other thing that is odd about them is they use native methods
> that use long handles.
This is because of the underlying info-zip library. At some places in
the Java jar/zip API the layer around the library is very thin and you
see the library implementation shining through. If you grap the library
from the net you see the similarities.
It gets even better whan you try to figure out stuff like the directory
in the Infaltor/Deflator. That magic byte goes directly into the
corresponding calls of the info-zip library.
And for the record, if someone googles for the directory stuff: That
array of bytes is supposed to contain a sequence of C-style null-
terminated strings. Don't ask about the encoding, we are back in C "a
char is a byte" land. Disgusting.
"Roedy Green" <(E-Mail Removed)> wrote in message
> On Mon, 15 Sep 2003 21:50:55 +0200, "Luke Tulkas"
> <(E-Mail Removed)> wrote or quoted :
> > //Read from zis until you get -1.
> > //If you haven't kept track of the number of bytes you read from
> >zis, you can ask the entry for size now & be surprised.
> The documentation on this really stinks.
Not only documentation. The whole API is, as you noticed, badly
Roedy Green wrote:
> If I were re-inventing jar files, they would have an alphabetical
> index at the HEAD of the file with absolute offsets into the file
> where to find the data. There might be a little indexing added to
> speed searching for a particular name, e.g. class file loading. There
> would be no embedded headers. That index itself would be optionally
> compressed too. The names of the elements would be in UTF-8 encoding.
Wouldn't compression of the index just exacerbate the
problem Harald Hein mentioned concerning the MANIFEST file?
Actually, I think it makes the problem insoluble: You don't
know the file offsets until you know the size of the compressed
index, but you can't compress the index until you know the offset
values it contains, and if the offset values change the index may
compress to a different size, ... I imagine many .rgjar files
would settle down to a steady state after one or two passes,
but there's the nagging possibility of an eternal oscillation.
Roedy Green wrote:
> On Tue, 16 Sep 2003 11:20:42 -0400, Eric Sosman <(E-Mail Removed)>
> wrote or quoted :
> >Wouldn't compression of the index just exacerbate the
> >problem Harald Hein mentioned concerning the MANIFEST file?
> >Actually, I think it makes the problem insoluble:
> For simplicity, you would put the length of the index uncompressed
> followed by the index.
Perhaps I didn't explain the problem clearly (or perhaps
I've just imagined the whole thing ...).
Your suggestion, if I understood correctly, was to put a
compressed index at the beginning of the .rgjar file. The index
would contain (among other things) the offsets of the various
content files. The offset of any particular content file is
the sum of the sizes of all things that appear before it, and
one of these things is the index. Thus, the values recorded in
the index depend on the size of the compressed index. But the
values also (potentially) influence the size of the compressed
index; change the values and you get a different compressed size.
Looks like a feedback loop to me.
You could avoid the loop by storing just the file sizes
instead of their offsets, along with a sequence number (or other
ordering information) to allow the offsets to be computed from
the decompressed index. But this is exactly Harald Hein's
problem: You'd now need to compress all the files *before*
creating the index, then write the index at the beginning of
the .rgjar file, then write all the compressed files. Byte code
isn't too voluminous and could probably be kept around in memory
between compression time and writing time, but if the .rgjar
archive also carries images, sounds, video clips, and the entire
database of RIAA lawsuits you're probably stuck with two complete
On Tue, 16 Sep 2003 16:24:49 -0400, Eric Sosman <(E-Mail Removed)>
wrote or quoted :
> Your suggestion, if I understood correctly, was to put a
>compressed index at the beginning of the .rgjar file. The index
>would contain (among other things) the offsets of the various
>content files. The offset of any particular content file is
>the sum of the sizes of all things that appear before it, and
>one of these things is the index. Thus, the values recorded in
>the index depend on the size of the compressed index. But the
>values also (potentially) influence the size of the compressed
>index; change the values and you get a different compressed size.
You have to build the index and the file separately then glue them
together at the last minute. The offsets in the compressed index are
relative to the end of the index, as if the index and the data were
two separate files.
If you tried to make them absolute offsets, you get into your chicken
and egg loop.
I notice now we are going for directories nested 10 deep with great
long names containing spaces. The NAMES of the files themselves are
sometimes just as big as the contents. There is plenty of opportunity
there for compressing.
On rethinking it may make more sense to tack the index on the end, so
long as in the very last bytes of the file is a pointer to the
beginning of the index. PKZip format lacks this. You must find the
start by wending your way back field by field.
This way you can append to the file more efficiently. You can take
newe data on the end, and then write a new index on the end, without
necessarily copying the entire front section. This is a more dangerous
way to live, but putting the index on the end would at least leave
Putting it on the front however, makes it easier to sample a zip
without downloading the whole thing.
Is there any way to create a zip file using ZipOutputStream which sets the sizes correctly? Using ZipFile at decompression is no option for me because we have a lot of clients in the field which do not use ZipFile and cannot be changed.
I have tried several libraries, but have to find a satisfying one yet...