Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > Python > ZipFile - file adding API incomplete?

Reply
Thread Tools

ZipFile - file adding API incomplete?

 
 
Glenn Maynard
Guest
Posts: n/a
 
      11-17-2009
I want to do something fairly simple: read files from one ZIP and add
them to another, so I can remove and replace files. This led me to a
couple things that seem to be missing from the API.

The simple approach would be to open each file in the source ZIP, and
hand it off to newzip.write(). There's a missing piece, though:
there's no API that lets me pass in a file-like object and a ZipInfo,
to preserve metadata. zip.write() only takes the filename and
compression method, not a ZipInfo; writestr takes a ZipInfo but only
accepts a string, not a file. Is there an API call I'm missing?
(This seems like the fundamental API for adding files, that write and
writestr should be calling.)

The correct approach is to copy the data directly, so it's not
recompressed. This would need two new API calls: rawopen(), acting
like open() but returning a direct file slice and not decompressing
data; and rawwrite(zinfo, file), to pass in pre-compressed data, where
the compression method in zinfo matches the compression type used.

I was surprised that I couldn't find the former. The latter is an
advanced one, important for implementing any tool that modifies large
ZIPs. Short-term, at least, I'll probably implement these externally.

--
Glenn Maynard
 
Reply With Quote
 
 
 
 
Diez B. Roggisch
Guest
Posts: n/a
 
      11-17-2009
Glenn Maynard schrieb:
> I want to do something fairly simple: read files from one ZIP and add
> them to another, so I can remove and replace files. This led me to a
> couple things that seem to be missing from the API.
>
> The simple approach would be to open each file in the source ZIP, and
> hand it off to newzip.write(). There's a missing piece, though:
> there's no API that lets me pass in a file-like object and a ZipInfo,
> to preserve metadata. zip.write() only takes the filename and
> compression method, not a ZipInfo; writestr takes a ZipInfo but only
> accepts a string, not a file. Is there an API call I'm missing?
> (This seems like the fundamental API for adding files, that write and
> writestr should be calling.)
>
> The correct approach is to copy the data directly, so it's not
> recompressed. This would need two new API calls: rawopen(), acting
> like open() but returning a direct file slice and not decompressing
> data; and rawwrite(zinfo, file), to pass in pre-compressed data, where
> the compression method in zinfo matches the compression type used.
>
> I was surprised that I couldn't find the former. The latter is an
> advanced one, important for implementing any tool that modifies large
> ZIPs. Short-term, at least, I'll probably implement these externally.


No idea why the write doesn't accept an open file - OTOH, as passing a
string is just


writestr(info, in_file.read())


I don't think that's *that* much of an inconvenience..

And regarding your second idea: can that really work? Intuitively, I
would have thought that compression is adaptive, and based on prior
additions to the file. I might be wrong with this though.

Diez
 
Reply With Quote
 
 
 
 
Dave Angel
Guest
Posts: n/a
 
      11-17-2009


Diez B. Roggisch wrote:
> <div class="moz-text-flowed" style="font-family: -moz-fixed">Glenn
> Maynard schrieb:
>> I want to do something fairly simple: read files from one ZIP and add
>> them to another, so I can remove and replace files. This led me to a
>> couple things that seem to be missing from the API.
>>
>> <snip>
>>
>> The correct approach is to copy the data directly, so it's not
>> recompressed. This would need two new API calls: rawopen(), acting
>> like open() but returning a direct file slice and not decompressing
>> data; and rawwrite(zinfo, file), to pass in pre-compressed data, where
>> the compression method in zinfo matches the compression type used.
>>
>> I was surprised that I couldn't find the former. The latter is an
>> advanced one, important for implementing any tool that modifies large
>> ZIPs. Short-term, at least, I'll probably implement these externally.

>
> <snip>
>
> And regarding your second idea: can that really work? Intuitively, I
> would have thought that compression is adaptive, and based on prior
> additions to the file. I might be wrong with this though.
>
>

I'm pretty sure that the ZIP format uses independent compression for
each contained file (member). You can add and remove members from an
existing ZIP, and use several different compression methods within the
same file. So the adaptive tables start over for each new member.

What isn't so convenient is that the sizes are apparently at the end.
So if you're trying to unzip "over the wire" you can't readily do it
without somehow seeking to the end. That same feature is a good thing
when it comes to spanning zip files across multiple disks.

The zip file format is documented on the net, but I haven't read the
spec in at least 15 years.

DaveA

 
Reply With Quote
 
Glenn Maynard
Guest
Posts: n/a
 
      11-18-2009
On Tue, Nov 17, 2009 at 9:28 AM, Dave Angel <(E-Mail Removed)> wrote:
> I'm pretty sure that the ZIP format uses independent compression for each
> contained file (member). *You can add and remove members from an existing
> ZIP, and use several different compression methods within the same file. *So
> the adaptive tables start over for each new member.


This is correct. It doesn't do solid compression, which is what you
get with .tar.gz (and RARs, optionally).

> What isn't so convenient is that the sizes are apparently at the end. *So if
> you're trying to unzip "over the wire" you can't readily do it without
> somehow seeking to the end. *That same feature is a good thing when it comes
> to spanning zip files across multiple disks.


Actually, there are two copies of the headers: one immediately before
the file data (the local file header), and one at the end (the central
directory); both contain copies of the compressed and uncompressed
file size. Very few programs actually use the local file headers, but
it's very nice to have the option. It also helps makes ZIPs very
recoverable. If you've ever run a ZIP recovery tool, they're usually
just reconstructing the central directory from the local file headers
(and probably recomputing the CRCs).

(This is no longer true if bit 3 of the bitflags is set, which puts
the CRC and filesizes after the data. In that case, it's not possible
to stream data--largely defeating the benefit of the local headers.)

> Define a calls to read _portions_ of the raw (compressed, encrypted, whatever) data.


I think the clean way is to return a file-like object for a specified file, eg.:

# Read raw bytes 1024-1152 from each file in the ZIP:
zip = ZipFile("file.zip", "r")
for info in zip.infolist():
f = zip.rawopen(info) # or a filename
f.seek(1024)
f.read(12

> Define a call that locks the ZipFile object and returns a write handle for a single new file.


I'd use a file-like object here, too, for probably obvious
reasons--you can pass it to anything expecting a file object to write
data to (eg. shutil.copyfile).

> Only on successful close of the "write handle" is the new directory written.


Rather, when the new file is closed, its directory entry is saved to
ZipFile.filelist. The new directory on disk should be written when
the zip's own close() method is called, just as when writing files
with the other methods. Otherwise, writing lots of files in this way
would write and overwrite the central directory repeatedly.

Any thoughts about this rough API outline:

ZipFile.rawopen(zinfo_or_arcname)
Same definition as open(), but returns the raw data. No mode (no
newline translation for raw files); no pwd (raw files aren't
decrypted).

ZipFile.writefile(zinfo[, raw])
Definition like ZipInfo.writestr. Relax writestr()'s "at least the
filename, date, and time must be given" rule: if not specified, use
the current date and time. Returns a file-like object (ZipWriteFile)
which file data is written to. If raw is True, no actual compression
is performed, and the file data should already be compressed with the
specified compression type (no checking is performed). If raw is
False (the default), the data will be compressed before being written.
When finished writing data, the file must be closed. Only one
ZipWriteFile may be open for each ZipFile at a time. Calls to
ZipFile.writefile while a ZipWriteFile is already open will result in
ValueError[1].

Another detail: is the CRC recomputed when writing in raw mode? No.
If I delete a file from a ZIP (causing me to rewrite the ZIP) and
another file in the ZIP is corrupt, it should just move the file
as-is, invalid CRC and all; it should not rewrite the file with a new
CRC (masking the corruption) or throw an error (I should not get
errors about file X being corrupt if I'm deleting file Y). When
writing in raw mode, if zinfo.CRC is already specified (not None), it
should be used as-is.

I don't like how this results in three different APIs for adding data
(write, writestr, writefile), but trying to squeeze the APIs together
feels unnatural--the parameters don't really line up too well. I'd
expect the other two to become thin wrappers around
ZipFile.writefile(). This never opens files directly like
ZipFile.write, so it only takes a zinfo and not a filename (set the
filename through the ZipInfo).

Now you can stream data into a ZIP, specify all metadata for the file,
and you can stream in compressed data from another ZIP (for deleting
files and other cases) without recompressing. This also means you can
do all of these things to encrypted files without the password, and to
files compressed with unknown methods, which is currently impossible.

> and I realize that the big flaw in this design is that from the moment you start overwriting the existing master directory until you write

a new master at the end, your do not have a valid zip file.

The same is true when appending to a ZIP with ZipFile.write(); until
it finishes, the file on disk isn't a valid ZIP. That's unavoidable.
Files in the ZIP can still be opened by the existing ZipFile object,
since it keeps the central directory in memory.

For what it's worth, I've written ZIP parsing code several times over
the years (https://svn.stepmania.com/svn/trunk/...DriverZip.cpp),
so I'm familiar with the more widely-used parts of the file format,
but I havn't dealt with ZIP writing very much. I'm not sure if I'll
have time to get to this soon, but I'll keep thinking about it.

[1] seems odd, but mimicing
http://docs.python.org/library/stdtypes.html#file.close

--
Glenn Maynard
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
ZipOutputStream, ZipFile and Linux unzip do not agree on the file count in zip C B Java 4 11-10-2004 11:13 AM
Re: delete file with zipfile u Python 1 01-11-2004 11:48 PM
delete file with zipfile u Python 1 01-11-2004 10:21 AM
How to fresh or delete a file in azip-archive via zipfile module? =?windows-1251?b?xeLj5e3o6SDK7vHl7eru?= Python 0 12-07-2003 06:59 PM
ZipFile output (ZIP file) not accessable from XP 2002 Jim Walseth Python 0 10-23-2003 05:13 PM



Advertisments