Skip to content

add create_zip and open_zip#25

Closed
samoconnor wants to merge 4 commits intofhs:masterfrom
samoconnor:master
Closed

add create_zip and open_zip#25
samoconnor wants to merge 4 commits intofhs:masterfrom
samoconnor:master

Conversation

@samoconnor
Copy link
Copy Markdown
Contributor

@fhs
Copy link
Copy Markdown
Owner

fhs commented Jan 5, 2016

This implementation is quiet different from your original proposal. You basically have a wrapper around Reader/Writer constructor and add some new methods to it. I think now it makes more sense to add the new functionality directly to Reader/Writer.

So, I propose renaming open_zip to Reader and create_zip to Writer, and bringing them back into top-level module.

That said, the following two function are kind of inconsistent:

Writer(Dict) -> data # no io or filename provided
Writer(io, Dict) -> nothing # io or filename provided

I'd change them to something like this:

encode(Dict) -> data # no io or filename provided
addfiles(io, Dict) -> nothing # io or filename provided

The name addfiles makes sense because we already have an addfile function. I'm not sure about encode. Maybe it should be createzip or fromdict or something else.

The documentation needs to be regenerated -- run doc/jldoc.py. But really we should be using docstrings introduced in julia 0.4.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is NOT provided

Many new test cases.

open_zip now supports read and write.

create_zip now only does 1-line zip creation from Pairs, Dict etc...

open_zip now returns ZipArchive <: Associative.
This wraps Reader and Writer.
ZipArchive has as cache to support reading and writing an archive.
e.g. if the user opens an existing archive, then adds a file, the
existing files are read in to the cache, then the reader is closed
and a writer is created to write everyting out to the output archive.
If the user only reads or only writes then the cache is not used and
the previous memory-efficient behaviour is retained where one file
is handled at a time. If reads and writes are mixed, the cache is
activated and the whole content is stored in ram. This could be made
more efficient later by stroring compressed data in the cache to avoid
decompressing/recompressing.

ZipFile.jl

type Reader and Writer:
Add _close_io::Bool flag to tell close() if it is allowed to close
the underlying stream (sometimes the caller wants it left open).

Add rewind(f::ReadableFile) to allow a ReadableFile to be read more
than once (not exported).

Fix bug in write(f::WritableFile, p::Ptr, nb::Integer) where a zero
length write would cause an error inside _zio.
@samoconnor
Copy link
Copy Markdown
Contributor Author

@fhs, My intention with this API design was to follow the patterns used in Base.
I hoped that this would make the API easy to understand for people who are familiar with Base.

My first thought was that Base should already have a good abstraction for reading and writing to an archive. Associative (and by implication iterable) seems like a good fit for an archive that identifies data by strings.

The next decision was type names. ZipFile has Reader and Writer, but "doer" data type names always seem wrong to me (Base only has 1 documented "doer": ClusterManager). ZipArchive seems like a better type name, but ZipFile.jl does not natively support updating zip archives, so it seemed best to stick with two types.

I agree that create_file() doing two different things isn't great (open file for writing vs one-shot archive creation).

In https://github.com/samoconnor/ZipFile.jl/commit/f275ea73c929063c8d8dd7687360c53e9fa4e955 I've refactored to avoid this. While I was at it I addressed the Reader/Writer issue and added a ZipArchive type that wraps Reader and Writer. open_zip() now returns ZipArchive. You can now add files, update files, read files, iterate over files etc all with one interface object. create_zip() is now reserved for one-shot creation of an archive from function arguments. See the commit log for full explanation.

Also new is this for creating a zip file from other files...

function create_zip(io::IO, files::Array)
    create_zip(io::IO, files, [open(readbytes, f) for f in files])
end

e.g.

create_zip("foo.zip", ["file1.csv", "file2.csv", "subdir/file3.csv"])

(prompted by a question from @nalimilan in JuliaLang/julia#14546 (comment))

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fhs, bug fix for zero-length file written to archive.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what's the bug. If nb is 0, write function wouldn't write anything anyway.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was crashing somewhere inside zlib when I wrote an empty file

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like a zlib or Zlib.jl bug. Maybe switching to Libz.jl will fix this (see #22).

For now, you can send a separate PR with a comment saying this is a workaround zlib/Zlib.jl bug.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind. I just merged the fix. Thanks.

@fhs
Copy link
Copy Markdown
Owner

fhs commented Jan 8, 2016

I don't see why we need caching and ZipArchive. It seems like the library is trying to do too much. Caching is not an easy problem. You don't need want to read large files from a zip archive and keep it in memory.

I liked the previous implementation where it just added few more convenience functions to Reader/Writer. BTW, I called them Reader/Writer because this library was inspired by Go's zip package (https://golang.org/pkg/archive/zip/) and even in Go -er names are usually reserved for interfaces, not structs. So, I don't know if they're good names or not, but it's too late to change them.

Please discuss the new API before you implement it, so we can come to an agreement and you're not wasting your time ;-)

@samoconnor
Copy link
Copy Markdown
Contributor Author

Caching is necessary if you want to update an existing archive.
The library can either provide it, or leave it up to the user, but it can't be avoided.

A single ZipArchive type to replace Reader and Writer makes things much easier for the user who wants to read and write to the same archive.

My implementation is careful to avoid the cache for use patterns where it is not needed, e.g. pure writes to a new archive, or pure reads from an existing archive.

The test cases cover a number of combinations of read-write-read, write-read-write, etc...

You are right that for large files there would be a large memory overhead.

I've done a quick implementation of the same interface using Info-ZIP as the backend.
This should be much more efficient for large files.
My guess is that the Info-ZIP implementation is faster in most cases, so I've made a seperate PR for it: See JuliaLang/METADATA.jl#4382

I'll leave it up to you to decide how much of this interface is appropriate for ZipFile.

@samoconnor
Copy link
Copy Markdown
Contributor Author

Hi @fhs, the create_zip and open_zip with Info-ZIP backend is now in a registered package over here: https://github.com/samoconnor/InfoZIP.jl

Although your pure-Julia ZipFile implementation is not as full featured as Info-ZIP I imagine that there are use cases where avoiding the external dependancy on Info-ZIP is desirable.

Do you have any interest in:

  • merging the create_zip and open_zip interface in this PR into ZipFile, or
  • me providing an option in the Info-ZIP package to use ZipFile as a backend in cases where Info-ZIP is not installed.

It seems to me that there is benefit to end users in having a consistent interface.

@fhs
Copy link
Copy Markdown
Owner

fhs commented Jan 21, 2016

The latter option seems good to me. I would like to keep ZipFile.jl as pure julia.

I'm closing this PR for now. It seems to me that the use-case for this is dealing with lots of small files in a zip file. I think it'll introduce performance issues for large files in a zip file. Also, there are more important things to work on besides a higher-level alternate API, such as

  • Better compression/depression performance (I looked at replacing Zlib.jl with Libz.jl but Libz.jl is missing stuff we need)
  • ZIP64 support
  • In general, support for more types of ZIP files, benchmarking and performance tuning.

@samoconnor
Copy link
Copy Markdown
Contributor Author

Hi @fhs, sounds good to me.
See: samoconnor/InfoZIP.jl@c1c1419

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants