Skip to content

refactor for Onda Format v0.5.0#59

Merged
jrevels merged 81 commits intomasterfrom
jr/arrow
Feb 20, 2021
Merged

refactor for Onda Format v0.5.0#59
jrevels merged 81 commits intomasterfrom
jr/arrow

Conversation

@jrevels
Copy link
Member

@jrevels jrevels commented Jan 4, 2021

ref beacon-biosignals/OndaFormat#28

This is a ridiculously breaking set of changes - I'm not even going to attempt backwards compatibility, except for a convenience function for upgrading old datasets.

I still need to refactor examples/docs/tests, as well as potentially reintroduce some higher-level convenience functionality depending on which manipulations end up being naturally replaced by DataFrames-y one-liners. done

This PR also requires that Arrow.jl have a tagged release + a corresponding compat bound bump to here. done

One additional nice-to-have that is mostly orthogonal to this PR (i.e. this PR is mergeable without it) would be if we could make an upstream Arrow.jl PR to make paths-handling there a bit more type-agnostic so that read_signals etc. would automagically work with S3Paths. this would still be nice to have but we work around it here already

@jrevels jrevels marked this pull request as ready for review February 15, 2021 06:05
@jrevels jrevels requested a review from ararslan February 15, 2021 06:18
- 1.0
- 1.3
- 1.5
- nightly
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really dumb, but it seems like if I only have a single stage listed here, Travis only performs the "Documentation" job and not the actual tests themselves...not sure what the underlying problem is but worked around it via this for now

@jrevels
Copy link
Member Author

jrevels commented Feb 15, 2021

Hmmm. Codecov is claiming certain lines in https://codecov.io/gh/beacon-biosignals/Onda.jl/pull/59/src/src/samples.jl are not covered, but I'm like 99.5% sure that a few of these are...both of the cases I'm looking at are ones where the call to the function f is clearly covered but is happening through a call to broadcast!(f, ...)...

@jrevels
Copy link
Member Author

jrevels commented Feb 15, 2021

This is now ready for review.

Given that this is essentially a package-wide rewrite, I strongly suggest any reviewers review the package fresh and only reference the diff where useful (rather than starting with the diff).

IMO examples/tour.jl is the best entrypoint for understanding the intended new direction here.

@ericphanson
Copy link
Member

If we add push_preview = true to deploydocs, we should get per-PR docs builds to preview the new docs (https://juliadocs.github.io/Documenter.jl/stable/lib/public/#Documenter.deploydocs)

@jrevels
Copy link
Member Author

jrevels commented Feb 16, 2021

If we add push_preview = true to deploydocs, we should get per-PR docs builds to preview the new docs

Done!

#####

function zstd_compress(bytes::Vector{UInt8}, level=3)
compressor = ZstdCompressor(; level=level)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is like the one untouched bit of Onda, but I noticed Arrow.jl re-uses ZstdCompressor's: https://github.com/JuliaData/Arrow.jl/blob/a113edd934a1efa667b3ffb3d11b135f746322ab/src/Arrow.jl#L94-L107; maybe this could have perf benefits for us too here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, probably not a bad idea. I wonder what the perf difference is between multithreading on the Julia side vs. zstd's built-in multithreading (not exposed via CodecZstd IIRC; for all i know it might be the same thing under the hood anyway)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one possible advantage of doing it Julia side is composing with other threaded Julia functions, if it's nested or such. (Though I'm not totally sure how tuned that stuff is yet anyway)

end

# It would be better if Arrow.jl supported a generic API for nonstandard path-like types so that
# we can avoid potential intermediate copies here, but its documentation is explicit that it only
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is something I've been wondering about; I think read_arrow_table(read(path)) does not usually make extra copies, right? it reads the whole file into memory as a `Vector{UInt8}, then Arrow.jl re-uses that byte buffer and basically exposes a lazy tabular view on top?

But this API choice of Arrow.jl means we can't handle larger-then-memory tables or use mmaping, since we always need to read the whole table into memory.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think read_arrow_table(read(path)) does not usually make extra copies, right? it reads the whole file into memory as a `Vector{UInt8}, then Arrow.jl re-uses that byte buffer and basically exposes a lazy tabular view on top?

Yup. The part that's a bit more annoying IMO is the write_arrow_table workaround. I think it currently involves an unnecessary extra buffer...in theory it shouldn't need to, though. What I really want is a method that, given a Table, returns an <:IO object that directly references the table's memory and when read yields bytes in the caller's preferred Arrow file/IPC format (i.e. the file=true option for write).

There might already be a good way to achieve that with the Arrow.jl API, and I just missed it lol

this API choice of Arrow.jl means we can't handle larger-then-memory tables or use mmaping, since we always need to read the whole table into memory.

Well, we still get mmaping for local filesystem paths, but it would be really cool if e.g. AWSS3.S3Paths supported an mmap-like method that used byte-range reads under the hood 😎

Generally, though, you'd split a larger-than-memory Arrow table into manageably chunked objects upon write, and then treat the objects as a one logical table/dataset via an API like https://arrow.apache.org/docs/python/dataset.html. In Julia land, I wonder how much we can get away with just using e.g. Tables.partition for things lol

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The part that's a bit more annoying IMO is the write_arrow_table workaround. I think it currently involves an unnecessary extra buffer...in theory it shouldn't need to, though.

I see, yeah I've been using the same workaround and have also been annoyed by the need for it. I think maybe one fix is rofinn/FilePathsBase.jl#113.

I wonder how much we can get away with just using e.g. Tables.partition

That seems cool, I don't know how much as been figured out for using that kind of tool for out-of-memory data, as opposed to just lazily chaining tables that are already in-memory. But I haven't used it much yet.

@ericphanson
Copy link
Member

If we add push_preview = true to deploydocs, we should get per-PR docs builds to preview the new docs

Done!

Cool! The preview docs are here: https://beacon-biosignals.github.io/Onda.jl/previews/PR59/

jrevels and others added 3 commits February 16, 2021 16:40
Co-authored-by: Eric Hanson <5846501+ericphanson@users.noreply.github.com>
Copy link
Member

@ararslan ararslan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursory review for now but overall things are looking nice

jrevels and others added 5 commits February 18, 2021 15:59
Co-authored-by: Alex Arslan <ararslan@comcast.net>
Co-authored-by: Alex Arslan <ararslan@comcast.net>
Co-authored-by: Alex Arslan <ararslan@comcast.net>
Co-authored-by: Alex Arslan <ararslan@comcast.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants