Conversation
| - 1.0 | ||
| - 1.3 | ||
| - 1.5 | ||
| - nightly |
There was a problem hiding this comment.
This is really dumb, but it seems like if I only have a single stage listed here, Travis only performs the "Documentation" job and not the actual tests themselves...not sure what the underlying problem is but worked around it via this for now
|
Hmmm. Codecov is claiming certain lines in https://codecov.io/gh/beacon-biosignals/Onda.jl/pull/59/src/src/samples.jl are not covered, but I'm like 99.5% sure that a few of these are...both of the cases I'm looking at are ones where the call to the function |
|
This is now ready for review. Given that this is essentially a package-wide rewrite, I strongly suggest any reviewers review the package fresh and only reference the diff where useful (rather than starting with the diff). IMO |
|
If we add |
Done! |
| ##### | ||
|
|
||
| function zstd_compress(bytes::Vector{UInt8}, level=3) | ||
| compressor = ZstdCompressor(; level=level) |
There was a problem hiding this comment.
This is like the one untouched bit of Onda, but I noticed Arrow.jl re-uses ZstdCompressor's: https://github.com/JuliaData/Arrow.jl/blob/a113edd934a1efa667b3ffb3d11b135f746322ab/src/Arrow.jl#L94-L107; maybe this could have perf benefits for us too here?
There was a problem hiding this comment.
ah, probably not a bad idea. I wonder what the perf difference is between multithreading on the Julia side vs. zstd's built-in multithreading (not exposed via CodecZstd IIRC; for all i know it might be the same thing under the hood anyway)
There was a problem hiding this comment.
one possible advantage of doing it Julia side is composing with other threaded Julia functions, if it's nested or such. (Though I'm not totally sure how tuned that stuff is yet anyway)
| end | ||
|
|
||
| # It would be better if Arrow.jl supported a generic API for nonstandard path-like types so that | ||
| # we can avoid potential intermediate copies here, but its documentation is explicit that it only |
There was a problem hiding this comment.
this is something I've been wondering about; I think read_arrow_table(read(path)) does not usually make extra copies, right? it reads the whole file into memory as a `Vector{UInt8}, then Arrow.jl re-uses that byte buffer and basically exposes a lazy tabular view on top?
But this API choice of Arrow.jl means we can't handle larger-then-memory tables or use mmaping, since we always need to read the whole table into memory.
There was a problem hiding this comment.
I think read_arrow_table(read(path)) does not usually make extra copies, right? it reads the whole file into memory as a `Vector{UInt8}, then Arrow.jl re-uses that byte buffer and basically exposes a lazy tabular view on top?
Yup. The part that's a bit more annoying IMO is the write_arrow_table workaround. I think it currently involves an unnecessary extra buffer...in theory it shouldn't need to, though. What I really want is a method that, given a Table, returns an <:IO object that directly references the table's memory and when read yields bytes in the caller's preferred Arrow file/IPC format (i.e. the file=true option for write).
There might already be a good way to achieve that with the Arrow.jl API, and I just missed it lol
this API choice of Arrow.jl means we can't handle larger-then-memory tables or use mmaping, since we always need to read the whole table into memory.
Well, we still get mmaping for local filesystem paths, but it would be really cool if e.g. AWSS3.S3Paths supported an mmap-like method that used byte-range reads under the hood 😎
Generally, though, you'd split a larger-than-memory Arrow table into manageably chunked objects upon write, and then treat the objects as a one logical table/dataset via an API like https://arrow.apache.org/docs/python/dataset.html. In Julia land, I wonder how much we can get away with just using e.g. Tables.partition for things lol
There was a problem hiding this comment.
The part that's a bit more annoying IMO is the write_arrow_table workaround. I think it currently involves an unnecessary extra buffer...in theory it shouldn't need to, though.
I see, yeah I've been using the same workaround and have also been annoyed by the need for it. I think maybe one fix is rofinn/FilePathsBase.jl#113.
I wonder how much we can get away with just using e.g. Tables.partition
That seems cool, I don't know how much as been figured out for using that kind of tool for out-of-memory data, as opposed to just lazily chaining tables that are already in-memory. But I haven't used it much yet.
Cool! The preview docs are here: https://beacon-biosignals.github.io/Onda.jl/previews/PR59/ |
Co-authored-by: Eric Hanson <5846501+ericphanson@users.noreply.github.com>
ararslan
left a comment
There was a problem hiding this comment.
Cursory review for now but overall things are looking nice
Co-authored-by: Alex Arslan <ararslan@comcast.net>
Co-authored-by: Alex Arslan <ararslan@comcast.net>
Co-authored-by: Alex Arslan <ararslan@comcast.net>
Co-authored-by: Alex Arslan <ararslan@comcast.net>
ref beacon-biosignals/OndaFormat#28
This is a ridiculously breaking set of changes - I'm not even going to attempt backwards compatibility, except for a convenience function for upgrading old datasets.
I still need to refactor examples/docs/tests, as well as potentially reintroduce some higher-level convenience functionality depending on which manipulations end up being naturally replaced by DataFrames-y one-liners.doneThis PR also requires that Arrow.jl have a tagged release + a corresponding compat bound bump to here.doneOne additional nice-to-have that is mostly orthogonal to this PR (i.e. this PR is mergeable without it) would be if we could make an upstream Arrow.jl PR to make paths-handling there a bit more type-agnostic so thatthis would still be nice to have but we work around it here alreadyread_signalsetc. would automagically work withS3Paths.