Add data/rand-many-types#42
Conversation
paleolimbot
left a comment
There was a problem hiding this comment.
This will be very helpful!
There are a number of similar files in arrow-testing that are similar (and sorted vaguely by type category) here: https://github.com/apache/arrow-testing/tree/master/data/arrow-ipc-stream/integration/1.0.0-littleendian . Is it worth using those for any part of the experiment?
There are also a number of places where we generate random arrow arrays, although they are squirrelled away in various places (e.g., Archery, Arrow C++'s testing component) and that shouldn't block this file from existing in arrow-experiments.
The only thing I would request for the here-and-now would be to ensure there's a seed set so that the next person who runs this gets the same result. I think the recommended way to do that is something like:
rng = np.random.default_rng(12345)
rng.(random|randint|etc)(...)
Ref: https://numpy.org/doc/stable/reference/random/generator.html
If this were intended for long-term use or if it required significant effort to create, then I would have looked at reusing an existing asset instead of creating a new one here. But as is, I think the convenience of having this in the same repo and the ability to easily use this to test type-related issues with Arrow-over-HTTP outweigh the benefits of reuse.
Done in e1e43ca. Thanks! |
This adds a file in Arrow IPC stream format containing 100,000 rows x 20 columns of random data exercising many different Arrow data types, and the Python script used to generate the file. I will use this in some new
httpexamples.