Skip to content

Add data/rand-many-types#42

Merged
ianmcook merged 2 commits into
apache:mainfrom
ianmcook:rand-many-types
Nov 29, 2024
Merged

Add data/rand-many-types#42
ianmcook merged 2 commits into
apache:mainfrom
ianmcook:rand-many-types

Conversation

@ianmcook

@ianmcook ianmcook commented Nov 28, 2024

Copy link
Copy Markdown
Member

This adds a file in Arrow IPC stream format containing 100,000 rows x 20 columns of random data exercising many different Arrow data types, and the Python script used to generate the file. I will use this in some new http examples.

@ianmcook ianmcook requested a review from paleolimbot November 28, 2024 18:33

@paleolimbot paleolimbot left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be very helpful!

There are a number of similar files in arrow-testing that are similar (and sorted vaguely by type category) here: https://github.com/apache/arrow-testing/tree/master/data/arrow-ipc-stream/integration/1.0.0-littleendian . Is it worth using those for any part of the experiment?

There are also a number of places where we generate random arrow arrays, although they are squirrelled away in various places (e.g., Archery, Arrow C++'s testing component) and that shouldn't block this file from existing in arrow-experiments.

The only thing I would request for the here-and-now would be to ensure there's a seed set so that the next person who runs this gets the same result. I think the recommended way to do that is something like:

rng = np.random.default_rng(12345)
rng.(random|randint|etc)(...)

Ref: https://numpy.org/doc/stable/reference/random/generator.html

@ianmcook

Copy link
Copy Markdown
Member Author

There are a number of similar files in arrow-testing that are similar (and sorted vaguely by type category) here: https://github.com/apache/arrow-testing/tree/master/data/arrow-ipc-stream/integration/1.0.0-littleendian . Is it worth using those for any part of the experiment?

If this were intended for long-term use or if it required significant effort to create, then I would have looked at reusing an existing asset instead of creating a new one here. But as is, I think the convenience of having this in the same repo and the ability to easily use this to test type-related issues with Arrow-over-HTTP outweigh the benefits of reuse.

The only thing I would request for the here-and-now would be to ensure there's a seed set so that the next person who runs this gets the same result.

Done in e1e43ca. Thanks!

@ianmcook ianmcook merged commit 4884e4b into apache:main Nov 29, 2024
@ianmcook ianmcook deleted the rand-many-types branch November 29, 2024 17:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants