[WIP] Avoid Copies in IOStream.write#1691
Conversation
|
In practice this falls down because It does work if we switch |
|
Looks like it's not recommended for non-blocking sockets. http://stackoverflow.com/questions/6240737/python-socket-sendall-function |
|
So I don't see a clean way to avoid copies. Does anyone else? Thoughts on the first commit in this PR? Is this an acceptable API change? |
|
With For the first commit, why would someone call In the second commit, I'd rather make windows the special case instead of linux and darwin. FreeBSD and other posixy platforms also have a |
|
Oh, The benefits to calling write with a list are mostly to avoid the latter half of calling the My messages have somewhere between two and several frames. At peak I try to process a few thousand messages a second. I'm somewhat concerned about hitting the IOLoop with code that looks like the following: for frame in frames:
yield stream.write(frame)Perhaps there is a nicer way around this though or perhaps I'm optimizing prematurely. It looks like the IOLoop can operate at tens of thousands of cycles per second. In [1]: from tornado import gen
In [2]: from tornado.ioloop import IOLoop
In [3]: @gen.coroutine
...: def f():
...: for i in range(10000):
...: yield []
...:
In [4]: loop = IOLoop()
In [5]: %time loop.run_sync(f)
CPU times: user 164 ms, sys: 124 µs, total: 164 ms
Wall time: 164 msAdding the extra list does add a bit of time. It's very small however. In [1]: L = []
In [2]: b = b'12345'
In [3]: def f():
for i in range(10000):
for bb in [b]:
L.append(bb)
...:
In [4]: def g():
for i in range(10000):
L.append(b)
...:
In [5]: %time f()
CPU times: user 8.74 ms, sys: 0 ns, total: 8.74 ms
Wall time: 8.59 ms
In [6]: %time g()
CPU times: user 2.53 ms, sys: 143 µs, total: 2.68 ms
Wall time: 2.74 ms
In [7]: %time f()
CPU times: user 8.84 ms, sys: 56 µs, total: 8.89 ms
Wall time: 9.04 ms
In [8]: %time g()
CPU times: user 2.01 ms, sys: 0 ns, total: 2.01 ms
Wall time: 1.85 ms |
Previously we would break apart large bytestrings to avoid a Windows issue. This negatively impacted performance on Linix and OSX systems unnecessarily. Now we only perform this if we're on Windows.
|
I apologize for the very long delay. I've removed the first commit supporting lists of bytestrings and have changed the Windows behavior to blacklist windows rather than whitelist linux and darwin. |
|
This now also uses memoryviews and avoids copies in |
| try: | ||
| loc = self._read_buffer[0].find(self._read_delimiter) | ||
| except AttributeError: # might be a memoryview | ||
| loc = self._read_buffer[0].tobytes().find(self._read_delimiter) |
There was a problem hiding this comment.
This bit is unfortunate. There are times when we need the object to be a bytes object. There might be ways to get the bytes object without doing the explicit conversion.
There was a problem hiding this comment.
Are there alternatives to bytes.find() that we could use instead? It looks like both bytes and memoryview support the in operator (although I haven't verified that it's efficient and has the semantics we want)
|
You may be aware already that the python interface to ZeroMQ successfully implemented minimal copying. Memory-view should provide the same foundations necessary for that. |
|
Yes, I am aware. ZeroMQ also brings along a lot of stuff that I don't particularly want or need though. |
This uses the memoryview object to avoid creating new bytestrings when slicing through bytes objects. This also creates slightly leaner but fairly redundant functions pop_prefix and remove_prefix to complement _merge_prefix
|
@bdarnell can I get a quick reaction from you on this PR? Is something like this reasonable to get in (after tests pass)? It should be relatively quick to quickly glance over (60 line changes.) |
|
How you tried instead to make the write buffer a bytearray, and then use memoryview() to slice off the chunks being given to |
bdarnell
left a comment
There was a problem hiding this comment.
Yes, the approach seems good once the tests are passing.
| try: | ||
| loc = self._read_buffer[0].find(self._read_delimiter) | ||
| except AttributeError: # might be a memoryview | ||
| loc = self._read_buffer[0].tobytes().find(self._read_delimiter) |
There was a problem hiding this comment.
Are there alternatives to bytes.find() that we could use instead? It looks like both bytes and memoryview support the in operator (although I haven't verified that it's efficient and has the semantics we want)
| # as we slice off pieces to send to the socket. | ||
| WRITE_BUFFER_CHUNK_SIZE = 128 * 1024 | ||
| for i in range(0, len(data), WRITE_BUFFER_CHUNK_SIZE): | ||
| self._write_buffer.append(data[i:i + WRITE_BUFFER_CHUNK_SIZE]) |
There was a problem hiding this comment.
Does socket.send on windows support memoryview objects? If so, we may be able to avoid this splitting and just give send() a memoryview.
|
It looks like we can complement In [1]: b = b'0' * 1000000 + b'123' + b'0' * 1000000
...: mv = memoryview(b)
...:
In [2]: %time b'123' in b
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 3.27 ms
Out[2]: True
In [3]: %time b.find(b'123')
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 3.65 ms
Out[3]: 1000000
In [4]: b'123' in mv
Out[4]: False
In [5]: %time bytearray(mv).find(b'123')
CPU times: user 0 ns, sys: 4 ms, total: 4 ms
Wall time: 6.35 ms
Out[5]: 1000000
In [6]: %time bytearray(mv).find(b'123')
CPU times: user 8 ms, sys: 0 ns, total: 8 ms
Wall time: 5.97 ms
Out[6]: 1000000
In [7]: %time b'123' in bytearray(mv)
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 3.99 ms
Out[7]: True
In [8]: %time b'123' in bytearray(mv[:1000])
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 36.7 µs
Out[8]: False
In [9]: %time b'123' in bytearray(mv[1000:])
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 3.96 ms
Out[9]: True |
|
Ah, actually it looks like bytearray is making a copy. It's just that the find operation is significantly faster so that I didn't realize at first. In [15]: b = b'123' + b'0' * 2000000
...: mv = memoryview(b)
...:
In [16]: %time b'123' in bytearray(mv)
CPU times: user 0 ns, sys: 4 ms, total: 4 ms
Wall time: 553 µs
Out[16]: True@pitrou any ideas? |
|
Perhaps the read and write buffers should be bytearrays, instead of lists of chunks? Extending a bytearray is efficient. As for deleting from the front, you can do that lazily e.g. when that would halve the size (or on Python 3.4+, rely on the fact that bytearray does it: https://mail.python.org/pipermail/python-dev/2016-October/146673.html). |
|
I've submitted a bytearray-based solution in #1873 |
|
Closing this since I've merged #1873 instead. |
|
For what it's worth I think that this issue is still relevant. While bytearrays are an improvement they do not avoid all copies. I may reopen this PR in the next few months when I expect it to become a higher priority for Dask work. |
This removes unnecessary sharding and copying of large bytestrings when using
IOStream.write.General development questions
nosetests/py.test tornado/test/iostream_test.pyI get some odd errors.Fixes #1685