Skip to content

[WIP] Avoid Copies in IOStream.write#1691

Closed
mrocklin wants to merge 2 commits into
tornadoweb:masterfrom
mrocklin:no-copy
Closed

[WIP] Avoid Copies in IOStream.write#1691
mrocklin wants to merge 2 commits into
tornadoweb:masterfrom
mrocklin:no-copy

Conversation

@mrocklin

@mrocklin mrocklin commented Apr 14, 2016

Copy link
Copy Markdown
Contributor

This removes unnecessary sharding and copying of large bytestrings when using IOStream.write.

General development questions

  1. How do I run tests locally? Naively running nosetests/py.test tornado/test/iostream_test.py I get some odd errors.
  2. Thoughts on how to test no-copy? Ideally I would write some large message and the test object equality on something within the write buffer. Perhaps I need to subclass IOStream to do this well?

Fixes #1685

@mrocklin mrocklin changed the title Avoid Copies in IOStream.write [WIP] Avoid Copies in IOStream.write Apr 14, 2016
@mrocklin

mrocklin commented Apr 14, 2016

Copy link
Copy Markdown
Contributor Author

In practice this falls down because socket.send is only sending a bit at a time, causing many expensive calls to _merge_prefix.

It does work if we switch socket.send with socket.sendall, I'm not sure if this change would have other negative side effects however.

@mrocklin

Copy link
Copy Markdown
Contributor Author

Looks like it's not recommended for non-blocking sockets. http://stackoverflow.com/questions/6240737/python-socket-sendall-function

@mrocklin

Copy link
Copy Markdown
Contributor Author

So I don't see a clean way to avoid copies. Does anyone else?

Thoughts on the first commit in this PR? Is this an acceptable API change?

@bdarnell

Copy link
Copy Markdown
Member

With str and bytes types, copies are inevitable. But what about bytearray and memoryview? I think those can be used to make minimal-copy writes (I'm not sure if zero-copy is possible while staying in pure python with clean interfaces, but those two types will get you closer than str and bytes)

For the first commit, why would someone call write() with a list instead of calling write() multiple times with the individual values? Is there a case in which that avoids a significant amount of overhead? This change introduces overhead in the non-list case that I'd like to avoid (allocating a new list and looping). I'd be OK from an API perspective with accepting lists of bytes in this function, but I wouldn't want to do it if it adds overhead to the common case of non-lists.

In the second commit, I'd rather make windows the special case instead of linux and darwin. FreeBSD and other posixy platforms also have a send() system call that behaves as expected; I believe windows is unique here.

@mrocklin

Copy link
Copy Markdown
Contributor Author

Oh, bytearray and memoryview are nice ideas. I'll take a look.

The benefits to calling write with a list are mostly to avoid the latter half of calling the write_to_fd function multiple times. In particular my original intent was to avoid sending two TCP packets rather than just one when sending multiple frames. I'm not sure that this is possible though. I'll admit though that I haven't yet done sufficient benchmarking to really validate this change.

My messages have somewhere between two and several frames. At peak I try to process a few thousand messages a second. I'm somewhat concerned about hitting the IOLoop with code that looks like the following:

for frame in frames:
    yield stream.write(frame)

Perhaps there is a nicer way around this though or perhaps I'm optimizing prematurely. It looks like the IOLoop can operate at tens of thousands of cycles per second.

In [1]: from tornado import gen
In [2]: from tornado.ioloop import IOLoop
In [3]: @gen.coroutine
   ...: def f():
   ...:     for i in range(10000):
   ...:         yield []
   ...:         

In [4]: loop = IOLoop()

In [5]: %time loop.run_sync(f)
CPU times: user 164 ms, sys: 124 µs, total: 164 ms
Wall time: 164 ms

Adding the extra list does add a bit of time. It's very small however.

In [1]: L = []
In [2]: b = b'12345'
In [3]: def f():
    for i in range(10000):
        for bb in [b]:
            L.append(bb)
   ...:             

In [4]: def g():
    for i in range(10000):
        L.append(b)
   ...:         

In [5]: %time f()
CPU times: user 8.74 ms, sys: 0 ns, total: 8.74 ms
Wall time: 8.59 ms

In [6]: %time g()
CPU times: user 2.53 ms, sys: 143 µs, total: 2.68 ms
Wall time: 2.74 ms

In [7]: %time f()
CPU times: user 8.84 ms, sys: 56 µs, total: 8.89 ms
Wall time: 9.04 ms

In [8]: %time g()
CPU times: user 2.01 ms, sys: 0 ns, total: 2.01 ms
Wall time: 1.85 ms

Previously we would break apart large bytestrings to avoid a Windows
issue.  This negatively impacted performance on Linix and OSX systems
unnecessarily.  Now we only perform this if we're on Windows.
@mrocklin

Copy link
Copy Markdown
Contributor Author

I apologize for the very long delay. I've removed the first commit supporting lists of bytestrings and have changed the Windows behavior to blacklist windows rather than whitelist linux and darwin.

@mrocklin

Copy link
Copy Markdown
Contributor Author

This now also uses memoryviews and avoids copies in _merge_prefix.

Comment thread tornado/iostream.py
try:
loc = self._read_buffer[0].find(self._read_delimiter)
except AttributeError: # might be a memoryview
loc = self._read_buffer[0].tobytes().find(self._read_delimiter)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bit is unfortunate. There are times when we need the object to be a bytes object. There might be ways to get the bytes object without doing the explicit conversion.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there alternatives to bytes.find() that we could use instead? It looks like both bytes and memoryview support the in operator (although I haven't verified that it's efficient and has the semantics we want)

@teoliphant

Copy link
Copy Markdown

You may be aware already that the python interface to ZeroMQ successfully implemented minimal copying. Memory-view should provide the same foundations necessary for that.

@mrocklin

Copy link
Copy Markdown
Contributor Author

Yes, I am aware. ZeroMQ also brings along a lot of stuff that I don't particularly want or need though.

This uses the memoryview object to avoid creating new bytestrings when
slicing through bytes objects.  This also creates slightly leaner but
fairly redundant functions pop_prefix and remove_prefix to complement
_merge_prefix
@mrocklin

mrocklin commented Oct 28, 2016

Copy link
Copy Markdown
Contributor Author

@bdarnell can I get a quick reaction from you on this PR? Is something like this reasonable to get in (after tests pass)?

It should be relatively quick to quickly glance over (60 line changes.)

@pitrou

pitrou commented Oct 31, 2016

Copy link
Copy Markdown
Contributor

How you tried instead to make the write buffer a bytearray, and then use memoryview() to slice off the chunks being given to write_to_fd()?

@bdarnell bdarnell left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the approach seems good once the tests are passing.

Comment thread tornado/iostream.py
try:
loc = self._read_buffer[0].find(self._read_delimiter)
except AttributeError: # might be a memoryview
loc = self._read_buffer[0].tobytes().find(self._read_delimiter)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there alternatives to bytes.find() that we could use instead? It looks like both bytes and memoryview support the in operator (although I haven't verified that it's efficient and has the semantics we want)

Comment thread tornado/iostream.py
# as we slice off pieces to send to the socket.
WRITE_BUFFER_CHUNK_SIZE = 128 * 1024
for i in range(0, len(data), WRITE_BUFFER_CHUNK_SIZE):
self._write_buffer.append(data[i:i + WRITE_BUFFER_CHUNK_SIZE])

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does socket.send on windows support memoryview objects? If so, we may be able to avoid this splitting and just give send() a memoryview.

@mrocklin

mrocklin commented Nov 1, 2016

Copy link
Copy Markdown
Contributor Author

It looks like we can complement memoryview with bytearray

In [1]: b = b'0' * 1000000 + b'123' + b'0' * 1000000
   ...: mv = memoryview(b)
   ...: 

In [2]: %time b'123' in b
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 3.27 ms
Out[2]: True

In [3]: %time b.find(b'123')
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 3.65 ms
Out[3]: 1000000

In [4]: b'123' in mv
Out[4]: False

In [5]: %time bytearray(mv).find(b'123')
CPU times: user 0 ns, sys: 4 ms, total: 4 ms
Wall time: 6.35 ms
Out[5]: 1000000

In [6]: %time bytearray(mv).find(b'123')
CPU times: user 8 ms, sys: 0 ns, total: 8 ms
Wall time: 5.97 ms
Out[6]: 1000000

In [7]: %time b'123' in bytearray(mv)
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 3.99 ms
Out[7]: True

In [8]: %time b'123' in bytearray(mv[:1000])
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 36.7 µs
Out[8]: False

In [9]: %time b'123' in bytearray(mv[1000:])
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 3.96 ms
Out[9]: True

@mrocklin

mrocklin commented Nov 1, 2016

Copy link
Copy Markdown
Contributor Author

Ah, actually it looks like bytearray is making a copy. It's just that the find operation is significantly faster so that I didn't realize at first.

In [15]: b = b'123' + b'0' * 2000000
    ...: mv = memoryview(b)
    ...: 

In [16]: %time b'123' in bytearray(mv)
CPU times: user 0 ns, sys: 4 ms, total: 4 ms
Wall time: 553 µs
Out[16]: True

@pitrou any ideas?

@pitrou

pitrou commented Nov 1, 2016

Copy link
Copy Markdown
Contributor

Perhaps the read and write buffers should be bytearrays, instead of lists of chunks? Extending a bytearray is efficient. As for deleting from the front, you can do that lazily e.g. when that would halve the size (or on Python 3.4+, rely on the fact that bytearray does it: https://mail.python.org/pipermail/python-dev/2016-October/146673.html).

@pitrou

pitrou commented Nov 1, 2016

Copy link
Copy Markdown
Contributor

I've submitted a bytearray-based solution in #1873

@bdarnell

Copy link
Copy Markdown
Member

Closing this since I've merged #1873 instead.

@bdarnell bdarnell closed this Feb 21, 2017
@mrocklin

Copy link
Copy Markdown
Contributor Author

For what it's worth I think that this issue is still relevant. While bytearrays are an improvement they do not avoid all copies. I may reopen this PR in the next few months when I expect it to become a higher priority for Dask work.

@mrocklin mrocklin mentioned this pull request Sep 6, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants