[WIP] Avoid Copies in IOStream.write by mrocklin · Pull Request #1691 · tornadoweb/tornado

mrocklin · 2016-04-14T19:43:53Z

This removes unnecessary sharding and copying of large bytestrings when using IOStream.write.

General development questions

How do I run tests locally? Naively running nosetests/py.test tornado/test/iostream_test.py I get some odd errors.
Thoughts on how to test no-copy? Ideally I would write some large message and the test object equality on something within the write buffer. Perhaps I need to subclass IOStream to do this well?

mrocklin · 2016-04-14T19:53:47Z

In practice this falls down because socket.send is only sending a bit at a time, causing many expensive calls to _merge_prefix.

It does work if we switch socket.send with socket.sendall, I'm not sure if this change would have other negative side effects however.

mrocklin · 2016-04-14T19:54:50Z

Looks like it's not recommended for non-blocking sockets. http://stackoverflow.com/questions/6240737/python-socket-sendall-function

mrocklin · 2016-04-16T00:36:13Z

So I don't see a clean way to avoid copies. Does anyone else?

Thoughts on the first commit in this PR? Is this an acceptable API change?

bdarnell · 2016-04-16T01:22:02Z

With str and bytes types, copies are inevitable. But what about bytearray and memoryview? I think those can be used to make minimal-copy writes (I'm not sure if zero-copy is possible while staying in pure python with clean interfaces, but those two types will get you closer than str and bytes)

For the first commit, why would someone call write() with a list instead of calling write() multiple times with the individual values? Is there a case in which that avoids a significant amount of overhead? This change introduces overhead in the non-list case that I'd like to avoid (allocating a new list and looping). I'd be OK from an API perspective with accepting lists of bytes in this function, but I wouldn't want to do it if it adds overhead to the common case of non-lists.

In the second commit, I'd rather make windows the special case instead of linux and darwin. FreeBSD and other posixy platforms also have a send() system call that behaves as expected; I believe windows is unique here.

mrocklin · 2016-04-16T02:10:08Z

Oh, bytearray and memoryview are nice ideas. I'll take a look.

The benefits to calling write with a list are mostly to avoid the latter half of calling the write_to_fd function multiple times. In particular my original intent was to avoid sending two TCP packets rather than just one when sending multiple frames. I'm not sure that this is possible though. I'll admit though that I haven't yet done sufficient benchmarking to really validate this change.

My messages have somewhere between two and several frames. At peak I try to process a few thousand messages a second. I'm somewhat concerned about hitting the IOLoop with code that looks like the following:

for frame in frames:
    yield stream.write(frame)

Perhaps there is a nicer way around this though or perhaps I'm optimizing prematurely. It looks like the IOLoop can operate at tens of thousands of cycles per second.

In [1]: from tornado import gen
In [2]: from tornado.ioloop import IOLoop
In [3]: @gen.coroutine
   ...: def f():
   ...:     for i in range(10000):
   ...:         yield []
   ...:         

In [4]: loop = IOLoop()

In [5]: %time loop.run_sync(f)
CPU times: user 164 ms, sys: 124 µs, total: 164 ms
Wall time: 164 ms

Adding the extra list does add a bit of time. It's very small however.

In [1]: L = []
In [2]: b = b'12345'
In [3]: def f():
    for i in range(10000):
        for bb in [b]:
            L.append(bb)
   ...:             

In [4]: def g():
    for i in range(10000):
        L.append(b)
   ...:         

In [5]: %time f()
CPU times: user 8.74 ms, sys: 0 ns, total: 8.74 ms
Wall time: 8.59 ms

In [6]: %time g()
CPU times: user 2.53 ms, sys: 143 µs, total: 2.68 ms
Wall time: 2.74 ms

In [7]: %time f()
CPU times: user 8.84 ms, sys: 56 µs, total: 8.89 ms
Wall time: 9.04 ms

In [8]: %time g()
CPU times: user 2.01 ms, sys: 0 ns, total: 2.01 ms
Wall time: 1.85 ms

Previously we would break apart large bytestrings to avoid a Windows issue. This negatively impacted performance on Linix and OSX systems unnecessarily. Now we only perform this if we're on Windows.

mrocklin · 2016-10-24T19:21:59Z

I apologize for the very long delay. I've removed the first commit supporting lists of bytestrings and have changed the Windows behavior to blacklist windows rather than whitelist linux and darwin.

mrocklin · 2016-10-24T20:14:52Z

This now also uses memoryviews and avoids copies in _merge_prefix.

mrocklin · 2016-10-24T20:19:00Z

+                    try:
+                        loc = self._read_buffer[0].find(self._read_delimiter)
+                    except AttributeError:  # might be a memoryview
+                        loc = self._read_buffer[0].tobytes().find(self._read_delimiter)


This bit is unfortunate. There are times when we need the object to be a bytes object. There might be ways to get the bytes object without doing the explicit conversion.

Are there alternatives to bytes.find() that we could use instead? It looks like both bytes and memoryview support the in operator (although I haven't verified that it's efficient and has the semantics we want)

teoliphant · 2016-10-24T23:43:56Z

You may be aware already that the python interface to ZeroMQ successfully implemented minimal copying. Memory-view should provide the same foundations necessary for that.

mrocklin · 2016-10-25T00:55:11Z

Yes, I am aware. ZeroMQ also brings along a lot of stuff that I don't particularly want or need though.

This uses the memoryview object to avoid creating new bytestrings when slicing through bytes objects. This also creates slightly leaner but fairly redundant functions pop_prefix and remove_prefix to complement _merge_prefix

mrocklin · 2016-10-28T22:52:46Z

@bdarnell can I get a quick reaction from you on this PR? Is something like this reasonable to get in (after tests pass)?

It should be relatively quick to quickly glance over (60 line changes.)

pitrou · 2016-10-31T13:34:41Z

How you tried instead to make the write buffer a bytearray, and then use memoryview() to slice off the chunks being given to write_to_fd()?

bdarnell

Yes, the approach seems good once the tests are passing.

bdarnell · 2016-11-01T12:56:01Z

+                    try:
+                        loc = self._read_buffer[0].find(self._read_delimiter)
+                    except AttributeError:  # might be a memoryview
+                        loc = self._read_buffer[0].tobytes().find(self._read_delimiter)


Are there alternatives to bytes.find() that we could use instead? It looks like both bytes and memoryview support the in operator (although I haven't verified that it's efficient and has the semantics we want)

bdarnell · 2016-11-01T13:02:05Z

+                # as we slice off pieces to send to the socket.
+                WRITE_BUFFER_CHUNK_SIZE = 128 * 1024
+                for i in range(0, len(data), WRITE_BUFFER_CHUNK_SIZE):
+                    self._write_buffer.append(data[i:i + WRITE_BUFFER_CHUNK_SIZE])


Does socket.send on windows support memoryview objects? If so, we may be able to avoid this splitting and just give send() a memoryview.

mrocklin · 2016-11-01T13:12:22Z

It looks like we can complement memoryview with bytearray

In [1]: b = b'0' * 1000000 + b'123' + b'0' * 1000000
   ...: mv = memoryview(b)
   ...: 

In [2]: %time b'123' in b
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 3.27 ms
Out[2]: True

In [3]: %time b.find(b'123')
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 3.65 ms
Out[3]: 1000000

In [4]: b'123' in mv
Out[4]: False

In [5]: %time bytearray(mv).find(b'123')
CPU times: user 0 ns, sys: 4 ms, total: 4 ms
Wall time: 6.35 ms
Out[5]: 1000000

In [6]: %time bytearray(mv).find(b'123')
CPU times: user 8 ms, sys: 0 ns, total: 8 ms
Wall time: 5.97 ms
Out[6]: 1000000

In [7]: %time b'123' in bytearray(mv)
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 3.99 ms
Out[7]: True

In [8]: %time b'123' in bytearray(mv[:1000])
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 36.7 µs
Out[8]: False

In [9]: %time b'123' in bytearray(mv[1000:])
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 3.96 ms
Out[9]: True

mrocklin · 2016-11-01T13:16:16Z

Ah, actually it looks like bytearray is making a copy. It's just that the find operation is significantly faster so that I didn't realize at first.

In [15]: b = b'123' + b'0' * 2000000
    ...: mv = memoryview(b)
    ...: 

In [16]: %time b'123' in bytearray(mv)
CPU times: user 0 ns, sys: 4 ms, total: 4 ms
Wall time: 553 µs
Out[16]: True

@pitrou any ideas?

pitrou · 2016-11-01T13:23:36Z

Perhaps the read and write buffers should be bytearrays, instead of lists of chunks? Extending a bytearray is efficient. As for deleting from the front, you can do that lazily e.g. when that would halve the size (or on Python 3.4+, rely on the fact that bytearray does it: https://mail.python.org/pipermail/python-dev/2016-October/146673.html).

pitrou · 2016-11-01T20:33:04Z

I've submitted a bytearray-based solution in #1873

bdarnell · 2017-02-21T03:28:27Z

Closing this since I've merged #1873 instead.

mrocklin · 2017-07-14T03:52:25Z

For what it's worth I think that this issue is still relevant. While bytearrays are an improvement they do not avoid all copies. I may reopen this PR in the next few months when I expect it to become a higher priority for Dask work.

mrocklin changed the title ~~Avoid Copies in IOStream.write~~ [WIP] Avoid Copies in IOStream.write Apr 14, 2016

Avoid breaking apart large bytestrings on write

0c994bf

Previously we would break apart large bytestrings to avoid a Windows issue. This negatively impacted performance on Linix and OSX systems unnecessarily. Now we only perform this if we're on Windows.

mrocklin force-pushed the no-copy branch from 99c6568 to 0c994bf Compare October 24, 2016 19:21

mrocklin commented Oct 24, 2016

View reviewed changes

Use memoryview to avoid memory copies

ed46698

This uses the memoryview object to avoid creating new bytestrings when slicing through bytes objects. This also creates slightly leaner but fairly redundant functions pop_prefix and remove_prefix to complement _merge_prefix

mrocklin force-pushed the no-copy branch from 0c62fb1 to ed46698 Compare October 28, 2016 19:43

mrocklin mentioned this pull request Oct 28, 2016

Support custom serialization dask/distributed#606

Merged

bdarnell reviewed Nov 1, 2016

View reviewed changes

bdarnell mentioned this pull request Nov 4, 2016

Use bytearray buffers in IOStream #1873

Merged

bdarnell closed this Feb 21, 2017

mrocklin mentioned this pull request Sep 6, 2017

Zero copy send #2147

Closed

Uh oh!

Conversation

mrocklin commented Apr 14, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrocklin commented Apr 14, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrocklin commented Apr 14, 2016

Uh oh!

mrocklin commented Apr 16, 2016

Uh oh!

bdarnell commented Apr 16, 2016

Uh oh!

mrocklin commented Apr 16, 2016

Uh oh!

mrocklin commented Oct 24, 2016

Uh oh!

mrocklin commented Oct 24, 2016

Uh oh!

mrocklin Oct 24, 2016

Choose a reason for hiding this comment

Uh oh!

bdarnell Nov 1, 2016

Choose a reason for hiding this comment

Uh oh!

teoliphant commented Oct 24, 2016

Uh oh!

mrocklin commented Oct 25, 2016

Uh oh!

mrocklin commented Oct 28, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pitrou commented Oct 31, 2016

Uh oh!

bdarnell left a comment

Choose a reason for hiding this comment

Uh oh!

bdarnell Nov 1, 2016

Choose a reason for hiding this comment

Uh oh!

bdarnell Nov 1, 2016

Choose a reason for hiding this comment

Uh oh!

mrocklin commented Nov 1, 2016

Uh oh!

mrocklin commented Nov 1, 2016

Uh oh!

pitrou commented Nov 1, 2016

Uh oh!

pitrou commented Nov 1, 2016

Uh oh!

bdarnell commented Feb 21, 2017

Uh oh!

mrocklin commented Jul 14, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mrocklin commented Apr 14, 2016 •

edited

Loading

mrocklin commented Apr 14, 2016 •

edited

Loading

mrocklin commented Oct 28, 2016 •

edited

Loading