Skip to content

[Python] Verify that terminating sequences in binary IPC stream do not interfere with HTTP/1.1 transport #40581

Description

@ianmcook

Chunked transfer encoding is commonly used in HTTP/1.1. In chunked transfer encoding, special sequences of bytes are used to separate the chunks and as the terminating chunk. But what happens if those sequences of bytes occur inside binary Arrow IPC data, for example in a binary or string array?

I am almost certain (based on an understanding of how HTTP/1.1 clients work) that this will not cause any problems, but we should test to be fully certain.

To test, we could for example use the simple Python GET example, replacing the schema and the definition of GetPutData with the following:

schema = pa.schema([('a', pa.binary())])

def GetPutData():
    arrays = [pa.array('4\r\nWiki\r\n7\r\npedia i\r\nB\r\nn \r\nchunks.\r\n0\r\n\r\nabcdefg', type=pa.binary())]
    batches = [pa.record_batch(arrays, schema), pa.record_batch(arrays, schema)]
    return batches

Or this similar version which creates the buffers manually:

schema = pa.schema([('a', pa.binary())])

def GetPutData():
    bytestr = '4\r\nWiki\r\n7\r\npedia i\r\nB\r\nn \r\nchunks.\r\n0\r\n\r\nabcdefg'.encode('ascii')
    data = [bytestr, bytestr]
    offsets_buffer = pa.py_buffer(b''.join([n.to_bytes(4, 'little') for n in [0, 49, 98]]))
    values_buffer = pa.py_buffer(b''.join(data))
    array = pa.BinaryArray.from_buffers(pa.binary(), 2, [None, offsets_buffer, values_buffer])
    arrays = [array]
    batches = [pa.record_batch(arrays, schema), pa.record_batch(arrays, schema)]
    return batches

Component(s)

Python

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions