Perf: export Polars String columns directly from Arrow LargeUtf8 buffers#1
Merged
Conversation
ed028f2 to
f65150e
Compare
Polars stores strings as Arrow LargeUtf8: a flat UTF-8 bytes buffer plus an int64 offsets buffer. Previously, export went through series.to_numpy() (one Python str object per row) and then the C helper re-encoded each string to UTF-8 via PyObject_Str + str.encode(). This commit adds _export_extract_string_obj_arrow() in sbdf_helpers.c, which reads the raw UTF-8 bytes and offsets directly -- no Python API calls in the inner loop. The Cython side obtains raw pointers via PyArray_DATA() on zero-copy numpy views of the Arrow buffers. The dispatch path (polars_exporter_id = _POL_EXP_STRING = 5) mirrors the existing temporal fast paths. Categorical and Enum columns are cast to Utf8 before the Arrow path is taken. A guard asserts the Arrow type is large_string (int64 offsets) and raises SBDFError if not. Benchmarked at 100k rows, string no-nulls (psutil, 7 reps): pandas baseline: 58ms old polars (via pandas): 71ms new polars (Arrow direct): 26ms (-56% vs pandas, -64% vs old polars) The remaining time is dominated by sbdf_str_create_len (one malloc + memcpy per string), which is unavoidable in the current SBDF format. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
series.to_arrow() requires pyarrow. CI test environments install spotfire[polars] without pyarrow, causing ModuleNotFoundError on all Polars string export tests. Wrap the Arrow fast path in try/except ImportError so it degrades gracefully to the existing to_numpy() path when pyarrow is absent. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…hars pylint line-too-long (C0301) flagged lines 98-99 after the type: ignore annotations were added. Split the assertEqual calls to keep each line within the 120-character limit. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
E302: add second blank line before OutputFormat class and _ExportContext decorator. E127: align continuation lines with opening parenthesis in set_arrow_string, _export_polars_series_to_numpy, _export_vt_polars_string, and the sbdf_helpers.pxi extern declaration. E115/E117: fix comment indentation inside except blocks. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rk profiles Temporal Polars columns with nulls were being cast to float64 (nan for nulls) instead of int64 before passing to the C exporter, which read the buffer as long long* and got garbage values. Fix: call fill_null(0) after the int cast so to_numpy() always returns the expected integer dtype; the invalids mask already records which positions are null so the sentinel is never read. Adds temporal_nulls (datetime/date/duration/time, ~10% nulls) and binary / binary_nulls profiles to benchmark.py to cover remaining SBDF value types. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
8615934 to
c210ffa
Compare
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
_export_extract_string_obj_arrow()insbdf_helpers.c: reads raw UTF-8 bytes directly from the Arrow LargeUtf8 values buffer using the int64 offsets buffer — no Python API calls in the hot loop_export_vt_polars_stringCython exporter (id=5) that gets raw pointers viaPyArray_DATA()on zero-copy numpy views of the Arrow buffersset_arrow_string()on_ExportContextto hold the Arrow buffer views alongside the invalids mask_export_polars_setup_arraysinterceptsUtf8/String/Categorical/Enumdtypes and takes the new fast path instead of falling through toto_numpy()large_string(int64 offsets); raisesSBDFErrorwith a clear message if notWhy this is faster
Previously, export of a Polars string column did:
series.to_numpy()→ one Pythonstrobject allocated per rowPyObject_Str()+str.encode("utf-8")per element — redundant since Arrow already stores UTF-8Now it does:
series.to_arrow()→ zero-copy Arrow array (Polars shares memory)np.frombuffer(buf)→ zero-copy numpy views of offsets and bytes bufferssbdf_str_create_len(values_buf + offsets[i], len)— direct slice, no Python objectsBenchmark (psutil RSS, 7 reps, warmup discarded)
100,000 rows, String no nulls
to_pandas())100,000 rows, String ~10% nulls
to_pandas())The Arrow direct path is ~56% faster than the pandas baseline and ~64% faster than the old polars workaround. The remaining time is dominated by
sbdf_str_create_len(onemalloc+memcpyper string — unavoidable in the current SBDF format).Edge cases
Utf8beforeto_arrow(), same as existing pathbufs[2]may beNone; falls back tonp.empty(0, uint8)— C loop skips all rows via invalids maskcombine_chunks()called ifto_arrow()returns a chunked resultTest plan
pytest spotfire/test/test_sbdf.py)test_write_polars_nulls)🤖 Generated with Claude Code