Skip to content

Perf: export Polars String columns directly from Arrow LargeUtf8 buffers#1

Merged
stewjb merged 6 commits into
mainfrom
perf/polars-string-arrow-export
Apr 4, 2026
Merged

Perf: export Polars String columns directly from Arrow LargeUtf8 buffers#1
stewjb merged 6 commits into
mainfrom
perf/polars-string-arrow-export

Conversation

@stewjb
Copy link
Copy Markdown
Owner

@stewjb stewjb commented Apr 4, 2026

Summary

  • Adds _export_extract_string_obj_arrow() in sbdf_helpers.c: reads raw UTF-8 bytes directly from the Arrow LargeUtf8 values buffer using the int64 offsets buffer — no Python API calls in the hot loop
  • Adds _export_vt_polars_string Cython exporter (id=5) that gets raw pointers via PyArray_DATA() on zero-copy numpy views of the Arrow buffers
  • Adds set_arrow_string() on _ExportContext to hold the Arrow buffer views alongside the invalids mask
  • _export_polars_setup_arrays intercepts Utf8/String/Categorical/Enum dtypes and takes the new fast path instead of falling through to to_numpy()
  • Guards that the Arrow type is large_string (int64 offsets); raises SBDFError with a clear message if not

Why this is faster

Previously, export of a Polars string column did:

  1. series.to_numpy() → one Python str object allocated per row
  2. C helper: PyObject_Str() + str.encode("utf-8") per element — redundant since Arrow already stores UTF-8

Now it does:

  1. series.to_arrow() → zero-copy Arrow array (Polars shares memory)
  2. np.frombuffer(buf) → zero-copy numpy views of offsets and bytes buffers
  3. C function: sbdf_str_create_len(values_buf + offsets[i], len) — direct slice, no Python objects

Benchmark (psutil RSS, 7 reps, warmup discarded)

100,000 rows, String no nulls

time (ms)
Export: pandas df 58.3
Export: polars df (old: via to_pandas()) 71.4
Export: polars df (new: Arrow direct) 25.7

100,000 rows, String ~10% nulls

time (ms)
Export: pandas df 88.4
Export: polars df (old: via to_pandas()) 105.2
Export: polars df (new: Arrow direct) 36.6

The Arrow direct path is ~56% faster than the pandas baseline and ~64% faster than the old polars workaround. The remaining time is dominated by sbdf_str_create_len (one malloc + memcpy per string — unavoidable in the current SBDF format).

Edge cases

  • Categorical/Enum: cast to Utf8 before to_arrow(), same as existing path
  • All-null or empty series: bufs[2] may be None; falls back to np.empty(0, uint8) — C loop skips all rows via invalids mask
  • ChunkedArray (older Polars): combine_chunks() called if to_arrow() returns a chunked result

Test plan

  • All 53 existing SBDF tests pass (pytest spotfire/test/test_sbdf.py)
  • Run benchmark and confirm string export time decreases vs prior baseline
  • Verify Categorical/Enum string columns export correctly (covered by existing polars write tests)
  • Verify null string columns export correctly (covered by test_write_polars_nulls)

🤖 Generated with Claude Code

@stewjb stewjb force-pushed the perf/polars-string-arrow-export branch from ed028f2 to f65150e Compare April 4, 2026 02:28
stewjb and others added 5 commits April 3, 2026 22:03
Polars stores strings as Arrow LargeUtf8: a flat UTF-8 bytes buffer plus
an int64 offsets buffer. Previously, export went through
series.to_numpy() (one Python str object per row) and then the C helper
re-encoded each string to UTF-8 via PyObject_Str + str.encode().

This commit adds _export_extract_string_obj_arrow() in sbdf_helpers.c,
which reads the raw UTF-8 bytes and offsets directly -- no Python API
calls in the inner loop. The Cython side obtains raw pointers via
PyArray_DATA() on zero-copy numpy views of the Arrow buffers.

The dispatch path (polars_exporter_id = _POL_EXP_STRING = 5) mirrors
the existing temporal fast paths. Categorical and Enum columns are cast
to Utf8 before the Arrow path is taken. A guard asserts the Arrow type
is large_string (int64 offsets) and raises SBDFError if not.

Benchmarked at 100k rows, string no-nulls (psutil, 7 reps):
  pandas baseline:          58ms
  old polars (via pandas):  71ms
  new polars (Arrow direct): 26ms  (-56% vs pandas, -64% vs old polars)

The remaining time is dominated by sbdf_str_create_len (one malloc +
memcpy per string), which is unavoidable in the current SBDF format.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
series.to_arrow() requires pyarrow. CI test environments install
spotfire[polars] without pyarrow, causing ModuleNotFoundError on all
Polars string export tests. Wrap the Arrow fast path in try/except
ImportError so it degrades gracefully to the existing to_numpy() path
when pyarrow is absent.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…hars

pylint line-too-long (C0301) flagged lines 98-99 after the type: ignore
annotations were added. Split the assertEqual calls to keep each line
within the 120-character limit.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
E302: add second blank line before OutputFormat class and _ExportContext
decorator.  E127: align continuation lines with opening parenthesis in
set_arrow_string, _export_polars_series_to_numpy, _export_vt_polars_string,
and the sbdf_helpers.pxi extern declaration.  E115/E117: fix comment
indentation inside except blocks.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rk profiles

Temporal Polars columns with nulls were being cast to float64 (nan for nulls)
instead of int64 before passing to the C exporter, which read the buffer as
long long* and got garbage values.  Fix: call fill_null(0) after the int cast
so to_numpy() always returns the expected integer dtype; the invalids mask
already records which positions are null so the sentinel is never read.

Adds temporal_nulls (datetime/date/duration/time, ~10% nulls) and binary /
binary_nulls profiles to benchmark.py to cover remaining SBDF value types.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@stewjb stewjb force-pushed the perf/polars-string-arrow-export branch from 8615934 to c210ffa Compare April 4, 2026 03:06
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@stewjb stewjb merged commit 651df7e into main Apr 4, 2026
44 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant