Perf: export Polars String columns directly from Arrow LargeUtf8 buffers by stewjb · Pull Request #1 · stewjb/spotfire-python

stewjb · 2026-04-04T02:17:45Z

Summary

Adds _export_extract_string_obj_arrow() in sbdf_helpers.c: reads raw UTF-8 bytes directly from the Arrow LargeUtf8 values buffer using the int64 offsets buffer — no Python API calls in the hot loop
Adds _export_vt_polars_string Cython exporter (id=5) that gets raw pointers via PyArray_DATA() on zero-copy numpy views of the Arrow buffers
Adds set_arrow_string() on _ExportContext to hold the Arrow buffer views alongside the invalids mask
_export_polars_setup_arrays intercepts Utf8/String/Categorical/Enum dtypes and takes the new fast path instead of falling through to to_numpy()
Guards that the Arrow type is large_string (int64 offsets); raises SBDFError with a clear message if not

Why this is faster

Previously, export of a Polars string column did:

series.to_numpy() → one Python str object allocated per row
C helper: PyObject_Str() + str.encode("utf-8") per element — redundant since Arrow already stores UTF-8

Now it does:

series.to_arrow() → zero-copy Arrow array (Polars shares memory)
np.frombuffer(buf) → zero-copy numpy views of offsets and bytes buffers
C function: sbdf_str_create_len(values_buf + offsets[i], len) — direct slice, no Python objects

Benchmark (psutil RSS, 7 reps, warmup discarded)

100,000 rows, String no nulls

	time (ms)
Export: pandas df	58.3
Export: polars df (old: via `to_pandas()`)	71.4
Export: polars df (new: Arrow direct)	25.7

100,000 rows, String ~10% nulls

	time (ms)
Export: pandas df	88.4
Export: polars df (old: via `to_pandas()`)	105.2
Export: polars df (new: Arrow direct)	36.6

The Arrow direct path is ~56% faster than the pandas baseline and ~64% faster than the old polars workaround. The remaining time is dominated by sbdf_str_create_len (one malloc + memcpy per string — unavoidable in the current SBDF format).

Edge cases

Categorical/Enum: cast to Utf8 before to_arrow(), same as existing path
All-null or empty series: bufs[2] may be None; falls back to np.empty(0, uint8) — C loop skips all rows via invalids mask
ChunkedArray (older Polars): combine_chunks() called if to_arrow() returns a chunked result

Test plan

All 53 existing SBDF tests pass (pytest spotfire/test/test_sbdf.py)
Run benchmark and confirm string export time decreases vs prior baseline
Verify Categorical/Enum string columns export correctly (covered by existing polars write tests)
Verify null string columns export correctly (covered by test_write_polars_nulls)

🤖 Generated with Claude Code

Polars stores strings as Arrow LargeUtf8: a flat UTF-8 bytes buffer plus an int64 offsets buffer. Previously, export went through series.to_numpy() (one Python str object per row) and then the C helper re-encoded each string to UTF-8 via PyObject_Str + str.encode(). This commit adds _export_extract_string_obj_arrow() in sbdf_helpers.c, which reads the raw UTF-8 bytes and offsets directly -- no Python API calls in the inner loop. The Cython side obtains raw pointers via PyArray_DATA() on zero-copy numpy views of the Arrow buffers. The dispatch path (polars_exporter_id = _POL_EXP_STRING = 5) mirrors the existing temporal fast paths. Categorical and Enum columns are cast to Utf8 before the Arrow path is taken. A guard asserts the Arrow type is large_string (int64 offsets) and raises SBDFError if not. Benchmarked at 100k rows, string no-nulls (psutil, 7 reps): pandas baseline: 58ms old polars (via pandas): 71ms new polars (Arrow direct): 26ms (-56% vs pandas, -64% vs old polars) The remaining time is dominated by sbdf_str_create_len (one malloc + memcpy per string), which is unavoidable in the current SBDF format. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

series.to_arrow() requires pyarrow. CI test environments install spotfire[polars] without pyarrow, causing ModuleNotFoundError on all Polars string export tests. Wrap the Arrow fast path in try/except ImportError so it degrades gracefully to the existing to_numpy() path when pyarrow is absent. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…hars pylint line-too-long (C0301) flagged lines 98-99 after the type: ignore annotations were added. Split the assertEqual calls to keep each line within the 120-character limit. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

E302: add second blank line before OutputFormat class and _ExportContext decorator. E127: align continuation lines with opening parenthesis in set_arrow_string, _export_polars_series_to_numpy, _export_vt_polars_string, and the sbdf_helpers.pxi extern declaration. E115/E117: fix comment indentation inside except blocks. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…rk profiles Temporal Polars columns with nulls were being cast to float64 (nan for nulls) instead of int64 before passing to the C exporter, which read the buffer as long long* and got garbage values. Fix: call fill_null(0) after the int cast so to_numpy() always returns the expected integer dtype; the invalids mask already records which positions are null so the sentinel is never read. Adds temporal_nulls (datetime/date/duration/time, ~10% nulls) and binary / binary_nulls profiles to benchmark.py to cover remaining SBDF value types. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

stewjb force-pushed the perf/polars-string-arrow-export branch from ed028f2 to f65150e Compare April 4, 2026 02:28

stewjb and others added 5 commits April 3, 2026 22:03

stewjb force-pushed the perf/polars-string-arrow-export branch from 8615934 to c210ffa Compare April 4, 2026 03:06

Fix: remove unused cdef declarations flagged by cython-lint

c52db09

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

stewjb merged commit 651df7e into main Apr 4, 2026
44 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perf: export Polars String columns directly from Arrow LargeUtf8 buffers#1

Perf: export Polars String columns directly from Arrow LargeUtf8 buffers#1
stewjb merged 6 commits into
mainfrom
perf/polars-string-arrow-export

stewjb commented Apr 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

stewjb commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why this is faster

Benchmark (psutil RSS, 7 reps, warmup discarded)

Edge cases

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

stewjb commented Apr 4, 2026 •

edited

Loading