feat: add Apple platform support by ralph-e-boy · Pull Request #85 · TA-Lib/ta-lib

ralph-e-boy · 2026-04-05T15:05:16Z

I wonder if this will all be moot if you're really porting the whole thing to rust but anyway I did this before seeing the migration plan doc in the repo.

Apple Platform Support: iOS/macOS Toolchain + Accelerate Framework SIMD Optimizations

This PR adds first-class Apple platform support to TA-Lib: cross-compilation for iOS/Simulator/macOS via CMake toolchain, and vectorized implementations of 16 TA functions using Apple's Accelerate framework (vDSP/vForce) for up to 6x throughput on Apple Silicon.

iOS / macOS Build Support

CMake toolchain file (cmake/ios.toolchain.cmake) for iOS, Simulator, and macOS cross-compilation
Platform detection (IOS/MACOS variables) with automatic dev tool disabling for iOS
XCFramework packaging script (build-xcframework.sh) for creating universal static libraries
Build script (scripts/build-ios.sh) for all three slices in one command
Xcode deployment targets: iOS 12.0, macOS 10.15

# Build all platforms
scripts/build-ios.sh

# Package into XCFramework
./build-xcframework.sh

Accelerate Framework Optimizations

16 TA functions are vectorized using vecLib's vDSP (SIMD arithmetic) and vForce (SIMD transcendentals) when building on Apple platforms. Controlled by TA_USE_ACCELERATE CMake option (default ON).

vForce-optimized (single-input transcendentals):
ACOS, ASIN, ATAN, COS, COSH, EXP, LN, LOG10, SIN, SINH, SQRT, TAN, TANH

vDSP-optimized (dual-input arithmetic):
ADD, SUB, MULT

All optimizations are guarded by #if defined(TA_USE_ACCELERATE) && !defined(USE_SINGLE_PRECISION_INPUT) and live in hand-written sections that survive gen_code re-runs. The double-precision path is optimized; float variants fall back to scalar code.

Dispatch macros in ta_veclib.h keep each call site to one line:

ACCEL_VFORCE_1IN(vvacos)         // 13 vForce functions
ACCEL_VDSP_2IN(vDSP_vaddD)      // ADD, MULT
ACCEL_VDSP_2IN_SWAP(vDSP_vsubD) // SUB (swapped operands: vDSP_vsubD computes B-A)

Benchmark Results

All benchmarks run on Apple Silicon (M-series), 10,000 iterations per function per input size (1,140,000 total samples)

Speedup at 10,000 elements (mean, n=10000)

Function	Scalar (us)	Accelerate (us)	Speedup	StdDev (S / A)
SIN	65.90	11.72	5.6x	±4.87 / ±1.00
TAN	62.96	12.38	5.1x	±3.63 / ±0.30
ASIN	63.44	13.07	4.9x	±4.39 / ±0.71
ACOS	81.67	18.07	4.5x	±5.68 / ±0.29
COS	57.09	12.61	4.5x	±2.47 / ±1.33
ATAN	65.55	15.23	4.3x	±1.97 / ±0.64
EXP	51.10	13.07	3.9x	±2.99 / ±0.30
TANH	62.10	17.22	3.6x	±3.93 / ±0.25
COSH	41.10	13.11	3.1x	±1.33 / ±0.42
SINH	43.79	14.07	3.1x	±1.62 / ±1.31
LOG10	30.08	15.06	2.0x	±0.82 / ±0.78
LN	27.07	15.36	1.8x	±0.84 / ±1.89
ADD	2.58	2.42	1.1x	±0.54 / ±1.24
MULT	2.57	2.29	1.1x	±0.75 / ±0.79
SUB	2.52	2.55	1.0x	±0.18 / ±0.34
SQRT	3.26	3.42	1.0x	±0.40 / ±1.57

Trig and transcendental functions see 1.8x-5.6x speedups via vForce NEON SIMD. ADD, SUB, MULT, and SQRT are neutral -- the compiler's auto-vectorization already matches vDSP for simple arithmetic. They're kept on the Accelerate path because they show no regression.

Methodology

Compare output from two benchmark binaries, built from the same source with TA_USE_ACCELERATE enabled or not. Each binary ran all 18 functions at 100/1000/10000 elements with 10 warmup iterations followed by N timed iterations.

scripts/run_bench.sh 10000  # writes bench_results.db

What we tested and held back

These functions didn't test as improvements so no accelerate there, here are details on why:

Candidate	Approach	Outcome	Reason
DIV	`vDSP_vdivD`	0.6x at 1k (n=1000)	vDSP_vdivD's operand-swap overhead + function call cost exceeds the compiler's tight NEON `fdiv` loop
CEIL	`vvceil`	0.8x at 10k (n=10000)	Compiler emits single `frintp` instruction; vForce call overhead makes it slower
FLOOR	`vvfloor`	0.9x at 10k (n=10000)	Compiler emits single `frintm` instruction; same overhead issue as CEIL
LINEARREG (5 funcs)	`vDSP_sveD` + `vDSP_dotprD` per window	Slower (12.9 vs 17.2us)	Two vDSP calls per window position; call overhead exceeds SIMD savings for typical periods (14-30)
STDDEV	`vvsqrt` over output array	Slower (4.5 vs 5.5us)	Extra clamping pass before vectorized sqrt adds overhead the branch-predicted scalar loop avoids
CORREL	`vDSP_dotprD` for initial sums	Slower (5.7 vs 6.2us)	Same per-window call overhead pattern
HT_SINE/TRENDMODE/DCPHASE	`__sincos()` for paired sin/cos	No improvement	Clang already fuses paired sin/cos at -O2 on Apple Silicon

Key insight: Accelerate wins when replacing an entire O(n) loop with a single O(n) vForce/vDSP call. It loses when adding O(1) function calls inside an O(n) outer loop.

Profiling Infrastructure Fixes

The ta_regtest -p profiling mode was broken on Apple platforms with Accelerate enabled. The root cause: clock() has ~microsecond granularity, and vForce-optimized functions on small inputs (100 elements) complete faster than one clock tick, producing clockDelta == 0, treated as fatal error 612.

Fix:. I got the message that the clock wasn't precise enough for benchmarking when using -p for ta_regtest'

Replaced clock() with mach_absolute_time() on Apple (nanosecond precision)
Extracted platform timer logic into macros (TIMER_DECL, TIMER_START, TIMER_STOP, TIMER_TICKS_TO_MS, etc.) in ta_test_priv.h, eliminating 8 duplicated #ifdef blocks across 3 files
Changed zero-delta handling from fatal error to graceful flag (matching test_util.c's existing pattern)

mrjbq7 · 2026-04-05T15:34:29Z

Neat idea, looks to be minimal maintenance overhead. Had some comments above. Thanks!

mrjbq7 · 2026-04-05T15:35:50Z

I would also like to see the impact on smaller (100) and larger (100k, 1mm) inputs.

ralph-e-boy · 2026-04-05T15:52:33Z

Here you go on test sizes, not sure I saw the other comments above besides 'neat idea' one, were there more than that?

100 to 1mm Elements

1000 iterations per function per input size, 190,000 total samples

Trig / Transcendental Functions (vForce)

Function	100	1,000	10,000	100,000	1,000,000
SIN	2.4x	2.5x	5.5x	6.5x	6.4x
COS	2.1x	2.2x	4.5x	6.0x	6.0x
TAN	2.8x	3.1x	5.1x	6.0x	5.9x
ASIN	2.7x	3.0x	4.8x	5.9x	6.0x
ACOS	1.8x	2.2x	5.2x	6.2x	5.8x
ATAN	2.6x	3.0x	4.5x	5.0x	5.0x
EXP	1.6x	1.9x	4.1x	4.8x	4.7x
TANH	2.2x	2.5x	3.4x	4.1x	4.1x
COSH	2.4x	2.8x	3.2x	3.5x	3.5x
SINH	2.3x	2.8x	3.3x	3.4x	3.4x
LOG10	1.8x	2.1x	2.1x	2.1x	2.1x
LN	1.7x	1.8x	1.8x	1.9x	1.9x

Key observations:

SIN/COS/TAN hit 6x at 100k+ elements and hold steady through 1M
Gains are already measurable at 100 elements (1.6x-2.8x), meaning even small tick-by-tick calls benefit
Speedup plateaus around 100k — the pipeline is fully saturated at that point

Absolute Times at 1M Elements

...in wall-clock terms, on a M1 laptop (2019, 16mb)

Function	Scalar	Accelerate	Saved
ACOS	10.4ms	1.8ms	8.6ms
SIN	7.4ms	1.2ms	6.3ms
COS	7.4ms	1.2ms	6.2ms
ASIN	7.7ms	1.3ms	6.4ms
TAN	7.3ms	1.2ms	6.0ms
ATAN	7.4ms	1.5ms	5.9ms
TANH	7.0ms	1.7ms	5.3ms
EXP	6.0ms	1.3ms	4.8ms
SINH	4.7ms	1.4ms	3.3ms
COSH	4.5ms	1.3ms	3.2ms
LOG10	3.1ms	1.5ms	1.6ms
LN	2.8ms	1.5ms	1.3ms

Arithmetic Functions (vDSP)

These are neutral-to-marginal across all sizes. The compiler's auto-vectorization matches vDSP for simple operations.

Function	100	1,000	10,000	100,000	1,000,000
ADD	1.0x	1.0x	1.0x	1.2x	1.0x
SUB	~1x	1.0x	1.0x	1.0x	1.0x
MULT	1.1x	1.0x	1.0x	1.0x	1.2x
SQRT	~1x	1.0x	1.0x	1.0x	1.0x

I left this benchmarking stuff in for you to see how I did it but I dont' really mean to commit that stuff!

mrjbq7 · 2026-04-05T15:53:51Z

The comments above were code review inline ones

…oolchain Replace ios.toolchain.cmake with CMake's built-in iOS/simulator cross-compilation (CMAKE_SYSTEM_NAME=iOS, CMAKE_OSX_SYSROOT, Xcode generator). Update xcframework script for Xcode generator output paths. Remove ta_bench target.

Compare ErrorNumber return value against TA_TEST_PASS instead of TA_SUCCESS to avoid -Wenum-compare between different enumeration types.

feat: add Apple platform support

8a6fd73

ralph-e-boy added 7 commits April 5, 2026 12:57

remove notes file

a812a8b

revert to original readme

09a79f4

increase benchmark options

a006cfe

remove bnechmark test runner

75d795a

move ios scripts / remove test benchmark driver

0b62635

fix enum comparison warning in test_abstract.c

5ce23a5

Compare ErrorNumber return value against TA_TEST_PASS instead of TA_SUCCESS to avoid -Wenum-compare between different enumeration types.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Apple platform support#85

feat: add Apple platform support#85
ralph-e-boy wants to merge 8 commits into
TA-Lib:mainfrom
ralph-e-boy:apple-accellerate-framework

ralph-e-boy commented Apr 5, 2026

Uh oh!

mrjbq7 commented Apr 5, 2026

Uh oh!

mrjbq7 commented Apr 5, 2026 •

edited

Loading

Uh oh!

ralph-e-boy commented Apr 5, 2026

Uh oh!

mrjbq7 commented Apr 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ralph-e-boy commented Apr 5, 2026

Apple Platform Support: iOS/macOS Toolchain + Accelerate Framework SIMD Optimizations

iOS / macOS Build Support

Accelerate Framework Optimizations

Benchmark Results

Speedup at 10,000 elements (mean, n=10000)

Methodology

What we tested and held back

Profiling Infrastructure Fixes

Uh oh!

mrjbq7 commented Apr 5, 2026

Uh oh!

mrjbq7 commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ralph-e-boy commented Apr 5, 2026

Trig / Transcendental Functions (vForce)

Absolute Times at 1M Elements

Arithmetic Functions (vDSP)

Uh oh!

mrjbq7 commented Apr 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mrjbq7 commented Apr 5, 2026 •

edited

Loading