Skip to content

feat: add Apple platform support#85

Open
ralph-e-boy wants to merge 8 commits into
TA-Lib:mainfrom
ralph-e-boy:apple-accellerate-framework
Open

feat: add Apple platform support#85
ralph-e-boy wants to merge 8 commits into
TA-Lib:mainfrom
ralph-e-boy:apple-accellerate-framework

Conversation

@ralph-e-boy
Copy link
Copy Markdown

I wonder if this will all be moot if you're really porting the whole thing to rust but anyway I did this before seeing the migration plan doc in the repo.

Apple Platform Support: iOS/macOS Toolchain + Accelerate Framework SIMD Optimizations

This PR adds first-class Apple platform support to TA-Lib: cross-compilation for iOS/Simulator/macOS via CMake toolchain, and vectorized implementations of 16 TA functions using Apple's Accelerate framework (vDSP/vForce) for up to 6x throughput on Apple Silicon.


iOS / macOS Build Support

  • CMake toolchain file (cmake/ios.toolchain.cmake) for iOS, Simulator, and macOS cross-compilation
  • Platform detection (IOS/MACOS variables) with automatic dev tool disabling for iOS
  • XCFramework packaging script (build-xcframework.sh) for creating universal static libraries
  • Build script (scripts/build-ios.sh) for all three slices in one command
  • Xcode deployment targets: iOS 12.0, macOS 10.15
# Build all platforms
scripts/build-ios.sh

# Package into XCFramework
./build-xcframework.sh

Accelerate Framework Optimizations

16 TA functions are vectorized using vecLib's vDSP (SIMD arithmetic) and vForce (SIMD transcendentals) when building on Apple platforms. Controlled by TA_USE_ACCELERATE CMake option (default ON).

vForce-optimized (single-input transcendentals):
ACOS, ASIN, ATAN, COS, COSH, EXP, LN, LOG10, SIN, SINH, SQRT, TAN, TANH

vDSP-optimized (dual-input arithmetic):
ADD, SUB, MULT

All optimizations are guarded by #if defined(TA_USE_ACCELERATE) && !defined(USE_SINGLE_PRECISION_INPUT) and live in hand-written sections that survive gen_code re-runs. The double-precision path is optimized; float variants fall back to scalar code.

Dispatch macros in ta_veclib.h keep each call site to one line:

ACCEL_VFORCE_1IN(vvacos)         // 13 vForce functions
ACCEL_VDSP_2IN(vDSP_vaddD)      // ADD, MULT
ACCEL_VDSP_2IN_SWAP(vDSP_vsubD) // SUB (swapped operands: vDSP_vsubD computes B-A)

Benchmark Results

All benchmarks run on Apple Silicon (M-series), 10,000 iterations per function per input size (1,140,000 total samples)

Speedup at 10,000 elements (mean, n=10000)

Function Scalar (us) Accelerate (us) Speedup StdDev (S / A)
SIN 65.90 11.72 5.6x ±4.87 / ±1.00
TAN 62.96 12.38 5.1x ±3.63 / ±0.30
ASIN 63.44 13.07 4.9x ±4.39 / ±0.71
ACOS 81.67 18.07 4.5x ±5.68 / ±0.29
COS 57.09 12.61 4.5x ±2.47 / ±1.33
ATAN 65.55 15.23 4.3x ±1.97 / ±0.64
EXP 51.10 13.07 3.9x ±2.99 / ±0.30
TANH 62.10 17.22 3.6x ±3.93 / ±0.25
COSH 41.10 13.11 3.1x ±1.33 / ±0.42
SINH 43.79 14.07 3.1x ±1.62 / ±1.31
LOG10 30.08 15.06 2.0x ±0.82 / ±0.78
LN 27.07 15.36 1.8x ±0.84 / ±1.89
ADD 2.58 2.42 1.1x ±0.54 / ±1.24
MULT 2.57 2.29 1.1x ±0.75 / ±0.79
SUB 2.52 2.55 1.0x ±0.18 / ±0.34
SQRT 3.26 3.42 1.0x ±0.40 / ±1.57

Trig and transcendental functions see 1.8x-5.6x speedups via vForce NEON SIMD. ADD, SUB, MULT, and SQRT are neutral -- the compiler's auto-vectorization already matches vDSP for simple arithmetic. They're kept on the Accelerate path because they show no regression.

Methodology

Compare output from two benchmark binaries, built from the same source with TA_USE_ACCELERATE enabled or not. Each binary ran all 18 functions at 100/1000/10000 elements with 10 warmup iterations followed by N timed iterations.

scripts/run_bench.sh 10000  # writes bench_results.db

What we tested and held back

These functions didn't test as improvements so no accelerate there, here are details on why:

Candidate Approach Outcome Reason
DIV vDSP_vdivD 0.6x at 1k (n=1000) vDSP_vdivD's operand-swap overhead + function call cost exceeds the compiler's tight NEON fdiv loop
CEIL vvceil 0.8x at 10k (n=10000) Compiler emits single frintp instruction; vForce call overhead makes it slower
FLOOR vvfloor 0.9x at 10k (n=10000) Compiler emits single frintm instruction; same overhead issue as CEIL
LINEARREG (5 funcs) vDSP_sveD + vDSP_dotprD per window Slower (12.9 vs 17.2us) Two vDSP calls per window position; call overhead exceeds SIMD savings for typical periods (14-30)
STDDEV vvsqrt over output array Slower (4.5 vs 5.5us) Extra clamping pass before vectorized sqrt adds overhead the branch-predicted scalar loop avoids
CORREL vDSP_dotprD for initial sums Slower (5.7 vs 6.2us) Same per-window call overhead pattern
HT_SINE/TRENDMODE/DCPHASE __sincos() for paired sin/cos No improvement Clang already fuses paired sin/cos at -O2 on Apple Silicon

Key insight: Accelerate wins when replacing an entire O(n) loop with a single O(n) vForce/vDSP call. It loses when adding O(1) function calls inside an O(n) outer loop.


Profiling Infrastructure Fixes

The ta_regtest -p profiling mode was broken on Apple platforms with Accelerate enabled. The root cause: clock() has ~microsecond granularity, and vForce-optimized functions on small inputs (100 elements) complete faster than one clock tick, producing clockDelta == 0, treated as fatal error 612.

Fix:. I got the message that the clock wasn't precise enough for benchmarking when using -p for ta_regtest'

  • Replaced clock() with mach_absolute_time() on Apple (nanosecond precision)
  • Extracted platform timer logic into macros (TIMER_DECL, TIMER_START, TIMER_STOP, TIMER_TICKS_TO_MS, etc.) in ta_test_priv.h, eliminating 8 duplicated #ifdef blocks across 3 files
  • Changed zero-delta handling from fatal error to graceful flag (matching test_util.c's existing pattern)

@mrjbq7
Copy link
Copy Markdown
Member

mrjbq7 commented Apr 5, 2026

Neat idea, looks to be minimal maintenance overhead. Had some comments above. Thanks!

@mrjbq7
Copy link
Copy Markdown
Member

mrjbq7 commented Apr 5, 2026

I would also like to see the impact on smaller (100) and larger (100k, 1mm) inputs.

@ralph-e-boy
Copy link
Copy Markdown
Author

Here you go on test sizes, not sure I saw the other comments above besides 'neat idea' one, were there more than that?

100 to 1mm Elements

1000 iterations per function per input size, 190,000 total samples

Trig / Transcendental Functions (vForce)

Function 100 1,000 10,000 100,000 1,000,000
SIN 2.4x 2.5x 5.5x 6.5x 6.4x
COS 2.1x 2.2x 4.5x 6.0x 6.0x
TAN 2.8x 3.1x 5.1x 6.0x 5.9x
ASIN 2.7x 3.0x 4.8x 5.9x 6.0x
ACOS 1.8x 2.2x 5.2x 6.2x 5.8x
ATAN 2.6x 3.0x 4.5x 5.0x 5.0x
EXP 1.6x 1.9x 4.1x 4.8x 4.7x
TANH 2.2x 2.5x 3.4x 4.1x 4.1x
COSH 2.4x 2.8x 3.2x 3.5x 3.5x
SINH 2.3x 2.8x 3.3x 3.4x 3.4x
LOG10 1.8x 2.1x 2.1x 2.1x 2.1x
LN 1.7x 1.8x 1.8x 1.9x 1.9x

Key observations:

  • SIN/COS/TAN hit 6x at 100k+ elements and hold steady through 1M
  • Gains are already measurable at 100 elements (1.6x-2.8x), meaning even small tick-by-tick calls benefit
  • Speedup plateaus around 100k — the pipeline is fully saturated at that point

Absolute Times at 1M Elements

...in wall-clock terms, on a M1 laptop (2019, 16mb)

Function Scalar Accelerate Saved
ACOS 10.4ms 1.8ms 8.6ms
SIN 7.4ms 1.2ms 6.3ms
COS 7.4ms 1.2ms 6.2ms
ASIN 7.7ms 1.3ms 6.4ms
TAN 7.3ms 1.2ms 6.0ms
ATAN 7.4ms 1.5ms 5.9ms
TANH 7.0ms 1.7ms 5.3ms
EXP 6.0ms 1.3ms 4.8ms
SINH 4.7ms 1.4ms 3.3ms
COSH 4.5ms 1.3ms 3.2ms
LOG10 3.1ms 1.5ms 1.6ms
LN 2.8ms 1.5ms 1.3ms

Arithmetic Functions (vDSP)

These are neutral-to-marginal across all sizes. The compiler's auto-vectorization matches vDSP for simple operations.

Function 100 1,000 10,000 100,000 1,000,000
ADD 1.0x 1.0x 1.0x 1.2x 1.0x
SUB ~1x 1.0x 1.0x 1.0x 1.0x
MULT 1.1x 1.0x 1.0x 1.0x 1.2x
SQRT ~1x 1.0x 1.0x 1.0x 1.0x

I left this benchmarking stuff in for you to see how I did it but I dont' really mean to commit that stuff!

@mrjbq7
Copy link
Copy Markdown
Member

mrjbq7 commented Apr 5, 2026

The comments above were code review inline ones

…oolchain

Replace ios.toolchain.cmake with CMake's built-in iOS/simulator cross-compilation
(CMAKE_SYSTEM_NAME=iOS, CMAKE_OSX_SYSROOT, Xcode generator). Update xcframework
script for Xcode generator output paths. Remove ta_bench target.
Compare ErrorNumber return value against TA_TEST_PASS instead of
TA_SUCCESS to avoid -Wenum-compare between different enumeration types.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants