feat: add Apple platform support#85
Open
ralph-e-boy wants to merge 8 commits into
Open
Conversation
Member
|
Neat idea, looks to be minimal maintenance overhead. Had some comments above. Thanks! |
Member
|
I would also like to see the impact on smaller (100) and larger (100k, 1mm) inputs. |
Author
|
Here you go on test sizes, not sure I saw the other comments above besides 'neat idea' one, were there more than that? 100 to 1mm Elements 1000 iterations per function per input size, 190,000 total samples Trig / Transcendental Functions (vForce)
Key observations:
Absolute Times at 1M Elements...in wall-clock terms, on a M1 laptop (2019, 16mb)
Arithmetic Functions (vDSP)These are neutral-to-marginal across all sizes. The compiler's auto-vectorization matches vDSP for simple operations.
I left this benchmarking stuff in for you to see how I did it but I dont' really mean to commit that stuff! |
Member
|
The comments above were code review inline ones |
…oolchain Replace ios.toolchain.cmake with CMake's built-in iOS/simulator cross-compilation (CMAKE_SYSTEM_NAME=iOS, CMAKE_OSX_SYSROOT, Xcode generator). Update xcframework script for Xcode generator output paths. Remove ta_bench target.
Compare ErrorNumber return value against TA_TEST_PASS instead of TA_SUCCESS to avoid -Wenum-compare between different enumeration types.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
I wonder if this will all be moot if you're really porting the whole thing to rust but anyway I did this before seeing the migration plan doc in the repo.
Apple Platform Support: iOS/macOS Toolchain + Accelerate Framework SIMD Optimizations
This PR adds first-class Apple platform support to TA-Lib: cross-compilation for iOS/Simulator/macOS via CMake toolchain, and vectorized implementations of 16 TA functions using Apple's Accelerate framework (vDSP/vForce) for up to 6x throughput on Apple Silicon.
iOS / macOS Build Support
cmake/ios.toolchain.cmake) for iOS, Simulator, and macOS cross-compilationIOS/MACOSvariables) with automatic dev tool disabling for iOSbuild-xcframework.sh) for creating universal static librariesscripts/build-ios.sh) for all three slices in one commandAccelerate Framework Optimizations
16 TA functions are vectorized using vecLib's vDSP (SIMD arithmetic) and vForce (SIMD transcendentals) when building on Apple platforms. Controlled by
TA_USE_ACCELERATECMake option (default ON).vForce-optimized (single-input transcendentals):
ACOS, ASIN, ATAN, COS, COSH, EXP, LN, LOG10, SIN, SINH, SQRT, TAN, TANH
vDSP-optimized (dual-input arithmetic):
ADD, SUB, MULT
All optimizations are guarded by
#if defined(TA_USE_ACCELERATE) && !defined(USE_SINGLE_PRECISION_INPUT)and live in hand-written sections that survivegen_codere-runs. The double-precision path is optimized; float variants fall back to scalar code.Dispatch macros in
ta_veclib.hkeep each call site to one line:Benchmark Results
All benchmarks run on Apple Silicon (M-series), 10,000 iterations per function per input size (1,140,000 total samples)
Speedup at 10,000 elements (mean, n=10000)
Trig and transcendental functions see 1.8x-5.6x speedups via vForce NEON SIMD. ADD, SUB, MULT, and SQRT are neutral -- the compiler's auto-vectorization already matches vDSP for simple arithmetic. They're kept on the Accelerate path because they show no regression.
Methodology
Compare output from two benchmark binaries, built from the same source with
TA_USE_ACCELERATEenabled or not. Each binary ran all 18 functions at 100/1000/10000 elements with 10 warmup iterations followed by N timed iterations.scripts/run_bench.sh 10000 # writes bench_results.dbWhat we tested and held back
These functions didn't test as improvements so no accelerate there, here are details on why:
vDSP_vdivDfdivloopvvceilfrintpinstruction; vForce call overhead makes it slowervvfloorfrintminstruction; same overhead issue as CEILvDSP_sveD+vDSP_dotprDper windowvvsqrtover output arrayvDSP_dotprDfor initial sums__sincos()for paired sin/cosKey insight: Accelerate wins when replacing an entire O(n) loop with a single O(n) vForce/vDSP call. It loses when adding O(1) function calls inside an O(n) outer loop.
Profiling Infrastructure Fixes
The
ta_regtest -pprofiling mode was broken on Apple platforms with Accelerate enabled. The root cause:clock()has ~microsecond granularity, and vForce-optimized functions on small inputs (100 elements) complete faster than one clock tick, producingclockDelta == 0, treated as fatal error 612.Fix:. I got the message that the clock wasn't precise enough for benchmarking when using -p for ta_regtest'
clock()withmach_absolute_time()on Apple (nanosecond precision)TIMER_DECL,TIMER_START,TIMER_STOP,TIMER_TICKS_TO_MS, etc.) inta_test_priv.h, eliminating 8 duplicated#ifdefblocks across 3 filestest_util.c's existing pattern)