To understand the tool calling performance difference and to be able to better choose the vendor, please explain potential reasons for the difference.
Is there some internal nondeterminism leading to vendor-specific differences in logits output when the same model weights are used? Or are there differences in sampling or schema validation? Or are different model weights / quantization being used by different vendors?