I find out that MLAPreprocess accelerated very well on w8a8_int8. Will this operator support bf16/fp16 in the future? Because some moe models may not quantize their weights in attention layer.
https://github.com/sgl-project/sglang/blob/a2423052f6673256e3f7e2d8a946893f3653cb5d/python/sglang/srt/layers/attention/npu_ops/mla_preprocess.py#L386-L393