It seems your method filter "top-k" samples using rewards and then compute the advantages. Would the order affect the training effects?