Skip to content

[Question]: MInference Pre filling is slower than the vllm original version #18

@junior-zsy

Description

@junior-zsy

Describe the issue

code :

# Copyright (c) 2024 Microsoft
# Licensed under The MIT License [see LICENSE for details]

from vllm import LLM, SamplingParams

from minference import MInference
import time

def read_content_from_file(file_path, num_chars=5000):
    try:
        with open(file_path, 'r', encoding='utf-8') as file:
            content = file.read(num_chars)
        return content
    except FileNotFoundError:
        logging.error(f"File {file_path} not found.")
        return ""
    except Exception as e:
        logging.error(f"An error occurred while reading the file: {e}")
        return ""

content = read_content_from_file("./question.txt", 12000) + ",请总结上面的故事梗概。"

prompts = []
for _ in range(50):
    prompts.extend([content])

sampling_params = SamplingParams(
    temperature=0.8,
    top_p=0.95,
    max_tokens=1,
)
model_name = "/xxx/model/Qwen2-7B-Instruct"
llm = LLM(
    model_name,
    max_num_seqs=1,
    enforce_eager=True,
    tensor_parallel_size=1,
    max_model_len=128000,
)



start_time = time.time()
outputs = llm.generate(prompts, sampling_params)
end_time = time.time()

elapsed_time = end_time - start_time
print(f"vllm Generating text took {elapsed_time:.2f} seconds.")



# Patch MInference Module
minference_patch = MInference("vllm", model_name)
llm = minference_patch(llm)

start_time = time.time()
outputs = llm.generate(prompts, sampling_params)
end_time = time.time()

elapsed_time = end_time - start_time
print(f"minference Generating text took {elapsed_time:.2f} seconds.")

results of execution:

Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:41<00:00, 1.22it/s]
vllm Generating text took 41.57 seconds.
Patched model for minference with vLLM..
Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [01:34<00:00, 1.90s/it]
minference Generating text took 95.37 seconds.

why minference slower than vllm

Metadata

Metadata

Assignees

Labels

questionFurther information is requested

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions