return the last token's logprobs, logits and last_hidden_states if include_stop_str_in_output is requested#4000
Merged
lvhan028 merged 1 commit intoInternLM:mainfrom Sep 22, 2025
Conversation
Collaborator
Author
lmdeploy serve api_server Qwen/Qwen3-8B --backend pytorch --logprobs-mode raw_logprobsfrom openai import OpenAI
client = OpenAI(api_key='11', base_url='http://0.0.0.0:23333/v1/')
model_name = client.models.list().data[0].id
response = client.chat.completions.create(
model=model_name,
messages=[{
'role': 'user',
'content': "Hello!",
}],
temperature=0.8,
top_p=0.8,
logprobs=True,
top_logprobs=1,
stream=False,
extra_body={
"include_stop_str_in_output": True,
"return_token_ids": True,
})
logprobs = []
for item in response.choices[0].logprobs.content:
logprobs.append(item.logprob)
print(len(logprobs), logprobs)
print(len(response.choices[0].message.gen_tokens), response.choices[0].message.gen_tokens)
print(response)The logprobs of |
irexyc
pushed a commit
to irexyc/lmdeploy
that referenced
this pull request
Sep 23, 2025
irexyc
pushed a commit
to irexyc/lmdeploy
that referenced
this pull request
Sep 23, 2025
lvhan028
added a commit
that referenced
this pull request
Nov 19, 2025
* use driver flag * update * accurate mask iter * use fast divmod * remove cp_O * remove unused * return the last token's logprobs if include_stop_str_in_output is requested (#4000) * [Fix] device args in chat cli when using pytorch engine (#3999) * [Fix] device args in chat cli when using pytorch engine * [Fix] change device into device_type in chat cli * fix NULL raw data * add attn_cp_size to cli * build cutlass::FastDivmod on host * use single buffer * udpate comm * use two stage reduce * remove unused * better AllreduceResidualRMSnorm * fix max_session_len * update docs * fix embedding/lm_head split * use same split_k on different cp_rank * always use seperate reduce for cp * add cp configuration parameter * remove redundant parameters * remove redundant parameters * fix build * fix xgrammar build * update docs * remove unused * fix test_attention * unify attn split_k reduction w/ w/o cp * fix nccl found * update reduce * fix windows build * remove print * revert is_driver_ * prevent create new allocator * use Store to write partial_ML * use expressive names * use cdiv * remove separate_reduce * apply attention sink on cp_rank0 * move cp_utils.* to kernels/attention * update cli description --------- Co-authored-by: Lyu Han <lvhan_028@163.com> Co-authored-by: CyCle1024 <chenchiyu@pjlab.org.cn>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.