Skip to content
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion common/arg.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2520,7 +2520,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
));
add_opt(common_arg(
{"-a", "--alias"}, "STRING",
"set alias for model name (to be used by REST API)",
"set model name alias, comma-separated for multiple aliases (to be used by API)",
[](common_params & params, const std::string & value) {
params.model_alias = value;
}
Expand Down
4 changes: 2 additions & 2 deletions tools/cli/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,8 +57,8 @@
| `-dt, --defrag-thold N` | KV cache defragmentation threshold (DEPRECATED)<br/>(env: LLAMA_ARG_DEFRAG_THOLD) |
| `-np, --parallel N` | number of parallel sequences to decode (default: 1)<br/>(env: LLAMA_ARG_N_PARALLEL) |
| `--mlock` | force system to keep model in RAM rather than swapping or compressing<br/>(env: LLAMA_ARG_MLOCK) |
| `--mmap, --no-mmap` | whether to memory-map model. Explicitly enabling mmap disables direct-io. (if mmap disabled, slower load but may reduce pageouts if not using mlock) (default: enabled)<br/>(env: LLAMA_ARG_MMAP) |
| `-dio, --direct-io, -ndio, --no-direct-io` | use DirectIO if available. Takes precedence over --mmap (default: enabled)<br/>(env: LLAMA_ARG_DIO) |
| `--mmap, --no-mmap` | whether to memory-map model. (if mmap disabled, slower load but may reduce pageouts if not using mlock) (default: enabled)<br/>(env: LLAMA_ARG_MMAP) |
| `-dio, --direct-io, -ndio, --no-direct-io` | use DirectIO if available. (default: disabled)<br/>(env: LLAMA_ARG_DIO) |
| `--numa TYPE` | attempt optimizations that help on some NUMA systems<br/>- distribute: spread execution evenly over all nodes<br/>- isolate: only spawn threads on CPUs on the node that execution started on<br/>- numactl: use the CPU map provided by numactl<br/>if run without this previously, it is recommended to drop the system page cache before using this<br/>see https://github.com/ggml-org/llama.cpp/issues/1437<br/>(env: LLAMA_ARG_NUMA) |
| `-dev, --device <dev1,dev2,..>` | comma-separated list of devices to use for offloading (none = don't offload)<br/>use --list-devices to see a list of available devices<br/>(env: LLAMA_ARG_DEVICE) |
| `--list-devices` | print list of available devices and exit |
Expand Down
4 changes: 2 additions & 2 deletions tools/completion/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -140,8 +140,8 @@ llama-completion.exe -m models\gemma-1.1-7b-it.Q4_K_M.gguf --ignore-eos -n -1
| `-dt, --defrag-thold N` | KV cache defragmentation threshold (DEPRECATED)<br/>(env: LLAMA_ARG_DEFRAG_THOLD) |
| `-np, --parallel N` | number of parallel sequences to decode (default: 1)<br/>(env: LLAMA_ARG_N_PARALLEL) |
| `--mlock` | force system to keep model in RAM rather than swapping or compressing<br/>(env: LLAMA_ARG_MLOCK) |
| `--mmap, --no-mmap` | whether to memory-map model. Explicitly enabling mmap disables direct-io. (if mmap disabled, slower load but may reduce pageouts if not using mlock) (default: enabled)<br/>(env: LLAMA_ARG_MMAP) |
| `-dio, --direct-io, -ndio, --no-direct-io` | use DirectIO if available. Takes precedence over --mmap (default: enabled)<br/>(env: LLAMA_ARG_DIO) |
| `--mmap, --no-mmap` | whether to memory-map model. (if mmap disabled, slower load but may reduce pageouts if not using mlock) (default: enabled)<br/>(env: LLAMA_ARG_MMAP) |
| `-dio, --direct-io, -ndio, --no-direct-io` | use DirectIO if available. (default: disabled)<br/>(env: LLAMA_ARG_DIO) |
| `--numa TYPE` | attempt optimizations that help on some NUMA systems<br/>- distribute: spread execution evenly over all nodes<br/>- isolate: only spawn threads on CPUs on the node that execution started on<br/>- numactl: use the CPU map provided by numactl<br/>if run without this previously, it is recommended to drop the system page cache before using this<br/>see https://github.com/ggml-org/llama.cpp/issues/1437<br/>(env: LLAMA_ARG_NUMA) |
| `-dev, --device <dev1,dev2,..>` | comma-separated list of devices to use for offloading (none = don't offload)<br/>use --list-devices to see a list of available devices<br/>(env: LLAMA_ARG_DEVICE) |
| `--list-devices` | print list of available devices and exit |
Expand Down
12 changes: 9 additions & 3 deletions tools/server/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,8 +74,8 @@ For the full list of features, please refer to [server's changelog](https://gith
| `-ctv, --cache-type-v TYPE` | KV cache data type for V<br/>allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1<br/>(default: f16)<br/>(env: LLAMA_ARG_CACHE_TYPE_V) |
| `-dt, --defrag-thold N` | KV cache defragmentation threshold (DEPRECATED)<br/>(env: LLAMA_ARG_DEFRAG_THOLD) |
| `--mlock` | force system to keep model in RAM rather than swapping or compressing<br/>(env: LLAMA_ARG_MLOCK) |
| `--mmap, --no-mmap` | whether to memory-map model. Explicitly enabling mmap disables direct-io. (if mmap disabled, slower load but may reduce pageouts if not using mlock) (default: enabled)<br/>(env: LLAMA_ARG_MMAP) |
| `-dio, --direct-io, -ndio, --no-direct-io` | use DirectIO if available. Takes precedence over --mmap (default: enabled)<br/>(env: LLAMA_ARG_DIO) |
| `--mmap, --no-mmap` | whether to memory-map model. (if mmap disabled, slower load but may reduce pageouts if not using mlock) (default: enabled)<br/>(env: LLAMA_ARG_MMAP) |
| `-dio, --direct-io, -ndio, --no-direct-io` | use DirectIO if available. (default: disabled)<br/>(env: LLAMA_ARG_DIO) |
| `--numa TYPE` | attempt optimizations that help on some NUMA systems<br/>- distribute: spread execution evenly over all nodes<br/>- isolate: only spawn threads on CPUs on the node that execution started on<br/>- numactl: use the CPU map provided by numactl<br/>if run without this previously, it is recommended to drop the system page cache before using this<br/>see https://github.com/ggml-org/llama.cpp/issues/1437<br/>(env: LLAMA_ARG_NUMA) |
| `-dev, --device <dev1,dev2,..>` | comma-separated list of devices to use for offloading (none = don't offload)<br/>use --list-devices to see a list of available devices<br/>(env: LLAMA_ARG_DEVICE) |
| `--list-devices` | print list of available devices and exit |
Expand Down Expand Up @@ -162,9 +162,11 @@ For the full list of features, please refer to [server's changelog](https://gith

| Argument | Explanation |
| -------- | ----------- |
| `-lcs, --lookup-cache-static FNAME` | path to static lookup cache to use for lookup decoding (not updated by generation) |
| `-lcd, --lookup-cache-dynamic FNAME` | path to dynamic lookup cache to use for lookup decoding (updated by generation) |
| `--ctx-checkpoints, --swa-checkpoints N` | max number of context checkpoints to create per slot (default: 8)[(more info)](https://github.com/ggml-org/llama.cpp/pull/15293)<br/>(env: LLAMA_ARG_CTX_CHECKPOINTS) |
| `-cram, --cache-ram N` | set the maximum cache size in MiB (default: 8192, -1 - no limit, 0 - disable)[(more info)](https://github.com/ggml-org/llama.cpp/pull/16391)<br/>(env: LLAMA_ARG_CACHE_RAM) |
| `-kvu, --kv-unified` | use single unified KV buffer shared across all sequences (default: enabled if number of slots is auto)<br/>(env: LLAMA_ARG_KV_UNIFIED) |
| `-kvu, --kv-unified, -no-kvu, --no-kv-unified` | use single unified KV buffer shared across all sequences (default: enabled if number of slots is auto)<br/>(env: LLAMA_ARG_KV_UNIFIED) |
| `--context-shift, --no-context-shift` | whether to use context shift on infinite text generation (default: disabled)<br/>(env: LLAMA_ARG_CONTEXT_SHIFT) |
| `-r, --reverse-prompt PROMPT` | halt generation at PROMPT, return control in interactive mode |
| `-sp, --special` | special tokens output enabled (default: false) |
Expand Down Expand Up @@ -229,6 +231,10 @@ For the full list of features, please refer to [server's changelog](https://gith
| `-ngld, --gpu-layers-draft, --n-gpu-layers-draft N` | max. number of draft model layers to store in VRAM, either an exact number, 'auto', or 'all' (default: auto)<br/>(env: LLAMA_ARG_N_GPU_LAYERS_DRAFT) |
| `-md, --model-draft FNAME` | draft model for speculative decoding (default: unused)<br/>(env: LLAMA_ARG_MODEL_DRAFT) |
| `--spec-replace TARGET DRAFT` | translate the string in TARGET into DRAFT if the draft model and main model are not compatible |
| `--spec-type [none\|ngram-cache\|ngram-simple\|ngram-map-k\|ngram-map-k4v\|ngram-mod]` | type of speculative decoding to use when no draft model is provided (default: none) |
| `--spec-ngram-size-n N` | ngram size N for ngram-simple/ngram-map speculative decoding, length of lookup n-gram (default: 12) |
| `--spec-ngram-size-m N` | ngram size M for ngram-simple/ngram-map speculative decoding, length of draft m-gram (default: 48) |
| `--spec-ngram-min-hits N` | minimum hits for ngram-map speculative decoding (default: 1) |
| `-mv, --model-vocoder FNAME` | vocoder model for audio generation (default: unused) |
| `--tts-use-guide-tokens` | Use guide tokens to improve TTS word recall |
| `--embd-gemma-default` | use default EmbeddingGemma model (note: can download weights from the internet) |
Expand Down
14 changes: 12 additions & 2 deletions tools/server/server-context.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -580,6 +580,7 @@ struct server_context_impl {
float slot_prompt_similarity = 0.0f;

std::string model_name; // name of the loaded model, to be used by API
std::vector<std::string> model_aliases; // additional names for the model

bool sleeping = false;

Expand Down Expand Up @@ -813,8 +814,15 @@ struct server_context_impl {
SRV_WRN("%s", "for more info see https://github.com/ggml-org/llama.cpp/pull/16391\n");

if (!params_base.model_alias.empty()) {
// user explicitly specified model name
model_name = params_base.model_alias;
Comment thread
ServeurpersoCom marked this conversation as resolved.
// user explicitly specified model name (may include comma-separated aliases)
auto aliases = string_split<std::string>(params_base.model_alias, ',');
model_name = string_strip(aliases[0]);
for (size_t i = 1; i < aliases.size(); i++) {
auto alias = string_strip(aliases[i]);
if (!alias.empty()) {
model_aliases.push_back(alias);
}
}
Comment thread
ServeurpersoCom marked this conversation as resolved.
Outdated
} else if (!params_base.model.name.empty()) {
// use model name in registry format (for models in cache)
model_name = params_base.model.name;
Expand Down Expand Up @@ -2892,6 +2900,7 @@ server_context_meta server_context::get_meta() const {
return server_context_meta {
/* build_info */ build_info,
/* model_name */ impl->model_name,
/* model_aliases */ impl->model_aliases,
/* model_path */ impl->params_base.model.path,
/* has_mtmd */ impl->mctx != nullptr,
/* has_inp_image */ impl->chat_params.allow_image,
Expand Down Expand Up @@ -3688,6 +3697,7 @@ void server_routes::init_routes() {
{"data", {
{
{"id", meta->model_name},
{"aliases", meta->model_aliases},
{"object", "model"},
{"created", std::time(0)},
{"owned_by", "llamacpp"},
Expand Down
1 change: 1 addition & 0 deletions tools/server/server-context.h
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ struct server_context_impl; // private implementation
struct server_context_meta {
std::string build_info;
std::string model_name;
std::vector<std::string> model_aliases;
std::string model_path;
bool has_mtmd;
bool has_inp_image;
Expand Down
78 changes: 73 additions & 5 deletions tools/server/server-models.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -184,6 +184,32 @@ void server_models::add_model(server_model_meta && meta) {
if (mapping.find(meta.name) != mapping.end()) {
throw std::runtime_error(string_format("model '%s' appears multiple times", meta.name.c_str()));
}
if (name_index.find(meta.name) != name_index.end()) {
throw std::runtime_error(string_format("model name '%s' conflicts with an existing alias", meta.name.c_str()));
}

// parse aliases from preset's --alias option (comma-separated)
std::string alias_str;
if (meta.preset.get_option("LLAMA_ARG_ALIAS", alias_str) && !alias_str.empty()) {
for (auto & alias : string_split<std::string>(alias_str, ',')) {
alias = string_strip(alias);
if (alias.empty()) {
continue;
}
if (name_index.find(alias) != name_index.end()) {
throw std::runtime_error(string_format("alias '%s' for model '%s' conflicts with an existing name or alias",
alias.c_str(), meta.name.c_str()));
}
meta.aliases.push_back(alias);
}
}

// index canonical name + all aliases
name_index[meta.name] = meta.name;
for (const auto & alias : meta.aliases) {
name_index[alias] = meta.name;
}

meta.update_args(ctx_preset, bin_path); // render args
std::string name = meta.name;
mapping[name] = instance_t{
Expand Down Expand Up @@ -249,6 +275,7 @@ void server_models::load_models() {
server_model_meta meta{
/* preset */ preset.second,
/* name */ preset.first,
/* aliases */ {},
/* port */ 0,
/* status */ SERVER_MODEL_STATUS_UNLOADED,
/* last_used */ 0,
Expand All @@ -268,7 +295,18 @@ void server_models::load_models() {
SRV_INF("Available models (%zu) (*: custom preset)\n", mapping.size());
for (const auto & [name, inst] : mapping) {
bool has_custom = custom_names.find(name) != custom_names.end();
SRV_INF(" %c %s\n", has_custom ? '*' : ' ', name.c_str());
if (inst.meta.aliases.empty()) {
SRV_INF(" %c %s\n", has_custom ? '*' : ' ', name.c_str());
} else {
std::string alias_list;
for (const auto & a : inst.meta.aliases) {
if (!alias_list.empty()) {
alias_list += ", ";
}
alias_list += a;
}
SRV_INF(" %c %s (aliases: %s)\n", has_custom ? '*' : ' ', name.c_str(), alias_list.c_str());
}
}
}

Expand Down Expand Up @@ -316,16 +354,25 @@ void server_models::update_meta(const std::string & name, const server_model_met
cv.notify_all(); // notify wait_until_loaded
}

std::string server_models::resolve_name(const std::string & name) {
std::lock_guard<std::mutex> lk(mutex);
auto it = name_index.find(name);
if (it != name_index.end()) {
return it->second;
}
return "";
}

bool server_models::has_model(const std::string & name) {
std::lock_guard<std::mutex> lk(mutex);
return mapping.find(name) != mapping.end();
return name_index.find(name) != name_index.end();
}

std::optional<server_model_meta> server_models::get_meta(const std::string & name) {
std::lock_guard<std::mutex> lk(mutex);
auto it = mapping.find(name);
if (it != mapping.end()) {
return it->second.meta;
auto it = name_index.find(name);
if (it != name_index.end()) {
return mapping[it->second].meta;
}
return std::nullopt;
}
Expand Down Expand Up @@ -821,6 +868,11 @@ void server_models_routes::init_routes() {
this->proxy_get = [this](const server_http_req & req) {
std::string method = "GET";
std::string name = req.get_param("model");
// resolve alias to canonical model name
std::string resolved = models.resolve_name(name);
if (!resolved.empty()) {
name = resolved;
}
bool autoload = is_autoload(params, req);
auto error_res = std::make_unique<server_http_res>();
if (!router_validate_model(name, models, autoload, error_res)) {
Expand All @@ -833,6 +885,11 @@ void server_models_routes::init_routes() {
std::string method = "POST";
json body = json::parse(req.body);
std::string name = json_value(body, "model", std::string());
// resolve alias to canonical model name
std::string resolved = models.resolve_name(name);
if (!resolved.empty()) {
name = resolved;
}
bool autoload = is_autoload(params, req);
auto error_res = std::make_unique<server_http_res>();
if (!router_validate_model(name, models, autoload, error_res)) {
Expand All @@ -845,6 +902,11 @@ void server_models_routes::init_routes() {
auto res = std::make_unique<server_http_res>();
json body = json::parse(req.body);
std::string name = json_value(body, "model", std::string());
// resolve alias to canonical model name
std::string resolved = models.resolve_name(name);
if (!resolved.empty()) {
name = resolved;
}
auto model = models.get_meta(name);
if (!model.has_value()) {
res_err(res, format_error_response("model is not found", ERROR_TYPE_NOT_FOUND));
Expand Down Expand Up @@ -883,6 +945,7 @@ void server_models_routes::init_routes() {
}
models_json.push_back(json {
{"id", meta.name},
{"aliases", meta.aliases},
{"object", "model"}, // for OAI-compat
{"owned_by", "llamacpp"}, // for OAI-compat
{"created", t}, // for OAI-compat
Expand All @@ -901,6 +964,11 @@ void server_models_routes::init_routes() {
auto res = std::make_unique<server_http_res>();
json body = json::parse(req.body);
std::string name = json_value(body, "model", std::string());
// resolve alias to canonical model name
std::string resolved = models.resolve_name(name);
if (!resolved.empty()) {
name = resolved;
}
auto model = models.get_meta(name);
if (!model.has_value()) {
res_err(res, format_error_response("model is not found", ERROR_TYPE_INVALID_REQUEST));
Expand Down
5 changes: 5 additions & 0 deletions tools/server/server-models.h
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@ static std::string server_model_status_to_string(server_model_status status) {
struct server_model_meta {
common_preset preset;
std::string name;
std::vector<std::string> aliases; // additional names that resolve to this model
Comment thread
ServeurpersoCom marked this conversation as resolved.
Outdated
int port = 0;
server_model_status status = SERVER_MODEL_STATUS_UNLOADED;
int64_t last_used = 0; // for LRU unloading
Expand Down Expand Up @@ -84,6 +85,7 @@ struct server_models {
std::mutex mutex;
std::condition_variable cv;
std::map<std::string, instance_t> mapping;
std::map<std::string, std::string> name_index; // alias/name -> canonical name
Comment thread
ServeurpersoCom marked this conversation as resolved.
Outdated

// for stopping models
std::condition_variable cv_stop;
Expand Down Expand Up @@ -112,6 +114,9 @@ struct server_models {
// check if a model instance exists (thread-safe)
bool has_model(const std::string & name);

// resolve alias/name to canonical model name, returns empty string if not found (thread-safe)
std::string resolve_name(const std::string & name);
Comment thread
ServeurpersoCom marked this conversation as resolved.
Outdated

// return a copy of model metadata (thread-safe)
std::optional<server_model_meta> get_meta(const std::string & name);

Expand Down
Loading