You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Introduce evaluation API
Signed-off-by: Michal Bien <mbien@nvidia.com>
---------
Signed-off-by: Michal Bien <mbien@nvidia.com>
Signed-off-by: Glorf <Glorf@users.noreply.github.com>
Co-authored-by: Glorf <Glorf@users.noreply.github.com>
Co-authored-by: Abhishree Thittenamane <47577437+athitten@users.noreply.github.com>
nemo_checkpoint_path (Path): Path for nemo 2.0 checkpoint. This is used to get the tokenizer from the ckpt
457
-
which is required to tokenize the evaluation input and output prompts.
458
-
url (str): grpc service url that were used in the deploy method above
459
-
in the format: grpc://{grpc_service_ip}:{grpc_port}.
460
-
triton_http_port (int): HTTP port that was used for the PyTriton server in the deploy method. Default: 8000.
461
-
Please pass the triton_http_port if using a custom port in the deploy method.
462
-
model_name (str): Name of the model that is deployed on PyTriton server. It should be the same as
463
-
triton_model_name passed to the deploy method above to be able to launch evaluation. Deafult: "triton_model".
464
-
eval_task (str): task to be evaluated on. For ex: "gsm8k", "gsm8k_cot", "mmlu", "lambada". Default: "gsm8k".
465
-
These are the tasks that are supported currently. Any other task of type generate_until or loglikelihood from
466
-
lm-evaluation-harness can be run, but only the above mentioned ones are tested. Tasks of type
467
-
loglikelihood_rolling are not supported yet.
468
-
num_fewshot (int): number of examples in few-shot context. Default: None.
469
-
limit (Union[int, float]): Limit the number of examples per task. If <1 (i.e float val between 0 and 1), limit
470
-
is a percentage of the total number of examples. If int say x, then run evaluation only on x number of samples
471
-
from the eval dataset. Default: None, which means eval is run the entire dataset.
472
-
bootstrap_iters (int): Number of iterations for bootstrap statistics, used when calculating stderrs. Set to 0
473
-
for no stderr calculations to be performed. Default: 100000.
474
-
# inference params
475
-
temperature: Optional[float]: float value between 0 and 1. temp of 0 indicates greedy decoding, where the token
476
-
with highest prob is chosen. Temperature can't be set to 0.0 currently, due to a bug with TRTLLM
477
-
(# TODO to be investigated). Hence using a very samll value as the default. Default: 0.000000001.
478
-
top_p: Optional[float]: float value between 0 and 1. limits to the top tokens within a certain probability.
479
-
top_p=0 means the model will only consider the single most likely token for the next prediction. Default: 0.0.
480
-
top_k: Optional[int]: limits to a certain number (K) of the top tokens to consider. top_k=1 means the model
481
-
will only consider the single most likely token for the next prediction. Default: 1
482
-
add_bos: Optional[bool]: whether a special token representing the beginning of a sequence should be added when
483
-
encoding a string. Default: False since typically for CausalLM its set to False. If needed set add_bos to True.
444
+
target_cfg (EvaluationTarget): target of the evaluation. Providing nemo_checkpoint_path, model_id and url in EvaluationTarget.api_endpoint is required to run evaluations.
445
+
eval_cfg (EvaluationConfig): configuration for evaluations. Default type (task): gsm8k.
0 commit comments