Conversation
0130e7e to
2f97b87
Compare
| unit="batch", | ||
| disable=not gptq_conf.show_progress, | ||
| ): | ||
| device = next(model.parameters()).device |
There was a problem hiding this comment.
(minor) Getting device is recomputed inside the inner loop. It could be moved outside the loop to avoid redundant work and improve readability.
| self.cache_kwargs[k].append( | ||
| v.cpu() | ||
| if isinstance(v, torch.Tensor) | ||
| else (v[0].cpu(), v[1].cpu()) | ||
| if isinstance(v, tuple) | ||
| else None | ||
| ) |
There was a problem hiding this comment.
We used to preserve kwargs as-is, but this patch now converts only tensors and 2-element tuples, and replaces everything else with None. That can change input semantics.
I think this should be implemented with a recursive “move tensors only” helper that preserves the original container structure.
# let's add this function to tico/utils/utils.py
def move_to_device(obj, device):
"""
Recursively move tensors inside a nested structure to the given device.
Non-tensor objects are preserved as-is.
"""
if isinstance(obj, torch.Tensor):
return obj.to(device)
elif isinstance(obj, tuple):
return tuple(move_to_device(x, device) for x in obj)
elif isinstance(obj, list):
return [move_to_device(x, device) for x in obj]
elif isinstance(obj, dict):
return {k: move_to_device(v, device) for k, v in obj.items()}
# preserve everything else (bool, int, None, custom objects, etc.)
return obj
def move_to_cpu(obj):
return move_to_device(obj, "cpu")Then, we can just call it like this.
# after
self.cache_kwargs[k].append(move_to_cpu(v))
cache_kwargs_batch = gather_single_batch_from_dict(self.cache_kwargs, batch_idx)
cache_kwargs_batch = move_to_device(cache_kwargs_batch, device)
cache_args_batch = gather_single_batch_from_list(self.cache_args, batch_idx)
cache_args_batch = move_to_device(cache_args_batch, device)There was a problem hiding this comment.
Ok. Understood. Thank you!
ce1303e to
1e68dda
Compare
| return padding | ||
|
|
||
|
|
||
| def move_to_device(obj, device): |
There was a problem hiding this comment.
How about moving this to tico/utils/utils.py? This function seems to be used in other places as well.
This PR uses gpu memory in GPTQ algorithm only for inference to reduce gpu memory usage. TICO-DCO-1.0-Signed-off-by: s.malakhov <s.malakhov@partner.samsung.com>
This PR uses gpu memory in GPTQ algorithm only for inference to reduce gpu memory usage.
It will make it possible to use large number of samples on a gpu with constrained memory.
Sample run on 256 samples for TinyLlama/TinyLlama-1.1B-Chat-v1.0' on 8Gb GPU
./ccex test --include-internal -k quantization.algorithm.test_gptq
TICO-DCO-1.0-Signed-off-by: s.malakhov s.malakhov@partner.samsung.com