Skip to content

bugboy1769/llm_serving

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Inference Optimisation on a Local Macbook

The code is largely from https://learn.deeplearning.ai/courses/efficiently-serving-llms, it is my implementation of it on a macbook. The key change is moving everything to MPS. Sometimes, some torch initialisations occur natively on the CPU, without explicit movement you're thrown a SIGBUS error your way. Also, while debugging, Claude failed to identify the problem effectively over multiple exchanges while ChatGPT (Think Longer) one-shotted it. Currently in repo:

  • Vanilla Generation
  • KV Caching - Doesn't work too well for llama style models have they have much more stringent implementations for passing past_key_values.
  • Continuous Batching - Currently fixing an error which causes kv caching to fail at the last token to be generated in the first batch.
  • Nice Graphs and Conceptual Comments

ToDo: Quantization and LoRA.

About

LLM inferencing using huggingface transformers on Apple silicon. Batch processing, KV caching, Continuous batching. ToDo: LoRA from scratch, Quantization.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages