A LLM (Large Language Model) is the basis for most current AI tools. Fortunately there are lots of open source models that can be run on your own hardware (if it is powerful enough). However I asked myself more than once: what is powerful enough?
Hardware requirements
While experimenting with LLMs one of the first questions you need to answer is: What kind of hardware is required to run a model. So I’ll try to give a short introduction in the things you need to know.
The most important thing to know about is the number ob parameters a model contains. Most models contain that size in their name. Something like “7B”, “72B” or even “405B”. The “B” stands for billion, so a “7B” model uses 7 billion parameters. The default data type of such a parameter is FP16 (a 16 bit floating point number). So each parameter requires 2 bytes (=16 / 8 bit) of (V)RAM. A “7B” model therefore requires at least 2*7*10^9 bytes, that’s 14 GB of (V)RAM only to store the parameters. However some extra memory is required to store additional (input) data, some explanations I found assumed about 20% of extra memory for that purpose.
Quantization
However there are so called quantized models that promise much lower memory consumption, so how’s that possible?
One way to reduce the amount of memory required is to lower the precision of the data types: So instead of using FP16 (2 bytes per parameter) the precision can be reduced to 8 bits (1 byte per parameter), 4 bits (2 parameters in 1 Byte) or even lower.