What’s Perplexity in AI?
Perplexity measures how well an AI such as ChatGPT can predict the next word. If it guesses correctly most of the time, it has low perplexity. If it is often wrong, the perplexity is high.
In general, AI models with lower perplexity are better (at predicting the words in the test set).
In practical terms, this means:
- For each word in the test set, calculate the probability of that word given the previous words, as predicted by the model.
- Take the log (base 2) of each of these probabilities.
- Sum up these log probabilities.
- Divide the sum by the total number of words in the test set.
- The perplexity is then 22 raised to the power of the negative average log probability.
An Example
In the following, we calculate the perplexity of mixtral-8x7b-instruct
on wiki.test.raw
:
./perplexity -m mistral-7B-v0.1/mixtral-8x7b-instruct-v0.1.Q4_0.gguf -f wiki.test.raw
The results are shown below:
perplexity: tokenizing the input ..
perplexity: tokenization took 550.513 ms
perplexity: calculating perplexity over 642 chunks, batch_size=512
perplexity: 3.81 seconds per pass - ETA 40.72 minutes
[1]3.2840,[2]3.9565,[3]4.5948,[4]4.7974,[5]4.8613,[6]4.8332,[7]4.9531,
...
[640]5.5644,[641]5.5566,[642]5.5517,[642]4.5087,
Final estimate: PPL = 4.5087 +/- 0.02428
The perplexity score PPL = ~4.51
, which is pretty good for 4-bit quantized model with a default context size 512
on the wiki.test.raw
dataset.
Comparatively, the perplexity of llama-2–13b
model is ~5.38
(see llama.cpp).
In my own test, llama-2–70b-chat
with 4-bit quantization has PPL = ~5.04
.
Final estimate: PPL = 5.0426 +/- 0.03032