What’s Perplexity in AI?

Michael Humor
Dev Genius
Published in
2 min readDec 20, 2023

--

Perplexity measures how well an AI such as ChatGPT can predict the next word. If it guesses correctly most of the time, it has low perplexity. If it is often wrong, the perplexity is high.

In general, AI models with lower perplexity are better (at predicting the words in the test set).

In practical terms, this means:

  1. For each word in the test set, calculate the probability of that word given the previous words, as predicted by the model.
  2. Take the log (base 2) of each of these probabilities.
  3. Sum up these log probabilities.
  4. Divide the sum by the total number of words in the test set.
  5. The perplexity is then 22 raised to the power of the negative average log probability.

An Example

In the following, we calculate the perplexity of mixtral-8x7b-instruct on wiki.test.raw:

./perplexity -m mistral-7B-v0.1/mixtral-8x7b-instruct-v0.1.Q4_0.gguf -f wiki.test.raw

The results are shown below:

perplexity: tokenizing the input ..
perplexity: tokenization took 550.513 ms
perplexity: calculating perplexity over 642 chunks, batch_size=512
perplexity: 3.81 seconds per pass - ETA 40.72 minutes
[1]3.2840,[2]3.9565,[3]4.5948,[4]4.7974,[5]4.8613,[6]4.8332,[7]4.9531,
...
[640]5.5644,[641]5.5566,[642]5.5517,[642]4.5087,

Final estimate: PPL = 4.5087 +/- 0.02428

The perplexity score PPL = ~4.51, which is pretty good for 4-bit quantized model with a default context size 512 on the wiki.test.raw dataset.

Comparatively, the perplexity of llama-2–13b model is ~5.38 (see llama.cpp).

In my own test, llama-2–70b-chat with 4-bit quantization has PPL = ~5.04.

Final estimate: PPL = 5.0426 +/- 0.03032

--

--