The exact questions are almost certainly not in the training data, since extra w...

The exact questions are almost certainly not in the training data, since extra words are added to each puzzle, and I don't publish these along with the original words (though there's a slight chance they used my previous API requests for training).

To guard against potential training data contamination, I separately calculate the score using only the newest 100 puzzles. Grok 4 still leads.