Good question! I did a small experiment: trained a small logistics regression fr...

kadushka · 2025-07-11T18:10:06 1752257406

Thank you! I guess if there's enough spelling related text in the dataset, a model is forced to learn some info about token composition in order to predict such texts.

I wonder if it would help to explicitly insert this info into an embedding vector, similar to how we encode word position info. For example, allocate the first 20 vector elements to represent ASCII codes of token's characters (in some normalized way).

boroboro4 · 2025-07-11T20:55:18 1752267318

Ok, bonus content #2.

I took Qwen3 1.7B model and did the same but rather then using embedding vector I used vector after 1st/etc layer, below accuracies for 1st positions:

- embeddings: 0.855

- 1st: 0.913

- 2nd: 0.870

- 3rd: 0.671

- 16th: 0.676

- 20th: 0.683

And now mega bonus content: the same but with prefix "count letters in ":

- 1st: 0.922

- 2nd: 0.924

- 3rd: 0.920

- 16th: 0.877

- 20th: 0.895

And for 2nd letter:

- embeddings: 0.686

- 1st: 0.679

- 2nd: 0.682

- 3rd: 0.674

- 16th: 0.572

boroboro4 · 2025-07-11T18:29:46 1752258586

One way here is to use one hot encoding in first (token length * alphabet length) dimensions.

But to be frank I don’t think it’s really needed, I bet everything really needed model learns by itself. If I had time I would’ve tried it though :)

Bonus content, accuracies for other models (notice DeepSeek!):

- Qwen3-32B: 0.873 / 0.585 / 0.467

- Qwen3-235B-A22B: 0.857 / 0.607 / 0.502

- DeepSeek-V3: 0.869 / 0.738 / 0.624