The fact the word ends up being 1 token doesn’t mean model can’t track individua...

brookst · 2025-07-10T21:13:11 1752181991

No, the vector is in a semantic embedding space. That's the magic.

So "the sky is blue" converts to the tokens [1820, 13180, 374, 6437]

And "le ciel est bleu" converts to the tokens [273, 12088, 301, 1826, 12704, 84]

Then the embeddings vectors created from these are very similar, despite the letters having very little in common.

boroboro4 · 2025-07-11T12:47:58 1752238078

Character on 1st/2nd/3rd place is part of semantic space in generic meaning of the word. I ran experiments which seemingly ~support my hypothesis below.

kadushka · 2025-07-10T20:10:43 1752178243

Is there any evidence to support your hypothesis?

boroboro4 · 2025-07-11T12:42:47 1752237767

Good question! I did a small experiment: trained a small logistics regression from embedding vectors into 1st/2nd/3rd character in token: https://chatgpt.com/share/6871061a-7948-8007-ab53-5b0b697e90...

I got 0.863 (for 1st)/0.559 (for 2nd)/0.447 (for 3rd) accuracy for Qwen 3 8B model embeddings. Note the code is hacky and might be wrong in ways + in reality transformers do know more because here I utilize only embedding layer. However it does show there are very clear signals on characters in tokens in embedding vectors.

kadushka · 2025-07-11T18:10:06 1752257406

Thank you! I guess if there's enough spelling related text in the dataset, a model is forced to learn some info about token composition in order to predict such texts.

I wonder if it would help to explicitly insert this info into an embedding vector, similar to how we encode word position info. For example, allocate the first 20 vector elements to represent ASCII codes of token's characters (in some normalized way).

boroboro4 · 2025-07-11T20:55:18 1752267318

Ok, bonus content #2.

I took Qwen3 1.7B model and did the same but rather then using embedding vector I used vector after 1st/etc layer, below accuracies for 1st positions:

- embeddings: 0.855

- 1st: 0.913

- 2nd: 0.870

- 3rd: 0.671

- 16th: 0.676

- 20th: 0.683

And now mega bonus content: the same but with prefix "count letters in ":

- 1st: 0.922

- 2nd: 0.924

- 3rd: 0.920

- 16th: 0.877

- 20th: 0.895

And for 2nd letter:

- embeddings: 0.686

- 1st: 0.679

- 2nd: 0.682

- 3rd: 0.674

- 16th: 0.572

boroboro4 · 2025-07-11T18:29:46 1752258586

One way here is to use one hot encoding in first (token length * alphabet length) dimensions.

But to be frank I don’t think it’s really needed, I bet everything really needed model learns by itself. If I had time I would’ve tried it though :)

Bonus content, accuracies for other models (notice DeepSeek!):

- Qwen3-32B: 0.873 / 0.585 / 0.467

- Qwen3-235B-A22B: 0.857 / 0.607 / 0.502

- DeepSeek-V3: 0.869 / 0.738 / 0.624