The fact the word ends up being 1 token doesn’t mean model can’t track individual characters in it. The model transforms token into vector (of multiple thousands dimensionality), and I’m pretty sure there are dimensions corresponding to things like “if 1st character an ‘a’, 1st is ‘b’, 2nd is ‘a’ etc.
Character on 1st/2nd/3rd place is part of semantic space in generic meaning of the word. I ran experiments which seemingly ~support my hypothesis below.
I got 0.863 (for 1st)/0.559 (for 2nd)/0.447 (for 3rd) accuracy for Qwen 3 8B model embeddings. Note the code is hacky and might be wrong in ways + in reality transformers do know more because here I utilize only embedding layer.
However it does show there are very clear signals on characters in tokens in embedding vectors.
Thank you! I guess if there's enough spelling related text in the dataset, a model is forced to learn some info about token composition in order to predict such texts.
I wonder if it would help to explicitly insert this info into an embedding vector, similar to how we encode word position info. For example, allocate the first 20 vector elements to represent ASCII codes of token's characters (in some normalized way).
So tokens aren’t as important.