> 0: p=0.40 1: p=0.60 which suggests that 1 is the next bit and leads to a subop...

> 0: p=0.40 1: p=0.60 which suggests that 1 is the next bit and leads to a suboptimal starting point for predicting the bit after that. The error is even more prominent with longer sequences as the joint probability distribution becomes more unfactorizable into marginal distributions (as I would expect any minimal algorithmic description of real-world data to be).

Can someone explain this part a bit more? I'm not seeing the issue. From what I see, if the first token (t1) output is a zero, then the next token (t2) would have probabilities 0:p=.90 and 1:p=.10. (And t2 0/1:p= .50/.50 if t1=1)

Mathematically, those line up with the initial distribution, so what's the concern? That's how conditional probability works.