Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Aren't layers basically doing n^k attention? The attention block is n^2 because it allows 1 number per input/output pair. But nothing prevents you from stacking these on top of each other and get k-th order of "attentioness" with each layer encoding a different order.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: