Interesting that both de-novo and porting seems to have worked.
I do not understand why GGML is written this way, though. So much duplication, one variant per instruction set. Our Gemma.cpp only requires a single backend written using Highway's portable intrinsics, and last I checked for decode on SKX+Zen4, is also faster.
I do not understand why GGML is written this way, though. So much duplication, one variant per instruction set. Our Gemma.cpp only requires a single backend written using Highway's portable intrinsics, and last I checked for decode on SKX+Zen4, is also faster.