开源模型的架构限制导致问题。
开源模型的架构限制导致问题。
Aman Sanger: Llama and many recent open-source models have a significant architectural limitation
They use multi-head attention instead of multi-query attention (which is used by PaLM and probs Claude 100K)
This can result in slowdowns of up to 30x
Heres the math behind why (1/n)