New open weight model: Tiny Aya #963
rasbt
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Have been reading through the technical reports of the recent wave of open-weight LLM releases. Since many of you asked about multilingual support in the past, I thought Tiny Aya by Cohere was an interesting one. Afaik this is the strongest multilingual model of that size class.
I just did a from-scratch implementation here: https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/15_tiny-aya/standalone-tiny-aya-plus-kv-cache.ipynb
Architecture-wise, Tiny Aya is a classic decoder-style transformer with a few noteworthy modifications (besides the obvious ones like SwiGLU and Grouped Query Attention):
Parallel transformer blocks. A parallel transformer block computes attention and MLP from the same normalized input, then adds both to the residual in one step. I assume this is to reduce serial dependencies inside a layer to improve computational throughput.
Sliding window attention. Specifically, it uses a 3:1 local:global ratio similar to Arcee Trinity and Olmo 3. The window size is also 4096. Also, similar to Arcee, the sliding window layers use RoPE whereas the full attention layers use NoPE.
LayerNorm. Most architectures moved to RMSNorm as it's computationally a bit cheaper and performs well. Tiny Aya is keeping it more classic with a modified version of LayerNorm (the implementation here is like standard LayerNorm but without shift, i.e., bias, parameter).
Beta Was this translation helpful? Give feedback.
All reactions