New open weight model: Tiny Aya #963

rasbt · 2026-02-19T22:56:23Z

rasbt
Feb 19, 2026
Maintainer

Have been reading through the technical reports of the recent wave of open-weight LLM releases. Since many of you asked about multilingual support in the past, I thought Tiny Aya by Cohere was an interesting one. Afaik this is the strongest multilingual model of that size class.

I just did a from-scratch implementation here: https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/15_tiny-aya/standalone-tiny-aya-plus-kv-cache.ipynb

Architecture-wise, Tiny Aya is a classic decoder-style transformer with a few noteworthy modifications (besides the obvious ones like SwiGLU and Grouped Query Attention):

Parallel transformer blocks. A parallel transformer block computes attention and MLP from the same normalized input, then adds both to the residual in one step. I assume this is to reduce serial dependencies inside a layer to improve computational throughput.
Sliding window attention. Specifically, it uses a 3:1 local:global ratio similar to Arcee Trinity and Olmo 3. The window size is also 4096. Also, similar to Arcee, the sliding window layers use RoPE whereas the full attention layers use NoPE.
LayerNorm. Most architectures moved to RMSNorm as it's computationally a bit cheaper and performs well. Tiny Aya is keeping it more classic with a modified version of LayerNorm (the implementation here is like standard LayerNorm but without shift, i.e., bias, parameter).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New open weight model: Tiny Aya #963

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

New open weight model: Tiny Aya #963

Uh oh!

rasbt Feb 19, 2026 Maintainer

Replies: 0 comments

rasbt
Feb 19, 2026
Maintainer