Tokenizers #1001

DerpLady · 2026-04-06T14:04:28Z

DerpLady
Apr 6, 2026

Buying the book, I expected to use my own data for the vocabulary part. Here, stanadrdized tokenizers are used which is rahte ruseless for my use case (I need a model that is specilized in Swedish language for a specific industry). Isnt it a bit misleading to talk about an llm from scratch if we are using a lot of pre-built stuff? E.g vocabulary + weights.

rasbt · 2026-04-06T14:08:03Z

rasbt
Apr 6, 2026
Maintainer

Unfortunately, building the tokenizer from scratch as well was a bit out of scope, but you could train your own tokenizer using the bonus materials here: https://github.com/rasbt/LLMs-from-scratch/tree/main/ch02/05_bpe-from-scratch

1 reply

DerpLady Apr 7, 2026
Author

Got it, will check the link out - thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizers #1001

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Tokenizers #1001

Uh oh!

DerpLady Apr 6, 2026

Replies: 1 comment · 1 reply

Uh oh!

rasbt Apr 6, 2026 Maintainer

Uh oh!

DerpLady Apr 7, 2026 Author

DerpLady
Apr 6, 2026

Replies: 1 comment 1 reply

rasbt
Apr 6, 2026
Maintainer

DerpLady Apr 7, 2026
Author