Tokenizers
#1001
Replies: 1 comment 1 reply
-
|
Unfortunately, building the tokenizer from scratch as well was a bit out of scope, but you could train your own tokenizer using the bonus materials here: https://github.com/rasbt/LLMs-from-scratch/tree/main/ch02/05_bpe-from-scratch |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Buying the book, I expected to use my own data for the vocabulary part. Here, stanadrdized tokenizers are used which is rahte ruseless for my use case (I need a model that is specilized in Swedish language for a specific industry). Isnt it a bit misleading to talk about an llm from scratch if we are using a lot of pre-built stuff? E.g vocabulary + weights.
Beta Was this translation helpful? Give feedback.
All reactions