Feature request
Hi there,
This is a detail, really, but I notice that your simple quickstart:
from trl import SFTTrainer
from datasets import load_dataset
trainer = SFTTrainer(
model="Qwen/Qwen3-0.6B",
train_dataset=load_dataset("trl-lib/Capybara", split="train"),
)
trainer.train()
This uses around 25GB of RAM on an A100 in Colab. Unfortunately, that still requires Colab Pro, which is not ideal for students. To my surprise, this is also not impossibly far from the memory of a T4 (16GB, but really 14-15GB).
I'm going to look into the quickest way to reduce the memory footprint whilst keeping this example minimal (probablymax_length?), but since the free hardware available to students worldwide remains the T4 on Colab, usually, would you consider keeping that in mind for those introductory examples? That would hugely help preparing educational material, and would make the library more quickly/widely adopted as well!
What do you think? It might be as easy as adding one note in the docs saying "to run on a T4, add this option"... Thanks in advance for reading!
Motivation
Make the quickstart examples more accessible (especially for students).
Your contribution
-
Pointing out the memory issue in Colab (free version).
-
Some small tests:
A. 11GB VRAM
trainer = SFTTrainer(
model="Qwen/Qwen2.5-0.5B",
args = SFTConfig(per_device_train_batch_size=1), # instead of the default of 8
train_dataset=dataset,
)
trainer.train()
B. 13.6GB VRAM
trainer = SFTTrainer(
model="Qwen/Qwen2.5-0.5B",
args = SFTConfig(max_length=256), # instead of the default of 1024
train_dataset=dataset,
)
trainer.train()
C. 12.4GB VRAM
# reduced from https://ai.google.dev/gemma/docs/core/huggingface_text_finetune_qlora
# Check if GPU benefits from bfloat16
if torch.cuda.get_device_capability()[0] >= 8:
torch_dtype = torch.bfloat16
else:
torch_dtype = torch.float16
trainer = SFTTrainer(
model="Qwen/Qwen2.5-0.5B",
args = SFTConfig(
# default max_len of 1024
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
gradient_checkpointing=True,
fp16=True if torch_dtype == torch.float16 else False,
bf16=True if torch_dtype == torch.bfloat16 else False,
)
train_dataset=dataset,
)
trainer.train()
Feature request
Hi there,
This is a detail, really, but I notice that your simple quickstart:
This uses around 25GB of RAM on an A100 in Colab. Unfortunately, that still requires Colab Pro, which is not ideal for students. To my surprise, this is also not impossibly far from the memory of a T4 (16GB, but really 14-15GB).
I'm going to look into the quickest way to reduce the memory footprint whilst keeping this example minimal (probably
max_length?), but since the free hardware available to students worldwide remains the T4 on Colab, usually, would you consider keeping that in mind for those introductory examples? That would hugely help preparing educational material, and would make the library more quickly/widely adopted as well!What do you think? It might be as easy as adding one note in the docs saying "to run on a T4, add this option"... Thanks in advance for reading!
Motivation
Make the quickstart examples more accessible (especially for students).
Your contribution
Pointing out the memory issue in Colab (free version).
Some small tests:
A. 11GB VRAM
B. 13.6GB VRAM
C. 12.4GB VRAM