Transformer Reinforcement Learning (TRL)

Transformer Reinforcement Learning (TRL) TRL is a full stack library that provides a set of tools to train transformer language models with methods such as Supervised Fine-Tuning (SFT), Group Relative Policy Optimization (GRPO), Direct Preference Optimization (DPO), Reward Modeling, and more. The library is integrated with 🤗 transformers. Along with using W&B to record your training metrics, you can integrate Weave with your TRL workflows to gain observability into how your model performs during training. Weave records inputs, outputs, and timestamps for each evaluation step so you can inspect the quality of responses generated by the model. This guide shows you how to use TRL with Weave and W&B.

Getting started

Install TRL with uv or pip. Choose one of the options below based on your setup.

# Using uv
uv pip install trl

# Using pip
pip install trl

Then install Weave and W&B using any of the options below based on your setup:

# Using uv
uv pip install weave wandb

# Using pip
pip install weave wandb

Training models with TRL and logging traces/completions using Weave

Once you have installed the necessary libraries, you can use the built-in WeaveCallback in TRL to log traces and completions at each evaluation step. The callback logs data during evaluation phases, so you need to pass an evaluation set to the trainer object. The following example script demonstrates how to run an evaluation with TRL and log the results to Weave when training a model with the GRPOTrainer. Run the example and inspect the results in Weave:

import os
os.environ["WANDB_API_KEY"] = "<YOUR-WANDB-API-KEY>"
os.environ["OPENAI_API_KEY"] = "<YOUR-OPENAI-API-KEY>"

import wandb
import weave
from datasets import load_dataset
from trl import GRPOConfig, GRPOTrainer, WeaveCallback

# Log in to W&B
wandb.login()

# Load the datasets
train_dataset, eval_dataset = load_dataset("trl-lib/ultrafeedback-prompt", split=['train[:5%]', 'test[:5%]'])

# Dummy reward function for demonstration purposes
def reward_num_unique_letters(completions, **kwargs):
    """Reward function that rewards completions with more unique letters."""
    completion_contents = [completion[0]["content"] for completion in completions]
    return [float(len(set(content))) for content in completion_contents]

training_args = GRPOConfig(output_dir="Qwen2-0.5B-GRPO")
trainer = GRPOTrainer(
    model="Qwen/Qwen2-0.5B-Instruct",
    reward_funcs=reward_num_unique_letters,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)
weave_callback = WeaveCallback(trainer=trainer, ...)
trainer.add_callback(weave_callback)

trainer.train()

Resources

Here are some resources you can use to learn how to integrate Weave with different workflows:

WeaveCallback documentation
Curated examples that show how to use Weave with TRL for different algorithms.

Guides

Cookbooks

Reference

Open Source

Community

Getting started

Training models with TRL and logging traces/completions using Weave

Resources

Guides

Cookbooks

Reference

Open Source

Community

​Getting started

​Training models with TRL and logging traces/completions using Weave

​Resources

Getting started

Training models with TRL and logging traces/completions using Weave

Resources