Gauntlet 4K RLVR

An RLVR dataset for teams training agentic coding models with reinforcement learning.

While supervised fine-tuning gets you far, agentic reinforcement learning with verifiable rewards (RLVR) is how you push models to actually solve problems. We've curated Gauntlet into a focused 4,000-example RLVR dataset, delivered in Harbor format for seamless integration with agentic RL training pipelines.

We trained EssentialAI's rnj-1-instruct 8B, an agentic coding model that punches well above its weight class, using the Harbor framework, with rollouts pushed to SkyRL. We saw a 3x improvement on Terminal Bench 2.0, an out-of-distribution benchmark the model never saw during training.

What's in Gauntlet 4K RLVR

Gauntlet 4K RLVR is a carefully curated subset of the original Gauntlet dataset, optimized for reinforcement learning workflows. Each example provides a verifiable reward signal through pytest tests.

4,000

Examples

Curated for RL training efficiency.

Harbor

Format

Ready for RL pipelines.

Pytest

Verification

Binary reward signals.

Gauntlet

Source

Real developer workflows.

Harbor Format

Delivered in Harbor format, the dataset integrates directly with modern RL training frameworks. Each example includes the task prompt, verification tests, and pre-configured Docker environments.

Verifiable Rewards

Every example produces a binary reward signal: the generated code either passes all tests (reward = 1) or it doesn't (reward = 0). No fuzzy metrics, no LLM-as-judge. Just ground truth.

Training Setup

We trained rnj-1-instruct 8B using the terminus-2 harness with our Gauntlet 4K RLVR dataset. After GRPO showed instability during training, we switched to the DAPO algorithm which provided consistent convergence.

rnj-1-instruct

Base Model

8B parameters

80 hours

Training Time

8x A100 GPUs

LoRA

Method

Rank 64

DAPO

Algorithm

Stable RL optimization

Component	Details
Harness	terminus-2
Dataset	Gauntlet 4K RLVR
Hardware	8x NVIDIA A100
Training Duration	80 hours
LoRA Rank	64
Algorithm	DAPO (GRPO was unstable)

Terminal Bench 2.0 Results

We evaluated on Terminal Bench 2.0, an out-of-distribution benchmark the model never saw during training. This tests whether the model learned generalizable coding skills, not just pattern matching on the training data.

The trained 8B model shows outsized performance for its size, achieving results comparable to 20B+ parameter models. Below we compare against other models evaluated on the same terminus-2 harness.

Problems Solved

3 → 9 problems

+197%

Pass@1 Rate

3.4% → 10.1%

TERMINAL BENCH 2.0 RESULTS

Terminal Bench 2.0 Comparison

Score

Model	Score	Source
GPT-OSS-20B*	3.1% ± 1.5	Leaderboard
Essential AI rnj-1-instruct 8B (Baseline)	3.4% (pass@1)	Our run
GPT-5-Nano*	7.9% ± 1.9	Leaderboard
Essential AI rnj-1-instruct 8B + Gauntlet 4K RLVR	10.1% (pass@1)	Our run
Grok Code Fast 1*	14.2% ± 2.5	Leaderboard

*Results pulled from the Terminal Bench 2.0 leaderboard.

Note: Terminal Bench 2.0 is an out-of-distribution benchmark with 89 terminal-based coding tasks. The model was never exposed to these tasks during training, demonstrating genuine skill transfer from the Gauntlet RLVR training.

Summary

Key Numbers


Dataset size	4,000 curated examples
Format	Harbor
Verification	Pytest binary rewards
rnj-1-instruct 8B Terminal Bench pass@1	3.4% → 10.1% (3x)
Training	80h on 8x A100, DAPO + SkyRL + Harbor

RLVR works. With just 4,000 carefully curated examples and verifiable rewards, we achieved a 3x improvement on an out-of-distribution benchmark. The key is quality over quantity, and rewards you can trust.

Interested in training on Gauntlet 4K RLVR? Book a call.