Back to Our Work
Gauntlet 4K RLVR
Datasets

Gauntlet 4K RLVR

An RLVR dataset curated from Gauntlet, delivered in Harbor format for reinforcement learning.

Erik Quintanilla
Erik Quintanilla
··5 min read
An RLVR dataset for teams training agentic coding models with reinforcement learning.

While supervised fine-tuning gets you far, agentic reinforcement learning with verifiable rewards (RLVR) is how you push models to actually solve problems. We've curated Gauntlet into a focused 4,000-example RLVR dataset, delivered in Harbor format for seamless integration with agentic RL training pipelines.

We trained EssentialAI's rnj-1-instruct 8B, an agentic coding model that punches well above its weight class, using the Harbor framework, with rollouts pushed to SkyRL. We saw a 3x improvement on Terminal Bench 2.0, an out-of-distribution benchmark the model never saw during training.

01

What's in Gauntlet 4K RLVR

Gauntlet 4K RLVR is a carefully curated subset of the original Gauntlet dataset, optimized for reinforcement learning workflows. Each example provides a verifiable reward signal through pytest tests.

4,000
Examples
Curated for RL training efficiency.
Harbor
Format
Ready for RL pipelines.
Pytest
Verification
Binary reward signals.
Gauntlet
Source
Real developer workflows.
Harbor Format
Delivered in Harbor format, the dataset integrates directly with modern RL training frameworks. Each example includes the task prompt, verification tests, and pre-configured Docker environments.
Verifiable Rewards
Every example produces a binary reward signal: the generated code either passes all tests (reward = 1) or it doesn't (reward = 0). No fuzzy metrics, no LLM-as-judge. Just ground truth.
02

Training Setup

We trained rnj-1-instruct 8B using the terminus-2 harness with our Gauntlet 4K RLVR dataset. After GRPO showed instability during training, we switched to the DAPO algorithm which provided consistent convergence.

rnj-1-instruct
Base Model
8B parameters
80 hours
Training Time
8x A100 GPUs
LoRA
Method
Rank 64
DAPO
Algorithm
Stable RL optimization
ComponentDetails
Harnessterminus-2
DatasetGauntlet 4K RLVR
Hardware8x NVIDIA A100
Training Duration80 hours
LoRA Rank64
AlgorithmDAPO (GRPO was unstable)
03

Terminal Bench 2.0 Results

We evaluated on Terminal Bench 2.0, an out-of-distribution benchmark the model never saw during training. This tests whether the model learned generalizable coding skills, not just pattern matching on the training data.

The trained 8B model shows outsized performance for its size, achieving results comparable to 20B+ parameter models. Below we compare against other models evaluated on the same terminus-2 harness.

3x
Problems Solved
3 → 9 problems
+197%
Pass@1 Rate
3.4% → 10.1%
TERMINAL BENCH 2.0 RESULTS
Terminal Bench 2.0 Comparison
Score
ModelScoreSource
GPT-OSS-20B*3.1% ± 1.5Leaderboard
Essential AI rnj-1-instruct 8B (Baseline)3.4% (pass@1)Our run
GPT-5-Nano*7.9% ± 1.9Leaderboard
Essential AI rnj-1-instruct 8B + Gauntlet 4K RLVR
10.1% (pass@1)
Our run
Grok Code Fast 1*14.2% ± 2.5Leaderboard

*Results pulled from the Terminal Bench 2.0 leaderboard.

Note: Terminal Bench 2.0 is an out-of-distribution benchmark with 89 terminal-based coding tasks. The model was never exposed to these tasks during training, demonstrating genuine skill transfer from the Gauntlet RLVR training.

04

Summary

Key Numbers
Dataset size4,000 curated examples
FormatHarbor
VerificationPytest binary rewards
rnj-1-instruct 8B Terminal Bench pass@13.4% → 10.1% (3x)
Training80h on 8x A100, DAPO + SkyRL + Harbor

RLVR works. With just 4,000 carefully curated examples and verifiable rewards, we achieved a 3x improvement on an out-of-distribution benchmark. The key is quality over quantity, and rewards you can trust.

Interested in training on Gauntlet 4K RLVR? Book a call.