Long-Horizon Agent Evaluation

ThriftSim: A Simulated Retail Universe for Agent Evaluation

The first agent evaluation environment where inventory is unstructured, valuation is uncertain, and decisions require human judgment rather than SKU lookup.

The Problem with Current Benchmarks

Most agent benchmarks ask: Did the agent complete the task?

But in real operational environments, task completion isn't enough. You need business outcomes: revenue maximization, waste reduction, processing speed, pricing accuracy, and operational efficiency.

Secondhand retail exposes this gap. Inventory changes continuously. Metadata is incomplete. Condition grading is subjective. Valuation requires human judgment. These conditions don't exist in synthetic benchmarks.

ThriftSim: The Simulated Universe

The real asset isn't the benchmark. It's the world. ThriftSim is a complete simulated thrift store universe containing inventory, employees, customers, communication channels, pricing history, donation streams, and operational constraints.

Live Operational Data

Real inventory from 6+ consignment partners. Actual customer interactions. Human expert baselines from experienced thrift workers completing intake, sorting, pricing, and merchandising tasks.

Simulated Environment

Donation streams, customer emails, supplier messages, Slack channels, inventory overflow, seasonal demand patterns, and operational constraints that mirror real thrift store dynamics.

Automatic Benchmark Generation

Once the world exists, new benchmarks can be generated automatically. ThriftBench and ThriftEval become outputs of the simulation infrastructure, not standalone artifacts.

ThriftEval: Business-Grounded Evaluation

Composite Scoring Framework

→

Revenue Accuracy: Did the agent maximize revenue?

→

Sell-Through Rate: Did inventory actually sell?

→

Waste Diversion: How much stayed out of landfill?

→

Processing Speed: How quickly was inventory listed?

→

Pricing Accuracy: How close to human expert judgment?

→

Operational Efficiency: How many actions were required?

Human Baselines

Experienced thrift workers complete tasks (intake, sorting, pricing, merchandising). We record actions, decisions, and rationale. This becomes the gold standard for agent evaluation.

Evaluation Rubrics

Agents are scored against human expert performance on pricing accuracy, category classification, repair decisions, and operational efficiency. Composite scores reflect real business value.

Collaborate on ThriftSim

We're building the first long-horizon agent evaluation environment for operational retail. Interested in research collaboration, benchmark access, or evaluation partnerships?

Contact Research Team