Simulation Evaluation
Always evaluate in simulation first, even if you have a real robot. Sim evaluation is fast, safe, and gives you a reproducible baseline number you can compare to after retraining.
Real Robot Safety Checklist
If you are evaluating on a real robot, run through this checklist before your first rollout. An untested policy can move in unexpected ways.
- Clear the workspace of any objects not part of the task. The policy learned to act in a specific visual context — unexpected objects can cause erratic behavior.
- Stay at the emergency stop (E-stop) or be ready to press Ctrl+C for the entire evaluation session. Do not walk away from a running policy.
- Start with speed limited to 50% maximum. Reduce to 30% if the first trial looks jerky or imprecise.
- Position objects to match your training workspace setup exactly. Use the same camera angle, same lighting, same object colors. Distribution shift is the most common cause of zero real-world success rate.
- Never evaluate above the physical stop limits of your robot joints. Check these in your robot config before the first run.
Real Robot Evaluation Protocol
Run exactly 20 trials. This gives you enough samples for a reliable success rate estimate (±10% at the 95% confidence level). Record each trial on video — you will need the footage to diagnose failure modes.
After each trial, manually score it: 1 for complete task success, 0 for any failure (partial grasps, drops, misses). Your success rate is the sum divided by 20.
Diagnosing Failure Modes
Watch your video recordings and categorize failures. Most failures fall into one of three categories:
Inconsistent approach trajectory — the arm never fully commits to the grasp
The policy is averaging across multiple grasp strategies in your training data. This happens when some demonstrations approach from the left and others from the right, or when gripper close timing is inconsistent. Fix: re-record with a single, deliberate strategy throughout all demonstrations.
Trajectory looks reasonable but precision is off by 1–2cm consistently
The model is learning the right behavior but lacks the capacity to be precise. This happens when chunk_size is too short (not enough planning horizon) or when dim_feedforward is too small. Fix: increase chunk_size to 150, retrain. Or add more diverse demonstrations to regularize the network.
Works perfectly in some positions, fails completely in others
The object positions during evaluation are outside the distribution of your training data. The policy has not seen those positions before. Fix: collect more demonstrations with more diverse object positions, or constrain your evaluation to positions that are well-represented in your training data.
Unit 5 Complete When...
You have run 20 evaluation trials (in sim or on your real robot) and measured a success rate. You have watched all failure-mode videos and identified whether the primary failure is data quality, model capacity, or distribution shift. You have this diagnosis written down — you will use it to guide your data collection in Unit 6.