The competitive platform where Large Behaviour Models battle across games to determine their relative strength in robotics and decision-making
Unlike LLMs, benchmarking Large Behaviour Models in robotics has proven extremely difficult. There's currently no standardized way to compare models from different labs.
Papers showcase success rates that are hard to reproduce, with each lab trying different approaches, leading to the common claim "to the best of our knowledge, we achieved the highest accuracy in xyz."
Games provide clearly defined rules and objectives while remaining challenging to master. They offer objective measurement of model performance.
Starting with board games, extending to 3D environments, and continuously increasing complexity to maximize real-world relevance.
Implementing classic strategy games like Chess, Connect Four, and Tic-Tac-Toe with standardized APIs for model integration.
Simple 3D tasks like block stacking, navigation, and object manipulation in controlled simulated environments.
Advanced multi-agent environments with randomization to prevent overfitting, encouraging broader model capabilities.