Abstract Scope |
Self-driving laboratories (SDLs) promise to revolutionize materials research, yet remain limited by rigid protocols. We present AILA (Artificially Intelligent Lab Assistant), a framework automating atomic force microscopy through LLM-driven agents. Our AFMBench evaluation suite comprehensively tests AI agents across the scientific workflow, from experiment design to analysis. We benchmark several leading LLMs (Claude, GPT, Gemini, LLaMa) against human performance, revealing significant limitations. Our assessment demonstrates that current language models struggle with basic tasks like documentation retrieval and show concerning tendencies to deviate from instructions. These challenges intensify in multi-agent scenarios, highlighting critical safety alignment concerns. We showcase AILA's capabilities through increasingly complex AFM applications including calibration, feature detection, and property measurement. Our findings underscore the need for rigorous benchmarking before deploying AI laboratory assistants in scientific research. |