IMMERSIVE COMMONS · THE SIGNALISSUE 10 · 14-20 JUN 2026
OPEN INTELLIGENCE · ISSUE 10

THE SIGNAL
14-20 JUN 2026
FRONTIER TOWER
10

The Week The Chokepoint Failed

Washington pulled the most powerful public model three days after it shipped, and the open-weight world filled the gap before Anthropic could restore service. Underneath, the proofs kept coming up short: the frontier leader cleared 3% of real knowledge work, a single web page turned a browsing agent into code execution on the host, and the now-public SpaceX spent $60 billion of fresh stock to buy the coding funnel. The state reached for a chokepoint that no longer exists.

BEATS 07
DISPATCHES 14
CHAIN MYTHOS × 03
PUBLISHED 2026-06-20
I.

THE MODEL THEY RECALLED

Three days after it shipped, the government ordered the most powerful public model shut off.

125FIELD REPORTMYTHOS · CHAIN

Washington Recalls Fable 5. The Charge Was A Jailbreak.

Three days after the most powerful public model shipped, the government took it back.

Anthropic notice that Claude Fable 5 and Mythos 5 access has been suspended
IMAGETechCrunch

On Friday June 12 at 5:21 p.m. ET, the Commerce Department ordered Anthropic to cut off all foreign access to Claude Fable 5 and Claude Mythos 5, framing the move as an export-control action over a claimed jailbreak. Anthropic complied by shutting both models down entirely, rather than selectively blocking foreign users, and spent the rest of the week working to restore access. Fable 5, the most capable model ever released to the public, had shipped three days earlier.

The government's stated concern was a narrow, non-universal jailbreak: prompting Fable 5 to read a codebase and identify its software flaws, a capability Anthropic notes already exists in OpenAI's GPT-5.5. Reporting tied the order to conversations between Amazon CEO Andy Jassy and US officials, after Amazon researchers coaxed the model into cyberattack-relevant output. Mythos 5, the ungated parent, had been confined since April to roughly 50 vetted organizations under Project Glaswing.

For twelve issues this dispatch tracked Mythos as the model too dangerous to ship. It shipped, behind a price and a reroute, and the state recalled it inside a week, on verbal evidence, the same month Anthropic asked to be regulated. The lab set the real stakes itself: "We disagree that the finding of a narrow potential jailbreak should be cause for recalling a commercial model deployed to hundreds of millions of people," it wrote, warning the standard would "essentially halt all new model deployments for all frontier model providers."

AnthropicTechCrunchCyberScoopNextgov/FCW
II.

THE VOID FILLED IN DAYS

The open-weight pack absorbed the displaced demand before the recall could bite.

126FIELD REPORT

A Downloadable Model Just Matched The Frontier

GLM-5.2 ships open under MIT and ties GPT-5.5 on real economic work, the same week Washington tried to wall the frontier in.

GLM-5.2 open-weights launch from Z.ai, MoE architecture graphic
IMAGEVentureBeat

On June 16th Z.ai released GLM-5.2, a 753B-parameter mixture-of-experts model with roughly 40B active per token, a 1M-token context window, and an MIT license that permits commercial use and self-hosting. Within a day Artificial Analysis named it the leading open-weights model on its Intelligence Index v4.1 at a score of 51. On GDPval-AA, a benchmark of paid economic tasks, it posted an Elo of 1524, sitting level with proprietary GPT-5.5 at 1514. The weights are on disk; the gap is gone.

The thing that makes this load-bearing is not the leaderboard slot, it is the price of the slot. GLM-5.2 runs at $1.40 per million input tokens and $4.40 per million output tokens, a fraction of frontier closed rates, and the same weights can be pulled down and served on your own hardware (currently mirrored across three inference providers). GDPval-AA is the detail to sit with: it scores models on actual paid knowledge work rather than puzzle benchmarks, which is precisely the territory export controls were drawn to protect. A model you can download just drew level on the eval that matters to the economy.

For the builder, the calculus inverts. A frontier-parity model under MIT means the question stops being which API you rent and becomes which weights you host, with the privacy, latency, and per-token economics that follow. For the field, the timing is the story: a recall fence works only if the thing behind it cannot be copied, and a 753B file that matches GPT-5.5 on real economic work is a thing that has already been copied. The chokepoint failed because there was nothing left to choke.

Artificial AnalysisVentureBeatLLM Stats
127FIELD REPORT

Moonshot Ships Kimi K2.7-Code Open. The Fight Moved to Cost Per Finished Task.

A one-trillion-parameter coding model under Modified MIT, cheaper per completed agentic task than the closed APIs it backstops.

Moonshot AI Kimi K2.7-Code release graphic showing the open-weight coding model launch
IMAGEMarkTechPost

On June 12, Moonshot AI released **Kimi K2.7-Code**, an open-weight coding model under a Modified MIT license, with weights on Hugging Face and a hosted API priced at $0.95 per million input tokens. It is a 1 trillion parameter mixture-of-experts model with 32 billion active per token and a 256K context window, tuned for one job. Moonshot's framing is blunt: K2.7-Code targets long-horizon software engineering rather than general chat. It plans, edits, runs tools, and debugs across many steps.

The load-bearing number is not the headline accuracy gain. Moonshot reports a +21.8 percent jump on Kimi Code Bench v2 over K2.6, rising from 50.9 to 62.0, but it pairs that with roughly 30 percent fewer reasoning tokens per task. On a thinking-first model that does not support a non-thinking mode, reasoning tokens are the meter, and fewer tokens per solved task is a cut to the real unit cost of agentic work. The 256K context holds a multi-file repo and a long tool-call trace in one window, and the MoE design (a sparse model that routes each token to a few of 384 experts so only 32B of the trillion fire at once) is what keeps the per-token price near a dollar.

The axis that matters here is cost per completed task, not benchmark rank, and that is precisely the axis a builder cut off from a recalled or deprecated closed model cares about. An open-weight pack you can self-host on vLLM, SGLang, or KTransformers is a fallback that cannot be revoked, priced low enough to run a coding agent in a loop without watching the meter. The frontier still wins the hard evals; the open tier is no longer trying to. It is competing on the price of a finished pull request, and that is the number that decides what a small team can actually afford to automate.

MarkTechPostLLM-StatsNerova
III.

THE NUMBERS STILL DON'T CLEAR

The week the frontier got recalled, the clean benchmarks refused to call it capable.

128FIELD REPORT

The Recalled Model Finishes Three Jobs In A Hundred.

Artificial Analysis built a benchmark out of real multi-week work. The frontier leader clears 3%.

Artificial Analysis AA-Briefcase long-horizon knowledge-work benchmark results
IMAGEArtificial Analysis

On June 18th Artificial Analysis released **AA-Briefcase**, a long-horizon benchmark that stops pretending knowledge work is a single prompt. It is 91 tasks drawn from four multi-week business projects (data science, product management, banking operations, and heavy industry strategy), each fed by nearly 2,000 fragmented source files. The headline result: Claude Fable 5, the most capable public model on the planet, satisfies every criterion on just 3% of the tasks.

The number is brutal because the test is honest. Most evals reward a clean answer to a clean question; AA-Briefcase scores whether a model can hold a coherent project across weeks, reconcile thousands of half-relevant inputs, and ship the actual artifact a banker or a PM would sign off on. The lab describes it as moving "beyond single, disconnected prompts by evaluating models across a coherent long-horizon project." On 31 of the 91 tasks no model scores above 50% at all. Long-horizon coherence, the thing that separates a tool from a coworker, is where the frontier collapses.

Read this against the chokepoint week. The same Fable 5 that topped the GDPval-AA leaderboard at 1,783 Elo, the model whose capability triggered a government recall, cannot finish three real jobs in a hundred. The deployment gap is not a safety story; it is a competence story wearing a safety costume. For the builder, the lesson is to stop benchmarking against the chatbot and start benchmarking against the briefcase. Capability that wins single-turn arenas still drowns the moment the work has a second week.

Artificial Analysis — AA-BriefcaseArtificial Analysis — GDPval-AA leaderboard
129FIELD REPORT

The Index Stops Asking What A Model Knows. It Asks If The Job Gets Done.

Artificial Analysis rebuilt its Intelligence Index around paid work. The top of the field now sits at 60.

Artificial Analysis Intelligence Index v4.1 component weights and leaderboard
IMAGEArtificial Analysis

On June 16th Artificial Analysis shipped **Intelligence Index v4.1**, and the change is not a tuning pass; it is a redefinition of what the number measures. The single highest-weighted component is now **GDPval-AA** at 20%, an agentic test of real economically valuable work drawn from 220 tasks across 44 occupations and nine GDP-contributing industries. The lab upgraded Terminal-Bench to 2.1 and swapped tau-squared-Bench Telecom for tau-cubed-Bench Banking, dropped IFBench outright, and pinned the leaders to a recalibrated band where Claude Fable 5 lands at 60, Claude Opus 4.8 at 56, and GPT-5.5 at 55.

The mechanism is in why IFBench got cut. Artificial Analysis pulled it because, in their words, "the benchmark no longer distinguishes frontier models sufficiently, so we have removed it from the Intelligence Index." That is saturation: a test every frontier model aces stops being a measurement and becomes noise. The replacement logic runs the other way. GDPval-AA grades models in an agentic loop with shell access and web browsing, scoring deliverables (documents, slides, spreadsheets) head-to-head into Elo, and v4.1 now reports cost, time, and **tokens per task** alongside the score. The index stopped asking whether a model can answer and started asking what the answer costs to produce.

For the builder, this is the eval layer admitting what the chokepoint week already proved: knowledge tests are spent currency. When the field's most-cited index reweights itself around finished work and prices every task in dollars, time, and tokens, the question shifts from "which model is smartest" to "which model closes the job cheapest." A 60 on a scale anchored to paid labor is a far more honest ceiling than a 95 on a quiz the whole frontier has memorized. Pick your model on the work, not the trivia, because the people writing the benchmarks just did.

Artificial Analysis, Intelligence Index v4.1Artificial Analysis, GDPval-AA
IV.

THE LANDLORD AND THE LENDER

The newly public lab bought the funnel; the chip vendor borrowed to build the floor.

130FIELD REPORT

The Launch Company Buys The Coding Funnel. Sixty Billion, All Stock.

Four days public, SpaceX converts a $1.77 trillion market cap into ownership of where developers meet models.

SpaceX SPCX Nasdaq IPO debut, June 2026
IMAGETechCrunch

On June 16th, SpaceX filed an 8-K formalizing a $60 billion all-stock merger: subsidiary X67 Inc. folds into Anysphere, the company behind Cursor, the dominant AI-coding editor. The filing lands four days after SPCX priced its IPO at $135 a share, raised $86 billion against a $75 billion target, and printed a $1.77 trillion valuation, the largest offering in market history. By Tuesday the stock had run above $200 in pre-market. A newly public lab-and-launch conglomerate spent its first week buying the dominant AI-coding distribution channel outright.

The consideration is paid in SpaceX Class A stock, priced off the volume-weighted average of the seven trading days before close, with closing pegged to Q3 2026 and explicitly conditioned on regulatory approval. That structure is the point. SpaceX merged with xAI in February, so the freshly minted equity is the currency, and Cursor is the distribution layer xAI lacked: the editor where developers actually choose which model writes their code. A $10 billion break fee marks how badly the acquirer wants the funnel, not just the model.

The chokepoint this week was supposed to be capital, the thing that gated who could build frontier infrastructure. SPCX proved a lab can price and hold a trillion-dollar float; this deal proves it can then convert that float into control of the surface where models get adopted. For a builder, the warning is that your default coding environment is becoming a balance-sheet asset of a single launch-and-compute giant, and the next contest is not over who trains the best model but over who owns the editor where you reach for one.

TechCrunchSEC 8-K (StockTitan)SpaceX IPO (Wikipedia)
131FIELD REPORT

Nvidia Borrowed The Buildout. $25 Billion, Out To 2056.

The cash-richest vendor in tech just put AI infrastructure on the bond market's books for thirty years.

Nvidia $25 billion investment-grade bond offering coverage
IMAGETechTimes

On June 15th Nvidia priced its first bond deal since 2021 and by June 18th had closed $25 billion across seven tranches, upsized from a roughly $20 billion target after the book swelled to about $85 billion in orders, more than three times the size of the deal. The last time Nvidia tapped the market was June 2021, when it raised $5 billion. The new structure runs from two-year paper out to a 30-year tranche maturing in 2056, the kind of maturity a company sells only when it believes its demand curve outlives a full generation of hardware.

The mechanism is the irony. Nvidia is the picks-and-shovels vendor of the AI economy, the one company that gets paid no matter which lab wins, and it carries the cash to match. Yet it ran a "quick-build" issuance with no investor roadshow and let the order book do the pricing. Demand was so heavy that the 2056 tranche tightened from initial guidance of roughly 0.9 points over Treasuries to a final spread of 65 basis points over Treasuries (a basis point is one-hundredth of a percentage point). The stated purpose is general corporate uses, refinancing, and establishing a liquid benchmark for Nvidia's investment-grade credit. A benchmark is the tell. You do not build a yield curve for a one-time raise; you build one to borrow again, at scale, for years.

What lands in the room is that the bond market just underwrote the buildout. For most of this cycle the AI capital stack was equity and hyperscaler balance sheets; the open question was whether fixed-income investors would lend against demand that depreciates every eighteen months. The answer arrived oversubscribed, with industry AI capex expected to clear $700 billion in 2026. When the vendor that prints cash chooses leverage anyway, and lenders fight to fund a note that comes due the year a newborn turns thirty, the trade is no longer "is AI real." It is how many decades of debt the buildout can carry, and Nvidia just set the first coupon.

TechTimesYahoo FinanceINDmoney
132FIELD REPORT

Cohere Buys Canadian Rails. Keeps The Floor Off US Clouds.

A frontier lab just put its inference on a national carrier's sovereign compute, not on anyone's hyperscaler.

HIVE Digital BUZZ HPC sovereign AI GPU contract with Bell AI Fabric for Cohere
IMAGEYahoo Finance

Merritt, British Columbia, June 18th. HIVE Digital's BUZZ HPC closed a three-year, USD $220 million GPU-cloud contract with Bell AI Fabric to run Cohere's models on 2,304 NVIDIA Grace Blackwell GPUs. The cluster goes live late 2026 into early 2027. A frontier lab, a national telco, and a clean-energy compute operator, all Canadian, stacked into one funnel.

The GPUs land as GB200 NVL72 rack-scale systems, the same hardware the hyperscalers fight over, but the jurisdiction is the load-bearing detail. Sovereign AI means the weights, the inference, and the customer data sit on compute that a single country controls end to end, outside the reach of another government's export rules or subpoena. Bell owns the distribution rail, HIVE owns the floor, Cohere owns the model. None of it routes through a US cloud. HIVE says the deal pushes its contracted HPC revenue past $100 million ARR.

This is the week's quiet counter-move to Washington's compute chokepoint. While export controls try to meter who gets the chips, a lab can simply build the whole stack somewhere the order does not reach, and a carrier with a captive national customer base hands it the demand. "Convert clean energy into intelligence at scale and make Canada one of the most important sovereign AI jurisdictions on Earth," is the thesis HIVE's CEO put on record, and it reads as a distribution play, not a power-bill press release. For builders, the moat is shifting from who has the best model to who controls the rails it ships on.

HIVE DigitalYahoo FinanceStockTitan
V.

THE RUNTIME WON'T HOLD

OWASP called it permanent last week. This week three teams turned the agent into the attacker.

133FIELD REPORT

One Web Page. One Browsing Agent. Code Runs On Your Machine.

OWASP called the flaw permanent last week. This week it was a working host compromise.

Diagram of the AutoJack exploit chain from a web page through a browsing agent to host code execution
IMAGEThe Hacker News

On June 19 Microsoft researchers disclosed AutoJack, an attack in which a single web page, rendered by a browsing agent built on AutoGen Studio, executes arbitrary code on the developer's machine. A demo "Web Content Summarizer" agent, pointed at an attacker URL, spawned calc.exe on the desktop under the agent's own process account. The page never touched the user. The agent did the work.

The chain stacks three ordinary mistakes. The local MCP WebSocket trusted localhost, so the browsing agent on the box inherited that trust; the authentication middleware skipped MCP paths, assuming the handler would check tokens, which it never did; and the endpoint took commands straight from request parameters with no allowlist. Two pre-release builds carrying the flaw, 0.4.3.dev1 and 0.4.3.dev2, remain on PyPI unyanked; the stable 0.4.2.2 had no MCP route and was never exposed.

A week after OWASP called prompt injection a permanent architectural flaw, AutoJack turns the thesis into a host you no longer own. The lesson is not a patch level. An agent that can browse is a process on your machine that reads untrusted input and then acts, and localhost stops being a trust boundary the moment something on localhost takes orders from the open web.

The Hacker NewsMicrosoft SecurityTechRadar Pro
134FIELD REPORT

Two Teams Broke OpenClaw The Same Week. One Hole Has No Patch.

The agent trusts what reaches it, and its access becomes the attacker's.

Illustration of an AI agent ingesting a malicious contact card and leaking credentials
IMAGEThe Hacker News

Two security teams broke **OpenClaw**, a popular self-hosted AI agent, in the same window. Imperva hid instructions inside shared contacts, vCards, and location-pin labels; OpenClaw flattened each object into its prompt with no marker for untrusted data, then downloaded and ran a script from a server the researchers controlled. Varonis ran a separate attack and reached the same room through a different door.

Imperva's payload rode a vCard's full-name field, which WhatsApp supports natively, and tested clean against Gemini 3.1 Pro; OpenClaw patched it in version 2026.4.23 by moving those fields into a separate untrusted-metadata channel. Varonis built a test agent with Gmail access and, under an ordinary phishing email, watched it forward mock AWS keys, database connection strings, and SSH credentials in plaintext, then ship a synthetic dataset of 247 customers. That second failure, Varonis noted, is not something a patch fixes.

One hole was a coding bug; the other is the design. An agent with memory on by default and live credentials in reach treats any text that arrives as instruction, which means a single widely-shared contact card can carry a command to every agent that ingests it. The patchable case got patched. The architectural one is the product working exactly as built.

The Hacker NewsImperva
135FIELD REPORT

1Password Stopped Storing The Secret. It Now Hands It Over One Job At A Time.

The vault became a broker, and the long-lived key this week's exploits fed on lost its reason to exist.

1Password Credential Broker just-in-time credential delivery for humans, machines, and AI agents
IMAGE1Password

On June 15th **1Password** launched Credential Broker into private beta and, the same day, acquired Apono, the just-in-time access company whose CEO calls standing access "the quiet liability inside almost every company." The pairing is the point. A vault that holds 1.5 billion-plus credentials for 180,000 businesses is no longer just a place to keep a secret; it is a gate that decides which human, workload, or agent gets one, for which task, for how long. After two stories this issue about RCE chains and credential sprawl, this is the remediation beat: the same week the chokepoint failed, the company that owns it changed what the chokepoint does.

The mechanism is what makes it load-bearing. Instead of injecting a long-lived token into a pipeline or an environment file, Credential Broker delivers the credential at runtime against Workload Identity Federation standards: the requester proves identity with a signed, platform-issued credential, 1Password checks it against a trust policy, and the workload never gets standing access to the vault. As 1Password puts it, "it only gets the credential it needs, at the moment it needs it." The first shipped integration is GitHub Actions, with job-scoped access windows; the AI-agent layer, which fronts existing hooks into Codex, Cursor, and Perplexity's Comet, issues short-lived tokens that expire when the task ends. The secret never sits at rest in the place an attacker would look for it.

For a builder, this redraws where the blast radius lives. A leaked agent key that auto-revokes on task completion is worth far less than one that lives forever in a `.env`, and an audit trail that says exactly which workload acted, from where, and on whose behalf turns an incident from a guessing game into a query. Credential Broker is still private beta, with general availability targeted for late 2026, and intent-based brokering is only as strong as the trust policy behind it. But it is the first major credential-infrastructure play built for agent identities rather than retrofitted to them, and it points the field at the right target: not better vaults for long-lived secrets, but fewer long-lived secrets to vault.

1Password (Credential Broker)1Password (Apono acquisition)SiliconANGLE
VI.

THE MATTER

Autonomy crossed into a funded production line, and the brains for the bodies shipped open from China.

136FIELD REPORTMATTER

The Loyal Wingman Just Got A Production Line.

The Air Force funded autonomous fighter drones at a third of an F-35, and the autonomy ships separately.

Anduril YFQ-44A Fury and General Atomics YFQ-42A collaborative combat aircraft
IMAGEBreaking Defense

On June 17th the US Air Force moved its Collaborative Combat Aircraft program from prototype to production, awarding Anduril and General Atomics the first contracts to build operational uncrewed fighters. The two aircraft, Anduril's FQ-44A Fury and General Atomics' FQ-42A, drop their prototype "Y" prefix and head for a fleet of over 150 by the end of the decade. The fiscal-year-2027 request behind them runs roughly $1.4 billion in development plus about $1 billion in procurement.

The load-bearing number is the price. Each aircraft targets under $30 million per unit, about one third of an F-35A, against an $82.5 million reference for the latest F-35A lot. Cheap enough to risk, the wingman is designed to be attrited. The other structural break is that the brain is bought apart from the airframe. The Air Force ran a separate autonomy-as-software competition, advancing Anduril, Shield AI, and RTX's Collins Aerospace into six-month performance periods, with the pilot logic required to run on an open government reference architecture so it can be swapped per warfighter feedback. The airframe is the razor; the autonomy is the blade.

For the builder, this is the moment combat autonomy became a line item with a vendor and a unit cost. The fight that matters now is not who welds the fastest drone but whose Hivemind or Lattice flies it, because a performance-licensed autonomy stack riding an open architecture is the highest-margin, longest-lived asset in the program. Hardware gets attrited; the model gets versioned. Whoever owns the wingman's policy owns the recurring revenue, and the next decade of air power just got priced like software.

Breaking Defense — CCA production selectionAnduril — production contract winDefense One — first drone wingmen
137FIELD REPORTMATTER

Alibaba Open-Sources the Brains for the Robots It Already Builds

China took the top of an independent real-robot leaderboard, and the winning score still cannot do a day of work.

Alibaba Qwen-Robot embodied AI foundation model suite
IMAGETechNode

On June 16th Alibaba's Tongyi Lab released Qwen-Robot, three open foundation models for embodied AI: RobotNav for navigation, RobotManip for manipulation, and RobotWorld for world modeling. RobotManip was trained on more than 38,100 hours of open robot interaction and human demonstration video, and the lab reports it took first place on the RoboChallenge Table30 v1 Generalist Track at 45 percent task success. The suite has already entered pilot testing with selected Alibaba Cloud enterprise clients. No US-built model has held the top of an independent real-robot benchmark since January, when China's Spirit AI last claimed it.

RoboChallenge is run on physical hardware by Dexmal and Hugging Face, which is what makes a leaderboard number load-bearing and what bounds it. The board uses a remote-robot paradigm: the submitting team executes the evaluation on RoboChallenge rigs, so the protocol is independent but the run is team-administered. The number itself, as Qwen reports it, is the honest part. Forty-five percent across thirty tasks on four robot platforms is a leaderboard win and, the lab says, a twenty-point margin over the runner-up, and it is also a coin flip, and worse, as a hit rate on chores a five-year-old clears. The training thesis is that diversity beats purity. Both Spirit and now Qwen win by feeding the policy open-ended goal-driven data instead of scripted clean demonstrations.

The stack is closing. China shipped roughly ninety percent of the world's humanoid units in 2025, and the brains layer that drives them is now Chinese, atop an independent board, and free to download. For a builder, the takeaway is not the flag. It is the gap between the leaderboard and the loading dock: the best openly published embodied model in the world fails the majority of real tasks, which means the chokepoint in robotics was never the policy weights and never the hardware. It is the long, unglamorous middle where 45 percent has to become 95.

TechNodeSCMPQwen blogPRNewswire (Spirit AI)TMTPOST
VII.

FROM THE FLOOR

While the frontier got recalled, the research cluster published the map of where capability actually lives.

138FIELD REPORT

From The Floor: We Read 1,800 Papers To Map How An Agent Actually Works.

The week the frontier got recalled, the research cluster published the map of where capability actually lives.

Editorial schematic cover for The Agentic Stack: the agent loop and a numbered ledger of twelve layers
IMAGEImmersive Commons

FRONTIER TOWER, June 20. The Immersive Commons research cluster published The Agentic Stack, a builder's map drawn from roughly 1,800 recent papers on how to build agents. It is the first public intelligence from the cluster, and it answers one question end to end: when an agent works, what is actually doing the work.

The thesis is blunt. An agent is a loop, not a model with a clever prompt, and almost all of its real capability comes from the engineering around that loop rather than the weights. The map lays the loop out as twelve layers wrapped by four planes, runtime, observability, evaluation, and security, and for each part it names the state of the art, how to build it, and the trap to avoid. Every recommendation links to the work behind it (AI Harness Engineering, WildClawBench), and the cluster regenerates the whole map monthly from a growing corpus, versioned and archived.

Read against this week, the map is the argument. The government recalled the most capable public model, a clean benchmark put it at 3% on real work, and a single web page turned a browsing agent into a shell on the host. None of that touches the lesson the floor pulled from 1,800 papers: the model was never the product. The harness is, and the harness is something you can read, audit, and build.

Immersive CommonsAI Harness EngineeringWildClawBench