Orbit & Explore search
Back to blog
June 2026 8 min read Agentic development Draft

When the Agent Changes Under Your Feet

The strange thing about building with agents is not that they sometimes fail. It is that, after weeks of photo gallery, iOS, and Mac frontend work in Codex and Cursor, the same tool can feel brilliant one day and oddly unfamiliar the next.

Collage of personal software and agent-built tools
2

Editors

iOS

Frontend

Mac

Frontend

Drift

The tax

The weirdest part of agentic development is not when an agent fails. It is when the same product, same apparent model tier, and same kind of task feels brilliant on Monday and strangely dull on Wednesday. Over the last few weeks, while pushing hard on my photo gallery project and its iOS and Mac frontends, that unevenness has become the thing I notice most.

Some sessions with Codex and Cursor have felt like working with a sharp senior engineer who has read the repo, understands the product, notices the platform constraint, and keeps the work moving. Other sessions have felt like a different person came back wearing the same badge. Same UI. Same model label. Different judgment.

That is a strange problem to have because the ceiling is now high enough to matter. I have been using these tools on real personal software: the family photo archive, the local face-recognition pipeline, the photo gallery, the iOS frontend, the Mac frontend, Clear Skies, Backyard Sky, the Astro gallery, and the Adventure OS pilot. These are not toy prompts. They have project history, design taste, privacy assumptions, deployment paths, platform conventions, image assets, test suites, and the usual little pile of decisions that make software feel like itself.

On the best days, agentic development compresses that whole loop. On the worst days, the loop is still fast, but it is no longer stable.

The question is no longer "can an agent code?" It is "which agent did I get this time?"

The unevenness is the product now

The visible product is the editor, the chat pane, the terminal, the diff, the task runner. But the product I actually depend on is consistency: does the agent read the surrounding files, respect local style, preserve my changes, understand when to stop, and carry intent across a longer build?

That consistency has been surprisingly uneven. Sometimes Codex will calmly inspect the repo, notice the blog already has a publishing manifest, avoid unrelated dirty files, and create exactly the right patch. Sometimes Cursor will take a SwiftUI layout change and carry it through with taste and useful platform instincts. Then another session will miss a local convention, overfit to the last instruction, drift into a refactor I did not ask for, or forget the product constraint that made the feature worth building in the first place.

The photo gallery work has made this more visible. A web page can hide a lot of sins. iOS and Mac frontends are less forgiving. Navigation structure, image loading, sidebar behavior, inspector state, toolbar placement, window sizing, selection models, and platform-specific polish all expose whether the agent is actually holding the shape of the app in its head.

Good agents feel situated

The best sessions have a particular texture. The agent reads before editing. It notices that a view is part of a larger flow. It understands that a photo gallery is not just a grid of thumbnails, but a memory system with privacy boundaries, search expectations, face recognition, albums, years, originals, and family context.

When that version shows up, the work feels almost unfairly good. I can ask for a Mac sidebar improvement and get something that respects the rest of the window. I can ask for an iOS refinement and get a change that feels native instead of web-shaped. I can ask for a gallery behavior and get code that follows the archive's actual logic instead of inventing a generic media app.

Those sessions are why I keep using these tools. They are also why the weaker sessions feel so jarring. Once you have seen the agent behave like a real collaborator, it is hard to shrug off the days when it becomes a fast autocomplete with a confident tone.

Bad agents still move fast

The dangerous version is not useless. It can still produce code, pass some tests, and sound plausible. That is what makes it expensive. A bad agent that obviously fails is easy to stop. A mediocre agent that moves quickly can leave behind a trail of almost-right decisions: a SwiftUI view that technically works but fights the platform, a state model that solves today's interaction but makes tomorrow's harder, a gallery tweak that looks polished but cuts across the privacy or archive model.

This is where agentic development differs from normal tool frustration. I am not just evaluating output. I am evaluating judgment under uncertainty. Does the agent know when to inspect more? Does it know when the existing code is teaching it something? Does it know when the smallest change is better than the ambitious one?

That is the part that feels uneven. Not raw coding ability, but judgment stability.

The cost question hiding underneath

I do not know what Codex, Cursor, OpenAI, Anthropic, or anyone else is doing internally on a given day. I am not claiming secret knowledge. But as a user paying attention to the shape of the work, I think it is reasonable to ask whether agent quality is being actively managed in ways we cannot see.

There are obvious business pressures. Agents are expensive. They read a lot of context, call tools, run tests, recover from failures, and sometimes spend minutes doing what a normal chat model would finish in seconds. If you are serving that at scale, every routing decision matters. Quantized models, cheaper fallback models, dynamic compute budgets, shorter reasoning paths, smaller context allocations, cached summaries, request shaping, hidden tool limits, and product-tier throttles are all plausible levers.

None of those levers are inherently bad. In fact, a well-designed system should use them. The problem is opacity. If the model badge says one thing but the agent behavior varies wildly, users end up debugging a ghost layer between their prompt and the actual worker. That is especially rough in software development, where repeatability is not a nice-to-have. It is the difference between a tool you can build a workflow around and a tool you have to continually re-audition.

This is worth saying carefully: the concern is not that every cost-control mechanism is suspicious. The concern is that hidden routing and hidden compute controls can change the developer experience even when the label in the UI stays the same.

The hidden beta problem

There is another possibility that feels just as plausible from the outside: the agent is not being nerfed; it is being experimented on.

I would not be shocked if some sessions are effectively hidden betas: new agent policies, new planner loops, new tool-use heuristics, new model snapshots, new context compaction strategies, or new orchestration layers presented under an existing model name. Again, I do not know that this is happening in any specific product. But it would be a very normal way to improve an agent platform quickly.

The trouble is that agentic development is not like A/B testing a button color. If the agent that edits my repo today is materially different from the agent that edited it yesterday, I need to know that. New agent versions can be better in the aggregate and still worse for a particular project, framework, or workflow. They can also improve benchmark performance while making the lived development loop feel more brittle.

The model name is becoming too blunt an instrument. For agentic work, I want build IDs, routing transparency, stable channels, changelogs, and maybe even a way to pin the agent runtime for a project. I do not need every internal detail. I do need enough to know whether I am comparing my prompt, my repo, or the agent itself.

What my projects taught me

The family photo archive taught me that the agent has to respect the domain. It is not enough to build "photo search." The system has originals, albums, years, face matching, reviewed labels, private family context, and a strong bias toward not publishing a guess just because a model has a score.

The iOS and Mac frontend work taught me that platform taste matters. An agent can wire up a view, but native software is full of small expectations: where state lives, how navigation feels, how selection behaves, how a window should open, and which controls belong in a toolbar instead of a custom panel.

Clear Skies taught me that agents are excellent when the product constraint is sharp. The app only needed to answer whether the backyard rig should go outside. When the agent stayed oriented around that decision, it helped. When it drifted toward generic weather-app thinking, I had to pull it back.

The Adventure OS pilot taught me that orchestration is its own product. Generating messages, headers, schedules, variants, and feedback loops is not just "ask an agent to do the thing." It is a system of constraints, review points, memory, and measurement.

Agents make software feel fluid. Production still wants something closer to repeatability.

Where I have landed for now

I am still bullish. Maybe more bullish than before, because the unevenness is only painful when the good version is good enough to miss. A bad tool is easy to dismiss. A brilliant tool that sometimes becomes mediocre is much more interesting and much more frustrating.

My current workflow is becoming more conservative. Use agents aggressively, but keep the loops smaller. Ask for plans. Make the agent read the repo. Keep diffs scoped. Preserve logs and prompts when a session is unusually good or bad, because the variance itself is now useful data.

I also want the products to grow up around this reality. Give me stable agent channels. Tell me when the agent runtime changes. Let me pin a project to a known behavior profile. If cost controls are active, fine. Just do not make me infer them from a sudden change in judgment.

Agentic development is moving from magic trick to infrastructure. Infrastructure has to be observable. It has to be repeatable. And it has to respect the fact that developers are not just buying intelligence; we are buying a loop we can trust.