Mary Rose Cook

Using encapsulated development to code on my phone

I’m lucky enough to have a wife, two young children and a job as an AI Engineer at Notion. I’m also lucky enough to have a side project.

Some weeks, I’ll get half an hour on my laptop to work on this side project. Most weeks, I won’t. Yet, I’ve averaged two commits to the project per day for the last three months.

How? By building on my phone. But, really, by making it possible to build on my phone.

The broad approach: encapsulated development. Each prompt leaves the repo in a known, stable, verified, stateless condition that permits the next change.

Known

My mental model of the project is functional enough that I can make future changes. This doesn’t mean I understand everything. I may not know the architecture of some parts of the system. I may not know how some implementations work. But I can understand the behavior of those parts of the system as black boxes. For example, I don’t know exactly how the particle system stores particles. But I do know that, when it spawns particles, it doesn’t allocate new objects.

Stable

The quality of the software must be high enough for the next change to be successful. This means well factored, working code.

Sometimes, the project will start to crumble with the accumulation of low quality code. I do a string of refactors to get things back on track, then start building features again.

Verified

Every change must be correct. I don’t have time to manually test changes. The kettle has just finished boiling.

Incorrect changes build into a wobbly tower. Each prompt must include how it will be verified as correct. This usually means end-to-end tests.

For a web app, this might mean having Chrome DevTools actually click through the UI to check it works. Or spinning up the API and checking that inputs produce the expected outputs. Or going full ~StrongDM~ and implementing the external services the product interacts with.

In my case, I’m working on a tool for making video games. So I built a headless version of my engine. Test code can initialize it with a set of game objects, supply synthetic user inputs, then view screenshots to verify the behavior.

Stateless

Every change should go uninterrupted from prompt to production. At the end of my session - probably 30 seconds or a minute - the project is in a new, known, stable state.

It’s somewhat acceptable to set aside a prompt partway through writing it. To resume, all I need to do is read the text. Quite bad is having to set aside a PR. When I return I have to figure out what still needs to be built, what still needs to be fixed. On a phone. This is difficult because a phone has a low input/output bandwidth. It’s effortful to draw together the state of an artifact. And effortful to edit the artifact at multiple points.

Worst of all is letting the project go off the rails. About two months ago, my project was in a pickle. I’d charged forwards at the beginning, and, now, I had a web app written in plain JS, HTML and raw Node. I knew I wanted to refactor it to TypeScript and Next. And I knew it would be too high a mental load to do that refactor across dozens of commits on my phone. I needed my laptop to be able to research the code, write the plan and have enough visibility to steward the refactor. So, one night, I stayed up way too late and got the project back on the rails.

The worst part about these stateful, interrupted sessions is that, the more chaotic the state the project is in, the fewer opportunities I have to make the next bit of progress. If I’m in a known, stable state, I only need to dictate a prompt. And I can do that while I wait for the MUNI. But, if I’m halfway through a PR, I’ll need to do some reading and scrolling and I might have got nothing done when the K arrives and wrecks my train of thought.

Oral literacy

Many of these measures are useful for projects not built on a phone. A well-understood, stable architecture. Robust verification of correctness. Stateless changes. But, on the phone, something is lost. It’s good to plan. It’s good to design.

It’s like an oral culture. If I want to put together a more complex feature, I have to keep it aloft in my brain. Maybe jot down a few notes in Bear. But, no sketchbook, no Figma, no design doc. No divergence, no exploration. This hampers the quality of what I build.

Still, it feels silly to focus too much on this loss. Without LLMs and phones, the project wouldn’t have existed in the first place.

Specifics

I focused this essay on the traits that won’t change. But, in case it’s useful, here’s my setup as of May 25, 2026.

I issue a prompt in the Codex iOS app to GPT 5.5 extra high. The Codex harness makes a commit and the Codex cloud env runs the tests. I push a PR from my phone and a GitHub Action automatically merges it. Another GitHub Action detects the merge and deploys the code to my Digital Ocean VPS and restarts the server. Done.

 

Code generation that just works

About nine months ago, my son said he wanted to make a video game. He said it was called Exploding Kitties. We made it together on my computer. He described the gameplay and drew the graphics. I vibed the code. The game basically worked. But we had to build it in small pieces. And, periodically, I had to spend time in the guts of the code, getting it back on the rails. Unifying two ways of doing the same thing. Fixing gnarlier bugs. Disentangling the game code from the engine.

Today, I know that we’d be able to one-shot Exploding Kitties. The first reason: models and agent harnesses produce much higher intelligence. But, the second reason, the one I want to talk about, is the supporting techniques and environment.

I’ve built a new game-making tool called Fountain. You can one-shot any simple mobile arcade game. Or, you can iterate your way to a more complex game and the code stays on the rails over many turns. Here’s why it works -

A framework that supplies decisions and built-ins

Every game is given a game framework upfront.

This framework encodes many decisions. That game entities have a certain data shape. That behavior abstraction is done with prototypal inheritance. That the coordinates of an entity represent its top left.

This keeps the code generation aligned.

And this framework includes generally useful built-ins. An update/event/draw loop. A WebGL canvas render surface. A collision detection and resolution system. A particle system. A system to detect input.

This reduces the amount of code that must be generated.

A manipulable artifact

Prompting can be tiresome. Language is ambiguous. The model can interpret a prompt in a way the game designer didn’t intend, and make the wrong change. Language is clumsy. It’s hard to precisely indicate any point in a continuum. A color. A point. An amount.

But there’s a solution. Do it the old way. Give the game developer a user interface through which to express their intent. A color picker to choose the color of the water. A slider to select the density of the spray ejected by a speedboat. A drag and drop interaction to move game entities into the right place.

Some of these are framework built-ins. Every game needs to position entities. But many are generated dynamically by the model, based on the exact needs of a game.

A constrained domain

An arcade game with vector graphics on a browser running on a mobile touch device in portrait at a 16x9 aspect ratio. That’s my target domain.

Even within these tight constraints, there are infinite possibilities for expression. But, now, the chances of success are much higher. The framework can make decisions upfront so I can reduce model drift whilst imposing minimal limits on the possibilities. And the framework can supply built-ins that are highly likely to be useful.

Authentic verification

When generating code, it’s becoming common practice to build a means of verification into each prompt. Telling the model to write tests. Telling the model to drive the browser to try a new feature. Naturally, for code generation just that works, we also require verification of each change.

The framework has a little test harness that takes a stream of inputs and some frame numbers indicating when screenshots should be captured. The game is then run heedlessly - several seconds of gameplay in just a few tens of milliseconds - and the screens are captured. Then the model can do what it does best: make an interpretation of whether some fuzzy criteria have been achieved.

The model can, for example, build a speedboat that sprays water behind it, run the game, send input to make the speed boat go, grab a screenshot and decide, Does that look like a boat spraying water?

So, better models and harnesses. Yes. But, also, an environment and some techniques that increase alignment to maximize the chance of success. Because of this, I can prompt a game at Castro station, and play it by Montgomery.

 

Should I multi-task?

LLMs take time to generate code. I’ve set things up so I can switch to another task while I wait. But, surprisingly, I’ve found this is usually the wrong idea.

First, if I switch, the context I had on the first task drains away. When I return, I’ll need to load that context back.

Second, if the first task has a high cognitive load, I won’t be able to think coherently about anything else.

Third, if the first task is my main task, I’m mostly doing other things besides generation. Drawing diagrams, thinking, reading code, composing prompts. So switching would parallelize just a fraction of my time.

However, there are some cases where parallel generation is worthwhile.

First, a generation that will take a long time. For example, implementing a spec.md I’ve created. Or a task where I have an end-to-end process where the agent can self-verify to a correct solution. Or my colleague, Simon, pasting in a to-do list of items and then going to lunch.

Second, a generation for the same task I’m already working on. For example, sending an agent off to research a question about the code base.

Third, fire and forget ideas with a low cost of failure. For example, giving an agent a link to a bug report it might be able to fix autonomously. Or sending the agent off to try implementing an idea I had for a new tool.

Parallelizing these things works well, and is manageable. But the hectic mode of keeping several plates spinning isn’t worth it.

 

Pressure to change

At Notion, we’ve been doing a quality sprint to increase our test coverage. Friday was the last day, and I wanted to get some more tests written. Time was short, which forced me to break my usual workflow.

Not that I really have a usual workflow, these days. New AI-augmented programming tools and techniques come out every day. Everything is changing so fast that you can frequently become 10% more productive, forever, with a few minutes or a few hours invested. So, I push myself to try new things.

But, inevitably, workflows are sticky. It’s hard to change a habit. Extra cognitive load to monitor and refine the technique. Extra willpower to overcome the inertia of the familiar. Extra gumption to risk wasted time on something that isn’t helpful. So I have my mega list of stuff to try and every few days I’ll pluck something off it to try.

But Friday was eye-opening. With time short, I wanted to get as much done as possible. And failure would mean only a few hours lost.

We already had a bunch in place to speed things up.

A Claude Code skill that my colleague, Jimmy, wrote. It laid out a careful, thorough process for writing tests. It included looking at our testing guide, tips on what to mock and an entreaty to look at surrounding test coverage.

I pointed Codex at the Notion doc listing functions that needed coverage. I told it to find functions that are core parts of the system, or that have complex logic. This way, we could prioritize our time towards testing code that was important or gnarly.

And, on Friday, here’s the new stuff I tried -

I’m rushing. Jimmy’s skill is written for Claude, but I use Codex. What if I just point Codex at the skill directory in Claude’s config?

I’m rushing, so I have to get out of the loop. I need a process that can autonomously go from function name to PR. So I wrote a prompt with these steps: read the guidance on writing tests, write tests, create a branch, commit, review the code, refine the code, put up a PR.

I’m rushing, so it’s going to be harder to review every line. So I unleashed a “Final Review Before Pushing Straight to Production” prompt. This presents very high stakes to the model. And it lists a bunch of things the model regularly gets wrong. It lists every (human) comment on every PR I’ve landed (auto-fetched). It lists all the redirect prompts I’ve given the model (also auto-fetched).

I’m rushing, and I only have four git work trees. I can’t do one test-suite per tree. It’ll take too long. What if I give Codex four functions and tell it to go from function -> PR for each one?

I’m rushing, and now my PRs are getting reviewed by my human colleagues. What if I paste the PR URL into Codex and tell it to do fixes for the comments, then push a new commit?

Surprisingly, these almost all worked. Only the four-PRs-in-one failed. Three of the PRs had lint errors and the setup made it harder to iterate on them.

Dozens of new tests and four new techniques to carry into the future. Or, rather, to carry until they’re superseded next week.

 

The cinch

When generating code with an LLM, sometimes a task is so laborious to specify that you may as well do it manually. But, sometimes, you can find just the right information to cinch together to enable the model to do the work.

Here’s an example. At Notion, I had built some UI for a new feature. Ken, my designer colleague, reviewed the working software and updated his Figma mocks with some refinements he wanted. I needed to implement those refinements.

The Figma mocks provided all the necessary information about how the UI should look and work. And the existing code represented the current state. But I couldn’t just point the LLM at the mocks and tell it to implement the differences. The comparison between code and mocks was too noisy. The mocks included things we were planning for the farther future, things that were out of date, things that another engineer was implementing. But, it wasn’t worth the effort of directing the LLM to do each change, one by one.

Which brings me to the cinch: I realized I could combine the mocks and the current UI code with just a little bit of extra context: a terse bullet point list of the revisions. The mocks provided the full context of each change, but the bullets directed the model’s attention to the relevant information. This cinch took me maybe fifteen minutes to compile, but saved hours of writing code.

Seeing how to draw together the crucial information to let an LLM understand what to do. The cinch.

 

Making the unknown known

Cosmos, the book by Carl Sagan, does something remarkable. It starts in a distant part of the universe. It does a slow zoom, through desolate space, through groups of galaxies, through the Milky Way, through a remote arm of the Milky Way, through the solar system, past the most distant plants, finally into Earth. It shows us as a tiny note of dust in an obscure part of the universe.

Then, it moves to one of the early civilizations, in Alexandria. To Alexander’s ideals of learning, his great library. It shows how, at that time, Earth was vast, unknown, many parts a mental blank. And it traces the change from that blankness to continents being connected within a human life span. Civilizations becoming known to one other. Until, finally, there are no unknown parts of Earth. No unknown continents or peoples. The rest of Earth was once other, but now it’s us.

The book returns to the question of space. Vast, unknown. Just like Earth once was.

 

Making a game with my son

One morning, my son woke up and came downstairs, deep in thought. He looked up at me and said, “Can we make a game, Mummy?” He’s seven and he’s called Jacob.

He told me his game was called Exploding Kitties. He described the mechanics. Bad guys patrol up and down. If they see the player - a kitty - they laser them with their eyes and the kitty explodes. If the kitty can sneak behind a bad guy, it can scratch and kill him.

I had a little game making kit already. A mobile web app. An update and render loop, game objects.

I showed Jacob how to add game objects to the level. He added some red squares for the baddies and a blue square for the kitty.

I can’t tell you how magic it was to see him use something I made.

I said, shall we make the kitty move? He said yes. I prompted Cursor, “Make it so when the player taps the screen the blue square gradually moves to where they tapped.” Cursor generated the code and applied it. The mobile app, served on localhost and made available over WiFi, refreshed on my phone. Jacob tried tapping the screen and the kitty moved to where he’d tapped.

In the past when we’d made games together, the programming had been too slow for Jacob to stay engaged. Now, with code gen, the feedback loop was fast enough to keep his attention.

I told Cursor to add a prompt input box to the game itself. I wired up a little backend route that could receive the prompt and pipe it through for Claude Code to implement.

The UI for modifying the game was now built into the game itself. Jacob and I could both work on the phone. A shared headspace through a shared device.

Jacob said he wanted to draw proper pictures for the kitty and the bad guys. I typed into the phone, “Create a pixel editor on the game object properties screen. Store the pixel art on the game objects.” Two minutes later, Jacob was poring over the throne, drawing the kitty in the pixel editor, enthralled. It reminded me of when my Dad and I would make icons in ResEdit on the Macintosh.

After Jacob had added the kitty I said I had an idea. I typed in, “Get rid of this prompt box and replace it with a button that records audio. Use OpenAI to transcribe the audio. Send the transcription to the Claude Code backend route as before.”

The record button appeared. I asked Jacob what he wanted to change about the game next. He said, “I want to do bigger drawings.” I said, “Go ahead and tell it yourself”. He tapped the record button and said, “Make the drawings bigger,” and I added, “Like 10 by 10, right?” And he said, “Yeah,” and tapped the button to stop the recording. Half a minute later, he had a bigger canvas, and started drawing the bad guys.

Jacob can type, but slowly. Now he could speak instead of typing and build the game himself.

A fast feedback loop. Software with the edit controls built right in. A shared device. An accessible medium of expression. My son and I, in the same headspace, making something together. Magic.

 

I can teach you to program with AI

tl;dr: I’m offering coaching sessions where I teach professional engineers a smooth, stay-in-flow technique for AI-augmented programming.

All the nitty gritty tips and setup were very helpful.

— Andrew J.

Email me to sign up!.

Let the computer make you more productive

My first job after university was working at a software company on their huge Java desktop application. The architecture, complex and winding, made the code very hard to follow. Layer upon layer of indirection meant that trying to follow the flow of execution led to cognitive overload.

Fortunately, the company, though almost unflaggingly tight-fisted (bring your own cake on your birthday), bought IntelliJ for every developer. It had a feature, go to definition, where you could click on a method call and jump to the implementation. This made it possible to understand the byzantine code.

Which brings me to AI.

As programmers, we feel comfortable using tools to make us more productive. Generating code with AI is a natural next step in letting the computer help us.

Three times more productive

With code generation, building features goes much faster. I can be declarative (“add a button…”). I can get an implementation of a stock algorithm (“implement A* with this contract…”) I can zoom in when I need to (“wait, don’t duplicate that state…”).

Programming is a craft. Getting better is the slow process of accreting little techniques and intuition. But there are some core techniques you use all the time. For example, moving in small steps to keep the code compiling.

AI-augmented programming is also a craft. And there are also some core techniques. But they’re different. That one I described above isn’t even really a thing for AI-augmented programming. It’s too low level.

One core technique for AI-augmented programming, maybe the core technique - describe a feature, attach relevant code context, skim the code as it’s generated, let the agent fix lints and type errors, try out your new feature.

This isn’t a cobbled together set of manual steps. It’s an absorbing process where you stay in flow. And it’s a single technique that you’ll use all the time.

When building features, I estimate I’m three times more productive than I was a year ago.

I can teach you

I’m offering ninety minute sessions where I teach this feature-building technique. It’s one-on-one coaching for full-time, professional engineers. At the end of the session, you’ll have built a feature using this technique. And you’ll be set up to keep cranking out features forever.

As a bonus, I’ll teach you the literal one weird trick that makes programming with AI even faster. It’s there in plain sight, but hardly anyone does it. Here’s a hint - talking is faster than typing, but reading is faster than hearing.

Book a session

To book a session, email me at mary@maryrosecook.com.

The first person who signs up for a slot will pay the super-duper introductory price of $0. The second person will pay the merely super introductory price of $100. After that, the price is $300.

To help us get the most out of the time, include in your email -

  • Which code editor and terminal you use.
  • A project you’re working on and a feature you’d like to add to it in our session.

Talk soon!

 

Using AI to build a tactical shooter

Enemy AI

My latest side project is a 2D shooter where the enemies plan their attacks. I’m using a technique called Goal Oriented Action Planning. This approach was used in an old game from the 2000s called F.E.A.R. It was a sort of spooky tactical shooter. Think Rainbow Six but with that creepy girl from The Ring hanging about the place. In FEAR, the enemies could flank the player and provide suppressing fire. They could stay in cover and coordinate with each other.

More side projects with AI-augmented programming

Why am I making this? It seemed like it would be fun to try a structurally simple 2D game with tricky enemy AI.

In the age of programming with AI, it’s much easier to follow this kind of whimsy. I’m more productive and I can get to the interesting stuff more quickly.

Productivity hack

You know that film with Bradley Cooper*, where he takes a drug that makes him super focused and productive, but he ends up ruining his life? Well, I’ve found something similar.

Livestreaming.

If you want to trade some of your lifespan and peace of mind for some productivity, just record yourself working. It’s quite stressful. You’re worried about making blunders in front of other people. You can’t take breaks. You definitely can’t start scrolling X.

But you will get a lot done.

Game tape

Everyone’s eternally wanking on about Camp 4. I wasn’t there, but I think X might have it bested. It’s awash in scenius. The field or tradecraft of AI-augmented programming is proceeding so unbelievably fast. And the best place to learn about it is in ephemera and asides crammed into tiny boxes dispensed by a misfiring slot machine.

So, here is a contribution to the effluvial stream. A video of me working on the 2D shooter. You can see me plan out the project and generate the code that lays out the level, implements player movement, and implements collision detection. Pretty good for an hour and fifteen minutes.

Though extemporaneous, the video outlines a powerful AI-augmented workflow for writing software -

  1. Plan and iterate on the plan§ with AI, solving many design problems at the spec stage.
  2. Get the AI to implement the first milestone (often as a one-shot).
  3. Check off the milestone and move to the next one.

Some of the techniques I demonstrate -

  • Using voice-to-text to prompt the LLM. Much faster than typing.
  • Staying in flow by using voice and Cursor Agent mode. One UI that lets me plan, refine and generate code. No stitching together tools or copy/pasting.
  • Using the AI as a rubber duck to think through problems.
  • Also using the AI as a thought partner to come up with better solutions.
  • Asking the AI technical questions (e.g. on ECS architecture idioms).
  • Keeping the spec short and dense for easy manipulation and scanning.
  • Avoiding unnecessary abstraction, but also defining a robust architecture to keep the project extensible.
  • Using popular technical approaches (ECS, SAT collisions) to ensure a robust approach and also make it easier for the AI to one-shot correct implementations.


* Limitless is not a very good film. But if you like Bradley and like good films, definitely watch The Place Beyond the Pines. My Dad and I saw it a continent apart - him in England and me living in New York - and we still talk about it.

Rock climbing scenius at Camp 4.

§ Thanks to Geoffrey for teaching me this!

 

Explore, expand, exploit

A few months ago, I started sleeping badly.

I had been excited about AI since ChatGPT came out.  I’d loved using Cursor to help me program since Jay had told me about it over the phone as I walked from Eureka Heights back home to Noe Valley.

But, in January, something changed. The proximate cause was a flood of new AI releases.  o3-mini, Deep Research, Lightpage.  Every week, more intelligence dropping from heaven into my lap.

But the bigger change was that I was getting more productive, faster.

Type in a few sentences, get a hundred lines of code. A feeling of vertigo.

More than that, I could learn a new technique in an hour and become significantly more productive.

This was in stark contrast to the previous twenty years I’d spent learning to program.  That was a slow, accretive grind.  A new technique for encapsulation.  A more refined understanding of what it means to “repeat yourself”.  Learning that you could step-debug a production web app.

My friend, Sam, has this model of learning as building a graph.  Each node is a piece of information or a skill or a behavior.  They’re interconnected.  Acquiring a new node of knowledge isn’t too hard.  It’s a bit harder to elaborate it. Which is to say, to connect it to the existing nodes in your graph.

But the real fucker is when you have to unmake a part of your graph.  You get cognitive dissonance because some of the nodes contradict each other or need to be pried apart or replaced.  It’s very painful to disassemble the graph and remake it.  Learning to program was a lot of that.

Learning to build software with AI feels completely different.  It’s much closer to learning a new discipline. Certainly, the old way of programming is relevant.  But all the power comes from the new techniques in this new field that doesn’t even really have a name.

Further, a lot of the new techniques involve a new workflow. Copy code from your editor into GPT, make a request, get code back, paste it into your editor. No, don’t do that any more. Instead, start by selecting code and then pressing ⌘-L. No, wait, stop. Just press ⌘-L, make a request, get code, press the Apply button. No. New move. Press ⌘-I, make a request, scan the code as it’s added to the repo, run and check the behavior. No, wait, this is the killer. Iterate on a PRD first, then tell the LLM to write the code in one shot.

Adopting a new workflow induces cognitive strain because it requires extra supervision. And it requires willpower to not just do things the old safe way. It’s exhausting.

So the destabilization is three things. First, the pure sensation of doing something with a new, startling ease. Second, the knowledge that I can put in ten minutes or an hour or two and get significantly faster at building software. Third, my methods and working life are changing every day.

With gains like that to be had, why wouldn’t I do it all the time? Stay up later. Get up earlier. Watch AI programming game tape with lunch. Spend an hour learning rather than working, then make up the loss in the same day.

There’s also a problem. There is so much information. Every day is a deluge of new tools, techniques and streams. Which ones are worthwhile? I have a long list of stuff I’ve been meaning to try, that seems promising, that worked and I feel I should be doing more. Sometimes I’ll watch a forty-five minute YouTube video of someone vibe coding and it’ll be useless. Other times, I’ll skip forward and hit a nugget so juicy that I’ll become terrified I’m missing these sorts of things all over the place. Sometimes a technique will seem unpromising but, several hours in, will click.

For this, at least, I’ve found something of a solution. When you’re in a fast-changing environment with high uncertainty, what do you do? Well, Civilization, the video game, is a good guide. You explore, expand, exploit.

You spend a good deal of time exploring widely. Things are changing rapidly and there’s a lot to learn. Many explorations will be fruitless and that’s totally fine.

When you find something good, you expand your technique to incorporate it. Did you hear about an obviously useful tool? Install it! Have you been refactoring your code to make it easier to change? Do it!

And when something’s working, you exploit it as much as possible. It’s easy to learn a cool new technique then forget to use it. Don’t. Instead, bring it back.

I keep a list with these three categories. Explore, expand, exploit. And I spend some time on each.

It’s something to cling to.