An AI-coded prototype takes an afternoon. An AI-coded thing real people can rely on for a year takes the same nine decisions any production app has always taken — none of them about the model. The roadmap below is a maturity rubric: a set of lines separating "I built something this weekend" from "this is in production." For each one the question is the same. Have you crossed it, and if not, what is the cheapest move that gets you across?
1. The "what is this?" line — you can name the user, the problem, and the finish in three sentences
Vibe-coding starts with a vague gesture and ends with a 3,000-line codebase the model has been improvising on for six hours. The cheapest fix is also the most boring one: write down three things before any model touches a keyboard.
| Question | A passing answer | A failing answer |
|---|---|---|
| What problem does this solve? | "Freelancers lose two hours a week reconciling invoices across three tools." | "Help people with their finances." |
| Who is the user? | "Solo freelancers billing 5–20 clients a month, mostly designers." | "Anyone who wants to track money." |
| What does done look like? | "Ten freelancers can connect Stripe and a bank, run a month of activity, and produce a PDF." | "When it feels good." |
If the answers don't fit on a notecard, the model will fill the gap with whatever it feels like building, and you will spend the next three days deleting it.
When not to use it: weekend toys and learning projects. The whole point of those is to wander.
2. The spec line — the plan exists outside the chat
Every chat-only spec dies the moment the context window rotates. The bar to clear is a SPEC.md (or whatever you want to call it) checked into the repo before the first feature commit, containing: feature list, non-goals, stack choices, the deploy story. Treat it the way a drafted PR description used to be treated — a real artifact with reviewers, not scrollback.
The spec is also where you negotiate stack questions back-and-forth with a model. Forty-five minutes of "ask me adversarial questions about this stack choice" produces a saner Postgres-vs-Supabase decision than an hour of vibes. The artifact is the markdown, not the conversation.
When not to use it: if the project is small enough that the spec is "do the obvious thing," skip it. The ceremony only pays off when there's a real stack decision under the hood.
3. The timeline line — there are dates, and the dates are short
A timeline isn't a Gantt chart; it is a forcing function for cutting scope. The shape that holds up:
| Day | What exists |
|---|---|
| Day 1 | Spec and stack decisions, no code. |
| Day 2–3 | One end-to-end happy path runs on your laptop. Nothing else. |
| Week 1 | The MVP feature list works. Tests cover auth and the hot rendering paths. |
| Week 2–4 | Hardening, monitoring, deploy pipeline, ten real users. |
What this rules out: month-long "I'm still building" stretches with no demo-able artifact. If day three doesn't have something that runs end-to-end, the spec was wrong, not the schedule.
When not to use it: real research projects, where the timeline lives in the question, not the calendar.
4. The build-mode line — no-code and code-with-AI are different products, and you've picked one on purpose
The decision rule, before any of the platforms try to sell you on themselves:
| If the app has… | The right mode is | The thing you'll regret |
|---|---|---|
| No auth, no data, mostly content | Hosted no-code | Building a Next.js stack from scratch for a landing page |
| Auth, a database, light backend logic | Either, with eyes open | Picking no-code then needing webhooks and queues |
| Multi-tenant, background jobs, integrations | Code-with-AI in your own repo | Discovering at week three that the no-code platform has no escape hatch |
| You need to hand it to a dev team in six months | Code-with-AI in your own repo | Hosted no-code with no export |
Hosted no-code platforms get you to "shippable" fastest and to "evolvable" slowest. The crossover point — where the cost of the platform's limits outpaces the cost of running your own stack — is real, and it lands earlier than the marketing suggests. Pick the mode for where the app needs to be in six months, not where it needs to be on Friday.
5. The environment line — version control, runtime, agent, and ground rules exist before feature one
If you went the code-with-AI route, the table to clear:
| Component | Cleared when |
|---|---|
| Version control | git init, remote configured, first commit lands. |
| Runtime | Node, Python, or whatever installs cleanly on a fresh machine via one script. |
| Agentic editor | One picked, one budget understood, one fallback in mind for when the wallet says no. |
| Long-term memory | A CLAUDE.md, AGENTS.md, or cursor-rules file with stack, conventions, and the don't-do-this list. |
| MCP integrations | Only the ones you'll actually use this week. Not "all of them, just in case." |
The agent-rules file is doing the most load-bearing work in this list. It's the difference between re-explaining your stack every session and the model just knowing it. Write it once, edit it as the project evolves, and treat it as code, not as documentation.
When not to use it: if you're prototyping on a no-code platform, most of this is the platform's job. Don't recreate it for no reason.
6. The model-mix line — you've matched models to tasks, not picked one and prayed
There is no single best model. There is a distribution, and the bar is having a conscious mix:
| Model class | Where it earns its cost |
|---|---|
| Frontier coding (Opus-tier) | New project scaffolding, novel features, anything where creative leaps matter. |
| Mid-tier reasoning (Sonnet, GPT-5-class Codex) | Bulk feature implementation, debugging, writing tests, refactors. |
| Cheap-and-fast (Haiku, mid-tier open-weight) | Lint-shaped fixes, doc updates, small mechanical edits. |
The failure mode is using a frontier model for everything until the wallet hits a wall, then panic-switching mid-feature. The other failure mode is using a cheap model for the architecture phase and discovering at week two that the foundations are subtly wrong. Pick a default for each task class before you start spending.
For hosted no-code platforms, the same rule applies to credits: read the deploy and usage pricing before you're three weeks in, not after.
7. The MVP line — one feature is fully working before two features exist
Scope creep is the largest single killer of AI-built projects. The five-step rule that holds:
- One feature works end-to-end before the next one starts.
- Every meaningful change lands in a commit.
- Every feature is tested — by hand at minimum, against the obvious edge cases — before you move on.
- Tasks given to the model are small and scoped. Never "build the whole spec."
- You read the diff. Not write it. Read it.
The least amount of code that solves the problem is the goal, every time. Models are happy to generate 800 lines where 80 would do, and 80 lines you can read at 11pm beat 800 you can't.
When not to use it: never. There is no project where adding more code first is the right move.
8. The CI/CD line — tests and deploys run themselves, or the project is a hobby
This is the line most vibe-coded projects never cross. The minimum:
| Layer | What "crossed" looks like |
|---|---|
| Tests | Auth flow, the two or three hot UI paths, and any payment or destructive action have automated tests. The model can write them; you read and approve them. |
| CI | Every push runs the tests. Every merge to main runs them again. Failed tests block the merge. |
| CD | Merging to main (or a release branch) deploys. No human runs 30 commands in a terminal. |
| Rollback | One command, or one click, gets you back to the previous green deploy. |
The line between hobby and production sits exactly here. If you're hand-deploying because "the pipeline is a project for later," you have a hobby. The setup is a one-time afternoon; the cost of skipping it compounds for the life of the project.
When not to use it: internal scripts, one-off tools, and anything where you are the only user and you're fine with breakage.
9. The monitoring line — you can answer "did anything break?" without asking a user
The day after deploy, the question isn't "does it work" but "would I know if it didn't." Three signals, all live before any external user touches the app:
- Error tracking. Exceptions surface to a place you actually look. Sentry, equivalent, or rolled-your-own — doesn't matter, as long as it pages.
- Request and usage logs. Enough to reconstruct what a user did when they hit a bug. Aggregated, searchable, retained for at least a week.
- A simple usage dashboard. Daily actives, top errors, and the one or two product-level metrics that mean "the app is working." A hosted analytics tool, a Postgres view, a Looker board — the bar is that someone looks at it.
This is also where the "AI builds it, software engineers maintain it" line lives. Maintenance isn't a model problem. It's a "what broke, when, and for whom" problem, and it requires the boring observability stack the LLM didn't write for you.
The cumulative cost of skipping
| Line | Median cost of skipping |
|---|---|
| 1. Define user, problem, done | Two days deleting features no one asked for. |
| 2. Spec checked into the repo | A week of "what were we even building?" debates. |
| 3. Realistic timeline | A month-long stretch with nothing demo-able. |
| 4. Right build-mode | A platform migration you can't afford. |
| 5. Environment and agent rules | Re-explaining your stack on every session, forever. |
| 6. Matched model mix | A surprise bill, or a subtly wrong foundation. |
| 7. MVP discipline | A 4,000-line app you cannot debug. |
| 8. CI/CD | A hobby project that breaks every time you touch it. |
| 9. Monitoring | Users telling you when it broke. |
None of these are about AI. They're the same nine lines production software has always had to clear. The only thing AI changed is how fast you reach the first line — which is exactly why the next eight feel optional, and exactly why skipping them is so much more expensive than it used to be.
The model can write the code in an afternoon. It cannot decide whether the code should exist, who it's for, when it's done, how it's tested, or whether anyone notices when it breaks. Those nine decisions are still yours.