From Vibe to Verified: Building Production Software with AI

As a seasoned software engineering professional with over two decades of experience, I specialize in leading and architecting complex software projects.
Context
Over three days in late October 2025, I built a intranet application template using FastAPI and React—then reverse-engineered it into 417 BDD scenarios and 278 formal requirements with complete traceability.
This wasn't about seeing how fast AI could generate code. It was about validating a process: Can AI-assisted development produce a well-architected application with the documentation and testing rigor needed for enterprise systems?
The short answer: yes, but only if you approach it as engineering, not magic.
The Experiment Design
I started with a typical enterprise need: an intranet web application with authentication, user management, API access controls, and a sample CRUD entity. Nothing exotic - the kind of system you'd build for internal tools or departmental applications.
The constraint was deliberate: work with AI as a development partner, but maintain engineering discipline. That meant:
Start with a detailed plan, not ad-hoc prompting
Build iteratively with verification at each step
Write tests, not just features
Generate documentation as you go, not after
Reverse-engineer to BDD and requirements for traceability
The tech stack - FastAPI (Python), React with TypeScript, Docker - wasn't chosen for novelty. These are production-grade tools with mature ecosystems. The goal wasn't to prove AI could write toy demos. It was to see if AI could contribute to the kind of software that survives code review, security audits, and multi-year maintenance cycles.
Day 1: Foundations That Matter
The first move matters. I didn't ask AI to "build me an app." I gave it a project overview and asked it to generate an implementation plan broken into phases with effort estimates, dependencies, and success criteria.
That plan became the contract. It forced clarity on scope, architecture decisions, and what "done" actually meant. Too often, AI development devolves into prompt whack-a-mole - asking for features without understanding how they connect. A plan changes that dynamic.
With the plan in hand, Day 1 became straight-forward:
Set up Python and Node environments
Scaffold backend and frontend structures
Implement JWT-based authentication with Argon2 password hashing
Build login and registration pages
Wire up protected routes
The interesting part wasn't that AI could generate this code. The interesting part was when it hit Pydantic v2 compatibility issues with FastAPI. AI didn't just error out. It researched the migration guide, understood the breaking changes in FieldInfo, and fixed the problem. That's not code generation. That's troubleshooting.
By evening, authentication worked end-to-end. Users could register, log in, and access protected routes. More importantly, I had working tests verifying the behavior.
Day 2: Building on Solid Ground
Day 2 was about expanding features while maintaining quality. User management, personal access tokens, and a sample CRUD entity (items).
But here's where most AI-generated projects fall apart: consistency. It's easy to get five features that look like they were written by five different people with five different UI patterns.
The fix: establish shared components early. A reusable UserModal for create/edit operations. A Layout component with sidebar navigation. A toast notification system. CSS variables for theming. FormField components with built-in validation.
AI excelled at this once the pattern was established. "Create a PAT management page following the same patterns as the user management page" produced consistent, predictable results. The navigation worked the same way. The forms validated the same way. Error handling looked the same.
Personal Access Tokens (PATs) showcased something important: scope-based authorization. Users could create API keys with read, write, or admin permissions. Tokens were hashed before storage. They displayed only once after creation. The UX included security warnings about token handling.
This wasn't just CRUD. It was thinking through the security model and user workflow. AI contributed ideas here—suggesting one-time token display, implementing secure hashing, recommending scope validation middleware.
By evening, the application had enough features to be useful. But "works on my machine" isn't production-ready.
Day 3 Morning: Quality Is a Feature
Day 3 started with testing infrastructure. Not aspirational testing, not "we'll add tests later"—actual tests running in CI-ready format.
Backend: 52 PyTest cases covering authentication, user management, CRUD operations, and token lifecycle. All passing. Frontend: 21 Vitest component tests with mocking and assertions. Security: 19 dedicated tests for SQL injection prevention, XSS sanitization, password hashing verification, and access control enforcement.
Then came UI polish that actually mattered:
Modern top navigation bar with active page highlighting
Dropdown menus for user actions
Consistent styling via CSS variables
Accessibility improvements (ARIA labels, keyboard navigation, reduced motion support)
Toast notifications that worked cross-browser
The key wasn't that AI could write this code. It was that I could ask AI to verify it worked. "Use Puppeteer to test the login flow and screenshot the result." AI would spin up the app, run the browser automation, capture screenshots, and report findings.
That verification loop is critical. It's the difference between "I think this works" and "I confirmed this works."
Day 3 Midday: Enterprise Authentication
LDAP integration is where enterprise tools live or die. Nobody wants to manage two sets of credentials. Users expect single sign-on with their Active Directory accounts.
This is complex territory: LDAP connection pooling, group-based role assignment, fallback to local authentication when LDAP is unavailable, secure credential handling, troubleshooting tools for admins.
AI researched LDAP best practices, studied the ldap3 library documentation, and generated a 330-line service with:
Dual authentication (LDAP first, local fallback)
Group-based admin role assignment (AD groups → app roles)
Health monitoring endpoints
Comprehensive error handling and logging
22 unit tests
A 650-line configuration guide
The documentation is important. LDAP configuration is notoriously finicky. The guide walks through connection strings, search filters, group mappings, troubleshooting steps, and security considerations. This isn't generated fluff - it's operationally useful.
Day 3 Afternoon: Deployment Reality
Docker isn't optional for modern deployment. The application needed to run the same way in development, testing, and production.
Multi-stage Docker builds reduced image sizes. Non-root users improved container security. Health check endpoints enabled orchestration monitoring. Podman compatibility meant no vendor lock-in.
AI wrote the Dockerfiles, docker-compose configuration, and automated deployment tests. Then it ran them. Spun up containers, verified endpoints, captured logs, validated behavior.
This is where "AI can code" becomes "AI can engineer." It's not just writing syntax. It's understanding deployment concerns, testing assumptions, and documenting operational procedures.
The Reverse Engineering Phase
By late Day 3, I had a working application. Tested, documented, deployable. But enterprise systems need more than code - they need traceability.
Could we reverse-engineer this codebase into BDD features and formal requirements? Not just describe what it does, but document it in a form that satisfies compliance, supports long-term maintenance, and serves as living documentation?
Phase 9: BDD Feature Documentation
AI analyzed the implemented features and generated 23 Gherkin feature files organized into 7 categories:
Authentication & Authorization (5 files, 64 scenarios)
User Management (3 files, 49 scenarios)
Personal Access Tokens (3 files, 60 scenarios)
Items Management (2 files, 39 scenarios)
Security (4 files, 84 scenarios)
UI/UX (4 files, 79 scenarios)
System Administration (2 files, 42 scenarios)
Total: 417 scenarios written in declarative Gherkin format following industry best practices.
Each scenario describes behavior from a user perspective:
Scenario: User creates personal access token with read scope
Given the user is authenticated
When the user creates a token named "API Reader" with read scope
Then the token is generated successfully
And the token displays only once
And the user receives a security warning about token handling
This isn't code documentation. This is behavior specification that non-technical stakeholders can read and verify.
Phase 10: Requirements Generation
From those BDD features, AI extracted formal requirements using EARS (Easy Approach to Requirements Syntax) following ISO/IEC/IEEE 29148:2018 standards.
Eight requirements documents with 278 total requirements:
REQ-001: Authentication (50 requirements)
REQ-002: User Management (32 requirements)
REQ-003: Items Management (38 requirements)
REQ-004: Personal Access Tokens (54 requirements)
REQ-005: Security (28 requirements)
REQ-006: UI/UX (26 requirements)
REQ-007: System Administration (24 requirements)
REQ-008: Quality Attributes (26 requirements)
Each requirement follows EARS patterns:
REQ-AUTH-FUNC-002: WHEN a user submits the registration form
THEN the system shall validate that the password meets complexity
requirements (minimum 8 characters, uppercase, lowercase, number,
special character).
Priority: CRITICAL
Category: Functional
Source: features/authentication/01-user-registration.feature
Test Cases: TC-AUTH-002, TC-AUTH-003
Finally, AI generated a complete traceability matrix mapping:
- Feature files → Scenarios → Requirements → Test cases → Implementation
Forward traceability: "Which code implements this requirement?"
Backward traceability: "Which requirements does this code satisfy?"
Output Screenshots
Login and Main Dashboard View



Item View and Management


API Token Management


User Management


Take-aways
Core Argument
The process demonstrated here isn't about AI replacing developers. It's about AI enabling a development workflow that produces better artifacts.
Consider what this three-day sprint created:
32,000+ lines of code (backend, frontend, tests, docs)
92 automated tests (100% passing)
417 BDD scenarios documenting every behavior
278 formal requirements with full traceability
Deployment-ready Docker configuration
Comprehensive user and admin documentation
A traditional development team would need weeks, maybe months, to produce this scope with equivalent documentation quality. Not because developers are slow, but because documentation usually happens last, if at all.
The AI-assisted approach inverted that. Documentation wasn't an afterthought - it was continuous. The BDD features and requirements weren't retrofitted to match code - they were extracted from working implementation with verification.
Technical and Practical Insights
What works:
Plan-first development: Generate detailed implementation plans before coding. Plans provide context that improves every subsequent AI interaction.
Iterative verification: Don't assume AI output is correct. Test it. AI can write the tests, run them, interpret failures, and fix issues. Use that capability.
Pattern establishment: Define architectural patterns early (component structure, API conventions, error handling). AI excels at applying consistent patterns.
Automated validation: AI can run linters, security scans, browser automation, and deployment tests. Leverage this for rapid feedback loops.
Documentation as process: Generate docs alongside code, not after. BDD features, API guides, and configuration documentation should evolve with implementation.
Reverse engineering works: You can build a "vibe-based" prototype, then formalize it into rigorous specifications. The traceability isn't fake—it accurately reflects what exists.
What doesn't work:
Ad-hoc prompting: "Build me X" without structure produces inconsistent results. Context matters.
Blind trust: AI makes mistakes. It hallucinates APIs that don't exist. It misunderstands requirements. Verification is mandatory.
Ignoring fundamentals: AI won't save a bad architecture. If you don't understand authentication security, LDAP integration, or database transactions, AI will happily generate plausible-looking broken code.
Documentation substitution: Living documentation is valuable, but it doesn't replace understanding. You still need engineers who can debug production incidents.
Key heuristics:
Ask AI to create verification tools (tests, scripts, automation) not just features
Provide reference documents for specialized domains (BDD best practices, EARS patterns, deployment guides)
Request detailed plans with dependencies and success criteria before implementation
Use markdown for planning and tracking - AI reads it effectively
Break large tasks into phases; review output before proceeding
Broader Implications
This experiment suggests several shifts in how we should think about software development with AI assistance:
From Code Generation to Artifact Generation
AI's value isn't writing code - it's producing the full set of artifacts professional software requires. Code, tests, documentation, deployment configuration, requirements specifications. The velocity gain comes from generating everything in parallel, not just the code.
From Documentation Debt to Living Specifications
Documentation typically lags implementation by weeks or months. With AI assistance, documentation generation keeps pace with coding. Better: you can reverse-engineer implementations into formal specifications that maintain traceability to source.
From "Move Fast and Break Things" to "Move Fast and Document Things"
Startup culture glorified shipping features over writing documentation. AI changes that tradeoff. You can move fast and maintain rigorous documentation. The constraint isn't time—it's discipline.
From Individual Coding to System Engineering
AI shifts the developer role from "writing functions" toward "architecting systems." You spend more time on:
Defining architecture and patterns
Reviewing AI-generated code for correctness
Writing verification criteria
Ensuring consistency across artifacts
Making tradeoff decisions
Less time on:
Boilerplate implementation
Routine refactoring
Manual test writing
Documentation formatting
Configuration file generation
This is a good shift. It moves human effort toward higher-leverage activities.
Closing Thought
This project wasn't about proving AI can code. That's solved. It was about proving AI can contribute to engineering - the discipline of building systems that work reliably, documented clearly, and evolve maintainably over years.
The answer is yes, but only if you treat AI as a capable junior engineer who needs clear direction, thorough review, and systematic verification. Give it a detailed plan. Establish architectural patterns. Verify everything. Generate documentation continuously. Reverse-engineer to formal specifications when needed.
What matters most isn't the AI's capability - it's your engineering process. AI amplifies whatever process you give it. A sloppy process produces sloppy results faster. A rigorous process produces rigorous results faster.
Three days, 32,000 lines of code, 417 BDD scenarios, 278 requirements, full traceability. Not because AI is magic, but because structured engineering with AI assistance is efficient.
The template is proven. The workflow works. The artifacts meet enterprise standards.
Now the real work begins: adapting this process to production systems, compliance frameworks, and multi-year maintenance cycles. But that's a solvable problem—because we have a foundation that's both fast and rigorous.
That's the mindset we need: not "how fast can AI write code" but "how can AI help us build better systems."
Appendix: Metrics Summary
Development Time: 3 days (Oct 30 - Nov 1, 2025)
Code Generated:
Backend: ~8,500 lines (Python)
Frontend: ~7,200 lines (TypeScript/React)
Tests: ~2,200 lines
Total Code: ~17,900 lines
Documentation Generated:
BDD Features: ~6,000 lines (Gherkin)
Requirements: ~8,500 lines (EARS format)
User/Admin/API Guides: ~1,850 lines
Implementation Plans: ~2,400 lines
Total Documentation: ~18,750 lines
Testing Coverage:
Backend Tests: 52 (PyTest)
Frontend Tests: 21 (Vitest)
Security Tests: 19
BDD Scenarios: 417
Total Automated Tests: 92 (100% passing)
Traceability:
Feature Files: 23
Requirements: 278
Test Cases: 87 unique mappings
Forward and backward traceability: Complete
Technologies:
Backend: FastAPI, Python 3.13, SQLModel, Argon2, ldap3
Frontend: React 18, TypeScript, Vite, React Router
Testing: PyTest, Vitest, Puppeteer
Deployment: Docker, Podman, multi-stage builds
Documentation: Gherkin BDD, EARS requirements, Markdown
Process Innovation:
Forward development: Requirements → Implementation
Reverse engineering: Implementation → BDD → Requirements
Continuous verification: AI-driven testing throughout
Living documentation: Maintained alongside code


