← Back to all workFintech

The app that worked until it didn't

Live fintech, names withheldProduction Readiness Audit

It worked. That was the problem. The same fintech review that surfaced the security gaps in the previous case study had a second job: not just "is this safe from attackers," but "will this hold up as real traffic arrives." The app was live and functioning. Users were transacting. Nothing was on fire. But a product that works at today's traffic and a product that survives ten times that traffic are two different products, and the gap between them is invisible right up until the moment it isn't.

The freeze waiting to happen

The single most dangerous thing we found wasn't a bug. It was a default nobody had changed.

The app's database connection pool was sitting on its out-of-the-box default of five. That means five concurrent queries can talk to the database at once. The sixth waits. Under light traffic, you never notice. But the day a marketing push, a launch, or simple growth sends a few dozen requests through at once, the queue backs up, requests time out, and the app effectively freezes, not because anything broke, but because it was never told it could handle more than five concurrent connections.

This is the most treacherous class of production risk: the app passes every test, demos perfectly, runs fine for weeks, and then falls over precisely when success arrives. We sized the pool to the workload and tuned the connection settings, so growth becomes something the system absorbs instead of something that takes it down.

And the slow fuses underneath

Once we were looking at how the system behaves over time and under load rather than just whether it functions, a pattern of slow-burning issues emerged, each harmless today, each guaranteed to bite later.

Every database query was scanning the entire table

The core tables, wallets, transactions, withdrawals, and payments, had no indexes on the columns the app searches by. Right now, with modest data, queries are fine. As the tables grow past a few hundred thousand rows, every lookup reads the whole table top to bottom, and the app gets progressively, then catastrophically, slower. We specified the exact indexes to add before that wall arrives instead of after.

The job queue grew forever and would quietly run up the bill

Completed background jobs were never cleared from memory. Every payment, every withdrawal, every export left a record behind that was never removed, so memory use, and the cost of the managed service holding it, climbed without bound. Nothing alerts you to this; you simply get a larger invoice each month and an eventual ceiling. We set the jobs to clean up after themselves.

Simultaneous withdrawals created phantom failures

When several withdrawal requests hit at the same moment, the system checked the balance before locking the account, so all of them passed the check, all created pending records, one succeeded, and the rest were flipped to "failed." No money was lost, but the user's history filled with confusing failed withdrawals, and the audit trail, the thing a fintech most needs to be trustworthy, got polluted. We recommended making the requests idempotent so a duplicate returns the original result instead of manufacturing a ghost.

Failures that no one would ever hear about

Several background operations, a notification queue, a virtual-account update, caught their own errors, logged them quietly to the console, and reported success anyway. When they failed, nothing upstream knew. A user could be silently left without a needed update and no alert would ever fire. In production, a failure you can't see is worse than one you can; we flagged each so failures surface instead of vanishing.

The outcome

The pool size and connection settings are fixed; the freeze-at-success scenario is closed. The rest came back as a prioritized roadmap: the indexes to add and when, the queue cleanup to set, the idempotency and error-surfacing to build, each tagged with how much it matters and how soon. Not a pile of tickets, but an ordered answer to the only question that matters at this stage: what will hurt first, and what can wait.

Why this happens

A prototype built fast, especially with AI tools, is optimized to work, not to last. The defaults stay at their defaults. The cleanup nobody needed yet never gets written. The query that's instant on a thousand rows is left to meet its first million in production. None of it shows up while you're building or demoing. It shows up when you scale, raise, or simply succeed: the worst possible moment to discover it.

That's what a production readiness review is for: finding the failures that are scheduled rather than present, and rescheduling them out of existence before your traffic finds them for you.

AI feature live, and unsure it'll hold at scale?

Book a 20-minute fit call. We'll give you an honest read on whether it's ready for the traffic you're planning for, and if it is, we'll say so.

No prep, no pitch. If it's not a fit, we'll say so.