← All writing
Architecture · Jun 2025

When the right solution isn't fixing the bugs — it's diagnosing why the bugs exist

The problem

When I joined Ajari, Wizpr had already been in development. It was supposed to be an internal communication platform — messaging, calls, the works. But it wasn't finished, and it was full of bugs.

The instinct in that situation is to start fixing bugs. I didn't do that.

Instead, I spent time understanding why the bugs existed. And what I found was that the symptoms weren't the problem — the architecture was.

The server was hanging with as few as 5 concurrent users. Database deadlocks under minimal load.

Messages were unreliable. The chat was simple CRUD — send a message, POST to server, fetch messages, display. No offline support, no queue, no retry logic. If the network hiccupped, messages were lost.

The mobile app had no offline capability. Every action required a server round-trip.

These weren't bugs. They were the predictable consequences of an architecture that wasn't designed for a real-time communication product.


The decisions

Propose a full refactor, not a bug fix sprint

This was a judgment call that required some conviction. Bug fixes are faster to justify — you show progress quickly. A refactor is a harder sell because it looks like you're moving backwards before moving forwards.

My argument: patching these bugs would be treating symptoms. The server deadlock would resurface as usage grew. The unreliable messaging would keep generating bugs. The right investment was fixing the architecture.

The proposal was accepted. We refactored both mobile and backend.

Event-based architecture on the backend — inspired by Matrix

The original backend was CRUD: create, read, update, delete records. For a chat application under concurrent load, this creates contention — multiple transactions competing for the same rows causes deadlocks.

I redesigned the chat backend around an event-based architecture, inspired by the Matrix protocol:

  • Every action (send message, edit, delete, react, mark as read, typing indicator) becomes an immutable event stored in an append-only log
  • No records are mutated — new events are appended
  • The current state of a conversation is derived from its event history

This eliminates the deadlock problem: append-only writes don't contend with reads or other writes in the same way as update/delete operations.

Initial sync + incremental sync

With an event-based backend, the sync model becomes straightforward:

  • Initial sync: On first launch, the client requests all events. Rooms, participants, message history — everything is loaded from the event log.
  • Incremental sync: On subsequent launches, the client sends its last sync token and receives only events since then.

The mobile client stores everything locally. The app works fully offline. When connectivity returns, the outbound queue processes pending operations.

Custom biometric authentication — FIDO2 from scratch

Wizpr needed strong authentication. Rather than using a third-party auth service, I designed and implemented a challenge-response authentication system using ECDSA P-256 — functionally equivalent to FIDO2/WebAuthn.

On enrollment: the device generates a keypair. The private key lives in hardware-backed secure storage (Android Keystore / iOS Secure Enclave) — it can only be accessed after a successful biometric unlock. The public key is registered with the backend.

On login: the backend sends a 32-byte random challenge. The app prompts biometrics. On success, the private key signs the challenge. The signature is verified server-side. If valid, an access token is issued.

The private key never travels over the network. A compromised server cannot replay an authentication — each challenge is single-use with a 5-minute TTL.

Extract the chat logic as a standalone SDK

Rather than building the chat functionality directly into Wizpr, I extracted it into a separate Flutter package with a clean public API. The ChatAdapter interface decouples the SDK from any specific backend — any backend can be connected by implementing the adapter.

This meant LearnExpert could integrate the same chat foundation. It also meant the chat logic was independently testable and maintainable.


The trade-offs

The refactor cost time. Proposing and executing a full architectural refactor on a product that already had development time invested is a hard decision. There's a period where you've dismantled the old thing and the new thing isn't ready yet. The team had to trust the direction.

Event-based architecture is more complex to reason about than CRUD for developers unfamiliar with it. The learning curve was real. I invested time in documenting the architecture and walking the team through the mental model.

The VoIP integration on iOS was harder than expected. Apple's requirements for VoIP push notifications and CallKit are strict — the app must display the call UI within 10 seconds of receiving a VoIP push, or Apple may revoke VoIP push capabilities for the app. Getting this right required careful implementation and testing.


The outcome

  • Server went from hanging at 5 users to stable at 40 concurrent users, no deadlock incidents
  • Offline-first with outbound queue — messages are never lost, even without connectivity
  • Hardware-backed biometric auth with challenge-response — no passwords stored, no credentials transmitted
  • Chat SDK reused across multiple products
  • Voice and video calls with native OS call experience on both Android and iOS, including wake-up calling when the app is closed

The platform now has an architecture that can grow. The same problems won't resurface as usage increases.

Other Writing