E2EE for Chat — When Simple Isn't Simple Enough (Part 1)
The problem
The app had a chat module — a reusable Flutter package, backend-agnostic by design. It could plug into any app without knowing what server it talked to. That flexibility was intentional. It would also make encryption harder than expected.
Sodium was already integrated. Key generation worked. Public keys went to the server. Basic encryption existed, just disconnected from the chat flow.
The requirement: implement end-to-end encryption using Signal Protocol. Full X3DH key exchange for session establishment. Double Ratchet for forward secrecy, so compromising today's keys doesn't expose yesterday's messages. Sender Keys for groups. The same protocol WhatsApp and Signal use.
Then the real shape of the problem emerged.
The chat module couldn't know about Signal Protocol — that would break the backend-agnostic design. But Signal Protocol needs context: who am I encrypting for, what session, what room. The existing encryption interface was too simple. Bytes in, bytes out. No recipient. No session. No context.
The protocol itself is complex. Not the math — the edge cases. Session corruption. Prekey depletion. Decryption failures with no obvious cause. WhatsApp and Signal spent years on these. We'd hit them too.
Groups made it worse. Signal Protocol for 1:1 is manageable. For groups, every operation multiplies. Sender Keys need distribution. Members join, keys get sent. Members leave, keys rotate and get redistributed to everyone else. Somewhere in there, groups became 60% of the complexity for 40% of the feature.
The implementation path wasn't obvious either. Build Signal Protocol from scratch using Sodium — 3-4 weeks, security audit risk, potential bugs in the implementation. Use a Dart library that implements the protocol — battle-tested, actively maintained, but GPL-3.0 licensed. Proprietary app, potential blocker.
None of these were separate. Change the interface and it affects error handling. Change the scope and the complexity budget shifts. Choose the wrong library and you own a licensing problem right before shipping. Every decision was downstream of another one.
This wasn't "add encryption." It was architectural surgery.
The decisions
Define the contract in the module, implement it in the app
Three options:
- Put everything in the chat module — Signal Protocol, session management, crypto
- Put everything in the app — the module just calls it
- Split: the module defines the interface, the app implements the protocol
Option 1 destroyed the backend-agnostic design. Option 2 made the module a thin wrapper with no real shape. Option 3 kept the separation.
The chat module defines what encryption looks like: an interface that takes bytes and returns bytes. The app implements Signal Protocol behind it. The module stays clean. The complexity is the app's problem, not the module's.
Pass context through an optional map
Signal Protocol needs context to work: recipient, session, room. Adding those as explicit fields would tie the interface to one protocol permanently.
The solution: an optional key-value map. The interface takes bytes and a map. No-op implementations ignore the map. Signal Protocol reads what it needs from it. The module passes it through without knowing what's in it.
Any future encryption scheme defines its own context. The module doesn't care.
Device as identity — a product constraint, not an architecture choice
Signal Protocol needs a stable identity anchor — a cryptographic root that says "this is me." The app had two identity concepts: the account and the hardware device.
The product had already answered which one mattered. One account binds to one device, permanently. You can't move an account to a different device. This wasn't an engineering decision. It was a product requirement.
That constraint simplified the protocol significantly. No multi-device sessions. No cross-device key sync. One identity key pair, anchored to the hardware — functionally the same as a phone number in Signal.
The harsh implication was already baked into the product model: if the device is lost, the account is gone. No recovery. The encryption design just had to acknowledge that honestly rather than paper over it.
Library over custom implementation
Two realistic paths: build Signal Protocol from scratch using Sodium, or use an existing Dart implementation.
From scratch: 3-4 weeks minimum, possible implementation bugs, security audit before shipping. The library: same protocol, already implemented and tested by others.
The call: use the library. The protocol is complex enough that a proven implementation is worth the dependency. The real concern was the license — GPL-3.0. That question got deferred.
1:1 first, groups later
Signal Protocol group encryption works through Sender Keys — a symmetric key shared with all members. Every time someone joins, they need the key. Every time someone leaves, the key rotates and gets redistributed to everyone else.
The edge cases compound from there. What if the admin is offline when a new member joins? What if rotation fails halfway through? What if two members leave at the same time? What if the distribution message never arrives?
Each question opens more questions. Groups were 60% of the complexity for 40% of the feature — before we'd shipped any of it.
The call: implement 1:1 E2EE first. Ship it. Learn what actually breaks in production. Add groups later with real failure data instead of design-doc speculation.
Forward secrecy, break-in recovery, encrypted direct messages. Groups can wait.
The trade-offs
Signal Protocol requires the server to store and distribute prekey bundles — public key material that lets two devices start an encrypted session without being online at the same time. Three new endpoints: upload bundles, fetch by device, replenish one-time prekeys when they run out. The server never sees plaintext, but it's still backend work that needs coordination with another team.
The edge cases are real and mostly deferred. Session corruption: delete it, fetch a new prekey bundle, reinitialize. Show "Re-establishing secure connection" to the user. Decryption failure: log it, show a placeholder. Prekey depletion: background task generates and uploads a fresh batch, exponential backoff on failure. These are starting points. What actually matters will show up in production, not in a design doc.
The GPL-3.0 question didn't get answered. If the app is proprietary, the library is a blocker. The interface supports swapping implementations without touching the chat module, so the door is open — but that's not the same as having a plan. Options are: open source the app, build Signal Protocol from scratch, or find a commercial license. None of those are quick.
Deferring group E2EE means shipping an inconsistent product. Some conversations encrypted, some not. Users will notice that. Shipping both at once would mean solving Sender Key distribution, join/leave logic, key rotation, and admin permissions before we've even validated that the 1:1 path works correctly. The worse option felt like the more honest one.
The outcome
What the design settled on:
- Chat module defines the interface, app layer handles Signal Protocol
- Encryption contract accepts an optional context map — generic, protocol-agnostic
- Device as identity anchor — no multi-device, no recovery
- Proven Signal Protocol library over custom implementation
- 1:1 E2EE first, groups in v2
Still open: GPL-3.0 compatibility, edge case handling in detail, decryption failure UX, backend API design, prekey rotation timing.
The architecture holds. But this is part 1 — the design conversation. Part 2 is the implementation plan: what gets built, in what order, with what definition of done.
What I'd do differently
Confirm licensing before evaluating the library. Time went into understanding what the library could do before checking whether GPL-3.0 was compatible with the app's licensing model. Proprietary app means that path is blocked before it starts. Should have been question one.
Prototype session establishment first. The design assumes the library integrates cleanly with the existing architecture. That's an assumption. A 2-hour prototype — generate keys, build a session, encrypt one message — would have tested it before the full design was committed.
Define "done" precisely upfront. "1:1 E2EE" is scope, not a definition of done. Prekey rotation? Session corruption recovery? What does the user see when decryption fails — message gone, placeholder shown, automatic retry? These decisions touch UI, copy, and error handling architecture. Defer them and the implementation stops mid-build to ask the same questions.
The requirement sounded straightforward: build a speech-to-speech AI interface with a smooth Lottie animation. Three things happening simultaneously created a problem we didn't anticipate — the UI and the audio pipeline were competing for the same thread.
The server was hanging with 5 users. Messages were unreliable. The app had no offline capability. These weren't bugs — they were the predictable consequences of an architecture that wasn't designed for real-time communication.
When I joined S.ID, the mobile app did not exist. No product manager. No UI/UX designer. No mobile team. Most engineers wait for someone to hand them a spec. I didn't have that option.