60fps animation and real-time audio — both fighting for the same thread
When the UI and the audio fight each other
The problem
The requirement sounded straightforward: build a speech-to-speech AI interface. User speaks, AI responds — in voice, in real time. While the AI is responding, show a Lottie animation to indicate it's thinking and speaking.
Three things happening simultaneously:
- Recording the user's voice and streaming it to the AI via WebSocket
- Receiving PCM audio chunks back from the AI and playing them
- Rendering a complex Lottie animation at 60fps
In isolation, each of these is manageable. Together, they created a problem we didn't fully anticipate until we had a working prototype: the UI and the audio pipeline were competing for the same thread.
The result was visible immediately. The Lottie animation would jank — dropping frames, stuttering — whenever audio processing was happening. And the audio itself had a more subtle problem: chunks from the AI's TTS response were playing back inconsistently. Sometimes the audio finished too fast, cutting off mid-sentence. Sometimes chunks arrived out of order or got duplicated. The experience felt broken in a way that was hard to describe but immediately obvious to anyone using it.
These weren't separate bugs. They were two symptoms of the same root cause: too much happening on the main thread.
The decisions
Move audio playback off the main thread
The first step was isolating audio processing from the UI thread entirely. The original implementation was using a standard Flutter audio package that ran on the main isolate — which meant every time audio was being processed, it was competing with the rendering pipeline.
I switched to taudio, which runs audio playback on a background thread by design. This alone resolved the Lottie jank. Once audio processing moved off the main thread, the animation had the CPU headroom it needed to maintain 60fps.
The second part of the UI fix was adding RepaintBoundary around the Lottie widget. This tells Flutter's rendering engine to treat the animation as an isolated layer — changes inside it don't trigger repaints in the parent tree, and changes in the parent tree don't interrupt it. Small change, meaningful impact.
After both: smooth animation, consistent 60fps, even during active audio processing.
Fix chunk synchronization with async locks
The audio playback issue — chunks finishing too fast, content getting cut off — was a synchronization problem.
The AI's TTS response arrives as a stream of PCM audio chunks over WebSocket. Each chunk has an index. The playback pipeline receives these chunks and feeds them to the audio player as they arrive. The problem: without any ordering guarantee, chunks could be fed to the player before the previous one had finished processing. The player would rush through them, and the final chunks — which the player had already moved past — would get dropped.
The fix was a mutex lock wrapping every chunk feed operation. This serializes all chunk processing — each chunk must complete before the next one begins feeding into the player. No more racing.
We also added indexed deduplication — if a chunk we'd already processed arrived again due to network retransmission, it gets discarded rather than played a second time.
After this: audio plays back at natural speed, full sentences, no cutoffs.
Accept the WebSocket constraint
WebRTC would have been the more technically correct choice for this use case — lower latency, better handling of packet loss, built for real-time bidirectional audio. But the AI backend was already built on WebSocket. Switching the transport layer would have meant renegotiating with the AI team and delaying the feature significantly.
The practical decision: work within the WebSocket constraint, and solve the reliability problems at the application layer with proper buffering and synchronization — which is what the async_locks approach did.
This is a trade-off I'm honest about. WebSocket is not ideal for real-time audio. But "ideal" and "right for this situation" are different things.
The trade-offs
The initial audio implementation was built by a colleague. The issues surfaced when we were testing together, and the solutions — the package switch to taudio, the RepaintBoundary, the async_locks synchronization — came out of working through it together. I contributed the diagnosis and the specific fixes, but this was a collaborative solve, not a solo one.
Staying on WebSocket meant accepting higher latency than WebRTC would have given us. For a conversational AI interface, this is a real cost. The latency is perceptible — not terrible, but not invisible either. If we rebuild the transport layer in the future, WebRTC is the right direction.
There's a deliberate 100ms pause before the player stops after the final chunk. Without it, the player closes before the last chunk has fully played out. It works, but it's a hack — the right fix would be a proper drain mechanism that waits for the audio buffer to empty before closing the player. On slower devices, 100ms might not always be enough.
The outcome
After the changes: Lottie runs at consistent 60fps during audio playback, no jank. Full sentences play back at natural speed, no cutoffs, no duplicated chunks. Indexed deduplication prevents double-play from WebSocket retransmissions. The audio BLoC is isolated — state management, threading, and synchronization are self-contained and independently testable.
The speech-to-speech experience in Nisa AI works. Users speak, the AI responds in voice, the animation plays smoothly. The implementation has rough edges I'd improve given more time — but it's honest about what those edges are.
What I'd do differently
Two things.
First, specify WebRTC from the start. If the transport layer decision had been made with the audio experience in mind rather than backend convenience, the synchronization problems would have been smaller — WebRTC handles packet ordering and loss natively.
Second, implement a proper audio drain mechanism instead of the 100ms fixed delay. The delay is fragile — on slower devices, 100ms might not be enough. A drain that waits for buffer empty would be more correct and more reliable across hardware.
The chat module had encryption hooks. Sodium was already integrated. Adding E2EE should have taken a week. Then Signal Protocol showed up.
The server was hanging with 5 users. Messages were unreliable. The app had no offline capability. These weren't bugs — they were the predictable consequences of an architecture that wasn't designed for real-time communication.
When I joined S.ID, the mobile app did not exist. No product manager. No UI/UX designer. No mobile team. Most engineers wait for someone to hand them a spec. I didn't have that option.