Barock.dev

Apr 22, 2026 - 11 ' read

Why your iOS voice agent still hears itself: a field report on VoiceProcessingIO, AEC tail, and why 300ms isn't enough

ios, audio, voice-agents, echo-cancellation, swift, avaudioengine

In real-time voice-agent applications running on iOS, acoustic echo cancellation is challenging. You build the usual pipeline — microphone capture, a streaming STT, an LLM, a streaming TTS, and playback — turn on setVoiceProcessingEnabled(true) on the input node, and expect Apple to handle the rest. It doesn’t. Your agent picks up the tail of its own TTS, transcribes it as user speech, starts responding to itself, and the conversation collapses into a feedback loop within two turns.

Then you go do what every voice-agent dev eventually does: you add a post-playback mic gate — when the TTS playback buffer empties, suppress the microphone for a few hundred milliseconds to ride out the tail. You pick 300ms because that seems reasonable. You ship. Your agent still leaks. You stare at the Xcode console at 2am wondering what Apple’s echo cancellation is actually doing.

I spent two weeks in this hole while building Uttero, a voice calling MCP server that lets agents make and receive real-time voice calls. This post is the field report i wish had existed when i started. The short version: Apple’s VoiceProcessingIO is not what most of us think it is, the initialization order matters in a way Apple doesn’t document, and 300ms is the wrong answer. 800ms is closer.

What VPIO actually does (and doesn’t do) #

The first thing worth getting straight: setVoiceProcessingEnabled(true) on AVAudioEngine.inputNode does not turn on traditional acoustic echo cancellation. It activates Apple’s VoiceProcessingIO audio unit under the hood, and VPIO is closer to output subtraction than to a full AEC. An Apple DTS engineer confirmed this on the Developer Forums (thread 97679, worth reading in full): VPIO subtracts the output signal from the input signal. There are no delay parameters. There is no adaptive room modeling you can tune. There is no API to inspect the echo reference path or validate that it is working.

What this means in practice:

In a normal room with ambient noise, subtraction works well enough. You don’t notice.
In a quiet room (say, recording a demo video at midnight), the residual is clearly audible. Your agent hears the tail of its own TTS, especially the breathy end of a sentence.
There is no knob to turn. VPIO is a black box. If it isn’t working for your use case, you cannot tune it into working — you can only work around it.

This last point is the one that took me the longest to internalize. I kept searching for “the right VPIO configuration” as if one existed. It doesn’t. Apple explicitly designed it as a one-size-fits-most subtraction layer, and the DTS response makes clear there is no tuning surface.

And that’s just the behavioral story. Before you even get there, VPIO has a set of silent failure modes that make it look like nothing is happening at all. Which brings us to the part that actually cost me a week.

Three silent killers of iOS VPIO #

These are the configurations where you think voice processing is active, Xcode doesn’t complain, your app launches fine, and the AEC is quietly doing nothing.

Killer 1: initialization order #

This is the big one. Apple does not document it, and the default AVAudioEngine examples you’ll find do not follow it.

The rule: attach your playback graph before you enable voice processing on the input node. Not after. Not “somewhere in start().” Before.

Why? VPIO uses a two-bus architecture — bus 0 for output, bus 1 for input. The AEC reference signal is obtained from the output bus, which receives your mixed audio via the playout callback. If you enable voice processing before the playback graph is attached, VPIO activates with no output bus connected, and its internal notion of “what the speaker is playing” is empty. The AEC runs, but it has nothing to subtract from. Mic input comes in untouched. The symptom: echo, and sometimes severely quiet output (Apple Developer Forum, also Twilio’s AVAudioEngineDevice reference).

The fix, in my start():

configurePlayback()                                  // 1. attach playback graph first
try engine.inputNode.setVoiceProcessingEnabled(true) // 2. now enable voice processing
try configureCapture()                               // 3. install capture tap in the VP-enabled format
engine.prepare()
try engine.start()

The reverse order ( setVoiceProcessingEnabled → configurePlayback) is what most tutorials show. It silently breaks AEC. I had written the code that way for two weeks before i found the right order buried in a forum reply.

Killer 2: speaker routing in `.default` mode #

Most voice-agent implementations want hands-free operation, so they route to the speaker. The intuitive way — setting the audio session mode to .default with .defaultToSpeaker — disables VPIO’s echo cancellation. Again, silently. The AEC still runs, but speaker playback at full volume with .default mode creates an acoustic path that VPIO was not designed to cancel.

The fix is to use .voiceChat mode, not .default:

try session.setCategory(
  .playAndRecord,
  mode: .voiceChat,
  options: [.defaultToSpeaker, .allowBluetooth, .allowBluetoothA2DP]
)
try session.setActive(true)

.voiceChat is the mode VPIO is tuned for. .videoChat behaves similarly. Any other combination — .default, .measurement, .spokenAudio — you are on your own.

Killer 3: manual rendering mode #

If your architecture uses AVAudioEngine in manual rendering mode (for custom mixing, offline processing, or feeding samples into another audio unit), VPIO is simply unavailable. setVoiceProcessingEnabled(true) requires device rendering. This is documented, but only if you read the small print.

The prevalent workaround? Drop one layer down to CoreAudio and construct the VoiceProcessingIO audio unit directly with kAudioUnitType_Output / kAudioUnitSubType_VoiceProcessingIO. Twilio’s ExampleAVAudioEngineDevice.m is the canonical reference. It’s an enormous amount of work compared to the one-line setVoiceProcessingEnabled call, and for most voice agents you don’t need it — you can live inside device rendering. But if you’re trying to pre-mix TTS with background music or notification tones into the same output stream, be aware the high-level API is not an option.

Why you still need an application-layer mic gate #

Say you’ve navigated all three killers. Voice processing is initialized in the correct order, your session mode is .voiceChat, you are rendering to device. VPIO is now doing what it can. You will still hear residual echo at the end of TTS utterances, especially in quiet rooms. This is not a bug. Remember: VPIO is subtraction, not adaptive cancellation, and the acoustic tail of a speaker has real physical latency that subtraction cannot handle perfectly.

The fix is an application-layer mic gate: when TTS playback finishes, suppress the microphone for a short grace window so the speaker’s physical flush and VPIO’s residual tail cannot leak back as “user speech.”

How long should the gate be? This is where i got it wrong the first time.

Picking the gate duration: why 300ms leaked #

My first implementation used a 300ms mic gate. I picked 300ms because it is the upper bound of the WebRTC AEC3 documented typical room reverberation tail, which is 100–300ms. The logic: cover the worst-case typical room, move on. It still leaked. Users (me, testing at 2am in a quiet room) reported hearing the agent’s TTS tail re-enter as a half-word transcript that triggered a hallucinated response.

The reason 300ms was not enough: the room reverberation tail is only one of the three things the gate has to cover. The full acoustic chain after your playback buffer drains is:

Software-to-DAC gap. Zero-ish, but not exactly zero.
Hardware output latency. iOS reports this via AVAudioSession.outputLatency — typically 10–60ms, occasionally higher on Bluetooth.
Room reverberation tail. 100–300ms per WebRTC AEC3.
VPIO convergence / residual. The AEC adapts to the acoustic environment and there is a residual during the adaptation window.

Summed, the real post-playback “echo risk window” is 200–500ms in typical rooms and longer with Bluetooth output or reflective rooms. 300ms covers neither end comfortably.

When i researched what production voice agents actually use, the numbers were consistently higher. Pipecat’s default VAD stop_secs sits at 800ms. One production two-tier RMS gate implementation i studied uses a 1.5-second cooldown with an elevated RMS threshold. Nobody is running a 300ms gate in production.

I moved the gate to 800ms. The bug went away.

Is 800ms the “right” answer? It is a trade-off. Shorter gates let the user barge in faster; longer gates are safer but hurt conversational responsiveness. 800ms matches the Pipecat default, which has been empirically tuned against a large user base, so i took it as a reasonable floor. If your environment is noisier than a quiet midnight room, you can likely go lower. If you ship on Android with Bluetooth headphones across thousands of devices, you will probably go higher. The important thing is that the gate exists and that its duration is chosen from acoustic reality, not intuition.

Disclaimer: I am a voice-agent builder, not an acoustics researcher. The numbers above come from product documentation (WebRTC AEC3, Pipecat), Apple Developer Forum threads, and my own testing on iOS 18 devices. They should be taken as a starting point, not absolute truth. If you are building a voice product you care about, measure on your target devices in your target rooms.

The implementation #

Here is the core of what landed in Uttero’s CallAudioEngine.swift. It is a single Swift file that owns the entire in-call audio graph: VPIO-enabled capture, a lock-protected ring buffer for TTS playback, and the post-playback mic gate.

Full start():

private func start(sampleRate: Double, isIncoming: Bool) throws {
  guard !isRunning else { return }
  targetSampleRate = sampleRate

  let session = AVAudioSession.sharedInstance()
  if isIncoming {
    // CallKit has already activated + configured the session in
    // didActivateAudioSession. Don't re-activate — just attach.
    activatedSession = false
  } else {
    try session.setCategory(
      .playAndRecord,
      mode: .voiceChat,
      options: [.defaultToSpeaker, .allowBluetooth, .allowBluetoothA2DP]
    )
    try session.setActive(true)
    activatedSession = true
  }

  // Init order matters: attach the playback graph BEFORE enabling voice
  // processing so VPIO sees the output bus as its AEC reference signal,
  // then install the capture tap in the VP-enabled input format.
  configurePlayback()
  try engine.inputNode.setVoiceProcessingEnabled(true)
  try configureCapture()

  engine.prepare()
  try engine.start()
  isRunning = true
}

Arming the gate on drain. This runs on the audio-render thread, so it must not allocate or acquire multiple locks:

@discardableResult
private func drainPlayback(into dst: UnsafeMutablePointer<UInt8>, wanted: Int) -> Int {
  os_unfair_lock_lock(&playbackLock)
  let take = min(wanted, playbackFill)
  if take > 0 {
    // ... copy bytes from ring buffer into dst ...
    playbackFill -= take
    if playbackFill == 0 {
      // Buffer just emptied — arm the mic gate so the speaker's physical
      // flush + VPIO residual tail isn't transcribed as user speech.
      micGateDeadline = CFAbsoluteTimeGetCurrent() + micGateDuration
    }
  }
  os_unfair_lock_unlock(&playbackLock)
  // zero-fill any remainder so the caller always gets `wanted` valid bytes
  return take
}

private let micGateDuration: CFAbsoluteTime = 0.8

Disarming the gate when new TTS arrives, so barge-in on the next utterance is not blocked:

private func appendPlayback(_ bytes: Data) {
  os_unfair_lock_lock(&playbackLock)
  defer { os_unfair_lock_unlock(&playbackLock) }
  // New TTS arriving — disarm any stale post-drain gate so the next
  // utterance isn't blocked. The gate will re-arm when this chunk (and
  // any that follow) finish draining.
  micGateDeadline = 0
  // ... copy bytes into ring buffer ...
}

The capture tap checks the gate before forwarding frames:

input.installTap(onBus: 0, bufferSize: 1024, format: inputFormat) { [weak self] buffer, _ in
  guard let self = self, !self.isMuted else { return }
  if self.isPostPlaybackGateActive() { return }
  self.convertAndForward(buffer: buffer, converter: converter, target: targetFormat)
}

A few implementation notes worth flagging:

The gate deadline is stored as a CFAbsoluteTime protected by the same os_unfair_lock as the ring buffer, because it is armed from the audio-render thread and read from the capture-tap thread. Anything simpler (an atomic Bool, for instance) races in practice.
setVoiceProcessingEnabled(false) must be called before engine.stop(). On iOS 18+, doing it in the reverse order can crash in AURemoteIO teardown. Learned the hard way.
Don’t try to deactivate the AVAudioSession synchronously after stopping playback — AVSpeechSynthesizer and similar system components hold references that cause the deactivation to fail until they release. A ~500ms delayed deactivation is the workaround most production apps converge on.

CallKit interaction #

One quick gotcha for anyone plugging this into CallKit: on incoming calls, CallKit activates the AVAudioSession for you, via provider(_:didActivate:) in your CXProviderDelegate. If you then call setActive(true) yourself from your engine’s start(), you get a state mismatch that manifests as either silent audio or a call route that refuses to switch to speaker.

The pattern that works for me: pass an isIncoming: Bool flag into start(), and skip the session-activate path when true. The session is already active; the engine just needs to attach. See the start() snippet above.

Demo source code #

Uttero is not fully open source yet, but the CallAudioEngine pattern above is small enough to lift and adapt. I am planning to extract it into a standalone public sample repo (minimal Swift Package, no Flutter plumbing, works with any AVAudioEngine-based voice agent). If you want an early copy, reach out.

Conclusion #

iOS voice-agent echo is one of those problems that looks like a 10-line fix and turns out to be a two-week rabbit hole. The short checklist:

Use .voiceChat mode, not .default.
Attach the playback graph before you enable voice processing.
Reset voice processing before stopping the engine.
Expect residual echo even when everything is configured correctly — VPIO is subtraction, not cancellation.
Add a post-playback mic gate of at least 500–800ms, not 300ms.
Disarm the gate when new TTS arrives, so barge-in still works.

None of this is obviously discoverable from Apple’s documentation. Most of what i learned came from the Apple Developer Forums, Twilio’s open-source audio device, the WebRTC AEC3 documentation, and reading Pipecat’s defaults. The ecosystem is small and fragmented, and the best knowledge currently lives in forum replies and production code that nobody has time to write up.

If you are building a voice agent on iOS and hitting echo problems, i am happy to trade notes. Find me on X at @barockok_ or email hi@barock.dev. And if you find a part of this that is wrong — in particular if you have a cleaner way to get the AEC reference signal on iOS than the forum answers suggest — i would love to hear it.