Audio Latency: Measuring Microphone and Speaker Delay

Published 2026-04-13 8 min read

Summary (TL;DR)

Recording an acoustic guitar take through my old USB interface used to land somewhere around 35 ms of round-trip latency in the monitoring headphones — enough that any rhythmic phrase felt like wading through wet sand. Switching to a MOTU M2 with a properly configured ASIO driver brought that down into the 5 ms range, and the same take immediately went from “unplayable” to “natural.” The phenomenon is not a bug; it is a measurable, physical reality called round-trip latency (RTT). RTT is the sum of five stages: the microphone signal filling an input buffer, the host software processing that buffer, the processed signal filling an output buffer, the DAC converting it back to analog, and the final sound propagating through the air to your ear (about 1 ms for every 34 cm). The tunable part lives mostly in buffer size and driver model; ASIO, WASAPI Exclusive, Core Audio, and JACK each have different typical minimums and platform limits. This guide explains how those components add up, why the sample-rate/buffer-size trade-off is not as intuitive as it seems, and how to measure real end-to-end latency with a loopback cable, a test tone, and an audio editor — rather than trusting marketing’s “1 ms” claims.

Background

Between a microphone and a speaker, a computer does more than most people realize. An ADC samples the microphone signal at a chosen sample rate (commonly 48 kHz) and collects samples into an input buffer. Once the buffer fills — say, 128 samples at 48 kHz, about 2.67 ms of audio — the driver hands it to the application.

The application (a DAW, a communication client, a browser) processes the buffer, applies effects or encodes for network transmission, and writes the result to an output buffer. When the output buffer fills, the DAC converts it to an analog signal that leaves the headphone or line output.

Then there is the world outside the computer. Sound travels at roughly 343 m/s at room temperature, which means about 1 ms per 34 cm. Headphones add essentially none of this; a monitor speaker across the room adds a few milliseconds on top of everything else. A pair of monitors at 1.7 m of seating distance contributes roughly 5 ms of pure propagation — invisible on the spec sheet but present in your ears.

So round-trip latency = input buffer + processing + output buffer + DAC + propagation. With the sample rate fixed, smaller buffers reduce latency but raise CPU load, because the audio thread has to wake up more often. If buffer size is specified in samples rather than in time, raising the sample rate shortens the same “N samples” buffer in milliseconds, but higher sample rates also raise throughput and stress drivers differently, so this is not a free way to reduce delay.

The driver model sits between the application and the audio hardware and has its own latency characteristics. On Windows, the classic MME and DirectSound paths add substantial buffering for compatibility; WASAPI Shared is lower but still mediated by the mixer; WASAPI Exclusive and ASIO bypass much of that overhead. ASIO4ALL is a generic ASIO wrapper that helps when a vendor ASIO driver is unavailable, but it is not a substitute for a true vendor ASIO driver — measured RTT on the same Focusrite Scarlett 2i2 3rd Gen typically lands meaningfully lower with the official driver than with ASIO4ALL on top of WASAPI. On macOS, Core Audio is already tuned for low latency as the platform default. On Linux, the JACK and PipeWire stacks can reach very low latency once configured, but they require real-time kernel scheduling to be consistent.

Data / Comparison

CriterionASIOWASAPI ExclusiveCore AudioJACK
PlatformMostly WindowsWindows onlymacOS onlyLinux-centric, cross-platform ports exist
Typical lowest latencyVery low, commonly in the low-milliseconds rangeLow when used in exclusive modeLow and stable in the low-milliseconds rangeVery low when well-configured, with more setup complexity
Sharing modelDriver-dependent (typically exclusive)Shared or exclusive; exclusive is lower latencyWell-tuned shared mode is the defaultRouting-matrix based, multi-client

Exact millisecond numbers depend heavily on hardware, driver, and OS version, so rather than trusting the “1 ms latency” line on a product page, measure your own system once to establish a baseline. Affordable interfaces like the MOTU M2 and Focusrite Scarlett 2i2 3rd Gen can reach single-digit-millisecond round-trip when paired with a proper ASIO or Core Audio driver, but the same hardware over a Windows shared-mode WASAPI path or wrapped through ASIO4ALL typically performs noticeably worse. Your real RTT is the joint result of three variables — interface, driver, buffer size — and assuming any one of them in isolation is rarely accurate.

Real-world Scenarios

Scenario 1 — Live performance and in-ear monitoring. A singer listening to their own voice through in-ear monitors will feel a delay of even a few milliseconds as a timing problem. The target here is RTT under 10 ms, which usually requires low buffer sizes, a low-latency driver (ASIO on Windows, Core Audio on macOS), and — where possible — the interface’s hardware direct monitoring, which bypasses the software path entirely for the monitoring signal. The MOTU M2’s front-panel monitor knob solves this in one twist, and the Scarlett 2i2 3rd Gen’s “Direct” switch does the same.

Scenario 2 — Podcast recording. Unless several hosts are monitoring each other live, 50 to 100 ms of RTT is usually fine during the recording session. Post-production aligns tracks anyway, and a larger buffer improves CPU headroom, which in turn reduces the risk of dropouts in a long recording. Optimizing for lowest latency here often hurts rather than helps.

Scenario 3 — Video conferencing. Browser-based WebRTC calls typically operate in a range from roughly 100 to 200 ms end-to-end because of network, encoding, and jitter buffering on top of local audio latency. That sits near the boundary where conversational turn-taking feels natural; adding Bluetooth earbuds with an older codec can push it past that threshold, which is why participants start talking over each other.

Scenario 4 — Online music collaboration. Attempting to play synchronized music with a remote partner over the internet is the unforgiving end of this spectrum. Beyond roughly 25 to 30 ms of one-way delay, keeping a stable tempo becomes difficult, and typical home internet plus consumer audio software regularly lands well above that. Dedicated low-latency collaboration platforms exist specifically to shave every possible millisecond, and even then they generally require wired ethernet, careful driver tuning, and physical proximity on the internet backbone. Recognizing that a consumer video-call stack is the wrong tool for this job is usually the first step.

Common Misconceptions

“Bluetooth audio is always high-latency.” Classic A2DP streaming does add significant delay. But newer LE Audio (LC3) and aptX Low Latency codecs bring this down considerably, which is a category change in felt performance. Two headphones that both say “Bluetooth” can differ dramatically depending on the codec they and the source support. For monitoring a recording session, though, wired remains the safer choice.

“USB audio is always better than analog.” USB audio beats onboard analog when the interface has a good DAC, clean preamps, and stable drivers. Cheap USB DACs can actually be worse due to jitter, noise, and driver issues. “USB” alone does not imply quality; component choice does.

“192 kHz sample rate is lower latency.” Counter-intuitive but important: if the buffer is specified in time (ms), raising the sample rate does not reduce physical latency. Only when the buffer is fixed in samples does a higher sample rate mean fewer milliseconds per buffer — at the cost of more data per second, more CPU, and more driver stress. Low latency primarily comes from smaller buffers and a better driver model, not from a higher sample rate.

“Software monitoring and hardware monitoring feel the same.” They do not. Software monitoring routes the microphone signal through the DAW path — input buffer, plugin processing, output buffer — and accumulates RTT along the way. Hardware monitoring on an audio interface sends the mic signal directly to the headphone output inside the interface, bypassing the computer entirely for the monitoring path. The performer hears themselves with effectively zero latency, while the computer still records the track for later production. Many setups mix the two up and blame buffer size for a delay that hardware monitoring would eliminate in one setting change.

Checklist

  1. Define the RTT target for your use case. Live performance: below 10 ms; recording: 50 to 100 ms is fine; conferencing: 100 to 200 ms is realistic.
  2. Choose the driver model. On Windows, ASIO or WASAPI Exclusive; on macOS, Core Audio; on Linux, JACK or PipeWire. Default shared-mode stacks (MME, default shared WASAPI) usually have the highest latency.
  3. Lower buffer size incrementally until just before dropouts start. That is your system’s practical floor.
  4. Do a loopback measurement. Connect the interface’s output to its input with a cable, play a test tone, and read the delay from the captured waveform in an audio editor.
  5. Match the sample rate to the project need. 48 kHz is typical for video work; 44.1/48 kHz for music. Go above only with a specific reason.
  6. Use the monitoring path wisely. Where possible, use the interface’s direct monitoring feature for the performer’s ears, which makes monitoring RTT effectively zero even if the recording path has several milliseconds of latency.

The Patrache Studio audio latency tool estimates round-trip latency in the browser, which makes it easy to get a ballpark reading without buying hardware. For the wider “input latency” picture in a gaming context, see Keyboard NKRO and Input Lag for Gaming. If you are tuning a video call setup where audio delay affects A/V sync, pair this guide with Webcam Diagnostics: Frame Rate, Resolution, and Lighting to understand the camera-side latency that stacks on top of audio.

References