cruto.ai
← All posts
Engineering · 6 min read

We don't touch your audio. Here's how the architecture works.

Live mock interview audio streams browser ↔ Gemini directly. Our backend mints an ephemeral token and steps out of the way.

The Cruto team · February 18, 2026

The most common voice-AI architecture: browser sends audio to your server, server forwards to the model, model responds, server forwards back. The server is on the audio path.

That works, but it has two costs we didn't want to pay. First, it doubles the audio bandwidth bill — we'd be paying for both ingress (browser → us) and egress (us → Gemini). Second, every audio frame transits our servers, which means we're now in the chain of custody for the candidate's voice.

The alternative

Our backend doesn't touch the audio. When a candidate starts a mock interview, the backend mints an ephemeral token (a short-lived auth grant scoped to that one session) and hands it to the browser. The browser opens a direct WebSocket to Gemini Live with the token. Audio frames flow browser ↔ Gemini, never through us.

The transcript flows back through us — Gemini posts text to a webhook, which we persist to Postgres. That's the only thing we store. Raw audio never lands on our infrastructure at all.

What the candidate sees

  • The waveform indicator — that's the browser, doing local volume detection on the mic stream before it sends.
  • The transcript — that's Gemini, returning text via WebSocket as the candidate speaks.
  • The "session ended" debrief — that's our backend running a separate scoring pass over the transcript.

What it costs us

About 20% less bandwidth than the proxied architecture. About 0% audio storage (we don't store any). About a 90% reduction in PII surface — the only PII we ever see is the text transcript, and that's encrypted at rest.

About one annoying edge case: when the candidate's network drops mid-session, we have to detect the gap from missing transcript chunks rather than from a server-side disconnect. Worth it.

Try Cruto

The interview-prep platform that knows your pipeline.

Free tier includes 5 imports + 3 tests + 15 min of live interview every month. No credit card.

Sign up free
More posts