Why voice is harder than it looks

A text interface lets the user correct mistakes with the delete key. A voice interface does not. A misheard number becomes a wrong booking, a missed negation becomes a wrong consent, a cut-off accent becomes a failed transaction. The protocol has to assume every ASR output is partially wrong and design a flow that is safe despite that.

Stage 1: the intent schema

The NLU layer does not emit free-form text. It emits a typed intent against a published schema. A GP-booking intent:

{
  "intent": "book_appointment",
  "vertical": "geraclinic",
  "lang_detected": "hi-IN",
  "slots": {
    "location": { "value": "Leicester LE1", "confidence": 0.91 },
    "language_pref": { "value": "hi", "confidence": 0.96 },
    "time_window": { "value": "this-week", "confidence": 0.83 },
    "max_price_gbp": { "value": 100, "confidence": 0.88 }
  },
  "overall_confidence": 0.87
}

Every slot has its own confidence. The aggregate threshold for auto-progress is high (>0.85 per slot, >0.9 overall). Anything lower routes to the confirmation stage with explicit re-reads of the shaky slots.

Stage 2: numeric confirmation

Before money moves, the user confirms by pressing a digit (or saying it). The prompt is short, slow, and lists only the fields that matter:

“GP booking, Leicester, Hindi-speaking, this week, up to 100 pounds. Press 1 to confirm, 2 to change.”

Key design choices:

Press a digit, not say yes. ASR confuses “yes”, “yeah”, “no”, and silence. A digit keypad tone is unambiguous.
Read back numbers digit-by-digit. “one-zero-zero pounds” not “one hundred pounds” — reduces the chance the user is parsing a wrong magnitude.
Short prompts. Aim for under eight seconds. Users on 2G connections and cheap handsets pay real latency and attention costs per extra second.

Multi-stage confirmation for anything above a threshold

Transactions above a user-configurable cap trigger a second confirmation with a specific magnitude reading and a “this is more than usual, are you sure?” line. The cap is inherited from the user’s GeraMind preferences so they do not re-configure per product.

Stage 3: human fallback

The model refuses to proceed and routes to a trained human operator when any of the following occur:

Overall confidence < 0.6.
Two consecutive failed confirmations on the same field.
Distress signals (crying, panic words, silence after a payment prompt).
Medical-emergency language.
User says “help” or the local-language equivalent.

The operator joins with full context: the intent slots so far, the caller’s vault preferences (consent-scoped), the transcript. The operator does not replay the conversation; they continue it. This is the single largest cost in the GeraVoice business model and it is non-negotiable — a voice service without human fallback is a voice service that fails the people who most need it.

Audit: every stage signs a log entry

Each intent, each confirmation, each fallback produces a signed log entry with transcripts (opt-in, short-retention) and structured slots (long-retention, minimal). The user can request the log for any interaction; disputed charges open the log to an arbitrator under GeraNexus semantics.

What we refuse to do

No voice-biometric authentication by default. Deepfakes make it unsafe as a primary factor.
No auto-debit without a numeric confirmation on every transaction above the micro-tier.
No aggressive up-sell or script injection. The prompt library is reviewed quarterly for dark patterns.

What we are still designing

Code-switching between languages mid-utterance (common in Armenia, Georgia, Uganda, India) is the hardest open problem. ASR models that handle Hindi-English switching well cost more than the margin on a £5 transaction. We are researching a per-call language-detection gate with an escalation to a bilingual human for ambiguous sessions.