Protocol Deep-Dive: Voice Intents, Numeric Confirmation, Human Fallback
Published 21 April 2026 · 12 min read
Why voice is harder than it looks
A text interface lets the user correct mistakes with the delete key. A voice interface does not. A misheard number becomes a wrong booking, a missed negation becomes a wrong consent, a cut-off accent becomes a failed transaction. The protocol has to assume every ASR output is partially wrong and design a flow that is safe despite that.
Stage 1: the intent schema
The NLU layer does not emit free-form text. It emits a typed intent against a published schema. A GP-booking intent:
{
"intent": "book_appointment",
"vertical": "geraclinic",
"lang_detected": "hi-IN",
"slots": {
"location": { "value": "Leicester LE1", "confidence": 0.91 },
"language_pref": { "value": "hi", "confidence": 0.96 },
"time_window": { "value": "this-week", "confidence": 0.83 },
"max_price_gbp": { "value": 100, "confidence": 0.88 }
},
"overall_confidence": 0.87
}Every slot has its own confidence. The aggregate threshold for auto-progress is high (>0.85 per slot, >0.9 overall). Anything lower routes to the confirmation stage with explicit re-reads of the shaky slots.
Stage 2: numeric confirmation
Before money moves, the user confirms by pressing a digit (or saying it). The prompt is short, slow, and lists only the fields that matter:
“GP booking, Leicester, Hindi-speaking, this week, up to 100 pounds. Press 1 to confirm, 2 to change.”
Key design choices:
- Press a digit, not say yes. ASR confuses “yes”, “yeah”, “no”, and silence. A digit keypad tone is unambiguous.
- Read back numbers digit-by-digit. “one-zero-zero pounds” not “one hundred pounds” — reduces the chance the user is parsing a wrong magnitude.
- Short prompts. Aim for under eight seconds. Users on 2G connections and cheap handsets pay real latency and attention costs per extra second.
Multi-stage confirmation for anything above a threshold
Transactions above a user-configurable cap trigger a second confirmation with a specific magnitude reading and a “this is more than usual, are you sure?” line. The cap is inherited from the user’s GeraMind preferences so they do not re-configure per product.
Stage 3: human fallback
The model refuses to proceed and routes to a trained human operator when any of the following occur:
- Overall confidence < 0.6.
- Two consecutive failed confirmations on the same field.
- Distress signals (crying, panic words, silence after a payment prompt).
- Medical-emergency language.
- User says “help” or the local-language equivalent.
The operator joins with full context: the intent slots so far, the caller’s vault preferences (consent-scoped), the transcript. The operator does not replay the conversation; they continue it. This is the single largest cost in the GeraVoice business model and it is non-negotiable — a voice service without human fallback is a voice service that fails the people who most need it.
Audit: every stage signs a log entry
Each intent, each confirmation, each fallback produces a signed log entry with transcripts (opt-in, short-retention) and structured slots (long-retention, minimal). The user can request the log for any interaction; disputed charges open the log to an arbitrator under GeraNexus semantics.
What we refuse to do
- No voice-biometric authentication by default. Deepfakes make it unsafe as a primary factor.
- No auto-debit without a numeric confirmation on every transaction above the micro-tier.
- No aggressive up-sell or script injection. The prompt library is reviewed quarterly for dark patterns.
What we are still designing
Code-switching between languages mid-utterance (common in Armenia, Georgia, Uganda, India) is the hardest open problem. ASR models that handle Hindi-English switching well cost more than the margin on a £5 transaction. We are researching a per-call language-detection gate with an escalation to a bilingual human for ambiguous sessions.
Help build voice-first commerce.
Join the waitlist