FAQ: Common Objections to Voice Commerce, Answered
Published 21 April 2026 · 6 min read
1. “The model will mishear a number.”
Yes. We require digit-by-digit readback and press-a-key confirmation for any number that affects the transaction (price, quantity, phone number, date). Number confusion is the single largest failure class and it is designed around, not designed past.
2. “Deepfakes will fraud voice auth.”
Agreed. We refuse voice biometrics as a primary authentication factor. Identity is established via the user’s phone number + a vault-issued consent token; voice itself is never the trust anchor.
3. “Voice in public leaks private information.”
Real concern. Two mitigations: sensitive fields (amounts, account numbers) are read back in code rather than plain (“your balance ends in four-two”), and a discreet-mode setting replaces spoken confirmation with DTMF input where possible.
4. “My users code-switch between languages.”
Known. Mid-utterance switching between e.g. Hindi-English or Swahili-English is partly supported today via bilingual ASR models; ambiguous sessions route to a bilingual human. This is an active research area.
5. “2G latency makes voice AI unusable.”
The end-to-end target is a 1.2-second response budget on 2G. Hit with streaming ASR, pre-compiled TTS for common responses, and aggressive turn-taking design. Not perfect; good enough for short transactional flows.
6. “ASR is biased against African and South Asian accents.”
Known and measured. Our per-language error-rate dashboards are shared with operators. Where the gap is above a threshold, we either invest in targeted training data, route to a human faster, or refuse the transaction. We publish the numbers.
7. “How do you get consent without a screen?”
Short, scripted, read-back confirmations with a digit press. For sensitive scopes we send an SMS to the number on record with a one-tap consent link; the voice call is the prompt, the SMS is the audit-quality consent.
8. “Voice minutes are expensive; the margin is gone.”
True for long conversations. The flow design keeps transactions under 90 seconds median; complex cases escalate to a human who costs more per second but resolves faster. Net margin is tight but positive for transactions above ~£2. Below that, the service is subsidised strategically.
9. “What if the model is wrong and someone is harmed?”
The protocol is designed to refuse rather than risk. Low confidence routes to a human; distress language routes to a human; medical-emergency words route to a human. Liability on transactions is covered via escrowed payments and the dispute flow from GeraNexus.
10. “Elderly users have voices the ASR cannot read.”
Known. Model fine-tuning for older-speaker acoustic profiles is an active commitment. Where ASR fails, the human fallback is faster and kinder. We consider over-eager escalation to a human a feature, not a bug, for this cohort.
11. “When does the AI refuse?”
Published list: medical emergency, safeguarding signals, fraud patterns, unusually large transactions, contested re-reads, user distress. All escalate to a human or a clinician. Refusal is never silent — the user is told and routed.
12. “Is this better than the app I already built?”
For users who can and will use the app, no. For the 2 billion who cannot or will not, yes. Voice is not a replacement for touch — it is the complement that closes the access gap.
Help build voice-first commerce.
Join the waitlist