Why this matters

A voice UI that works for literate smartphone users does not automatically work for a user booking a doctor on a feature phone in their third language. The assumptions are different. This post is the design language we are writing down — openly, so others can improve on it.

Principle 1: keep turns short

Any system-generated utterance longer than 10 seconds loses half its listeners. Human attention to spoken information caps out fast. Our design language forces every system turn to two clauses or fewer before inviting a response.

Bad: "You can book a consultation with Dr Grigoryan on Monday at 2pm, Tuesday at 10am, or Wednesday at 4pm. Which would you like?"

Better: "I have three times. Monday 2pm, Tuesday 10am, Wednesday 4pm. Which?"

Principle 2: repeat is a first-class action

"Say again" (in every language) always works, at any point, no matter what else the system was doing. The system repeats the most recent user-facing utterance — not a summary, not a rephrasing, the actual last thing said. Low-literacy users frequently miss a word the first time; this is not a failure case, it is a normal flow.

Principle 3: numbers in local language

Prices, times and addresses are the highest-information content. These must be rendered in the user’s language with culturally appropriate formatting. "Two thousand dram" not "2000 AMD". "Four in the afternoon" not "16:00". Code-switching is OK for product names ("GeraClinic"); numbers are never code-switched.

Principle 4: numeric confirmation

Before any committing action (book, pay, cancel), the system asks for a numeric confirmation in the user’s language. "To confirm the booking, say the number three." This resists both mishearing and prompt-injection-style attacks via social-engineered speech.

Principle 5: graceful code-switching

Users in multilingual regions switch languages mid-sentence. "I want to book a doctor, hashu sirun e" (Armenian mid- English). The system should handle this without panicking. Our approach: the ASR and LLM both operate over multilingual input without a language lock; the TTS responds in whichever language the user most recently spoke. The user can lock to a specific language via a voice command.

Principle 6: explicit human fallback

When the system gets stuck, the escalation is "say ‘human’ to speak to a person." This works in every language, at any point. The user is never trapped. Escalation latency matters — we target under 60 seconds to a human agent during business hours.

What we learned from IVR

Traditional IVR ("press 1 for support, press 2 for sales") taught three useful lessons and failed on everything else. The lessons: menus need no more than three options, users should never have to remember an option past the end of the turn, and repeats must be fast. What IVR did wrong: tree-shaped menus that could not handle free-form input; brittle grammars that broke on accent; no graceful fallback.

What we still don’t know

Best verification flow when the user has no literacy and no screen — voice-only proof of identity.
How to signal security-relevant state ("this is a secure payment step") audibly without falling back to tone patterns that will be mimicked by fraudsters.
How to accommodate users with hearing impairment alongside low- literacy users — these constraints genuinely conflict.

The underlying stack

Speech-to-text: multilingual Whisper-class models with fine-tuning on target languages. Text-to-speech: open-weight models where quality permits, commercial licences where they don’t. Reasoning: whatever LLM we are pointed at — GeraVoice is not model-locked. Commerce: downstream Gera verticals via GeraNexus.

GeraClinic is the highest-priority integration for voice — booking and triage are naturally voice-driven. Pilot design drafts live at /research. Feedback welcome.

Voice-First UI for Low-Literacy Populations: Our Design Approach