Why publish these

Voice commerce in emerging markets is a space where wrong design choices don’t just inconvenience users — they exclude them. We would rather draft in public than quietly pick wrong defaults.

1. Accent robustness

ASR accuracy correlates directly with representation in training data. A Kenyan Swahili speaker faces worse ASR than a British English speaker, even though the commercial impact of misrecognition is higher. A mis-transcribed price is a mis-transaction.

Current thinking: fine-tune the base ASR on target-language corpora collected with consent and paid-for. For languages with thin public datasets, we are considering a small paid data-collection programme in pilot markets.

Want input on: ethics and methodology for collecting target-language speech data without exploitation. Researchers in this space, we want to talk.

2. Code-switching without language lock

Users switch languages mid-sentence. The ASR has to handle it. The LLM has to handle it. The TTS has to choose which language to reply in. Getting this wrong is the single biggest irritation in multilingual voice assistants today.

Current thinking: no language lock by default; the system responds in the language of the most recent clear user utterance. Users can lock a language with a voice command if they prefer.

Want input on: principled ways to handle ambiguity — "which language is this borrowed word in?" is sometimes unanswerable.

3. Voice-based fraud resistance

Voice cloning is cheap. A convincing imitation of a family member asking for money takes seconds to generate. Even if we do not fall for the attack, we need to not enable the attacker.

Current thinking: numeric confirmations in-band, out-of-band payment-link confirmations for non-trivial amounts, hard rate-limits on amount-per-day, merchant-categorised risk scoring. None of these individually are enough.

Want input on: fraud teams operating voice- authenticated systems in banking, we would like to compare notes.

4. Meaningful consent without a screen

GDPR and equivalent regulation assumes a screen. Informed consent is fundamentally harder when there is no visual contract to read.

Current thinking: recorded consent flows with user-repeat-back, plain-language scripts, post-call delivery of a written summary (SMS where literacy supports it, a phone callback otherwise). Revocation flows that are as easy as the original consent.

Want input on: regulators or privacy lawyers who’ve looked specifically at voice-first consent shapes.

5. Silence, noise and interruption

Real calls happen in noisy environments with variable silence. The system has to know when the user has stopped speaking without cutting them off, and when the user is listening versus distracted.

Current thinking: voice-activity detection tuned per-environment, pause tolerances that adapt, explicit "are you still there" prompts after long silences.

Want input on: turn-taking research groups, especially those who’ve studied cross-cultural differences in conversational pacing. Pauses mean different things in different cultures.

6. Human fallback at scale

Every voice system needs a human fallback. In a pilot market of 10,000 users, two human agents suffice. In a scaled market of 10M, the economics are different.

Current thinking: tiered fallback — AI primary, AI with human-in-the-loop for low-confidence turns, full human only when needed. Training agents to continue mid-conversation without losing state.

How to help

Research drafts are at /research. The waitlist is open. If you operate voice-first products in Africa, South Asia or Latin America, we would especially like to hear what you have learned.