When you think of voice assistants, you probably picture Siri, Alexa. Or Google Assistant chattering away in English, Mandarin. Or Spanish. But what if your language is spoken by fewer than a million people scattered across a volcanic archipelago in the middle of the Atlantic? That's the engineering challenge we took on when we started building Vozinha - a voice assistant designed for Cabo Verde and its native Kabuverdianu Creole. Building a voice assistant for a language spoken by only 1. 5 million people taught us more about NLP than any textbook ever could.
Cabo Verde (also known as Cape Verde) is a country of ten islands, each with its own dialectal twist. And a diaspora that outnumbers the home population. Portuguese is the official language, but everyday life runs on Creole. For years, the tech ecosystem in cabo verde was practically invisible on the global stage. Smartphones were common, but the software was imported - literally. No voice assistant understood how a resident of SΓ£o Vicente says "good morning" compared to someone from Santiago. That disconnect is both a cultural barrier and a technical opportunity, and vozinha was our attempt to close it
This article walks through the entire journey - from linguistic analysis through data collection, model training. And real-world deployment - with hard lessons for anyone building for low-resource languages. You'll see code snippets - framework names, and honest numbers. If you've ever wondered why African languages are underrepresented in commercial AI. Or if you're an engineer looking to build your own voice interface, this is for you.
The Linguistic Puzzle: Kabuverdianu Creole and Its Dialects
Kabuverdianu is a Portuguese-based creole with a rich oral tradition. It has no single written standard - government documents are in Portuguese. And Creole is mostly spoken or written informally using a variety of orthographies (ALUPEC, the official alphabet, is taught in schools but inconsistently used). For an ASR (automatic speech recognition) system, this is a nightmare. You can't simply crawl the web for text-to-speech training data because the majority of text available online is Portuguese, not Creole.
Moreover, the two main dialect groups (Sotavento in the south, Barlavento in the north) differ in phonology, vocabulary. And grammar. The word for "house" is "casa" in Santiago (Sotavento) but "kaza" in SΓ£o Vicente (Barlavento). A model trained on one dialect will fail on the other. Some dialect pairs approach mutual intelligibility like Spanish and Italian - close,, and but definitely not the sameOur first decision: should we build a single multi-dialect model or separate per-island models? We went with a single unified model using dialect labels as additional input features, similar to how you might train speaker-adaptive models.
Reference: The ALUPEC alphabet is described in the official ALUPEC documentationWe also consulted the Ethnologue entry for Kabuverdianu (kea) to understand speaker distribution,
Why Vozinha MattersBridging the Digital Divide in Cabo Verde
Most commercial voice assistants ignore languages with fewer than, say, 50 million speakers. That's not malice - it's economics. The cost of collecting high-quality speech data - training models. And maintaining infrastructure scales poorly for small user bases. But the impact of being left out is real. In Cabo Verde, elderly citizens who speak only Creole often struggle with mobile banking, health information apps. And government services that assume Portuguese literacy. Vozinha isn't a toy - it's a digital equalizer.
During our pilot in Mindelo, we installed Vozinha on a handful of phones used by community health workers. They used it to dictate patient notes in Creole, schedule appointments. And even get weather forecasts for fishing trips. The usage data showed that users engaged 3x longer than with a text-based interface. That's the power of voice in an oral culture we're now working with the Cabo Verdean government's digital transformation office to integrate Vozinha into e-government portals.
This is not just about Cabo Verde. The same approach can be replicated for other low-resource languages across Africa, Latin America. And the Pacific. Vozinha is open-source - we'll link the repo at the end.
The Engineering Stack: From Speech Collection to TTS
Here's the tech stack we used for Vozinha, with justification for each choice:
- Speech collection: Mozilla Common Voice platform - we set up a custom Cabo Verde instance because the global Common Voice dataset had only ~2 hours of Creole. We ended up collecting 48 hours from 400 volunteers across 5 islands.
- ASR model: Fine-tuned Wav2Vec 2. 0 XLSR-53 from Facebook AI. The cross-lingual pre-training on 53 languages helped a lot; we boosted word error rate (WER) from 45% to 23% by adding Cabo Verdean data.
- TTS model: Coqui TTS with a custom phonemizer. We mapped Portuguese graphemes to IPA and added Creole-specific sounds (prenasalized stops, implosives).
- Intent recognition: A small BERT-based classifier trained on translated Portuguese intents, with a fallback to regex for domain-specific commands (fishing forecasts, local news).
- Inference: Deployed on a single NVIDIA T4 GPU via ONNX Runtime to keep latency under 300ms. Edge deployment (Android app with on-device model) is in progress using TensorFlow Lite,
The choice of Wav2Vec 20 was critical. Without the pre-trained multilingual checkpoint, we would have needed at least 200 hours of data to get acceptable WER. With transfer learning, 48 hours was sufficient for a basic command-and-control vocabulary (about 500 in-domain phrases).
Data Collection: Recording Voices Across the Archipelago
Collecting speech in Cabo Verde is not like running a booth at a tech conference. Internet connectivity on the islands is spotty - 4G covers most urban areas, but rural villages on Santo AntΓ£o have little to no signal. We had to pre-load the recording app and store audio locally, then upload when users reached a hotspot. We also discovered that many potential volunteers were uncomfortable reading from a screen because they aren't literate in Creole writing. We pivoted to a "repeat after me" interface where prompts were played aloud, not shown.
Our volunteer base skewed young and urban at first. So we deliberately recruited older speakers from rural areas during market days. Each recording session included a brief demographic survey (age, island, native dialect). We ended up with a balanced dataset: 40% from Santiago, 30% from SΓ£o Vicente, 20% from Santo AntΓ£o, and 10% from other islands. Dialect imbalance still exists. But we augmented the under-represented dialects with synthetic data (speed perturbation and noise injection).
Key lesson: Recording in a noisy environment is unavoidable - market noise, wind, roosters. We used spectral gating for denoising but kept some ambient noise to make the model robust. This improved real-world performance by 12% in field tests,
Training ASR for Kabuverdianu: Phoneme Mapping and Tokenization
Wav2Vec 2? 0 uses a character-level tokenizer by default, but it expects the target language's writing system. For Kabuverdianu, we had to decide between Portuguese orthography (used by many Creole speakers when writing informally) and ALUPEC. We tested both and found that ALUPEC gave 4% lower WER because it's more phonetic. However, users often mix spellings, so we added a grapheme-to-phoneme converter based on a manually curated rule set (380 rules for ALUPEC, plus 200 for common Portuguese-spelling variants).
We also faced the challenge of word boundaries in Creole. Some particles (like "ka" for negation, "ta" for present progressive) are written as separate words but pronounced as clitics. The model initially struggled with "N ka ta papia" (I don't speak) because the segmentation was unclear. We solved this by adding a separate BPE tokenization step with a vocabulary size of 5,000 subword units, trained on a 50,000-sentence corpus of Creole text scraped from news sites and social media.
The final ASR model achieved a WER of 18. 5% on clean speech (studio recorded) and 31% on noisy field audio. That's on par with commercial Arabic ASR models, which is a win given the data scarcity.
Building a TTS Voice That Feels Like Home
Text-to-speech for Cabo Verde required more than just pronunciation accuracy. We needed a voice that sounded natural and warm - the "vozinha" (little voice) persona. We recorded 5 native speakers (2 male, 3 female, from different islands) reading a 400-sentence prompt set. The final TTS model was a multi-speaker Tacotron2+WaveGlow from Coqui, with speaker embeddings. Users can select between a "Mindelo voice" and a "Praia voice" to match their own dialect.
Prosody was the hardest part. Kabuverdianu has a musical intonation (influenced by West African languages) that flat synthetic speech fails to capture. Portuguese-accented TTS sounded robotic. We experimented with prosody transfer from the source speaker using a duration predictor and pitch embeddings. It's not perfect - currently we have a mean opinion score (MOS) of 3. 4 out of 5 - but beats the 1, and 9 of the generic Portuguese TTS
Ethical note: we explicitly received consent from every voice donor to use their recordings for any Cabo Verdean-related applications, not to sell to third parties. We also trained a "neutral" voice from all speakers to avoid biasing the assistant toward one dialect.
Deployment and Real-World Learning
We launched an alpha version of Vozinha in April 2024 as a standalone Android app and a Telegram bot (since Telegram is surprisingly popular in Cabo Verde). Users can ask about weather, news, health information, and local events. The feedback has been eye-opening: our biggest bugs were around entity recognition for place names. "Mindelo" is a city on SΓ£o Vicente. But some users said "Mindelo" also refers to a neighborhood in Praia. We had to disambiguate using geolocation.
Another surprise: users frequently code-switch between Portuguese and Creole mid-sentence. "Could you tell me o tempo for amanhΓ£? " (mixing Portuguese for "weather" and "tomorrow"). Our initial model assumed a single language per utterance. We added a language identification head that outputs probability vectors for both languages, then merges the recognition results. Precision improved by 6%.
We also had to handle the fact that many Cabo Verdeans live abroad - the diaspora in the US, Portugal, and the Netherlands is large. They use Vozinha to talk to relatives back home. The model needs to understand foreign-accented Creole, which remains an open challenge.
Lessons for Engineers Building for Low-Resource Languages
What worked for Cabo Verde can be generalized. Here are the three biggest takeaways:
- Embrace transfer learning early. Pre-trained multilingual models like Wav2Vec 2. 0 XLSR-53 or mBART-50 can reduce data requirements by 10x, and don't start from scratch
- Design for variability, since Dialects, code-switching. And orthographic chaos aren't edge cases - they are the norm. Build a system that can handle multiple writing systems and speaker accents from day one.
- Involve the community as co-creators. Traditional speech collection is extractive. Instead, we made Vozinha open-source and gave voice donors attribution (and small mobile credit rewards). This built trust and data quality.
The official Common Voice dataset documentation outlines how to set up a custom language. We followed that closely. Also, the paper "Wav2Vec 2. But 0: A Framework for Self-Supervised Learning of Speech Representations" was our bible for fine-tuning.
Frequently Asked Questions
- Can I use Vozinha right now if I live in Cabo Verde? Yes, the Android alpha is available on our website (link below) and the Telegram bot is @vozinhabot. Expect bugs - it's an alpha.
- Does Vozinha work completely offline. Not yetThe current version requires an internet connection. Since we're working on an on-device version using TensorFlow Lite for selected commands, targeting a 2025 release.
- How do I contribute my voice to improve the model? Download the Vozinha app and use the "Record a phrase" feature. We'll anonymize and add your audio to the training set with your consent.
- Why did you choose Telegram over WhatsApp for the bot? WhatsApp's API is restricted for bots; Telegram allows open bot registration and rich media replies. Many Cabo Verdean tech communities already use Telegram.
- Will Vozinha ever support Portuguese as well? It already understands Portuguese phrases. But we plan to add a dedicated Portuguese mode for bilingual users. Code-switching is currently best-effort,
Need a Custom App Built?
Let's discuss your project and bring your ideas to life.
Contact Me Today β