If you don’t want to be dependent on a web page to generate your language model and pronunciation dictionary, you can use the CMU-Cambridge Statistical Language Modeling Toolkit.
First, you will need to download a dictionary. I used the “cmu07a.dic” that comes with pocketsphinx. You will find it in
Next, you will need to download the source for the toolkit and build and install it on your Raspberry Pi:
# wget -c http://hivelocity.dl.sourceforge.net/project/cmusphinx/cmuclmtk/0.7/cmuclmtk-0.7.tar.gz
# tar jxvf cmuclmtk-0.7.tar.gz
# pushd cmuclmtk-0.7
# sudo make install
Once you have installed the toolkit, you can use a script like this to create the files you will need for use with pocketsphinx_continuous:
text2wfreq < "$CORPUS" | wfreq2vocab > "$NAME.vocab"
text2idngram -vocab "$NAME.vocab" -idngram "$NAME.idngram" < "$CORPUS"
idngram2lm -vocab_type $VOCAB_TYPE -context "$NAME.css" -idngram "$NAME.idngram" -vocab "$NAME.vocab" -arpa "$NAME.arpa"
Name the script “mklm”. You can use the “.vocab” file to create your pronunciation dictionary using a script like this:
echo -n "" > "$NAME.dic"
while read line ;
if [[ ! $line =~ ^# ]] ; then
printf "Searching for %-20s" "$line..."
if egrep "(^$line[[:space:]]|^$line\([0-9]\)[[:space:]])" "$DICT" >> "$NAME.dic" ; then
echo " [FOUND]"
done < "$NAME.vocab"
Name this script "mkdict".
When you run the "mkdict" script, it will show you all the words that can be found in the pronunciation dictionary. You may find that some of the words you need are not in the dictionary. Fortunately, it is relatively easy to add words to the dictionary. I created the word "reboot" by looking at other words that started with "re" and adding the pronunciation for "boot". The result looks like this:
reboot R IY B UW T
IMPORTANT: You MUST put a TAB between the word and the pronunciation. If you use a space, pocketsphinx_continuous will not be able to use the word.
Once you have a dictionary that contains all of the words you need, you can run your "mklm" and "mkdict" scripts to generate your language model and dictionary files.
Obviously, you can use the "Sphinx Knowledge Base Tool" mentioned in the previous article to generate the files you need. However, I can think of a lot of scenarios in which a user might not want to submit a corpus to a publicly available web page.