Kuba Lopatka(Intel Corporation) and Tobias Bocklet(Technische Hochschule Nürnberg)
We propose a new training method to improve HMM-based keyword spotting. The loss function is based on a score computed with the keyword/filler model from the entire input sequence. It it is equivalent to max/attention pooling but is based on prior acoustic knowledge. We compare our model to a baseline trained on framewise cross entropy, with and without per-class weighting. We employ a low-footprint TDNN for acoustic modeling. The proposed training yields significant and consistent improvement over the baseline in adverse noise conditions. The FRR on cafeteria noise is reduced from 13.07% to 5.28% at 9 dB SNR and from 37.44% to 6.78% at 5 dB SNR. We obtain very good results with only 600 unique training keyword samples. The training method is independent of the frontend and acoustic model topology.