Evaluation of a Silent Speech Interface based on Magnetic Sensing and Deep Learning for a Phonetically Rich Vocabulary
Authors: Jose A. Gonzalez, Lam A. Cheah, Phil D. Green, James M. Gilbert, Stephen R. Ell, Roger K. Moore and Ed Holdsworth
Comparison of mapping techniques
Speech samples generated by the following techniques are provided (see Table 2 and Fig. 1 in the paper):
- Original: speech signals recorded by the subjects.
- GMM: 128-mixture GMMs with full covariance matrices. Segmental features are computed over segments spaning 21 frames (i.e. 105 ms) of sensor data.
- DNN: DNNs with 4 hidden layers and 400 ReLUs per layer. Segmental features are also computed from 21-frame length windows of sensor data.
- RNN: fixed-lag recurrent neural network with 4 hidden layers and 150 GRUs in each layer. The look-ahead window contains 10 sensor frames in the future.
- BiRNN: bidirectional RNN with 4 bidirectional hidden layers and 105 GRUs per layer.
Comparison of input features
Now, we present examples of speech synthesised by the RNN-based mapping above trained with the following types of input features (see Table 3 in the paper):
- PMA: features extracted from the PMA data (as in the RNN-based mapping above). This is our baseline system.
- Linguistic features: either phone or senone (i.e. HMM states) labels. The linguistic features are automatically extracted from the force-aligned phonetic transcriptions of the original audio files and enconded as one-hot vectors. This system can be seen as a very basic TTS system.
- PMA + linguistic features: either PMA+Phones or PMA+Senone labels.