Examples of speech predicted from articulator movement

Evaluation of a Silent Speech Interface based on Magnetic Sensing and Deep Learning for a Phonetically Rich Vocabulary

Authors: Jose A. Gonzalez, Lam A. Cheah, Phil D. Green, James M. Gilbert, Stephen R. Ell, Roger K. Moore and Ed Holdsworth


Comparison of mapping techniques

Speech samples generated by the following techniques are provided (see Table 2 and Fig. 1 in the paper):

  • Original: speech signals recorded by the subjects.
  • GMM: 128-mixture GMMs with full covariance matrices. Segmental features are computed over segments spaning 21 frames (i.e. 105 ms) of sensor data.
  • DNN: DNNs with 4 hidden layers and 400 ReLUs per layer. Segmental features are also computed from 21-frame length windows of sensor data.
  • RNN: fixed-lag recurrent neural network with 4 hidden layers and 150 GRUs in each layer. The look-ahead window contains 10 sensor frames in the future.
  • BiRNN: bidirectional RNN with 4 bidirectional hidden layers and 105 GRUs per layer.
Subject Original GMM DNN RNN BiRNN
F1 wav wav wav wav wav
F1 wav wav wav wav wav
F2 wav wav wav wav wav
F2 wav wav wav wav wav
M3 wav wav wav wav wav
M3 wav wav wav wav wav
M4 wav wav wav wav wav
M4 wav wav wav wav wav

 

Comparison of input features

Now, we present examples of speech synthesised by the RNN-based mapping above trained with the following types of input features  (see Table 3 in the paper):

  • PMA: features extracted from the PMA data (as in the RNN-based mapping above). This is our baseline system.
  • Linguistic features: either phone or senone (i.e. HMM states) labels. The linguistic features are automatically extracted from the force-aligned phonetic transcriptions of the original audio files and enconded as one-hot vectors. This system can be seen as a very basic TTS system.
  • PMA + linguistic features: either PMA+Phones or PMA+Senone labels.
Subject Original PMA Phones Senones PMA+Phones PMA+Senones
F1 wav wav wav wav wav wav
F1 wav wav wav wav wav wav
F2 wav wav wav wav wav wav
F2 wav wav wav wav wav wav
M3 wav wav wav wav wav wav
M3 wav wav wav wav wav wav
M4 wav wav wav wav wav wav
M4 wav wav wav wav wav wav