Speech samples predicted from PMA data

Direct Speech Reconstruction from Articulatory Sensor Data by Machine Learning

Authors: Jose A. Gonzalez, Lam A. Cheah, Angel M. Gomez, Phil D. Green, James M. Gilbert, Stephen R. Ell, Roger K. Moore and Ed Holdsworth


Description

Below there are several speech samples generated by the mapping techniques studied in the paper. Speech resynthesised with the STRAIGHT vocoder is also provided as a reference.

Conditions

  • GMM-based mapping techniques: 128-mixture GMMs with full covariance matrices. Segmental features are computed over segments spaning 21 frames (i.e. 105 ms) of sensor data.
  • DNN-based mapping techniques: DNNs with 4 hidden layers and 426 sigmoid units per layers. As in the GMM-based techniques, segmental features are computed over segments spaning 21 frames (i.e. 105 ms) of sensor data.
  • Fixed-lag RNN: 4 hidden layers with 164 GRU units per layer. A look-ahead window of 10 sensor frames (i.e. 50 ms) is used.
  • Bidirectional RNN (BiRNN) mapping technique: 4 bidirectional hidden layers with 164 GRU units per layer.

Speech samples

Speaker STRAIGHT GMM-MMSE GMM-MLE GMM-MLEGV DNN-MMSE DNN-MLE DNN-MLEGV RNN BiRNN
M1 wav wav wav wav wav wav wav wav wav
M3 wav wav wav wav wav wav wav wav wav
M4 wav wav wav wav wav wav wav wav wav
M4 wav wav wav wav wav wav wav wav wav

Note: Please use Google Chrome or Safari browsers to play the speech samples (Firefox doesn’t fully support wav files).