Direct Speech Reconstruction from Articulatory Sensor Data by Machine Learning
Authors: Jose A. Gonzalez, Lam A. Cheah, Angel M. Gomez, Phil D. Green, James M. Gilbert, Stephen R. Ell, Roger K. Moore and Ed Holdsworth
Below there are several speech samples generated by the mapping techniques studied in the paper. Speech resynthesised with the STRAIGHT vocoder is also provided as a reference.
- GMM-based mapping techniques: 128-mixture GMMs with full covariance matrices. Segmental features are computed over segments spaning 21 frames (i.e. 105 ms) of sensor data.
- DNN-based mapping techniques: DNNs with 4 hidden layers and 426 sigmoid units per layers. As in the GMM-based techniques, segmental features are computed over segments spaning 21 frames (i.e. 105 ms) of sensor data.
- Fixed-lag RNN: 4 hidden layers with 164 GRU units per layer. A look-ahead window of 10 sensor frames (i.e. 50 ms) is used.
- Bidirectional RNN (BiRNN) mapping technique: 4 bidirectional hidden layers with 164 GRU units per layer.
Note: Please use Google Chrome or Safari browsers to play the speech samples (Firefox doesn’t fully support wav files).