Using Highway Connections to Enable Deep Small-footprint LSTM-RNNs for Speech Recognition

CHENG Gaofeng; LI Xin; YAN Yonghong

doi:10.1049/cje.2018.11.008

Volume 28 Issue 1

Turn off MathJax

Article Contents

Article Navigation > Chinese Journal of Electronics > 2019 > 28(1): 107-112

CHENG Gaofeng, LI Xin, YAN Yonghong, “Using Highway Connections to Enable Deep Small-footprint LSTM-RNNs for Speech Recognition,” Chinese Journal of Electronics, vol. 28, no. 1, pp. 107-112, 2019, doi: 10.1049/cje.2018.11.008

Citation:

CHENG Gaofeng, LI Xin, YAN Yonghong, “Using Highway Connections to Enable Deep Small-footprint LSTM-RNNs for Speech Recognition,” Chinese Journal of Electronics, vol. 28, no. 1, pp. 107-112, 2019, doi: 10.1049/cje.2018.11.008

Citation:

PDF( 514 KB)

Using Highway Connections to Enable Deep Small-footprint LSTM-RNNs for Speech Recognition

doi: 10.1049/cje.2018.11.008

1.
Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Beijing 100190, China;
2.
University of Chinese Academy of Sciences, Beijing 100049, China

Funds: This work is supported by the National Key Research and Development Program (No.2016YFB0801203, No.2016YFB0801200), the National Natural Science Foundation of China (No.11590774, No.11590770).

Received Date: 2016-11-11
Rev Recd Date: 2018-08-02
Publish Date: 2019-01-10

Abstract

Abstract

Long short-term memory RNNs (LSTMRNNs) have shown great success in the Automatic speech recognition (ASR) field and have become the state-ofthe-art acoustic model for time-sequence modeling tasks. However, it is still difficult to train deep LSTM-RNNs while keeping the parameter number small. We use the highway connections between memory cells in adjacent layers to train a small-footprint highway LSTM-RNNs (HLSTM-RNNs), which are deeper and thinner compared to conventional LSTM-RNNs. The experiments on the Switchboard (SWBD) indicate that we can train thinner and deeper HLSTM-RNNs with a smaller parameter number than the conventional 3-layer LSTM-RNNs and a lower Word error rate (WER) than the conventional one. Compared with the counterparts of small-footprint LSTMRNNs, the small-footprint HLSTM-RNNs show greater reduction in WER.
- Long short-term memory,
- Highway connections,
- Small-footprint,
- Speech recognition

FullText(HTML)

References(24)

References

G. Hinton, L. Deng, D. Yu, et al., "Deep neural networks for acoustic modeling in speech recognition:The shared views of four research groups" Signal Processing Magazine, IEEE, Vol.29, No.6, pp.82-97, 2012.

H. A. Bourlard and N. Morgan, "Connectionist speech recognition:A hybrid approach", Springer Science and Business Media, 2012.

G. E. Dahl, D. Yu, L. Deng, et al, "Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition", IEEE Transactions on Audio, Speech and Language Processing, Vol.20, No.1, pp.30-42, 2012.

F. Seide, G. Li, and D. Yu, "Conversational speech transcription using context-dependent deep neural networks" Proc. Annual Conference of International Speech Communication Association (Interspeech), pp.437-440, 2011.

P. Swietojanski, A. Ghoshal, and S. Renals, "Convolutional neural networks for distant speech recognition," Signal Processing Letters, IEEE, Vol.21, No.9, pp.1120-1124, 2014.

J. Xu, J. Pan, and Y. Yan, "Agglutinative language speech recognition using automatic allophone deriving", Chinese Journal of Electronics, Vol.25, No.2, pp.328-333, 2016.

W. Jiang, P. Liu, and F. Wen, "Speech magnitude spectrum reconstruction from MFCCs using deep neural network", Chinese Journal of Electronics, Vol.27, No.2, pp.393-398, 2018.

H. Zhang, Q. Fu, and Y. Yan, "Speech Enhancement Using Compact Microphone Array and Applications in Distant Speech Acquisition," Chinese Journal of Electronics, Vol.18, No.3, pp.481-486, 2009.

Y. Xie, J. Huang, and Y. He, "One Dictionary vs. Two Dictionaries in Sparse Coding Based Denoising", Chinese Journal of Electronics, Vol.26, No.2, pp.367-371, 2017.

A. Graves, A. Mohamed, and G. Hinton, "Speech recognition with deep recurrent neural networks," Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013.

H. Zen, and H. Sak, "Unidirectional long short-term memory recurrent neural network with recurrent output layer for lowlatency speech synthesis," Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015.

H. Sak, A. Senior, and F. Beaufays, "Long short-term memory recurrent neural network architectures for large scale acoustic modeling," Annual Conference of the International Speech Communication Association (Interspeech), 2014.

Y. Zhang, G. Chen, D. Yu, et al., "Highway long shortterm memory RNNs for distant speech recognition," Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016.

Y. Bengio, P. Simard, P. Frasconi,"Learning long-term dependencies with gradient descent is difficult", IEEE Transactions on Neural Networks, Vol.5, No.2, pp.157-166, 1994.

L. LU, S. Renals,"Small-footprint deep neural networks with highway connections for speech recognition", IEEE Transactions on Audio, Speech and Lan-guage Processing, Vol.25, No.7, pp.1502-1511, 2017.

S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural Computation, Vol.9, No.8, pp.17351438, 1997.

H. Sak, A. Senior, and F. Beaufays, "Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition," Feb. 2014. Available:http://arxiv.org/abs/1402.1128.

C.Y. Lee, S. Xie, P. Gallagher, et al., "Deeply-supervised nets," Artificial Intelligence and Statistics, 2015.

Y. Bengio, P. Lamblin, D. Popovici, et al., "Greedy layer-wise training of deep networks," Proc. NIPS, 2007, Vol.19, pp.153.

G. E. Hinton and R. R. Salakhutdinov, "Reducing the dimensionality of data with neural networks," Science, Vol.313, No.5786, pp.504-507, 2006.

R. K. Srivastava, K. Greff, and J. Schmidhuber, "Training very deep networks," Proc. NIPS, 2015.

D. Povey, V. Peddinti, D. Galvez, et al., "Purely sequencetrained neural networks for ASR based on lattice-free MMI", Annual Conference of International Speech Communication Association (Interspeech), 2016.

K. Vesely, A. Ghoshal, L. Burget, et al., "Sequencediscriminative training of deep neural networks." Annual Conference of International Speech Communication Association (Interspeech), pp.2345-2349, 2013.

G. Saon, H. Soltau, D. Nahamoo, et al., "Speaker adaption of neural network acoustic models using i-vectors." Proc. IEEE Workshop on Automfatic Speech Recognition and Understanding (ASRU), pp.55-59, 2013.

Relative Articles

Supplements(0)

Cited By

Proportional views