In this paper, we present a multi-domain speech conversion technique by proposing a Multi-domain Speech Conversion Network (MSpeC-Net) architecture for solving the less-explored area of Non-Audible Murmur-to-SPeeCH (NAM2-SPCH) conversion. The murmur produced by the speaker and captured by the NAM microphone undergoes speech quality degradation. Hence, NAM2SPCH conversion becomes a necessary and challenging task for improving the intelligibility of NAM signal. MSpeC-Net contains three domain-specific autoencoders. The multiple encoder-decoders are aligned using latent consistency loss in such a way that the desired conversion is achieved by using the source encoder and target decoder only. We have performed zero-pair NAM2SPCH conversion using the interaction between source encoder and the target decoder. We evaluated our proposed method using both objective and subjective evaluations. With a Mean Opinion Score of 3.26 and 3.12 on an average in a direct NAM2SPCH, and an indirect NAM2SPCH (i.e., NAM-to-whisper-to-speech) conversion, respectively. MSpeC-Net achieves the perceptually significant improvement for NAM2SPCH conversion system.
Index | Conversion Type | Input | MMSE-DiscoGAN (Baseline) | MSpeC-Net (Proposed) |
---|---|---|---|---|
(1) | NAM2WHSP | |||
(2) | WHSP2SPCH | |||
(3) | NAM2SPCH | |||
@INPROCEEDINGS{9052966, author={H. {Malaviya} and J. {Shah} and M. {Patel} and J. {Munshi} and H. A. {Patil}}, booktitle={ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, title={Mspec-Net : Multi-Domain Speech Conversion Network}, year={2020}, volume={}, number={}, pages={7764-7768} }