PACycleGAN-VC

Non-parallel voice conversion (VC) is a voice mapping technology that uses non-parallel corpus to convert source speeches into target speeches while maintaining semantic information unchanged. Recently, Cycle-consistent adversarial network-based VC with Filling in Frames (MaskCycleGAN-VC) [1] is proposed and generally accepted as current benchmark method. While it solves the problem of time-frequency structures consistency, the performance of voice conversion is not satisfactory enough, especially in inter-gender VC. There is still a large gap between target and converted voice in terms of naturalness and similarity. In addition, the performance of MaskCycleGAN-VC seriously deteriorates because of a limited amount of training data. In order to solve above problems, we propose Pyramid Attention CycleGAN (PACycleGAN) for voice conversion which integrates pyramid structure and attention mechanism. We use Differentiable Augmentation, a method that improves the data efficiency of GANs and makes training more stable by imposing different types of differentiable augmentations on both real and fake speech samples. We evaluate the performance of PACycleGAN on inter-gender and intra-gender non-parallel VC. Subjective and objective evaluations of naturalness and speaker similarity show that PACycleGAN-VC outperforms MaskCycleGAN-VC for every VC pair.

maskcyclegan-vc — Figure 1. Network architectures of generator, including the proposed Pyramid Attention (PA) module. Top−left: Attention-Gated ResBlock with different atrous rates. Top−right: The operation of the attention fusion component between two feature levels in pyramid.

Conversion samples

Recommended browsers: Safari, Chrome, Firefox, and Opera.

Experimental conditions

We evaluated our method on the Spoke (i.e., non-parallel VC) task of the Voice Conversion Challenge 2018 (VCC 2018) [2].
For each speaker, 81 sentences (approximately 5 min in length, which is relatively short for VC) were used for training.
The training set contains no overlapping utterances between the source and target speakers; therefore, we need to learn a converter in a fully non-parallel setting.
We used MelGAN [3] as a vocoder.

Compared models

Mask: MaskCycleGAN-VC [1]
PA: PACycleGAN-VC

Female (VCC2SF3) → Male (VCC2TM1)

	Source	Target	Mask	PA
Sample 1
Sample 2
Sample 3

Male (VCC2SM3) → Female (VCC2TF1)

	Source	Target	Mask	PA
Sample 1
Sample 2
Sample 3

Female (VCC2SF3) → Female (VCC2TF1)

	Source	Target	Mask	PA
Sample 1
Sample 2
Sample 3

Male (VCC2SM3) → Male (VCC2TM1)

	Source	Target	Mask	PA
Sample 1
Sample 2
Sample 3

References

[1] Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, and Nobukatsu Hojo. MaskCycleGAN-VC: Learning Non-parallel Voice Conversion with Filling in Frames. arXiv:2102.12841, Feb. 2021 ICASSP, 2021. [ Paper]

[2] J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicencio, T. Kinnunen, Z. Ling. The Voice Conversion Challenge 2018: Promoting Development of Parallel and Nonparallel Methods. Odyssey, 2018. [ Paper] [ Dataset]

[3] K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W. Z. Teoh, J. Sotelo, A. de Brebisson, Y. Bengio, A. Courville. MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis. NeurIPS, 2019. [ Paper]