Retrieval-based Voice Conversion (RVC)

Retrieval-based voice conversion (RVC) is a method of voice cloning that uses a pre-trained model to retrieve and combine segments of speech from a source speaker to synthesize the voice of a target speaker. Unlike other voice cloning methods, RVC does not require a large amount of training data from the target speaker. This makes it a more versatile and efficient approach to voice cloning.

Traditional voice conversion techniques typically rely on statistical models, such as hidden Markov models (HMMs) or Gaussian mixture models (GMMs), to learn the mapping between the source and target voices. These models are trained on a large corpus of parallel data, which consists of pairs of audio recording from the source and target speakers.

RVC, on the other hand, does not require parallel data. Instead, it uses a retrieval-based approach to find the most similar audio segments from the target speaker’s voice and then uses these segments to synthesize the converted voice. This approach is more efficient and less data-intensive than traditional voice conversion techniques.

In addition to the differences in data requirements and efficiency, RVC also offers several other advantages over traditional voice conversion techniques. For example, RVC is more robust to noise and distortions in the source voice. It is also versatile and can be used to convert voices to a wider range of target styles.

Advantages of RVC

Here are some of the specific advantages of RVC:

Noise robustness: RVC is less sensitive to noise in the source voice than traditional voice conversion techniques. This is because RVC does not rely on statistical models, which can be easily corrupted by noise.
Style transfer: RVC can be used to convert voices to a wider range of target styles than traditional voice conversion techniques. This is because RVC can leverage the entire range of styles represented in the target speaker’s voice.
Expressiveness: RVC can capture the expressiveness of the source speaker’s voice, including emotions and intonation. This is because RVC is based on retrieving the most similar audio segments from the target speaker’s voice, which preserves the natural expressiveness of the target speaker.

RVC offers several advantages over other voice cloning methods:

Less training data required: RVC can be trained with a relatively small amount of data from the target speaker, typically less than 10 minutes of speech.
Efficient and versatile: RVC can be used to convert a wide range of voices, including those with accents or speech impediments.
High quality results: RVC can produce synthesized speech that is very similar to the target speaker’s original voice.

However, RVC also has some limitations:

Tone leakage: In some cases, the synthesized speech may retain some of the source speaker’s voice characteristics.
Limited expressiveness: RVC may not be able to capture the full expressiveness of the target speaker’s voice.

Overview of how RVC works

In retrieval-based voice conversion (RVC), the model learns to clone a voice by retrieving and combining audio segments from the target speaker’s voice. This process involves several steps:

Pre-training: A large corpus of speech data from a variety of speakers is used to train a neural network model. This model learns to encode speech into a high-dimensional representation that captures the acoustic characteristics of each speaker’s voice. See Pre-training for more info.
Feature Extraction: The model extracts acoustic features from both the source and target speaker’s voice. These features represent the characteristics of the speech signal, such as pitch, loudness, and spectral content.
Speaker Embeddings: The model learns speaker embeddings, which are vector representations that capture the unique characteristics of each speaker’s voice. These embeddings are used to identify the most similar audio segments from the target speaker’s voice.
Segment Retrieval: For each utterance in the source voice, the model retrieves the most similar audio segments from the target speaker’s voice. The similarity is determined based on the acoustic features and speaker embeddings.
Segment Combination: The retrieved segments are then combined using a waveform synthesis technique, such as overlap-add or phase vocoder, to generate the converted voice. This process ensures that the converted voice maintains the overall intonation and rhythm of the source speaker while adopting the target speaker’s voice characteristics.
Fine-tuning: The model is fine-tuned using a loss function that measures the similarity between the converted voice and the target speaker’s voice. This fine-tuning process helps to improve the quality of the converted voice.

Through this process, the model learns to adjust the acoustic features and speaker embeddings of the source speaker’s voice to match those of the target speaker. This allows the model to effectively clone the target speaker’s voice.

Tutorial

Here is a tutorial on using Retrieval-based-Voice-Conversion-WebUI to train voice conversion model. It comes with a dataset for the pre-training model that uses nearly 50 hours of high quality audio from the VCTK open source dataset.

Learning Time

In deep learning, the data set is divided and the learning proceeds little by little. In one model update (step), batch_size data are retrieved and predictions and error corrections are performed. Doing this once for a dataset counts as one epoch.

Therefore, the learning time is the learning time per step x (the number of data in the dataset / batch size) x the number of epochs. In general, the larger the batch size, the more stable the learning becomes (learning time per step ÷ batch size) becomes smaller, but it uses more GPU memory. GPU RAM can be checked with the nvidia-smi command. Learning can be done in a short time by increasing the batch size as much as possible according to the machine of the execution environment.

Model

In the context of generative adversarial networks (GANs), the “G” and “D” files represent the generator and discriminator models, respectively. These two models are the core components of a GAN and play crucial roles in generating and evaluating synthetic data.

Generator (G): The generator model is responsible for creating new data samples that resemble the real data distribution. It takes a random noise vector as input and transforms it into a synthetic data sample. The generator’s objective is to produce data that is indistinguishable from the real data, making it difficult for the discriminator to differentiate between them.

Discriminator (D): The discriminator model is tasked with distinguishing between real and synthetic data samples. It takes a data sample as input and outputs a probability indicating how likely it is that the sample is real. The discriminator’s objective is to accurately classify real and synthetic data, helping to improve the generator’s performance.

The G and D models are trained in an adversarial manner, where the generator tries to fool the discriminator, and the discriminator tries to catch the generator. This adversarial process drives the generator to produce increasingly realistic synthetic data.

The specific implementation and training details of the G and D models can vary depending on the specific GAN architecture and application. However, the fundamental roles of these two models remain the same: the generator creates synthetic data, and the discriminator evaluates its quality.

Pretrained model

RVC starts training the model from pretrained weights instead of from 0, so it can be trained with a small dataset.

By default it loads rvc-location/pretrained/f0G40k.pth and rvc-location/pretrained/f0D40k.pth.

When learning, model parameters are saved in logs/{experiment name}/G_{}.pthand logs/{experiement name}/D_{}.pth for each save_every_epoch.

You can restart or start training from model weights learned in a different experiment by point the path to that model instead of using the default pretrained model.

Learning index

RVC saves the HuBERT feature values used during training, and during inference, searches for feature values that are similar to the feature values used during learning to perform inference. In order to perform this search at high speed, the index is learned in advance. For index learning, RVC use the approximate neighborhood search library faiss. Read the feature value of /logs/{experiment name}/3_feature768, save the combined feature value as /logs/{experiement name}/total_fea.npy, and use it to lean the index /logs/{experiment name}/added_XXX.index.

Loss Terms

Loss terms provide valuable feedback during the training process of GANs, helping to improve the quality and realism of the generated synthetic data.

Monitoring these loss terms together provides a comprehensive picture of the training progress. A successful training run would typically exhibit a balance between decreasing generator and discriminator losses, along with decreasing feature matching, mel spectrogram, and KL divergence losses. This indicates that the generator is producing increasingly realistic synthetic data while maintaining a similar underlying structure and distribution to real data.

The optimal values will vary depending on the specific task and dataset. However, by monitoring the trends in these loss terms, you can gain valuable insights into the progress of the training and identify any potential issues that may arise.

Discriminator Loss (loss_disc):

The discriminator loss measures the discriminator’s ability to distinguish between real and synthetic data samples. It is typically formulated as a cross-entropy loss, where the discriminator is penalized for misclassifying real and synthetic samples.

A decreasing discriminator loss indicates that the discriminator is becoming better at distinguishing between real and synthetic data. This is generally a good sign, as it suggests that the generator is producing increasingly realistic synthetic data. However, if the discriminator loss becomes too low, it may indicate that the generator is simply memorizing real data samples rather than learning to capture the underlying patterns in the data.

Generator Loss (loss_gen):

The generator loss measures the generator’s ability to produce synthetic data samples that are indistinguishable from real data. It is typically formulated as an adversarial loss, where the generator is penalized for producing data that is easily classified as synthetic by the discriminator.

A decreasing generator loss indicates that the generator is successful learning to producing realistic synthetic data.

Feature Matching Loss (loss_fm):

The feature matching loss encourage the generator to produce synthetic data that has similar intermediate feature representations to those of real data. This is achieved by comparing the intermediate activations of a feature extractor applied to both real and synthetic data.

A decreasing feature matching loss indicates that the generator is producing synthetic data with similar intermediate feature representations to those of real data. This suggests that the generator is learning to capture the underlying structure of real data, which contributes to the realism of the synthetic data.

Mel Spectrogram Loss (loss_mel):

The mel spectrogram loss compares the mel spectrograms of real and synthetic data samples. Mel spectrograms are a way of representing the frequency content of an audio signal, and this loss encourage the generator to produce synthetic data that sound similar to real data.

A decreasing mel spectrogram loss indicates that the generator is producing synthetic data with a similar spectral distribution to real data. This is particularly important for audio generation tasks, as it ensures that the synthetic audio sound similar to real audio.

KL Divergence Loss (loss_kl):

The KL divergence loss encourages the generator to produce synthetic data that has a similar distribution of latest variables to real data. Latent variables are a representation of the underlying factors that generate the data, and this loss ensures that the generator is not simply memorizing real data samples but is learning to capture the underlying patterns in the data.

A decreasing KL divergence loss indicates that the generator is producing synthetic data with a similar distribution of latent variables to real data. This suggests that the generator is not simply memorizing real data samples but is learning to capture the underlying patterns in the data, which leads to more generalizable synthetic data.