Voice Changing with Machine Learning

Tools

VC Client: This is a client software for performing real-time voice conversion using various Voice Conversion (VC) AI.
Retrieval-based-Voice-Conversation-WebUI: A software for training RVC (Retrieval-based Voice Conversion) voice model.
Virtual Audio Cable (VAC): A software to output the generated voice from VC Client to the target software.

Quality of your microphone and cable uses can impact the real-time voice conversion result.

VC Client

DirectML or CUDA?

TLDR;
If you have NVIDIA GPUs, go for CUDA. If you have AMD or INTEL ARC GPUs, go for DirectML. For macOS, there is only one version.

DirectML and CUDA are both programming interfaces for running machine learning workloads on GPUs. However, they differ in several ways:

DirectML is a Microsoft-developed API that is part of the DirectX 12 graphics API. It is specifically designed for machine learning workloads and is optimized for DirectX-compatible GPUs.
CUDA is an NVIDIA-developed API that is designed for general-purpose computing on NVIDIA GPUs. It is also widely used for machine learning workloads, but it is not as tightly integrated with DirectX as DirectML is.

In general, DirectML is a good choice for machine learning workloads that are running on DirectX-compatible GPUs. It is also a good choice for workloads that are being developed using Microsoft’s machine learning frameworks, such as TensorFlow and PyTorch.

CUDA is a more versatile API that can be used for a wider range of workloads. It is also a good choice for workloads that are being developed using NVIDIA’s machine learning frameworks, such as TensorFlow and PyTorch, as well as other frameworks such as MXNet and PaddlePaddle.

Recommended Settings

Advanced Setting

Protocol: Sio or Rest (try both and see what you prefer)
Crossfade: overlap 4096 start 0.2 end 0.8
Trancate: 300
SilenceFront: Off
Protect: 0.5
RVC Quality: Low (changing to high cranks GPU and CPU usage, for basically no real difference).
Skip Pass through confirmation: No

Audio Setup

S. Threshold

The minimum required sound for it to start converting audio. If you raise IN Gain at the top, nothing you do here will matter. If its below, it’s considered silence, hence the name Silence Threshold.

Explanations

Chunk

Decide how much length to cut and convert in one conversion. The higher the value, the more efficient the conversion, but the larger the buf value, the longer the maximum time before the conversion starts.¹

Chunk, the lower this is the less latency you have for the voice to come out, but the lower the quality becomes each time, you find the one where it doesn’t have any audio glitches, i.e. stuttering repeating words, cutting in and out, or laggy.²

Extra

Determines how much past audio to include in the input when converting audio. The longer the past voice is, the better the accuracy of the conversion, but the longer the res is, the longer the calculation takes.²

Extra start with 4096 while testing your chunk values. The larger this value is the more CPU resources it uses. It probably zero point in going higher than 32768 value, but it can however make your voice clearer.²

Tune

Tune is voice dependent so female to make you want a NEGATIVE tune usually -12, female to female you ideally don’t have to change anything but you might have to depending on how soft your voice is in comparison. Male to female you want a POSITIVE tune usually +12.²

Index

Index is only really beneficial if your Accent is HEAVY, or DOESN’T MATCH the person accent you want. But the cost is CPU usage. 300% more usage to be exact. It is recommend not using this, and just speaking naturally.²

How to train your own voice model

Download Retrieval-based-Voice-Conversation-WebUI, then open go-web.bat file.

According to a Guide for W-Okada’s RealTimeVoiceChangerClient by Raven:

A dataset of around 10 minutes to 50 minutes is recommended.

Total training epochs (total_epoch)

According to FAQ from RVC-Project/Retrieval-based-Voice-Conversion-WebUI on GitHub³:

If the training dataset’s audio quality is poor and the noise floor is high, 20-30 epochs are sufficient. Setting it too high won’t improve the audio quality of your low-quality training set.
If the training set audio quality is high, the noise floor is low, and there is sufficient duration, you can increase it. 200 is acceptable (since training is fast, and if you’re able to prepare a high-quality training set, your GPU likely can handle a longer training duration without issue).

References

“Realtime Voice Changer Client for RVC Tutorial (v.1.5.3.13)”. w-okada/voice-changer, GitHub. Retrieved January 4, 2024. ↩︎
Raven (December 29, 2023). “Guide for W-Okada’s RealTimeVoiceChangerClient”. Rentry.co. Archived from the original on January 4, 2024. Retrieved January 4, 2024. ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
RVC-Project/Retrieval-based-Voice-Conversion-WebUI (August 31, 2023). “FAQ (Frequently Asked Questions)”. GitHub. Archived from the original on January 4, 2024. Retrieved January 4, 2024. ↩︎