How to fine-tune the existing checkpoints¶
Vocoder matching¶
Vocoder (i.e. your spec-to-wav model) matching is an important part of the TTS pipeline. Because your spec-to-wav model is trained with the ground-truth Mel spectrograms from your audio, there is a mismatch between the Mel spectrograms created by your text-to-spec model and the ones that the pre-trained vocoders have seen during training. For that reason, it can be helpful to fine-tune your spec-to-wav model with the generated Mel spectrograms from your text-to-spec model.
You can use a pre-trained EveryVoice spec-to-wav model for this process and do not have to train your own spec-to-wav model. You can find EveryVoice pre-trained checkpoints here
Note
Vocoder matching will only help with the metallic artefacts that sometimes occur when synthesizing speech. If your model is not intelligible, has other types of errors like mispronunciations - vocoder matching will not solve it. In these cases, the problem is likely with your text-to-spec model, and probably due to either noisy data (noisy recordings, mistranscriptions etc), too little data, or data that is too varied (many different speakers). Please refer to TODO: troubleshooting for more information.
To finetune your spec-to-wav model (a.k.a 'Vocoder') with Mel spectrograms from your text-to-spec model, you need to have a pre-trained text-to-spec and spec-to-wav model ready. This process is also sometimes called 'vocoder matching'. You also need to have access to some parallel text/audio data (the same or similar data that you used to train your text-to-spec model).
Then you:
-
Generate a folder full of Mel spectrograms from your text-to-spec model (repeat this process for both your training and validation filelists):
Training filelist:
everyvoice synthesize from-text <path-to-your-text-to-spec.ckpt> -O spec --filelist <path-to-your-training-filelist.psv> --teacher-forcing-directory <path-to-your-preprocessed-directory> --output-dir <path-to-your-preprocessed-directory>
Validation filelist:
everyvoice synthesize from-text <path-to-your-text-to-spec.ckpt> -O spec --filelist <path-to-your-validation-filelist.psv> --teacher-forcing-directory <path-to-your-preprocessed-directory> --output-dir <path-to-your-preprocessed-directory>
Note
For vocoder matching to work, the size of the generated Mel spectrogram has to be the same as the ground truth Mel spectrogram calculated from the audio, so you have to use 'teacher-forcing' to force the text-to-spec model to output spectrograms of a specific size. To do this, we add the --teacher-forcing-directory and point it to the project
preprocessed
directory with the processed files from our filelist. This will write asynthesized_spec
folder within yourpreprocessed
directory, that you can use instead of the groundtruthspec
data by settingfinetune
to True as described in the next step. -
Set the
finetune_checkpoint
value in theeveryvoice-spec-to-wav.yaml
configuration file to the vocoder checkpoint that you want to fine-tune. This can be either a model you have trained yourself or a pre-trained checkpoint. -
Lower the learning rate (we suggest starting at 0.00001)
-
Train the vocoder again with finetuning set to True (train for at least 25000 steps, but preferrably 100000 steps):
everyvoice train spec-to-wav config/everyvoice-spec-to-wav.yaml -c training.finetune=True