Configuration¶
Each model has a statically typed configuration model. Each configuration has default settings that will be instantiated when the model is instantiated. To create a default preprocessing configuration for example you would:
from everyvoice.config.preprocessing_config import PreprocessingConfig
preprocessing_config = PreprocessingConfig()
Static typing means that misconfiguration errors should occur as soon as the configuration is instantiated instead of producing downstream runtime errors. It also means that intellisense is available in your code editor when working with a configuration class.
Sharing Configurations¶
The Text and Preprocessing configurations should only be defined once per dataset and shared between your models to ensure each model makes the same assumptions about your data. To achieve that, each model configuration can also be defined as a path to a configuration file. So, a configuration for an aligner that uses separately defined text and audio preprocessing configurations might look like this:
model:
lstm_dim: 512
conv_dim: 512
...
training:
batch_size: 32
...
preprocessing: "./config/default/everyvoice-shared-data.yaml"
text: "./config/default/everyvoice-shared-text.yaml"
Serialization¶
By default configuration objects are serialized as dictionaries, which works as expected with integers, floats, lists, booleans, dicts etc. But there are some cases where you need to specify a Callable in your configuration. For example the {ref}TextConfig
has a cleaners
field that takes a list of Callables to apply in order to raw text.
By default, these functions turn raw text to lowercase, collapse whitespace, and normalize using Unicode NFC normalization. In Python, we could instantiate this by passing the callables directly like so:
from everyvoice.config.text_config import TextConfig
from everyvoice.utils import collapse_whitespace, lower, nfc_normalize
text_config = TextConfig(cleaners=[lower, collapse_whitespace, nfc_normalize])
But, for yaml or json configuration, we need to serialize these functions. To do so, EveryVoice will turn each callable into module dot-notation. That is, your configuration will look like this in yaml:
cleaners:
- everyvoice.utils.lower
- everyvoice.utils.collapse_whitespace
- everyvoice.utils.nfc_normalize
This will then be de-serialized upon instantiation of your configuration.
Text Configuration¶
The TextConfig is where you define the symbol set for your data and any cleaners used to clean your raw text into the text needed for your data. You can share the TextConfig with any models that need it and only need one text configuration per dataset (and possibly only per language).
TextConfig¶
everyvoice.config.text_config.TextConfig
¶
Bases: ConfigModel
Source code in everyvoice/config/text_config.py
147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 |
|
Symbols¶
Your symbol set is created by taking the union of all values defined. For example:
symbols:
dataset_0_characters: ['a', 'b', 'c']
dataset_1_characters: ['b', 'c', 'd']
Will create a symbol set equal to {'a', 'b', 'c', 'd'}
(i.e. the union of both key/values). This allows you to train models with data from different languages, for example.
Important
You should always manually inspect your configuration here to make sure it makes sense with respect to your data. Is there a symbol that shouldn't be there? Is there a symbol that's defined as 'punctuation' but is used as non-punctuation in your language? Please inspect these and update the configuration accordingly.
everyvoice.config.text_config.Symbols
¶
Bases: BaseModel
Source code in everyvoice/config/text_config.py
53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 |
|
all_except_punctuation
property
¶Returns the set containing all characters.
cannot_have_punctuation_in_symbol_set()
¶You cannot have the same symbol defined in punctuation as elsewhere.
Raises:
Type | Description |
---|---|
ValueError
|
raised if a symbol from punctuation is found elsewhere |
Returns:
Name | Type | Description |
---|---|---|
Symbols |
Symbols
|
The validated symbol set |
Source code in everyvoice/config/text_config.py
68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 |
|
member_must_be_list_of_strings()
¶Except for punctuation
& pad
, all user defined member variables
have to be a list of strings.
Source code in everyvoice/config/text_config.py
85 86 87 88 89 90 91 92 93 94 95 96 97 98 |
|