spec2vec.model_building module¶
This module contains functions that will help users to train a word2vec model through gensim.
- spec2vec.model_building.learning_rates_to_gensim_style(num_of_epochs, **settings)[source]¶
Convert “learning_rate_initial” and “learning_rate_decay” to gensim “alpha” and “min_alpha”.
- spec2vec.model_building.set_learning_rate_decay(learning_rate_initial: float, learning_rate_decay: float, num_of_epochs: int) Tuple[float, float] [source]¶
The learning rate in Gensim model training is defined by an initial rate (alpha) and a final rate (min_alpha). which can be unintuitive. Here those parameters will be set based on the given values for learning_rate_initial, num_of_epochs, and learning_rate_decay.
- Parameters
learning_rate_initial – Set initial learning rate.
learning_rate_decay – After evert epoch, the learning rate will be lowered by the learning_rate_decay.
number_of_epochs – Total number of epochs for training.
Returns –
-------- –
alpha – Initial learning rate.
min_alpha – Final learning rate.
- spec2vec.model_building.set_spec2vec_defaults(**settings)[source]¶
Set spec2vec default argument values”(where no user input is give)”.
- spec2vec.model_building.train_new_word2vec_model(documents: List, iterations: Union[List[int], int], filename: Optional[str] = None, progress_logger: bool = True, **settings) Word2Vec [source]¶
Train a new Word2Vec model (using gensim). Save to file if filename is given.
Example code on how to train a word2vec model on a corpus (=list of documents) that is derived from a given set of spectrums (list of matchms.Spectrum instances):
from matchms import SpectrumDocument from spec2vec.model_building import train_new_word2vec_model documents = [SpectrumDocument(s, n_decimals=1) for s in spectrums] model = train_new_word2vec_model(documents, iterations=20, size=200, workers=1, progress_logger=False)
- Parameters
documents – List of documents, each document being a list of words (strings).
iterations – Specifies the number of training interations. This can be done by setting iterations to the total number of training epochs (e.g. “iterations=15”), or by passing a list of iterations (e.g. “iterations=[5,10,15]”) which will also led to a total training of max(iterations) epochs, but will save the model for every iteration in the list. Temporary models will be saved using the name: filename_TEMP_{#iteration}epoch.model”.
filename (str,) – Filename to save model. Default is None, which means no model will be saved. If a list of iterations is passed (e.g. “iterations=[5,10,15]”), then intermediate models will be saved during training (here after 5, 10 iterations) using the pattern: filename_TEMP_{#iteration}epoch.model
learning_rate_initial – Set initial learning rate. Default is 0.025.
learning_rate_decay – After every epoch the learning rate will be lowered by the learning_rate_decay. Default is 0.00025.
progress_logger – If True, the training progress will be printed every epoch. Default is True.
**settings – All other named arguments will be passed to the
gensim.models.word2vec.Word2Vec
constructor.sg (int (0,1)) – For sg = 0 –> CBOW model, for sg = 1 –> skip gram model (see Gensim documentation). Default for Spec2Vec is 0.
negative (int) – from Gensim: If > 0, negative sampling will be used, the int for negative specifies how many “noise words” should be drawn (usually between 5-20). If set to 0, no negative sampling is used. Default for Spec2Vec is 5.
size (int,) – Dimensions of word vectors. Default is 300.
window (int,) – Window size for context words (small for local context, larger for global context). Spec2Vec expects large windwos. Default is 500.
min_count (int,) – Only consider words that occur at least min_count times in the corpus. Default is 1.
workers (int,) – Number of threads to run the training on (should not be more than number of cores/threads. Default is 4.
- Returns
Gensim word2vec model.
- Return type
word2vec_model