spec2vec.model_building module

This module contains functions that will help users to train a word2vec model through gensim.

spec2vec.model_building.learning_rates_to_gensim_style(num_of_epochs, **settings)[source]

Convert “learning_rate_initial” and “learning_rate_decay” to gensim “alpha” and “min_alpha”.

spec2vec.model_building.set_learning_rate_decay(learning_rate_initial: float, learning_rate_decay: float, num_of_epochs: int) Tuple[float, float][source]

The learning rate in Gensim model training is defined by an initial rate (alpha) and a final rate (min_alpha). which can be unintuitive. Here those parameters will be set based on the given values for learning_rate_initial, num_of_epochs, and learning_rate_decay.

Parameters
  • learning_rate_initial – Set initial learning rate.

  • learning_rate_decay – After evert epoch, the learning rate will be lowered by the learning_rate_decay.

  • number_of_epochs – Total number of epochs for training.

  • Returns

  • --------

  • alpha – Initial learning rate.

  • min_alpha – Final learning rate.

spec2vec.model_building.set_spec2vec_defaults(**settings)[source]

Set spec2vec default argument values”(where no user input is give)”.

spec2vec.model_building.train_new_word2vec_model(documents: List, iterations: Union[List[int], int], filename: Optional[str] = None, progress_logger: bool = True, **settings) Word2Vec[source]

Train a new Word2Vec model (using gensim). Save to file if filename is given.

Example code on how to train a word2vec model on a corpus (=list of documents) that is derived from a given set of spectrums (list of matchms.Spectrum instances):

from matchms import SpectrumDocument
from spec2vec.model_building import train_new_word2vec_model

documents = [SpectrumDocument(s, n_decimals=1) for s in spectrums]
model = train_new_word2vec_model(documents, iterations=20, size=200,
                                 workers=1, progress_logger=False)
Parameters
  • documents – List of documents, each document being a list of words (strings).

  • iterations – Specifies the number of training interations. This can be done by setting iterations to the total number of training epochs (e.g. “iterations=15”), or by passing a list of iterations (e.g. “iterations=[5,10,15]”) which will also led to a total training of max(iterations) epochs, but will save the model for every iteration in the list. Temporary models will be saved using the name: filename_TEMP_{#iteration}epoch.model”.

  • filename (str,) – Filename to save model. Default is None, which means no model will be saved. If a list of iterations is passed (e.g. “iterations=[5,10,15]”), then intermediate models will be saved during training (here after 5, 10 iterations) using the pattern: filename_TEMP_{#iteration}epoch.model

  • learning_rate_initial – Set initial learning rate. Default is 0.025.

  • learning_rate_decay – After every epoch the learning rate will be lowered by the learning_rate_decay. Default is 0.00025.

  • progress_logger – If True, the training progress will be printed every epoch. Default is True.

  • **settings – All other named arguments will be passed to the gensim.models.word2vec.Word2Vec constructor.

  • sg (int (0,1)) – For sg = 0 –> CBOW model, for sg = 1 –> skip gram model (see Gensim documentation). Default for Spec2Vec is 0.

  • negative (int) – from Gensim: If > 0, negative sampling will be used, the int for negative specifies how many “noise words” should be drawn (usually between 5-20). If set to 0, no negative sampling is used. Default for Spec2Vec is 5.

  • size (int,) – Dimensions of word vectors. Default is 300.

  • window (int,) – Window size for context words (small for local context, larger for global context). Spec2Vec expects large windwos. Default is 500.

  • min_count (int,) – Only consider words that occur at least min_count times in the corpus. Default is 1.

  • workers (int,) – Number of threads to run the training on (should not be more than number of cores/threads. Default is 4.

Returns

Gensim word2vec model.

Return type

word2vec_model