spec2vec.Spec2Vec module¶
- class spec2vec.Spec2Vec.Spec2Vec(model: Union[Word2Vec, Word2VecLight], intensity_weighting_power: Union[float, int] = 0, allowed_missing_percentage: Union[float, int] = 10, progress_bar: bool = False)[source]¶
Bases:
BaseSimilarity
Calculate spec2vec similarity scores between a reference and a query.
Using a trained model, spectrum documents will be converted into spectrum vectors. The spec2vec similarity is then the cosine similarity score between two spectrum vectors.
The following code example shows how to calculate spec2vec similarities between query and reference spectrums. It uses a dummy model that can be found at
../integration-tests/test_user_workflow_spec2vec.model
and a small test dataset that can be found at../tests/pesticides.mgf
.import os import gensim from matchms import calculate_scores from matchms.filtering import add_losses from matchms.filtering import default_filters from matchms.filtering import normalize_intensities from matchms.filtering import require_minimum_number_of_peaks from matchms.filtering import select_by_intensity from matchms.filtering import select_by_mz from matchms.importing import load_from_mgf from spec2vec import Spec2Vec def spectrum_processing(s): '''This is how a user would typically design his own pre- and post- processing pipeline.''' s = default_filters(s) s = normalize_intensities(s) s = select_by_mz(s, mz_from=0, mz_to=1000) s = select_by_intensity(s, intensity_from=0.01) s = add_losses(s, loss_mz_from=10.0, loss_mz_to=200.0) s = require_minimum_number_of_peaks(s, n_required=5) return s spectrums_file = os.path.join(os.getcwd(), "..", "tests", "data", "pesticides.mgf") # Load data and apply the above defined filters to the data spectrums = [spectrum_processing(s) for s in load_from_mgf(spectrums_file)] # Omit spectrums that didn't qualify for analysis spectrums = [s for s in spectrums if s is not None] # Load pretrained model (here dummy model) model_file = os.path.join(os.getcwd(), "..", "integration-tests", "test_user_workflow_spec2vec.model") model = gensim.models.Word2Vec.load(model_file) # Define similarity_function spec2vec = Spec2Vec(model=model, intensity_weighting_power=0.5) # Calculate scores on all combinations of references and queries scores = calculate_scores(spectrums[10:], spectrums[:10], spec2vec) # Select top-10 candidates for first query spectrum spectrum0_top10 = scores.scores_by_query(spectrums[0], sort=True)[:10] # Display spectrum IDs for top-10 matches (only works if metadata contains "spectrum_id" field) print([s[0].metadata['spectrum_id'] for s in spectrum0_top10])
Should output
['CCMSLIB00001058300', 'CCMSLIB00001058289', 'CCMSLIB00001058303', ...
- __init__(model: Union[Word2Vec, Word2VecLight], intensity_weighting_power: Union[float, int] = 0, allowed_missing_percentage: Union[float, int] = 10, progress_bar: bool = False)[source]¶
- Parameters
model – Expected input is a gensim word2vec model that has been trained on the desired set of spectrum documents.
intensity_weighting_power – Spectrum vectors are a weighted sum of the word vectors. The given word intensities will be raised to the given power. The default is 0, which means that no weighing will be done.
allowed_missing_percentage – Set the maximum allowed percentage of the document that may be missing from the input model. This is measured as percentage of the weighted, missing words compared to all word vectors of the document. Default is 10, which means up to 10% missing words are allowed. If more words are missing from the model, an empty embedding will be returned (leading to similarities of 0) and a warning is raised.
progress_bar – Set to True to monitor the embedding creating with a progress bar. Default is False.
- matrix(references: Union[List[SpectrumDocument], List[Spectrum]], queries: Union[List[SpectrumDocument], List[Spectrum]], array_type: str = 'numpy', is_symmetric: bool = False) ndarray [source]¶
Calculate the spec2vec similarities between all references and queries.
- Parameters
references – Reference spectrums or spectrum documents.
queries – Query spectrums or spectrum documents.
array_type – Specify the output array type. Can be “numpy” or “sparse”. Currently, only “numpy” is supported and will return a numpy array. Future versions will include “sparse” as option to return a COO-sparse array.
is_symmetric – Set to True if references == queries to speed up calculation about 2x. Uses the fact that in this case score[i, j] = score[j, i]. Default is False.
- Returns
Array of spec2vec similarity scores.
- Return type
spec2vec_similarity
- pair(reference: Union[SpectrumDocument, Spectrum], query: Union[SpectrumDocument, Spectrum]) float [source]¶
Calculate the spec2vec similaritiy between a reference and a query.
- Parameters
reference – Reference spectrum or spectrum document.
query – Query spectrum or spectrum document.
- Returns
Spec2vec similarity score.
- Return type
spec2vec_similarity