spec2vec package

class spec2vec.Document(obj)[source]

Bases: object

Parent class for documents as required by spec2vec.

Use this as parent class to build your own document class. An example used for mass spectra is SpectrumDocument.

__init__(obj)[source]
Parameters

obj – Input object of desired class.

class spec2vec.Spec2Vec(model: gensim.models.word2vec.Word2Vec, intensity_weighting_power: Union[float, int] = 0, allowed_missing_percentage: Union[float, int] = 10, progress_bar: bool = False)[source]

Bases: matchms.similarity.BaseSimilarity.BaseSimilarity

Calculate spec2vec similarity scores between a reference and a query.

Using a trained model, spectrum documents will be converted into spectrum vectors. The spec2vec similarity is then the cosine similarity score between two spectrum vectors.

The following code example shows how to calculate spec2vec similarities between query and reference spectrums. It uses a dummy model that can be found at ../integration-tests/test_user_workflow_spec2vec.model and a small test dataset that can be found at ../tests/pesticides.mgf.

import os
import gensim
from matchms import calculate_scores
from matchms.filtering import add_losses
from matchms.filtering import default_filters
from matchms.filtering import normalize_intensities
from matchms.filtering import require_minimum_number_of_peaks
from matchms.filtering import select_by_intensity
from matchms.filtering import select_by_mz
from matchms.importing import load_from_mgf
from spec2vec import Spec2Vec

def spectrum_processing(s):
    '''This is how a user would typically design his own pre- and post-
    processing pipeline.'''
    s = default_filters(s)
    s = normalize_intensities(s)
    s = select_by_mz(s, mz_from=0, mz_to=1000)
    s = select_by_intensity(s, intensity_from=0.01)
    s = add_losses(s, loss_mz_from=10.0, loss_mz_to=200.0)
    s = require_minimum_number_of_peaks(s, n_required=5)
    return s

spectrums_file = os.path.join(os.getcwd(), "..", "tests", "pesticides.mgf")

# Load data and apply the above defined filters to the data
spectrums = [spectrum_processing(s) for s in load_from_mgf(spectrums_file)]

# Omit spectrums that didn't qualify for analysis
spectrums = [s for s in spectrums if s is not None]

# Load pretrained model (here dummy model)
model_file = os.path.join(os.getcwd(), "..", "integration-tests", "test_user_workflow_spec2vec.model")
model = gensim.models.Word2Vec.load(model_file)

# Define similarity_function
spec2vec = Spec2Vec(model=model, intensity_weighting_power=0.5)

# Calculate scores on all combinations of references and queries
scores = calculate_scores(spectrums[10:], spectrums[:10], spec2vec)

# Select top-10 candidates for first query spectrum
spectrum0_top10 = scores.scores_by_query(spectrums[0], sort=True)[:10]

# Display spectrum IDs for top-10 matches
print([s[0].metadata['spectrumid'] for s in spectrum0_top10])

Should output

['CCMSLIB00001058300', 'CCMSLIB00001058289', 'CCMSLIB00001058303', ...
__init__(model: gensim.models.word2vec.Word2Vec, intensity_weighting_power: Union[float, int] = 0, allowed_missing_percentage: Union[float, int] = 10, progress_bar: bool = False)[source]
Parameters
  • model – Expected input is a gensim word2vec model that has been trained on the desired set of spectrum documents.

  • intensity_weighting_power – Spectrum vectors are a weighted sum of the word vectors. The given word intensities will be raised to the given power. The default is 0, which means that no weighing will be done.

  • allowed_missing_percentage – Set the maximum allowed percentage of the document that may be missing from the input model. This is measured as percentage of the weighted, missing words compared to all word vectors of the document. Default is 10, which means up to 10% missing words are allowed. If more words are missing from the model, an empty embedding will be returned (leading to similarities of 0) and a warning is raised.

  • progress_bar – Set to True to monitor the embedding creating with a progress bar. Default is False.

matrix(references: Union[List[spec2vec.SpectrumDocument.SpectrumDocument], List[matchms.Spectrum.Spectrum]], queries: Union[List[spec2vec.SpectrumDocument.SpectrumDocument], List[matchms.Spectrum.Spectrum]], is_symmetric: bool = False)numpy.ndarray[source]

Calculate the spec2vec similarities between all references and queries.

Parameters
  • references – Reference spectrums or spectrum documents.

  • queries – Query spectrums or spectrum documents.

  • is_symmetric – Set to True if references == queries to speed up calculation about 2x. Uses the fact that in this case score[i, j] = score[j, i]. Default is False.

Returns

Array of spec2vec similarity scores.

Return type

spec2vec_similarity

pair(reference: Union[spec2vec.SpectrumDocument.SpectrumDocument, matchms.Spectrum.Spectrum], query: Union[spec2vec.SpectrumDocument.SpectrumDocument, matchms.Spectrum.Spectrum])float[source]

Calculate the spec2vec similaritiy between a reference and a query.

Parameters
  • reference – Reference spectrum or spectrum document.

  • query – Query spectrum or spectrum document.

Returns

Spec2vec similarity score.

Return type

spec2vec_similarity

class spec2vec.SpectrumDocument(spectrum, n_decimals: int = 2)[source]

Bases: spec2vec.Document.Document

Create documents from spectra.

Every peak (and loss) positions (m/z value) will be converted into a string “word”. The entire list of all peak words forms a spectrum document. Peak words have the form “peak@100.32” (for n_decimals=2), and losses have the format “loss@100.32”. Peaks with identical resulting strings will not be merged, hence same words can exist multiple times in a document (e.g. peaks at 100.31 and 100.29 would lead to two words “peak@100.3” when using n_decimals=1).

For example:

import numpy as np
from matchms import Spectrum
from spec2vec import SpectrumDocument

spectrum = Spectrum(mz=np.array([100.0, 150.0, 200.51]),
                    intensities=np.array([0.7, 0.2, 0.1]),
                    metadata={'compound_name': 'substance1'})
spectrum_document = SpectrumDocument(spectrum, n_decimals=1)

print(spectrum_document.words)
print(spectrum_document.peaks.mz)
print(spectrum_document.get("compound_name"))

Should output

['peak@100.0', 'peak@150.0', 'peak@200.5']
[100.   150.   200.51]
substance1
__init__(spectrum, n_decimals: int = 2)[source]
Parameters
  • spectrum (SpectrumType) – Input spectrum.

  • n_decimals – Peak positions are converted to strings with n_decimal decimals. The default is 2, which would convert a peak at 100.387 into the word “peak@100.39”.

get(key: str, default=None)[source]

Retrieve value from Spectrum metadata dict. Shorthand for

val = self._obj.metadata[key]
property losses

Return losses of original spectrum.

property metadata

Return metadata of original spectrum.

property peaks

Return peaks of original spectrum.

spec2vec.calc_vector(model: gensim.models.basemodel.BaseTopicModel, document: spec2vec.Document.Document, intensity_weighting_power: Union[float, int] = 0, allowed_missing_percentage: Union[float, int] = 10)numpy.ndarray[source]

Compute document vector as a (weighted) sum of individual word vectors.

Parameters
  • model – Pretrained word2vec model to convert words into vectors.

  • document – Document containing document.words and document.weights.

  • intensity_weighting_power – Specify to what power weights should be raised. The default is 0, which means that no weighing will be done.

  • allowed_missing_percentage – Set the maximum allowed percentage of the document that may be missing from the input model. This is measured as percentage of the weighted, missing words compared to all word vectors of the document. Default is 10, which means up to 10% missing words are allowed. If more words are missing from the model, an empty embedding will be returned (leading to similarities of 0) and a warning is raised.

Returns

Vector representing the input document in latent space. Will return None if the missing percentage of the document in the model is > allowed_missing_percentage.

Return type

vector