Glossary

A

Amide I Band

Raman peak around 1650 cm⁻¹ corresponding to C=O stretching vibrations in protein backbones. Strong marker for protein content.

Amide II Band

Raman peak around 1550 cm⁻¹ from N-H bending and C-N stretching in proteins. Used for protein structure analysis.

Asymmetric Least Squares (AsLS)

Baseline correction algorithm that fits asymmetric polynomials, giving more weight to points below the spectrum. Fast and effective for smooth baselines.

API

Application Programming Interface - the set of functions and classes that developers can use to extend or interact with the application programmatically.


B

Baseline Correction

Preprocessing step that removes additive background fluorescence from Raman spectra, revealing the true Raman peaks.

Batch Processing

Processing multiple datasets or spectra simultaneously with the same pipeline or analysis method.

Biomarker

A measurable indicator (e.g., specific Raman peak or ratio) that characterizes biological state or disease condition.


C

Calibration

Process of correcting spectral axis (wavenumber) or intensity axis using known standards to ensure measurement accuracy.

Class Imbalance

When one class has many more samples than another (e.g., 90% healthy, 10% disease), requiring special handling in machine learning.

Confusion Matrix

Table showing model predictions vs actual labels, used to evaluate classification performance (True Positive, False Positive, etc.).

Cosmic Ray

Sharp intensity spike caused by high-energy cosmic ray hitting the detector. Must be removed before analysis.

Cross-Validation

Technique for evaluating model performance by splitting data into training/test sets multiple times and averaging results.


D

Data Leakage

When information from test set inadvertently influences training, causing overoptimistic performance estimates. Critical to avoid.

Dataset

Collection of Raman spectra imported and managed together. Can contain multiple groups.

Dimensionality Reduction

Techniques (PCA, UMAP, t-SNE) that reduce high-dimensional spectral data (1000+ wavenumbers) to 2-3 dimensions for visualization.


E

Effect Size

Magnitude of difference between groups, independent of sample size. Cohen’s d is common effect size measure.

Endmember

Pure component spectrum in spectral unmixing (MCR-ALS). Mixture spectra are combinations of endmembers.


F

False Discovery Rate (FDR)

Expected proportion of false positives among all positive results. FDR correction (Benjamini-Hochberg) controls this rate.

Feature

In machine learning context, each wavenumber intensity value is a feature. Raman spectra have ~1000-2000 features.

Feature Engineering

Creating new features from raw data (e.g., peak ratios, derivatives) to improve model performance.

Feature Importance

Measure of how much each feature (wavenumber) contributes to model predictions. SHAP and permutation importance are common methods.

Fluorescence

Broad background signal in Raman spectra caused by sample autofluorescence. Removed by baseline correction.


G

Group

User-defined collection of spectra (e.g., “Healthy”, “Disease”) used for comparative analysis and machine learning.

GroupKFold

Cross-validation strategy ensuring all spectra from same patient/sample stay in same fold, preventing data leakage.


H

Hyperparameter

Model parameter set by user (not learned from data), such as number of trees in Random Forest or C in SVM.

Hyperparameter Tuning

Process of finding optimal hyperparameter values using grid search, random search, or optimization algorithms.


I

Intensity

Raman scattering signal strength, proportional to molecular concentration and scattering cross-section.

Interpolation

Process of estimating spectrum values at new wavenumber points based on existing data. Used to align spectra with different sampling.


K

K-Fold Cross-Validation

Splitting data into K parts, training on K-1 parts and testing on remaining part, repeated K times.


L

Lambda (λ)

Smoothness parameter in baseline correction algorithms. Higher λ = smoother baseline.

Leave-One-Patient-Out Cross-Validation (LOPOCV)

Cross-validation where each patient’s data is test set once, ensuring patient-level separation.

Loading

In PCA, the contribution of each original variable (wavenumber) to a principal component. Loadings plot shows which peaks drive PCs.


M

Machine Learning (ML)

Using algorithms to learn patterns from data and make predictions on new data without explicit programming.

MCR-ALS

Multivariate Curve Resolution - Alternating Least Squares. Method for decomposing mixture spectra into pure component spectra.

MGUS

Monoclonal Gammopathy of Undetermined Significance. Pre-cancerous condition that may progress to multiple myeloma (MM).

Multiple Testing Correction

Statistical adjustment for performing many hypothesis tests simultaneously, controlling false positive rate.


N

Normalization

Scaling spectra to remove multiplicative intensity variations, enabling fair comparison. Common methods: Vector, SNV, Min-Max, Area.

NumPy

Python library for numerical computing, providing arrays and mathematical functions. Core dependency for this application.


O

Outlier

Spectrum significantly different from others, possibly due to measurement error or unusual sample. Should be identified and often removed.

Overfitting

When model learns training data too well, including noise, causing poor performance on new data.


P

P-value

Probability of observing data as extreme as measured, assuming null hypothesis is true. p < 0.05 traditionally considered significant.

PCA (Principal Component Analysis)

Unsupervised dimensionality reduction finding directions of maximum variance. First step for most analyses.

Peak

Local maximum in Raman spectrum corresponding to specific molecular vibration.

Pipeline

Sequence of preprocessing steps applied in order (e.g., Baseline → Smooth → Normalize).

PLS-DA (Partial Least Squares Discriminant Analysis)

Supervised dimensionality reduction maximizing separation between known groups.

Preprocessing

Data transformation steps applied before analysis (baseline correction, smoothing, normalization, etc.).


Q

Quality Control (QC)

Procedures ensuring data quality, including outlier detection, cosmic ray removal, and baseline verification.

Quantile Normalization

Advanced normalization making intensity distributions identical across spectra.


R

Raman Scattering

Inelastic scattering of photons by molecules, providing fingerprint of molecular structure.

Raman Shift

Energy difference between incident and scattered photons, measured in wavenumbers (cm⁻¹).

Random Forest (RF)

Ensemble machine learning algorithm using multiple decision trees. Robust and interpretable.

Regularization

Adding penalty to model complexity to prevent overfitting. L1 (Lasso) and L2 (Ridge) are common types.

ROC Curve

Receiver Operating Characteristic curve plotting True Positive Rate vs False Positive Rate. ROC-AUC measures classification performance.


S

Savitzky-Golay Filter

Smoothing method fitting polynomials to local windows. Preserves peak shapes better than simple averaging.

Scree Plot

Plot of explained variance vs principal component number. Used to select number of PCs to keep.

SHAP (SHapley Additive exPlanations)

Method for interpreting machine learning models by assigning importance value to each feature for each prediction.

Smoothing

Reducing noise by averaging neighboring points. Savitzky-Golay and Gaussian are common methods.

SNV (Standard Normal Variate)

Normalization method centering and scaling each spectrum independently. Robust for biological samples.

Spectrum (plural: Spectra)

Plot of Raman intensity vs wavenumber for a single measurement.

Stratified Sampling

Splitting data while maintaining class proportions in train and test sets. Important for imbalanced data.

SVM (Support Vector Machine)

Machine learning algorithm finding optimal separating hyperplane between classes. Effective for high-dimensional data.


T

t-SNE (t-Distributed Stochastic Neighbor Embedding)

Non-linear dimensionality reduction emphasizing local structure. Good for visualizing clusters.

Test Set

Data held out during training, used only for final performance evaluation.

Training Set

Data used to train machine learning model. Model learns patterns from training set.


U

UMAP (Uniform Manifold Approximation and Projection)

Non-linear dimensionality reduction preserving both local and global structure. Faster than t-SNE.

Underfitting

When model is too simple to capture data patterns, causing poor performance on both training and test data.


V

Validation Set

Data used during training for hyperparameter tuning and model selection (separate from test set).

Vector Normalization

Scaling each spectrum to unit length (Euclidean norm = 1). Most common normalization for Raman data.


W

Wavenumber

Unit for Raman shift, measured in cm⁻¹ (inverse centimeters). Proportional to vibrational energy.

Whittaker Smoothing

Smoothing method using penalized least squares. Controlled by lambda parameter.


X

XGBoost (eXtreme Gradient Boosting)

Advanced gradient boosting machine learning algorithm. Often achieves highest accuracy but requires careful tuning.


Japanese Terms / 日本語用語

未病 (Mibyō)

Pre-disease state - condition before full disease manifestation. Focus of early detection research.

臨床光情報工学研究室 (Rinsho Hikari Jōhō Kōgaku Kenkyūshitsu)

Laboratory for Clinical Photonics and Information Engineering at University of Toyama.


Abbreviations

Abbreviation

Full Name

AI

Artificial Intelligence

ALS

Alternating Least Squares

ANOVA

Analysis of Variance

API

Application Programming Interface

AsLS

Asymmetric Least Squares

AUC

Area Under Curve

CDAE

Convolutional Denoising Autoencoder

CI

Confidence Interval

CSV

Comma-Separated Values

DPI

Dots Per Inch

EM

Expectation-Maximization

FABC

Fixed-Anchor Baseline Correction

FDR

False Discovery Rate

FN

False Negative

FP

False Positive

GUI

Graphical User Interface

ICA

Independent Component Analysis

KNN

K-Nearest Neighbors

LOPOCV

Leave-One-Patient-Out Cross-Validation

MAT

MATLAB File Format (not currently supported)

MCR

Multivariate Curve Resolution

MGUS

Monoclonal Gammopathy of Undetermined Significance

ML

Machine Learning

MM

Multiple Myeloma

MSC

Multiplicative Scatter Correction

NMF

Non-negative Matrix Factorization

PC

Principal Component

PCA

Principal Component Analysis

PLS

Partial Least Squares

PLS-DA

Partial Least Squares Discriminant Analysis

PQN

Probabilistic Quotient Normalization

QC

Quality Control

RF

Random Forest

ROC

Receiver Operating Characteristic

SMOTE

Synthetic Minority Over-sampling Technique

SNV

Standard Normal Variate

SVM

Support Vector Machine

t-SNE

t-Distributed Stochastic Neighbor Embedding

TN

True Negative

TP

True Positive

UMAP

Uniform Manifold Approximation and Projection

UV

uv package manager


Contributing

Found a term that should be added?

  1. Fork the repository

  2. Edit docs/glossary.md

  3. Add term in alphabetical order with clear definition

  4. Submit pull request

Please include:

  • Term in bold

  • Clear, concise definition

  • Related terms or examples where helpful