Glossary
A
- Amide I Band
Raman peak around 1650 cm⁻¹ corresponding to C=O stretching vibrations in protein backbones. Strong marker for protein content.
- Amide II Band
Raman peak around 1550 cm⁻¹ from N-H bending and C-N stretching in proteins. Used for protein structure analysis.
- Asymmetric Least Squares (AsLS)
Baseline correction algorithm that fits asymmetric polynomials, giving more weight to points below the spectrum. Fast and effective for smooth baselines.
- API
Application Programming Interface - the set of functions and classes that developers can use to extend or interact with the application programmatically.
B
- Baseline Correction
Preprocessing step that removes additive background fluorescence from Raman spectra, revealing the true Raman peaks.
- Batch Processing
Processing multiple datasets or spectra simultaneously with the same pipeline or analysis method.
- Biomarker
A measurable indicator (e.g., specific Raman peak or ratio) that characterizes biological state or disease condition.
C
- Calibration
Process of correcting spectral axis (wavenumber) or intensity axis using known standards to ensure measurement accuracy.
- Class Imbalance
When one class has many more samples than another (e.g., 90% healthy, 10% disease), requiring special handling in machine learning.
- Confusion Matrix
Table showing model predictions vs actual labels, used to evaluate classification performance (True Positive, False Positive, etc.).
- Cosmic Ray
Sharp intensity spike caused by high-energy cosmic ray hitting the detector. Must be removed before analysis.
- Cross-Validation
Technique for evaluating model performance by splitting data into training/test sets multiple times and averaging results.
D
- Data Leakage
When information from test set inadvertently influences training, causing overoptimistic performance estimates. Critical to avoid.
- Dataset
Collection of Raman spectra imported and managed together. Can contain multiple groups.
- Dimensionality Reduction
Techniques (PCA, UMAP, t-SNE) that reduce high-dimensional spectral data (1000+ wavenumbers) to 2-3 dimensions for visualization.
E
- Effect Size
Magnitude of difference between groups, independent of sample size. Cohen’s d is common effect size measure.
- Endmember
Pure component spectrum in spectral unmixing (MCR-ALS). Mixture spectra are combinations of endmembers.
F
- False Discovery Rate (FDR)
Expected proportion of false positives among all positive results. FDR correction (Benjamini-Hochberg) controls this rate.
- Feature
In machine learning context, each wavenumber intensity value is a feature. Raman spectra have ~1000-2000 features.
- Feature Engineering
Creating new features from raw data (e.g., peak ratios, derivatives) to improve model performance.
- Feature Importance
Measure of how much each feature (wavenumber) contributes to model predictions. SHAP and permutation importance are common methods.
- Fluorescence
Broad background signal in Raman spectra caused by sample autofluorescence. Removed by baseline correction.
G
- Group
User-defined collection of spectra (e.g., “Healthy”, “Disease”) used for comparative analysis and machine learning.
- GroupKFold
Cross-validation strategy ensuring all spectra from same patient/sample stay in same fold, preventing data leakage.
H
- Hyperparameter
Model parameter set by user (not learned from data), such as number of trees in Random Forest or C in SVM.
- Hyperparameter Tuning
Process of finding optimal hyperparameter values using grid search, random search, or optimization algorithms.
I
- Intensity
Raman scattering signal strength, proportional to molecular concentration and scattering cross-section.
- Interpolation
Process of estimating spectrum values at new wavenumber points based on existing data. Used to align spectra with different sampling.
K
- K-Fold Cross-Validation
Splitting data into K parts, training on K-1 parts and testing on remaining part, repeated K times.
L
- Lambda (λ)
Smoothness parameter in baseline correction algorithms. Higher λ = smoother baseline.
- Leave-One-Patient-Out Cross-Validation (LOPOCV)
Cross-validation where each patient’s data is test set once, ensuring patient-level separation.
- Loading
In PCA, the contribution of each original variable (wavenumber) to a principal component. Loadings plot shows which peaks drive PCs.
M
- Machine Learning (ML)
Using algorithms to learn patterns from data and make predictions on new data without explicit programming.
- MCR-ALS
Multivariate Curve Resolution - Alternating Least Squares. Method for decomposing mixture spectra into pure component spectra.
- MGUS
Monoclonal Gammopathy of Undetermined Significance. Pre-cancerous condition that may progress to multiple myeloma (MM).
- Multiple Testing Correction
Statistical adjustment for performing many hypothesis tests simultaneously, controlling false positive rate.
N
- Normalization
Scaling spectra to remove multiplicative intensity variations, enabling fair comparison. Common methods: Vector, SNV, Min-Max, Area.
- NumPy
Python library for numerical computing, providing arrays and mathematical functions. Core dependency for this application.
O
- Outlier
Spectrum significantly different from others, possibly due to measurement error or unusual sample. Should be identified and often removed.
- Overfitting
When model learns training data too well, including noise, causing poor performance on new data.
P
- P-value
Probability of observing data as extreme as measured, assuming null hypothesis is true. p < 0.05 traditionally considered significant.
- PCA (Principal Component Analysis)
Unsupervised dimensionality reduction finding directions of maximum variance. First step for most analyses.
- Peak
Local maximum in Raman spectrum corresponding to specific molecular vibration.
- Pipeline
Sequence of preprocessing steps applied in order (e.g., Baseline → Smooth → Normalize).
- PLS-DA (Partial Least Squares Discriminant Analysis)
Supervised dimensionality reduction maximizing separation between known groups.
- Preprocessing
Data transformation steps applied before analysis (baseline correction, smoothing, normalization, etc.).
Q
- Quality Control (QC)
Procedures ensuring data quality, including outlier detection, cosmic ray removal, and baseline verification.
- Quantile Normalization
Advanced normalization making intensity distributions identical across spectra.
R
- Raman Scattering
Inelastic scattering of photons by molecules, providing fingerprint of molecular structure.
- Raman Shift
Energy difference between incident and scattered photons, measured in wavenumbers (cm⁻¹).
- Random Forest (RF)
Ensemble machine learning algorithm using multiple decision trees. Robust and interpretable.
- Regularization
Adding penalty to model complexity to prevent overfitting. L1 (Lasso) and L2 (Ridge) are common types.
- ROC Curve
Receiver Operating Characteristic curve plotting True Positive Rate vs False Positive Rate. ROC-AUC measures classification performance.
S
- Savitzky-Golay Filter
Smoothing method fitting polynomials to local windows. Preserves peak shapes better than simple averaging.
- Scree Plot
Plot of explained variance vs principal component number. Used to select number of PCs to keep.
- SHAP (SHapley Additive exPlanations)
Method for interpreting machine learning models by assigning importance value to each feature for each prediction.
- Smoothing
Reducing noise by averaging neighboring points. Savitzky-Golay and Gaussian are common methods.
- SNV (Standard Normal Variate)
Normalization method centering and scaling each spectrum independently. Robust for biological samples.
- Spectrum (plural: Spectra)
Plot of Raman intensity vs wavenumber for a single measurement.
- Stratified Sampling
Splitting data while maintaining class proportions in train and test sets. Important for imbalanced data.
- SVM (Support Vector Machine)
Machine learning algorithm finding optimal separating hyperplane between classes. Effective for high-dimensional data.
T
- t-SNE (t-Distributed Stochastic Neighbor Embedding)
Non-linear dimensionality reduction emphasizing local structure. Good for visualizing clusters.
- Test Set
Data held out during training, used only for final performance evaluation.
- Training Set
Data used to train machine learning model. Model learns patterns from training set.
U
- UMAP (Uniform Manifold Approximation and Projection)
Non-linear dimensionality reduction preserving both local and global structure. Faster than t-SNE.
- Underfitting
When model is too simple to capture data patterns, causing poor performance on both training and test data.
V
- Validation Set
Data used during training for hyperparameter tuning and model selection (separate from test set).
- Vector Normalization
Scaling each spectrum to unit length (Euclidean norm = 1). Most common normalization for Raman data.
W
- Wavenumber
Unit for Raman shift, measured in cm⁻¹ (inverse centimeters). Proportional to vibrational energy.
- Whittaker Smoothing
Smoothing method using penalized least squares. Controlled by lambda parameter.
X
- XGBoost (eXtreme Gradient Boosting)
Advanced gradient boosting machine learning algorithm. Often achieves highest accuracy but requires careful tuning.
Japanese Terms / 日本語用語
- 未病 (Mibyō)
Pre-disease state - condition before full disease manifestation. Focus of early detection research.
- 臨床光情報工学研究室 (Rinsho Hikari Jōhō Kōgaku Kenkyūshitsu)
Laboratory for Clinical Photonics and Information Engineering at University of Toyama.
Abbreviations
Abbreviation |
Full Name |
|---|---|
AI |
Artificial Intelligence |
ALS |
Alternating Least Squares |
ANOVA |
Analysis of Variance |
API |
Application Programming Interface |
AsLS |
Asymmetric Least Squares |
AUC |
Area Under Curve |
CDAE |
Convolutional Denoising Autoencoder |
CI |
Confidence Interval |
CSV |
Comma-Separated Values |
DPI |
Dots Per Inch |
EM |
Expectation-Maximization |
FABC |
Fixed-Anchor Baseline Correction |
FDR |
False Discovery Rate |
FN |
False Negative |
FP |
False Positive |
GUI |
Graphical User Interface |
ICA |
Independent Component Analysis |
KNN |
K-Nearest Neighbors |
LOPOCV |
Leave-One-Patient-Out Cross-Validation |
MAT |
MATLAB File Format (not currently supported) |
MCR |
Multivariate Curve Resolution |
MGUS |
Monoclonal Gammopathy of Undetermined Significance |
ML |
Machine Learning |
MM |
Multiple Myeloma |
MSC |
Multiplicative Scatter Correction |
NMF |
Non-negative Matrix Factorization |
PC |
Principal Component |
PCA |
Principal Component Analysis |
PLS |
Partial Least Squares |
PLS-DA |
Partial Least Squares Discriminant Analysis |
PQN |
Probabilistic Quotient Normalization |
QC |
Quality Control |
RF |
Random Forest |
ROC |
Receiver Operating Characteristic |
SMOTE |
Synthetic Minority Over-sampling Technique |
SNV |
Standard Normal Variate |
SVM |
Support Vector Machine |
t-SNE |
t-Distributed Stochastic Neighbor Embedding |
TN |
True Negative |
TP |
True Positive |
UMAP |
Uniform Manifold Approximation and Projection |
UV |
uv package manager |
Contributing
Found a term that should be added?
Fork the repository
Edit
docs/glossary.mdAdd term in alphabetical order with clear definition
Submit pull request
Please include:
Term in bold
Clear, concise definition
Related terms or examples where helpful