# Glossary ## A **Amide I Band** : Raman peak around 1650 cm⁻¹ corresponding to C=O stretching vibrations in protein backbones. Strong marker for protein content. **Amide II Band** : Raman peak around 1550 cm⁻¹ from N-H bending and C-N stretching in proteins. Used for protein structure analysis. **Asymmetric Least Squares (AsLS)** : Baseline correction algorithm that fits asymmetric polynomials, giving more weight to points below the spectrum. Fast and effective for smooth baselines. **API** : Application Programming Interface - the set of functions and classes that developers can use to extend or interact with the application programmatically. --- ## B **Baseline Correction** : Preprocessing step that removes additive background fluorescence from Raman spectra, revealing the true Raman peaks. **Batch Processing** : Processing multiple datasets or spectra simultaneously with the same pipeline or analysis method. **Biomarker** : A measurable indicator (e.g., specific Raman peak or ratio) that characterizes biological state or disease condition. --- ## C **Calibration** : Process of correcting spectral axis (wavenumber) or intensity axis using known standards to ensure measurement accuracy. **Class Imbalance** : When one class has many more samples than another (e.g., 90% healthy, 10% disease), requiring special handling in machine learning. **Confusion Matrix** : Table showing model predictions vs actual labels, used to evaluate classification performance (True Positive, False Positive, etc.). **Cosmic Ray** : Sharp intensity spike caused by high-energy cosmic ray hitting the detector. Must be removed before analysis. **Cross-Validation** : Technique for evaluating model performance by splitting data into training/test sets multiple times and averaging results. --- ## D **Data Leakage** : When information from test set inadvertently influences training, causing overoptimistic performance estimates. Critical to avoid. **Dataset** : Collection of Raman spectra imported and managed together. Can contain multiple groups. **Dimensionality Reduction** : Techniques (PCA, UMAP, t-SNE) that reduce high-dimensional spectral data (1000+ wavenumbers) to 2-3 dimensions for visualization. --- ## E **Effect Size** : Magnitude of difference between groups, independent of sample size. Cohen's d is common effect size measure. **Endmember** : Pure component spectrum in spectral unmixing (MCR-ALS). Mixture spectra are combinations of endmembers. --- ## F **False Discovery Rate (FDR)** : Expected proportion of false positives among all positive results. FDR correction (Benjamini-Hochberg) controls this rate. **Feature** : In machine learning context, each wavenumber intensity value is a feature. Raman spectra have ~1000-2000 features. **Feature Engineering** : Creating new features from raw data (e.g., peak ratios, derivatives) to improve model performance. **Feature Importance** : Measure of how much each feature (wavenumber) contributes to model predictions. SHAP and permutation importance are common methods. **Fluorescence** : Broad background signal in Raman spectra caused by sample autofluorescence. Removed by baseline correction. --- ## G **Group** : User-defined collection of spectra (e.g., "Healthy", "Disease") used for comparative analysis and machine learning. **GroupKFold** : Cross-validation strategy ensuring all spectra from same patient/sample stay in same fold, preventing data leakage. --- ## H **Hyperparameter** : Model parameter set by user (not learned from data), such as number of trees in Random Forest or C in SVM. **Hyperparameter Tuning** : Process of finding optimal hyperparameter values using grid search, random search, or optimization algorithms. --- ## I **Intensity** : Raman scattering signal strength, proportional to molecular concentration and scattering cross-section. **Interpolation** : Process of estimating spectrum values at new wavenumber points based on existing data. Used to align spectra with different sampling. --- ## K **K-Fold Cross-Validation** : Splitting data into K parts, training on K-1 parts and testing on remaining part, repeated K times. --- ## L **Lambda (λ)** : Smoothness parameter in baseline correction algorithms. Higher λ = smoother baseline. **Leave-One-Patient-Out Cross-Validation (LOPOCV)** : Cross-validation where each patient's data is test set once, ensuring patient-level separation. **Loading** : In PCA, the contribution of each original variable (wavenumber) to a principal component. Loadings plot shows which peaks drive PCs. --- ## M **Machine Learning (ML)** : Using algorithms to learn patterns from data and make predictions on new data without explicit programming. **MCR-ALS** : Multivariate Curve Resolution - Alternating Least Squares. Method for decomposing mixture spectra into pure component spectra. **MGUS** : Monoclonal Gammopathy of Undetermined Significance. Pre-cancerous condition that may progress to multiple myeloma (MM). **Multiple Testing Correction** : Statistical adjustment for performing many hypothesis tests simultaneously, controlling false positive rate. --- ## N **Normalization** : Scaling spectra to remove multiplicative intensity variations, enabling fair comparison. Common methods: Vector, SNV, Min-Max, Area. **NumPy** : Python library for numerical computing, providing arrays and mathematical functions. Core dependency for this application. --- ## O **Outlier** : Spectrum significantly different from others, possibly due to measurement error or unusual sample. Should be identified and often removed. **Overfitting** : When model learns training data too well, including noise, causing poor performance on new data. --- ## P **P-value** : Probability of observing data as extreme as measured, assuming null hypothesis is true. p < 0.05 traditionally considered significant. **PCA (Principal Component Analysis)** : Unsupervised dimensionality reduction finding directions of maximum variance. First step for most analyses. **Peak** : Local maximum in Raman spectrum corresponding to specific molecular vibration. **Pipeline** : Sequence of preprocessing steps applied in order (e.g., Baseline → Smooth → Normalize). **PLS-DA (Partial Least Squares Discriminant Analysis)** : Supervised dimensionality reduction maximizing separation between known groups. **Preprocessing** : Data transformation steps applied before analysis (baseline correction, smoothing, normalization, etc.). --- ## Q **Quality Control (QC)** : Procedures ensuring data quality, including outlier detection, cosmic ray removal, and baseline verification. **Quantile Normalization** : Advanced normalization making intensity distributions identical across spectra. --- ## R **Raman Scattering** : Inelastic scattering of photons by molecules, providing fingerprint of molecular structure. **Raman Shift** : Energy difference between incident and scattered photons, measured in wavenumbers (cm⁻¹). **Random Forest (RF)** : Ensemble machine learning algorithm using multiple decision trees. Robust and interpretable. **Regularization** : Adding penalty to model complexity to prevent overfitting. L1 (Lasso) and L2 (Ridge) are common types. **ROC Curve** : Receiver Operating Characteristic curve plotting True Positive Rate vs False Positive Rate. ROC-AUC measures classification performance. --- ## S **Savitzky-Golay Filter** : Smoothing method fitting polynomials to local windows. Preserves peak shapes better than simple averaging. **Scree Plot** : Plot of explained variance vs principal component number. Used to select number of PCs to keep. **SHAP (SHapley Additive exPlanations)** : Method for interpreting machine learning models by assigning importance value to each feature for each prediction. **Smoothing** : Reducing noise by averaging neighboring points. Savitzky-Golay and Gaussian are common methods. **SNV (Standard Normal Variate)** : Normalization method centering and scaling each spectrum independently. Robust for biological samples. **Spectrum (plural: Spectra)** : Plot of Raman intensity vs wavenumber for a single measurement. **Stratified Sampling** : Splitting data while maintaining class proportions in train and test sets. Important for imbalanced data. **SVM (Support Vector Machine)** : Machine learning algorithm finding optimal separating hyperplane between classes. Effective for high-dimensional data. --- ## T **t-SNE (t-Distributed Stochastic Neighbor Embedding)** : Non-linear dimensionality reduction emphasizing local structure. Good for visualizing clusters. **Test Set** : Data held out during training, used only for final performance evaluation. **Training Set** : Data used to train machine learning model. Model learns patterns from training set. --- ## U **UMAP (Uniform Manifold Approximation and Projection)** : Non-linear dimensionality reduction preserving both local and global structure. Faster than t-SNE. **Underfitting** : When model is too simple to capture data patterns, causing poor performance on both training and test data. --- ## V **Validation Set** : Data used during training for hyperparameter tuning and model selection (separate from test set). **Vector Normalization** : Scaling each spectrum to unit length (Euclidean norm = 1). Most common normalization for Raman data. --- ## W **Wavenumber** : Unit for Raman shift, measured in cm⁻¹ (inverse centimeters). Proportional to vibrational energy. **Whittaker Smoothing** : Smoothing method using penalized least squares. Controlled by lambda parameter. --- ## X **XGBoost (eXtreme Gradient Boosting)** : Advanced gradient boosting machine learning algorithm. Often achieves highest accuracy but requires careful tuning. --- ## Japanese Terms / 日本語用語 **未病 (Mibyō)** : Pre-disease state - condition before full disease manifestation. Focus of early detection research. **臨床光情報工学研究室 (Rinsho Hikari Jōhō Kōgaku Kenkyūshitsu)** : Laboratory for Clinical Photonics and Information Engineering at University of Toyama. --- ## Abbreviations | Abbreviation | Full Name | | ------------ | -------------------------------------------------- | | AI | Artificial Intelligence | | ALS | Alternating Least Squares | | ANOVA | Analysis of Variance | | API | Application Programming Interface | | AsLS | Asymmetric Least Squares | | AUC | Area Under Curve | | CDAE | Convolutional Denoising Autoencoder | | CI | Confidence Interval | | CSV | Comma-Separated Values | | DPI | Dots Per Inch | | EM | Expectation-Maximization | | FABC | Fixed-Anchor Baseline Correction | | FDR | False Discovery Rate | | FN | False Negative | | FP | False Positive | | GUI | Graphical User Interface | | ICA | Independent Component Analysis | | KNN | K-Nearest Neighbors | | LOPOCV | Leave-One-Patient-Out Cross-Validation | | MAT | MATLAB File Format (not currently supported) | | MCR | Multivariate Curve Resolution | | MGUS | Monoclonal Gammopathy of Undetermined Significance | | ML | Machine Learning | | MM | Multiple Myeloma | | MSC | Multiplicative Scatter Correction | | NMF | Non-negative Matrix Factorization | | PC | Principal Component | | PCA | Principal Component Analysis | | PLS | Partial Least Squares | | PLS-DA | Partial Least Squares Discriminant Analysis | | PQN | Probabilistic Quotient Normalization | | QC | Quality Control | | RF | Random Forest | | ROC | Receiver Operating Characteristic | | SMOTE | Synthetic Minority Over-sampling Technique | | SNV | Standard Normal Variate | | SVM | Support Vector Machine | | t-SNE | t-Distributed Stochastic Neighbor Embedding | | TN | True Negative | | TP | True Positive | | UMAP | Uniform Manifold Approximation and Projection | | UV | uv package manager | --- ## Contributing Found a term that should be added? 1. Fork the [repository](https://github.com/zerozedsc/Raman-Spectroscopy-Analysis-Application) 2. Edit `docs/glossary.md` 3. Add term in alphabetical order with clear definition 4. Submit pull request Please include: - **Term** in bold - Clear, concise definition - Related terms or examples where helpful