Machine Learning Guide
Complete guide to building, training, evaluating, and deploying machine learning models for Raman spectroscopy classification and regression.
Table of Contents
Overview
Modeling & Classification Page
The ML page provides a complete workflow for supervised learning:

Figure: Machine Learning page showing dataset selector (left), algorithm/validation configuration (center-right), and results dashboard (bottom)
Note: The ML page layout includes:
Left Panel: Training data selection and group assignment
Center-Right: Algorithm selection, validation strategy, and hyperparameter optimization settings
Bottom: Results dashboard with performance metrics, ROC curves, confusion matrix, and feature importance
Key Features:
Configure algorithms, validation, and hyperparameters in one place
Review metrics and plots (AUC/accuracy, ROC curve, confusion matrix, feature importance)
Export and report actions: View Detailed Results, Export Model, Save Report
When to Use ML
Use Cases:
✅ Classification: Diagnose disease presence/absence
✅ Multi-class: Classify multiple disease types
✅ Regression: Predict continuous values (concentration, biomarker levels)
✅ Feature selection: Identify most important wavenumbers
Requirements:
Labeled data (group assignments or values)
Sufficient samples (>50 per class recommended)
Preprocessed spectra
Clear research question
Not suitable for:
❌ Exploratory analysis (use PCA, UMAP instead)
❌ Unlabeled data (use clustering)
❌ Very small datasets (<20 samples)
❌ When interpretability is sole goal (use statistical tests)
ML Workflow
Complete Workflow
graph TD
A[Preprocessed Data] --> B[Data Splitting]
B --> C[Training Set 70%]
B --> D[Test Set 30%]
C --> E[Algorithm Selection]
E --> F[Hyperparameter Tuning]
F --> G[Cross-Validation]
G --> H[Best Model]
H --> I{Performance OK?}
I -->|No| E
I -->|Yes| J[Test Set Evaluation]
J --> K[Feature Importance]
K --> L[Model Export]
Step-by-Step Guide
Step 1: Prepare Data
Requirements:
Preprocessed spectra
Group labels assigned
No missing values
Balanced classes (if possible)
Check:
Data Summary:
- Healthy Control: 75 samples
- Disease Group: 68 samples
- Total: 143 samples
- Features: 1000 wavenumbers
- Class balance: 52% / 48% ✓
Step 2: Split Data
Strategy: Separate test set before any ML
Recommended Split:
Total: 143 samples
Training Set: 100 samples (70%)
- Used for: Model training, hyperparameter tuning, CV
- Healthy: 52 samples
- Disease: 48 samples
Test Set: 43 samples (30%)
- Used for: Final evaluation only
- Healthy: 23 samples
- Disease: 20 samples
Implementation:
Click [Split Dataset] button
Choose split strategy:
Patient-level: Keep all spectra from one patient together ✓ (Recommended)
Spectrum-level: Random split (risk of data leakage)
Set split ratio: 70/30
Click [Split]
Result: Two new datasets created
training_set.csvtest_set.csv
Step 3: Select Training Data
In ML page, select training set only
Verify groups are correct
Check for class imbalance
Step 4: Choose Algorithm
See Algorithm Selection section for details
Quick Recommendations:
Start with: Random Forest (robust, fast, interpretable)
If RF works well: Try XGBoost (often better performance)
For linear separation: Logistic Regression or SVM (linear kernel)
For complex patterns: SVM (RBF kernel) or XGBoost
Step 5: Configure Validation
GroupKFold (Recommended):
Ensures patient-level splitting
Prevents data leakage
n=5 or n=10 folds
Why important?:
Wrong: Spectrum-level split
Patient 1: Spectra A, B, C → Some in train, some in test
Result: Overly optimistic performance (data leakage)
Correct: Patient-level split
Patient 1: All spectra → All in train OR all in test
Result: Realistic generalization estimate
Step 6: Set Hyperparameters
Option A: Use Defaults (Quick Start)
Good for: Initial exploration, baseline performance
Option B: Grid Search (Thorough)
Good for: Optimizing performance, publication
Time: Longer (5-30 minutes depending on data size)
Option C: Random Search (Efficient)
Good for: Large parameter spaces, time constraints
Time: Medium (2-10 minutes)
Step 7: Train Model
Click [Train Model]
Progress bar shows:
Current fold (1/5, 2/5, …)
Current hyperparameter combination
Estimated time remaining
Wait for training to complete
Training Time:
Random Forest: ~10-60 seconds
SVM: ~30-120 seconds
XGBoost: ~20-90 seconds
XGBoost: ~1-5 minutes
SVM (RBF kernel): ~30 seconds to 5 minutes
Step 8: Evaluate on Test Set
After training:
Model trained on full training set with best hyperparameters
Automatically evaluated on test set
Results shown in dashboard
Never:
Use test set during hyperparameter tuning
Use test set during feature selection
Train on test set
Look at test set before finalizing model
Algorithm Selection
Support Vector Machine (SVM)
Principle: Find optimal hyperplane separating classes
Strengths:
✓ High-dimensional data (many wavenumbers)
✓ Small to medium datasets
✓ Good generalization
✓ Kernel trick for non-linear separation
Weaknesses:
✗ Slow on large datasets (>5000 samples)
✗ Sensitive to scaling (auto-scaled in app)
✗ Black-box (hard to interpret)
Hyperparameters:
C: Regularization parameter (0.1 - 100, default: 1.0)
- Lower → More regularization, simpler decision boundary
- Higher → Less regularization, more complex boundary
Kernel: Transformation function
- 'linear': For linearly separable data
- 'rbf': Most common, for non-linear data
- 'poly': Polynomial kernel
gamma (for RBF): Kernel coefficient (0.001 - 1, default: 'scale')
- Lower → Smooth decision boundary
- Higher → Complex decision boundary
When to use:
Binary classification
Non-linear decision boundaries (use RBF kernel)
Feature dimensions > sample size
Need probabilistic outputs
Random Forest (RF)
Principle: Ensemble of decision trees voting on prediction
Strengths:
✓ Fast training and prediction
✓ Handles non-linear relationships
✓ Feature importance built-in
✓ Robust to outliers
✓ No scaling required
✓ Good baseline performance
Weaknesses:
✗ Can overfit on noisy data
✗ Less accurate than gradient boosting
✗ Biased toward high-cardinality features
Hyperparameters:
n_estimators: Number of trees (50 - 500, default: 100)
- More trees → Better performance, slower
- Diminishing returns after ~200
max_depth: Maximum tree depth (5 - 50, default: None)
- Limit depth to prevent overfitting
- None = unlimited (risky)
min_samples_split: Min samples to split node (2 - 20, default: 2)
- Higher → More regularization
max_features: Features per split ('sqrt', 'log2', default: 'sqrt')
- 'sqrt': Good default for classification
- 'log2': Alternative
When to use:
First choice for most problems
Baseline model
Interpretability via feature importance
Fast training required
XGBoost (Extreme Gradient Boosting)
Principle: Sequentially build trees, each correcting previous errors
Strengths:
✓ State-of-the-art performance
✓ Built-in regularization
✓ Handles missing values
✓ Feature importance
✓ Efficient implementation
Weaknesses:
✗ More hyperparameters to tune
✗ Slower than Random Forest
✗ Can overfit if not regularized
Hyperparameters:
n_estimators: Number of boosting rounds (50 - 500, default: 100)
learning_rate: Step size (0.01 - 0.3, default: 0.1)
- Lower → Better performance, needs more estimators
- Higher → Faster training, risk of overfitting
max_depth: Tree depth (3 - 10, default: 6)
- Deeper → More complex, higher risk of overfit
subsample: Row sampling (0.5 - 1.0, default: 1.0)
- < 1.0 → Regularization via sampling
colsample_bytree: Column sampling (0.5 - 1.0, default: 1.0)
- < 1.0 → Regularization, faster training
When to use:
After Random Forest shows promise
Need best possible performance
Kaggle-style competitions
Publication-worthy results
Logistic Regression (LR)
Principle: Linear model with sigmoid activation
Strengths:
✓ Fast training
✓ Interpretable coefficients
✓ Probabilistic outputs
✓ Works well for linearly separable data
✓ No hyperparameter tuning needed
Weaknesses:
✗ Limited to linear decision boundaries
✗ May underfit complex data
Hyperparameters:
C: Inverse regularization (0.01 - 100, default: 1.0)
- Lower → More regularization
- Higher → Less regularization
penalty: Regularization type ('l1', 'l2', default: 'l2')
- 'l2': Ridge (most common)
- 'l1': Lasso (sparse solutions, feature selection)
When to use:
Baseline model
Need interpretable coefficients
Linearly separable classes
Fast training required
Probabilistic predictions needed
Algorithm Comparison
Algorithm |
Speed |
Accuracy |
Interpretability |
Hyperparameters |
Recommended For |
|---|---|---|---|---|---|
Logistic Regression |
⚡⚡⚡ |
⭐⭐ |
⭐⭐⭐ |
⚙️ |
Baseline, linear problems |
Random Forest |
⚡⚡ |
⭐⭐⭐ |
⭐⭐ |
⚙️⚙️ |
First choice, most cases |
XGBoost |
⚡ |
⭐⭐⭐⭐ |
⭐⭐ |
⚙️⚙️⚙️ |
Best performance |
SVM |
⚡ |
⭐⭐⭐ |
⭐ |
⚙️⚙️ |
High-dimensional, kernel methods |
Note: Neural networks (MLP) are planned for future releases.
Training and Validation
Validation Strategies
GroupKFold (Recommended)
Purpose: Patient-level cross-validation
How it works:
Fold 1: Patients 1-4 (train) | Patients 5 (test)
Fold 2: Patients 1-3, 5 (train) | Patient 4 (test)
Fold 3: Patients 1-2, 4-5 (train) | Patient 3 (test)
...
Advantages:
✓ Prevents data leakage
✓ Realistic generalization estimate
✓ Tests on unseen patients
✓ Required for patient-level spectra
When to use: Always when you have patient-level data
Stratified K-Fold
Purpose: Maintain class proportions in each fold
How it works:
Fold 1: 80% train (balanced) | 20% test (balanced)
Fold 2: Different 80%/20% split, still balanced
...
When to use:
No patient-level structure
Independent samples
Class imbalance present
Leave-One-Patient-Out (LOPOCV)
Purpose: Maximum cross-validation rigor
How it works:
Iteration 1: Train on patients 2-N | Test on patient 1
Iteration 2: Train on patients 1, 3-N | Test on patient 2
...
Advantages:
✓ Most rigorous
✓ Uses all data
✓ Patient-level validation
Disadvantages:
✗ Computationally expensive
✗ High variance in estimates
When to use: Small datasets (<50 patients), need maximum rigor
Hold-out Test Set
Purpose: Single train/test split
Advantages:
✓ Fast
✓ Simple
✓ Easy to understand
Disadvantages:
✗ Single estimate (no confidence interval)
✗ Result depends on split
✗ Uses less training data
When to use: Initial exploration, very large datasets
Hyperparameter Optimization
Grid Search
Method: Try all combinations
Example:
# Random Forest Grid Search
n_estimators = [50, 100, 200]
max_depth = [10, 20, None]
min_samples_split = [2, 5, 10]
# Total combinations: 3 × 3 × 3 = 27
# With 5-fold CV: 27 × 5 = 135 model fits
Advantages:
✓ Exhaustive search
✓ Finds optimal combination
✓ Reproducible
Disadvantages:
✗ Slow for large grids
✗ Wastes time on bad regions
When to use: Small parameter spaces, need best performance
Random Search
Method: Sample random combinations
Example:
# Sample 20 random combinations
# Faster than grid search
# Often finds good solutions
Advantages:
✓ Faster than grid search
✓ Can cover large spaces
✓ Often finds good solutions quickly
When to use: Large parameter spaces, time constraints
Bayesian Optimization
Method: Intelligently sample based on previous results
Advantages:
✓ Most efficient
✓ Fewer iterations needed
✓ Focuses on promising regions
Disadvantages:
✗ More complex
✗ May get stuck in local optima
When to use: Expensive model training, limited time
Preventing Overfitting
Strategies:
More data: Collect more samples
Regularization: Increase regularization parameters
Simpler model: Reduce complexity
Feature selection: Use fewer wavenumbers
Cross-validation: Ensure good validation strategy
Early stopping: For neural networks
Warning Signs:
Training Accuracy: 99.5%
Test Accuracy: 72.3%
→ Severe overfitting!
Training Accuracy: 92.1%
Test Accuracy: 91.3%
→ Good generalization ✓
Model Evaluation
Classification Metrics
Accuracy
Formula: (TP + TN) / (TP + TN + FP + FN)
When sufficient: Balanced classes
Limitations: Misleading for imbalanced data
Example:
Dataset: 95% Healthy, 5% Disease
Model: Predicts all as Healthy
Accuracy: 95% (but useless!)
Precision, Recall, F1-Score
Precision: Of predicted positives, how many are correct?
Precision = TP / (TP + FP)
Use: When false positives are costly
Recall (Sensitivity): Of actual positives, how many did we find?
Recall = TP / (TP + FN)
Use: When false negatives are costly
F1-Score: Harmonic mean of precision and recall
F1 = 2 × (Precision × Recall) / (Precision + Recall)
Use: Balance precision and recall
ROC Curve and AUC
ROC (Receiver Operating Characteristic):
Plot: True Positive Rate vs False Positive Rate
Shows trade-off at different thresholds
AUC (Area Under Curve):
AUC = 1.0: Perfect classifier
AUC = 0.9-1.0: Excellent
AUC = 0.8-0.9: Good
AUC = 0.7-0.8: Fair
AUC = 0.5: Random guessing
Example:
TPR
↑
1.0│ ╭─────
│ ╭─┘
0.8│ ╭┘
│ ╭┘
0.5│╭┘ AUC = 0.92
└──────────→ FPR
0 0.5 1.0
Confusion Matrix
Structure:
Predicted
Healthy Disease
Actual Healthy 42 3 (Accuracy: 93%)
Disease 2 18 (Recall: 90%)
Precision: 95% 86%
Interpretation:
Diagonal: Correct predictions
Off-diagonal: Errors
False Positives (FP): Healthy predicted as Disease (3)
False Negatives (FN): Disease predicted as Healthy (2)
Multi-Class Metrics
Macro-Average: Average metrics across classes (equal weight) Micro-Average: Aggregate TP, FP, FN then calculate (weighted by class size) Weighted-Average: Weight by class size
Example:
3-Class Problem:
Class A: Precision=0.95, Recall=0.92 (100 samples)
Class B: Precision=0.88, Recall=0.85 (50 samples)
Class C: Precision=0.91, Recall=0.89 (30 samples)
Macro Precision = (0.95 + 0.88 + 0.91) / 3 = 0.913
Weighted Precision = (0.95×100 + 0.88×50 + 0.91×30) / 180 = 0.919
Regression Metrics
MAE (Mean Absolute Error)
MAE = mean(|predicted - actual|)
Units: Same as target variable
Interpretation: Average prediction error
RMSE (Root Mean Squared Error)
RMSE = sqrt(mean((predicted - actual)²))
Units: Same as target variable
Interpretation: Penalizes large errors more
R² (Coefficient of Determination)
R² = 1 - (SS_residual / SS_total)
Range: -∞ to 1
R² = 1: Perfect predictions
R² = 0: No better than mean
R² < 0: Worse than mean
Model Interpretation
Feature Importance
Purpose: Identify most important wavenumbers
Random Forest Feature Importance
Method: Mean decrease in impurity
Interpretation:
Importance vs Wavenumber
Importance
↑
│ █
│ █ █
│ █ █ █ █ █
│ █ █ █ █ █ █ █ █ █ █
└──────────────────────→
400 800 1200 1600
Key peaks: 1655, 1450, 1200 cm⁻¹
Use: Identify biomarker regions
SHAP Values (SHapley Additive exPlanations)
Purpose: Explain an individual prediction by attributing the model output to Raman-shift features.
How SHAP works in this app:
Train a model on the ML page.
Click SHAP (no retraining).
In the SHAP parameter dialog, choose:
Explain set (Train/Test)
Dataset and spectrum index
Background samples / Max evals / Top-k (when available)
A modal SHAP Result dialog opens and shows progress.
What you will see (tabs):
Spectrum: the selected spectrum (with contributor markers when available)
SHAP: a per-feature contribution bar plot across the Raman shift axis
Red = positive contribution (pushes prediction toward the predicted class)
Blue = negative contribution (pushes prediction away)
Summary / Report: predicted class name + probability and a ranked contributor table
Provenance: dataset/sample provenance (best-effort)
Export: Use Export to write a SHAP bundle (plots + contributors CSV + raw JSON).
Performance notes:
SHAP can be slow on large feature vectors; Background samples and Max evals have the biggest impact.
Use Stop to cancel a long run (cancellation is best-effort/cooperative).
Permutation Importance
Method: Shuffle feature, measure performance drop
Advantages:
✓ Model-agnostic
✓ Reflects real predictive power
✓ Accounts for interactions
Use: Alternative to feature importance
Model Transparency
For Clinicians and Reviewers:
Report:
Training procedure
Validation strategy (GroupKFold with patient-level splitting)
Hyperparameters used
Performance metrics with confidence intervals
Visualize:
ROC curves
Confusion matrices
Feature importance plots
Example predictions with explanations
Validate:
Independent test set
External validation (different hospital, cohort)
Temporal validation (future patients)
Model Export and Deployment
Saving Trained Models
Formats:
1. Pickle (PKL) - Python native
# Saves entire scikit-learn model
File: trained_model.pkl
Size: 5-50 MB (depends on model)
Use: Python deployment
2. ONNX (Open Neural Network Exchange)
# Cross-platform format
File: trained_model.onnx
Supports: Python, C++, Java, C#
Use: Production deployment, embedded systems
3. JSON (Metadata Only)
# Model architecture and parameters
File: model_config.json
Use: Documentation, reproduction
Export Steps:
Train and validate model
Click [Export Model]
Select format (PKL or ONNX)
Choose location
Save
Loading Models for Prediction
In Application:
Go to ML page
Click [Load Model]
Select model file (.pkl or .onnx)
Click [Predict]
Select new data
View predictions
External Use (Python):
import pickle
import numpy as np
# Load model
with open('trained_model.pkl', 'rb') as f:
model = pickle.load(f)
# Load new data
new_spectra = np.loadtxt('new_data.csv', delimiter=',')
# Preprocess (use SAME pipeline as training!)
# ... apply baseline correction, smoothing, normalization ...
# Predict
predictions = model.predict(new_spectra)
probabilities = model.predict_proba(new_spectra)
print(f"Predicted class: {predictions[0]}")
print(f"Probability: {probabilities[0]}")
Model Deployment Checklist
Before Deployment:
Model validated on independent test set
Preprocessing pipeline saved and documented
Performance metrics meet requirements
Feature importance analyzed and validated
Model interpretability verified
Documentation completed
Code reviewed
Ethical approval obtained (if clinical use)
Deployment Considerations:
Data Quality: Ensure new data matches training data quality
Preprocessing: Apply EXACT same pipeline
Model Updates: Plan for retraining with new data
Monitoring: Track prediction confidence and errors
Human Oversight: Always include expert review for critical decisions
Troubleshooting
Poor Performance (Accuracy <70%)
Causes:
Insufficient preprocessing
Classes not actually separable
Too few samples
Inappropriate algorithm
Solutions:
Improve preprocessing (try different methods)
Increase sample size
Try different algorithms
Check if problem is solvable (run PCA first)
Verify labels are correct
Overfitting (Train >> Test)
Symptoms:
Training Accuracy: 98%
Test Accuracy: 68%
Difference: 30% → Overfitting!
Solutions:
Reduce model complexity (lower max_depth, fewer estimators)
Increase regularization (lower C, higher alpha)
Use more training data
Feature selection (reduce number of wavenumbers)
Stronger cross-validation
Class Imbalance
Problem:
Healthy: 300 samples
Disease: 30 samples (10:1 ratio)
Model predicts all as Healthy → 90% accuracy but useless
Solutions:
Collect more data (best solution)
Use
class_weight='balanced'in modelSMOTE (Synthetic Minority Over-sampling)
Undersampling majority class
Adjust decision threshold
Focus on F1-score instead of accuracy
Data Leakage
Problem: Information from test set leaks into training
Common Causes:
Normalizing before splitting
Feature selection on full dataset
Using test set during hyperparameter tuning
Spectrum-level split with patient-level replicates
Prevention:
Split data FIRST (patient-level)
Fit preprocessing ONLY on training set
NEVER look at test set during model development
Use GroupKFold for patient-level data
See Also
Analysis Guide - Previous step: Exploratory analysis
Machine Learning Methods Reference - Detailed algorithm documentation
Best Practices - ML best practices
FAQ - Machine Learning - Common questions
Next: Best Practices →