Analysis Methods Reference

This comprehensive reference documents all analysis methods available in the Raman Spectroscopy Analysis Application, including preprocessing algorithms, exploratory analysis, statistical tests, and machine learning methods.

Purpose of This Reference

Each method is documented with:

Purpose - What the method does and when to use it
Theory - Brief explanation of the underlying algorithm
Parameters - Complete parameter reference with recommendations
Interpretation Guide - How to understand and report results
Examples - Practical usage examples
Common Issues - Troubleshooting and limitations
References - Primary literature citations

Method Categories

Preprocessing Methods 

40+ methods for spectral preprocessing:

Category	Methods	Purpose
Baseline Correction	AsLS, AirPLS, Polynomial, Whittaker, FABC, Butterworth	Remove fluorescence background
Smoothing	Savitzky-Golay, Gaussian, Moving Average, Median Filter	Reduce noise
Normalization	Vector, Min-Max, Area, SNV, MSC, Quantile, PQN	Scale spectra
Derivatives	1st, 2nd order Savitzky-Golay	Enhance peaks
Feature Engineering	Peak Ratio, Wavelet Transform, Rank Transform	Extract features
Advanced	CDAE (Deep Learning), MCR-ALS, NMF	Denoising and decomposition

→ Complete Preprocessing Reference

Exploratory Analysis 

Unsupervised methods for data exploration:

Method	Purpose	When to Use
PCA	Dimensionality reduction, variance visualization	First step for all analyses
UMAP	Non-linear manifold learning	Complex high-dimensional data
t-SNE	Non-linear clustering visualization	Identifying subgroups
Hierarchical Clustering	Dendrogram-based grouping	Unknown group structure
K-means Clustering	Partition-based clustering	Known number of clusters

→ Complete Exploratory Analysis Reference

Statistical Methods 

Hypothesis testing and correlation analysis:

Method	Purpose	Data Type
Pairwise Tests	Compare two groups	t-test (normal), Mann-Whitney (non-parametric)
ANOVA	Compare multiple groups	F-test (normal), Kruskal-Wallis (non-parametric)
Correlation Analysis	Find relationships	Pearson, Spearman, Kendall
Band Ratio Analysis	Calculate biochemical ratios	Protein/Lipid, Amide I/II, etc.
Peak Detection	Identify significant peaks	Automatic peak finding

→ Complete Statistical Methods Reference

Machine Learning 

Supervised classification algorithms:

Algorithm	Type	Best For
Support Vector Machine (SVM)	Margin-based	High-dimensional, small sample
Random Forest	Ensemble tree	Robust, interpretable
XGBoost	Gradient boosting	High accuracy, large datasets
Logistic Regression	Linear	Baseline, interpretability

Note: Multi-Layer Perceptron (MLP) and PLS-DA are planned for future releases.

Validation Methods:

Stratified K-Fold
Leave-One-Patient-Out (LOPOCV)
Time-series splits

Interpretability:

SHAP values (feature importance)
Permutation importance
Partial dependence plots
Decision boundary visualization

→ Complete Machine Learning Reference

Quick Method Selector

By Research Question

“Are my groups different?”

Start with: PCA (visual)
Confirm with: Statistical Tests (quantitative)

“What are the key differentiating features?”

Use: Band Ratio Analysis
Or: SHAP Values from ML models

“Can I predict group membership?”

Use: Machine Learning (classification)
Validate with: GroupKFold

“What are the pure components in my mixture?”

Use: MCR-ALS (spectral unmixing)

“How many clusters exist in my data?”

Use: Hierarchical Clustering (dendrogram)
Or: Elbow Method with K-means

By Data Characteristics

Small sample size (n < 30 per group)

Avoid: Deep learning, complex models
Use: Random Forest, SVM with careful validation
Validate with: LOPOCV

Large sample size (n > 100 per group)

Can use: All methods, including deep learning
Recommended: XGBoost, Neural Networks
Validate with: Hold-out test set + cross-validation

High class imbalance (e.g., 90% healthy, 10% disease)

Use: Stratified sampling, SMOTE
Metrics: ROC-AUC, F1-score (not accuracy)
See: Handling Imbalance

Multiple groups (>2)

Exploratory: PCA, UMAP with color-coded groups
Statistical: ANOVA with post-hoc tests
ML: Multi-class classification

Method Selection Flowchart

        graph TD
    A[Start] --> B{Research Goal?}
    B -->|Explore| C{Known Groups?}
    B -->|Test Hypothesis| D{How Many Groups?}
    B -->|Predict| E{Labeled Data?}
    
    C -->|Yes| F[PCA with Groups]
    C -->|No| G[Hierarchical Clustering]
    
    D -->|2 groups| H[t-test or Mann-Whitney]
    D -->|>2 groups| I[ANOVA or Kruskal-Wallis]
    
    E -->|Yes| J{Sample Size?}
    E -->|No| K[Unsupervised Learning]
    
    J -->|Small| L[Random Forest + LOPOCV]
    J -->|Large| M[XGBoost + Hold-out Test]
    
    F --> N[Examine Loadings]
    G --> O[Cut Dendrogram]
    H --> P[Calculate Effect Size]
    I --> P
    L --> Q[SHAP Interpretation]
    M --> Q
    K --> F

Preprocessing Guidelines

Recommended Pipeline for Raman Data

Minimum pipeline:

Baseline Correction (AsLS or AirPLS)
Smoothing (Savitzky-Golay)
Normalization (Vector or SNV)

For noisy data:

Cosmic Ray Removal
Baseline Correction (AirPLS)
Smoothing (Savitzky-Golay, window=11)
Normalization (SNV)
Outlier Detection & Removal

For classification:

Baseline Correction (AsLS)
Smoothing (Savitzky-Golay, window=7)
Normalization (Vector)
(Optional) Peak Ratio Features

For biomarker discovery:

Baseline Correction (AirPLS)
Minimal Smoothing (window=5)
Normalization (SNV)
NO derivative (preserves peak positions)

See Preprocessing: Recommended Pipelines for detailed guidance.

Validation Best Practices

Critical Rules

Never use test data for preprocessing
- Fit preprocessing on training data only
- Transform test data using training parameters
- See: Avoiding Data Leakage
Use patient-level splitting
- If multiple spectra per patient, keep patient’s spectra together
- Use GroupKFold with patient IDs
- See: GroupKFold Guide
Report all metrics
- Accuracy, Precision, Recall, F1, ROC-AUC
- Confusion matrix
- Confidence intervals or standard deviations
- See: Reporting Results
Validate preprocessing choices
- Test multiple preprocessing pipelines
- Use cross-validation to select pipeline
- Document final choices
- See: Preprocessing Validation

Parameter Selection Guides

Each method page includes:

Parameter Tables

Example from PCA:

Parameter	Type	Default	Range	Recommendation
n_components	int	3	2-10	Start with 2-3 for visualization, 5-10 for detailed analysis
scaling	str	StandardScaler	StandardScaler, MinMaxScaler, None	Always use StandardScaler for Raman data

Visual Parameter Guides

Example effects of changing parameters, with figures showing:

Before/after comparisons
Parameter sweep results
Optimal parameter ranges highlighted

Decision Trees

For complex methods, decision trees guide parameter selection:

Is your data noisy?
├─ Yes → Use higher lambda in baseline correction
└─ No → Use default lambda

Are peaks sharp or broad?
├─ Sharp → Use smaller smoothing window (5-7)
└─ Broad → Use larger smoothing window (11-15)

Interpretation Guides

Every method includes:

Example Results

Annotated figures showing typical outputs
Good vs problematic results
Edge cases and how to handle them

What to Report

Required information for publications
Suggested figure panels
Statistical reporting guidelines

Common Misinterpretations

Frequent mistakes
How to avoid overinterpretation
Validity checks

Citations and References

Using This Documentation in Publications

If you use these methods, cite:

This software:

Rozain, M. H. (2025). Raman Spectroscopy Analysis Application.
University of Toyama. https://github.com/zerozedsc/Raman-Spectroscopy-Analysis-Application

Original method papers (provided for each method)
Key libraries:
- scikit-learn: Pedregosa et al., 2011
- RamanSPy: Stevens et al., 2023
- pybaselines: Erb et al., 2023

Bibliography

Complete bibliography with DOIs is provided in References.

Contributing Method Documentation

Want to add a new method or improve existing documentation?

Follow the Contributing Guide
Include all required sections (Purpose, Theory, Parameters, etc.)
Provide at least one working example
Cite primary literature
Submit via pull request

See Contributing Guide for details.

Support

For method-specific questions:

Check the FAQ
Search GitHub Discussions
Review Troubleshooting Guide

For method requests or bug reports:

Open an issue on GitHub