Analysis Methods Reference

This comprehensive reference documents all analysis methods available in the Raman Spectroscopy Analysis Application, including preprocessing algorithms, exploratory analysis, statistical tests, and machine learning methods.

Purpose of This Reference

Each method is documented with:

  • Purpose - What the method does and when to use it

  • Theory - Brief explanation of the underlying algorithm

  • Parameters - Complete parameter reference with recommendations

  • Interpretation Guide - How to understand and report results

  • Examples - Practical usage examples

  • Common Issues - Troubleshooting and limitations

  • References - Primary literature citations

Method Categories

Preprocessing Methods

40+ methods for spectral preprocessing:

Category

Methods

Purpose

Baseline Correction

AsLS, AirPLS, Polynomial, Whittaker, FABC, Butterworth

Remove fluorescence background

Smoothing

Savitzky-Golay, Gaussian, Moving Average, Median Filter

Reduce noise

Normalization

Vector, Min-Max, Area, SNV, MSC, Quantile, PQN

Scale spectra

Derivatives

1st, 2nd order Savitzky-Golay

Enhance peaks

Feature Engineering

Peak Ratio, Wavelet Transform, Rank Transform

Extract features

Advanced

CDAE (Deep Learning), MCR-ALS, NMF

Denoising and decomposition

β†’ Complete Preprocessing Reference

Exploratory Analysis

Unsupervised methods for data exploration:

Method

Purpose

When to Use

PCA

Dimensionality reduction, variance visualization

First step for all analyses

UMAP

Non-linear manifold learning

Complex high-dimensional data

t-SNE

Non-linear clustering visualization

Identifying subgroups

Hierarchical Clustering

Dendrogram-based grouping

Unknown group structure

K-means Clustering

Partition-based clustering

Known number of clusters

β†’ Complete Exploratory Analysis Reference

Statistical Methods

Hypothesis testing and correlation analysis:

Method

Purpose

Data Type

Pairwise Tests

Compare two groups

t-test (normal), Mann-Whitney (non-parametric)

ANOVA

Compare multiple groups

F-test (normal), Kruskal-Wallis (non-parametric)

Correlation Analysis

Find relationships

Pearson, Spearman, Kendall

Band Ratio Analysis

Calculate biochemical ratios

Protein/Lipid, Amide I/II, etc.

Peak Detection

Identify significant peaks

Automatic peak finding

β†’ Complete Statistical Methods Reference

Machine Learning

Supervised classification algorithms:

Algorithm

Type

Best For

Support Vector Machine (SVM)

Margin-based

High-dimensional, small sample

Random Forest

Ensemble tree

Robust, interpretable

XGBoost

Gradient boosting

High accuracy, large datasets

Logistic Regression

Linear

Baseline, interpretability

Note: Multi-Layer Perceptron (MLP) and PLS-DA are planned for future releases.

Validation Methods:

  • Stratified K-Fold

  • Leave-One-Patient-Out (LOPOCV)

  • Time-series splits

Interpretability:

  • SHAP values (feature importance)

  • Permutation importance

  • Partial dependence plots

  • Decision boundary visualization

β†’ Complete Machine Learning Reference

Quick Method Selector

By Research Question

β€œAre my groups different?”

β€œWhat are the key differentiating features?”

β€œCan I predict group membership?”

β€œWhat are the pure components in my mixture?”

β€œHow many clusters exist in my data?”

By Data Characteristics

Small sample size (n < 30 per group)

  • Avoid: Deep learning, complex models

  • Use: Random Forest, SVM with careful validation

  • Validate with: LOPOCV

Large sample size (n > 100 per group)

  • Can use: All methods, including deep learning

  • Recommended: XGBoost, Neural Networks

  • Validate with: Hold-out test set + cross-validation

High class imbalance (e.g., 90% healthy, 10% disease)

  • Use: Stratified sampling, SMOTE

  • Metrics: ROC-AUC, F1-score (not accuracy)

  • See: Handling Imbalance

Multiple groups (>2)

  • Exploratory: PCA, UMAP with color-coded groups

  • Statistical: ANOVA with post-hoc tests

  • ML: Multi-class classification

Method Selection Flowchart

        graph TD
    A[Start] --> B{Research Goal?}
    B -->|Explore| C{Known Groups?}
    B -->|Test Hypothesis| D{How Many Groups?}
    B -->|Predict| E{Labeled Data?}
    
    C -->|Yes| F[PCA with Groups]
    C -->|No| G[Hierarchical Clustering]
    
    D -->|2 groups| H[t-test or Mann-Whitney]
    D -->|>2 groups| I[ANOVA or Kruskal-Wallis]
    
    E -->|Yes| J{Sample Size?}
    E -->|No| K[Unsupervised Learning]
    
    J -->|Small| L[Random Forest + LOPOCV]
    J -->|Large| M[XGBoost + Hold-out Test]
    
    F --> N[Examine Loadings]
    G --> O[Cut Dendrogram]
    H --> P[Calculate Effect Size]
    I --> P
    L --> Q[SHAP Interpretation]
    M --> Q
    K --> F
    

Preprocessing Guidelines

Validation Best Practices

Critical Rules

  1. Never use test data for preprocessing

    • Fit preprocessing on training data only

    • Transform test data using training parameters

    • See: Avoiding Data Leakage

  2. Use patient-level splitting

    • If multiple spectra per patient, keep patient’s spectra together

    • Use GroupKFold with patient IDs

    • See: GroupKFold Guide

  3. Report all metrics

    • Accuracy, Precision, Recall, F1, ROC-AUC

    • Confusion matrix

    • Confidence intervals or standard deviations

    • See: Reporting Results

  4. Validate preprocessing choices

    • Test multiple preprocessing pipelines

    • Use cross-validation to select pipeline

    • Document final choices

    • See: Preprocessing Validation

Parameter Selection Guides

Each method page includes:

Parameter Tables

Example from PCA:

Parameter

Type

Default

Range

Recommendation

n_components

int

3

2-10

Start with 2-3 for visualization, 5-10 for detailed analysis

scaling

str

StandardScaler

StandardScaler, MinMaxScaler, None

Always use StandardScaler for Raman data

Visual Parameter Guides

Example effects of changing parameters, with figures showing:

  • Before/after comparisons

  • Parameter sweep results

  • Optimal parameter ranges highlighted

Decision Trees

For complex methods, decision trees guide parameter selection:

Is your data noisy?
β”œβ”€ Yes β†’ Use higher lambda in baseline correction
└─ No β†’ Use default lambda

Are peaks sharp or broad?
β”œβ”€ Sharp β†’ Use smaller smoothing window (5-7)
└─ Broad β†’ Use larger smoothing window (11-15)

Interpretation Guides

Every method includes:

Example Results

  • Annotated figures showing typical outputs

  • Good vs problematic results

  • Edge cases and how to handle them

What to Report

  • Required information for publications

  • Suggested figure panels

  • Statistical reporting guidelines

Common Misinterpretations

  • Frequent mistakes

  • How to avoid overinterpretation

  • Validity checks

Citations and References

Using This Documentation in Publications

If you use these methods, cite:

  1. This software:

    Rozain, M. H. (2025). Raman Spectroscopy Analysis Application.
    University of Toyama. https://github.com/zerozedsc/Raman-Spectroscopy-Analysis-Application
    
  2. Original method papers (provided for each method)

  3. Key libraries:

    • scikit-learn: Pedregosa et al., 2011

    • RamanSPy: Stevens et al., 2023

    • pybaselines: Erb et al., 2023

Bibliography

Complete bibliography with DOIs is provided in References.

Contributing Method Documentation

Want to add a new method or improve existing documentation?

  1. Follow the Contributing Guide

  2. Include all required sections (Purpose, Theory, Parameters, etc.)

  3. Provide at least one working example

  4. Cite primary literature

  5. Submit via pull request

See Contributing Guide for details.

Support

For method-specific questions:

For method requests or bug reports: