Analysis Guide

Complete guide to exploratory and statistical analysis of Raman spectroscopy data.

Table of Contents


Overview

Analysis Workflow

After preprocessing, analysis helps you:

  1. Explore: Visualize data structure and patterns

  2. Compare: Test differences between groups

  3. Discover: Identify biomarkers and correlations

  4. Validate: Statistical significance of findings

        graph LR
    A[Preprocessed Data] --> B{Analysis Type}
    B -->|Unsupervised| C[Exploratory]
    B -->|Supervised| D[Statistical]
    C --> E[PCA/UMAP/tSNE]
    C --> F[Clustering]
    D --> G[Group Comparisons]
    D --> H[Correlation]
    E --> I[Interpretation]
    F --> I
    G --> I
    H --> I
    I --> J[Reporting]
    

Exploratory Analysis Page Interface

Analysis Page Layout

Figure: Analysis page showing dataset selector (left), method selection tabs (top-center), parameter panel (center), and results panel (right/bottom)

Note: The analysis page features:

  • Left Panel: Input data selection with grouped/ungrouped mode

  • Top-Center: Method category tabs (Exploratory, Statistical, Visualization)

  • Center: Method selection and parameter configuration

  • Right/Bottom: Results panel with interactive plots and summary statistics

Key Features:

  • Stop [⏹]: Cancel current analysis

  • Grouped Mode: Analyze by sample groups

  • Multiple Tabs: Results organized by analysis type

  • Interactive Plots: Zoom, pan, and export


Exploratory Analysis

Principal Component Analysis (PCA)

Purpose: Reduce dimensionality while preserving variance

When to use:

  • Visualize high-dimensional data in 2D/3D

  • Identify major sources of variation

  • Detect outliers

  • Reduce noise

  • Prepare data for ML

Running PCA

Steps:

  1. Select preprocessed datasets

  2. Choose grouped or ungrouped mode

  3. Navigate to Exploratory tab

  4. Select PCA

  5. Configure parameters:

    • N Components: Number of PCs (2-10, default: 3)

    • Scaling: StandardScaler (recommended)

  6. Click [Run Analysis]

Results

Three Views:

1. Scores Plot (2D or 3D)

PC2 (25.3%)
    ↑
    │    ○ ○ ○     ● ● ●
    │  ○ ○ ○ ○   ● ● ● ●
    │    ○ ○       ● ● ●
    └──────────────────────→ PC1 (60.0%)
    
    ○ Healthy Control
    ● Disease Group

Interpretation:

  • Separation: Good if groups don’t overlap

  • Clusters: Indicates distinct spectral patterns

  • Outliers: Points far from cluster centers

  • PC1 variance: Most important direction

2. Loadings Plot

Loading values vs Wavenumber

Shows which wavenumbers contribute most to each PC
- High positive loading → Important for discrimination
- Negative loading → Inverse correlation
- Near-zero → Not important for this PC

Use: Identify biomarker peaks

3. Scree Plot

Explained Variance (%)
    ↑
 60 │ █
    │ ▐
 40 │ ▐ █
    │ ▐ ▐
 20 │ ▐ ▐ ▌ ▌ ▌
    │ ▐ ▐ ▐ ▐ ▐ ▍▍
  0 └─────────────────→
    PC1 2  3  4  5  6 ...

Use: Determine how many PCs to keep

  • Elbow point: Where variance drops sharply

  • Typically keep PCs explaining 80-95% variance

Interpretation Tips

Good Separation:

  • Groups form distinct clusters

  • Minimal overlap

  • PC1 explains >50% variance

  • Clear biomarker peaks in loadings

Poor Separation:

  • Groups overlap completely

  • Random distribution

  • PC1 explains <30% variance

  • May need better preprocessing or more samples

Outliers:

  • Identify by visual inspection

  • Check original spectrum

  • Determine if:

    • Real biological variability

    • Technical artifact

    • Data quality issue

UMAP (Uniform Manifold Approximation and Projection)

Purpose: Non-linear dimensionality reduction

When to use:

  • PCA shows poor separation

  • Non-linear relationships suspected

  • Better cluster visualization

  • Exploratory analysis

Running UMAP

Parameters:

  • n_neighbors: Local neighborhood size (5-50, default: 15)

    • Lower → Fine structure

    • Higher → Global structure

  • min_dist: Minimum distance between points (0.0-1.0, default: 0.1)

    • Lower → Tight clusters

    • Higher → Looser clusters

  • n_components: Dimensions (2 or 3, default: 2)

Steps:

  1. Select UMAP from Exploratory methods

  2. Configure parameters

  3. Click [Run Analysis]

  4. View 2D/3D embedding

UMAP vs PCA

Aspect

PCA

UMAP

Type

Linear

Non-linear

Speed

Fast

Slower

Reproducibility

Deterministic

Stochastic

Interpretation

Loadings available

No loadings

Use Case

General exploration

Complex patterns

Recommendation: Start with PCA, use UMAP if PCA separation is poor

t-SNE (t-Distributed Stochastic Neighbor Embedding)

Purpose: Visualize local structure and clusters

When to use:

  • Cluster visualization

  • Complex, non-linear relationships

  • Publication figures

Parameters:

  • perplexity: Balance local/global structure (5-50, default: 30)

  • n_iter: Iterations (250-5000, default: 1000)

  • learning_rate: Step size (10-1000, default: 200)

Warning:

  • Results vary between runs (stochastic)

  • Distances between clusters not meaningful

  • Cannot embed new data (no transform)

Clustering Analysis

Hierarchical Clustering

Purpose: Create dendrogram showing sample relationships

Steps:

  1. Select Hierarchical Clustering

  2. Parameters:

    • Linkage: Ward, Average, Complete (default: Ward)

    • Distance: Euclidean, Correlation (default: Euclidean)

  3. Click [Run Analysis]

Result: Dendrogram

Height
  ↑
  │         ┌─────────┐
  │    ┌────┤         ├────┐
  │ ┌──┤    │         │    ├──┐
  │ │  │    └─────────┘    │  │
  └─┴──┴──────────────────┴──┴───→
    Samples (colored by group)

Interpretation:

  • Tree branches: Similar samples cluster together

  • Height: Dissimilarity (higher = more different)

  • Cut height: Determines number of clusters

K-Means Clustering

Purpose: Partition data into k clusters

Steps:

  1. Select K-Means Clustering

  2. Parameters:

    • N Clusters: Number of clusters (2-20)

    • N Init: Random initializations (10-100, default: 10)

    • Max Iter: Maximum iterations (100-1000, default: 300)

  3. Use Elbow Method to determine optimal k

  4. Click [Run Analysis]

Elbow Plot:

Inertia
  ↑
  │ ●
  │  ╲
  │   ●
  │    ╲●
  │      ●─●─●─●
  └──────────────→
     2 3 4 5 6 7 8
     Number of Clusters

Optimal k: Elbow point (e.g., k=3 in above plot)


Statistical Analysis

Pairwise Group Comparisons

t-Test (Parametric)

Purpose: Compare means of two groups

Assumptions:

  • Normal distribution

  • Equal variances (or use Welch’s t-test)

  • Independent samples

When to use:

  • Two groups

  • Normally distributed data

  • Sufficient sample size (n≥30 per group)

Steps:

  1. Select Statistical tab

  2. Choose t-Test

  3. Select two groups to compare

  4. Parameters:

    • Type: Student’s or Welch’s (default: Welch’s)

    • Alpha: Significance level (0.01, 0.05, default: 0.05)

  5. Click [Run Analysis]

Results:

  • p-values: Per wavenumber

  • Effect sizes: Cohen’s d

  • Volcano plot: -log10(p) vs effect size

  • Significant regions: p < 0.05 highlighted

Interpretation:

p < 0.001: Highly significant (***) 
p < 0.01:  Very significant (**)
p < 0.05:  Significant (*)
p ≥ 0.05:  Not significant (ns)

Mann-Whitney U Test (Non-Parametric)

Purpose: Non-parametric alternative to t-test

When to use:

  • Non-normal distribution

  • Small sample sizes

  • Ordinal data

  • Outliers present

Interpretation: Same as t-test but based on rank differences

Multi-Group Comparisons

ANOVA (Analysis of Variance)

Purpose: Compare means across multiple groups (≥3)

Assumptions:

  • Normal distribution in each group

  • Equal variances (homoscedasticity)

  • Independent samples

Steps:

  1. Select ANOVA

  2. Choose 3+ groups

  3. Parameters:

    • Alpha: 0.01, 0.05 (default: 0.05)

    • Post-hoc: Tukey HSD, Bonferroni (default: Tukey)

  4. Click [Run Analysis]

Results:

  • F-statistic per wavenumber

  • p-values

  • Post-hoc pairwise comparisons

  • Effect sizes (eta-squared)

Post-hoc Tests: Identify which groups differ

Tukey HSD results:
Healthy vs Disease A: p = 0.002 (**)
Healthy vs Disease B: p = 0.123 (ns)
Disease A vs Disease B: p = 0.045 (*)

Note: ANOVA is currently disabled in the application UI. For comparing two groups, use pairwise statistical tests instead.

Correlation Analysis

Purpose: Find relationships between wavenumbers

Pearson Correlation

Formula: Linear correlation coefficient

r = cov(X, Y) / (σ_X * σ_Y)
Range: -1 to +1

Interpretation:

  • r > 0.7: Strong positive correlation

  • r > 0.4: Moderate positive correlation

  • r ≈ 0: No correlation

  • r < -0.4: Moderate negative correlation

  • r < -0.7: Strong negative correlation

When to use: Linear relationships

Spearman Correlation

Non-parametric: Based on rank correlation

When to use:

  • Non-linear monotonic relationships

  • Outliers present

  • Ordinal data

Steps:

  1. Select Correlation Analysis

  2. Choose correlation type (Pearson or Spearman)

  3. Optional: Select wavenumber region

  4. Click [Run Analysis]

Results: Correlation matrix heatmap

        400  600  800  1000 1200 ...
  400  [1.0  0.8  0.3  0.1  -0.2]
  600  [0.8  1.0  0.5  0.2  -0.1]
  800  [0.3  0.5  1.0  0.7   0.3]
 1000  [0.1  0.2  0.7  1.0   0.5]
 1200  [-0.2 -0.1 0.3  0.5   1.0]
  ...
  
  Color scale: Red (positive) to Blue (negative)

Use: Identify correlated spectral regions

Band Ratio Analysis

Purpose: Calculate ratios of specific peaks

When to use:

  • Known biomarker ratios

  • Normalize one peak by another

  • Create simple interpretable features

Steps:

  1. Select Band Ratio Analysis

  2. Define Peak 1 range: [1000-1010 cm⁻¹]

  3. Define Peak 2 range: [1200-1210 cm⁻¹]

  4. Click [Calculate Ratio]

Results:

  • Ratio values per spectrum

  • Box plot by group

  • Statistical test of group differences

Example:

I₁₆₅₅/I₁₄₄₅ ratio (Amide I / CH₂)
- Healthy: 1.2 ± 0.1
- Disease: 0.9 ± 0.1
- p = 0.003 (**)

Visualization Methods

Interactive Heatmap

Purpose: Visualize all spectra as color-coded intensity map

Features:

  • Hierarchical clustering of samples (rows)

  • Dendrogram showing sample relationships

  • Group coloring

  • Zoom and pan

Use: Identify spectral patterns and outliers

Waterfall Plot

Purpose: 3D-style stacked spectra visualization

Features:

  • Offset spectra for visibility

  • Color by group

  • Interactive rotation (3D mode)

  • Export for publication

Use: Publication figures, presentation

Overlaid Spectra

Purpose: Plot multiple spectra on same axes

Features:

  • Mean ± standard deviation by group

  • Individual spectrum overlay (up to 100)

  • Group coloring

  • Legend management

Use: Visual comparison of groups

Peak Scatter Plot

Purpose: Plot peak intensity at two wavenumbers

Example:

Peak @ 1655 cm⁻¹
    ↑
    │    ○ ○
    │  ○ ○ ○ ○
    │        ● ● ●
    │      ● ● ● ●
    │        ● ●
    └──────────────→
      Peak @ 1445 cm⁻¹
      
  ○ Healthy
  ● Disease

Use: Visualize peak ratio separation

Correlation Matrix

Purpose: Heatmap of correlation coefficients

Features:

  • Hierarchical clustering of wavenumbers

  • Color scale (red = positive, blue = negative)

  • Interactive tooltips

  • Export as image

Use: Identify correlated spectral regions


Results Interpretation

Statistical Significance

Multiple Testing Correction:

When testing thousands of wavenumbers, apply correction:

Methods:

  1. Bonferroni: Most conservative

    • Adjusted α = 0.05 / n_tests

    • Example: 1000 tests → α = 0.00005

  2. FDR (False Discovery Rate): Recommended

    • Benjamini-Hochberg procedure

    • Controls proportion of false positives

    • Less conservative than Bonferroni

  3. Permutation tests: Data-driven

    • Randomly shuffle group labels

    • Re-compute test statistic

    • p-value = proportion of shuffles with more extreme value

Application: Check “Apply FDR correction” in analysis settings

Effect Size

Why important?: Significance ≠ Practical importance

Cohen’s d (for t-tests):

d = (mean1 - mean2) / pooled_std

Interpretation:
|d| < 0.2: Small effect
|d| < 0.5: Medium effect
|d| ≥ 0.8: Large effect

Eta-squared (η²) (for ANOVA):

η² = SS_between / SS_total

Interpretation:
η² < 0.01: Small effect
η² < 0.06: Medium effect
η² ≥ 0.14: Large effect

Recommendation: Report both p-value AND effect size

Biological Interpretation

Steps:

  1. Identify significant peaks

    • Use statistical tests

    • Apply multiple testing correction

    • Check effect sizes

  2. Assign peaks to molecular vibrations

    • Consult literature

    • Use reference databases

    • Check glossary for common peaks

  3. Interpret biological meaning

    • What molecules changed?

    • Why would they change in disease/condition?

    • Consistent with known biology?

  4. Validate findings

    • Independent test set

    • Literature comparison

    • Biochemical validation

Example:

Significant peak @ 1655 cm⁻¹ (Amide I)
- Assignment: C=O stretch in proteins
- Increased in disease group
- Biological interpretation: Protein conformational change
- Literature: Consistent with protein misfolding in this disease

Export and Reporting

Export Options

Plots:

  • PNG: Raster image (300 DPI for publication)

  • SVG: Vector graphics (editable in Illustrator, Inkscape)

Data tables (depends on the selected analysis output):

  • CSV

  • XLSX (Excel)

  • JSON

  • TXT (tab-delimited)

  • PKL (pickle)

Saved result folders:

  • “Export report” currently creates a folder containing:

    • plot.png (if available)

    • data.csv (if available)

    • report.txt

Creating Reports

Steps:

  1. Complete all analyses

  2. Click the export/report action in the results panel

  3. Choose an output folder

Report Structure:

1. Introduction
   - Dataset description
   - Preprocessing pipeline used
   
2. Methods
   - Analysis methods
   - Statistical tests
   - Parameters

3. Results
   - PCA scores plot
   - Statistical comparison results
   - Significant peaks table
   
4. Discussion
   - Interpretation
   - Biological relevance
   
5. Appendix
   - Full parameter settings
   - Additional figures

Publication-Ready Figures

Requirements:

  • Resolution: 300+ DPI

  • Format: PNG (raster) or SVG (vector)

  • Fonts: Embed or convert to paths

  • Size: Match journal requirements

  • Color: Check color-blind friendly palettes

Settings:

Figure → Export Settings
- DPI: 300
- Format: PNG
- Font Size: 12pt
- Line Width: 2pt
- Color Palette: Colorblind-safe
- Background: White (for print)

Troubleshooting

No Group Separation in PCA

Possible Causes:

  • Groups are truly not different

  • Insufficient preprocessing

  • Too much noise

  • Wrong groups selected

Solutions:

  • Try different preprocessing

  • Check data quality

  • Use UMAP or t-SNE

  • Verify group labels are correct

Statistical Tests Show No Significance

Possible Causes:

  • Small sample size (low power)

  • High within-group variability

  • Multiple testing correction too strict

  • Groups not actually different

Solutions:

  • Increase sample size

  • Improve preprocessing to reduce noise

  • Use less conservative correction (FDR instead of Bonferroni)

  • Check effect sizes (may be significant but small)

Analysis Takes Too Long

Causes:

  • Large dataset (>5000 spectra)

  • Complex method (UMAP, t-SNE)

  • Insufficient RAM

Solutions:

  • Use PCA instead of UMAP/t-SNE

  • Subsample data for exploration

  • Close other applications

  • Enable batch processing


See Also


Next: Machine Learning Guide