Analysis Guide

Complete guide to exploratory and statistical analysis of Raman spectroscopy data.

Table of Contents

Overview
Exploratory Analysis
Statistical Analysis
Visualization Methods
Results Interpretation
Export and Reporting

Overview

Analysis Workflow

After preprocessing, analysis helps you:

Explore: Visualize data structure and patterns
Compare: Test differences between groups
Discover: Identify biomarkers and correlations
Validate: Statistical significance of findings

        graph LR
    A[Preprocessed Data] --> B{Analysis Type}
    B -->|Unsupervised| C[Exploratory]
    B -->|Supervised| D[Statistical]
    C --> E[PCA/UMAP/tSNE]
    C --> F[Clustering]
    D --> G[Group Comparisons]
    D --> H[Correlation]
    E --> I[Interpretation]
    F --> I
    G --> I
    H --> I
    I --> J[Reporting]

Exploratory Analysis Page Interface

Analysis Page Layout

Figure: Analysis page showing dataset selector (left), method selection tabs (top-center), parameter panel (center), and results panel (right/bottom)

Note: The analysis page features:

Left Panel: Input data selection with grouped/ungrouped mode

Top-Center: Method category tabs (Exploratory, Statistical, Visualization)

Center: Method selection and parameter configuration

Right/Bottom: Results panel with interactive plots and summary statistics

Key Features:

Stop [⏹]: Cancel current analysis
Grouped Mode: Analyze by sample groups
Multiple Tabs: Results organized by analysis type
Interactive Plots: Zoom, pan, and export

Exploratory Analysis

Principal Component Analysis (PCA)

Purpose: Reduce dimensionality while preserving variance

When to use:

Visualize high-dimensional data in 2D/3D
Identify major sources of variation
Detect outliers
Reduce noise
Prepare data for ML

Running PCA

Steps:

Select preprocessed datasets
Choose grouped or ungrouped mode
Navigate to Exploratory tab
Select PCA
Configure parameters:
- N Components: Number of PCs (2-10, default: 3)
- Scaling: StandardScaler (recommended)
Click [Run Analysis]

Results

Three Views:

1. Scores Plot (2D or 3D)

PC2 (25.3%)
    ↑
    │    ○ ○ ○     ● ● ●
    │  ○ ○ ○ ○   ● ● ● ●
    │    ○ ○       ● ● ●
    └──────────────────────→ PC1 (60.0%)
    
    ○ Healthy Control
    ● Disease Group

Interpretation:

Separation: Good if groups don’t overlap
Clusters: Indicates distinct spectral patterns
Outliers: Points far from cluster centers
PC1 variance: Most important direction

2. Loadings Plot

Loading values vs Wavenumber

Shows which wavenumbers contribute most to each PC
- High positive loading → Important for discrimination
- Negative loading → Inverse correlation
- Near-zero → Not important for this PC

Use: Identify biomarker peaks

3. Scree Plot

Explained Variance (%)
    ↑
 60 │ █
    │ ▐
 40 │ ▐ █
    │ ▐ ▐
 20 │ ▐ ▐ ▌ ▌ ▌
    │ ▐ ▐ ▐ ▐ ▐ ▍▍
  0 └─────────────────→
    PC1 2  3  4  5  6 ...

Use: Determine how many PCs to keep

Elbow point: Where variance drops sharply
Typically keep PCs explaining 80-95% variance

Interpretation Tips

Good Separation:

Groups form distinct clusters
Minimal overlap
PC1 explains >50% variance
Clear biomarker peaks in loadings

Poor Separation:

Groups overlap completely
Random distribution
PC1 explains <30% variance
May need better preprocessing or more samples

Outliers:

Identify by visual inspection
Check original spectrum
Determine if:
- Real biological variability
- Technical artifact
- Data quality issue

UMAP (Uniform Manifold Approximation and Projection)

Purpose: Non-linear dimensionality reduction

When to use:

PCA shows poor separation
Non-linear relationships suspected
Better cluster visualization
Exploratory analysis

Running UMAP

Parameters:

n_neighbors: Local neighborhood size (5-50, default: 15)
- Lower → Fine structure
- Higher → Global structure
min_dist: Minimum distance between points (0.0-1.0, default: 0.1)
- Lower → Tight clusters
- Higher → Looser clusters
n_components: Dimensions (2 or 3, default: 2)

Steps:

Select UMAP from Exploratory methods
Configure parameters
Click [Run Analysis]
View 2D/3D embedding

UMAP vs PCA

Aspect	PCA	UMAP
Type	Linear	Non-linear
Speed	Fast	Slower
Reproducibility	Deterministic	Stochastic
Interpretation	Loadings available	No loadings
Use Case	General exploration	Complex patterns

Recommendation: Start with PCA, use UMAP if PCA separation is poor

t-SNE (t-Distributed Stochastic Neighbor Embedding)

Purpose: Visualize local structure and clusters

When to use:

Cluster visualization
Complex, non-linear relationships
Publication figures

Parameters:

perplexity: Balance local/global structure (5-50, default: 30)
n_iter: Iterations (250-5000, default: 1000)
learning_rate: Step size (10-1000, default: 200)

Warning:

Results vary between runs (stochastic)
Distances between clusters not meaningful
Cannot embed new data (no transform)

Clustering Analysis

Hierarchical Clustering

Purpose: Create dendrogram showing sample relationships

Steps:

Select Hierarchical Clustering
Parameters:
- Linkage: Ward, Average, Complete (default: Ward)
- Distance: Euclidean, Correlation (default: Euclidean)
Click [Run Analysis]

Result: Dendrogram

Height
  ↑
  │         ┌─────────┐
  │    ┌────┤         ├────┐
  │ ┌──┤    │         │    ├──┐
  │ │  │    └─────────┘    │  │
  └─┴──┴──────────────────┴──┴───→
    Samples (colored by group)

Interpretation:

Tree branches: Similar samples cluster together
Height: Dissimilarity (higher = more different)
Cut height: Determines number of clusters

K-Means Clustering

Purpose: Partition data into k clusters

Steps:

Select K-Means Clustering
Parameters:
- N Clusters: Number of clusters (2-20)
- N Init: Random initializations (10-100, default: 10)
- Max Iter: Maximum iterations (100-1000, default: 300)
Use Elbow Method to determine optimal k
Click [Run Analysis]

Elbow Plot:

Inertia
  ↑
  │ ●
  │  ╲
  │   ●
  │    ╲●
  │      ●─●─●─●
  └──────────────→
     2 3 4 5 6 7 8
     Number of Clusters

Optimal k: Elbow point (e.g., k=3 in above plot)

Statistical Analysis

Pairwise Group Comparisons

t-Test (Parametric)

Purpose: Compare means of two groups

Assumptions:

Normal distribution
Equal variances (or use Welch’s t-test)
Independent samples

When to use:

Two groups
Normally distributed data
Sufficient sample size (n≥30 per group)

Steps:

Select Statistical tab
Choose t-Test
Select two groups to compare
Parameters:
- Type: Student’s or Welch’s (default: Welch’s)
- Alpha: Significance level (0.01, 0.05, default: 0.05)
Click [Run Analysis]

Results:

p-values: Per wavenumber
Effect sizes: Cohen’s d
Volcano plot: -log10(p) vs effect size
Significant regions: p < 0.05 highlighted

Interpretation:

p < 0.001: Highly significant (***) 
p < 0.01:  Very significant (**)
p < 0.05:  Significant (*)
p ≥ 0.05:  Not significant (ns)

Mann-Whitney U Test (Non-Parametric)

Purpose: Non-parametric alternative to t-test

When to use:

Non-normal distribution
Small sample sizes
Ordinal data
Outliers present

Interpretation: Same as t-test but based on rank differences

Multi-Group Comparisons

ANOVA (Analysis of Variance)

Purpose: Compare means across multiple groups (≥3)

Assumptions:

Normal distribution in each group
Equal variances (homoscedasticity)
Independent samples

Steps:

Select ANOVA
Choose 3+ groups
Parameters:
- Alpha: 0.01, 0.05 (default: 0.05)
- Post-hoc: Tukey HSD, Bonferroni (default: Tukey)
Click [Run Analysis]

Results:

F-statistic per wavenumber
p-values
Post-hoc pairwise comparisons
Effect sizes (eta-squared)

Post-hoc Tests: Identify which groups differ

Tukey HSD results:
Healthy vs Disease A: p = 0.002 (**)
Healthy vs Disease B: p = 0.123 (ns)
Disease A vs Disease B: p = 0.045 (*)

Note: ANOVA is currently disabled in the application UI. For comparing two groups, use pairwise statistical tests instead.

Correlation Analysis

Purpose: Find relationships between wavenumbers

Pearson Correlation

Formula: Linear correlation coefficient

r = cov(X, Y) / (σ_X * σ_Y)
Range: -1 to +1

Interpretation:

r > 0.7: Strong positive correlation
r > 0.4: Moderate positive correlation
r ≈ 0: No correlation
r < -0.4: Moderate negative correlation
r < -0.7: Strong negative correlation

When to use: Linear relationships

Spearman Correlation

Non-parametric: Based on rank correlation

When to use:

Non-linear monotonic relationships
Outliers present
Ordinal data

Steps:

Select Correlation Analysis
Choose correlation type (Pearson or Spearman)
Optional: Select wavenumber region
Click [Run Analysis]

Results: Correlation matrix heatmap

        400  600  800  1000 1200 ...
  400  [1.0  0.8  0.3  0.1  -0.2]
  600  [0.8  1.0  0.5  0.2  -0.1]
  800  [0.3  0.5  1.0  0.7   0.3]
 1000  [0.1  0.2  0.7  1.0   0.5]
 1200  [-0.2 -0.1 0.3  0.5   1.0]
  ...
  
  Color scale: Red (positive) to Blue (negative)

Use: Identify correlated spectral regions

Band Ratio Analysis

Purpose: Calculate ratios of specific peaks

When to use:

Known biomarker ratios
Normalize one peak by another
Create simple interpretable features

Steps:

Select Band Ratio Analysis
Define Peak 1 range: [1000-1010 cm⁻¹]
Define Peak 2 range: [1200-1210 cm⁻¹]
Click [Calculate Ratio]

Results:

Ratio values per spectrum
Box plot by group
Statistical test of group differences

Example:

I₁₆₅₅/I₁₄₄₅ ratio (Amide I / CH₂)
- Healthy: 1.2 ± 0.1
- Disease: 0.9 ± 0.1
- p = 0.003 (**)

Visualization Methods

Interactive Heatmap

Purpose: Visualize all spectra as color-coded intensity map

Features:

Hierarchical clustering of samples (rows)
Dendrogram showing sample relationships
Group coloring
Zoom and pan

Use: Identify spectral patterns and outliers

Waterfall Plot

Purpose: 3D-style stacked spectra visualization

Features:

Offset spectra for visibility
Color by group
Interactive rotation (3D mode)
Export for publication

Use: Publication figures, presentation

Overlaid Spectra

Purpose: Plot multiple spectra on same axes

Features:

Mean ± standard deviation by group
Individual spectrum overlay (up to 100)
Group coloring
Legend management

Use: Visual comparison of groups

Peak Scatter Plot

Purpose: Plot peak intensity at two wavenumbers

Example:

Peak @ 1655 cm⁻¹
    ↑
    │    ○ ○
    │  ○ ○ ○ ○
    │        ● ● ●
    │      ● ● ● ●
    │        ● ●
    └──────────────→
      Peak @ 1445 cm⁻¹
      
  ○ Healthy
  ● Disease

Use: Visualize peak ratio separation

Correlation Matrix

Purpose: Heatmap of correlation coefficients

Features:

Hierarchical clustering of wavenumbers
Color scale (red = positive, blue = negative)
Interactive tooltips
Export as image

Use: Identify correlated spectral regions

Results Interpretation

Statistical Significance

Multiple Testing Correction:

When testing thousands of wavenumbers, apply correction:

Methods:

Bonferroni: Most conservative
- Adjusted α = 0.05 / n_tests
- Example: 1000 tests → α = 0.00005
FDR (False Discovery Rate): Recommended
- Benjamini-Hochberg procedure
- Controls proportion of false positives
- Less conservative than Bonferroni
Permutation tests: Data-driven
- Randomly shuffle group labels
- Re-compute test statistic
- p-value = proportion of shuffles with more extreme value

Application: Check “Apply FDR correction” in analysis settings

Effect Size

Why important?: Significance ≠ Practical importance

Cohen’s d (for t-tests):

d = (mean1 - mean2) / pooled_std

Interpretation:
|d| < 0.2: Small effect
|d| < 0.5: Medium effect
|d| ≥ 0.8: Large effect

Eta-squared (η²) (for ANOVA):

η² = SS_between / SS_total

Interpretation:
η² < 0.01: Small effect
η² < 0.06: Medium effect
η² ≥ 0.14: Large effect

Recommendation: Report both p-value AND effect size

Biological Interpretation

Steps:

Identify significant peaks
- Use statistical tests
- Apply multiple testing correction
- Check effect sizes
Assign peaks to molecular vibrations
- Consult literature
- Use reference databases
- Check glossary for common peaks
Interpret biological meaning
- What molecules changed?
- Why would they change in disease/condition?
- Consistent with known biology?
Validate findings
- Independent test set
- Literature comparison
- Biochemical validation

Example:

Significant peak @ 1655 cm⁻¹ (Amide I)
- Assignment: C=O stretch in proteins
- Increased in disease group
- Biological interpretation: Protein conformational change
- Literature: Consistent with protein misfolding in this disease

Export and Reporting

Export Options

Plots:

PNG: Raster image (300 DPI for publication)
SVG: Vector graphics (editable in Illustrator, Inkscape)

Data tables (depends on the selected analysis output):

CSV
XLSX (Excel)
JSON
TXT (tab-delimited)
PKL (pickle)

Saved result folders:

“Export report” currently creates a folder containing:
- plot.png (if available)
- data.csv (if available)
- report.txt

Creating Reports

Steps:

Complete all analyses
Click the export/report action in the results panel
Choose an output folder

Report Structure:

1. Introduction
   - Dataset description
   - Preprocessing pipeline used
   
2. Methods
   - Analysis methods
   - Statistical tests
   - Parameters

3. Results
   - PCA scores plot
   - Statistical comparison results
   - Significant peaks table
   
4. Discussion
   - Interpretation
   - Biological relevance
   
5. Appendix
   - Full parameter settings
   - Additional figures

Publication-Ready Figures

Requirements:

Resolution: 300+ DPI
Format: PNG (raster) or SVG (vector)
Fonts: Embed or convert to paths
Size: Match journal requirements
Color: Check color-blind friendly palettes

Settings:

Figure → Export Settings
- DPI: 300
- Format: PNG
- Font Size: 12pt
- Line Width: 2pt
- Color Palette: Colorblind-safe
- Background: White (for print)

Troubleshooting

No Group Separation in PCA

Possible Causes:

Groups are truly not different
Insufficient preprocessing
Too much noise
Wrong groups selected

Solutions:

Try different preprocessing
Check data quality
Use UMAP or t-SNE
Verify group labels are correct

Statistical Tests Show No Significance

Possible Causes:

Small sample size (low power)
High within-group variability
Multiple testing correction too strict
Groups not actually different

Solutions:

Increase sample size
Improve preprocessing to reduce noise
Use less conservative correction (FDR instead of Bonferroni)
Check effect sizes (may be significant but small)

Analysis Takes Too Long

Causes:

Large dataset (>5000 spectra)
Complex method (UMAP, t-SNE)
Insufficient RAM

Solutions:

Use PCA instead of UMAP/t-SNE
Subsample data for exploration
Close other applications
Enable batch processing

Analysis Guide

Table of Contents

Overview

Analysis Workflow

Exploratory Analysis Page Interface

Exploratory Analysis

Principal Component Analysis (PCA)

Running PCA

Results

Interpretation Tips

UMAP (Uniform Manifold Approximation and Projection)

Running UMAP

UMAP vs PCA

t-SNE (t-Distributed Stochastic Neighbor Embedding)

Clustering Analysis

Hierarchical Clustering

K-Means Clustering

Statistical Analysis

Pairwise Group Comparisons

t-Test (Parametric)

Mann-Whitney U Test (Non-Parametric)

Multi-Group Comparisons

ANOVA (Analysis of Variance)

Correlation Analysis

Pearson Correlation

Spearman Correlation

Band Ratio Analysis

Visualization Methods

Interactive Heatmap

Waterfall Plot

Overlaid Spectra

Peak Scatter Plot

Correlation Matrix

Results Interpretation

Statistical Significance

Effect Size

Biological Interpretation

Export and Reporting

Export Options

Creating Reports

Publication-Ready Figures

Troubleshooting

No Group Separation in PCA

Statistical Tests Show No Significance

Analysis Takes Too Long

See Also