Analysis Guide
Complete guide to exploratory and statistical analysis of Raman spectroscopy data.
Table of Contents
Overview
Analysis Workflow
After preprocessing, analysis helps you:
Explore: Visualize data structure and patterns
Compare: Test differences between groups
Discover: Identify biomarkers and correlations
Validate: Statistical significance of findings
graph LR
A[Preprocessed Data] --> B{Analysis Type}
B -->|Unsupervised| C[Exploratory]
B -->|Supervised| D[Statistical]
C --> E[PCA/UMAP/tSNE]
C --> F[Clustering]
D --> G[Group Comparisons]
D --> H[Correlation]
E --> I[Interpretation]
F --> I
G --> I
H --> I
I --> J[Reporting]
Exploratory Analysis Page Interface

Figure: Analysis page showing dataset selector (left), method selection tabs (top-center), parameter panel (center), and results panel (right/bottom)
Note: The analysis page features:
Left Panel: Input data selection with grouped/ungrouped mode
Top-Center: Method category tabs (Exploratory, Statistical, Visualization)
Center: Method selection and parameter configuration
Right/Bottom: Results panel with interactive plots and summary statistics
Key Features:
Stop [⏹]: Cancel current analysis
Grouped Mode: Analyze by sample groups
Multiple Tabs: Results organized by analysis type
Interactive Plots: Zoom, pan, and export
Exploratory Analysis
Principal Component Analysis (PCA)
Purpose: Reduce dimensionality while preserving variance
When to use:
Visualize high-dimensional data in 2D/3D
Identify major sources of variation
Detect outliers
Reduce noise
Prepare data for ML
Running PCA
Steps:
Select preprocessed datasets
Choose grouped or ungrouped mode
Navigate to Exploratory tab
Select PCA
Configure parameters:
N Components: Number of PCs (2-10, default: 3)
Scaling: StandardScaler (recommended)
Click [Run Analysis]
Results
Three Views:
1. Scores Plot (2D or 3D)
PC2 (25.3%)
↑
│ ○ ○ ○ ● ● ●
│ ○ ○ ○ ○ ● ● ● ●
│ ○ ○ ● ● ●
└──────────────────────→ PC1 (60.0%)
○ Healthy Control
● Disease Group
Interpretation:
Separation: Good if groups don’t overlap
Clusters: Indicates distinct spectral patterns
Outliers: Points far from cluster centers
PC1 variance: Most important direction
2. Loadings Plot
Loading values vs Wavenumber
Shows which wavenumbers contribute most to each PC
- High positive loading → Important for discrimination
- Negative loading → Inverse correlation
- Near-zero → Not important for this PC
Use: Identify biomarker peaks
3. Scree Plot
Explained Variance (%)
↑
60 │ █
│ ▐
40 │ ▐ █
│ ▐ ▐
20 │ ▐ ▐ ▌ ▌ ▌
│ ▐ ▐ ▐ ▐ ▐ ▍▍
0 └─────────────────→
PC1 2 3 4 5 6 ...
Use: Determine how many PCs to keep
Elbow point: Where variance drops sharply
Typically keep PCs explaining 80-95% variance
Interpretation Tips
Good Separation:
Groups form distinct clusters
Minimal overlap
PC1 explains >50% variance
Clear biomarker peaks in loadings
Poor Separation:
Groups overlap completely
Random distribution
PC1 explains <30% variance
May need better preprocessing or more samples
Outliers:
Identify by visual inspection
Check original spectrum
Determine if:
Real biological variability
Technical artifact
Data quality issue
UMAP (Uniform Manifold Approximation and Projection)
Purpose: Non-linear dimensionality reduction
When to use:
PCA shows poor separation
Non-linear relationships suspected
Better cluster visualization
Exploratory analysis
Running UMAP
Parameters:
n_neighbors: Local neighborhood size (5-50, default: 15)Lower → Fine structure
Higher → Global structure
min_dist: Minimum distance between points (0.0-1.0, default: 0.1)Lower → Tight clusters
Higher → Looser clusters
n_components: Dimensions (2 or 3, default: 2)
Steps:
Select UMAP from Exploratory methods
Configure parameters
Click [Run Analysis]
View 2D/3D embedding
UMAP vs PCA
Aspect |
PCA |
UMAP |
|---|---|---|
Type |
Linear |
Non-linear |
Speed |
Fast |
Slower |
Reproducibility |
Deterministic |
Stochastic |
Interpretation |
Loadings available |
No loadings |
Use Case |
General exploration |
Complex patterns |
Recommendation: Start with PCA, use UMAP if PCA separation is poor
t-SNE (t-Distributed Stochastic Neighbor Embedding)
Purpose: Visualize local structure and clusters
When to use:
Cluster visualization
Complex, non-linear relationships
Publication figures
Parameters:
perplexity: Balance local/global structure (5-50, default: 30)n_iter: Iterations (250-5000, default: 1000)learning_rate: Step size (10-1000, default: 200)
Warning:
Results vary between runs (stochastic)
Distances between clusters not meaningful
Cannot embed new data (no transform)
Clustering Analysis
Hierarchical Clustering
Purpose: Create dendrogram showing sample relationships
Steps:
Select Hierarchical Clustering
Parameters:
Linkage: Ward, Average, Complete (default: Ward)
Distance: Euclidean, Correlation (default: Euclidean)
Click [Run Analysis]
Result: Dendrogram
Height
↑
│ ┌─────────┐
│ ┌────┤ ├────┐
│ ┌──┤ │ │ ├──┐
│ │ │ └─────────┘ │ │
└─┴──┴──────────────────┴──┴───→
Samples (colored by group)
Interpretation:
Tree branches: Similar samples cluster together
Height: Dissimilarity (higher = more different)
Cut height: Determines number of clusters
K-Means Clustering
Purpose: Partition data into k clusters
Steps:
Select K-Means Clustering
Parameters:
N Clusters: Number of clusters (2-20)
N Init: Random initializations (10-100, default: 10)
Max Iter: Maximum iterations (100-1000, default: 300)
Use Elbow Method to determine optimal k
Click [Run Analysis]
Elbow Plot:
Inertia
↑
│ ●
│ ╲
│ ●
│ ╲●
│ ●─●─●─●
└──────────────→
2 3 4 5 6 7 8
Number of Clusters
Optimal k: Elbow point (e.g., k=3 in above plot)
Statistical Analysis
Pairwise Group Comparisons
t-Test (Parametric)
Purpose: Compare means of two groups
Assumptions:
Normal distribution
Equal variances (or use Welch’s t-test)
Independent samples
When to use:
Two groups
Normally distributed data
Sufficient sample size (n≥30 per group)
Steps:
Select Statistical tab
Choose t-Test
Select two groups to compare
Parameters:
Type: Student’s or Welch’s (default: Welch’s)
Alpha: Significance level (0.01, 0.05, default: 0.05)
Click [Run Analysis]
Results:
p-values: Per wavenumber
Effect sizes: Cohen’s d
Volcano plot: -log10(p) vs effect size
Significant regions: p < 0.05 highlighted
Interpretation:
p < 0.001: Highly significant (***)
p < 0.01: Very significant (**)
p < 0.05: Significant (*)
p ≥ 0.05: Not significant (ns)
Mann-Whitney U Test (Non-Parametric)
Purpose: Non-parametric alternative to t-test
When to use:
Non-normal distribution
Small sample sizes
Ordinal data
Outliers present
Interpretation: Same as t-test but based on rank differences
Multi-Group Comparisons
ANOVA (Analysis of Variance)
Purpose: Compare means across multiple groups (≥3)
Assumptions:
Normal distribution in each group
Equal variances (homoscedasticity)
Independent samples
Steps:
Select ANOVA
Choose 3+ groups
Parameters:
Alpha: 0.01, 0.05 (default: 0.05)
Post-hoc: Tukey HSD, Bonferroni (default: Tukey)
Click [Run Analysis]
Results:
F-statistic per wavenumber
p-values
Post-hoc pairwise comparisons
Effect sizes (eta-squared)
Post-hoc Tests: Identify which groups differ
Tukey HSD results:
Healthy vs Disease A: p = 0.002 (**)
Healthy vs Disease B: p = 0.123 (ns)
Disease A vs Disease B: p = 0.045 (*)
Note: ANOVA is currently disabled in the application UI. For comparing two groups, use pairwise statistical tests instead.
Correlation Analysis
Purpose: Find relationships between wavenumbers
Pearson Correlation
Formula: Linear correlation coefficient
r = cov(X, Y) / (σ_X * σ_Y)
Range: -1 to +1
Interpretation:
r > 0.7: Strong positive correlation
r > 0.4: Moderate positive correlation
r ≈ 0: No correlation
r < -0.4: Moderate negative correlation
r < -0.7: Strong negative correlation
When to use: Linear relationships
Spearman Correlation
Non-parametric: Based on rank correlation
When to use:
Non-linear monotonic relationships
Outliers present
Ordinal data
Steps:
Select Correlation Analysis
Choose correlation type (Pearson or Spearman)
Optional: Select wavenumber region
Click [Run Analysis]
Results: Correlation matrix heatmap
400 600 800 1000 1200 ...
400 [1.0 0.8 0.3 0.1 -0.2]
600 [0.8 1.0 0.5 0.2 -0.1]
800 [0.3 0.5 1.0 0.7 0.3]
1000 [0.1 0.2 0.7 1.0 0.5]
1200 [-0.2 -0.1 0.3 0.5 1.0]
...
Color scale: Red (positive) to Blue (negative)
Use: Identify correlated spectral regions
Band Ratio Analysis
Purpose: Calculate ratios of specific peaks
When to use:
Known biomarker ratios
Normalize one peak by another
Create simple interpretable features
Steps:
Select Band Ratio Analysis
Define Peak 1 range: [1000-1010 cm⁻¹]
Define Peak 2 range: [1200-1210 cm⁻¹]
Click [Calculate Ratio]
Results:
Ratio values per spectrum
Box plot by group
Statistical test of group differences
Example:
I₁₆₅₅/I₁₄₄₅ ratio (Amide I / CH₂)
- Healthy: 1.2 ± 0.1
- Disease: 0.9 ± 0.1
- p = 0.003 (**)
Visualization Methods
Interactive Heatmap
Purpose: Visualize all spectra as color-coded intensity map
Features:
Hierarchical clustering of samples (rows)
Dendrogram showing sample relationships
Group coloring
Zoom and pan
Use: Identify spectral patterns and outliers
Waterfall Plot
Purpose: 3D-style stacked spectra visualization
Features:
Offset spectra for visibility
Color by group
Interactive rotation (3D mode)
Export for publication
Use: Publication figures, presentation
Overlaid Spectra
Purpose: Plot multiple spectra on same axes
Features:
Mean ± standard deviation by group
Individual spectrum overlay (up to 100)
Group coloring
Legend management
Use: Visual comparison of groups
Peak Scatter Plot
Purpose: Plot peak intensity at two wavenumbers
Example:
Peak @ 1655 cm⁻¹
↑
│ ○ ○
│ ○ ○ ○ ○
│ ● ● ●
│ ● ● ● ●
│ ● ●
└──────────────→
Peak @ 1445 cm⁻¹
○ Healthy
● Disease
Use: Visualize peak ratio separation
Correlation Matrix
Purpose: Heatmap of correlation coefficients
Features:
Hierarchical clustering of wavenumbers
Color scale (red = positive, blue = negative)
Interactive tooltips
Export as image
Use: Identify correlated spectral regions
Results Interpretation
Statistical Significance
Multiple Testing Correction:
When testing thousands of wavenumbers, apply correction:
Methods:
Bonferroni: Most conservative
Adjusted α = 0.05 / n_tests
Example: 1000 tests → α = 0.00005
FDR (False Discovery Rate): Recommended
Benjamini-Hochberg procedure
Controls proportion of false positives
Less conservative than Bonferroni
Permutation tests: Data-driven
Randomly shuffle group labels
Re-compute test statistic
p-value = proportion of shuffles with more extreme value
Application: Check “Apply FDR correction” in analysis settings
Effect Size
Why important?: Significance ≠ Practical importance
Cohen’s d (for t-tests):
d = (mean1 - mean2) / pooled_std
Interpretation:
|d| < 0.2: Small effect
|d| < 0.5: Medium effect
|d| ≥ 0.8: Large effect
Eta-squared (η²) (for ANOVA):
η² = SS_between / SS_total
Interpretation:
η² < 0.01: Small effect
η² < 0.06: Medium effect
η² ≥ 0.14: Large effect
Recommendation: Report both p-value AND effect size
Biological Interpretation
Steps:
Identify significant peaks
Use statistical tests
Apply multiple testing correction
Check effect sizes
Assign peaks to molecular vibrations
Consult literature
Use reference databases
Check glossary for common peaks
Interpret biological meaning
What molecules changed?
Why would they change in disease/condition?
Consistent with known biology?
Validate findings
Independent test set
Literature comparison
Biochemical validation
Example:
Significant peak @ 1655 cm⁻¹ (Amide I)
- Assignment: C=O stretch in proteins
- Increased in disease group
- Biological interpretation: Protein conformational change
- Literature: Consistent with protein misfolding in this disease
Export and Reporting
Export Options
Plots:
PNG: Raster image (300 DPI for publication)
SVG: Vector graphics (editable in Illustrator, Inkscape)
Data tables (depends on the selected analysis output):
CSV
XLSX (Excel)
JSON
TXT (tab-delimited)
PKL (pickle)
Saved result folders:
“Export report” currently creates a folder containing:
plot.png(if available)data.csv(if available)report.txt
Creating Reports
Steps:
Complete all analyses
Click the export/report action in the results panel
Choose an output folder
Report Structure:
1. Introduction
- Dataset description
- Preprocessing pipeline used
2. Methods
- Analysis methods
- Statistical tests
- Parameters
3. Results
- PCA scores plot
- Statistical comparison results
- Significant peaks table
4. Discussion
- Interpretation
- Biological relevance
5. Appendix
- Full parameter settings
- Additional figures
Publication-Ready Figures
Requirements:
Resolution: 300+ DPI
Format: PNG (raster) or SVG (vector)
Fonts: Embed or convert to paths
Size: Match journal requirements
Color: Check color-blind friendly palettes
Settings:
Figure → Export Settings
- DPI: 300
- Format: PNG
- Font Size: 12pt
- Line Width: 2pt
- Color Palette: Colorblind-safe
- Background: White (for print)
Troubleshooting
No Group Separation in PCA
Possible Causes:
Groups are truly not different
Insufficient preprocessing
Too much noise
Wrong groups selected
Solutions:
Try different preprocessing
Check data quality
Use UMAP or t-SNE
Verify group labels are correct
Statistical Tests Show No Significance
Possible Causes:
Small sample size (low power)
High within-group variability
Multiple testing correction too strict
Groups not actually different
Solutions:
Increase sample size
Improve preprocessing to reduce noise
Use less conservative correction (FDR instead of Bonferroni)
Check effect sizes (may be significant but small)
Analysis Takes Too Long
Causes:
Large dataset (>5000 spectra)
Complex method (UMAP, t-SNE)
Insufficient RAM
Solutions:
Use PCA instead of UMAP/t-SNE
Subsample data for exploration
Close other applications
Enable batch processing
See Also
Machine Learning Guide - Next step: Build ML models
Analysis Methods Reference - Detailed method documentation
Best Practices - Analysis recommendations
FAQ - Analysis - Common questions
Next: Machine Learning Guide →