Best Practices

Research best practices for Raman spectroscopy analysis: data quality, reproducibility, validation, and reporting standards.

Table of Contents

Data Quality
Preprocessing Strategy
Statistical Analysis
Machine Learning
Reproducibility
Publication and Reporting

Data Quality

Sample Preparation

Consistency is Critical:

Standard Operating Procedure (SOP):
Sample collection protocol
Storage conditions (-80°C for plasma)
Thawing procedure (room temp, 10 min)
Substrate preparation
Drying time
Environmental conditions (temp, humidity)

Documentation:

Record all parameters
Note any deviations
Track batch information
Include control samples

Quality Control Samples:

Per Batch:
- Positive control (known sample)
- Negative control (blank)
- Technical replicates (n=3 minimum)
- Standard reference material

Data Acquisition

Instrument Parameters:

Essential to Record:
- Laser wavelength (e.g., 785 nm)
- Laser power (e.g., 50 mW)
- Integration time (e.g., 10 s)
- Number of accumulations (e.g., 3)
- Objective magnification (e.g., 10×)
- Spectral range (e.g., 400-1800 cm⁻¹)
- Spectral resolution (e.g., 4 cm⁻¹)

Keep Constant:

Same spectrometer for entire study
Same acquisition parameters
Same environmental conditions
Same operator (if possible)

Calibration:

Frequency: Daily
Standards:
- Silicon peak at 520.7 cm⁻¹ (wavenumber)
- NIST standards (intensity)
- Polystyrene (multiple peaks)

Procedure:
1. Acquire calibration spectrum
2. Verify peak position
3. Apply correction if needed
4. Document calibration

Sample Size Determination

Minimum Recommendations:

Per Group:
- Exploratory analysis: n ≥ 30
- Statistical tests: n ≥ 50
- Machine learning: n ≥ 100
- Clinical validation: n ≥ 200

Power Analysis (for statistical tests):

# Example: t-test power analysis
Effect size (Cohen's d): 0.5 (moderate)
Significance level (α): 0.05
Power (1-β): 0.80

Required sample size: ~64 per group

For Machine Learning:

Rule of Thumb:
- 10-20 samples per feature (wavenumber)
- With 1000 wavenumbers → n ≥ 10,000 ideal
- BUT: Use dimensionality reduction (PCA)
- Practical minimum: 50-100 per class

Reality Check:
- Small studies (n=50-200): Use simple models (LR, RF)
- Medium studies (n=200-1000): Use complex models (XGBoost, SVM)
- Large studies (n>1000): Deep learning possible

Data Organization

File Naming:

Good:
YYYY-MM-DD_batch_condition_replicate_ID.csv
2026-01-24_batch01_healthy_rep1_patient001.csv

Bad:
data.csv
p1.csv
20260124.csv

Folder Structure:

project_name/
├── raw_data/
│   ├── batch_01/
│   │   ├── healthy/
│   │   └── disease/
│   └── batch_02/
├── processed_data/
├── preprocessing_pipelines/
├── models/
├── results/
│   ├── figures/
│   └── tables/
├── documentation/
│   ├── sop.md
│   ├── notebook.md
│   └── metadata.xlsx
└── scripts/

Metadata Management:

sample_id,patient_id,group,batch,acquisition_date,laser_power,integration_time,notes
S001,P001,healthy,batch01,2026-01-15,50,10,good_quality
S002,P001,healthy,batch01,2026-01-15,50,10,technical_replicate
S003,P002,disease,batch01,2026-01-16,50,10,weak_signal

Preprocessing Strategy

Choosing Methods

Decision Tree:

        graph TD
    A[Raw Spectra] --> B{Baseline?}
    B -->|Yes| C[AsLS/AirPLS]
    B -->|No| D{Noisy?}
    C --> D
    D -->|Yes| E[Savitzky-Golay]
    D -->|No| F{Intensity Varies?}
    E --> F
    F -->|Yes| G[Vector Norm]
    F -->|No| H[Done]
    G --> H

Conservative Approach (Recommended):

1. Baseline correction: AsLS (λ=1e5, p=0.01)
2. Light smoothing: Savitzky-Golay (window=11, order=3)
3. Normalization: Vector (L2 norm)

Why:
- Minimal information loss
- Widely accepted
- Reproducible
- Good starting point

Avoid:

❌ Over-smoothing (window >21)
❌ Multiple normalization steps
❌ 2nd derivative without strong justification
❌ Deep learning methods on small datasets
❌ Undocumented manual corrections

Validation

Visual Inspection:

Check:
Overlay before/after preprocessing
Verify peaks not removed
Check baseline flatness
Compare multiple representative spectra
Inspect edge cases (weak signal, high noise)

Quantitative Metrics:

Calculate:
- SNR (Signal-to-Noise Ratio) improvement
- Peak height preservation
- Baseline level (should be near zero)
- Consistency across spectra

Test on Subset:

Workflow:
Select 20% of data randomly
Try different preprocessing approaches
Evaluate on downstream analysis (PCA, classification)
Choose best performing pipeline
Apply to full dataset

Documentation

Pipeline File:

{
  "pipeline_name": "blood_plasma_standard_v1",
  "created_date": "2026-01-24",
  "methods": [
    {
      "step": 1,
      "method": "AsLS",
      "parameters": {"lambda": 100000, "p": 0.01},
      "reason": "Remove fluorescence background"
    },
    {
      "step": 2,
      "method": "Savitzky-Golay",
      "parameters": {"window": 11, "polyorder": 3, "deriv": 0},
      "reason": "Reduce noise while preserving peaks"
    },
    {
      "step": 3,
      "method": "Vector Normalization",
      "parameters": {},
      "reason": "Normalize intensity variations"
    }
  ],
  "validation": {
    "snr_improvement": "35%",
    "peak_preservation": "98%",
    "tested_on": "20% random subset"
  }
}

Lab Notebook Entry:

Date: 2026-01-24
Experiment: Preprocessing optimization

Tried:
- Pipeline A: Polynomial baseline → Poor performance
- Pipeline B: AsLS (λ=1e6) → Over-correction
- Pipeline C: AsLS (λ=1e5) → Good balance ✓

Selected Pipeline C:
- AsLS (λ=1e5, p=0.01)
- SavGol (w=11, order=3)
- Vector norm

Results:
- PCA separation improved
- Classification accuracy: 85% → 92%

Statistical Analysis

Test Selection

Flowchart:

        graph TD
    A[Compare Groups] --> B{How many groups?}
    B -->|2| C{Normal distribution?}
    B -->|≥3| D[Use pairwise tests]
    C -->|Yes| E[t-test]
    C -->|No| F[Mann-Whitney U]
    D --> G[Mann-Whitney U for each pair]
    G --> H[Apply FDR correction]

Note: ANOVA and Kruskal-Wallis are not currently available in the UI. For 3+ groups, use pairwise tests with multiple comparison correction.

Check Assumptions:

Normality:
- Visual: Q-Q plot, histogram
- Statistical: Shapiro-Wilk test (p > 0.05 → normal)

Equal Variance:
- Levene's test (p > 0.05 → equal variance)

Independence:
- Study design verification
- No repeated measures

Multiple Testing Correction

Problem:

Testing 1000 wavenumbers at α=0.05
Expected false positives: 1000 × 0.05 = 50

Even with random data, expect ~50 "significant" results!

Solutions:

1. Bonferroni (most conservative):
   α_corrected = 0.05 / 1000 = 0.00005
   
2. FDR - Benjamini-Hochberg (recommended):
   Controls false discovery rate at 5%
   Less conservative than Bonferroni
   
3. Permutation tests:
   Data-driven correction
   More powerful for small samples

Implementation:

Always report whether correction was applied
Use FDR by default
Use Bonferroni for confirmatory studies

Effect Sizes

Always Report:

Example Bad:
"Peak at 1655 cm⁻¹ differs significantly between groups (p=0.003)"

Example Good:
"Peak at 1655 cm⁻¹ shows large effect (Cohen's d=0.85, 95% CI: 0.42-1.28)
with significant difference between groups (p=0.003, FDR-corrected)"

Guidelines:

Cohen's d:
- Small: |d| = 0.2
- Medium: |d| = 0.5
- Large: |d| = 0.8

Report:
- Point estimate
- Confidence interval (95%)
- Statistical significance

Machine Learning

Data Splitting

Critical Rules:

Split BEFORE any preprocessing or feature selection
Use patient-level splitting (not spectrum-level)
Never touch test set until final evaluation
Document random seed for reproducibility

Recommended Split:

Total Data
├── Training (70%)
│   ├── Used for: Model training, hyperparameter tuning, CV
│   └── Preprocessing fit on training only
├── Validation (15%) [Optional]
│   └── Used for: Model selection, early stopping
└── Test (15-30%)
    └── Used for: Final evaluation ONCE

Patient-Level Splitting:

# Good: Patient-level
patients = get_unique_patients(data)
train_patients, test_patients = train_test_split(
    patients, test_size=0.3, stratify=patient_labels, random_state=42
)

# Bad: Spectrum-level (data leakage!)
train_spectra, test_spectra = train_test_split(
    spectra, test_size=0.3
)  # Patient spectra in both train and test!

Cross-Validation Strategy

GroupKFold (Required):

from sklearn.model_selection import GroupKFold

# Ensure all spectra from one patient stay together
cv = GroupKFold(n_splits=5)
for train_idx, val_idx in cv.split(X, y, groups=patient_ids):
    # Train on train_idx, validate on val_idx
    # No patient appears in both

Stratified K-Fold (If no patient grouping):

from sklearn.model_selection import StratifiedKFold

# Maintains class balance in each fold
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

Feature Selection

Timing Matters:

Wrong:
Select features on full dataset
Split data
Train model
→ Data leakage! Features selected using test set info

Correct:
Split data
Select features on training set only
Apply same selection to test set
Train model

Methods:

1. Variance threshold:
   - Remove low-variance wavenumbers
   - Threshold: <1% of max variance
   
2. Univariate selection:
   - t-test or ANOVA per wavenumber
   - Keep top k most significant
   
3. Model-based:
   - Random Forest feature importance
   - Keep top k features
   
4. Recursive Feature Elimination (RFE):
   - Iteratively remove least important features
   - Expensive but thorough

How Many Features?:

Rule of Thumb: n_features ≤ n_samples / 10

Examples:
- 100 samples → Select ≤10 features
- 500 samples → Select ≤50 features
- 1000 samples → Can use more features

Model Selection

Start Simple:

Baseline: Logistic Regression
Next: Random Forest (fast, robust)
Then: XGBoost (if RF works)
Advanced: SVM, Neural Networks (if needed)

Hyperparameter Tuning:

Workflow:
1. Use defaults for initial assessment
2. If promising, run grid search on training set
3. Use nested CV:
   - Outer loop: Performance estimation
   - Inner loop: Hyperparameter tuning
4. Evaluate best model on test set

Avoid:

Tuning on test set
Many model types without justification
Complex models on small data
Claiming “best model” without proper comparison

Validation Requirements

Minimum Standards:

For Publication:
Independent test set (held out from start)
Patient-level cross-validation (GroupKFold)
Multiple performance metrics (not just accuracy)
Confidence intervals (bootstrap or CV)
Feature importance analysis
Model interpretation (SHAP, etc.)

Gold Standard:

Training/test split
Cross-validation on training set
External validation (different hospital/cohort)
Temporal validation (future patients)
Prospective validation (real-time use)

Reproducibility

Code and Environment

Version Control:

Use Git:
- Track all code changes
- Document analysis steps
- Enable rollback
- Collaborate safely

Commit Messages:
"Add AsLS baseline correction to preprocessing pipeline"
"Update RF hyperparameters based on grid search results"

Environment Management:

# Create reproducible environment
uv venv
source .venv/bin/activate  # Linux/Mac
.venv\Scripts\activate  # Windows

# Record dependencies
uv pip freeze > requirements.txt

# Include versions
numpy==1.24.3
scikit-learn==1.3.0
pandas==2.0.3

Random Seeds:

# Set ALL random seeds
import numpy as np
import random
from sklearn.model_selection import train_test_split

RANDOM_SEED = 42

np.random.seed(RANDOM_SEED)
random.seed(RANDOM_SEED)

# Use in all functions
train_test_split(X, y, random_state=RANDOM_SEED)

Documentation

Lab Notebook (Digital or Physical):

Per Experiment:
- Date and time
- Objective
- Methods used
- Parameters
- Results
- Observations
- Next steps

Example:
---
Date: 2026-01-24 14:30
Objective: Test preprocessing pipelines
Methods: AsLS baseline correction with varying λ
Parameters: λ ∈ {1e4, 1e5, 1e6, 1e7}
Results: λ=1e5 gives best balance (see plot_baseline_comparison.png)
Next: Apply to full dataset and proceed to classification
---

README Files:

# Project: Blood Plasma Disease Classification

## Overview
Raman spectroscopy-based classification of disease states in blood plasma.

## Data
- Location: data/raw/batch_01/
- Samples: 143 (75 healthy, 68 disease)
- Acquisition: 2026-01-15 to 2026-01-20
- Spectrometer: RamanSpecPro 5000

## Preprocessing
Pipeline: preprocessing/standard_pipeline_v1.json
- AsLS (λ=1e5, p=0.01)
- Savitzky-Golay (w=11, order=3)
- Vector normalization

## Results
Best model: Random Forest (see models/rf_best.pkl)
Test accuracy: 92.3% (95% CI: 86.1-96.8%)

## Citation
[Include publication info when available]

Method Descriptions:

Write as if for a peer reviewer:

"Baseline correction was performed using asymmetric least squares (AsLS) 
with λ=10⁵ and p=0.01, followed by Savitzky-Golay smoothing (window=11, 
polynomial order=3). Spectra were normalized using L2 norm. This pipeline 
was selected based on optimization on a 20% random subset of the training 
data, evaluated by classification accuracy and signal-to-noise ratio 
improvement."

Publication and Reporting

Figures

Quality Standards:

Resolution: 300 DPI minimum (600 DPI for line art)
Format: TIFF, PDF, or EPS (vector when possible)
Size: Match journal requirements
Fonts: Readable (≥8pt), consistent across figures
Colors: Colorblind-friendly palette

Good Figure Practices:

Do:
✓ Clear axis labels with units
✓ Legends with all necessary info
✓ Error bars (SD, SEM, or CI)
✓ Statistical significance marked
✓ High contrast
✓ Consistent style across all figures

Don't:
✗ 3D plots (hard to read)
✗ Rainbow color maps (colorblind issue)
✗ Cluttered plots
✗ Missing error bars
✗ Unlabeled axes

Tables

Results Table Example:

Table 1. Classification Performance

| Model         | Accuracy (%) | Sensitivity (%) | Specificity (%) | AUC   |
| ------------- | ------------ | --------------- | --------------- | ----- |
| Logistic Reg  | 78.3 ± 4.2   | 75.1 ± 5.3      | 81.2 ± 4.8      | 0.862 |
| Random Forest | 92.3 ± 3.1   | 90.5 ± 4.1      | 93.8 ± 3.5      | 0.967 |
| XGBoost       | 93.1 ± 2.8   | 91.2 ± 3.9      | 94.7 ± 3.1      | 0.972 |
| SVM (RBF)     | 89.4 ± 3.5   | 87.3 ± 4.5      | 91.1 ± 3.9      | 0.943 |

Values: Mean ± SD from 5-fold cross-validation.
AUC: Area under ROC curve.
Best performance in bold.

Methods Section

Structure:

1. Sample Collection and Preparation
   - Describe protocol in detail
   - Include all parameters
   - Reference standard protocols if applicable

2. Data Acquisition
   - Instrument specifications
   - Acquisition parameters
   - Calibration procedures
   - Quality control measures

3. Data Preprocessing
   - Methods used (with citations)
   - Parameters (exact values)
   - Validation approach
   - Software used (with version)

4. Analysis Methods
   - Statistical tests (assumptions checked)
   - Machine learning algorithms
   - Validation strategy (GroupKFold, etc.)
   - Hyperparameter optimization
   - Software and versions

5. Data Availability
   - Where data can be accessed
   - Code repository (GitHub)
   - Any restrictions

Results Section

Good Practice:

"Random Forest achieved the best performance with a test set accuracy 
of 92.3% (95% CI: 86.1-96.8%), sensitivity of 90.5%, and specificity 
of 93.8%. The model was trained using 5-fold cross-validation with 
patient-level splitting (GroupKFold) on 100 training samples. Feature 
importance analysis revealed that wavenumbers at 1655 cm⁻¹ (Amide I), 
1450 cm⁻¹ (CH₂), and 1200 cm⁻¹ (Amide III) contributed most to 
classification (see Figure 3). External validation on an independent 
cohort (n=43) confirmed model generalization with 91.2% accuracy."

Discussion

Address:

1. Main findings
   - Put in context of literature
   - Explain biological/clinical significance

2. Comparison with previous work
   - How does your method compare?
   - What's novel?

3. Limitations
   - Sample size
   - Single center vs multi-center
   - Lack of external validation
   - Class imbalance
   - Preprocessing choices

4. Future directions
   - Larger studies
   - External validation
   - Clinical trials
   - Real-time implementation

Ethical Considerations

Required:

1. Ethics approval
   - IRB/Ethics committee approval number
   - Informed consent obtained

2. Patient privacy
   - Data de-identified
   - No patient information in figures
   - Secure data storage

3. Clinical use disclaimer
   - "For research use only"
   - "Not approved for clinical diagnostics"
   - Limitations clearly stated

4. Conflicts of interest
   - Funding sources disclosed
   - Commercial interests declared

Common Pitfalls to Avoid

❌ Data Leakage

Wrong: Normalize all data, then split
Correct: Split first, normalize training, apply to test

❌ Spectrum-Level Splitting

Wrong: Random split ignoring patients
Correct: Patient-level splitting with GroupKFold

❌ Peeking at Test Set

Wrong: Check test performance during development
Correct: Look at test set only for final evaluation

❌ Multiple Testing Without Correction

Wrong: Report all p<0.05 across 1000 wavenumbers
Correct: Apply FDR correction

❌ Cherry-Picking Results

Wrong: Only report best-performing model
Correct: Report all tested approaches, explain choice

❌ Overfitting

Wrong: Complex model on small dataset
Correct: Match model complexity to sample size

❌ Missing Documentation

Wrong: "Preprocessed using standard methods"
Correct: "AsLS (λ=10⁵, p=0.01), then SavGol (w=11, order=3)"

Resources

Software Citations

Always cite:

@article{scikit-learn,
  title={Scikit-learn: Machine Learning in Python},
  author={Pedregosa et al.},
  journal={JMLR},
  year={2011}
}

Best Practices

Table of Contents

Data Quality

Sample Preparation

Data Acquisition

Sample Size Determination

Data Organization

Preprocessing Strategy

Choosing Methods

Validation

Documentation

Statistical Analysis

Test Selection

Multiple Testing Correction

Effect Sizes

Machine Learning

Data Splitting

Cross-Validation Strategy

Feature Selection

Model Selection

Validation Requirements

Reproducibility

Code and Environment

Documentation

Publication and Reporting

Figures

Tables

Methods Section

Results Section

Discussion

Ethical Considerations

Checklists

Before Analysis

Before Preprocessing

Before Statistical Analysis

Before Machine Learning

Before Publication

Common Pitfalls to Avoid

❌ Data Leakage

❌ Spectrum-Level Splitting

❌ Peeking at Test Set

❌ Multiple Testing Without Correction

❌ Cherry-Picking Results

❌ Overfitting

❌ Missing Documentation

Resources

Recommended Reading

Software Citations

See Also