# Best Practices

Research best practices for Raman spectroscopy analysis: data quality, reproducibility, validation, and reporting standards.

## Table of Contents
- [Data Quality](#data-quality)
- [Preprocessing Strategy](#preprocessing-strategy)
- [Statistical Analysis](#statistical-analysis)
- [Machine Learning](#machine-learning)
- [Reproducibility](#reproducibility)
- [Publication and Reporting](#publication-and-reporting)

---

## Data Quality

### Sample Preparation

**Consistency is Critical**:
```
Standard Operating Procedure (SOP):
1. Sample collection protocol
2. Storage conditions (-80°C for plasma)
3. Thawing procedure (room temp, 10 min)
4. Substrate preparation
5. Drying time
6. Environmental conditions (temp, humidity)
```

**Documentation**:
- Record all parameters
- Note any deviations
- Track batch information
- Include control samples

**Quality Control Samples**:
```
Per Batch:
- Positive control (known sample)
- Negative control (blank)
- Technical replicates (n=3 minimum)
- Standard reference material
```

### Data Acquisition

**Instrument Parameters**:
```
Essential to Record:
- Laser wavelength (e.g., 785 nm)
- Laser power (e.g., 50 mW)
- Integration time (e.g., 10 s)
- Number of accumulations (e.g., 3)
- Objective magnification (e.g., 10×)
- Spectral range (e.g., 400-1800 cm⁻¹)
- Spectral resolution (e.g., 4 cm⁻¹)
```

**Keep Constant**:
- Same spectrometer for entire study
- Same acquisition parameters
- Same environmental conditions
- Same operator (if possible)

**Calibration**:
```
Frequency: Daily
Standards:
- Silicon peak at 520.7 cm⁻¹ (wavenumber)
- NIST standards (intensity)
- Polystyrene (multiple peaks)

Procedure:
1. Acquire calibration spectrum
2. Verify peak position
3. Apply correction if needed
4. Document calibration
```

### Sample Size Determination

**Minimum Recommendations**:
```
Per Group:
- Exploratory analysis: n ≥ 30
- Statistical tests: n ≥ 50
- Machine learning: n ≥ 100
- Clinical validation: n ≥ 200
```

**Power Analysis** (for statistical tests):
```python
# Example: t-test power analysis
Effect size (Cohen's d): 0.5 (moderate)
Significance level (α): 0.05
Power (1-β): 0.80

Required sample size: ~64 per group
```

**For Machine Learning**:
```
Rule of Thumb:
- 10-20 samples per feature (wavenumber)
- With 1000 wavenumbers → n ≥ 10,000 ideal
- BUT: Use dimensionality reduction (PCA)
- Practical minimum: 50-100 per class

Reality Check:
- Small studies (n=50-200): Use simple models (LR, RF)
- Medium studies (n=200-1000): Use complex models (XGBoost, SVM)
- Large studies (n>1000): Deep learning possible
```

### Data Organization

**File Naming**:
```
Good:
YYYY-MM-DD_batch_condition_replicate_ID.csv
2026-01-24_batch01_healthy_rep1_patient001.csv

Bad:
data.csv
p1.csv
20260124.csv
```

**Folder Structure**:
```
project_name/
├── raw_data/
│   ├── batch_01/
│   │   ├── healthy/
│   │   └── disease/
│   └── batch_02/
├── processed_data/
├── preprocessing_pipelines/
├── models/
├── results/
│   ├── figures/
│   └── tables/
├── documentation/
│   ├── sop.md
│   ├── notebook.md
│   └── metadata.xlsx
└── scripts/
```

**Metadata Management**:
```text
sample_id,patient_id,group,batch,acquisition_date,laser_power,integration_time,notes
S001,P001,healthy,batch01,2026-01-15,50,10,good_quality
S002,P001,healthy,batch01,2026-01-15,50,10,technical_replicate
S003,P002,disease,batch01,2026-01-16,50,10,weak_signal
```

---

## Preprocessing Strategy

### Choosing Methods

**Decision Tree**:
```mermaid
graph TD
    A[Raw Spectra] --> B{Baseline?}
    B -->|Yes| C[AsLS/AirPLS]
    B -->|No| D{Noisy?}
    C --> D
    D -->|Yes| E[Savitzky-Golay]
    D -->|No| F{Intensity Varies?}
    E --> F
    F -->|Yes| G[Vector Norm]
    F -->|No| H[Done]
    G --> H
```

**Conservative Approach** (Recommended):
```
1. Baseline correction: AsLS (λ=1e5, p=0.01)
2. Light smoothing: Savitzky-Golay (window=11, order=3)
3. Normalization: Vector (L2 norm)

Why:
- Minimal information loss
- Widely accepted
- Reproducible
- Good starting point
```

**Avoid**:
- ❌ Over-smoothing (window >21)
- ❌ Multiple normalization steps
- ❌ 2nd derivative without strong justification
- ❌ Deep learning methods on small datasets
- ❌ Undocumented manual corrections

### Validation

**Visual Inspection**:
```
Check:
1. Overlay before/after preprocessing
2. Verify peaks not removed
3. Check baseline flatness
4. Compare multiple representative spectra
5. Inspect edge cases (weak signal, high noise)
```

**Quantitative Metrics**:
```
Calculate:
- SNR (Signal-to-Noise Ratio) improvement
- Peak height preservation
- Baseline level (should be near zero)
- Consistency across spectra
```

**Test on Subset**:
```
Workflow:
1. Select 20% of data randomly
2. Try different preprocessing approaches
3. Evaluate on downstream analysis (PCA, classification)
4. Choose best performing pipeline
5. Apply to full dataset
```

### Documentation

**Pipeline File**:
```json
{
  "pipeline_name": "blood_plasma_standard_v1",
  "created_date": "2026-01-24",
  "methods": [
    {
      "step": 1,
      "method": "AsLS",
      "parameters": {"lambda": 100000, "p": 0.01},
      "reason": "Remove fluorescence background"
    },
    {
      "step": 2,
      "method": "Savitzky-Golay",
      "parameters": {"window": 11, "polyorder": 3, "deriv": 0},
      "reason": "Reduce noise while preserving peaks"
    },
    {
      "step": 3,
      "method": "Vector Normalization",
      "parameters": {},
      "reason": "Normalize intensity variations"
    }
  ],
  "validation": {
    "snr_improvement": "35%",
    "peak_preservation": "98%",
    "tested_on": "20% random subset"
  }
}
```

**Lab Notebook Entry**:
```
Date: 2026-01-24
Experiment: Preprocessing optimization

Tried:
- Pipeline A: Polynomial baseline → Poor performance
- Pipeline B: AsLS (λ=1e6) → Over-correction
- Pipeline C: AsLS (λ=1e5) → Good balance ✓

Selected Pipeline C:
- AsLS (λ=1e5, p=0.01)
- SavGol (w=11, order=3)
- Vector norm

Results:
- PCA separation improved
- Classification accuracy: 85% → 92%
```

---

## Statistical Analysis

### Test Selection

**Flowchart**:
```mermaid
graph TD
    A[Compare Groups] --> B{How many groups?}
    B -->|2| C{Normal distribution?}
    B -->|≥3| D[Use pairwise tests]
    C -->|Yes| E[t-test]
    C -->|No| F[Mann-Whitney U]
    D --> G[Mann-Whitney U for each pair]
    G --> H[Apply FDR correction]
```

**Note**: ANOVA and Kruskal-Wallis are not currently available in the UI. For 3+ groups, use pairwise tests with multiple comparison correction.

**Check Assumptions**:
```
Normality:
- Visual: Q-Q plot, histogram
- Statistical: Shapiro-Wilk test (p > 0.05 → normal)

Equal Variance:
- Levene's test (p > 0.05 → equal variance)

Independence:
- Study design verification
- No repeated measures
```

### Multiple Testing Correction

**Problem**:
```
Testing 1000 wavenumbers at α=0.05
Expected false positives: 1000 × 0.05 = 50

Even with random data, expect ~50 "significant" results!
```

**Solutions**:
```
1. Bonferroni (most conservative):
   α_corrected = 0.05 / 1000 = 0.00005
   
2. FDR - Benjamini-Hochberg (recommended):
   Controls false discovery rate at 5%
   Less conservative than Bonferroni
   
3. Permutation tests:
   Data-driven correction
   More powerful for small samples
```

**Implementation**:
- Always report whether correction was applied
- Use FDR by default
- Use Bonferroni for confirmatory studies

### Effect Sizes

**Always Report**:
```
Example Bad:
"Peak at 1655 cm⁻¹ differs significantly between groups (p=0.003)"

Example Good:
"Peak at 1655 cm⁻¹ shows large effect (Cohen's d=0.85, 95% CI: 0.42-1.28)
with significant difference between groups (p=0.003, FDR-corrected)"
```

**Guidelines**:
```
Cohen's d:
- Small: |d| = 0.2
- Medium: |d| = 0.5
- Large: |d| = 0.8

Report:
- Point estimate
- Confidence interval (95%)
- Statistical significance
```

---

## Machine Learning

### Data Splitting

**Critical Rules**:
```
1. Split BEFORE any preprocessing or feature selection
2. Use patient-level splitting (not spectrum-level)
3. Never touch test set until final evaluation
4. Document random seed for reproducibility
```

**Recommended Split**:
```
Total Data
├── Training (70%)
│   ├── Used for: Model training, hyperparameter tuning, CV
│   └── Preprocessing fit on training only
├── Validation (15%) [Optional]
│   └── Used for: Model selection, early stopping
└── Test (15-30%)
    └── Used for: Final evaluation ONCE
```

**Patient-Level Splitting**:
```python
# Good: Patient-level
patients = get_unique_patients(data)
train_patients, test_patients = train_test_split(
    patients, test_size=0.3, stratify=patient_labels, random_state=42
)

# Bad: Spectrum-level (data leakage!)
train_spectra, test_spectra = train_test_split(
    spectra, test_size=0.3
)  # Patient spectra in both train and test!
```

### Cross-Validation Strategy

**GroupKFold (Required)**:
```python
from sklearn.model_selection import GroupKFold

# Ensure all spectra from one patient stay together
cv = GroupKFold(n_splits=5)
for train_idx, val_idx in cv.split(X, y, groups=patient_ids):
    # Train on train_idx, validate on val_idx
    # No patient appears in both
```

**Stratified K-Fold** (If no patient grouping):
```python
from sklearn.model_selection import StratifiedKFold

# Maintains class balance in each fold
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
```

### Feature Selection

**Timing Matters**:
```
Wrong:
1. Select features on full dataset
2. Split data
3. Train model
→ Data leakage! Features selected using test set info

Correct:
1. Split data
2. Select features on training set only
3. Apply same selection to test set
4. Train model
```

**Methods**:
```
1. Variance threshold:
   - Remove low-variance wavenumbers
   - Threshold: <1% of max variance
   
2. Univariate selection:
   - t-test or ANOVA per wavenumber
   - Keep top k most significant
   
3. Model-based:
   - Random Forest feature importance
   - Keep top k features
   
4. Recursive Feature Elimination (RFE):
   - Iteratively remove least important features
   - Expensive but thorough
```

**How Many Features?**:
```
Rule of Thumb: n_features ≤ n_samples / 10

Examples:
- 100 samples → Select ≤10 features
- 500 samples → Select ≤50 features
- 1000 samples → Can use more features
```

### Model Selection

**Start Simple**:
```
1. Baseline: Logistic Regression
2. Next: Random Forest (fast, robust)
3. Then: XGBoost (if RF works)
4. Advanced: SVM, Neural Networks (if needed)
```

**Hyperparameter Tuning**:
```
Workflow:
1. Use defaults for initial assessment
2. If promising, run grid search on training set
3. Use nested CV:
   - Outer loop: Performance estimation
   - Inner loop: Hyperparameter tuning
4. Evaluate best model on test set
```

**Avoid**:
- Tuning on test set
- Many model types without justification
- Complex models on small data
- Claiming "best model" without proper comparison

### Validation Requirements

**Minimum Standards**:
```
For Publication:
1. Independent test set (held out from start)
2. Patient-level cross-validation (GroupKFold)
3. Multiple performance metrics (not just accuracy)
4. Confidence intervals (bootstrap or CV)
5. Feature importance analysis
6. Model interpretation (SHAP, etc.)
```

**Gold Standard**:
```
1. Training/test split
2. Cross-validation on training set
3. External validation (different hospital/cohort)
4. Temporal validation (future patients)
5. Prospective validation (real-time use)
```

---

## Reproducibility

### Code and Environment

**Version Control**:
```
Use Git:
- Track all code changes
- Document analysis steps
- Enable rollback
- Collaborate safely

Commit Messages:
"Add AsLS baseline correction to preprocessing pipeline"
"Update RF hyperparameters based on grid search results"
```

**Environment Management**:
```bash
# Create reproducible environment
uv venv
source .venv/bin/activate  # Linux/Mac
.venv\Scripts\activate  # Windows

# Record dependencies
uv pip freeze > requirements.txt

# Include versions
numpy==1.24.3
scikit-learn==1.3.0
pandas==2.0.3
```

**Random Seeds**:
```python
# Set ALL random seeds
import numpy as np
import random
from sklearn.model_selection import train_test_split

RANDOM_SEED = 42

np.random.seed(RANDOM_SEED)
random.seed(RANDOM_SEED)

# Use in all functions
train_test_split(X, y, random_state=RANDOM_SEED)
```

### Documentation

**Lab Notebook** (Digital or Physical):
```
Per Experiment:
- Date and time
- Objective
- Methods used
- Parameters
- Results
- Observations
- Next steps

Example:
---
Date: 2026-01-24 14:30
Objective: Test preprocessing pipelines
Methods: AsLS baseline correction with varying λ
Parameters: λ ∈ {1e4, 1e5, 1e6, 1e7}
Results: λ=1e5 gives best balance (see plot_baseline_comparison.png)
Next: Apply to full dataset and proceed to classification
---
```

**README Files**:
```markdown
# Project: Blood Plasma Disease Classification

## Overview
Raman spectroscopy-based classification of disease states in blood plasma.

## Data
- Location: data/raw/batch_01/
- Samples: 143 (75 healthy, 68 disease)
- Acquisition: 2026-01-15 to 2026-01-20
- Spectrometer: RamanSpecPro 5000

## Preprocessing
Pipeline: preprocessing/standard_pipeline_v1.json
- AsLS (λ=1e5, p=0.01)
- Savitzky-Golay (w=11, order=3)
- Vector normalization

## Results
Best model: Random Forest (see models/rf_best.pkl)
Test accuracy: 92.3% (95% CI: 86.1-96.8%)

## Citation
[Include publication info when available]
```

**Method Descriptions**:
```
Write as if for a peer reviewer:

"Baseline correction was performed using asymmetric least squares (AsLS) 
with λ=10⁵ and p=0.01, followed by Savitzky-Golay smoothing (window=11, 
polynomial order=3). Spectra were normalized using L2 norm. This pipeline 
was selected based on optimization on a 20% random subset of the training 
data, evaluated by classification accuracy and signal-to-noise ratio 
improvement."
```

---

## Publication and Reporting

### Figures

**Quality Standards**:
```
Resolution: 300 DPI minimum (600 DPI for line art)
Format: TIFF, PDF, or EPS (vector when possible)
Size: Match journal requirements
Fonts: Readable (≥8pt), consistent across figures
Colors: Colorblind-friendly palette
```

**Good Figure Practices**:
```
Do:
✓ Clear axis labels with units
✓ Legends with all necessary info
✓ Error bars (SD, SEM, or CI)
✓ Statistical significance marked
✓ High contrast
✓ Consistent style across all figures

Don't:
✗ 3D plots (hard to read)
✗ Rainbow color maps (colorblind issue)
✗ Cluttered plots
✗ Missing error bars
✗ Unlabeled axes
```

### Tables

**Results Table Example**:
```
Table 1. Classification Performance

| Model         | Accuracy (%) | Sensitivity (%) | Specificity (%) | AUC   |
| ------------- | ------------ | --------------- | --------------- | ----- |
| Logistic Reg  | 78.3 ± 4.2   | 75.1 ± 5.3      | 81.2 ± 4.8      | 0.862 |
| Random Forest | 92.3 ± 3.1   | 90.5 ± 4.1      | 93.8 ± 3.5      | 0.967 |
| XGBoost       | 93.1 ± 2.8   | 91.2 ± 3.9      | 94.7 ± 3.1      | 0.972 |
| SVM (RBF)     | 89.4 ± 3.5   | 87.3 ± 4.5      | 91.1 ± 3.9      | 0.943 |

Values: Mean ± SD from 5-fold cross-validation.
AUC: Area under ROC curve.
Best performance in bold.
```

### Methods Section

**Structure**:
```
1. Sample Collection and Preparation
   - Describe protocol in detail
   - Include all parameters
   - Reference standard protocols if applicable

2. Data Acquisition
   - Instrument specifications
   - Acquisition parameters
   - Calibration procedures
   - Quality control measures

3. Data Preprocessing
   - Methods used (with citations)
   - Parameters (exact values)
   - Validation approach
   - Software used (with version)

4. Analysis Methods
   - Statistical tests (assumptions checked)
   - Machine learning algorithms
   - Validation strategy (GroupKFold, etc.)
   - Hyperparameter optimization
   - Software and versions

5. Data Availability
   - Where data can be accessed
   - Code repository (GitHub)
   - Any restrictions
```

(reporting-results)=

### Results Section

**Good Practice**:
```
"Random Forest achieved the best performance with a test set accuracy 
of 92.3% (95% CI: 86.1-96.8%), sensitivity of 90.5%, and specificity 
of 93.8%. The model was trained using 5-fold cross-validation with 
patient-level splitting (GroupKFold) on 100 training samples. Feature 
importance analysis revealed that wavenumbers at 1655 cm⁻¹ (Amide I), 
1450 cm⁻¹ (CH₂), and 1200 cm⁻¹ (Amide III) contributed most to 
classification (see Figure 3). External validation on an independent 
cohort (n=43) confirmed model generalization with 91.2% accuracy."
```

### Discussion

**Address**:
```
1. Main findings
   - Put in context of literature
   - Explain biological/clinical significance

2. Comparison with previous work
   - How does your method compare?
   - What's novel?

3. Limitations
   - Sample size
   - Single center vs multi-center
   - Lack of external validation
   - Class imbalance
   - Preprocessing choices

4. Future directions
   - Larger studies
   - External validation
   - Clinical trials
   - Real-time implementation
```

### Ethical Considerations

**Required**:
```
1. Ethics approval
   - IRB/Ethics committee approval number
   - Informed consent obtained

2. Patient privacy
   - Data de-identified
   - No patient information in figures
   - Secure data storage

3. Clinical use disclaimer
   - "For research use only"
   - "Not approved for clinical diagnostics"
   - Limitations clearly stated

4. Conflicts of interest
   - Funding sources disclosed
   - Commercial interests declared
```

---

## Checklists

### Before Analysis
- [ ] Data quality checked (visual inspection, statistics)
- [ ] Metadata recorded (all acquisition parameters)
- [ ] Control samples included
- [ ] Calibration performed
- [ ] Data backed up
- [ ] File naming convention followed
- [ ] Lab notebook entry made

### Before Preprocessing
- [ ] Multiple approaches tested on subset
- [ ] Visual validation performed
- [ ] Quantitative metrics calculated
- [ ] Pipeline saved with parameters
- [ ] Documentation written

### Before Statistical Analysis
- [ ] Test assumptions checked (normality, equal variance)
- [ ] Multiple testing correction planned
- [ ] Effect sizes will be reported
- [ ] Sample size sufficient (power analysis)

### Before Machine Learning
- [ ] Data split patient-level (GroupKFold planned)
- [ ] Test set held out and not touched
- [ ] Preprocessing plan documented
- [ ] Feature selection strategy defined
- [ ] Validation strategy chosen
- [ ] Random seeds set

### Before Publication
- [ ] All analyses documented
- [ ] Code version controlled (GitHub)
- [ ] Figures publication-quality (300+ DPI)
- [ ] Tables formatted correctly
- [ ] Methods section complete and reproducible
- [ ] Data availability statement included
- [ ] Ethics approval obtained
- [ ] Author contributions defined
- [ ] Funding acknowledged
- [ ] Conflicts of interest disclosed

---

## Common Pitfalls to Avoid

(data-leakage)=

### ❌ Data Leakage
```
Wrong: Normalize all data, then split
Correct: Split first, normalize training, apply to test
```

### ❌ Spectrum-Level Splitting
```
Wrong: Random split ignoring patients
Correct: Patient-level splitting with GroupKFold
```

### ❌ Peeking at Test Set
```
Wrong: Check test performance during development
Correct: Look at test set only for final evaluation
```

### ❌ Multiple Testing Without Correction
```
Wrong: Report all p<0.05 across 1000 wavenumbers
Correct: Apply FDR correction
```

### ❌ Cherry-Picking Results
```
Wrong: Only report best-performing model
Correct: Report all tested approaches, explain choice
```

### ❌ Overfitting
```
Wrong: Complex model on small dataset
Correct: Match model complexity to sample size
```

### ❌ Missing Documentation
```
Wrong: "Preprocessed using standard methods"
Correct: "AsLS (λ=10⁵, p=0.01), then SavGol (w=11, order=3)"
```

---

## Resources

### Recommended Reading

**Raman Spectroscopy**:
1. Butler et al. (2016). Nature Protocols. "Using Raman spectroscopy to characterize biological materials"
2. Movasaghi et al. (2007). Applied Spectroscopy Reviews. "Raman Spectroscopy of Biological Tissues"

**Machine Learning**:
3. Hastie et al. (2009). "The Elements of Statistical Learning"
4. Goodfellow et al. (2016). "Deep Learning"

**Statistics**:
5. Altman & Bland (1995). BMJ. "Statistics notes" series
6. Wasserstein & Lazar (2016). Am Stat. "The ASA statement on p-values"

### Software Citations

**Always cite**:
```bibtex
@article{scikit-learn,
  title={Scikit-learn: Machine Learning in Python},
  author={Pedregosa et al.},
  journal={JMLR},
  year={2011}
}
```

---

## See Also

- [Data Import Guide](data-import.md) - Best practices for data organization
- [Preprocessing Guide](preprocessing.md) - Detailed preprocessing methods
- [Analysis Guide](analysis.md) - Statistical analysis best practices
- [Machine Learning Guide](machine-learning.md) - ML workflow and validation

---

**Complete User Guide Navigation**:
- [Interface Overview](interface-overview.md)
- [Data Import](data-import.md)
- [Preprocessing](preprocessing.md)
- [Analysis](analysis.md)
- [Machine Learning](machine-learning.md)
- **[Best Practices](best-practices.md)** ← You are here