# Preprocessing Methods Reference

Comprehensive reference for all 40+ preprocessing methods available in the application.

## Table of Contents
- [Baseline Correction](#baseline-correction)
- [Smoothing and Denoising](#smoothing-and-denoising)
- [Normalization](#normalization)
- [Derivatives](#derivatives)
- [Feature Engineering](#feature-engineering)
- [Advanced Methods](#advanced-methods)

---

## Baseline Correction

Remove fluorescence background and baseline drift from spectra.

### AsLS (Asymmetric Least Squares)

**Full Name**: Asymmetric Least Squares Smoothing

**Purpose**: Remove baseline while preserving peaks using asymmetric weighting

**Theory**:
Fits a smoothed baseline by minimizing:
```
Σ w_i(y_i - z_i)² + λ Σ (Δ²z_i)²
```
where `w_i` are asymmetric weights (lower for peaks, higher for baseline)

**Parameters**:

| Parameter    | Type  | Range       | Default | Description                                      |
| ------------ | ----- | ----------- | ------- | ------------------------------------------------ |
| `lambda` (λ) | float | 1e2 - 1e9   | 1e5     | Smoothness parameter. Higher = smoother baseline |
| `p`          | float | 0.001 - 0.1 | 0.01    | Asymmetry parameter. Lower = fit valleys better  |
| `max_iter`   | int   | 5 - 50      | 10      | Maximum iterations for convergence               |
| `tol`        | float | 1e-6 - 1e-3 | 1e-6    | Convergence tolerance                            |

**Parameter Guide**:
```python
# Conservative (preserves more peaks)
lambda = 1e5, p = 0.01

# Standard (balanced)
lambda = 1e6, p = 0.01

# Aggressive (removes more baseline)
lambda = 1e7, p = 0.001
```

**Usage Example**:
```python
from functions.preprocess import apply_asls

# Apply AsLS baseline correction
corrected = apply_asls(
    spectra,
    lam=1e5,
    p=0.01,
    max_iter=10
)
```

**Interpretation**:
- **λ (lambda)**: Controls baseline smoothness
  - Too low → Follows peaks (underfitting)
  - Too high → Over-smooths (may miss real baseline curvature)
  - Sweet spot: 1e5 - 1e6 for most Raman spectra

- **p**: Controls asymmetry
  - Lower → Treats peaks as outliers (good for sharp peaks)
  - Higher → Fits through peaks (use if baseline has structure)

**Common Issues**:
1. **Peaks removed**: λ too high or p too high → Reduce λ or p
2. **Baseline remains**: λ too low → Increase λ
3. **Slow convergence**: Increase tol or reduce max_iter
4. **Oscillations**: p too low → Increase p slightly

**When to Use**:
- ✓ General-purpose baseline correction
- ✓ Fluorescence background
- ✓ Unknown baseline shape
- ✗ Not for: Very noisy data (smooth first)

**Reference**: 
Eilers & Boelens (2005). "Baseline Correction with Asymmetric Least Squares Smoothing"

---

### AirPLS (Adaptive Iteratively Reweighted Penalized Least Squares)

**Full Name**: Adaptive Iteratively Reweighted Penalized Least Squares

**Purpose**: Automatic baseline correction with minimal parameter tuning

**Theory**:
Iteratively fits baseline by:
1. Penalized least squares fitting
2. Adaptive weighting based on residuals
3. Iteration until convergence

**Parameters**:

| Parameter    | Type  | Range       | Default | Description                  |
| ------------ | ----- | ----------- | ------- | ---------------------------- |
| `lambda` (λ) | float | 1e2 - 1e7   | 1e5     | Smoothness parameter         |
| `porder`     | int   | 1, 2        | 1       | Difference order for penalty |
| `max_iter`   | int   | 10 - 100    | 15      | Maximum iterations           |
| `min_diff`   | float | 1e-6 - 1e-3 | 1e-5    | Convergence criterion        |

**Parameter Guide**:
```python
# Fast (fewer iterations)
lambda = 1e4, max_iter = 10

# Balanced (recommended)
lambda = 1e5, max_iter = 15

# Thorough (more iterations)
lambda = 1e6, max_iter = 30
```

**Usage Example**:
```python
from functions.preprocess import apply_airpls

corrected = apply_airpls(
    spectra,
    lam=1e5,
    porder=1,
    max_iter=15
)
```

**Interpretation**:
- **Automatic adaptation**: Weights adjust to separate peaks from baseline
- **Less sensitive to p**: No asymmetry parameter needed
- **porder=1**: First-order differences (smoother)
- **porder=2**: Second-order differences (more flexible)

**Common Issues**:
1. **Negative baseline**: Normal behavior, corrects fluorescence
2. **Slow**: Reduce max_iter or increase λ
3. **Incomplete correction**: Increase max_iter or reduce λ

**When to Use**:
- ✓ Automatic baseline correction
- ✓ Batch processing
- ✓ When AsLS parameters unclear
- ✓ Complex baseline shapes

**Reference**:
Zhang et al. (2010). "Baseline correction using adaptive iteratively reweighted penalized least squares"

---

### Polynomial Baseline

**Purpose**: Fit and subtract polynomial baseline

**Theory**:
Fits polynomial of degree `n`:
```
baseline = a₀ + a₁x + a₂x² + ... + aₙxⁿ
```

**Parameters**:

| Parameter | Type | Range  | Default | Description       |
| --------- | ---- | ------ | ------- | ----------------- |
| `degree`  | int  | 1 - 10 | 3       | Polynomial degree |

**Parameter Guide**:
```python
degree = 1  # Linear baseline
degree = 2  # Quadratic
degree = 3  # Cubic (most common)
degree = 4-5  # Higher order (flexible)
degree > 5  # Risky (may overfit)
```

**Usage Example**:
```python
from functions.preprocess import apply_polynomial_baseline

corrected = apply_polynomial_baseline(
    spectra,
    degree=3
)
```

**Common Issues**:
1. **Underfitting**: degree too low → Increase degree
2. **Overfitting**: degree too high → Reduce degree
3. **Peaks affected**: Use robust fitting or lower degree

**When to Use**:
- ✓ Simple, smooth baseline
- ✓ Known polynomial shape
- ✓ Fast processing needed
- ✗ Not for: Complex baseline, fluorescence

---

### Whittaker Smoothing

**Purpose**: Smooth baseline using penalized least squares

**Parameters**:

| Parameter     | Type  | Range     | Default | Description          |
| ------------- | ----- | --------- | ------- | -------------------- |
| `lambda` (λ)  | float | 1e2 - 1e9 | 1e5     | Smoothness parameter |
| `differences` | int   | 1, 2, 3   | 2       | Order of differences |

**Usage Example**:
```python
from functions.preprocess import apply_whittaker

corrected = apply_whittaker(
    spectra,
    lam=1e5,
    differences=2
)
```

**When to Use**:
- ✓ Smooth baseline
- ✓ Preserving peak shapes
- ✓ Alternative to polynomial

**Reference**:
Eilers (2003). "A Perfect Smoother"

---

### FABC (Fully Automatic Baseline Correction)

**Purpose**: Completely automatic baseline correction with no tuning

**Parameters**:

| Parameter       | Type | Range     | Default | Description                     |
| --------------- | ---- | --------- | ------- | ------------------------------- |
| `window_length` | int  | 100 - 500 | 200     | Window size for local fitting   |
| `iterations`    | int  | 1 - 20    | 10      | Number of correction iterations |

**Usage Example**:
```python
from functions.preprocess.fabc_fixed import apply_fabc

corrected = apply_fabc(
    spectra,
    window_length=200,
    iterations=10
)
```

**When to Use**:
- ✓ No user expertise required
- ✓ Batch processing
- ✓ Standardized pipelines
- ✓ Unknown baseline characteristics

---

### Butterworth High-Pass Filter

**Purpose**: Remove low-frequency baseline components using frequency domain filtering

**Parameters**:

| Parameter | Type  | Range       | Default | Description                   |
| --------- | ----- | ----------- | ------- | ----------------------------- |
| `cutoff`  | float | 0.001 - 0.1 | 0.01    | Cutoff frequency (normalized) |
| `order`   | int   | 1 - 10      | 4       | Filter order                  |

**Usage Example**:
```python
from functions.preprocess import apply_butterworth_highpass

corrected = apply_butterworth_highpass(
    spectra,
    cutoff=0.01,
    order=4
)
```

**When to Use**:
- ✓ Frequency-domain baseline removal
- ✓ Uniform low-frequency drift
- ✗ Not ideal for: Non-uniform baseline

---

## Smoothing and Denoising

Reduce noise while preserving spectral features.

### Savitzky-Golay Filter

**Purpose**: Polynomial smoothing that preserves peak shape and height

**Theory**:
Fits local polynomials using least squares within a moving window

**Parameters**:

| Parameter       | Type | Range        | Default | Description                                           |
| --------------- | ---- | ------------ | ------- | ----------------------------------------------------- |
| `window_length` | int  | 5 - 51 (odd) | 11      | Size of smoothing window                              |
| `polyorder`     | int  | 2 - 5        | 3       | Polynomial order                                      |
| `deriv`         | int  | 0 - 2        | 0       | Derivative order (0=smooth, 1=1st deriv, 2=2nd deriv) |

**Parameter Guide**:
```python
# Light smoothing
window_length = 7, polyorder = 3

# Moderate smoothing (recommended)
window_length = 11, polyorder = 3

# Heavy smoothing
window_length = 21, polyorder = 3

# Peak sharpening (1st derivative)
window_length = 11, polyorder = 3, deriv = 1

# Peak resolution (2nd derivative)
window_length = 11, polyorder = 3, deriv = 2
```

**Usage Example**:
```python
from functions.preprocess import apply_savgol

# Smoothing
smoothed = apply_savgol(
    spectra,
    window_length=11,
    polyorder=3,
    deriv=0
)

# First derivative
derivative = apply_savgol(
    spectra,
    window_length=11,
    polyorder=3,
    deriv=1
)
```

**Interpretation**:
- **window_length**: Larger window = more smoothing, but peak broadening
- **polyorder**: Higher order preserves sharp features better
- **deriv=1**: Converts peaks to zero-crossings, removes baseline
- **deriv=2**: Converts peaks to negative dips, enhances resolution

**Common Issues**:
1. **Over-smoothing**: Window too large → Reduce window
2. **Peak broadening**: Window too large → Use window ≤ 11
3. **Noisy derivatives**: Smooth first, then derivative
4. **Oscillations**: polyorder too high → Reduce to 2 or 3

**When to Use**:
- ✓ General smoothing
- ✓ Peak-preserving smoothing
- ✓ Derivatives for baseline removal
- ✓ Most common choice

**Constraints**:
- window_length > polyorder
- window_length must be odd

**Reference**:
Savitzky & Golay (1964). "Smoothing and Differentiation of Data by Simplified Least Squares Procedures"

---

### Gaussian Smoothing

**Purpose**: Strong smoothing using Gaussian kernel

**Theory**:
Convolves spectrum with Gaussian kernel:
```
K(x) = (1/√(2πσ²)) exp(-x²/2σ²)
```

**Parameters**:

| Parameter | Type  | Range     | Default | Description                           |
| --------- | ----- | --------- | ------- | ------------------------------------- |
| `sigma`   | float | 0.5 - 5.0 | 2.0     | Standard deviation of Gaussian kernel |

**Parameter Guide**:
```python
sigma = 1.0  # Light smoothing
sigma = 2.0  # Moderate (default)
sigma = 3.0  # Heavy smoothing
sigma > 4.0  # Very strong (may blur peaks)
```

**Usage Example**:
```python
from functions.preprocess import apply_gaussian

smoothed = apply_gaussian(
    spectra,
    sigma=2.0
)
```

**Common Issues**:
1. **Peak broadening**: sigma too high → Reduce sigma
2. **Insufficient smoothing**: sigma too low → Increase sigma

**When to Use**:
- ✓ Very noisy data
- ✓ When peak shape less critical
- ✓ Visualization
- ✗ Not for: Quantitative analysis (peak heights change)

---

### Moving Average

**Purpose**: Simple uniform smoothing

**Parameters**:

| Parameter     | Type | Range        | Default | Description |
| ------------- | ---- | ------------ | ------- | ----------- |
| `window_size` | int  | 3 - 21 (odd) | 5       | Window size |

**Usage Example**:
```python
from functions.preprocess import apply_moving_average

smoothed = apply_moving_average(
    spectra,
    window_size=5
)
```

**When to Use**:
- ✓ Quick smoothing
- ✓ Preliminary exploration
- ✗ Not recommended for: Publication (use Savitzky-Golay)

---

### Median Filter

**Purpose**: Remove spike noise (cosmic rays, detector artifacts)

**Theory**:
Replaces each point with median of surrounding window

**Parameters**:

| Parameter     | Type | Range        | Default | Description |
| ------------- | ---- | ------------ | ------- | ----------- |
| `window_size` | int  | 3 - 11 (odd) | 5       | Window size |

**Usage Example**:
```python
from functions.preprocess import apply_median_filter

despike = apply_median_filter(
    spectra,
    window_size=5
)
```

**Common Issues**:
1. **Peak clipping**: Window too large → Use window=3 or 5
2. **Spikes remain**: Window too small → Increase to 7

**When to Use**:
- ✓ Cosmic ray removal
- ✓ Spike artifacts
- ✓ Before other preprocessing
- ✓ CCD detector noise

**Best Practice**: Apply BEFORE baseline correction and smoothing

---

### Kernel Denoising

**Purpose**: Advanced denoising using various kernel functions

**Available Kernels**: Gaussian, Epanechnikov, Tricube, Triweight

**Parameters**:

| Parameter   | Type  | Range     | Default    | Description      |
| ----------- | ----- | --------- | ---------- | ---------------- |
| `kernel`    | str   | -         | 'gaussian' | Kernel type      |
| `bandwidth` | float | 0.5 - 5.0 | 1.0        | Kernel bandwidth |

**Usage Example**:
```python
from functions.preprocess.kernel_denoise import apply_kernel_denoise

denoised = apply_kernel_denoise(
    spectra,
    kernel='gaussian',
    bandwidth=1.0
)
```

**When to Use**:
- ✓ Alternative to Gaussian smoothing
- ✓ Experimenting with different kernels

---

## Normalization

Scale spectra to comparable ranges.

### Vector Normalization (L2 Norm)

**Purpose**: Normalize to unit length (most common)

**Theory**:
```
normalized = spectrum / ||spectrum||₂
where ||spectrum||₂ = √(Σ x²)
```

**Parameters**: None

**Usage Example**:
```python
from functions.preprocess import apply_vector_norm

normalized = apply_vector_norm(spectra)
```

**Effect**:
- Makes all spectra have same total "energy"
- Removes intensity variations
- Preserves relative peak ratios

**When to Use**:
- ✓ Most common choice
- ✓ Classification tasks
- ✓ Removing concentration effects
- ✓ SVM, neural networks
- ✓ After baseline correction

---

### Min-Max Normalization

**Purpose**: Scale to [0, 1] range

**Theory**:
```
normalized = (spectrum - min) / (max - min)
```

**Parameters**:

| Parameter       | Type  | Range | Default | Description             |
| --------------- | ----- | ----- | ------- | ----------------------- |
| `feature_range` | tuple | -     | (0, 1)  | Target range (min, max) |

**Usage Example**:
```python
from functions.preprocess import apply_minmax_norm

normalized = apply_minmax_norm(
    spectra,
    feature_range=(0, 1)
)
```

**When to Use**:
- ✓ Neural networks
- ✓ Visualization
- ✓ Bounded input required
- ✗ Not for: Preserving absolute intensities

---

### Area Normalization

**Purpose**: Normalize by area under curve

**Theory**:
```
normalized = spectrum / Σ|spectrum|
```

**Parameters**: None

**Usage Example**:
```python
from functions.preprocess import apply_area_norm

normalized = apply_area_norm(spectra)
```

**When to Use**:
- ✓ Concentration normalization
- ✓ Comparing relative peak heights
- ✓ Eliminating total intensity differences
- ✗ Not for: Absolute quantification

---

### Standard Normal Variate (SNV)

**Purpose**: Remove multiplicative scatter effects

**Theory**:
```
SNV = (spectrum - mean) / std
```

**Parameters**: None

**Usage Example**:
```python
from functions.preprocess import apply_snv

normalized = apply_snv(spectra)
```

**When to Use**:
- ✓ Solid samples with scattering
- ✓ Particle size variations
- ✓ NIR spectroscopy (also useful for Raman)
- ✓ Removing multiplicative effects

**Reference**:
Barnes et al. (1989). "Standard Normal Variate Transformation"

---

### Multiplicative Scatter Correction (MSC)

**Purpose**: Correct for light scattering variations

**Theory**:
Fits each spectrum to a reference (mean spectrum):
```
spectrum_i = a + b × reference
corrected = (spectrum_i - a) / b
```

**Parameters**:

| Parameter   | Type  | Default | Description                               |
| ----------- | ----- | ------- | ----------------------------------------- |
| `reference` | array | None    | Reference spectrum (default: mean of all) |

**Usage Example**:
```python
from functions.preprocess import apply_msc

normalized = apply_msc(
    spectra,
    reference=None  # Auto: use mean
)
```

**When to Use**:
- ✓ Diffuse reflectance spectroscopy
- ✓ Particle size effects
- ✓ Scattering correction
- ✓ Quantitative analysis

**Requirement**: Need reference spectrum (usually mean)

**Reference**:
Geladi et al. (1985). "Linearization and Scatter-Correction for Near-Infrared Reflectance Spectra"

---

### Quantile Normalization

**Purpose**: Match distributions across spectra

**Parameters**:

| Parameter     | Type | Range      | Default | Description         |
| ------------- | ---- | ---------- | ------- | ------------------- |
| `n_quantiles` | int  | 100 - 1000 | 1000    | Number of quantiles |

**Usage Example**:
```python
from functions.preprocess.advanced_normalization import apply_quantile_norm

normalized = apply_quantile_norm(
    spectra,
    n_quantiles=1000
)
```

**When to Use**:
- ✓ Batch effect correction
- ✓ Making distributions identical
- ✗ Less common in Raman

---

### Probabilistic Quotient Normalization (PQN)

**Purpose**: Dilution correction for metabolomics

**Theory**:
```
1. Calculate reference (median spectrum)
2. Calculate quotients: spectrum / reference
3. Normalize by median quotient
```

**Parameters**:

| Parameter   | Type  | Default | Description        |
| ----------- | ----- | ------- | ------------------ |
| `reference` | array | None    | Reference spectrum |

**Usage Example**:
```python
from functions.preprocess.advanced_normalization import apply_pqn

normalized = apply_pqn(
    spectra,
    reference=None
)
```

**When to Use**:
- ✓ Metabolomics
- ✓ Dilution correction
- ✓ Biological fluids

**Reference**:
Dieterle et al. (2006). "Probabilistic Quotient Normalization"

---

### Rank Transformation

**Purpose**: Convert to ranks (non-parametric)

**Parameters**: None

**Usage Example**:
```python
from functions.preprocess.advanced_normalization import apply_rank_transform

ranked = apply_rank_transform(spectra)
```

**When to Use**:
- ✓ Non-parametric analysis
- ✓ Outlier-resistant
- ✓ When distribution doesn't matter

---

## Derivatives

Calculate spectral derivatives for baseline removal and peak resolution.

### First Derivative (Savitzky-Golay)

**Purpose**: Remove baseline, enhance peak differences

**Theory**:
First derivative of smoothed spectrum using Savitzky-Golay

**Parameters**: Same as Savitzky-Golay + `deriv=1`

**Usage Example**:
```python
from functions.preprocess import apply_savgol

derivative1 = apply_savgol(
    spectra,
    window_length=11,
    polyorder=3,
    deriv=1
)
```

**Effect**:
- Peaks become positive/negative transitions
- Baseline removed (constant → zero)
- Peak maxima → zero-crossings

**When to Use**:
- ✓ Alternative to baseline correction
- ✓ Overlapping peak resolution
- ✓ Chemometric analysis
- ✗ Amplifies noise (smooth first!)

---

### Second Derivative

**Purpose**: Sharpen peaks, resolve overlaps

**Theory**:
Second derivative of smoothed spectrum

**Parameters**: Same as Savitzky-Golay + `deriv=2`

**Usage Example**:
```python
from functions.preprocess import apply_savgol

derivative2 = apply_savgol(
    spectra,
    window_length=11,
    polyorder=3,
    deriv=2
)
```

**Effect**:
- Peaks become negative dips
- Sharpens overlapping peaks
- Greatly amplifies noise

**When to Use**:
- ✓ Overlapping peak resolution
- ✓ Peak identification
- ✓ Advanced analysis
- ⚠️ Warning: Requires good smoothing, amplifies noise significantly

---

## Feature Engineering

Create new features from spectra.

### Peak Ratio

**Purpose**: Calculate ratio between two peaks

**Parameters**:

| Parameter     | Type  | Description                        |
| ------------- | ----- | ---------------------------------- |
| `peak1_range` | tuple | (start, end) wavenumber for peak 1 |
| `peak2_range` | tuple | (start, end) wavenumber for peak 2 |

**Usage Example**:
```python
from functions.preprocess.feature_engineering import calculate_peak_ratio

# Amide I / CH2 ratio
ratio = calculate_peak_ratio(
    spectra,
    wavenumbers,
    peak1_range=(1645, 1665),  # Amide I
    peak2_range=(1440, 1460)   # CH2
)
```

**When to Use**:
- ✓ Known biomarker ratios
- ✓ Creating interpretable features
- ✓ Reducing dimensionality
- ✓ Band ratio analysis

**Example Ratios**:
```
I₁₆₅₅/I₁₄₄₅ (Amide I / CH₂)
I₁₂₉₀/I₁₂₄₀ (Amide III ratio)
I₁₀₀₀/I₁₆₀₀ (Custom biomarkers)
```

---

### Wavelet Transform

**Purpose**: Multi-resolution decomposition for denoising

**Parameters**:

| Parameter   | Type | Range  | Default | Description                     |
| ----------- | ---- | ------ | ------- | ------------------------------- |
| `wavelet`   | str  | -      | 'db4'   | Wavelet type (db4, sym8, coif5) |
| `level`     | int  | 1 - 10 | 4       | Decomposition level             |
| `threshold` | str  | -      | 'soft'  | Thresholding type (soft, hard)  |

**Usage Example**:
```python
from functions.preprocess.feature_engineering import apply_wavelet_transform

denoised = apply_wavelet_transform(
    spectra,
    wavelet='db4',
    level=4,
    threshold='soft'
)
```

**When to Use**:
- ✓ Complex noise patterns
- ✓ Multi-scale features
- ✓ Non-stationary noise
- ✗ Computationally expensive

**Reference**:
Mallat (1989). "A theory for multiresolution signal decomposition: the wavelet representation"

---

## Advanced Methods

Specialized preprocessing methods.

### Convolutional Denoising Autoencoder (CDAE)

**Purpose**: Deep learning-based denoising

**Theory**:
Neural network trained to reconstruct clean spectra from noisy input

**Parameters**:

| Parameter    | Type | Default | Description               |
| ------------ | ---- | ------- | ------------------------- |
| `model_path` | str  | None    | Path to trained model     |
| `batch_size` | int  | 32      | Batch size for prediction |

**Requirements**:
- PyTorch installed
- GPU recommended
- Pre-trained model

**Usage Example**:
```python
from functions.preprocess.deep_learning import apply_cdae

denoised = apply_cdae(
    spectra,
    model_path='models/cdae_raman.pth',
    batch_size=32
)
```

**When to Use**:
- ✓ After training on your data type
- ✓ Complex noise patterns
- ✓ Large datasets
- ✗ Requires: Training data, GPU, expertise

---

### Background Subtraction

**Purpose**: Subtract background/blank spectrum

**Parameters**:

| Parameter    | Type  | Description                     |
| ------------ | ----- | ------------------------------- |
| `background` | array | Background spectrum to subtract |

**Usage Example**:
```python
from functions.preprocess import subtract_background

corrected = subtract_background(
    spectra,
    background=blank_spectrum
)
```

**When to Use**:
- ✓ Measured blank/background available
- ✓ Removing substrate contribution
- ✓ Consistent background across samples

---

### Calibration

**Purpose**: Wavenumber axis calibration

**Types**:
1. **Linear Shift**: Single reference peak
2. **Polynomial**: Multiple reference peaks

**Parameters**:

| Parameter         | Type | Description              |
| ----------------- | ---- | ------------------------ |
| `reference_peaks` | list | Expected peak positions  |
| `measured_peaks`  | list | Measured peak positions  |
| `method`          | str  | 'linear' or 'polynomial' |

**Usage Example**:
```python
from functions.preprocess.calibration import apply_calibration

# Calibrate using silicon peak
calibrated_wn = apply_calibration(
    wavenumbers,
    reference_peaks=[520.7],  # Silicon
    measured_peaks=[522.3],    # Measured
    method='linear'
)
```

**When to Use**:
- ✓ Wavenumber axis errors detected
- ✓ Instrument drift correction
- ✓ Before combining datasets

**Reference Standards**:
- Silicon: 520.7 cm⁻¹
- Polystyrene: 1001, 1031, 1602 cm⁻¹
- Diamond: 1332 cm⁻¹

---

## Method Selection Guide

### Decision Matrix

| Goal                     | Recommended Method(s) | Parameters         |
| ------------------------ | --------------------- | ------------------ |
| **Remove baseline**      | AsLS                  | λ=1e5, p=0.01      |
|                          | AirPLS                | λ=1e5              |
|                          | Polynomial            | degree=3           |
| **Reduce noise**         | Savitzky-Golay        | window=11, order=3 |
|                          | Gaussian              | sigma=2.0          |
|                          | Median (spikes)       | window=5           |
| **Normalize intensity**  | Vector (L2)           | -                  |
|                          | SNV                   | -                  |
|                          | Area                  | -                  |
| **Remove scatter**       | MSC                   | reference=mean     |
|                          | SNV                   | -                  |
| **Baseline alternative** | 1st Derivative        | SavGol deriv=1     |
| **Peak resolution**      | 2nd Derivative        | SavGol deriv=2     |
| **Create features**      | Peak Ratios           | Custom ranges      |

(recommended-pipelines)=

### Common Pipelines

**Standard Pipeline**:
```python
1. AsLS (λ=1e5, p=0.01)
2. Savitzky-Golay (w=11, order=3)
3. Vector Normalization
```

**High-Noise Pipeline**:
```python
1. Median Filter (w=5)
2. AirPLS (λ=1e6)
3. Gaussian (σ=2.0)
4. SNV
```

**Derivative Pipeline**:
```python
1. AsLS (λ=1e5, p=0.01)
2. Savitzky-Golay 1st Derivative (w=11, order=3, deriv=1)
3. Vector Normalization
```

**Quantitative Pipeline**:
```python
1. AsLS (λ=1e6, p=0.001)
2. MSC (reference=mean)
3. Savitzky-Golay (w=9, order=3)
4. Area Normalization
```


---

(validation)=

## Parameter Constraints

### Automatic Validation

All methods include automatic parameter validation:

**Type Checking**:
```text
# Integer parameters converted if needed
window_length = "11" → 11 (converted)
polyorder = 3.0 → 3 (converted)

# Float parameters validated
lambda = "1e5" → 100000.0 (converted)
p = 0.01 (valid)
```

**Range Validation**:
```text
# Values clamped to valid ranges
lambda = 1e10 → 1e9 (max allowed)
window_length = 3 → 5 (min allowed)
p = -0.01 → 0.001 (min allowed)
```

**Logical Validation**:
```python
# Constraints enforced
if window_length <= polyorder:
    window_length = polyorder + 2
    
if window_length % 2 == 0:
    window_length += 1  # Make odd
```

---

## Best Practices

### General Guidelines

1. **Order Matters**:
   ```
   Correct: Spike removal → Baseline → Smoothing → Normalization
   Wrong: Normalization → Baseline (affects baseline estimation)
   ```

2. **Less is More**:
   - Use minimal necessary steps
   - Over-processing loses information
   - Validate each step visually

3. **Parameter Tuning**:
   - Start with defaults
   - Adjust based on visual inspection
   - Test on representative subset
   - Document final parameters

4. **Validation**:
   - Always preview before applying
   - Check multiple spectra
   - Compare before/after
   - Verify peak preservation

### Method-Specific Tips

**AsLS**:
- Start with λ=1e5, adjust if needed
- Lower p for sharper peaks
- Check that real peaks aren't removed

**Savitzky-Golay**:
- Window ≤ 11 for most cases
- polyorder = 3 is usually optimal
- Don't over-smooth

**Derivatives**:
- Always smooth before derivative
- 1st derivative: window ≥ 11
- 2nd derivative: window ≥ 15, very noisy

**Normalization**:
- Apply AFTER baseline and smoothing
- Vector norm: most common choice
- SNV: for scattering issues

---

## Troubleshooting

### Common Issues

| Issue            | Cause                  | Solution                        |
| ---------------- | ---------------------- | ------------------------------- |
| Peaks removed    | λ too high in AsLS     | Reduce λ to 1e4-1e5             |
| Baseline remains | λ too low              | Increase λ to 1e6-1e7           |
| Over-smoothed    | Window too large       | Reduce SavGol window to 7-9     |
| Noisy            | Insufficient smoothing | Increase window or sigma        |
| Negative values  | Normal after baseline  | Use area/vector norm            |
| Spikes remain    | Window too small       | Use median filter window=5-7    |
| Slow processing  | Too many iterations    | Reduce max_iter or increase tol |

---

## References

1. **Eilers & Boelens (2005)**: AsLS method
2. **Zhang et al. (2010)**: AirPLS method
3. **Savitzky & Golay (1964)**: SavGol filter
4. **Barnes et al. (1989)**: SNV normalization
5. **Geladi et al. (1985)**: MSC method
6. **Dieterle et al. (2006)**: PQN normalization
7. **Eilers (2003)**: Whittaker smoother
8. **Mallat (1989)**: Wavelet theory

See [References](../references.md) for complete citations.

---

## See Also

- [Preprocessing User Guide](../user-guide/preprocessing.md) - Step-by-step tutorials
- [Best Practices](../user-guide/best-practices.md) - Preprocessing strategies
- [FAQ - Preprocessing](../faq.md#preprocessing-questions) - Common questions
- [Glossary](../glossary.md) - Term definitions

---

**Total Methods Documented**: 40+
**Last Updated**: 2026-01-24