# Preprocessing Guide Complete guide to building and applying preprocessing pipelines for Raman spectroscopy data. ## Table of Contents - {ref}`Overview ` - {ref}`Pipeline Builder Interface ` - {ref}`Method Categories ` - {ref}`Building a Pipeline ` - {ref}`Common Pipelines ` - {ref}`Troubleshooting ` --- (preprocessing-overview)= ## Overview ### Why Preprocess? Raman spectra often contain noise and artifacts that can interfere with analysis: **Common Issues**: - 📈 **Baseline drift**: Fluorescence background - 📊 **Noise**: Random intensity fluctuations - 📉 **Intensity variations**: Sample-to-sample differences - ⚡ **Cosmic rays**: Sharp spikes from detector artifacts **Preprocessing Goals**: 1. Remove or reduce unwanted signal components 2. Enhance relevant spectral features 3. Normalize intensity variations 4. Prepare data for analysis or ML ### Preprocessing Philosophy **Best Practices**: - ✅ **Less is more**: Use minimal necessary steps - ✅ **Understand each step**: Know what each method does - ✅ **Validate effects**: Always preview before applying - ✅ **Document pipelines**: Save and share for reproducibility - ✅ **Test order**: Method sequence matters **Common Mistakes**: - ❌ Over-smoothing (loss of real features) - ❌ Over-normalization (loss of quantitative information) - ❌ Wrong method order (e.g., normalizing before baseline correction) - ❌ Not validating on test set - ❌ Using same parameters for all datasets --- (pipeline-builder-interface)= ## Pipeline Builder Interface ### Main Layout ![Preprocessing Page Layout](../../assets/screenshots/en/preprocessing-page.png) *Figure: Preprocessing page showing method selector (left), parameter panel (center), and preview plot (right), with pipeline builder at the bottom* > **Note**: The preprocessing page is divided into three main sections: > - **Left**: Method selector with categories (Baseline, Smoothing, Normalization, etc.) > - **Center**: Parameter configuration panel for the selected method > - **Right**: Real-time preview showing before/after comparison > - **Bottom**: Pipeline steps list with controls for each step (visibility, delete, reorder) ### Components **1. Method Selector (Left Panel)** - Browse 40+ preprocessing methods by category - Search by name or keyword - Hover for method description **2. Parameter Panel (Center)** - Configure selected method - Real-time validation - Tooltips explain each parameter - Reset to defaults button **3. Preview Panel (Right)** - Before/after comparison - Multiple spectra overlay - Zoom and pan - Statistics (SNR, baseline level) **4. Pipeline Steps (Bottom)** - Ordered list of methods - Toggle enable/disable: ☑/☐ - View parameters: 👁 - Delete step: 🗑 - Reorder: ⬆⬇ - Drag to reorder **5. Action Buttons** - **Clear**: Remove all steps - **Save Pipeline**: Export as JSON - **Load Pipeline**: Import from file - **Apply to Dataset**: Process all spectra --- (method-categories)= ## Method Categories ### 1. Baseline Correction **Purpose**: Remove fluorescence background and baseline drift #### AsLS (Asymmetric Least Squares) **Best for**: General baseline correction **Parameters**: - `lambda` (λ): Smoothness (1e2 - 1e9, default: 1e5) - Lower → Flexible baseline - Higher → Smooth baseline - `p`: Asymmetry (0.001 - 0.1, default: 0.01) - Lower → Fit peaks as baseline - Higher → Fit valleys as baseline **Example**: ```python # Conservative: λ=1e5, p=0.01 # Aggressive: λ=1e7, p=0.001 ``` #### AirPLS (Adaptive Iteratively Reweighted Penalized Least Squares) **Best for**: Automatic baseline estimation **Parameters**: - `lambda`: Smoothness (1e2 - 1e7, default: 1e5) - `porder`: Penalty order (1, 2, default: 1) - `max_iter`: Maximum iterations (10-100, default: 15) **When to use**: Unknown baseline shape, automatic processing #### Polynomial Baseline **Best for**: Simple polynomial baseline **Parameters**: - `degree`: Polynomial degree (1-10, default: 3) - 1 = linear - 2 = quadratic - 3-5 = most common **When to use**: Smooth, predictable baseline #### FABC (Fully Automatic Baseline Correction) **Best for**: Completely automatic correction **Parameters**: - `window_length`: Window size (100-500, default: 200) - `iterations`: Number of iterations (1-20, default: 10) **When to use**: No user tuning desired, batch processing ### 2. Smoothing and Denoising #### Savitzky-Golay Filter **Best for**: Noise reduction while preserving peaks **Parameters**: - `window_length`: Window size (5-51, odd numbers, default: 11) - Smaller → Less smoothing, more noise - Larger → More smoothing, peak broadening - `polyorder`: Polynomial order (2-5, default: 3) - 2 = quadratic - 3 = cubic (recommended) - `deriv`: Derivative order (0-2, default: 0) - 0 = smoothing only - 1 = first derivative - 2 = second derivative **Example**: ```python # Light smoothing: window=7, order=3 # Moderate: window=11, order=3 # Heavy: window=21, order=3 ``` #### Gaussian Smoothing **Best for**: Strong noise reduction **Parameters**: - `sigma`: Standard deviation (0.5-5.0, default: 2.0) - Lower → Less smoothing - Higher → More smoothing **When to use**: Very noisy data, when peak shape preservation is less critical #### Moving Average **Best for**: Simple, uniform smoothing **Parameters**: - `window_size`: Window size (3-21, default: 5) **When to use**: Quick smoothing, preliminary exploration #### Median Filter **Best for**: Removing spike noise (cosmic rays) **Parameters**: - `window_size`: Window size (3-11, odd numbers, default: 5) **When to use**: Detector artifacts, cosmic ray spikes ### 3. Normalization #### Vector Normalization (L2 Norm) **Best for**: Intensity variations, general use **Formula**: Divide each spectrum by its L2 norm ``` normalized = spectrum / sqrt(sum(spectrum^2)) ``` **When to use**: - Remove intensity differences - Classification tasks - Most common choice #### Min-Max Normalization **Best for**: Scaling to fixed range [0, 1] **Formula**: ``` normalized = (spectrum - min) / (max - min) ``` **When to use**: - Visualization - Neural networks - When you need bounded values #### Area Normalization **Best for**: Concentration normalization **Formula**: Divide by area under curve ``` normalized = spectrum / sum(abs(spectrum)) ``` **When to use**: - Comparing relative peak heights - Eliminating concentration effects #### SNV (Standard Normal Variate) **Best for**: Scattering correction **Formula**: ``` snv = (spectrum - mean) / std ``` **When to use**: - Solid samples with scattering - Removing multiplicative effects #### MSC (Multiplicative Scatter Correction) **Best for**: Light scattering correction **Requirements**: Reference spectrum (mean of all spectra) **When to use**: - Diffuse reflectance spectroscopy - Particle size effects ### 4. Derivatives #### First Derivative (Savitzky-Golay) **Best for**: Baseline removal, peak resolution **Effect**: - Removes baseline - Enhances peak differences - Converts peaks to positive/negative transitions **Parameters**: Same as Savitzky-Golay filter + `deriv=1` **When to use**: - Alternative to baseline correction - Improve peak separation - Chemometric analysis #### Second Derivative **Best for**: Peak sharpening, overlapping peaks **Effect**: - Further sharpens peaks - Converts peaks to negative dips - Enhances fine structure **Parameters**: Same as Savitzky-Golay filter + `deriv=2` **When to use**: - Overlapping peak resolution - Peak identification - Advanced spectroscopic analysis **Warning**: Amplifies noise significantly ### 5. Advanced Methods #### Convolutional Denoising Autoencoder (CDAE) **Best for**: Deep learning-based denoising **Requirements**: - GPU recommended - Training data with clean/noisy pairs - PyTorch installed **Parameters**: - `model_path`: Path to trained model - `batch_size`: Processing batch size (8-64, default: 32) **When to use**: - Complex noise patterns - After training on your specific data type #### Wavelet Transform **Best for**: Multi-resolution denoising **Parameters**: - `wavelet`: Wavelet type ('db4', 'sym8', default: 'db4') - `level`: Decomposition level (1-10, default: 4) - `threshold`: Threshold method ('soft', 'hard', default: 'soft') **When to use**: - Non-stationary noise - Multi-scale features #### Peak Ratio Feature Engineering **Best for**: Creating ratio features **Parameters**: - `peak1_range`: [start, end] wavenumber for peak 1 - `peak2_range`: [start, end] wavenumber for peak 2 **Output**: New feature = Peak1 / Peak2 **When to use**: - Known biomarker ratios - Reducing dimensionality - Creating interpretable features --- (building-a-pipeline)= ## Building a Pipeline ### Step-by-Step Guide **Example Task**: Preprocess blood plasma Raman spectra for classification #### Step 1: Start with Raw Data **Visualize**: 1. Import spectra 2. Plot overlay of all spectra 3. Identify issues: - High baseline? → Need baseline correction - Noisy? → Need smoothing - Varying intensities? → Need normalization #### Step 2: Add Baseline Correction **Choose Method**: ```python # For fluorescence background: AsLS (λ=1e5, p=0.01) # For automatic: AirPLS (λ=1e5) ``` **Add to Pipeline**: 1. Select "AsLS" from Baseline category 2. Set λ=1e5, p=0.01 3. Click **[Preview]** → Check effect 4. Adjust parameters if needed 5. Click **[Add to Pipeline]** **Preview Tips**: - Overlay before/after - Check multiple representative spectra - Ensure peaks are not removed - Baseline should be flat near zero #### Step 3: Add Smoothing (Optional) **When needed**: If spectra still noisy after baseline correction ```python # Moderate smoothing: Savitzky-Golay (window=11, polyorder=3, deriv=0) ``` **Add to Pipeline**: 1. Select "Savitzky-Golay" 2. Set window=11, polyorder=3 3. Preview and adjust 4. Add to pipeline **Warning**: Don't over-smooth! - Check peak widths - Compare to raw data - Ensure real features remain #### Step 4: Add Normalization **Why last?**: - Baseline correction first removes offsets - Smoothing reduces noise - Normalization as final step for intensity correction ```python # Most common: Vector Normalization (L2) ``` **Add to Pipeline**: 1. Select "Vector Normalization" 2. No parameters needed 3. Preview 4. Add to pipeline **Final Pipeline**: ``` 1. AsLS Baseline (λ=1e5, p=0.01) 2. Savitzky-Golay (window=11, order=3) 3. Vector Normalization ``` #### Step 5: Validate Pipeline **Check on Representative Spectra**: 1. Click each step's 👁 (eye) to see effect 2. Toggle steps on/off to compare 3. Verify on different groups 4. Check edge cases (weak signal, high noise) **Quality Metrics**: - SNR improvement - Peak preservation - Baseline flatness - Consistency across spectra #### Step 6: Save and Apply **Save Pipeline**: 1. Click **[Save Pipeline]** 2. Name: `blood_plasma_standard.json` 3. Add description 4. Save to pipeline library **Apply to Dataset**: 1. Select input data 2. Click **[Apply to Dataset]** 3. Progress bar shows processing 4. Processed spectra saved automatically --- (common-pipelines)= ## Common Pipelines ### 1. Standard Pipeline (General Purpose) ``` Pipeline: standard_preprocessing 1. AsLS Baseline (λ=1e5, p=0.01) 2. Savitzky-Golay Smoothing (window=11, order=3) 3. Vector Normalization Use Case: General Raman spectroscopy, classification tasks Pros: Robust, widely applicable, minimal parameter tuning Cons: May not be optimal for specific cases ``` ### 2. Minimal Pipeline (Low Noise Data) ``` Pipeline: minimal_preprocessing 1. Polynomial Baseline (degree=3) 2. Vector Normalization Use Case: High-quality spectra, minimal noise Pros: Fast, preserves maximum information Cons: Not suitable for noisy or complex baseline ``` ### 3. Aggressive Denoising (High Noise Data) ``` Pipeline: heavy_denoising 1. Median Filter (window=5) # Remove spikes 2. AirPLS Baseline (λ=1e6) 3. Gaussian Smoothing (σ=2.0) 4. SNV Normalization Use Case: Low signal-to-noise ratio, cosmic rays Pros: Maximum noise reduction Cons: Risk of over-smoothing, loss of fine features ``` ### 4. Derivative-Based Pipeline ``` Pipeline: derivative_based 1. AsLS Baseline (λ=1e5, p=0.01) 2. Savitzky-Golay 1st Derivative (window=11, order=3, deriv=1) 3. Vector Normalization Use Case: Baseline drift issues, overlapping peaks Pros: Removes baseline, enhances peak differences Cons: Amplifies noise, requires good smoothing ``` ### 5. Chemometric Pipeline (Quantitative Analysis) ``` Pipeline: chemometric 1. AsLS Baseline (λ=1e6, p=0.001) 2. MSC (Multiplicative Scatter Correction) 3. Savitzky-Golay Smoothing (window=9, order=3) 4. Area Normalization Use Case: Quantitative analysis, concentration prediction Pros: Corrects scattering, preserves peak ratios Cons: Requires reference spectrum for MSC ``` ### 6. Deep Learning Preprocessing ``` Pipeline: deep_learning_prep 1. Median Filter (window=3) # Remove outliers 2. AsLS Baseline (λ=1e5, p=0.01) 3. Min-Max Normalization [0, 1] Use Case: Neural network input Pros: Bounded values, simple normalization Cons: May need data augmentation ``` --- ## Advanced Tips ### Parameter Optimization **Grid Search for AsLS**: ```python # Test combinations: lambda_values = [1e3, 1e4, 1e5, 1e6, 1e7] p_values = [0.001, 0.01, 0.1] # For each combination: # 1. Apply preprocessing # 2. Run classification # 3. Select best performing parameters ``` **Optimization Tools**: - Use ML page → Hyperparameter Optimization - Include preprocessing parameters - Cross-validate on training set only ### Method Order Guidelines **General Rules**: 1. **Spike removal first**: Median filter 2. **Baseline correction next**: AsLS, AirPLS, etc. 3. **Smoothing**: Savitzky-Golay, Gaussian 4. **Derivatives** (if used): After smoothing 5. **Normalization last**: Vector, SNV, etc. **Why this order?**: - Spikes corrupt baseline estimation - Baseline affects normalization - Smoothing should be on baseline-corrected data - Normalization as final intensity correction ### Computational Efficiency **For large datasets (>1000 spectra)**: 1. **Disable real-time preview** during building 2. **Use batch processing**: Process in chunks 3. **Optimize parameters** on subset first 4. **Save pipelines** for reuse 5. **Consider GPU acceleration** for deep learning methods ### Validation Strategy **Split-Sample Validation**: ``` 1. Split data: 80% train, 20% test 2. Build pipeline on training set only 3. Apply same pipeline to test set 4. Evaluate on both sets 5. Ensure no overfitting to training data ``` **Cross-Validation**: ``` 1. Use K-fold CV on training set 2. Apply same pipeline to each fold 3. Average performance across folds 4. Final test on held-out test set ``` --- (preprocessing-troubleshooting)= ## Troubleshooting ### Problem: Peaks Disappear After Preprocessing **Causes**: - Over-smoothing - Aggressive baseline correction - Second derivative without smoothing **Solutions**: - Reduce Savitzky-Golay window size - Lower AsLS lambda parameter - Skip derivative methods - Validate on known peaks ### Problem: Baseline Still Present **Causes**: - Insufficient lambda in AsLS - Wrong asymmetry parameter - Polynomial degree too low **Solutions**: - Increase λ from 1e5 to 1e6 or 1e7 - Adjust p parameter (try 0.001 for higher baseline) - Use higher polynomial degree or different method - Try AirPLS for automatic correction ### Problem: Spectra Look Too Smooth **Causes**: - Window size too large - Multiple smoothing steps - Gaussian sigma too high **Solutions**: - Reduce window size - Remove redundant smoothing steps - Compare to raw data visually - Check if real peaks are lost ### Problem: Pipeline Slow to Apply **Causes**: - Large dataset - Computationally expensive methods (CDAE, wavelet) - Real-time preview enabled **Solutions**: - Disable preview during application - Process in batches - Use faster methods (polynomial instead of AsLS) - Enable parallel processing in settings ### Problem: Inconsistent Results **Causes**: - Different preprocessing for train/test - Parameter randomness (if applicable) - Data leakage (normalizing before splitting) **Solutions**: - Save and apply same pipeline to all data - Split data BEFORE preprocessing - Document exact pipeline used - Version control pipeline files --- ## See Also - [Analysis Methods Reference](../analysis-methods/preprocessing.md) - Detailed method documentation - [Data Import Guide](data-import.md) - Prepare data for preprocessing - [Analysis Guide](analysis.md) - Next step after preprocessing - [Best Practices](best-practices.md) - Preprocessing recommendations --- **Next**: [Analysis Guide](analysis.md) →