Functions API

Reference documentation for processing functions.

Table of Contents


Data Loading Functions

functions/data_loader.py

load_spectra_from_csv()

Load Raman spectra from CSV file.

from functions.data_loader import load_spectra_from_csv

data = load_spectra_from_csv('data/samples.csv')

# Returns:
# {
#     'wavenumbers': np.array([400, 401, 402, ...]),
#     'spectra': np.array([[...], [...], ...]),
#     'labels': ['Sample1', 'Sample2', ...],
#     'groups': ['Control', 'Control', 'Treatment', ...],
#     'metadata': {
#         'acquisition_date': '2026-01-24',
#         'instrument': 'Renishaw inVia',
#         'laser_power': 50,
#         'exposure_time': 10
#     }
# }

Parameters:

  • filepath (str): Path to CSV file

  • has_header (bool, default=True): First row contains headers

  • wavenumber_column (int or str, default=0): Wavenumber column index/name

  • delimiter (str, default=‘,’): Column delimiter

Returns: dict - Spectral data dictionary

CSV Format:

Wavenumber,Sample1,Sample2,Sample3
400.0,0.12,0.15,0.11
401.0,0.13,0.16,0.12
402.0,0.14,0.17,0.13
...

Raises:

  • FileNotFoundError: File doesn’t exist

  • ValueError: Invalid file format

  • DataError: Inconsistent data dimensions

save_spectra_to_csv()

Save spectra to CSV file.

from functions.data_loader import save_spectra_to_csv

save_spectra_to_csv(
    data={
        'wavenumbers': wavenumbers,
        'spectra': spectra,
        'labels': labels
    },
    filepath='output/preprocessed.csv',
    include_metadata=True
)

Parameters:

  • data (dict): Spectral data dictionary

  • filepath (str): Output file path

  • include_metadata (bool, default=True): Include metadata as comments

  • delimiter (str, default=‘,’): Column delimiter

  • float_format (str, default=‘%.6f’): Number formatting

Side Effects: Creates file at specified path

load_raman_peaks()

Load reference Raman peak database.

from functions.data_loader import load_raman_peaks

peaks = load_raman_peaks()

# Returns:
# {
#     'proteins': {
#         'phenylalanine': [1004, 1033],
#         'amide_I': [1650, 1680],
#         'amide_III': [1230, 1300]
#     },
#     'lipids': {
#         'CH2_stretch': [2850, 2900],
#         'CH2_bend': [1440, 1470]
#     },
#     'nucleic_acids': {
#         'DNA_backbone': [785, 810],
#         'RNA_bases': [810, 850]
#     }
# }

Returns: dict - Peak database by biomolecule category

Source: assets/data/raman_peaks.json


Preprocessing Functions

Baseline Correction

apply_asls()

Apply Asymmetric Least Squares (AsLS) baseline correction.

from functions.preprocess.baseline import apply_asls

corrected = apply_asls(
    spectrum=spectrum,
    lambda_=1e5,    # Smoothness
    p=0.01,         # Asymmetry
    max_iter=10
)

Parameters:

  • spectrum (np.ndarray): Input spectrum (1D array)

  • lambda_ (float): Smoothness parameter (1e2 - 1e9)

    • Lower → follows peaks more closely

    • Higher → smoother baseline

  • p (float): Asymmetry parameter (0.001 - 0.1)

    • Lower → fits valleys more closely

    • Higher → baseline goes through peaks

  • max_iter (int, default=10): Maximum iterations

Returns: np.ndarray - Baseline-corrected spectrum

Algorithm: Iteratively weighted least squares with asymmetric weights

Reference: Eilers & Boelens (2005) Baseline Correction with Asymmetric Least Squares Smoothing

apply_airpls()

Apply Adaptive Iteratively Reweighted Penalized Least Squares (airPLS).

from functions.preprocess.baseline import apply_airpls

corrected = apply_airpls(
    spectrum=spectrum,
    lambda_=100,
    porder=1,
    max_iter=15
)

Parameters:

  • spectrum (np.ndarray): Input spectrum

  • lambda_ (float): Smoothness parameter (1 - 1e6)

  • porder (int): Difference order (1 or 2)

  • max_iter (int, default=15): Maximum iterations

Returns: np.ndarray - Baseline-corrected spectrum

Algorithm: Adaptive weighting based on residual signs

Reference: Zhang et al. (2010) Baseline correction using adaptive iteratively reweighted penalized least squares

apply_polynomial_baseline()

Apply polynomial baseline fitting.

from functions.preprocess.baseline import apply_polynomial_baseline

corrected = apply_polynomial_baseline(
    spectrum=spectrum,
    degree=3,
    mask_peaks=True,
    threshold=0.8
)

Parameters:

  • spectrum (np.ndarray): Input spectrum

  • degree (int): Polynomial degree (1-10)

    • 1: Linear baseline

    • 2-3: Gentle curvature

    • 4-6: Complex baseline

    • 7+: Risk of overfitting

  • mask_peaks (bool, default=False): Exclude peaks from fit

  • threshold (float, default=0.8): Peak detection threshold

Returns: np.ndarray - Baseline-corrected spectrum

Algorithm: Least squares polynomial fitting

apply_whittaker_baseline()

Apply Whittaker smoothing baseline.

from functions.preprocess.baseline import apply_whittaker_baseline

corrected = apply_whittaker_baseline(
    spectrum=spectrum,
    lambda_=1000,
    differences=2
)

Parameters:

  • spectrum (np.ndarray): Input spectrum

  • lambda_ (float): Smoothness parameter (1 - 1e6)

  • differences (int): Order of differences (1 or 2)

Returns: np.ndarray - Baseline-corrected spectrum

Algorithm: Penalized least squares with difference penalty

apply_fabc()

Apply Fully Automatic Baseline Correction (FABC).

from functions.preprocess.fabc_fixed import apply_fabc

corrected = apply_fabc(
    spectrum=spectrum,
    window_length=51,
    iterations=10
)

Parameters:

  • spectrum (np.ndarray): Input spectrum

  • window_length (int): Smoothing window size (5-201, odd)

  • iterations (int, default=10): Number of iterations

Returns: np.ndarray - Baseline-corrected spectrum

Algorithm: Iterative morphological operations

apply_butterworth_filter()

Apply Butterworth high-pass filter for baseline removal.

from functions.preprocess.baseline import apply_butterworth_filter

corrected = apply_butterworth_filter(
    spectrum=spectrum,
    cutoff=0.01,
    order=4,
    sampling_rate=1.0
)

Parameters:

  • spectrum (np.ndarray): Input spectrum

  • cutoff (float): Cutoff frequency (0.001 - 0.1)

  • order (int): Filter order (2-10)

  • sampling_rate (float): Sampling rate

Returns: np.ndarray - High-pass filtered spectrum

Algorithm: Butterworth high-pass filter (removes low-frequency baseline)

Smoothing

apply_savgol()

Apply Savitzky-Golay smoothing.

from functions.preprocess.kernel_denoise import apply_savgol

smoothed = apply_savgol(
    spectrum=spectrum,
    window_length=11,
    polyorder=3,
    deriv=0
)

Parameters:

  • spectrum (np.ndarray): Input spectrum

  • window_length (int): Window size (5-51, odd, > polyorder)

    • Smaller → less smoothing, preserves peaks

    • Larger → more smoothing, may broaden peaks

  • polyorder (int): Polynomial order (2-5)

    • 2-3: Most common

    • 4-5: More flexible but risk of artifacts

  • deriv (int, default=0): Derivative order (0, 1, or 2)

    • 0: Smoothing only

    • 1: First derivative

    • 2: Second derivative

Returns: np.ndarray - Smoothed spectrum

Algorithm: Local polynomial regression within sliding window

Reference: Savitzky & Golay (1964) Smoothing and Differentiation of Data by Simplified Least Squares Procedures

apply_gaussian_filter()

Apply Gaussian smoothing.

from functions.preprocess.kernel_denoise import apply_gaussian_filter

smoothed = apply_gaussian_filter(
    spectrum=spectrum,
    sigma=2.0,
    mode='reflect'
)

Parameters:

  • spectrum (np.ndarray): Input spectrum

  • sigma (float): Standard deviation (0.5 - 5.0)

    • Smaller → less smoothing

    • Larger → more smoothing

  • mode (str, default=‘reflect’): Edge handling mode

Returns: np.ndarray - Smoothed spectrum

Algorithm: Convolution with Gaussian kernel

apply_moving_average()

Apply moving average smoothing.

from functions.preprocess.kernel_denoise import apply_moving_average

smoothed = apply_moving_average(
    spectrum=spectrum,
    window_size=5,
    mode='same'
)

Parameters:

  • spectrum (np.ndarray): Input spectrum

  • window_size (int): Window size (3-21, odd)

  • mode (str, default=‘same’): Output size mode

Returns: np.ndarray - Smoothed spectrum

Algorithm: Uniform weighting within sliding window

apply_median_filter()

Apply median filter for noise removal.

from functions.preprocess.kernel_denoise import apply_median_filter

filtered = apply_median_filter(
    spectrum=spectrum,
    kernel_size=5
)

Parameters:

  • spectrum (np.ndarray): Input spectrum

  • kernel_size (int): Kernel size (3-15, odd)

Returns: np.ndarray - Filtered spectrum

Algorithm: Median value within sliding window (robust to outliers)

apply_kernel_denoise()

Apply adaptive kernel denoising.

from functions.preprocess.kernel_denoise import apply_kernel_denoise

denoised = apply_kernel_denoise(
    spectrum=spectrum,
    kernel_type='gaussian',
    bandwidth=2.0,
    iterations=1
)

Parameters:

  • spectrum (np.ndarray): Input spectrum

  • kernel_type (str): Kernel type (‘gaussian’, ‘epanechnikov’, ‘uniform’)

  • bandwidth (float): Kernel bandwidth (0.5 - 5.0)

  • iterations (int, default=1): Number of passes

Returns: np.ndarray - Denoised spectrum

Algorithm: Kernel regression with adaptive bandwidth

Normalization

apply_vector_norm()

Apply vector (L2) normalization.

from functions.preprocess.normalization import apply_vector_norm

normalized = apply_vector_norm(spectrum)

# Spectrum now has unit L2 norm: np.linalg.norm(normalized) == 1.0

Parameters:

  • spectrum (np.ndarray): Input spectrum

Returns: np.ndarray - Normalized spectrum

Formula: \(x_{norm} = \frac{x}{\sqrt{\sum x_i^2}}\)

Use Case: Makes spectra comparable regardless of absolute intensity

apply_minmax_norm()

Apply Min-Max normalization.

from functions.preprocess.normalization import apply_minmax_norm

normalized = apply_minmax_norm(
    spectrum=spectrum,
    feature_range=(0, 1)
)

# Values now in [0, 1]: min=0, max=1

Parameters:

  • spectrum (np.ndarray): Input spectrum

  • feature_range (tuple, default=(0, 1)): Target range (min, max)

Returns: np.ndarray - Normalized spectrum

Formula: \(x_{norm} = \frac{x - x_{min}}{x_{max} - x_{min}} \times (max - min) + min\)

Use Case: Scale to specific range, preserves distribution shape

apply_area_norm()

Apply area (sum) normalization.

from functions.preprocess.normalization import apply_area_norm

normalized = apply_area_norm(spectrum)

# Total area is now 1.0: np.sum(normalized) == 1.0

Parameters:

  • spectrum (np.ndarray): Input spectrum

Returns: np.ndarray - Normalized spectrum

Formula: \(x_{norm} = \frac{x}{\sum x_i}\)

Use Case: Normalize by total signal intensity

apply_snv()

Apply Standard Normal Variate (SNV) normalization.

from functions.preprocess.normalization import apply_snv

normalized = apply_snv(spectrum)

# Mean=0, Std=1: np.mean(normalized) ≈ 0, np.std(normalized) ≈ 1

Parameters:

  • spectrum (np.ndarray): Input spectrum

Returns: np.ndarray - Normalized spectrum

Formula: \(x_{norm} = \frac{x - \bar{x}}{\sigma_x}\)

Use Case: Remove multiplicative scatter effects, common in NIR/Raman

apply_msc()

Apply Multiplicative Scatter Correction (MSC).

from functions.preprocess.normalization import apply_msc

# For single spectrum
normalized = apply_msc(
    spectrum=spectrum,
    reference=mean_spectrum
)

# For multiple spectra (calculates mean reference automatically)
from functions.preprocess.normalization import apply_msc_batch

normalized_spectra = apply_msc_batch(spectra)

Parameters:

  • spectrum (np.ndarray): Input spectrum

  • reference (np.ndarray): Reference spectrum (typically mean of all spectra)

Returns: np.ndarray - Normalized spectrum

Algorithm: Linear regression to reference, then correction

Formula:

  • Fit: \(x = a + b \cdot x_{ref}\)

  • Correct: \(x_{corr} = \frac{x - a}{b}\)

Use Case: Correct for scatter differences between samples

apply_quantile_norm()

Apply quantile normalization.

from functions.preprocess.advanced_normalization import apply_quantile_norm

normalized = apply_quantile_norm(
    spectra=spectra,  # 2D array: (n_samples, n_features)
    n_quantiles=1000
)

Parameters:

  • spectra (np.ndarray): Multiple spectra (2D array)

  • n_quantiles (int, default=1000): Number of quantiles

Returns: np.ndarray - Normalized spectra

Algorithm: Rank-based transformation to common distribution

Use Case: Make distributions identical across samples

apply_pqn()

Apply Probabilistic Quotient Normalization (PQN).

from functions.preprocess.advanced_normalization import apply_pqn

normalized = apply_pqn(
    spectrum=spectrum,
    reference=reference_spectrum
)

Parameters:

  • spectrum (np.ndarray): Input spectrum

  • reference (np.ndarray): Reference spectrum

Returns: np.ndarray - Normalized spectrum

Algorithm: Quotient to reference, then scale by median quotient

Use Case: Robust normalization for metabolomics/lipidomics

apply_rank_transform()

Apply rank transformation.

from functions.preprocess.advanced_normalization import apply_rank_transform

ranked = apply_rank_transform(spectrum)

# Values replaced by ranks: 1, 2, 3, ..., n

Parameters:

  • spectrum (np.ndarray): Input spectrum

Returns: np.ndarray - Rank-transformed spectrum

Use Case: Non-parametric transformation, robust to outliers

Derivatives

apply_first_derivative()

Calculate first derivative.

from functions.preprocess.derivatives import apply_first_derivative

deriv1 = apply_first_derivative(
    spectrum=spectrum,
    method='savgol',
    window_length=11,
    polyorder=3
)

Parameters:

  • spectrum (np.ndarray): Input spectrum

  • method (str): Derivative method (‘savgol’, ‘gradient’, ‘diff’)

  • window_length (int): Window size for Savitzky-Golay

  • polyorder (int): Polynomial order for Savitzky-Golay

Returns: np.ndarray - First derivative

Use Case: Enhance peak resolution, remove baseline drift

apply_second_derivative()

Calculate second derivative.

from functions.preprocess.derivatives import apply_second_derivative

deriv2 = apply_second_derivative(
    spectrum=spectrum,
    method='savgol',
    window_length=11,
    polyorder=3
)

Parameters: Same as apply_first_derivative()

Returns: np.ndarray - Second derivative

Use Case: Sharpen overlapping peaks, enhance fine structure

Advanced Processing

apply_cdae()

Apply Convolutional Denoising Autoencoder (CDAE).

from functions.preprocess.deep_learning import apply_cdae

denoised = apply_cdae(
    spectrum=spectrum,
    model_path='models/cdae_raman.pth',
    batch_size=32,
    device='cuda'  # or 'cpu'
)

Parameters:

  • spectrum (np.ndarray): Input spectrum

  • model_path (str): Path to trained CDAE model

  • batch_size (int, default=32): Batch size for inference

  • device (str, default=‘cpu’): Computation device

Returns: np.ndarray - Denoised spectrum

Algorithm: Deep learning-based denoising via autoencoder

Requirements: PyTorch, trained CDAE model

apply_background_subtraction()

Subtract background spectrum.

from functions.preprocess.background_subtraction import apply_background_subtraction

corrected = apply_background_subtraction(
    spectrum=spectrum,
    background=background_spectrum,
    method='direct',
    scale_factor=1.0
)

Parameters:

  • spectrum (np.ndarray): Sample spectrum

  • background (np.ndarray): Background spectrum

  • method (str): Subtraction method (‘direct’, ‘scaled’, ‘weighted’)

  • scale_factor (float, default=1.0): Background scaling factor

Returns: np.ndarray - Background-corrected spectrum

apply_wavelength_calibration()

Apply wavelength/wavenumber calibration.

from functions.preprocess.calibration import apply_wavelength_calibration

calibrated = apply_wavelength_calibration(
    wavenumbers=wavenumbers,
    reference_peaks=[1004, 1445, 1660],
    measured_peaks=[1002, 1443, 1658],
    method='linear'
)

Parameters:

  • wavenumbers (np.ndarray): Original wavenumber axis

  • reference_peaks (List[float]): Known reference peak positions

  • measured_peaks (List[float]): Measured peak positions

  • method (str): Calibration method (‘linear’, ‘polynomial’, ‘spline’)

Returns: np.ndarray - Calibrated wavenumber axis

apply_peak_ratio()

Calculate peak intensity ratios.

from functions.preprocess.feature_engineering import apply_peak_ratio

ratio = apply_peak_ratio(
    spectrum=spectrum,
    wavenumbers=wavenumbers,
    band1_range=(1640, 1680),  # Amide I
    band2_range=(2840, 2900),  # CH2 stretch
    integration_method='trapz'
)

Parameters:

  • spectrum (np.ndarray): Input spectrum

  • wavenumbers (np.ndarray): Wavenumber axis

  • band1_range (tuple): First band (min, max) wavenumbers

  • band2_range (tuple): Second band (min, max) wavenumbers

  • integration_method (str): Integration method (‘trapz’, ‘simps’, ‘max’)

Returns: float - Peak ratio (band1 / band2)

Use Case: Create discriminative features for classification

apply_wavelet_transform()

Apply wavelet transform for denoising or feature extraction.

from functions.preprocess.feature_engineering import apply_wavelet_transform

coeffs, denoised = apply_wavelet_transform(
    spectrum=spectrum,
    wavelet='db4',
    level=5,
    threshold_method='soft',
    threshold_value='auto'
)

Parameters:

  • spectrum (np.ndarray): Input spectrum

  • wavelet (str): Wavelet type (‘db4’, ‘sym5’, ‘coif3’)

  • level (int): Decomposition level (1-10)

  • threshold_method (str): Thresholding (‘soft’, ‘hard’, ‘garrote’)

  • threshold_value (str or float): Threshold (‘auto’ or specific value)

Returns: Tuple[list, np.ndarray] - (wavelet_coefficients, reconstructed_signal)

Use Case: Multi-resolution denoising, feature extraction


Analysis Functions

Dimensionality Reduction

apply_pca()

Perform Principal Component Analysis.

from functions.ML.linear_regression import apply_pca

result = apply_pca(
    spectra=spectra,
    n_components=2,
    whiten=False,
    random_state=42
)

# Returns:
# {
#     'scores': np.array([[...], [...]]),  # PC scores
#     'loadings': np.array([[...], [...]]),  # PC loadings
#     'explained_variance': np.array([...]),
#     'explained_variance_ratio': np.array([...]),
#     'model': fitted_pca_model
# }

Parameters:

  • spectra (np.ndarray): Input spectra (n_samples, n_features)

  • n_components (int or float): Number of components or variance to retain

    • int: Specific number of PCs

    • float (0-1): Retain components explaining this variance fraction

  • whiten (bool, default=False): Whiten components (unit variance)

  • random_state (int, optional): Random seed for reproducibility

Returns: dict - PCA results

Algorithm: Singular Value Decomposition (SVD)

Use Case: Dimensionality reduction, visualization, noise reduction

apply_umap()

Perform UMAP (Uniform Manifold Approximation and Projection).

from functions.ML.linear_regression import apply_umap

result = apply_umap(
    spectra=spectra,
    n_neighbors=15,
    min_dist=0.1,
    n_components=2,
    metric='euclidean',
    random_state=42
)

# Returns:
# {
#     'embedding': np.array([[...], [...]]),
#     'model': fitted_umap_model
# }

Parameters:

  • spectra (np.ndarray): Input spectra

  • n_neighbors (int, default=15): Neighborhood size (2-200)

  • min_dist (float, default=0.1): Minimum distance between points (0.0-0.99)

  • n_components (int, default=2): Embedding dimensions

  • metric (str, default=‘euclidean’): Distance metric

  • random_state (int, optional): Random seed

Returns: dict - UMAP results

Use Case: Non-linear dimensionality reduction, preserves both local and global structure

apply_tsne()

Perform t-SNE (t-Distributed Stochastic Neighbor Embedding).

from functions.ML.linear_regression import apply_tsne

result = apply_tsne(
    spectra=spectra,
    n_components=2,
    perplexity=30,
    learning_rate=200,
    n_iter=1000,
    random_state=42
)

# Returns:
# {
#     'embedding': np.array([[...], [...]]),
#     'model': fitted_tsne_model
# }

Parameters:

  • spectra (np.ndarray): Input spectra

  • n_components (int, default=2): Embedding dimensions

  • perplexity (float, default=30): Perplexity parameter (5-50)

  • learning_rate (float, default=200): Learning rate (10-1000)

  • n_iter (int, default=1000): Maximum iterations (250-5000)

  • random_state (int, optional): Random seed

Returns: dict - t-SNE results

Use Case: Visualization, preserves local structure (neighborhoods)

Clustering

apply_kmeans()

Perform K-means clustering.

from functions.ML.linear_regression import apply_kmeans

result = apply_kmeans(
    spectra=spectra,
    n_clusters=3,
    init='k-means++',
    n_init=10,
    max_iter=300,
    random_state=42
)

# Returns:
# {
#     'labels': np.array([0, 0, 1, 1, 2, 2, ...]),
#     'centers': np.array([[...], [...], [...]]),
#     'inertia': 123.45,
#     'model': fitted_kmeans_model
# }

Parameters:

  • spectra (np.ndarray): Input spectra

  • n_clusters (int): Number of clusters

  • init (str, default=‘k-means++’): Initialization method

  • n_init (int, default=10): Number of initializations

  • max_iter (int, default=300): Maximum iterations

  • random_state (int, optional): Random seed

Returns: dict - Clustering results

Use Case: Partition data into K groups

apply_hierarchical_clustering()

Perform hierarchical clustering.

from functions.ML.linear_regression import apply_hierarchical_clustering

result = apply_hierarchical_clustering(
    spectra=spectra,
    n_clusters=3,
    linkage='ward',
    metric='euclidean'
)

# Returns:
# {
#     'labels': np.array([0, 0, 1, 1, 2, 2, ...]),
#     'linkage_matrix': np.array([[...], [...]]),
#     'cophenetic_corr': 0.85,
#     'dendrogram': {...}
# }

Parameters:

  • spectra (np.ndarray): Input spectra

  • n_clusters (int, optional): Number of clusters (if None, returns full tree)

  • linkage (str, default=‘ward’): Linkage criterion (‘ward’, ‘complete’, ‘average’, ‘single’)

  • metric (str, default=‘euclidean’): Distance metric

Returns: dict - Clustering results with dendrogram

Use Case: Hierarchical grouping, visualize cluster relationships

apply_dbscan()

Perform DBSCAN (Density-Based Spatial Clustering).

from functions.ML.linear_regression import apply_dbscan

result = apply_dbscan(
    spectra=spectra,
    eps=0.5,
    min_samples=5,
    metric='euclidean'
)

# Returns:
# {
#     'labels': np.array([0, 0, -1, 1, 1, ...]),  # -1 = noise
#     'core_samples': np.array([True, True, False, ...]),
#     'n_clusters': 2,
#     'n_noise': 15,
#     'model': fitted_dbscan_model
# }

Parameters:

  • spectra (np.ndarray): Input spectra

  • eps (float): Maximum distance between neighbors (0.1-10.0)

  • min_samples (int): Minimum samples in neighborhood (3-20)

  • metric (str, default=‘euclidean’): Distance metric

Returns: dict - Clustering results

Use Case: Find arbitrary-shaped clusters, detect outliers

Statistical Tests

apply_ttest()

Perform t-test between two groups.

from functions.utils import apply_ttest

result = apply_ttest(
    group1=control_spectra,
    group2=treatment_spectra,
    equal_var=False,  # Welch's t-test
    alternative='two-sided'
)

# Returns:
# {
#     'statistic': np.array([...]),  # t-statistic per feature
#     'pvalue': np.array([...]),     # p-value per feature
#     'significant_features': np.array([100, 234, 567, ...]),
#     'effect_size': np.array([...])  # Cohen's d per feature
# }

Parameters:

  • group1 (np.ndarray): First group spectra (n_samples, n_features)

  • group2 (np.ndarray): Second group spectra

  • equal_var (bool, default=False): Assume equal variances

    • True: Student’s t-test

    • False: Welch’s t-test (recommended)

  • alternative (str, default=‘two-sided’): Alternative hypothesis (‘two-sided’, ‘less’, ‘greater’)

Returns: dict - Test results

Use Case: Compare means between two groups at each wavenumber

apply_mannwhitneyu()

Perform Mann-Whitney U test (non-parametric t-test alternative).

from functions.utils import apply_mannwhitneyu

result = apply_mannwhitneyu(
    group1=control_spectra,
    group2=treatment_spectra,
    alternative='two-sided'
)

# Returns similar structure to apply_ttest()

Parameters:

  • group1 (np.ndarray): First group spectra

  • group2 (np.ndarray): Second group spectra

  • alternative (str, default=‘two-sided’): Alternative hypothesis

Returns: dict - Test results

Use Case: Compare distributions when data is non-normal

apply_anova()

Perform one-way ANOVA for multiple groups.

from functions.utils import apply_anova

result = apply_anova(
    *groups,  # Variable number of group arrays
    post_hoc='tukey'
)

# Example:
result = apply_anova(
    control_spectra,
    treatment_a_spectra,
    treatment_b_spectra,
    post_hoc='tukey'
)

# Returns:
# {
#     'f_statistic': np.array([...]),
#     'pvalue': np.array([...]),
#     'significant_features': np.array([...]),
#     'effect_size': np.array([...]),  # eta-squared
#     'post_hoc_results': {...}  # if post_hoc specified
# }

Parameters:

  • *groups (np.ndarray): Multiple group arrays

  • post_hoc (str, optional): Post-hoc test (‘tukey’, ‘bonferroni’, ‘holm’)

Returns: dict - ANOVA results

Use Case: Compare means across multiple groups

apply_kruskal()

Perform Kruskal-Wallis test (non-parametric ANOVA alternative).

from functions.utils import apply_kruskal

result = apply_kruskal(
    *groups,
    post_hoc='dunn'
)

Parameters:

  • *groups (np.ndarray): Multiple group arrays

  • post_hoc (str, optional): Post-hoc test (‘dunn’)

Returns: dict - Test results

Use Case: Compare distributions across multiple groups (non-parametric)

apply_correlation()

Calculate correlation matrix.

from functions.utils import apply_correlation

result = apply_correlation(
    spectra=spectra,
    method='pearson'
)

# Returns:
# {
#     'correlation_matrix': np.array([[1.0, 0.8, ...], [0.8, 1.0, ...], ...]),
#     'pvalue_matrix': np.array([[0.0, 0.001, ...], [0.001, 0.0, ...], ...])
# }

Parameters:

  • spectra (np.ndarray): Input spectra (n_samples, n_features)

  • method (str): Correlation method (‘pearson’, ‘spearman’, ‘kendall’)

Returns: dict - Correlation results

Use Case: Find relationships between samples or features

apply_multiple_testing_correction()

Apply correction for multiple hypothesis testing.

from functions.utils import apply_multiple_testing_correction

corrected_pvalues = apply_multiple_testing_correction(
    pvalues=raw_pvalues,
    method='fdr_bh',
    alpha=0.05
)

# Returns:
# {
#     'corrected_pvalues': np.array([...]),
#     'reject': np.array([True, False, True, ...]),
#     'significant_count': 42
# }

Parameters:

  • pvalues (np.ndarray): Uncorrected p-values

  • method (str): Correction method

    • 'bonferroni': Most conservative

    • 'holm': Step-down Bonferroni

    • 'fdr_bh': Benjamini-Hochberg FDR (recommended)

    • 'fdr_by': Benjamini-Yekutieli FDR

  • alpha (float, default=0.05): Significance level

Returns: dict - Corrected results

Use Case: Control false positives when testing many hypotheses


Machine Learning Functions

Model Training

train_svm()

Train Support Vector Machine classifier.

from functions.ML.svm import train_svm

result = train_svm(
    X_train=X_train,
    y_train=y_train,
    X_test=X_test,
    y_test=y_test,
    kernel='rbf',
    C=1.0,
    gamma='scale',
    cv=5
)

# Returns:
# {
#     'model': fitted_svm_model,
#     'train_score': 0.98,
#     'test_score': 0.92,
#     'cv_scores': np.array([0.90, 0.93, 0.91, 0.94, 0.89]),
#     'confusion_matrix': np.array([[45, 5], [3, 47]]),
#     'classification_report': {...},
#     'support_vectors': np.array([[...], ...])
# }

Parameters:

  • X_train (np.ndarray): Training features (n_samples, n_features)

  • y_train (np.ndarray): Training labels

  • X_test (np.ndarray, optional): Test features

  • y_test (np.ndarray, optional): Test labels

  • kernel (str, default=‘rbf’): Kernel type (‘linear’, ‘rbf’, ‘poly’, ‘sigmoid’)

  • C (float, default=1.0): Regularization parameter (0.1-100)

  • gamma (str or float, default=‘scale’): Kernel coefficient

  • cv (int, default=5): Cross-validation folds

Returns: dict - Training results

Use Case: Binary or multi-class classification with kernel trick

train_random_forest()

Train Random Forest classifier.

from functions.ML.random_forest import train_random_forest

result = train_random_forest(
    X_train=X_train,
    y_train=y_train,
    X_test=X_test,
    y_test=y_test,
    n_estimators=100,
    max_depth=None,
    min_samples_split=2,
    cv=5,
    random_state=42
)

# Returns similar structure to train_svm() plus:
# {
#     ...,
#     'feature_importances': np.array([...]),
#     'oob_score': 0.91  # if oob_score=True
# }

Parameters:

  • X_train, y_train, X_test, y_test: Data arrays

  • n_estimators (int, default=100): Number of trees (100-1000)

  • max_depth (int, optional): Maximum tree depth

  • min_samples_split (int, default=2): Minimum samples to split (2-20)

  • min_samples_leaf (int, default=1): Minimum samples per leaf

  • max_features (str or int, default=‘sqrt’): Features per split

  • cv (int, default=5): Cross-validation folds

  • random_state (int, optional): Random seed

Returns: dict - Training results

Use Case: Robust ensemble classifier, handles non-linearity

train_xgboost()

Train XGBoost classifier.

from functions.ML.xgboost import train_xgboost

result = train_xgboost(
    X_train=X_train,
    y_train=y_train,
    X_test=X_test,
    y_test=y_test,
    n_estimators=100,
    learning_rate=0.1,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    cv=5,
    early_stopping_rounds=10,
    random_state=42
)

# Returns similar structure plus:
# {
#     ...,
#     'feature_importances': {
#         'gain': np.array([...]),
#         'weight': np.array([...]),
#         'cover': np.array([...])
#     },
#     'best_iteration': 87
# }

Parameters:

  • X_train, y_train, X_test, y_test: Data arrays

  • n_estimators (int, default=100): Number of boosting rounds (100-1000)

  • learning_rate (float, default=0.1): Step size shrinkage (0.01-0.3)

  • max_depth (int, default=6): Tree depth (3-10)

  • subsample (float, default=0.8): Sample fraction (0.5-1.0)

  • colsample_bytree (float, default=0.8): Feature fraction (0.5-1.0)

  • gamma (float, default=0): Minimum split loss (0-10)

  • reg_lambda (float, default=1): L2 regularization (0-10)

  • reg_alpha (float, default=0): L1 regularization (0-10)

  • cv (int, default=5): Cross-validation folds

  • early_stopping_rounds (int, optional): Stop if no improvement

  • random_state (int, optional): Random seed

Returns: dict - Training results

Use Case: State-of-the-art gradient boosting, competitions

train_logistic_regression()

Train Logistic Regression classifier.

from functions.ML.logistic_regression import train_logistic_regression

result = train_logistic_regression(
    X_train=X_train,
    y_train=y_train,
    X_test=X_test,
    y_test=y_test,
    C=1.0,
    penalty='l2',
    solver='lbfgs',
    cv=5,
    random_state=42
)

# Returns similar structure plus:
# {
#     ...,
#     'coefficients': np.array([[...], ...]),
#     'intercept': np.array([...]),
#     'odds_ratios': np.array([...])
# }

Parameters:

  • X_train, y_train, X_test, y_test: Data arrays

  • C (float, default=1.0): Inverse regularization strength (0.01-100)

  • penalty (str, default=‘l2’): Regularization type (‘l1’, ‘l2’, ‘elasticnet’, ‘none’)

  • solver (str, default=‘lbfgs’): Optimization algorithm

  • max_iter (int, default=100): Maximum iterations (100-10000)

  • cv (int, default=5): Cross-validation folds

  • random_state (int, optional): Random seed

Returns: dict - Training results

Use Case: Interpretable linear classifier with probabilities

train_mlp()

Train Multi-Layer Perceptron (Neural Network) classifier.

from functions.ML.logistic_regression import train_mlp

result = train_mlp(
    X_train=X_train,
    y_train=y_train,
    X_test=X_test,
    y_test=y_test,
    hidden_layer_sizes=(100, 50),
    activation='relu',
    alpha=0.0001,
    learning_rate_init=0.001,
    max_iter=200,
    early_stopping=True,
    cv=5,
    random_state=42
)

# Returns similar structure plus:
# {
#     ...,
#     'loss_curve': [0.8, 0.6, 0.5, 0.4, ...],
#     'best_loss': 0.35
# }

Parameters:

  • X_train, y_train, X_test, y_test: Data arrays

  • hidden_layer_sizes (tuple, default=(100,)): Neurons per hidden layer

  • activation (str, default=‘relu’): Activation function (‘relu’, ‘tanh’, ‘logistic’)

  • alpha (float, default=0.0001): L2 regularization (0.0001-0.01)

  • learning_rate_init (float, default=0.001): Initial learning rate (0.0001-0.01)

  • max_iter (int, default=200): Maximum epochs (200-1000)

  • early_stopping (bool, default=True): Stop if validation score doesn’t improve

  • cv (int, default=5): Cross-validation folds

  • random_state (int, optional): Random seed

Returns: dict - Training results

Use Case: Flexible non-linear classifier, deep learning

Model Evaluation

evaluate_model()

Comprehensive model evaluation.

from functions.ML.utils import evaluate_model

metrics = evaluate_model(
    model=trained_model,
    X_test=X_test,
    y_test=y_test,
    average='weighted'
)

# Returns:
# {
#     'accuracy': 0.92,
#     'precision': 0.91,
#     'recall': 0.93,
#     'f1_score': 0.92,
#     'roc_auc': 0.95,
#     'confusion_matrix': np.array([[45, 5], [3, 47]]),
#     'classification_report': {
#         'Control': {'precision': 0.94, 'recall': 0.90, 'f1-score': 0.92},
#         'Treatment': {'precision': 0.90, 'recall': 0.94, 'f1-score': 0.92}
#     }
# }

Parameters:

  • model: Trained scikit-learn model

  • X_test (np.ndarray): Test features

  • y_test (np.ndarray): True labels

  • average (str, default=‘weighted’): Averaging method for multi-class

Returns: dict - Evaluation metrics

plot_confusion_matrix()

Plot confusion matrix heatmap.

from functions.visualization.plots import plot_confusion_matrix

fig = plot_confusion_matrix(
    y_true=y_test,
    y_pred=predictions,
    class_names=['Control', 'Treatment'],
    normalize=True,
    cmap='Blues'
)

Parameters:

  • y_true (np.ndarray): True labels

  • y_pred (np.ndarray): Predicted labels

  • class_names (List[str], optional): Class labels

  • normalize (bool, default=False): Normalize by row (true labels)

  • cmap (str, default=‘Blues’): Colormap

Returns: matplotlib.figure.Figure

plot_roc_curve()

Plot ROC curve.

from functions.visualization.plots import plot_roc_curve

fig = plot_roc_curve(
    y_true=y_test,
    y_scores=y_proba,
    class_names=['Control', 'Treatment']
)

Parameters:

  • y_true (np.ndarray): True labels

  • y_scores (np.ndarray): Predicted probabilities

  • class_names (List[str], optional): Class labels

Returns: matplotlib.figure.Figure

Displays: ROC curve with AUC score

plot_learning_curve()

Plot learning curve (train/validation score vs. training size).

from functions.visualization.plots import plot_learning_curve

fig = plot_learning_curve(
    model=model,
    X=X_train,
    y=y_train,
    cv=5,
    scoring='accuracy'
)

Parameters:

  • model: Scikit-learn estimator

  • X (np.ndarray): Training features

  • y (np.ndarray): Training labels

  • cv (int, default=5): Cross-validation folds

  • scoring (str, default=‘accuracy’): Metric to plot

Returns: matplotlib.figure.Figure

Use Case: Diagnose overfitting/underfitting

Feature Importance

get_feature_importance()

Extract feature importance from model.

from functions.ML.utils import get_feature_importance

importance = get_feature_importance(
    model=trained_model,
    method='default',
    wavenumbers=wavenumbers
)

# Returns:
# {
#     'importance': np.array([...]),
#     'feature_indices': np.array([...]),
#     'wavenumbers': np.array([...]),
#     'top_features': {
#         'indices': [234, 567, 890],
#         'wavenumbers': [1004, 1445, 1660],
#         'importance': [0.15, 0.12, 0.10]
#     }
# }

Parameters:

  • model: Trained model

  • method (str, default=‘default’): Importance method

    • 'default': Use model’s native importance

    • 'permutation': Permutation importance (model-agnostic)

    • 'shap': SHAP values (requires shap library)

  • wavenumbers (np.ndarray, optional): Wavenumber axis

Returns: dict - Feature importance results

plot_feature_importance()

Plot feature importance as bar plot or spectrum overlay.

from functions.visualization.plots import plot_feature_importance

fig = plot_feature_importance(
    importance=importance_array,
    wavenumbers=wavenumbers,
    top_n=20,
    plot_type='spectrum'
)

Parameters:

  • importance (np.ndarray): Feature importance values

  • wavenumbers (np.ndarray): Wavenumber axis

  • top_n (int, default=20): Number of top features to highlight

  • plot_type (str, default=‘bar’): Plot type (‘bar’, ‘spectrum’, ‘both’)

Returns: matplotlib.figure.Figure

calculate_permutation_importance()

Calculate permutation feature importance (model-agnostic).

from functions.ML.utils import calculate_permutation_importance

importance = calculate_permutation_importance(
    model=trained_model,
    X=X_test,
    y=y_test,
    n_repeats=10,
    random_state=42,
    scoring='accuracy'
)

# Returns:
# {
#     'importances_mean': np.array([...]),
#     'importances_std': np.array([...]),
#     'importances': np.array([[...], [...], ...])  # shape: (n_features, n_repeats)
# }

Parameters:

  • model: Trained model

  • X (np.ndarray): Test features

  • y (np.ndarray): True labels

  • n_repeats (int, default=10): Number of permutations

  • random_state (int, optional): Random seed

  • scoring (str or callable, default=‘accuracy’): Metric

Returns: dict - Permutation importance results

Algorithm: Shuffle each feature and measure score drop


Utility Functions

functions/utils.py

validate_spectra_data()

Validate spectral data structure.

from functions.utils import validate_spectra_data

is_valid, errors = validate_spectra_data(data)

if not is_valid:
    for error in errors:
        print(f"Validation error: {error}")

Parameters:

  • data (dict): Data dictionary

Returns: Tuple[bool, List[str]] - (is_valid, error_messages)

Checks:

  • Required keys present

  • Correct data types

  • Consistent dimensions

  • No NaN/Inf values

  • Valid wavenumber range

generate_mock_spectra()

Generate synthetic spectra for testing.

from functions.utils import generate_mock_spectra

data = generate_mock_spectra(
    n_samples=100,
    n_features=1000,
    n_groups=3,
    noise_level=0.05,
    random_state=42
)

# Returns standard data dictionary

Parameters:

  • n_samples (int): Number of spectra

  • n_features (int): Number of wavenumbers

  • n_groups (int): Number of groups

  • noise_level (float): Noise standard deviation (0-1)

  • random_state (int, optional): Random seed

Returns: dict - Synthetic spectral data

Use Case: Testing, prototyping, demonstrations

calculate_snr()

Calculate Signal-to-Noise Ratio.

from functions.utils import calculate_snr

snr = calculate_snr(
    spectrum=spectrum,
    signal_range=(1640, 1680),  # Amide I region
    noise_range=(1800, 2000)    # Baseline region
)

print(f"SNR: {snr:.2f} dB")

Parameters:

  • spectrum (np.ndarray): Input spectrum

  • signal_range (tuple): (min, max) wavenumbers for signal

  • noise_range (tuple): (min, max) wavenumbers for noise estimation

Returns: float - SNR in decibels (dB)

Formula: \(SNR = 20 \log_{10}(\frac{\mu_{signal}}{\sigma_{noise}})\)

find_peaks()

Find peaks in spectrum.

from functions.utils import find_peaks

peaks = find_peaks(
    spectrum=spectrum,
    wavenumbers=wavenumbers,
    prominence=0.05,
    distance=10,
    width=5
)

# Returns:
# {
#     'peak_indices': np.array([...]),
#     'peak_wavenumbers': np.array([...]),
#     'peak_intensities': np.array([...]),
#     'properties': {
#         'prominences': np.array([...]),
#         'widths': np.array([...]),
#         'left_bases': np.array([...]),
#         'right_bases': np.array([...])
#     }
# }

Parameters:

  • spectrum (np.ndarray): Input spectrum

  • wavenumbers (np.ndarray): Wavenumber axis

  • prominence (float, optional): Minimum peak prominence

  • distance (int, optional): Minimum distance between peaks

  • width (int or tuple, optional): Required peak width

  • height (float or tuple, optional): Required peak height

Returns: dict - Peak information

Use Case: Peak identification, peak tracking

integrate_region()

Integrate spectrum over wavenumber range.

from functions.utils import integrate_region

area = integrate_region(
    spectrum=spectrum,
    wavenumbers=wavenumbers,
    region=(1640, 1680),
    method='trapz'
)

Parameters:

  • spectrum (np.ndarray): Input spectrum

  • wavenumbers (np.ndarray): Wavenumber axis

  • region (tuple): (min, max) wavenumbers

  • method (str, default=‘trapz’): Integration method (‘trapz’, ‘simps’)

Returns: float - Integrated area

Use Case: Quantification, peak ratios

resample_spectrum()

Resample spectrum to new wavenumber axis.

from functions.utils import resample_spectrum

resampled = resample_spectrum(
    spectrum=spectrum,
    old_wavenumbers=old_wn,
    new_wavenumbers=new_wn,
    method='linear'
)

Parameters:

  • spectrum (np.ndarray): Input spectrum

  • old_wavenumbers (np.ndarray): Original wavenumber axis

  • new_wavenumbers (np.ndarray): Target wavenumber axis

  • method (str, default=‘linear’): Interpolation method (‘linear’, ‘cubic’, ‘spline’)

Returns: np.ndarray - Resampled spectrum

Use Case: Align spectra with different wavenumber axes

split_train_test()

Split data into train/test sets with stratification.

from functions.utils import split_train_test

X_train, X_test, y_train, y_test = split_train_test(
    X=spectra,
    y=labels,
    test_size=0.2,
    stratify=True,
    random_state=42
)

Parameters:

  • X (np.ndarray): Features

  • y (np.ndarray): Labels

  • test_size (float, default=0.2): Test set fraction

  • stratify (bool, default=True): Preserve class distribution

  • random_state (int, optional): Random seed

Returns: Tuple[np.ndarray, …] - X_train, X_test, y_train, y_test

export_pipeline()

Export preprocessing pipeline to JSON.

from functions.utils import export_pipeline

export_pipeline(
    pipeline=[
        {'method': 'asls', 'params': {'lambda': 1e5, 'p': 0.01}},
        {'method': 'savgol', 'params': {'window_length': 11, 'polyorder': 3}},
        {'method': 'vector_norm', 'params': {}}
    ],
    filepath='pipelines/my_pipeline.json',
    metadata={
        'description': 'Standard preprocessing for tissue samples',
        'author': 'Researcher Name',
        'created': '2026-01-24'
    }
)

Parameters:

  • pipeline (List[dict]): Pipeline steps

  • filepath (str): Output file path

  • metadata (dict, optional): Additional metadata

Side Effects: Creates JSON file

load_pipeline()

Load preprocessing pipeline from JSON.

from functions.utils import load_pipeline

pipeline, metadata = load_pipeline('pipelines/my_pipeline.json')

# pipeline: [{'method': 'asls', 'params': {...}}, ...]
# metadata: {'description': '...', 'author': '...', 'created': '...'}

Parameters:

  • filepath (str): Pipeline file path

Returns: Tuple[List[dict], dict] - (pipeline_steps, metadata)


Pipeline Execution

execute_preprocessing_pipeline()

Execute complete preprocessing pipeline.

from functions.preprocess.pipeline import execute_preprocessing_pipeline

result = execute_preprocessing_pipeline(
    spectra=raw_spectra,
    pipeline=[
        {'method': 'asls', 'params': {'lambda': 1e5, 'p': 0.01}},
        {'method': 'savgol', 'params': {'window_length': 11, 'polyorder': 3}},
        {'method': 'vector_norm', 'params': {}}
    ],
    progress_callback=lambda i, total, msg: print(f"{i}/{total}: {msg}")
)

# Returns:
# {
#     'preprocessed_spectra': np.array([[...], [...]]),
#     'success': True,
#     'execution_time': 2.34,
#     'steps_applied': ['asls', 'savgol', 'vector_norm'],
#     'intermediate_results': {...}  # if return_intermediate=True
# }

Parameters:

  • spectra (np.ndarray): Input spectra (n_samples, n_features)

  • pipeline (List[dict]): Pipeline steps

  • progress_callback (callable, optional): Progress function(current, total, message)

  • return_intermediate (bool, default=False): Return results after each step

  • parallel (bool, default=False): Process spectra in parallel

  • n_jobs (int, default=-1): Number of parallel jobs (-1 = all CPUs)

Returns: dict - Processing results

Raises:

  • PreprocessingError: If any step fails

  • ValueError: If pipeline configuration is invalid


Best Practices

Function Usage

# Good: Clear, explicit parameters
corrected = apply_asls(
    spectrum=spectrum,
    lambda_=1e5,
    p=0.01,
    max_iter=10
)

# Avoid: Relying on defaults without understanding
corrected = apply_asls(spectrum)  # May not be appropriate for your data

Error Handling

from functions.preprocess.baseline import apply_asls
from functions._utils_ import PreprocessingError

try:
    corrected = apply_asls(spectrum, lambda_=1e5, p=0.01)
except PreprocessingError as e:
    print(f"Preprocessing failed: {e}")
    # Handle error or use fallback method
except ValueError as e:
    print(f"Invalid parameters: {e}")

Pipeline Design

# Good: Logical order
pipeline = [
    {'method': 'asls', 'params': {...}},      # 1. Baseline correction
    {'method': 'savgol', 'params': {...}},    # 2. Smoothing
    {'method': 'vector_norm', 'params': {}}   # 3. Normalization
]

# Avoid: Illogical order
pipeline = [
    {'method': 'vector_norm', 'params': {}},  # Normalizing before baseline correction
    {'method': 'asls', 'params': {...}}       # May normalize baseline as well
]

Performance Optimization

# For large datasets, use parallel processing
result = execute_preprocessing_pipeline(
    spectra=large_dataset,
    pipeline=my_pipeline,
    parallel=True,
    n_jobs=-1  # Use all CPU cores
)

# For repeated operations, cache intermediate results
from functools import lru_cache

@lru_cache(maxsize=128)
def cached_preprocessing(spectrum_tuple, pipeline_hash):
    spectrum = np.array(spectrum_tuple)
    return apply_preprocessing(spectrum, pipeline)

See Also


Last Updated: 2026-01-24