Functions API
Reference documentation for processing functions.
Table of Contents
Data Loading Functions
functions/data_loader.py
load_spectra_from_csv()
Load Raman spectra from CSV file.
from functions.data_loader import load_spectra_from_csv
data = load_spectra_from_csv('data/samples.csv')
# Returns:
# {
# 'wavenumbers': np.array([400, 401, 402, ...]),
# 'spectra': np.array([[...], [...], ...]),
# 'labels': ['Sample1', 'Sample2', ...],
# 'groups': ['Control', 'Control', 'Treatment', ...],
# 'metadata': {
# 'acquisition_date': '2026-01-24',
# 'instrument': 'Renishaw inVia',
# 'laser_power': 50,
# 'exposure_time': 10
# }
# }
Parameters:
filepath(str): Path to CSV filehas_header(bool, default=True): First row contains headerswavenumber_column(int or str, default=0): Wavenumber column index/namedelimiter(str, default=‘,’): Column delimiter
Returns: dict - Spectral data dictionary
CSV Format:
Wavenumber,Sample1,Sample2,Sample3
400.0,0.12,0.15,0.11
401.0,0.13,0.16,0.12
402.0,0.14,0.17,0.13
...
Raises:
FileNotFoundError: File doesn’t existValueError: Invalid file formatDataError: Inconsistent data dimensions
save_spectra_to_csv()
Save spectra to CSV file.
from functions.data_loader import save_spectra_to_csv
save_spectra_to_csv(
data={
'wavenumbers': wavenumbers,
'spectra': spectra,
'labels': labels
},
filepath='output/preprocessed.csv',
include_metadata=True
)
Parameters:
data(dict): Spectral data dictionaryfilepath(str): Output file pathinclude_metadata(bool, default=True): Include metadata as commentsdelimiter(str, default=‘,’): Column delimiterfloat_format(str, default=‘%.6f’): Number formatting
Side Effects: Creates file at specified path
load_raman_peaks()
Load reference Raman peak database.
from functions.data_loader import load_raman_peaks
peaks = load_raman_peaks()
# Returns:
# {
# 'proteins': {
# 'phenylalanine': [1004, 1033],
# 'amide_I': [1650, 1680],
# 'amide_III': [1230, 1300]
# },
# 'lipids': {
# 'CH2_stretch': [2850, 2900],
# 'CH2_bend': [1440, 1470]
# },
# 'nucleic_acids': {
# 'DNA_backbone': [785, 810],
# 'RNA_bases': [810, 850]
# }
# }
Returns: dict - Peak database by biomolecule category
Source: assets/data/raman_peaks.json
Preprocessing Functions
Baseline Correction
apply_asls()
Apply Asymmetric Least Squares (AsLS) baseline correction.
from functions.preprocess.baseline import apply_asls
corrected = apply_asls(
spectrum=spectrum,
lambda_=1e5, # Smoothness
p=0.01, # Asymmetry
max_iter=10
)
Parameters:
spectrum(np.ndarray): Input spectrum (1D array)lambda_(float): Smoothness parameter (1e2 - 1e9)Lower → follows peaks more closely
Higher → smoother baseline
p(float): Asymmetry parameter (0.001 - 0.1)Lower → fits valleys more closely
Higher → baseline goes through peaks
max_iter(int, default=10): Maximum iterations
Returns: np.ndarray - Baseline-corrected spectrum
Algorithm: Iteratively weighted least squares with asymmetric weights
Reference: Eilers & Boelens (2005) Baseline Correction with Asymmetric Least Squares Smoothing
apply_airpls()
Apply Adaptive Iteratively Reweighted Penalized Least Squares (airPLS).
from functions.preprocess.baseline import apply_airpls
corrected = apply_airpls(
spectrum=spectrum,
lambda_=100,
porder=1,
max_iter=15
)
Parameters:
spectrum(np.ndarray): Input spectrumlambda_(float): Smoothness parameter (1 - 1e6)porder(int): Difference order (1 or 2)max_iter(int, default=15): Maximum iterations
Returns: np.ndarray - Baseline-corrected spectrum
Algorithm: Adaptive weighting based on residual signs
Reference: Zhang et al. (2010) Baseline correction using adaptive iteratively reweighted penalized least squares
apply_polynomial_baseline()
Apply polynomial baseline fitting.
from functions.preprocess.baseline import apply_polynomial_baseline
corrected = apply_polynomial_baseline(
spectrum=spectrum,
degree=3,
mask_peaks=True,
threshold=0.8
)
Parameters:
spectrum(np.ndarray): Input spectrumdegree(int): Polynomial degree (1-10)1: Linear baseline
2-3: Gentle curvature
4-6: Complex baseline
7+: Risk of overfitting
mask_peaks(bool, default=False): Exclude peaks from fitthreshold(float, default=0.8): Peak detection threshold
Returns: np.ndarray - Baseline-corrected spectrum
Algorithm: Least squares polynomial fitting
apply_whittaker_baseline()
Apply Whittaker smoothing baseline.
from functions.preprocess.baseline import apply_whittaker_baseline
corrected = apply_whittaker_baseline(
spectrum=spectrum,
lambda_=1000,
differences=2
)
Parameters:
spectrum(np.ndarray): Input spectrumlambda_(float): Smoothness parameter (1 - 1e6)differences(int): Order of differences (1 or 2)
Returns: np.ndarray - Baseline-corrected spectrum
Algorithm: Penalized least squares with difference penalty
apply_fabc()
Apply Fully Automatic Baseline Correction (FABC).
from functions.preprocess.fabc_fixed import apply_fabc
corrected = apply_fabc(
spectrum=spectrum,
window_length=51,
iterations=10
)
Parameters:
spectrum(np.ndarray): Input spectrumwindow_length(int): Smoothing window size (5-201, odd)iterations(int, default=10): Number of iterations
Returns: np.ndarray - Baseline-corrected spectrum
Algorithm: Iterative morphological operations
apply_butterworth_filter()
Apply Butterworth high-pass filter for baseline removal.
from functions.preprocess.baseline import apply_butterworth_filter
corrected = apply_butterworth_filter(
spectrum=spectrum,
cutoff=0.01,
order=4,
sampling_rate=1.0
)
Parameters:
spectrum(np.ndarray): Input spectrumcutoff(float): Cutoff frequency (0.001 - 0.1)order(int): Filter order (2-10)sampling_rate(float): Sampling rate
Returns: np.ndarray - High-pass filtered spectrum
Algorithm: Butterworth high-pass filter (removes low-frequency baseline)
Smoothing
apply_savgol()
Apply Savitzky-Golay smoothing.
from functions.preprocess.kernel_denoise import apply_savgol
smoothed = apply_savgol(
spectrum=spectrum,
window_length=11,
polyorder=3,
deriv=0
)
Parameters:
spectrum(np.ndarray): Input spectrumwindow_length(int): Window size (5-51, odd, > polyorder)Smaller → less smoothing, preserves peaks
Larger → more smoothing, may broaden peaks
polyorder(int): Polynomial order (2-5)2-3: Most common
4-5: More flexible but risk of artifacts
deriv(int, default=0): Derivative order (0, 1, or 2)0: Smoothing only
1: First derivative
2: Second derivative
Returns: np.ndarray - Smoothed spectrum
Algorithm: Local polynomial regression within sliding window
Reference: Savitzky & Golay (1964) Smoothing and Differentiation of Data by Simplified Least Squares Procedures
apply_gaussian_filter()
Apply Gaussian smoothing.
from functions.preprocess.kernel_denoise import apply_gaussian_filter
smoothed = apply_gaussian_filter(
spectrum=spectrum,
sigma=2.0,
mode='reflect'
)
Parameters:
spectrum(np.ndarray): Input spectrumsigma(float): Standard deviation (0.5 - 5.0)Smaller → less smoothing
Larger → more smoothing
mode(str, default=‘reflect’): Edge handling mode
Returns: np.ndarray - Smoothed spectrum
Algorithm: Convolution with Gaussian kernel
apply_moving_average()
Apply moving average smoothing.
from functions.preprocess.kernel_denoise import apply_moving_average
smoothed = apply_moving_average(
spectrum=spectrum,
window_size=5,
mode='same'
)
Parameters:
spectrum(np.ndarray): Input spectrumwindow_size(int): Window size (3-21, odd)mode(str, default=‘same’): Output size mode
Returns: np.ndarray - Smoothed spectrum
Algorithm: Uniform weighting within sliding window
apply_median_filter()
Apply median filter for noise removal.
from functions.preprocess.kernel_denoise import apply_median_filter
filtered = apply_median_filter(
spectrum=spectrum,
kernel_size=5
)
Parameters:
spectrum(np.ndarray): Input spectrumkernel_size(int): Kernel size (3-15, odd)
Returns: np.ndarray - Filtered spectrum
Algorithm: Median value within sliding window (robust to outliers)
apply_kernel_denoise()
Apply adaptive kernel denoising.
from functions.preprocess.kernel_denoise import apply_kernel_denoise
denoised = apply_kernel_denoise(
spectrum=spectrum,
kernel_type='gaussian',
bandwidth=2.0,
iterations=1
)
Parameters:
spectrum(np.ndarray): Input spectrumkernel_type(str): Kernel type (‘gaussian’, ‘epanechnikov’, ‘uniform’)bandwidth(float): Kernel bandwidth (0.5 - 5.0)iterations(int, default=1): Number of passes
Returns: np.ndarray - Denoised spectrum
Algorithm: Kernel regression with adaptive bandwidth
Normalization
apply_vector_norm()
Apply vector (L2) normalization.
from functions.preprocess.normalization import apply_vector_norm
normalized = apply_vector_norm(spectrum)
# Spectrum now has unit L2 norm: np.linalg.norm(normalized) == 1.0
Parameters:
spectrum(np.ndarray): Input spectrum
Returns: np.ndarray - Normalized spectrum
Formula: \(x_{norm} = \frac{x}{\sqrt{\sum x_i^2}}\)
Use Case: Makes spectra comparable regardless of absolute intensity
apply_minmax_norm()
Apply Min-Max normalization.
from functions.preprocess.normalization import apply_minmax_norm
normalized = apply_minmax_norm(
spectrum=spectrum,
feature_range=(0, 1)
)
# Values now in [0, 1]: min=0, max=1
Parameters:
spectrum(np.ndarray): Input spectrumfeature_range(tuple, default=(0, 1)): Target range (min, max)
Returns: np.ndarray - Normalized spectrum
Formula: \(x_{norm} = \frac{x - x_{min}}{x_{max} - x_{min}} \times (max - min) + min\)
Use Case: Scale to specific range, preserves distribution shape
apply_area_norm()
Apply area (sum) normalization.
from functions.preprocess.normalization import apply_area_norm
normalized = apply_area_norm(spectrum)
# Total area is now 1.0: np.sum(normalized) == 1.0
Parameters:
spectrum(np.ndarray): Input spectrum
Returns: np.ndarray - Normalized spectrum
Formula: \(x_{norm} = \frac{x}{\sum x_i}\)
Use Case: Normalize by total signal intensity
apply_snv()
Apply Standard Normal Variate (SNV) normalization.
from functions.preprocess.normalization import apply_snv
normalized = apply_snv(spectrum)
# Mean=0, Std=1: np.mean(normalized) ≈ 0, np.std(normalized) ≈ 1
Parameters:
spectrum(np.ndarray): Input spectrum
Returns: np.ndarray - Normalized spectrum
Formula: \(x_{norm} = \frac{x - \bar{x}}{\sigma_x}\)
Use Case: Remove multiplicative scatter effects, common in NIR/Raman
apply_msc()
Apply Multiplicative Scatter Correction (MSC).
from functions.preprocess.normalization import apply_msc
# For single spectrum
normalized = apply_msc(
spectrum=spectrum,
reference=mean_spectrum
)
# For multiple spectra (calculates mean reference automatically)
from functions.preprocess.normalization import apply_msc_batch
normalized_spectra = apply_msc_batch(spectra)
Parameters:
spectrum(np.ndarray): Input spectrumreference(np.ndarray): Reference spectrum (typically mean of all spectra)
Returns: np.ndarray - Normalized spectrum
Algorithm: Linear regression to reference, then correction
Formula:
Fit: \(x = a + b \cdot x_{ref}\)
Correct: \(x_{corr} = \frac{x - a}{b}\)
Use Case: Correct for scatter differences between samples
apply_quantile_norm()
Apply quantile normalization.
from functions.preprocess.advanced_normalization import apply_quantile_norm
normalized = apply_quantile_norm(
spectra=spectra, # 2D array: (n_samples, n_features)
n_quantiles=1000
)
Parameters:
spectra(np.ndarray): Multiple spectra (2D array)n_quantiles(int, default=1000): Number of quantiles
Returns: np.ndarray - Normalized spectra
Algorithm: Rank-based transformation to common distribution
Use Case: Make distributions identical across samples
apply_pqn()
Apply Probabilistic Quotient Normalization (PQN).
from functions.preprocess.advanced_normalization import apply_pqn
normalized = apply_pqn(
spectrum=spectrum,
reference=reference_spectrum
)
Parameters:
spectrum(np.ndarray): Input spectrumreference(np.ndarray): Reference spectrum
Returns: np.ndarray - Normalized spectrum
Algorithm: Quotient to reference, then scale by median quotient
Use Case: Robust normalization for metabolomics/lipidomics
apply_rank_transform()
Apply rank transformation.
from functions.preprocess.advanced_normalization import apply_rank_transform
ranked = apply_rank_transform(spectrum)
# Values replaced by ranks: 1, 2, 3, ..., n
Parameters:
spectrum(np.ndarray): Input spectrum
Returns: np.ndarray - Rank-transformed spectrum
Use Case: Non-parametric transformation, robust to outliers
Derivatives
apply_first_derivative()
Calculate first derivative.
from functions.preprocess.derivatives import apply_first_derivative
deriv1 = apply_first_derivative(
spectrum=spectrum,
method='savgol',
window_length=11,
polyorder=3
)
Parameters:
spectrum(np.ndarray): Input spectrummethod(str): Derivative method (‘savgol’, ‘gradient’, ‘diff’)window_length(int): Window size for Savitzky-Golaypolyorder(int): Polynomial order for Savitzky-Golay
Returns: np.ndarray - First derivative
Use Case: Enhance peak resolution, remove baseline drift
apply_second_derivative()
Calculate second derivative.
from functions.preprocess.derivatives import apply_second_derivative
deriv2 = apply_second_derivative(
spectrum=spectrum,
method='savgol',
window_length=11,
polyorder=3
)
Parameters: Same as apply_first_derivative()
Returns: np.ndarray - Second derivative
Use Case: Sharpen overlapping peaks, enhance fine structure
Advanced Processing
apply_cdae()
Apply Convolutional Denoising Autoencoder (CDAE).
from functions.preprocess.deep_learning import apply_cdae
denoised = apply_cdae(
spectrum=spectrum,
model_path='models/cdae_raman.pth',
batch_size=32,
device='cuda' # or 'cpu'
)
Parameters:
spectrum(np.ndarray): Input spectrummodel_path(str): Path to trained CDAE modelbatch_size(int, default=32): Batch size for inferencedevice(str, default=‘cpu’): Computation device
Returns: np.ndarray - Denoised spectrum
Algorithm: Deep learning-based denoising via autoencoder
Requirements: PyTorch, trained CDAE model
apply_background_subtraction()
Subtract background spectrum.
from functions.preprocess.background_subtraction import apply_background_subtraction
corrected = apply_background_subtraction(
spectrum=spectrum,
background=background_spectrum,
method='direct',
scale_factor=1.0
)
Parameters:
spectrum(np.ndarray): Sample spectrumbackground(np.ndarray): Background spectrummethod(str): Subtraction method (‘direct’, ‘scaled’, ‘weighted’)scale_factor(float, default=1.0): Background scaling factor
Returns: np.ndarray - Background-corrected spectrum
apply_wavelength_calibration()
Apply wavelength/wavenumber calibration.
from functions.preprocess.calibration import apply_wavelength_calibration
calibrated = apply_wavelength_calibration(
wavenumbers=wavenumbers,
reference_peaks=[1004, 1445, 1660],
measured_peaks=[1002, 1443, 1658],
method='linear'
)
Parameters:
wavenumbers(np.ndarray): Original wavenumber axisreference_peaks(List[float]): Known reference peak positionsmeasured_peaks(List[float]): Measured peak positionsmethod(str): Calibration method (‘linear’, ‘polynomial’, ‘spline’)
Returns: np.ndarray - Calibrated wavenumber axis
apply_peak_ratio()
Calculate peak intensity ratios.
from functions.preprocess.feature_engineering import apply_peak_ratio
ratio = apply_peak_ratio(
spectrum=spectrum,
wavenumbers=wavenumbers,
band1_range=(1640, 1680), # Amide I
band2_range=(2840, 2900), # CH2 stretch
integration_method='trapz'
)
Parameters:
spectrum(np.ndarray): Input spectrumwavenumbers(np.ndarray): Wavenumber axisband1_range(tuple): First band (min, max) wavenumbersband2_range(tuple): Second band (min, max) wavenumbersintegration_method(str): Integration method (‘trapz’, ‘simps’, ‘max’)
Returns: float - Peak ratio (band1 / band2)
Use Case: Create discriminative features for classification
apply_wavelet_transform()
Apply wavelet transform for denoising or feature extraction.
from functions.preprocess.feature_engineering import apply_wavelet_transform
coeffs, denoised = apply_wavelet_transform(
spectrum=spectrum,
wavelet='db4',
level=5,
threshold_method='soft',
threshold_value='auto'
)
Parameters:
spectrum(np.ndarray): Input spectrumwavelet(str): Wavelet type (‘db4’, ‘sym5’, ‘coif3’)level(int): Decomposition level (1-10)threshold_method(str): Thresholding (‘soft’, ‘hard’, ‘garrote’)threshold_value(str or float): Threshold (‘auto’ or specific value)
Returns: Tuple[list, np.ndarray] - (wavelet_coefficients, reconstructed_signal)
Use Case: Multi-resolution denoising, feature extraction
Analysis Functions
Dimensionality Reduction
apply_pca()
Perform Principal Component Analysis.
from functions.ML.linear_regression import apply_pca
result = apply_pca(
spectra=spectra,
n_components=2,
whiten=False,
random_state=42
)
# Returns:
# {
# 'scores': np.array([[...], [...]]), # PC scores
# 'loadings': np.array([[...], [...]]), # PC loadings
# 'explained_variance': np.array([...]),
# 'explained_variance_ratio': np.array([...]),
# 'model': fitted_pca_model
# }
Parameters:
spectra(np.ndarray): Input spectra (n_samples, n_features)n_components(int or float): Number of components or variance to retainint: Specific number of PCs
float (0-1): Retain components explaining this variance fraction
whiten(bool, default=False): Whiten components (unit variance)random_state(int, optional): Random seed for reproducibility
Returns: dict - PCA results
Algorithm: Singular Value Decomposition (SVD)
Use Case: Dimensionality reduction, visualization, noise reduction
apply_umap()
Perform UMAP (Uniform Manifold Approximation and Projection).
from functions.ML.linear_regression import apply_umap
result = apply_umap(
spectra=spectra,
n_neighbors=15,
min_dist=0.1,
n_components=2,
metric='euclidean',
random_state=42
)
# Returns:
# {
# 'embedding': np.array([[...], [...]]),
# 'model': fitted_umap_model
# }
Parameters:
spectra(np.ndarray): Input spectran_neighbors(int, default=15): Neighborhood size (2-200)min_dist(float, default=0.1): Minimum distance between points (0.0-0.99)n_components(int, default=2): Embedding dimensionsmetric(str, default=‘euclidean’): Distance metricrandom_state(int, optional): Random seed
Returns: dict - UMAP results
Use Case: Non-linear dimensionality reduction, preserves both local and global structure
apply_tsne()
Perform t-SNE (t-Distributed Stochastic Neighbor Embedding).
from functions.ML.linear_regression import apply_tsne
result = apply_tsne(
spectra=spectra,
n_components=2,
perplexity=30,
learning_rate=200,
n_iter=1000,
random_state=42
)
# Returns:
# {
# 'embedding': np.array([[...], [...]]),
# 'model': fitted_tsne_model
# }
Parameters:
spectra(np.ndarray): Input spectran_components(int, default=2): Embedding dimensionsperplexity(float, default=30): Perplexity parameter (5-50)learning_rate(float, default=200): Learning rate (10-1000)n_iter(int, default=1000): Maximum iterations (250-5000)random_state(int, optional): Random seed
Returns: dict - t-SNE results
Use Case: Visualization, preserves local structure (neighborhoods)
Clustering
apply_kmeans()
Perform K-means clustering.
from functions.ML.linear_regression import apply_kmeans
result = apply_kmeans(
spectra=spectra,
n_clusters=3,
init='k-means++',
n_init=10,
max_iter=300,
random_state=42
)
# Returns:
# {
# 'labels': np.array([0, 0, 1, 1, 2, 2, ...]),
# 'centers': np.array([[...], [...], [...]]),
# 'inertia': 123.45,
# 'model': fitted_kmeans_model
# }
Parameters:
spectra(np.ndarray): Input spectran_clusters(int): Number of clustersinit(str, default=‘k-means++’): Initialization methodn_init(int, default=10): Number of initializationsmax_iter(int, default=300): Maximum iterationsrandom_state(int, optional): Random seed
Returns: dict - Clustering results
Use Case: Partition data into K groups
apply_hierarchical_clustering()
Perform hierarchical clustering.
from functions.ML.linear_regression import apply_hierarchical_clustering
result = apply_hierarchical_clustering(
spectra=spectra,
n_clusters=3,
linkage='ward',
metric='euclidean'
)
# Returns:
# {
# 'labels': np.array([0, 0, 1, 1, 2, 2, ...]),
# 'linkage_matrix': np.array([[...], [...]]),
# 'cophenetic_corr': 0.85,
# 'dendrogram': {...}
# }
Parameters:
spectra(np.ndarray): Input spectran_clusters(int, optional): Number of clusters (if None, returns full tree)linkage(str, default=‘ward’): Linkage criterion (‘ward’, ‘complete’, ‘average’, ‘single’)metric(str, default=‘euclidean’): Distance metric
Returns: dict - Clustering results with dendrogram
Use Case: Hierarchical grouping, visualize cluster relationships
apply_dbscan()
Perform DBSCAN (Density-Based Spatial Clustering).
from functions.ML.linear_regression import apply_dbscan
result = apply_dbscan(
spectra=spectra,
eps=0.5,
min_samples=5,
metric='euclidean'
)
# Returns:
# {
# 'labels': np.array([0, 0, -1, 1, 1, ...]), # -1 = noise
# 'core_samples': np.array([True, True, False, ...]),
# 'n_clusters': 2,
# 'n_noise': 15,
# 'model': fitted_dbscan_model
# }
Parameters:
spectra(np.ndarray): Input spectraeps(float): Maximum distance between neighbors (0.1-10.0)min_samples(int): Minimum samples in neighborhood (3-20)metric(str, default=‘euclidean’): Distance metric
Returns: dict - Clustering results
Use Case: Find arbitrary-shaped clusters, detect outliers
Statistical Tests
apply_ttest()
Perform t-test between two groups.
from functions.utils import apply_ttest
result = apply_ttest(
group1=control_spectra,
group2=treatment_spectra,
equal_var=False, # Welch's t-test
alternative='two-sided'
)
# Returns:
# {
# 'statistic': np.array([...]), # t-statistic per feature
# 'pvalue': np.array([...]), # p-value per feature
# 'significant_features': np.array([100, 234, 567, ...]),
# 'effect_size': np.array([...]) # Cohen's d per feature
# }
Parameters:
group1(np.ndarray): First group spectra (n_samples, n_features)group2(np.ndarray): Second group spectraequal_var(bool, default=False): Assume equal variancesTrue: Student’s t-test
False: Welch’s t-test (recommended)
alternative(str, default=‘two-sided’): Alternative hypothesis (‘two-sided’, ‘less’, ‘greater’)
Returns: dict - Test results
Use Case: Compare means between two groups at each wavenumber
apply_mannwhitneyu()
Perform Mann-Whitney U test (non-parametric t-test alternative).
from functions.utils import apply_mannwhitneyu
result = apply_mannwhitneyu(
group1=control_spectra,
group2=treatment_spectra,
alternative='two-sided'
)
# Returns similar structure to apply_ttest()
Parameters:
group1(np.ndarray): First group spectragroup2(np.ndarray): Second group spectraalternative(str, default=‘two-sided’): Alternative hypothesis
Returns: dict - Test results
Use Case: Compare distributions when data is non-normal
apply_anova()
Perform one-way ANOVA for multiple groups.
from functions.utils import apply_anova
result = apply_anova(
*groups, # Variable number of group arrays
post_hoc='tukey'
)
# Example:
result = apply_anova(
control_spectra,
treatment_a_spectra,
treatment_b_spectra,
post_hoc='tukey'
)
# Returns:
# {
# 'f_statistic': np.array([...]),
# 'pvalue': np.array([...]),
# 'significant_features': np.array([...]),
# 'effect_size': np.array([...]), # eta-squared
# 'post_hoc_results': {...} # if post_hoc specified
# }
Parameters:
*groups(np.ndarray): Multiple group arrayspost_hoc(str, optional): Post-hoc test (‘tukey’, ‘bonferroni’, ‘holm’)
Returns: dict - ANOVA results
Use Case: Compare means across multiple groups
apply_kruskal()
Perform Kruskal-Wallis test (non-parametric ANOVA alternative).
from functions.utils import apply_kruskal
result = apply_kruskal(
*groups,
post_hoc='dunn'
)
Parameters:
*groups(np.ndarray): Multiple group arrayspost_hoc(str, optional): Post-hoc test (‘dunn’)
Returns: dict - Test results
Use Case: Compare distributions across multiple groups (non-parametric)
apply_correlation()
Calculate correlation matrix.
from functions.utils import apply_correlation
result = apply_correlation(
spectra=spectra,
method='pearson'
)
# Returns:
# {
# 'correlation_matrix': np.array([[1.0, 0.8, ...], [0.8, 1.0, ...], ...]),
# 'pvalue_matrix': np.array([[0.0, 0.001, ...], [0.001, 0.0, ...], ...])
# }
Parameters:
spectra(np.ndarray): Input spectra (n_samples, n_features)method(str): Correlation method (‘pearson’, ‘spearman’, ‘kendall’)
Returns: dict - Correlation results
Use Case: Find relationships between samples or features
apply_multiple_testing_correction()
Apply correction for multiple hypothesis testing.
from functions.utils import apply_multiple_testing_correction
corrected_pvalues = apply_multiple_testing_correction(
pvalues=raw_pvalues,
method='fdr_bh',
alpha=0.05
)
# Returns:
# {
# 'corrected_pvalues': np.array([...]),
# 'reject': np.array([True, False, True, ...]),
# 'significant_count': 42
# }
Parameters:
pvalues(np.ndarray): Uncorrected p-valuesmethod(str): Correction method'bonferroni': Most conservative'holm': Step-down Bonferroni'fdr_bh': Benjamini-Hochberg FDR (recommended)'fdr_by': Benjamini-Yekutieli FDR
alpha(float, default=0.05): Significance level
Returns: dict - Corrected results
Use Case: Control false positives when testing many hypotheses
Machine Learning Functions
Model Training
train_svm()
Train Support Vector Machine classifier.
from functions.ML.svm import train_svm
result = train_svm(
X_train=X_train,
y_train=y_train,
X_test=X_test,
y_test=y_test,
kernel='rbf',
C=1.0,
gamma='scale',
cv=5
)
# Returns:
# {
# 'model': fitted_svm_model,
# 'train_score': 0.98,
# 'test_score': 0.92,
# 'cv_scores': np.array([0.90, 0.93, 0.91, 0.94, 0.89]),
# 'confusion_matrix': np.array([[45, 5], [3, 47]]),
# 'classification_report': {...},
# 'support_vectors': np.array([[...], ...])
# }
Parameters:
X_train(np.ndarray): Training features (n_samples, n_features)y_train(np.ndarray): Training labelsX_test(np.ndarray, optional): Test featuresy_test(np.ndarray, optional): Test labelskernel(str, default=‘rbf’): Kernel type (‘linear’, ‘rbf’, ‘poly’, ‘sigmoid’)C(float, default=1.0): Regularization parameter (0.1-100)gamma(str or float, default=‘scale’): Kernel coefficientcv(int, default=5): Cross-validation folds
Returns: dict - Training results
Use Case: Binary or multi-class classification with kernel trick
train_random_forest()
Train Random Forest classifier.
from functions.ML.random_forest import train_random_forest
result = train_random_forest(
X_train=X_train,
y_train=y_train,
X_test=X_test,
y_test=y_test,
n_estimators=100,
max_depth=None,
min_samples_split=2,
cv=5,
random_state=42
)
# Returns similar structure to train_svm() plus:
# {
# ...,
# 'feature_importances': np.array([...]),
# 'oob_score': 0.91 # if oob_score=True
# }
Parameters:
X_train,y_train,X_test,y_test: Data arraysn_estimators(int, default=100): Number of trees (100-1000)max_depth(int, optional): Maximum tree depthmin_samples_split(int, default=2): Minimum samples to split (2-20)min_samples_leaf(int, default=1): Minimum samples per leafmax_features(str or int, default=‘sqrt’): Features per splitcv(int, default=5): Cross-validation foldsrandom_state(int, optional): Random seed
Returns: dict - Training results
Use Case: Robust ensemble classifier, handles non-linearity
train_xgboost()
Train XGBoost classifier.
from functions.ML.xgboost import train_xgboost
result = train_xgboost(
X_train=X_train,
y_train=y_train,
X_test=X_test,
y_test=y_test,
n_estimators=100,
learning_rate=0.1,
max_depth=6,
subsample=0.8,
colsample_bytree=0.8,
cv=5,
early_stopping_rounds=10,
random_state=42
)
# Returns similar structure plus:
# {
# ...,
# 'feature_importances': {
# 'gain': np.array([...]),
# 'weight': np.array([...]),
# 'cover': np.array([...])
# },
# 'best_iteration': 87
# }
Parameters:
X_train,y_train,X_test,y_test: Data arraysn_estimators(int, default=100): Number of boosting rounds (100-1000)learning_rate(float, default=0.1): Step size shrinkage (0.01-0.3)max_depth(int, default=6): Tree depth (3-10)subsample(float, default=0.8): Sample fraction (0.5-1.0)colsample_bytree(float, default=0.8): Feature fraction (0.5-1.0)gamma(float, default=0): Minimum split loss (0-10)reg_lambda(float, default=1): L2 regularization (0-10)reg_alpha(float, default=0): L1 regularization (0-10)cv(int, default=5): Cross-validation foldsearly_stopping_rounds(int, optional): Stop if no improvementrandom_state(int, optional): Random seed
Returns: dict - Training results
Use Case: State-of-the-art gradient boosting, competitions
train_logistic_regression()
Train Logistic Regression classifier.
from functions.ML.logistic_regression import train_logistic_regression
result = train_logistic_regression(
X_train=X_train,
y_train=y_train,
X_test=X_test,
y_test=y_test,
C=1.0,
penalty='l2',
solver='lbfgs',
cv=5,
random_state=42
)
# Returns similar structure plus:
# {
# ...,
# 'coefficients': np.array([[...], ...]),
# 'intercept': np.array([...]),
# 'odds_ratios': np.array([...])
# }
Parameters:
X_train,y_train,X_test,y_test: Data arraysC(float, default=1.0): Inverse regularization strength (0.01-100)penalty(str, default=‘l2’): Regularization type (‘l1’, ‘l2’, ‘elasticnet’, ‘none’)solver(str, default=‘lbfgs’): Optimization algorithmmax_iter(int, default=100): Maximum iterations (100-10000)cv(int, default=5): Cross-validation foldsrandom_state(int, optional): Random seed
Returns: dict - Training results
Use Case: Interpretable linear classifier with probabilities
train_mlp()
Train Multi-Layer Perceptron (Neural Network) classifier.
from functions.ML.logistic_regression import train_mlp
result = train_mlp(
X_train=X_train,
y_train=y_train,
X_test=X_test,
y_test=y_test,
hidden_layer_sizes=(100, 50),
activation='relu',
alpha=0.0001,
learning_rate_init=0.001,
max_iter=200,
early_stopping=True,
cv=5,
random_state=42
)
# Returns similar structure plus:
# {
# ...,
# 'loss_curve': [0.8, 0.6, 0.5, 0.4, ...],
# 'best_loss': 0.35
# }
Parameters:
X_train,y_train,X_test,y_test: Data arrayshidden_layer_sizes(tuple, default=(100,)): Neurons per hidden layeractivation(str, default=‘relu’): Activation function (‘relu’, ‘tanh’, ‘logistic’)alpha(float, default=0.0001): L2 regularization (0.0001-0.01)learning_rate_init(float, default=0.001): Initial learning rate (0.0001-0.01)max_iter(int, default=200): Maximum epochs (200-1000)early_stopping(bool, default=True): Stop if validation score doesn’t improvecv(int, default=5): Cross-validation foldsrandom_state(int, optional): Random seed
Returns: dict - Training results
Use Case: Flexible non-linear classifier, deep learning
Model Evaluation
evaluate_model()
Comprehensive model evaluation.
from functions.ML.utils import evaluate_model
metrics = evaluate_model(
model=trained_model,
X_test=X_test,
y_test=y_test,
average='weighted'
)
# Returns:
# {
# 'accuracy': 0.92,
# 'precision': 0.91,
# 'recall': 0.93,
# 'f1_score': 0.92,
# 'roc_auc': 0.95,
# 'confusion_matrix': np.array([[45, 5], [3, 47]]),
# 'classification_report': {
# 'Control': {'precision': 0.94, 'recall': 0.90, 'f1-score': 0.92},
# 'Treatment': {'precision': 0.90, 'recall': 0.94, 'f1-score': 0.92}
# }
# }
Parameters:
model: Trained scikit-learn modelX_test(np.ndarray): Test featuresy_test(np.ndarray): True labelsaverage(str, default=‘weighted’): Averaging method for multi-class
Returns: dict - Evaluation metrics
plot_confusion_matrix()
Plot confusion matrix heatmap.
from functions.visualization.plots import plot_confusion_matrix
fig = plot_confusion_matrix(
y_true=y_test,
y_pred=predictions,
class_names=['Control', 'Treatment'],
normalize=True,
cmap='Blues'
)
Parameters:
y_true(np.ndarray): True labelsy_pred(np.ndarray): Predicted labelsclass_names(List[str], optional): Class labelsnormalize(bool, default=False): Normalize by row (true labels)cmap(str, default=‘Blues’): Colormap
Returns: matplotlib.figure.Figure
plot_roc_curve()
Plot ROC curve.
from functions.visualization.plots import plot_roc_curve
fig = plot_roc_curve(
y_true=y_test,
y_scores=y_proba,
class_names=['Control', 'Treatment']
)
Parameters:
y_true(np.ndarray): True labelsy_scores(np.ndarray): Predicted probabilitiesclass_names(List[str], optional): Class labels
Returns: matplotlib.figure.Figure
Displays: ROC curve with AUC score
plot_learning_curve()
Plot learning curve (train/validation score vs. training size).
from functions.visualization.plots import plot_learning_curve
fig = plot_learning_curve(
model=model,
X=X_train,
y=y_train,
cv=5,
scoring='accuracy'
)
Parameters:
model: Scikit-learn estimatorX(np.ndarray): Training featuresy(np.ndarray): Training labelscv(int, default=5): Cross-validation foldsscoring(str, default=‘accuracy’): Metric to plot
Returns: matplotlib.figure.Figure
Use Case: Diagnose overfitting/underfitting
Feature Importance
get_feature_importance()
Extract feature importance from model.
from functions.ML.utils import get_feature_importance
importance = get_feature_importance(
model=trained_model,
method='default',
wavenumbers=wavenumbers
)
# Returns:
# {
# 'importance': np.array([...]),
# 'feature_indices': np.array([...]),
# 'wavenumbers': np.array([...]),
# 'top_features': {
# 'indices': [234, 567, 890],
# 'wavenumbers': [1004, 1445, 1660],
# 'importance': [0.15, 0.12, 0.10]
# }
# }
Parameters:
model: Trained modelmethod(str, default=‘default’): Importance method'default': Use model’s native importance'permutation': Permutation importance (model-agnostic)'shap': SHAP values (requires shap library)
wavenumbers(np.ndarray, optional): Wavenumber axis
Returns: dict - Feature importance results
plot_feature_importance()
Plot feature importance as bar plot or spectrum overlay.
from functions.visualization.plots import plot_feature_importance
fig = plot_feature_importance(
importance=importance_array,
wavenumbers=wavenumbers,
top_n=20,
plot_type='spectrum'
)
Parameters:
importance(np.ndarray): Feature importance valueswavenumbers(np.ndarray): Wavenumber axistop_n(int, default=20): Number of top features to highlightplot_type(str, default=‘bar’): Plot type (‘bar’, ‘spectrum’, ‘both’)
Returns: matplotlib.figure.Figure
calculate_permutation_importance()
Calculate permutation feature importance (model-agnostic).
from functions.ML.utils import calculate_permutation_importance
importance = calculate_permutation_importance(
model=trained_model,
X=X_test,
y=y_test,
n_repeats=10,
random_state=42,
scoring='accuracy'
)
# Returns:
# {
# 'importances_mean': np.array([...]),
# 'importances_std': np.array([...]),
# 'importances': np.array([[...], [...], ...]) # shape: (n_features, n_repeats)
# }
Parameters:
model: Trained modelX(np.ndarray): Test featuresy(np.ndarray): True labelsn_repeats(int, default=10): Number of permutationsrandom_state(int, optional): Random seedscoring(str or callable, default=‘accuracy’): Metric
Returns: dict - Permutation importance results
Algorithm: Shuffle each feature and measure score drop
Utility Functions
functions/utils.py
validate_spectra_data()
Validate spectral data structure.
from functions.utils import validate_spectra_data
is_valid, errors = validate_spectra_data(data)
if not is_valid:
for error in errors:
print(f"Validation error: {error}")
Parameters:
data(dict): Data dictionary
Returns: Tuple[bool, List[str]] - (is_valid, error_messages)
Checks:
Required keys present
Correct data types
Consistent dimensions
No NaN/Inf values
Valid wavenumber range
generate_mock_spectra()
Generate synthetic spectra for testing.
from functions.utils import generate_mock_spectra
data = generate_mock_spectra(
n_samples=100,
n_features=1000,
n_groups=3,
noise_level=0.05,
random_state=42
)
# Returns standard data dictionary
Parameters:
n_samples(int): Number of spectran_features(int): Number of wavenumbersn_groups(int): Number of groupsnoise_level(float): Noise standard deviation (0-1)random_state(int, optional): Random seed
Returns: dict - Synthetic spectral data
Use Case: Testing, prototyping, demonstrations
calculate_snr()
Calculate Signal-to-Noise Ratio.
from functions.utils import calculate_snr
snr = calculate_snr(
spectrum=spectrum,
signal_range=(1640, 1680), # Amide I region
noise_range=(1800, 2000) # Baseline region
)
print(f"SNR: {snr:.2f} dB")
Parameters:
spectrum(np.ndarray): Input spectrumsignal_range(tuple): (min, max) wavenumbers for signalnoise_range(tuple): (min, max) wavenumbers for noise estimation
Returns: float - SNR in decibels (dB)
Formula: \(SNR = 20 \log_{10}(\frac{\mu_{signal}}{\sigma_{noise}})\)
find_peaks()
Find peaks in spectrum.
from functions.utils import find_peaks
peaks = find_peaks(
spectrum=spectrum,
wavenumbers=wavenumbers,
prominence=0.05,
distance=10,
width=5
)
# Returns:
# {
# 'peak_indices': np.array([...]),
# 'peak_wavenumbers': np.array([...]),
# 'peak_intensities': np.array([...]),
# 'properties': {
# 'prominences': np.array([...]),
# 'widths': np.array([...]),
# 'left_bases': np.array([...]),
# 'right_bases': np.array([...])
# }
# }
Parameters:
spectrum(np.ndarray): Input spectrumwavenumbers(np.ndarray): Wavenumber axisprominence(float, optional): Minimum peak prominencedistance(int, optional): Minimum distance between peakswidth(int or tuple, optional): Required peak widthheight(float or tuple, optional): Required peak height
Returns: dict - Peak information
Use Case: Peak identification, peak tracking
integrate_region()
Integrate spectrum over wavenumber range.
from functions.utils import integrate_region
area = integrate_region(
spectrum=spectrum,
wavenumbers=wavenumbers,
region=(1640, 1680),
method='trapz'
)
Parameters:
spectrum(np.ndarray): Input spectrumwavenumbers(np.ndarray): Wavenumber axisregion(tuple): (min, max) wavenumbersmethod(str, default=‘trapz’): Integration method (‘trapz’, ‘simps’)
Returns: float - Integrated area
Use Case: Quantification, peak ratios
resample_spectrum()
Resample spectrum to new wavenumber axis.
from functions.utils import resample_spectrum
resampled = resample_spectrum(
spectrum=spectrum,
old_wavenumbers=old_wn,
new_wavenumbers=new_wn,
method='linear'
)
Parameters:
spectrum(np.ndarray): Input spectrumold_wavenumbers(np.ndarray): Original wavenumber axisnew_wavenumbers(np.ndarray): Target wavenumber axismethod(str, default=‘linear’): Interpolation method (‘linear’, ‘cubic’, ‘spline’)
Returns: np.ndarray - Resampled spectrum
Use Case: Align spectra with different wavenumber axes
split_train_test()
Split data into train/test sets with stratification.
from functions.utils import split_train_test
X_train, X_test, y_train, y_test = split_train_test(
X=spectra,
y=labels,
test_size=0.2,
stratify=True,
random_state=42
)
Parameters:
X(np.ndarray): Featuresy(np.ndarray): Labelstest_size(float, default=0.2): Test set fractionstratify(bool, default=True): Preserve class distributionrandom_state(int, optional): Random seed
Returns: Tuple[np.ndarray, …] - X_train, X_test, y_train, y_test
export_pipeline()
Export preprocessing pipeline to JSON.
from functions.utils import export_pipeline
export_pipeline(
pipeline=[
{'method': 'asls', 'params': {'lambda': 1e5, 'p': 0.01}},
{'method': 'savgol', 'params': {'window_length': 11, 'polyorder': 3}},
{'method': 'vector_norm', 'params': {}}
],
filepath='pipelines/my_pipeline.json',
metadata={
'description': 'Standard preprocessing for tissue samples',
'author': 'Researcher Name',
'created': '2026-01-24'
}
)
Parameters:
pipeline(List[dict]): Pipeline stepsfilepath(str): Output file pathmetadata(dict, optional): Additional metadata
Side Effects: Creates JSON file
load_pipeline()
Load preprocessing pipeline from JSON.
from functions.utils import load_pipeline
pipeline, metadata = load_pipeline('pipelines/my_pipeline.json')
# pipeline: [{'method': 'asls', 'params': {...}}, ...]
# metadata: {'description': '...', 'author': '...', 'created': '...'}
Parameters:
filepath(str): Pipeline file path
Returns: Tuple[List[dict], dict] - (pipeline_steps, metadata)
Pipeline Execution
execute_preprocessing_pipeline()
Execute complete preprocessing pipeline.
from functions.preprocess.pipeline import execute_preprocessing_pipeline
result = execute_preprocessing_pipeline(
spectra=raw_spectra,
pipeline=[
{'method': 'asls', 'params': {'lambda': 1e5, 'p': 0.01}},
{'method': 'savgol', 'params': {'window_length': 11, 'polyorder': 3}},
{'method': 'vector_norm', 'params': {}}
],
progress_callback=lambda i, total, msg: print(f"{i}/{total}: {msg}")
)
# Returns:
# {
# 'preprocessed_spectra': np.array([[...], [...]]),
# 'success': True,
# 'execution_time': 2.34,
# 'steps_applied': ['asls', 'savgol', 'vector_norm'],
# 'intermediate_results': {...} # if return_intermediate=True
# }
Parameters:
spectra(np.ndarray): Input spectra (n_samples, n_features)pipeline(List[dict]): Pipeline stepsprogress_callback(callable, optional): Progress function(current, total, message)return_intermediate(bool, default=False): Return results after each stepparallel(bool, default=False): Process spectra in paralleln_jobs(int, default=-1): Number of parallel jobs (-1 = all CPUs)
Returns: dict - Processing results
Raises:
PreprocessingError: If any step failsValueError: If pipeline configuration is invalid
Best Practices
Function Usage
# Good: Clear, explicit parameters
corrected = apply_asls(
spectrum=spectrum,
lambda_=1e5,
p=0.01,
max_iter=10
)
# Avoid: Relying on defaults without understanding
corrected = apply_asls(spectrum) # May not be appropriate for your data
Error Handling
from functions.preprocess.baseline import apply_asls
from functions._utils_ import PreprocessingError
try:
corrected = apply_asls(spectrum, lambda_=1e5, p=0.01)
except PreprocessingError as e:
print(f"Preprocessing failed: {e}")
# Handle error or use fallback method
except ValueError as e:
print(f"Invalid parameters: {e}")
Pipeline Design
# Good: Logical order
pipeline = [
{'method': 'asls', 'params': {...}}, # 1. Baseline correction
{'method': 'savgol', 'params': {...}}, # 2. Smoothing
{'method': 'vector_norm', 'params': {}} # 3. Normalization
]
# Avoid: Illogical order
pipeline = [
{'method': 'vector_norm', 'params': {}}, # Normalizing before baseline correction
{'method': 'asls', 'params': {...}} # May normalize baseline as well
]
Performance Optimization
# For large datasets, use parallel processing
result = execute_preprocessing_pipeline(
spectra=large_dataset,
pipeline=my_pipeline,
parallel=True,
n_jobs=-1 # Use all CPU cores
)
# For repeated operations, cache intermediate results
from functools import lru_cache
@lru_cache(maxsize=128)
def cached_preprocessing(spectrum_tuple, pipeline_hash):
spectrum = np.array(spectrum_tuple)
return apply_preprocessing(spectrum, pipeline)
See Also
Core API - Core application modules
Pages API - Application pages
Components API - UI components
Analysis Methods - Detailed method documentation
Best Practices - Guidelines and recommendations
Last Updated: 2026-01-24