Machine Learning Methods
Comprehensive reference for classification and regression algorithms.
Table of Contents
Note: Multi-Layer Perceptron (MLP) and neural networks are planned for future releases.
Support Vector Machines (SVM)
Purpose: Binary or multi-class classification using optimal decision boundaries
Theory
Core Concept: Find hyperplane that maximizes margin between classes
Key Components:
Support Vectors: Data points closest to decision boundary
Margin: Distance between hyperplane and nearest points
Kernel: Function to transform data to higher dimensions
Decision Function:
f(x) = sign(Σ αᵢ yᵢ K(xᵢ, x) + b)
Where:
α: Lagrange multipliers
y: Class labels
K: Kernel function
b: Bias term
Kernel Functions
1. Linear Kernel
Formula: K(x, x') = xᵀx'
When to Use:
✓ Linearly separable data
✓ High-dimensional data (text, spectra)
✓ Large datasets (fast)
Pros:
Fast training and prediction
Interpretable (feature weights)
Less prone to overfitting
Cons:
Cannot handle non-linear boundaries
2. RBF (Radial Basis Function) Kernel
Formula: K(x, x') = exp(-γ ||x - x'||²)
When to Use:
✓ Non-linear boundaries
✓ Default choice for most problems
✓ Unknown data structure
Pros:
Handles non-linearity
Flexible decision boundaries
Cons:
More parameters to tune (C, γ)
Slower than linear
Risk of overfitting
Parameter γ (gamma):
High γ (e.g., 0.1): Narrow influence → Complex boundaries (overfitting risk)
Low γ (e.g., 0.001): Wide influence → Smooth boundaries
‘scale’ (default): γ = 1 / (n_features × variance)
‘auto’: γ = 1 / n_features
3. Polynomial Kernel
Formula: K(x, x') = (γ xᵀx' + r)ᵈ
Parameters:
d: Polynomial degree (2 or 3 typical)
r: Coefficient (usually 0)
When to Use:
✓ Known polynomial relationship
✓ Image data
✗ Rarely used for Raman spectra
Hyperparameters
C (Regularization Parameter)
Purpose: Control trade-off between margin and misclassification
Effect:
High C (e.g., 100): Smaller margin, fewer errors → Overfitting risk
Low C (e.g., 0.1): Larger margin, more errors → Underfitting risk
Typical Range: 0.1 - 100
Tuning Strategy:
# Grid search over C
C_range = [0.1, 1, 10, 100]
Gamma (γ) - RBF Kernel Only
Purpose: Define influence radius of single training example
Effect:
High γ (e.g., 1.0): Tight fit → Overfitting
Low γ (e.g., 0.001): Loose fit → Underfitting
Typical Range: 0.0001 - 1.0
Tuning Strategy:
# Grid search over gamma
gamma_range = [0.0001, 0.001, 0.01, 0.1, 1.0]
Usage Example
from functions.ML.svm import train_svm_model
# Prepare data
X_train, X_test, y_train, y_test = train_test_split(
preprocessed_spectra,
labels,
test_size=0.2,
random_state=42,
stratify=labels
)
# Train SVM with RBF kernel
svm_model = train_svm_model(
X_train, y_train,
kernel='rbf',
C=10.0,
gamma='scale',
random_state=42
)
# Predictions
y_pred = svm_model.predict(X_test)
# Evaluate
from sklearn.metrics import accuracy_score, classification_report
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.3f}")
print(classification_report(y_test, y_pred))
Hyperparameter Optimization
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
# Define parameter grid
param_grid = {
'C': [0.1, 1, 10, 100],
'gamma': [0.001, 0.01, 0.1, 1, 'scale', 'auto'],
'kernel': ['rbf']
}
# Grid search with cross-validation
grid_search = GridSearchCV(
SVC(random_state=42),
param_grid,
cv=5,
scoring='accuracy',
n_jobs=-1,
verbose=1
)
grid_search.fit(X_train, y_train)
# Best parameters
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.3f}")
# Use best model
best_svm = grid_search.best_estimator_
Interpretation
Decision Function Values:
# Get decision function scores
decision_scores = svm_model.decision_function(X_test)
# For binary classification:
# Positive → Class 1
# Negative → Class 0
# Magnitude → Confidence
# For multi-class (one-vs-one):
# Multiple decision functions (one per pair)
Support Vectors:
# Number of support vectors per class
print(f"Support vectors: {svm_model.n_support_}")
# Indices of support vectors
support_indices = svm_model.support_
# Support vectors themselves
support_vectors = X_train[support_indices]
Troubleshooting
Issue |
Cause |
Solution |
|---|---|---|
Poor accuracy |
Wrong kernel |
Try RBF if using linear |
Overfitting |
C too high or γ too high |
Reduce C, reduce γ |
Underfitting |
C too low or γ too low |
Increase C, increase γ |
Very slow training |
Large dataset with RBF |
Use LinearSVC or subsample |
Poor validation |
Data leakage |
Use GroupKFold for patient-level |
When to Use
Use SVM when:
✓ Binary or multi-class classification
✓ High-dimensional data (spectra work well)
✓ Clear margin between classes
✓ Small to medium datasets (< 50,000 samples)
✓ Need probabilistic outputs (use probability=True)
Consider alternatives when:
✗ Very large datasets (> 100,000) → Random Forest
✗ Need interpretability → Logistic Regression
✗ Categorical features → Random Forest/XGBoost
✗ Need feature importance → Random Forest
Class Imbalance
# Handle imbalanced classes
svm_model = SVC(
kernel='rbf',
C=10,
gamma='scale',
class_weight='balanced', # Automatic adjustment
random_state=42
)
# Or specify custom weights
class_weights = {0: 1.0, 1: 3.0} # Give 3x weight to class 1
svm_model = SVC(class_weight=class_weights)
Reference
Cortes & Vapnik (1995). “Support-Vector Networks”
Random Forest
Purpose: Ensemble of decision trees for robust classification/regression
Theory
Core Concept: Combine multiple decision trees to reduce overfitting
Bagging (Bootstrap Aggregating):
Create multiple bootstrap samples (random sampling with replacement)
Train decision tree on each sample
Aggregate predictions (vote for classification, average for regression)
Random Feature Selection:
At each split, consider random subset of features
Decorrelates trees → Better ensemble
Out-of-Bag (OOB) Error:
Each tree trained on ~63% of data
Remaining 37% used for validation (no separate test set needed)
Hyperparameters
n_estimators
Purpose: Number of trees in forest
Effect:
More trees → More stable, but slower
Fewer trees → Faster, but less stable
Typical Range: 100 - 1000
Recommendation: Start with 100, increase if training loss still decreasing
# Check OOB error vs n_trees
oob_errors = []
for n in range(10, 500, 10):
rf = RandomForestClassifier(n_estimators=n, oob_score=True)
rf.fit(X_train, y_train)
oob_errors.append(1 - rf.oob_score_)
# Plot to find plateau
max_depth
Purpose: Maximum depth of each tree
Effect:
None (default): Trees grow until pure leaves
Shallow (e.g., 5-10): Prevents overfitting
Deep (e.g., 20-30): Captures complex patterns
Typical Range: 10 - 30, or None
Tuning:
# Start with None, then limit if overfitting
max_depth = None # or 20 for regularization
min_samples_split
Purpose: Minimum samples required to split node
Effect:
Low (e.g., 2): More splits → More complex trees
High (e.g., 10-20): Fewer splits → Regularization
Default: 2
Recommendation: Increase to 5-10 if overfitting
min_samples_leaf
Purpose: Minimum samples in leaf node
Effect:
Low (e.g., 1): Allows pure leaves
High (e.g., 5-10): Smooths predictions
Default: 1
Recommendation: Set to 2-5 for regularization
max_features
Purpose: Number of features to consider for each split
Options:
‘sqrt’ (default): √n_features (recommended for classification)
‘log2’: log₂(n_features)
int: Specific number
float: Proportion (e.g., 0.3 = 30%)
Effect:
Fewer features → More diversity → Better ensemble
More features → Individual trees stronger → Less diversity
Usage Example
from functions.ML.random_forest import train_rf_model
# Train Random Forest
rf_model = train_rf_model(
X_train, y_train,
n_estimators=100,
max_depth=None,
min_samples_split=2,
min_samples_leaf=1,
max_features='sqrt',
random_state=42,
n_jobs=-1 # Use all CPU cores
)
# Predictions
y_pred = rf_model.predict(X_test)
# Prediction probabilities
y_proba = rf_model.predict_proba(X_test)
# Feature importance
importances = rf_model.feature_importances_
Hyperparameter Optimization
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
# Define parameter distributions
param_dist = {
'n_estimators': [100, 200, 500],
'max_depth': [None, 10, 20, 30],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
'max_features': ['sqrt', 'log2', 0.3]
}
# Random search (faster than grid search)
random_search = RandomizedSearchCV(
RandomForestClassifier(random_state=42),
param_distributions=param_dist,
n_iter=50, # Number of parameter combinations to try
cv=5,
scoring='accuracy',
n_jobs=-1,
random_state=42,
verbose=1
)
random_search.fit(X_train, y_train)
print(f"Best parameters: {random_search.best_params_}")
print(f"Best CV score: {random_search.best_score_:.3f}")
Feature Importance
Gini Importance (Default):
import numpy as np
import matplotlib.pyplot as plt
# Get feature importances
importances = rf_model.feature_importances_
# Sort by importance
indices = np.argsort(importances)[::-1]
top_k = 20
# Plot top features
plt.figure(figsize=(10, 6))
plt.bar(range(top_k), importances[indices[:top_k]])
plt.xlabel('Feature (Wavenumber) Index')
plt.ylabel('Importance')
plt.title('Top 20 Important Features')
plt.tight_layout()
plt.show()
# Map to wavenumbers
top_wavenumbers = wavenumbers[indices[:top_k]]
print(f"Top wavenumbers: {top_wavenumbers}")
Permutation Importance (More Accurate):
from sklearn.inspection import permutation_importance
# Calculate permutation importance
perm_importance = permutation_importance(
rf_model,
X_test,
y_test,
n_repeats=10,
random_state=42,
n_jobs=-1
)
# Get importances and standard deviations
perm_importances_mean = perm_importance.importances_mean
perm_importances_std = perm_importance.importances_std
# Sort
indices = np.argsort(perm_importances_mean)[::-1]
top_k = 20
# Plot
plt.figure(figsize=(10, 6))
plt.bar(range(top_k),
perm_importances_mean[indices[:top_k]],
yerr=perm_importances_std[indices[:top_k]])
plt.xlabel('Feature Index')
plt.ylabel('Permutation Importance')
plt.title('Top 20 Features (Permutation Importance)')
plt.tight_layout()
plt.show()
SHAP Values
Purpose: Explain model predictions via Shapley-value-based feature attribution.
When to use:
✓ Per-sample explanations (which wavenumbers drive a prediction)
✓ Global importance aggregated from local attributions
Note: This typically uses the external shap library.
In this application: SHAP explainability is exposed via the ML page SHAP action and produces a per-spectrum explanation with a contribution bar plot across the Raman shift axis (red = positive, blue = negative), a ranked contributor table, and an exportable bundle.
Out-of-Bag (OOB) Score
# Train with OOB scoring
rf_model = RandomForestClassifier(
n_estimators=100,
oob_score=True,
random_state=42
)
rf_model.fit(X_train, y_train)
# OOB score (similar to cross-validation)
print(f"OOB Score: {rf_model.oob_score_:.3f}")
# OOB predictions
oob_pred = rf_model.oob_decision_function_
Advantages
✓ Robust: Less prone to overfitting than single tree
✓ Feature importance: Automatic ranking
✓ Handles non-linearity: No kernel needed
✓ Missing values: Can handle with imputation
✓ No scaling required: Works with raw features
✓ Parallelizable: Fast training with multiple cores
Limitations
✗ Black box: Hard to interpret individual predictions
✗ Large models: Memory intensive for many trees
✗ Extrapolation: Poor performance outside training range
✗ Biased: Toward features with many categories
When to Use
Use Random Forest when:
✓ Tabular data with mixed features
✓ Need feature importance
✓ Don’t want to tune many hyperparameters
✓ Want robust, reliable performance
✓ Have sufficient data (> 1000 samples)
Consider alternatives when:
✗ Need probabilistic model → Logistic Regression
✗ Have sequential/spatial data → Neural networks
✗ Need speed → Logistic Regression or LinearSVC
✗ Want best accuracy → XGBoost (usually better)
Reference
Breiman (2001). “Random Forests”
XGBoost
Purpose: Gradient boosting for high-performance classification/regression
Theory
Core Concept: Build trees sequentially, each correcting previous errors
Gradient Boosting:
Start with simple model (e.g., mean prediction)
Calculate residuals (errors)
Train new tree to predict residuals
Add to ensemble with small weight
Repeat until convergence
XGBoost Innovations:
Regularization (L1/L2) to prevent overfitting
Handling missing values automatically
Parallel tree construction (fast)
Built-in cross-validation
Formula:
F(x) = Σ fₖ(x)
where fₖ is k-th tree
Hyperparameters
n_estimators
Purpose: Number of boosting rounds (trees)
Effect:
More trees → Better training fit → Overfitting risk
Fewer trees → Underfitting
Typical Range: 100 - 1000
Best Practice: Use early stopping to find optimal number
# Early stopping
xgb_model.fit(
X_train, y_train,
eval_set=[(X_val, y_val)],
early_stopping_rounds=50, # Stop if no improvement for 50 rounds
verbose=False
)
learning_rate (eta)
Purpose: Step size for each tree’s contribution
Effect:
High (e.g., 0.3): Fast convergence → Overfitting risk
Low (e.g., 0.01): Slow convergence → More trees needed → Better generalization
Typical Range: 0.01 - 0.3
Rule of Thumb: Lower learning_rate requires more n_estimators
# Conservative approach
learning_rate = 0.1
n_estimators = 500
# Aggressive approach (faster, riskier)
learning_rate = 0.3
n_estimators = 100
max_depth
Purpose: Maximum depth of each tree
Effect:
Deep (e.g., 6-10): Captures complex interactions
Shallow (e.g., 3-5): Prevents overfitting
Typical Range: 3 - 10
Default: 6
subsample
Purpose: Fraction of samples used for each tree
Effect:
< 1.0 (e.g., 0.8): Reduces overfitting, speeds up training
= 1.0: Use all samples
Typical Range: 0.5 - 1.0
Recommendation: 0.8 for robustness
colsample_bytree
Purpose: Fraction of features used for each tree
Effect:
< 1.0 (e.g., 0.8): Reduces overfitting, adds diversity
= 1.0: Use all features
Typical Range: 0.5 - 1.0
Recommendation: 0.8
gamma (min_split_loss)
Purpose: Minimum loss reduction to make split
Effect:
Higher: More conservative splitting → Regularization
Lower: More splits → Risk of overfitting
Typical Range: 0 - 5
Default: 0
lambda (reg_lambda) - L2 Regularization
Purpose: L2 penalty on leaf weights
Effect:
Higher: Stronger regularization
Lower: Less regularization
Typical Range: 0 - 10
Default: 1
alpha (reg_alpha) - L1 Regularization
Purpose: L1 penalty on leaf weights (feature selection)
Effect:
Higher: More sparsity (some weights → 0)
Lower: Less sparsity
Typical Range: 0 - 10
Default: 0
Usage Example
from functions.ML.xgboost import train_xgboost_model
# Train XGBoost
xgb_model = train_xgboost_model(
X_train, y_train,
n_estimators=500,
learning_rate=0.1,
max_depth=6,
subsample=0.8,
colsample_bytree=0.8,
gamma=0,
reg_lambda=1,
reg_alpha=0,
random_state=42,
n_jobs=-1
)
# Predictions
y_pred = xgb_model.predict(X_test)
y_proba = xgb_model.predict_proba(X_test)
Hyperparameter Optimization
from sklearn.model_selection import RandomizedSearchCV
from xgboost import XGBClassifier
# Define parameter distributions
param_dist = {
'n_estimators': [100, 200, 500, 1000],
'learning_rate': [0.01, 0.05, 0.1, 0.3],
'max_depth': [3, 5, 7, 9],
'subsample': [0.6, 0.8, 1.0],
'colsample_bytree': [0.6, 0.8, 1.0],
'gamma': [0, 1, 5],
'reg_lambda': [0, 1, 10],
'reg_alpha': [0, 0.1, 1]
}
# Random search
random_search = RandomizedSearchCV(
XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='logloss'),
param_distributions=param_dist,
n_iter=50,
cv=5,
scoring='accuracy',
n_jobs=-1,
random_state=42,
verbose=1
)
random_search.fit(X_train, y_train)
print(f"Best parameters: {random_search.best_params_}")
print(f"Best CV score: {random_search.best_score_:.3f}")
Feature Importance
import matplotlib.pyplot as plt
# Get feature importances
importances = xgb_model.feature_importances_
# Plot top features
import numpy as np
indices = np.argsort(importances)[::-1]
top_k = 20
plt.figure(figsize=(10, 6))
plt.bar(range(top_k), importances[indices[:top_k]])
plt.xlabel('Feature Index')
plt.ylabel('Importance (Gain)')
plt.title('Top 20 Important Features (XGBoost)')
plt.tight_layout()
plt.show()
# Importance types in XGBoost:
# - 'weight': Number of times feature used
# - 'gain': Average gain (default, most useful)
# - 'cover': Average coverage
# Get different importance types
import xgboost as xgb
importance_gain = xgb_model.get_booster().get_score(importance_type='gain')
importance_weight = xgb_model.get_booster().get_score(importance_type='weight')
Learning Curves
# Train with evaluation set to monitor performance
eval_set = [(X_train, y_train), (X_val, y_val)]
xgb_model.fit(
X_train, y_train,
eval_set=eval_set,
eval_metric='logloss',
verbose=50 # Print every 50 rounds
)
# Get evaluation results
results = xgb_model.evals_result()
# Plot learning curves
plt.figure(figsize=(10, 6))
plt.plot(results['validation_0']['logloss'], label='Train')
plt.plot(results['validation_1']['logloss'], label='Validation')
plt.xlabel('Boosting Round')
plt.ylabel('Log Loss')
plt.legend()
plt.title('XGBoost Learning Curves')
plt.tight_layout()
plt.show()
Advantages
✓ State-of-the-art accuracy: Often wins competitions
✓ Regularization: Built-in L1/L2
✓ Handles missing data: Automatically
✓ Fast: Parallel tree construction
✓ Early stopping: Prevents overfitting
✓ Feature importance: Gain, weight, cover
Limitations
✗ Many hyperparameters: Requires tuning
✗ Sensitive to overfitting: With default params
✗ Black box: Hard to interpret
✗ Memory intensive: For large datasets
When to Use
Use XGBoost when:
✓ Need best possible accuracy
✓ Tabular data
✓ Have time to tune hyperparameters
✓ Competition or critical application
✓ Medium to large datasets
Consider alternatives when:
✗ Need interpretability → Logistic Regression
✗ Limited time → Random Forest (fewer params)
✗ Very small data → Logistic Regression or SVM
Reference
Chen & Guestrin (2016). “XGBoost: A Scalable Tree Boosting System”
Logistic Regression
Purpose: Linear model for probabilistic binary/multi-class classification
Theory
Core Concept: Model log-odds as linear combination of features
Binary Classification:
P(y=1|x) = 1 / (1 + exp(-(β₀ + β₁x₁ + ... + βₙxₙ)))
Multi-class (one-vs-rest or multinomial):
P(y=k|x) = exp(xᵀβₖ) / Σⱼ exp(xᵀβⱼ)
Decision Boundary: Linear in feature space
Hyperparameters
C (Inverse Regularization)
Purpose: Control strength of regularization
Effect:
High C (e.g., 100): Weak regularization → Overfitting risk
Low C (e.g., 0.01): Strong regularization → Underfitting risk
Typical Range: 0.01 - 100
Note: C is inverse of regularization strength (unlike most methods)
penalty
Purpose: Type of regularization
Options:
‘l2’ (default): Ridge regularization (shrinks coefficients)
‘l1’: Lasso regularization (sparse coefficients, feature selection)
‘elasticnet’: Mix of L1 and L2
‘none’: No regularization
When to Use:
L2: Default, works well generally
L1: When want feature selection (many irrelevant features)
Elasticnet: When L1 too aggressive
solver
Purpose: Optimization algorithm
Options:
‘lbfgs’ (default): Good for small datasets, L2 only
‘saga’: Supports all penalties, good for large datasets
‘liblinear’: Good for small datasets, supports L1/L2
Recommendation: Use ‘saga’ for flexibility
max_iter
Purpose: Maximum iterations for convergence
Default: 100
Increase if: Warning about non-convergence
Typical Range: 100 - 10000
Usage Example
from functions.ML.logistic_regression import train_lr_model
# Train Logistic Regression
lr_model = train_lr_model(
X_train, y_train,
C=1.0,
penalty='l2',
solver='saga',
max_iter=1000,
random_state=42
)
# Predictions
y_pred = lr_model.predict(X_test)
y_proba = lr_model.predict_proba(X_test)
# Get coefficients (feature weights)
coefficients = lr_model.coef_[0] # For binary classification
# Intercept
intercept = lr_model.intercept_
Hyperparameter Optimization
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
# Define parameter grid
param_grid = {
'C': [0.01, 0.1, 1, 10, 100],
'penalty': ['l1', 'l2'],
'solver': ['saga'],
'max_iter': [1000]
}
# Grid search
grid_search = GridSearchCV(
LogisticRegression(random_state=42),
param_grid,
cv=5,
scoring='accuracy',
n_jobs=-1
)
grid_search.fit(X_train, y_train)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.3f}")
Interpretation
Coefficients (Feature Weights):
import numpy as np
import matplotlib.pyplot as plt
# Get coefficients
coefs = lr_model.coef_[0]
# Find most important features
indices = np.argsort(np.abs(coefs))[::-1]
top_k = 20
# Plot
plt.figure(figsize=(10, 6))
plt.bar(range(top_k), coefs[indices[:top_k]])
plt.xlabel('Feature Index')
plt.ylabel('Coefficient')
plt.title('Top 20 Feature Coefficients')
plt.axhline(y=0, color='k', linestyle='--')
plt.tight_layout()
plt.show()
# Positive coefficient → Increases probability of class 1
# Negative coefficient → Decreases probability of class 1
Odds Ratios:
# Convert coefficients to odds ratios
odds_ratios = np.exp(coefs)
# Interpretation:
# OR = 2.0 → One unit increase in feature doubles odds
# OR = 0.5 → One unit increase halves odds
# OR = 1.0 → No effect
print(f"Odds ratios for top features:")
for i in indices[:10]:
print(f" Feature {i}: OR = {odds_ratios[i]:.3f}")
Probability Calibration:
# Check calibration
from sklearn.calibration import calibration_curve
# Predicted probabilities
y_proba = lr_model.predict_proba(X_test)[:, 1]
# Calibration curve
fraction_of_positives, mean_predicted_value = calibration_curve(
y_test, y_proba, n_bins=10
)
# Plot
plt.figure(figsize=(8, 6))
plt.plot([0, 1], [0, 1], 'k--', label='Perfectly Calibrated')
plt.plot(mean_predicted_value, fraction_of_positives, 'o-', label='Logistic Regression')
plt.xlabel('Mean Predicted Probability')
plt.ylabel('Fraction of Positives')
plt.title('Calibration Curve')
plt.legend()
plt.tight_layout()
plt.show()
Advantages
✓ Interpretable: Clear feature weights
✓ Probabilistic: Direct probability estimates
✓ Fast: Training and prediction
✓ Well-calibrated: Reliable probabilities
✓ No hyperparameters: (except C)
✓ Linear decision boundary: Simple
Limitations
✗ Linear only: Cannot capture non-linear relationships
✗ Feature engineering: May need manual feature creation
✗ Sensitive to scaling: Requires standardization
✗ Multicollinearity: Correlated features problematic
When to Use
Use Logistic Regression when:
✓ Need interpretability (coefficients)
✓ Want probabilistic outputs
✓ Linearly separable data
✓ Baseline model
✓ Small datasets
✓ Need fast predictions
Consider alternatives when:
✗ Non-linear boundaries → SVM (RBF) or Random Forest
✗ Many irrelevant features → Random Forest (feature importance)
✗ Complex interactions → XGBoost or Neural Networks
Reference
Cox (1958). “The Regression Analysis of Binary Sequences”
Multi-Layer Perceptron (MLP)
Purpose: Neural network for non-linear classification/regression
Theory
Architecture: Input → Hidden Layers → Output
Neuron Computation:
output = activation(Σ wᵢxᵢ + b)
Activation Functions:
ReLU: f(x) = max(0, x) [Default, recommended]
Tanh: f(x) = tanh(x)
Logistic: f(x) = 1/(1 + e⁻ˣ)
Training: Backpropagation with gradient descent
Hyperparameters
activation
Purpose: Non-linear activation function
Options:
‘relu’ (default): Recommended, works well
‘tanh’: Alternative, slower convergence
‘logistic’: Sigmoid, rarely used
Recommendation: Use ‘relu’
alpha
Purpose: L2 regularization strength
Effect:
Higher (e.g., 0.01): Strong regularization
Lower (e.g., 0.0001): Weak regularization
Typical Range: 0.0001 - 0.01
Default: 0.0001
learning_rate_init
Purpose: Initial learning rate
Effect:
Higher (e.g., 0.01): Faster convergence → Instability risk
Lower (e.g., 0.0001): Slower convergence → More stable
Typical Range: 0.0001 - 0.01
Default: 0.001
max_iter
Purpose: Maximum epochs (training iterations)
Default: 200
Typical Range: 200 - 1000
Increase if: Training loss still decreasing
early_stopping
Purpose: Stop training when validation score stops improving
Recommended: True
Parameters:
validation_fraction: Fraction for validation (default 0.1)
n_iter_no_change: Patience (default 10)
Usage Example
from functions.ML.mlp import train_mlp_model
# Train MLP
mlp_model = train_mlp_model(
X_train, y_train,
hidden_layer_sizes=(100, 50),
activation='relu',
alpha=0.0001,
learning_rate_init=0.001,
max_iter=500,
early_stopping=True,
random_state=42
)
# Predictions
y_pred = mlp_model.predict(X_test)
y_proba = mlp_model.predict_proba(X_test)
Hyperparameter Optimization
from sklearn.model_selection import RandomizedSearchCV
from sklearn.neural_network import MLPClassifier
# Define parameter distributions
param_dist = {
'hidden_layer_sizes': [(50,), (100,), (100, 50), (200, 100)],
'activation': ['relu', 'tanh'],
'alpha': [0.0001, 0.001, 0.01],
'learning_rate_init': [0.0001, 0.001, 0.01],
'max_iter': [500]
}
# Random search
random_search = RandomizedSearchCV(
MLPClassifier(early_stopping=True, random_state=42),
param_distributions=param_dist,
n_iter=20,
cv=5,
scoring='accuracy',
n_jobs=-1,
random_state=42
)
random_search.fit(X_train, y_train)
print(f"Best parameters: {random_search.best_params_}")
print(f"Best CV score: {random_search.best_score_:.3f}")
Learning Curves
# Access loss history
train_loss = mlp_model.loss_curve_
# Plot
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
plt.plot(train_loss)
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('MLP Training Loss')
plt.tight_layout()
plt.show()
# Check for convergence:
# - Should decrease steadily
# - Flatten at end
# - If still decreasing → increase max_iter
Advantages
✓ Non-linear: Can learn complex patterns
✓ Flexible: Arbitrary architectures
✓ Universal approximator: Theoretically can learn any function
✓ Feature learning: Automatic feature extraction
Limitations
✗ Black box: Hard to interpret
✗ Hyperparameters: Many to tune
✗ Convergence: Can be slow or unstable
✗ Scaling required: Sensitive to feature scales
✗ Random initialization: Different results per run
When to Use
Use MLP when:
✓ Non-linear, complex patterns
✓ Large datasets (> 10,000 samples)
✓ Don’t need interpretability
✓ Have time to tune
✓ Sufficient data to prevent overfitting
Consider alternatives when:
✗ Need interpretability → Logistic Regression
✗ Small data → SVM or Random Forest
✗ Want speed → Random Forest
✗ Tabular data → XGBoost (usually better)
Reference
Rumelhart et al. (1986). “Learning representations by back-propagating errors”
Model Evaluation
Classification Metrics
Accuracy
Formula: (TP + TN) / (TP + TN + FP + FN)
When to Use: Balanced classes
Limitation: Misleading for imbalanced data
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.3f}")
Precision
Formula: TP / (TP + FP)
Interpretation: Of predicted positives, how many are correct?
When to Use: Cost of false positives high (e.g., spam detection)
from sklearn.metrics import precision_score
precision = precision_score(y_test, y_pred, average='weighted')
print(f"Precision: {precision:.3f}")
Recall (Sensitivity)
Formula: TP / (TP + FN)
Interpretation: Of actual positives, how many detected?
When to Use: Cost of false negatives high (e.g., disease screening)
from sklearn.metrics import recall_score
recall = recall_score(y_test, y_pred, average='weighted')
print(f"Recall: {recall:.3f}")
F1-Score
Formula: 2 × (Precision × Recall) / (Precision + Recall)
Interpretation: Harmonic mean of precision and recall
When to Use: Balance precision and recall
from sklearn.metrics import f1_score
f1 = f1_score(y_test, y_pred, average='weighted')
print(f"F1-Score: {f1:.3f}")
ROC-AUC
Purpose: Measure discrimination ability across all thresholds
Range: 0.5 (random) to 1.0 (perfect)
When to Use: Imbalanced classes, need threshold-independent metric
from sklearn.metrics import roc_auc_score, roc_curve
# Binary classification
y_proba = model.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_proba)
print(f"ROC-AUC: {auc:.3f}")
# Plot ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'AUC = {auc:.3f}')
plt.plot([0, 1], [0, 1], 'k--', label='Random')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.tight_layout()
plt.show()
Confusion Matrix
from sklearn.metrics import confusion_matrix
import seaborn as sns
# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred)
# Plot
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
xticklabels=class_names,
yticklabels=class_names)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.tight_layout()
plt.show()
Classification Report
from sklearn.metrics import classification_report
report = classification_report(y_test, y_pred, target_names=class_names)
print(report)
# Example output:
# precision recall f1-score support
#
# Class A 0.92 0.95 0.93 20
# Class B 0.88 0.85 0.87 20
#
# accuracy 0.90 40
# macro avg 0.90 0.90 0.90 40
# weighted avg 0.90 0.90 0.90 40
Cross-Validation
K-Fold Cross-Validation
from sklearn.model_selection import cross_val_score
# 5-fold CV
scores = cross_val_score(
model,
X_train,
y_train,
cv=5,
scoring='accuracy'
)
print(f"CV Accuracy: {scores.mean():.3f} ± {scores.std():.3f}")
Stratified K-Fold
Use when: Imbalanced classes
from sklearn.model_selection import StratifiedKFold, cross_val_score
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(
model,
X_train,
y_train,
cv=skf,
scoring='accuracy'
)
print(f"Stratified CV Accuracy: {scores.mean():.3f} ± {scores.std():.3f}")
Group K-Fold (Patient-Level)
Critical for Raman data: Prevent data leakage from same patient
from sklearn.model_selection import GroupKFold, cross_val_score
# groups: Patient IDs for each sample
gkf = GroupKFold(n_splits=5)
scores = cross_val_score(
model,
X_train,
y_train,
groups=patient_ids, # Ensure all samples from same patient in same fold
cv=gkf,
scoring='accuracy'
)
print(f"Group CV Accuracy: {scores.mean():.3f} ± {scores.std():.3f}")
See Also
Machine Learning User Guide - Step-by-step tutorials
Best Practices - ML strategies
Preprocessing Methods - Data preparation
Statistical Methods - Hypothesis testing
Last Updated: 2026-01-24