Machine Learning Methods

Comprehensive reference for classification and regression algorithms.

Table of Contents

Note: Multi-Layer Perceptron (MLP) and neural networks are planned for future releases.


Support Vector Machines (SVM)

Purpose: Binary or multi-class classification using optimal decision boundaries

Theory

Core Concept: Find hyperplane that maximizes margin between classes

Key Components:

  1. Support Vectors: Data points closest to decision boundary

  2. Margin: Distance between hyperplane and nearest points

  3. Kernel: Function to transform data to higher dimensions

Decision Function:

f(x) = sign(Σ αᵢ yᵢ K(xᵢ, x) + b)

Where:

  • α: Lagrange multipliers

  • y: Class labels

  • K: Kernel function

  • b: Bias term

Kernel Functions

1. Linear Kernel

Formula: K(x, x') = xᵀx'

When to Use:

  • ✓ Linearly separable data

  • ✓ High-dimensional data (text, spectra)

  • ✓ Large datasets (fast)

Pros:

  • Fast training and prediction

  • Interpretable (feature weights)

  • Less prone to overfitting

Cons:

  • Cannot handle non-linear boundaries

2. RBF (Radial Basis Function) Kernel

Formula: K(x, x') = exp(-γ ||x - x'||²)

When to Use:

  • ✓ Non-linear boundaries

  • ✓ Default choice for most problems

  • ✓ Unknown data structure

Pros:

  • Handles non-linearity

  • Flexible decision boundaries

Cons:

  • More parameters to tune (C, γ)

  • Slower than linear

  • Risk of overfitting

Parameter γ (gamma):

  • High γ (e.g., 0.1): Narrow influence → Complex boundaries (overfitting risk)

  • Low γ (e.g., 0.001): Wide influence → Smooth boundaries

  • ‘scale’ (default): γ = 1 / (n_features × variance)

  • ‘auto’: γ = 1 / n_features

3. Polynomial Kernel

Formula: K(x, x') = xᵀx' + r)ᵈ

Parameters:

  • d: Polynomial degree (2 or 3 typical)

  • r: Coefficient (usually 0)

When to Use:

  • ✓ Known polynomial relationship

  • ✓ Image data

  • ✗ Rarely used for Raman spectra

Hyperparameters

C (Regularization Parameter)

Purpose: Control trade-off between margin and misclassification

Effect:

  • High C (e.g., 100): Smaller margin, fewer errors → Overfitting risk

  • Low C (e.g., 0.1): Larger margin, more errors → Underfitting risk

Typical Range: 0.1 - 100

Tuning Strategy:

# Grid search over C
C_range = [0.1, 1, 10, 100]

Gamma (γ) - RBF Kernel Only

Purpose: Define influence radius of single training example

Effect:

  • High γ (e.g., 1.0): Tight fit → Overfitting

  • Low γ (e.g., 0.001): Loose fit → Underfitting

Typical Range: 0.0001 - 1.0

Tuning Strategy:

# Grid search over gamma
gamma_range = [0.0001, 0.001, 0.01, 0.1, 1.0]

Usage Example

from functions.ML.svm import train_svm_model

# Prepare data
X_train, X_test, y_train, y_test = train_test_split(
    preprocessed_spectra,
    labels,
    test_size=0.2,
    random_state=42,
    stratify=labels
)

# Train SVM with RBF kernel
svm_model = train_svm_model(
    X_train, y_train,
    kernel='rbf',
    C=10.0,
    gamma='scale',
    random_state=42
)

# Predictions
y_pred = svm_model.predict(X_test)

# Evaluate
from sklearn.metrics import accuracy_score, classification_report

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.3f}")
print(classification_report(y_test, y_pred))

Hyperparameter Optimization

from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

# Define parameter grid
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [0.001, 0.01, 0.1, 1, 'scale', 'auto'],
    'kernel': ['rbf']
}

# Grid search with cross-validation
grid_search = GridSearchCV(
    SVC(random_state=42),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_train, y_train)

# Best parameters
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.3f}")

# Use best model
best_svm = grid_search.best_estimator_

Interpretation

Decision Function Values:

# Get decision function scores
decision_scores = svm_model.decision_function(X_test)

# For binary classification:
# Positive → Class 1
# Negative → Class 0
# Magnitude → Confidence

# For multi-class (one-vs-one):
# Multiple decision functions (one per pair)

Support Vectors:

# Number of support vectors per class
print(f"Support vectors: {svm_model.n_support_}")

# Indices of support vectors
support_indices = svm_model.support_

# Support vectors themselves
support_vectors = X_train[support_indices]

Troubleshooting

Issue

Cause

Solution

Poor accuracy

Wrong kernel

Try RBF if using linear

Overfitting

C too high or γ too high

Reduce C, reduce γ

Underfitting

C too low or γ too low

Increase C, increase γ

Very slow training

Large dataset with RBF

Use LinearSVC or subsample

Poor validation

Data leakage

Use GroupKFold for patient-level

When to Use

Use SVM when:

  • ✓ Binary or multi-class classification

  • ✓ High-dimensional data (spectra work well)

  • ✓ Clear margin between classes

  • ✓ Small to medium datasets (< 50,000 samples)

  • ✓ Need probabilistic outputs (use probability=True)

Consider alternatives when:

  • ✗ Very large datasets (> 100,000) → Random Forest

  • ✗ Need interpretability → Logistic Regression

  • ✗ Categorical features → Random Forest/XGBoost

  • ✗ Need feature importance → Random Forest

Class Imbalance

# Handle imbalanced classes
svm_model = SVC(
    kernel='rbf',
    C=10,
    gamma='scale',
    class_weight='balanced',  # Automatic adjustment
    random_state=42
)

# Or specify custom weights
class_weights = {0: 1.0, 1: 3.0}  # Give 3x weight to class 1
svm_model = SVC(class_weight=class_weights)

Reference

Cortes & Vapnik (1995). “Support-Vector Networks”


Random Forest

Purpose: Ensemble of decision trees for robust classification/regression

Theory

Core Concept: Combine multiple decision trees to reduce overfitting

Bagging (Bootstrap Aggregating):

  1. Create multiple bootstrap samples (random sampling with replacement)

  2. Train decision tree on each sample

  3. Aggregate predictions (vote for classification, average for regression)

Random Feature Selection:

  • At each split, consider random subset of features

  • Decorrelates trees → Better ensemble

Out-of-Bag (OOB) Error:

  • Each tree trained on ~63% of data

  • Remaining 37% used for validation (no separate test set needed)

Hyperparameters

n_estimators

Purpose: Number of trees in forest

Effect:

  • More trees → More stable, but slower

  • Fewer trees → Faster, but less stable

Typical Range: 100 - 1000

Recommendation: Start with 100, increase if training loss still decreasing

# Check OOB error vs n_trees
oob_errors = []
for n in range(10, 500, 10):
    rf = RandomForestClassifier(n_estimators=n, oob_score=True)
    rf.fit(X_train, y_train)
    oob_errors.append(1 - rf.oob_score_)

# Plot to find plateau

max_depth

Purpose: Maximum depth of each tree

Effect:

  • None (default): Trees grow until pure leaves

  • Shallow (e.g., 5-10): Prevents overfitting

  • Deep (e.g., 20-30): Captures complex patterns

Typical Range: 10 - 30, or None

Tuning:

# Start with None, then limit if overfitting
max_depth = None  # or 20 for regularization

min_samples_split

Purpose: Minimum samples required to split node

Effect:

  • Low (e.g., 2): More splits → More complex trees

  • High (e.g., 10-20): Fewer splits → Regularization

Default: 2

Recommendation: Increase to 5-10 if overfitting

min_samples_leaf

Purpose: Minimum samples in leaf node

Effect:

  • Low (e.g., 1): Allows pure leaves

  • High (e.g., 5-10): Smooths predictions

Default: 1

Recommendation: Set to 2-5 for regularization

max_features

Purpose: Number of features to consider for each split

Options:

  • ‘sqrt’ (default): √n_features (recommended for classification)

  • ‘log2’: log₂(n_features)

  • int: Specific number

  • float: Proportion (e.g., 0.3 = 30%)

Effect:

  • Fewer features → More diversity → Better ensemble

  • More features → Individual trees stronger → Less diversity

Usage Example

from functions.ML.random_forest import train_rf_model

# Train Random Forest
rf_model = train_rf_model(
    X_train, y_train,
    n_estimators=100,
    max_depth=None,
    min_samples_split=2,
    min_samples_leaf=1,
    max_features='sqrt',
    random_state=42,
    n_jobs=-1  # Use all CPU cores
)

# Predictions
y_pred = rf_model.predict(X_test)

# Prediction probabilities
y_proba = rf_model.predict_proba(X_test)

# Feature importance
importances = rf_model.feature_importances_

Hyperparameter Optimization

from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier

# Define parameter distributions
param_dist = {
    'n_estimators': [100, 200, 500],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2', 0.3]
}

# Random search (faster than grid search)
random_search = RandomizedSearchCV(
    RandomForestClassifier(random_state=42),
    param_distributions=param_dist,
    n_iter=50,  # Number of parameter combinations to try
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    random_state=42,
    verbose=1
)

random_search.fit(X_train, y_train)

print(f"Best parameters: {random_search.best_params_}")
print(f"Best CV score: {random_search.best_score_:.3f}")

Feature Importance

Gini Importance (Default):

import numpy as np
import matplotlib.pyplot as plt

# Get feature importances
importances = rf_model.feature_importances_

# Sort by importance
indices = np.argsort(importances)[::-1]
top_k = 20

# Plot top features
plt.figure(figsize=(10, 6))
plt.bar(range(top_k), importances[indices[:top_k]])
plt.xlabel('Feature (Wavenumber) Index')
plt.ylabel('Importance')
plt.title('Top 20 Important Features')
plt.tight_layout()
plt.show()

# Map to wavenumbers
top_wavenumbers = wavenumbers[indices[:top_k]]
print(f"Top wavenumbers: {top_wavenumbers}")

Permutation Importance (More Accurate):

from sklearn.inspection import permutation_importance

# Calculate permutation importance
perm_importance = permutation_importance(
    rf_model,
    X_test,
    y_test,
    n_repeats=10,
    random_state=42,
    n_jobs=-1
)

# Get importances and standard deviations
perm_importances_mean = perm_importance.importances_mean
perm_importances_std = perm_importance.importances_std

# Sort
indices = np.argsort(perm_importances_mean)[::-1]
top_k = 20

# Plot
plt.figure(figsize=(10, 6))
plt.bar(range(top_k), 
        perm_importances_mean[indices[:top_k]],
        yerr=perm_importances_std[indices[:top_k]])
plt.xlabel('Feature Index')
plt.ylabel('Permutation Importance')
plt.title('Top 20 Features (Permutation Importance)')
plt.tight_layout()
plt.show()

SHAP Values

Purpose: Explain model predictions via Shapley-value-based feature attribution.

When to use:

  • ✓ Per-sample explanations (which wavenumbers drive a prediction)

  • ✓ Global importance aggregated from local attributions

Note: This typically uses the external shap library.

In this application: SHAP explainability is exposed via the ML page SHAP action and produces a per-spectrum explanation with a contribution bar plot across the Raman shift axis (red = positive, blue = negative), a ranked contributor table, and an exportable bundle.

Out-of-Bag (OOB) Score

# Train with OOB scoring
rf_model = RandomForestClassifier(
    n_estimators=100,
    oob_score=True,
    random_state=42
)
rf_model.fit(X_train, y_train)

# OOB score (similar to cross-validation)
print(f"OOB Score: {rf_model.oob_score_:.3f}")

# OOB predictions
oob_pred = rf_model.oob_decision_function_

Advantages

Robust: Less prone to overfitting than single tree
Feature importance: Automatic ranking
Handles non-linearity: No kernel needed
Missing values: Can handle with imputation
No scaling required: Works with raw features
Parallelizable: Fast training with multiple cores

Limitations

Black box: Hard to interpret individual predictions
Large models: Memory intensive for many trees
Extrapolation: Poor performance outside training range
Biased: Toward features with many categories

When to Use

Use Random Forest when:

  • ✓ Tabular data with mixed features

  • ✓ Need feature importance

  • ✓ Don’t want to tune many hyperparameters

  • ✓ Want robust, reliable performance

  • ✓ Have sufficient data (> 1000 samples)

Consider alternatives when:

  • ✗ Need probabilistic model → Logistic Regression

  • ✗ Have sequential/spatial data → Neural networks

  • ✗ Need speed → Logistic Regression or LinearSVC

  • ✗ Want best accuracy → XGBoost (usually better)

Reference

Breiman (2001). “Random Forests”


XGBoost

Purpose: Gradient boosting for high-performance classification/regression

Theory

Core Concept: Build trees sequentially, each correcting previous errors

Gradient Boosting:

  1. Start with simple model (e.g., mean prediction)

  2. Calculate residuals (errors)

  3. Train new tree to predict residuals

  4. Add to ensemble with small weight

  5. Repeat until convergence

XGBoost Innovations:

  • Regularization (L1/L2) to prevent overfitting

  • Handling missing values automatically

  • Parallel tree construction (fast)

  • Built-in cross-validation

Formula:

F(x) = Σ fₖ(x)
where fₖ is k-th tree

Hyperparameters

n_estimators

Purpose: Number of boosting rounds (trees)

Effect:

  • More trees → Better training fit → Overfitting risk

  • Fewer trees → Underfitting

Typical Range: 100 - 1000

Best Practice: Use early stopping to find optimal number

# Early stopping
xgb_model.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    early_stopping_rounds=50,  # Stop if no improvement for 50 rounds
    verbose=False
)

learning_rate (eta)

Purpose: Step size for each tree’s contribution

Effect:

  • High (e.g., 0.3): Fast convergence → Overfitting risk

  • Low (e.g., 0.01): Slow convergence → More trees needed → Better generalization

Typical Range: 0.01 - 0.3

Rule of Thumb: Lower learning_rate requires more n_estimators

# Conservative approach
learning_rate = 0.1
n_estimators = 500

# Aggressive approach (faster, riskier)
learning_rate = 0.3
n_estimators = 100

max_depth

Purpose: Maximum depth of each tree

Effect:

  • Deep (e.g., 6-10): Captures complex interactions

  • Shallow (e.g., 3-5): Prevents overfitting

Typical Range: 3 - 10

Default: 6

subsample

Purpose: Fraction of samples used for each tree

Effect:

  • < 1.0 (e.g., 0.8): Reduces overfitting, speeds up training

  • = 1.0: Use all samples

Typical Range: 0.5 - 1.0

Recommendation: 0.8 for robustness

colsample_bytree

Purpose: Fraction of features used for each tree

Effect:

  • < 1.0 (e.g., 0.8): Reduces overfitting, adds diversity

  • = 1.0: Use all features

Typical Range: 0.5 - 1.0

Recommendation: 0.8

gamma (min_split_loss)

Purpose: Minimum loss reduction to make split

Effect:

  • Higher: More conservative splitting → Regularization

  • Lower: More splits → Risk of overfitting

Typical Range: 0 - 5

Default: 0

lambda (reg_lambda) - L2 Regularization

Purpose: L2 penalty on leaf weights

Effect:

  • Higher: Stronger regularization

  • Lower: Less regularization

Typical Range: 0 - 10

Default: 1

alpha (reg_alpha) - L1 Regularization

Purpose: L1 penalty on leaf weights (feature selection)

Effect:

  • Higher: More sparsity (some weights → 0)

  • Lower: Less sparsity

Typical Range: 0 - 10

Default: 0

Usage Example

from functions.ML.xgboost import train_xgboost_model

# Train XGBoost
xgb_model = train_xgboost_model(
    X_train, y_train,
    n_estimators=500,
    learning_rate=0.1,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    gamma=0,
    reg_lambda=1,
    reg_alpha=0,
    random_state=42,
    n_jobs=-1
)

# Predictions
y_pred = xgb_model.predict(X_test)
y_proba = xgb_model.predict_proba(X_test)

Hyperparameter Optimization

from sklearn.model_selection import RandomizedSearchCV
from xgboost import XGBClassifier

# Define parameter distributions
param_dist = {
    'n_estimators': [100, 200, 500, 1000],
    'learning_rate': [0.01, 0.05, 0.1, 0.3],
    'max_depth': [3, 5, 7, 9],
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0],
    'gamma': [0, 1, 5],
    'reg_lambda': [0, 1, 10],
    'reg_alpha': [0, 0.1, 1]
}

# Random search
random_search = RandomizedSearchCV(
    XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='logloss'),
    param_distributions=param_dist,
    n_iter=50,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    random_state=42,
    verbose=1
)

random_search.fit(X_train, y_train)

print(f"Best parameters: {random_search.best_params_}")
print(f"Best CV score: {random_search.best_score_:.3f}")

Feature Importance

import matplotlib.pyplot as plt

# Get feature importances
importances = xgb_model.feature_importances_

# Plot top features
import numpy as np
indices = np.argsort(importances)[::-1]
top_k = 20

plt.figure(figsize=(10, 6))
plt.bar(range(top_k), importances[indices[:top_k]])
plt.xlabel('Feature Index')
plt.ylabel('Importance (Gain)')
plt.title('Top 20 Important Features (XGBoost)')
plt.tight_layout()
plt.show()

# Importance types in XGBoost:
# - 'weight': Number of times feature used
# - 'gain': Average gain (default, most useful)
# - 'cover': Average coverage

# Get different importance types
import xgboost as xgb
importance_gain = xgb_model.get_booster().get_score(importance_type='gain')
importance_weight = xgb_model.get_booster().get_score(importance_type='weight')

Learning Curves

# Train with evaluation set to monitor performance
eval_set = [(X_train, y_train), (X_val, y_val)]

xgb_model.fit(
    X_train, y_train,
    eval_set=eval_set,
    eval_metric='logloss',
    verbose=50  # Print every 50 rounds
)

# Get evaluation results
results = xgb_model.evals_result()

# Plot learning curves
plt.figure(figsize=(10, 6))
plt.plot(results['validation_0']['logloss'], label='Train')
plt.plot(results['validation_1']['logloss'], label='Validation')
plt.xlabel('Boosting Round')
plt.ylabel('Log Loss')
plt.legend()
plt.title('XGBoost Learning Curves')
plt.tight_layout()
plt.show()

Advantages

State-of-the-art accuracy: Often wins competitions
Regularization: Built-in L1/L2
Handles missing data: Automatically
Fast: Parallel tree construction
Early stopping: Prevents overfitting
Feature importance: Gain, weight, cover

Limitations

Many hyperparameters: Requires tuning
Sensitive to overfitting: With default params
Black box: Hard to interpret
Memory intensive: For large datasets

When to Use

Use XGBoost when:

  • ✓ Need best possible accuracy

  • ✓ Tabular data

  • ✓ Have time to tune hyperparameters

  • ✓ Competition or critical application

  • ✓ Medium to large datasets

Consider alternatives when:

  • ✗ Need interpretability → Logistic Regression

  • ✗ Limited time → Random Forest (fewer params)

  • ✗ Very small data → Logistic Regression or SVM

Reference

Chen & Guestrin (2016). “XGBoost: A Scalable Tree Boosting System”


Logistic Regression

Purpose: Linear model for probabilistic binary/multi-class classification

Theory

Core Concept: Model log-odds as linear combination of features

Binary Classification:

P(y=1|x) = 1 / (1 + exp(-(β₀ + β₁x₁ + ... + βₙxₙ)))

Multi-class (one-vs-rest or multinomial):

P(y=k|x) = exp(xᵀβₖ) / Σⱼ exp(xᵀβⱼ)

Decision Boundary: Linear in feature space

Hyperparameters

C (Inverse Regularization)

Purpose: Control strength of regularization

Effect:

  • High C (e.g., 100): Weak regularization → Overfitting risk

  • Low C (e.g., 0.01): Strong regularization → Underfitting risk

Typical Range: 0.01 - 100

Note: C is inverse of regularization strength (unlike most methods)

penalty

Purpose: Type of regularization

Options:

  • ‘l2’ (default): Ridge regularization (shrinks coefficients)

  • ‘l1’: Lasso regularization (sparse coefficients, feature selection)

  • ‘elasticnet’: Mix of L1 and L2

  • ‘none’: No regularization

When to Use:

  • L2: Default, works well generally

  • L1: When want feature selection (many irrelevant features)

  • Elasticnet: When L1 too aggressive

solver

Purpose: Optimization algorithm

Options:

  • ‘lbfgs’ (default): Good for small datasets, L2 only

  • ‘saga’: Supports all penalties, good for large datasets

  • ‘liblinear’: Good for small datasets, supports L1/L2

Recommendation: Use ‘saga’ for flexibility

max_iter

Purpose: Maximum iterations for convergence

Default: 100

Increase if: Warning about non-convergence

Typical Range: 100 - 10000

Usage Example

from functions.ML.logistic_regression import train_lr_model

# Train Logistic Regression
lr_model = train_lr_model(
    X_train, y_train,
    C=1.0,
    penalty='l2',
    solver='saga',
    max_iter=1000,
    random_state=42
)

# Predictions
y_pred = lr_model.predict(X_test)
y_proba = lr_model.predict_proba(X_test)

# Get coefficients (feature weights)
coefficients = lr_model.coef_[0]  # For binary classification

# Intercept
intercept = lr_model.intercept_

Hyperparameter Optimization

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

# Define parameter grid
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2'],
    'solver': ['saga'],
    'max_iter': [1000]
}

# Grid search
grid_search = GridSearchCV(
    LogisticRegression(random_state=42),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

grid_search.fit(X_train, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.3f}")

Interpretation

Coefficients (Feature Weights):

import numpy as np
import matplotlib.pyplot as plt

# Get coefficients
coefs = lr_model.coef_[0]

# Find most important features
indices = np.argsort(np.abs(coefs))[::-1]
top_k = 20

# Plot
plt.figure(figsize=(10, 6))
plt.bar(range(top_k), coefs[indices[:top_k]])
plt.xlabel('Feature Index')
plt.ylabel('Coefficient')
plt.title('Top 20 Feature Coefficients')
plt.axhline(y=0, color='k', linestyle='--')
plt.tight_layout()
plt.show()

# Positive coefficient → Increases probability of class 1
# Negative coefficient → Decreases probability of class 1

Odds Ratios:

# Convert coefficients to odds ratios
odds_ratios = np.exp(coefs)

# Interpretation:
# OR = 2.0 → One unit increase in feature doubles odds
# OR = 0.5 → One unit increase halves odds
# OR = 1.0 → No effect

print(f"Odds ratios for top features:")
for i in indices[:10]:
    print(f"  Feature {i}: OR = {odds_ratios[i]:.3f}")

Probability Calibration:

# Check calibration
from sklearn.calibration import calibration_curve

# Predicted probabilities
y_proba = lr_model.predict_proba(X_test)[:, 1]

# Calibration curve
fraction_of_positives, mean_predicted_value = calibration_curve(
    y_test, y_proba, n_bins=10
)

# Plot
plt.figure(figsize=(8, 6))
plt.plot([0, 1], [0, 1], 'k--', label='Perfectly Calibrated')
plt.plot(mean_predicted_value, fraction_of_positives, 'o-', label='Logistic Regression')
plt.xlabel('Mean Predicted Probability')
plt.ylabel('Fraction of Positives')
plt.title('Calibration Curve')
plt.legend()
plt.tight_layout()
plt.show()

Advantages

Interpretable: Clear feature weights
Probabilistic: Direct probability estimates
Fast: Training and prediction
Well-calibrated: Reliable probabilities
No hyperparameters: (except C)
Linear decision boundary: Simple

Limitations

Linear only: Cannot capture non-linear relationships
Feature engineering: May need manual feature creation
Sensitive to scaling: Requires standardization
Multicollinearity: Correlated features problematic

When to Use

Use Logistic Regression when:

  • ✓ Need interpretability (coefficients)

  • ✓ Want probabilistic outputs

  • ✓ Linearly separable data

  • ✓ Baseline model

  • ✓ Small datasets

  • ✓ Need fast predictions

Consider alternatives when:

  • ✗ Non-linear boundaries → SVM (RBF) or Random Forest

  • ✗ Many irrelevant features → Random Forest (feature importance)

  • ✗ Complex interactions → XGBoost or Neural Networks

Reference

Cox (1958). “The Regression Analysis of Binary Sequences”


Multi-Layer Perceptron (MLP)

Purpose: Neural network for non-linear classification/regression

Theory

Architecture: Input → Hidden Layers → Output

Neuron Computation:

output = activation(Σ wᵢxᵢ + b)

Activation Functions:

  • ReLU: f(x) = max(0, x) [Default, recommended]

  • Tanh: f(x) = tanh(x)

  • Logistic: f(x) = 1/(1 + e⁻ˣ)

Training: Backpropagation with gradient descent

Hyperparameters

hidden_layer_sizes

Purpose: Architecture of hidden layers

Format: Tuple (layer1_size, layer2_size, …)

Examples:

# Single hidden layer with 100 neurons
hidden_layer_sizes = (100,)

# Two hidden layers (100, 50)
hidden_layer_sizes = (100, 50)

# Three hidden layers (200, 100, 50)
hidden_layer_sizes = (200, 100, 50)

Rules of Thumb:

  • Start with (100,) or (100, 50)

  • More neurons → More capacity → Overfitting risk

  • More layers → Can learn complex patterns

activation

Purpose: Non-linear activation function

Options:

  • ‘relu’ (default): Recommended, works well

  • ‘tanh’: Alternative, slower convergence

  • ‘logistic’: Sigmoid, rarely used

Recommendation: Use ‘relu’

alpha

Purpose: L2 regularization strength

Effect:

  • Higher (e.g., 0.01): Strong regularization

  • Lower (e.g., 0.0001): Weak regularization

Typical Range: 0.0001 - 0.01

Default: 0.0001

learning_rate_init

Purpose: Initial learning rate

Effect:

  • Higher (e.g., 0.01): Faster convergence → Instability risk

  • Lower (e.g., 0.0001): Slower convergence → More stable

Typical Range: 0.0001 - 0.01

Default: 0.001

max_iter

Purpose: Maximum epochs (training iterations)

Default: 200

Typical Range: 200 - 1000

Increase if: Training loss still decreasing

early_stopping

Purpose: Stop training when validation score stops improving

Recommended: True

Parameters:

  • validation_fraction: Fraction for validation (default 0.1)

  • n_iter_no_change: Patience (default 10)

Usage Example

from functions.ML.mlp import train_mlp_model

# Train MLP
mlp_model = train_mlp_model(
    X_train, y_train,
    hidden_layer_sizes=(100, 50),
    activation='relu',
    alpha=0.0001,
    learning_rate_init=0.001,
    max_iter=500,
    early_stopping=True,
    random_state=42
)

# Predictions
y_pred = mlp_model.predict(X_test)
y_proba = mlp_model.predict_proba(X_test)

Hyperparameter Optimization

from sklearn.model_selection import RandomizedSearchCV
from sklearn.neural_network import MLPClassifier

# Define parameter distributions
param_dist = {
    'hidden_layer_sizes': [(50,), (100,), (100, 50), (200, 100)],
    'activation': ['relu', 'tanh'],
    'alpha': [0.0001, 0.001, 0.01],
    'learning_rate_init': [0.0001, 0.001, 0.01],
    'max_iter': [500]
}

# Random search
random_search = RandomizedSearchCV(
    MLPClassifier(early_stopping=True, random_state=42),
    param_distributions=param_dist,
    n_iter=20,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    random_state=42
)

random_search.fit(X_train, y_train)

print(f"Best parameters: {random_search.best_params_}")
print(f"Best CV score: {random_search.best_score_:.3f}")

Learning Curves

# Access loss history
train_loss = mlp_model.loss_curve_

# Plot
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
plt.plot(train_loss)
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('MLP Training Loss')
plt.tight_layout()
plt.show()

# Check for convergence:
# - Should decrease steadily
# - Flatten at end
# - If still decreasing → increase max_iter

Advantages

Non-linear: Can learn complex patterns
Flexible: Arbitrary architectures
Universal approximator: Theoretically can learn any function
Feature learning: Automatic feature extraction

Limitations

Black box: Hard to interpret
Hyperparameters: Many to tune
Convergence: Can be slow or unstable
Scaling required: Sensitive to feature scales
Random initialization: Different results per run

When to Use

Use MLP when:

  • ✓ Non-linear, complex patterns

  • ✓ Large datasets (> 10,000 samples)

  • ✓ Don’t need interpretability

  • ✓ Have time to tune

  • ✓ Sufficient data to prevent overfitting

Consider alternatives when:

  • ✗ Need interpretability → Logistic Regression

  • ✗ Small data → SVM or Random Forest

  • ✗ Want speed → Random Forest

  • ✗ Tabular data → XGBoost (usually better)

Reference

Rumelhart et al. (1986). “Learning representations by back-propagating errors”


Model Evaluation

Classification Metrics

Accuracy

Formula: (TP + TN) / (TP + TN + FP + FN)

When to Use: Balanced classes

Limitation: Misleading for imbalanced data

from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.3f}")

Precision

Formula: TP / (TP + FP)

Interpretation: Of predicted positives, how many are correct?

When to Use: Cost of false positives high (e.g., spam detection)

from sklearn.metrics import precision_score

precision = precision_score(y_test, y_pred, average='weighted')
print(f"Precision: {precision:.3f}")

Recall (Sensitivity)

Formula: TP / (TP + FN)

Interpretation: Of actual positives, how many detected?

When to Use: Cost of false negatives high (e.g., disease screening)

from sklearn.metrics import recall_score

recall = recall_score(y_test, y_pred, average='weighted')
print(f"Recall: {recall:.3f}")

F1-Score

Formula: 2 × (Precision × Recall) / (Precision + Recall)

Interpretation: Harmonic mean of precision and recall

When to Use: Balance precision and recall

from sklearn.metrics import f1_score

f1 = f1_score(y_test, y_pred, average='weighted')
print(f"F1-Score: {f1:.3f}")

ROC-AUC

Purpose: Measure discrimination ability across all thresholds

Range: 0.5 (random) to 1.0 (perfect)

When to Use: Imbalanced classes, need threshold-independent metric

from sklearn.metrics import roc_auc_score, roc_curve

# Binary classification
y_proba = model.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_proba)
print(f"ROC-AUC: {auc:.3f}")

# Plot ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_proba)

import matplotlib.pyplot as plt
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'AUC = {auc:.3f}')
plt.plot([0, 1], [0, 1], 'k--', label='Random')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.tight_layout()
plt.show()

Confusion Matrix

from sklearn.metrics import confusion_matrix
import seaborn as sns

# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Plot
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=class_names,
            yticklabels=class_names)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.tight_layout()
plt.show()

Classification Report

from sklearn.metrics import classification_report

report = classification_report(y_test, y_pred, target_names=class_names)
print(report)

# Example output:
#               precision    recall  f1-score   support
#
#      Class A       0.92      0.95      0.93        20
#      Class B       0.88      0.85      0.87        20
#
#     accuracy                           0.90        40
#    macro avg       0.90      0.90      0.90        40
# weighted avg       0.90      0.90      0.90        40

Cross-Validation

K-Fold Cross-Validation

from sklearn.model_selection import cross_val_score

# 5-fold CV
scores = cross_val_score(
    model,
    X_train,
    y_train,
    cv=5,
    scoring='accuracy'
)

print(f"CV Accuracy: {scores.mean():.3f} ± {scores.std():.3f}")

Stratified K-Fold

Use when: Imbalanced classes

from sklearn.model_selection import StratifiedKFold, cross_val_score

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

scores = cross_val_score(
    model,
    X_train,
    y_train,
    cv=skf,
    scoring='accuracy'
)

print(f"Stratified CV Accuracy: {scores.mean():.3f} ± {scores.std():.3f}")

Group K-Fold (Patient-Level)

Critical for Raman data: Prevent data leakage from same patient

from sklearn.model_selection import GroupKFold, cross_val_score

# groups: Patient IDs for each sample
gkf = GroupKFold(n_splits=5)

scores = cross_val_score(
    model,
    X_train,
    y_train,
    groups=patient_ids,  # Ensure all samples from same patient in same fold
    cv=gkf,
    scoring='accuracy'
)

print(f"Group CV Accuracy: {scores.mean():.3f} ± {scores.std():.3f}")

See Also


Last Updated: 2026-01-24