Machine Learning Methods

Comprehensive reference for classification and regression algorithms.

Table of Contents

Support Vector Machines (SVM)
Random Forest
XGBoost
Logistic Regression
Model Evaluation
Hyperparameter Optimization
Feature Importance

Note: Multi-Layer Perceptron (MLP) and neural networks are planned for future releases.

Support Vector Machines (SVM)

Purpose: Binary or multi-class classification using optimal decision boundaries

Theory

Core Concept: Find hyperplane that maximizes margin between classes

Key Components:

Support Vectors: Data points closest to decision boundary
Margin: Distance between hyperplane and nearest points
Kernel: Function to transform data to higher dimensions

Decision Function:

f(x) = sign(Σ αᵢ yᵢ K(xᵢ, x) + b)

Where:

α: Lagrange multipliers
y: Class labels
K: Kernel function
b: Bias term

Kernel Functions

1. Linear Kernel

Formula: K(x, x') = xᵀx'

When to Use:

✓ Linearly separable data
✓ High-dimensional data (text, spectra)
✓ Large datasets (fast)

Pros:

Fast training and prediction
Interpretable (feature weights)
Less prone to overfitting

Cons:

Cannot handle non-linear boundaries

2. RBF (Radial Basis Function) Kernel

Formula: K(x, x') = exp(-γ ||x - x'||²)

When to Use:

✓ Non-linear boundaries
✓ Default choice for most problems
✓ Unknown data structure

Pros:

Handles non-linearity
Flexible decision boundaries

Cons:

More parameters to tune (C, γ)
Slower than linear
Risk of overfitting

Parameter γ (gamma):

High γ (e.g., 0.1): Narrow influence → Complex boundaries (overfitting risk)
Low γ (e.g., 0.001): Wide influence → Smooth boundaries
‘scale’ (default): γ = 1 / (n_features × variance)
‘auto’: γ = 1 / n_features

3. Polynomial Kernel

Formula: K(x, x') = (γ xᵀx' + r)ᵈ

Parameters:

d: Polynomial degree (2 or 3 typical)
r: Coefficient (usually 0)

When to Use:

✓ Known polynomial relationship
✓ Image data
✗ Rarely used for Raman spectra

Hyperparameters

C (Regularization Parameter)

Purpose: Control trade-off between margin and misclassification

Effect:

High C (e.g., 100): Smaller margin, fewer errors → Overfitting risk
Low C (e.g., 0.1): Larger margin, more errors → Underfitting risk

Typical Range: 0.1 - 100

Tuning Strategy:

# Grid search over C
C_range = [0.1, 1, 10, 100]

Gamma (γ) - RBF Kernel Only

Purpose: Define influence radius of single training example

Effect:

High γ (e.g., 1.0): Tight fit → Overfitting
Low γ (e.g., 0.001): Loose fit → Underfitting

Typical Range: 0.0001 - 1.0

Tuning Strategy:

# Grid search over gamma
gamma_range = [0.0001, 0.001, 0.01, 0.1, 1.0]

Usage Example

from functions.ML.svm import train_svm_model

# Prepare data
X_train, X_test, y_train, y_test = train_test_split(
    preprocessed_spectra,
    labels,
    test_size=0.2,
    random_state=42,
    stratify=labels
)

# Train SVM with RBF kernel
svm_model = train_svm_model(
    X_train, y_train,
    kernel='rbf',
    C=10.0,
    gamma='scale',
    random_state=42
)

# Predictions
y_pred = svm_model.predict(X_test)

# Evaluate
from sklearn.metrics import accuracy_score, classification_report

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.3f}")
print(classification_report(y_test, y_pred))

Hyperparameter Optimization

from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

# Define parameter grid
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [0.001, 0.01, 0.1, 1, 'scale', 'auto'],
    'kernel': ['rbf']
}

# Grid search with cross-validation
grid_search = GridSearchCV(
    SVC(random_state=42),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_train, y_train)

# Best parameters
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.3f}")

# Use best model
best_svm = grid_search.best_estimator_

Interpretation

Decision Function Values:

# Get decision function scores
decision_scores = svm_model.decision_function(X_test)

# For binary classification:
# Positive → Class 1
# Negative → Class 0
# Magnitude → Confidence

# For multi-class (one-vs-one):
# Multiple decision functions (one per pair)

Support Vectors:

# Number of support vectors per class
print(f"Support vectors: {svm_model.n_support_}")

# Indices of support vectors
support_indices = svm_model.support_

# Support vectors themselves
support_vectors = X_train[support_indices]

Troubleshooting

Issue	Cause	Solution
Poor accuracy	Wrong kernel	Try RBF if using linear
Overfitting	C too high or γ too high	Reduce C, reduce γ
Underfitting	C too low or γ too low	Increase C, increase γ
Very slow training	Large dataset with RBF	Use LinearSVC or subsample
Poor validation	Data leakage	Use GroupKFold for patient-level

When to Use

Use SVM when:

✓ Binary or multi-class classification
✓ High-dimensional data (spectra work well)
✓ Clear margin between classes
✓ Small to medium datasets (< 50,000 samples)
✓ Need probabilistic outputs (use probability=True)

Consider alternatives when:

✗ Very large datasets (> 100,000) → Random Forest
✗ Need interpretability → Logistic Regression
✗ Categorical features → Random Forest/XGBoost
✗ Need feature importance → Random Forest

Class Imbalance

# Handle imbalanced classes
svm_model = SVC(
    kernel='rbf',
    C=10,
    gamma='scale',
    class_weight='balanced',  # Automatic adjustment
    random_state=42
)

# Or specify custom weights
class_weights = {0: 1.0, 1: 3.0}  # Give 3x weight to class 1
svm_model = SVC(class_weight=class_weights)

Reference

Cortes & Vapnik (1995). “Support-Vector Networks”

Random Forest

Purpose: Ensemble of decision trees for robust classification/regression

Theory

Core Concept: Combine multiple decision trees to reduce overfitting

Bagging (Bootstrap Aggregating):

Create multiple bootstrap samples (random sampling with replacement)
Train decision tree on each sample
Aggregate predictions (vote for classification, average for regression)

Random Feature Selection:

At each split, consider random subset of features
Decorrelates trees → Better ensemble

Out-of-Bag (OOB) Error:

Each tree trained on ~63% of data
Remaining 37% used for validation (no separate test set needed)

Hyperparameters

n_estimators

Purpose: Number of trees in forest

Effect:

More trees → More stable, but slower
Fewer trees → Faster, but less stable

Typical Range: 100 - 1000

Recommendation: Start with 100, increase if training loss still decreasing

# Check OOB error vs n_trees
oob_errors = []
for n in range(10, 500, 10):
    rf = RandomForestClassifier(n_estimators=n, oob_score=True)
    rf.fit(X_train, y_train)
    oob_errors.append(1 - rf.oob_score_)

# Plot to find plateau

max_depth

Purpose: Maximum depth of each tree

Effect:

None (default): Trees grow until pure leaves
Shallow (e.g., 5-10): Prevents overfitting
Deep (e.g., 20-30): Captures complex patterns

Typical Range: 10 - 30, or None

Tuning:

# Start with None, then limit if overfitting
max_depth = None  # or 20 for regularization

min_samples_split

Purpose: Minimum samples required to split node

Effect:

Low (e.g., 2): More splits → More complex trees
High (e.g., 10-20): Fewer splits → Regularization

Default: 2

Recommendation: Increase to 5-10 if overfitting

min_samples_leaf

Purpose: Minimum samples in leaf node

Effect:

Low (e.g., 1): Allows pure leaves
High (e.g., 5-10): Smooths predictions

Default: 1

Recommendation: Set to 2-5 for regularization

max_features

Purpose: Number of features to consider for each split

Options:

‘sqrt’ (default): √n_features (recommended for classification)
‘log2’: log₂(n_features)
int: Specific number
float: Proportion (e.g., 0.3 = 30%)

Effect:

Fewer features → More diversity → Better ensemble
More features → Individual trees stronger → Less diversity

Usage Example

from functions.ML.random_forest import train_rf_model

# Train Random Forest
rf_model = train_rf_model(
    X_train, y_train,
    n_estimators=100,
    max_depth=None,
    min_samples_split=2,
    min_samples_leaf=1,
    max_features='sqrt',
    random_state=42,
    n_jobs=-1  # Use all CPU cores
)

# Predictions
y_pred = rf_model.predict(X_test)

# Prediction probabilities
y_proba = rf_model.predict_proba(X_test)

# Feature importance
importances = rf_model.feature_importances_

Hyperparameter Optimization

from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier

# Define parameter distributions
param_dist = {
    'n_estimators': [100, 200, 500],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2', 0.3]
}

# Random search (faster than grid search)
random_search = RandomizedSearchCV(
    RandomForestClassifier(random_state=42),
    param_distributions=param_dist,
    n_iter=50,  # Number of parameter combinations to try
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    random_state=42,
    verbose=1
)

random_search.fit(X_train, y_train)

print(f"Best parameters: {random_search.best_params_}")
print(f"Best CV score: {random_search.best_score_:.3f}")

Feature Importance

Gini Importance (Default):

import numpy as np
import matplotlib.pyplot as plt

# Get feature importances
importances = rf_model.feature_importances_

# Sort by importance
indices = np.argsort(importances)[::-1]
top_k = 20

# Plot top features
plt.figure(figsize=(10, 6))
plt.bar(range(top_k), importances[indices[:top_k]])
plt.xlabel('Feature (Wavenumber) Index')
plt.ylabel('Importance')
plt.title('Top 20 Important Features')
plt.tight_layout()
plt.show()

# Map to wavenumbers
top_wavenumbers = wavenumbers[indices[:top_k]]
print(f"Top wavenumbers: {top_wavenumbers}")

Permutation Importance (More Accurate):

from sklearn.inspection import permutation_importance

# Calculate permutation importance
perm_importance = permutation_importance(
    rf_model,
    X_test,
    y_test,
    n_repeats=10,
    random_state=42,
    n_jobs=-1
)

# Get importances and standard deviations
perm_importances_mean = perm_importance.importances_mean
perm_importances_std = perm_importance.importances_std

# Sort
indices = np.argsort(perm_importances_mean)[::-1]
top_k = 20

# Plot
plt.figure(figsize=(10, 6))
plt.bar(range(top_k), 
        perm_importances_mean[indices[:top_k]],
        yerr=perm_importances_std[indices[:top_k]])
plt.xlabel('Feature Index')
plt.ylabel('Permutation Importance')
plt.title('Top 20 Features (Permutation Importance)')
plt.tight_layout()
plt.show()

SHAP Values

Purpose: Explain model predictions via Shapley-value-based feature attribution.

When to use:

✓ Per-sample explanations (which wavenumbers drive a prediction)
✓ Global importance aggregated from local attributions

Note: This typically uses the external shap library.

In this application: SHAP explainability is exposed via the ML page SHAP action and produces a per-spectrum explanation with a contribution bar plot across the Raman shift axis (red = positive, blue = negative), a ranked contributor table, and an exportable bundle.

Out-of-Bag (OOB) Score

# Train with OOB scoring
rf_model = RandomForestClassifier(
    n_estimators=100,
    oob_score=True,
    random_state=42
)
rf_model.fit(X_train, y_train)

# OOB score (similar to cross-validation)
print(f"OOB Score: {rf_model.oob_score_:.3f}")

# OOB predictions
oob_pred = rf_model.oob_decision_function_

Advantages

✓ Robust: Less prone to overfitting than single tree
✓ Feature importance: Automatic ranking
✓ Handles non-linearity: No kernel needed
✓ Missing values: Can handle with imputation
✓ No scaling required: Works with raw features
✓ Parallelizable: Fast training with multiple cores

Limitations

✗ Black box: Hard to interpret individual predictions
✗ Large models: Memory intensive for many trees
✗ Extrapolation: Poor performance outside training range
✗ Biased: Toward features with many categories

When to Use

Use Random Forest when:

✓ Tabular data with mixed features
✓ Need feature importance
✓ Don’t want to tune many hyperparameters
✓ Want robust, reliable performance
✓ Have sufficient data (> 1000 samples)

Consider alternatives when:

✗ Need probabilistic model → Logistic Regression
✗ Have sequential/spatial data → Neural networks
✗ Need speed → Logistic Regression or LinearSVC
✗ Want best accuracy → XGBoost (usually better)

Reference

Breiman (2001). “Random Forests”

XGBoost

Purpose: Gradient boosting for high-performance classification/regression

Theory

Core Concept: Build trees sequentially, each correcting previous errors

Gradient Boosting:

Start with simple model (e.g., mean prediction)
Calculate residuals (errors)
Train new tree to predict residuals
Add to ensemble with small weight
Repeat until convergence

XGBoost Innovations:

Regularization (L1/L2) to prevent overfitting
Handling missing values automatically
Parallel tree construction (fast)
Built-in cross-validation

Formula:

F(x) = Σ fₖ(x)
where fₖ is k-th tree

Hyperparameters

n_estimators

Purpose: Number of boosting rounds (trees)

Effect:

More trees → Better training fit → Overfitting risk
Fewer trees → Underfitting

Typical Range: 100 - 1000

Best Practice: Use early stopping to find optimal number

# Early stopping
xgb_model.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    early_stopping_rounds=50,  # Stop if no improvement for 50 rounds
    verbose=False
)

learning_rate (eta)

Purpose: Step size for each tree’s contribution

Effect:

High (e.g., 0.3): Fast convergence → Overfitting risk
Low (e.g., 0.01): Slow convergence → More trees needed → Better generalization

Typical Range: 0.01 - 0.3

Rule of Thumb: Lower learning_rate requires more n_estimators

# Conservative approach
learning_rate = 0.1
n_estimators = 500

# Aggressive approach (faster, riskier)
learning_rate = 0.3
n_estimators = 100

max_depth

Purpose: Maximum depth of each tree

Effect:

Deep (e.g., 6-10): Captures complex interactions
Shallow (e.g., 3-5): Prevents overfitting

Typical Range: 3 - 10

Default: 6

subsample

Purpose: Fraction of samples used for each tree

Effect:

< 1.0 (e.g., 0.8): Reduces overfitting, speeds up training
= 1.0: Use all samples

Typical Range: 0.5 - 1.0

Recommendation: 0.8 for robustness

colsample_bytree

Purpose: Fraction of features used for each tree

Effect:

< 1.0 (e.g., 0.8): Reduces overfitting, adds diversity
= 1.0: Use all features

Typical Range: 0.5 - 1.0

Recommendation: 0.8

gamma (min_split_loss)

Purpose: Minimum loss reduction to make split

Effect:

Higher: More conservative splitting → Regularization
Lower: More splits → Risk of overfitting

Typical Range: 0 - 5

Default: 0

lambda (reg_lambda) - L2 Regularization

Purpose: L2 penalty on leaf weights

Effect:

Higher: Stronger regularization
Lower: Less regularization

Typical Range: 0 - 10

Default: 1

alpha (reg_alpha) - L1 Regularization

Purpose: L1 penalty on leaf weights (feature selection)

Effect:

Higher: More sparsity (some weights → 0)
Lower: Less sparsity

Typical Range: 0 - 10

Default: 0

Usage Example

from functions.ML.xgboost import train_xgboost_model

# Train XGBoost
xgb_model = train_xgboost_model(
    X_train, y_train,
    n_estimators=500,
    learning_rate=0.1,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    gamma=0,
    reg_lambda=1,
    reg_alpha=0,
    random_state=42,
    n_jobs=-1
)

# Predictions
y_pred = xgb_model.predict(X_test)
y_proba = xgb_model.predict_proba(X_test)

Hyperparameter Optimization

from sklearn.model_selection import RandomizedSearchCV
from xgboost import XGBClassifier

# Define parameter distributions
param_dist = {
    'n_estimators': [100, 200, 500, 1000],
    'learning_rate': [0.01, 0.05, 0.1, 0.3],
    'max_depth': [3, 5, 7, 9],
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0],
    'gamma': [0, 1, 5],
    'reg_lambda': [0, 1, 10],
    'reg_alpha': [0, 0.1, 1]
}

# Random search
random_search = RandomizedSearchCV(
    XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='logloss'),
    param_distributions=param_dist,
    n_iter=50,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    random_state=42,
    verbose=1
)

random_search.fit(X_train, y_train)

print(f"Best parameters: {random_search.best_params_}")
print(f"Best CV score: {random_search.best_score_:.3f}")

Feature Importance

import matplotlib.pyplot as plt

# Get feature importances
importances = xgb_model.feature_importances_

# Plot top features
import numpy as np
indices = np.argsort(importances)[::-1]
top_k = 20

plt.figure(figsize=(10, 6))
plt.bar(range(top_k), importances[indices[:top_k]])
plt.xlabel('Feature Index')
plt.ylabel('Importance (Gain)')
plt.title('Top 20 Important Features (XGBoost)')
plt.tight_layout()
plt.show()

# Importance types in XGBoost:
# - 'weight': Number of times feature used
# - 'gain': Average gain (default, most useful)
# - 'cover': Average coverage

# Get different importance types
import xgboost as xgb
importance_gain = xgb_model.get_booster().get_score(importance_type='gain')
importance_weight = xgb_model.get_booster().get_score(importance_type='weight')

Learning Curves

# Train with evaluation set to monitor performance
eval_set = [(X_train, y_train), (X_val, y_val)]

xgb_model.fit(
    X_train, y_train,
    eval_set=eval_set,
    eval_metric='logloss',
    verbose=50  # Print every 50 rounds
)

# Get evaluation results
results = xgb_model.evals_result()

# Plot learning curves
plt.figure(figsize=(10, 6))
plt.plot(results['validation_0']['logloss'], label='Train')
plt.plot(results['validation_1']['logloss'], label='Validation')
plt.xlabel('Boosting Round')
plt.ylabel('Log Loss')
plt.legend()
plt.title('XGBoost Learning Curves')
plt.tight_layout()
plt.show()

Advantages

✓ State-of-the-art accuracy: Often wins competitions
✓ Regularization: Built-in L1/L2
✓ Handles missing data: Automatically
✓ Fast: Parallel tree construction
✓ Early stopping: Prevents overfitting
✓ Feature importance: Gain, weight, cover

Limitations

✗ Many hyperparameters: Requires tuning
✗ Sensitive to overfitting: With default params
✗ Black box: Hard to interpret
✗ Memory intensive: For large datasets

When to Use

Use XGBoost when:

✓ Need best possible accuracy
✓ Tabular data
✓ Have time to tune hyperparameters
✓ Competition or critical application
✓ Medium to large datasets

Consider alternatives when:

✗ Need interpretability → Logistic Regression
✗ Limited time → Random Forest (fewer params)
✗ Very small data → Logistic Regression or SVM

Reference

Chen & Guestrin (2016). “XGBoost: A Scalable Tree Boosting System”

Logistic Regression

Purpose: Linear model for probabilistic binary/multi-class classification

Theory

Core Concept: Model log-odds as linear combination of features

Binary Classification:

P(y=1|x) = 1 / (1 + exp(-(β₀ + β₁x₁ + ... + βₙxₙ)))

Multi-class (one-vs-rest or multinomial):

P(y=k|x) = exp(xᵀβₖ) / Σⱼ exp(xᵀβⱼ)

Decision Boundary: Linear in feature space

Hyperparameters

C (Inverse Regularization)

Purpose: Control strength of regularization

Effect:

High C (e.g., 100): Weak regularization → Overfitting risk
Low C (e.g., 0.01): Strong regularization → Underfitting risk

Typical Range: 0.01 - 100

Note: C is inverse of regularization strength (unlike most methods)

penalty

Purpose: Type of regularization

Options:

‘l2’ (default): Ridge regularization (shrinks coefficients)
‘l1’: Lasso regularization (sparse coefficients, feature selection)
‘elasticnet’: Mix of L1 and L2
‘none’: No regularization

When to Use:

L2: Default, works well generally
L1: When want feature selection (many irrelevant features)
Elasticnet: When L1 too aggressive

solver

Purpose: Optimization algorithm

Options:

‘lbfgs’ (default): Good for small datasets, L2 only
‘saga’: Supports all penalties, good for large datasets
‘liblinear’: Good for small datasets, supports L1/L2

Recommendation: Use ‘saga’ for flexibility

max_iter

Purpose: Maximum iterations for convergence

Default: 100

Increase if: Warning about non-convergence

Typical Range: 100 - 10000

Usage Example

from functions.ML.logistic_regression import train_lr_model

# Train Logistic Regression
lr_model = train_lr_model(
    X_train, y_train,
    C=1.0,
    penalty='l2',
    solver='saga',
    max_iter=1000,
    random_state=42
)

# Predictions
y_pred = lr_model.predict(X_test)
y_proba = lr_model.predict_proba(X_test)

# Get coefficients (feature weights)
coefficients = lr_model.coef_[0]  # For binary classification

# Intercept
intercept = lr_model.intercept_

Hyperparameter Optimization

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

# Define parameter grid
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2'],
    'solver': ['saga'],
    'max_iter': [1000]
}

# Grid search
grid_search = GridSearchCV(
    LogisticRegression(random_state=42),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

grid_search.fit(X_train, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.3f}")

Interpretation

Coefficients (Feature Weights):

import numpy as np
import matplotlib.pyplot as plt

# Get coefficients
coefs = lr_model.coef_[0]

# Find most important features
indices = np.argsort(np.abs(coefs))[::-1]
top_k = 20

# Plot
plt.figure(figsize=(10, 6))
plt.bar(range(top_k), coefs[indices[:top_k]])
plt.xlabel('Feature Index')
plt.ylabel('Coefficient')
plt.title('Top 20 Feature Coefficients')
plt.axhline(y=0, color='k', linestyle='--')
plt.tight_layout()
plt.show()

# Positive coefficient → Increases probability of class 1
# Negative coefficient → Decreases probability of class 1

Odds Ratios:

# Convert coefficients to odds ratios
odds_ratios = np.exp(coefs)

# Interpretation:
# OR = 2.0 → One unit increase in feature doubles odds
# OR = 0.5 → One unit increase halves odds
# OR = 1.0 → No effect

print(f"Odds ratios for top features:")
for i in indices[:10]:
    print(f"  Feature {i}: OR = {odds_ratios[i]:.3f}")

Probability Calibration:

# Check calibration
from sklearn.calibration import calibration_curve

# Predicted probabilities
y_proba = lr_model.predict_proba(X_test)[:, 1]

# Calibration curve
fraction_of_positives, mean_predicted_value = calibration_curve(
    y_test, y_proba, n_bins=10
)

# Plot
plt.figure(figsize=(8, 6))
plt.plot([0, 1], [0, 1], 'k--', label='Perfectly Calibrated')
plt.plot(mean_predicted_value, fraction_of_positives, 'o-', label='Logistic Regression')
plt.xlabel('Mean Predicted Probability')
plt.ylabel('Fraction of Positives')
plt.title('Calibration Curve')
plt.legend()
plt.tight_layout()
plt.show()

Advantages

✓ Interpretable: Clear feature weights
✓ Probabilistic: Direct probability estimates
✓ Fast: Training and prediction
✓ Well-calibrated: Reliable probabilities
✓ No hyperparameters: (except C)
✓ Linear decision boundary: Simple

Limitations

✗ Linear only: Cannot capture non-linear relationships
✗ Feature engineering: May need manual feature creation
✗ Sensitive to scaling: Requires standardization
✗ Multicollinearity: Correlated features problematic

When to Use

Use Logistic Regression when:

✓ Need interpretability (coefficients)
✓ Want probabilistic outputs
✓ Linearly separable data
✓ Baseline model
✓ Small datasets
✓ Need fast predictions

Consider alternatives when:

✗ Non-linear boundaries → SVM (RBF) or Random Forest
✗ Many irrelevant features → Random Forest (feature importance)
✗ Complex interactions → XGBoost or Neural Networks

Reference

Cox (1958). “The Regression Analysis of Binary Sequences”

Multi-Layer Perceptron (MLP)

Purpose: Neural network for non-linear classification/regression

Theory

Architecture: Input → Hidden Layers → Output

Neuron Computation:

output = activation(Σ wᵢxᵢ + b)

Activation Functions:

ReLU: f(x) = max(0, x) [Default, recommended]
Tanh: f(x) = tanh(x)
Logistic: f(x) = 1/(1 + e⁻ˣ)

Training: Backpropagation with gradient descent

Hyperparameters

hidden_layer_sizes

Purpose: Architecture of hidden layers

Format: Tuple (layer1_size, layer2_size, …)

Examples:

# Single hidden layer with 100 neurons
hidden_layer_sizes = (100,)

# Two hidden layers (100, 50)
hidden_layer_sizes = (100, 50)

# Three hidden layers (200, 100, 50)
hidden_layer_sizes = (200, 100, 50)

Rules of Thumb:

Start with (100,) or (100, 50)
More neurons → More capacity → Overfitting risk
More layers → Can learn complex patterns

activation

Purpose: Non-linear activation function

Options:

‘relu’ (default): Recommended, works well
‘tanh’: Alternative, slower convergence
‘logistic’: Sigmoid, rarely used

Recommendation: Use ‘relu’

alpha

Purpose: L2 regularization strength

Effect:

Higher (e.g., 0.01): Strong regularization
Lower (e.g., 0.0001): Weak regularization

Typical Range: 0.0001 - 0.01

Default: 0.0001

learning_rate_init

Purpose: Initial learning rate

Effect:

Higher (e.g., 0.01): Faster convergence → Instability risk
Lower (e.g., 0.0001): Slower convergence → More stable

Typical Range: 0.0001 - 0.01

Default: 0.001

max_iter

Purpose: Maximum epochs (training iterations)

Default: 200

Typical Range: 200 - 1000

Increase if: Training loss still decreasing

early_stopping

Purpose: Stop training when validation score stops improving

Recommended: True

Parameters:

validation_fraction: Fraction for validation (default 0.1)
n_iter_no_change: Patience (default 10)

Usage Example

from functions.ML.mlp import train_mlp_model

# Train MLP
mlp_model = train_mlp_model(
    X_train, y_train,
    hidden_layer_sizes=(100, 50),
    activation='relu',
    alpha=0.0001,
    learning_rate_init=0.001,
    max_iter=500,
    early_stopping=True,
    random_state=42
)

# Predictions
y_pred = mlp_model.predict(X_test)
y_proba = mlp_model.predict_proba(X_test)

Hyperparameter Optimization

from sklearn.model_selection import RandomizedSearchCV
from sklearn.neural_network import MLPClassifier

# Define parameter distributions
param_dist = {
    'hidden_layer_sizes': [(50,), (100,), (100, 50), (200, 100)],
    'activation': ['relu', 'tanh'],
    'alpha': [0.0001, 0.001, 0.01],
    'learning_rate_init': [0.0001, 0.001, 0.01],
    'max_iter': [500]
}

# Random search
random_search = RandomizedSearchCV(
    MLPClassifier(early_stopping=True, random_state=42),
    param_distributions=param_dist,
    n_iter=20,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    random_state=42
)

random_search.fit(X_train, y_train)

print(f"Best parameters: {random_search.best_params_}")
print(f"Best CV score: {random_search.best_score_:.3f}")

Learning Curves

# Access loss history
train_loss = mlp_model.loss_curve_

# Plot
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
plt.plot(train_loss)
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('MLP Training Loss')
plt.tight_layout()
plt.show()

# Check for convergence:
# - Should decrease steadily
# - Flatten at end
# - If still decreasing → increase max_iter

Advantages

✓ Non-linear: Can learn complex patterns
✓ Flexible: Arbitrary architectures
✓ Universal approximator: Theoretically can learn any function
✓ Feature learning: Automatic feature extraction

Limitations

✗ Black box: Hard to interpret
✗ Hyperparameters: Many to tune
✗ Convergence: Can be slow or unstable
✗ Scaling required: Sensitive to feature scales
✗ Random initialization: Different results per run

When to Use

Use MLP when:

✓ Non-linear, complex patterns
✓ Large datasets (> 10,000 samples)
✓ Don’t need interpretability
✓ Have time to tune
✓ Sufficient data to prevent overfitting

Consider alternatives when:

✗ Need interpretability → Logistic Regression
✗ Small data → SVM or Random Forest
✗ Want speed → Random Forest
✗ Tabular data → XGBoost (usually better)

Reference

Rumelhart et al. (1986). “Learning representations by back-propagating errors”

Model Evaluation

Classification Metrics

Accuracy

Formula: (TP + TN) / (TP + TN + FP + FN)

When to Use: Balanced classes

Limitation: Misleading for imbalanced data

from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.3f}")

Precision

Formula: TP / (TP + FP)

Interpretation: Of predicted positives, how many are correct?

When to Use: Cost of false positives high (e.g., spam detection)

from sklearn.metrics import precision_score

precision = precision_score(y_test, y_pred, average='weighted')
print(f"Precision: {precision:.3f}")

Recall (Sensitivity)

Formula: TP / (TP + FN)

Interpretation: Of actual positives, how many detected?

When to Use: Cost of false negatives high (e.g., disease screening)

from sklearn.metrics import recall_score

recall = recall_score(y_test, y_pred, average='weighted')
print(f"Recall: {recall:.3f}")

F1-Score

Formula: 2 × (Precision × Recall) / (Precision + Recall)

Interpretation: Harmonic mean of precision and recall

When to Use: Balance precision and recall

from sklearn.metrics import f1_score

f1 = f1_score(y_test, y_pred, average='weighted')
print(f"F1-Score: {f1:.3f}")

ROC-AUC

Purpose: Measure discrimination ability across all thresholds

Range: 0.5 (random) to 1.0 (perfect)

When to Use: Imbalanced classes, need threshold-independent metric

from sklearn.metrics import roc_auc_score, roc_curve

# Binary classification
y_proba = model.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_proba)
print(f"ROC-AUC: {auc:.3f}")

# Plot ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_proba)

import matplotlib.pyplot as plt
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'AUC = {auc:.3f}')
plt.plot([0, 1], [0, 1], 'k--', label='Random')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.tight_layout()
plt.show()

Confusion Matrix

from sklearn.metrics import confusion_matrix
import seaborn as sns

# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Plot
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=class_names,
            yticklabels=class_names)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.tight_layout()
plt.show()

Classification Report

from sklearn.metrics import classification_report

report = classification_report(y_test, y_pred, target_names=class_names)
print(report)

# Example output:
#               precision    recall  f1-score   support
#
#      Class A       0.92      0.95      0.93        20
#      Class B       0.88      0.85      0.87        20
#
#     accuracy                           0.90        40
#    macro avg       0.90      0.90      0.90        40
# weighted avg       0.90      0.90      0.90        40

Cross-Validation

K-Fold Cross-Validation

from sklearn.model_selection import cross_val_score

# 5-fold CV
scores = cross_val_score(
    model,
    X_train,
    y_train,
    cv=5,
    scoring='accuracy'
)

print(f"CV Accuracy: {scores.mean():.3f} ± {scores.std():.3f}")

Stratified K-Fold

Use when: Imbalanced classes

from sklearn.model_selection import StratifiedKFold, cross_val_score

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

scores = cross_val_score(
    model,
    X_train,
    y_train,
    cv=skf,
    scoring='accuracy'
)

print(f"Stratified CV Accuracy: {scores.mean():.3f} ± {scores.std():.3f}")

Group K-Fold (Patient-Level)

Critical for Raman data: Prevent data leakage from same patient

from sklearn.model_selection import GroupKFold, cross_val_score

# groups: Patient IDs for each sample
gkf = GroupKFold(n_splits=5)

scores = cross_val_score(
    model,
    X_train,
    y_train,
    groups=patient_ids,  # Ensure all samples from same patient in same fold
    cv=gkf,
    scoring='accuracy'
)

print(f"Group CV Accuracy: {scores.mean():.3f} ± {scores.std():.3f}")