# Machine Learning Methods Comprehensive reference for classification and regression algorithms. ## Table of Contents - [Support Vector Machines (SVM)](#support-vector-machines-svm) - [Random Forest](#random-forest) - [XGBoost](#xgboost) - [Logistic Regression](#logistic-regression) - [Model Evaluation](#model-evaluation) - [Hyperparameter Optimization](#hyperparameter-optimization) - [Feature Importance](#feature-importance) **Note**: Multi-Layer Perceptron (MLP) and neural networks are planned for future releases. --- ## Support Vector Machines (SVM) **Purpose**: Binary or multi-class classification using optimal decision boundaries ### Theory **Core Concept**: Find hyperplane that maximizes margin between classes **Key Components**: 1. **Support Vectors**: Data points closest to decision boundary 2. **Margin**: Distance between hyperplane and nearest points 3. **Kernel**: Function to transform data to higher dimensions **Decision Function**: ``` f(x) = sign(Σ αᵢ yᵢ K(xᵢ, x) + b) ``` Where: - α: Lagrange multipliers - y: Class labels - K: Kernel function - b: Bias term ### Kernel Functions #### 1. Linear Kernel **Formula**: `K(x, x') = xᵀx'` **When to Use**: - ✓ Linearly separable data - ✓ High-dimensional data (text, spectra) - ✓ Large datasets (fast) **Pros**: - Fast training and prediction - Interpretable (feature weights) - Less prone to overfitting **Cons**: - Cannot handle non-linear boundaries #### 2. RBF (Radial Basis Function) Kernel **Formula**: `K(x, x') = exp(-γ ||x - x'||²)` **When to Use**: - ✓ Non-linear boundaries - ✓ Default choice for most problems - ✓ Unknown data structure **Pros**: - Handles non-linearity - Flexible decision boundaries **Cons**: - More parameters to tune (C, γ) - Slower than linear - Risk of overfitting **Parameter γ (gamma)**: - **High γ** (e.g., 0.1): Narrow influence → Complex boundaries (overfitting risk) - **Low γ** (e.g., 0.001): Wide influence → Smooth boundaries - **'scale'** (default): γ = 1 / (n_features × variance) - **'auto'**: γ = 1 / n_features #### 3. Polynomial Kernel **Formula**: `K(x, x') = (γ xᵀx' + r)ᵈ` **Parameters**: - **d**: Polynomial degree (2 or 3 typical) - **r**: Coefficient (usually 0) **When to Use**: - ✓ Known polynomial relationship - ✓ Image data - ✗ Rarely used for Raman spectra ### Hyperparameters #### C (Regularization Parameter) **Purpose**: Control trade-off between margin and misclassification **Effect**: - **High C** (e.g., 100): Smaller margin, fewer errors → Overfitting risk - **Low C** (e.g., 0.1): Larger margin, more errors → Underfitting risk **Typical Range**: 0.1 - 100 **Tuning Strategy**: ```python # Grid search over C C_range = [0.1, 1, 10, 100] ``` #### Gamma (γ) - RBF Kernel Only **Purpose**: Define influence radius of single training example **Effect**: - **High γ** (e.g., 1.0): Tight fit → Overfitting - **Low γ** (e.g., 0.001): Loose fit → Underfitting **Typical Range**: 0.0001 - 1.0 **Tuning Strategy**: ```python # Grid search over gamma gamma_range = [0.0001, 0.001, 0.01, 0.1, 1.0] ``` ### Usage Example ```python from functions.ML.svm import train_svm_model # Prepare data X_train, X_test, y_train, y_test = train_test_split( preprocessed_spectra, labels, test_size=0.2, random_state=42, stratify=labels ) # Train SVM with RBF kernel svm_model = train_svm_model( X_train, y_train, kernel='rbf', C=10.0, gamma='scale', random_state=42 ) # Predictions y_pred = svm_model.predict(X_test) # Evaluate from sklearn.metrics import accuracy_score, classification_report accuracy = accuracy_score(y_test, y_pred) print(f"Accuracy: {accuracy:.3f}") print(classification_report(y_test, y_pred)) ``` ### Hyperparameter Optimization ```python from sklearn.model_selection import GridSearchCV from sklearn.svm import SVC # Define parameter grid param_grid = { 'C': [0.1, 1, 10, 100], 'gamma': [0.001, 0.01, 0.1, 1, 'scale', 'auto'], 'kernel': ['rbf'] } # Grid search with cross-validation grid_search = GridSearchCV( SVC(random_state=42), param_grid, cv=5, scoring='accuracy', n_jobs=-1, verbose=1 ) grid_search.fit(X_train, y_train) # Best parameters print(f"Best parameters: {grid_search.best_params_}") print(f"Best CV score: {grid_search.best_score_:.3f}") # Use best model best_svm = grid_search.best_estimator_ ``` ### Interpretation **Decision Function Values**: ```python # Get decision function scores decision_scores = svm_model.decision_function(X_test) # For binary classification: # Positive → Class 1 # Negative → Class 0 # Magnitude → Confidence # For multi-class (one-vs-one): # Multiple decision functions (one per pair) ``` **Support Vectors**: ```python # Number of support vectors per class print(f"Support vectors: {svm_model.n_support_}") # Indices of support vectors support_indices = svm_model.support_ # Support vectors themselves support_vectors = X_train[support_indices] ``` ### Troubleshooting | Issue | Cause | Solution | | ------------------ | ------------------------ | -------------------------------- | | Poor accuracy | Wrong kernel | Try RBF if using linear | | Overfitting | C too high or γ too high | Reduce C, reduce γ | | Underfitting | C too low or γ too low | Increase C, increase γ | | Very slow training | Large dataset with RBF | Use LinearSVC or subsample | | Poor validation | Data leakage | Use GroupKFold for patient-level | ### When to Use **Use SVM when**: - ✓ Binary or multi-class classification - ✓ High-dimensional data (spectra work well) - ✓ Clear margin between classes - ✓ Small to medium datasets (< 50,000 samples) - ✓ Need probabilistic outputs (use probability=True) **Consider alternatives when**: - ✗ Very large datasets (> 100,000) → Random Forest - ✗ Need interpretability → Logistic Regression - ✗ Categorical features → Random Forest/XGBoost - ✗ Need feature importance → Random Forest ### Class Imbalance ```python # Handle imbalanced classes svm_model = SVC( kernel='rbf', C=10, gamma='scale', class_weight='balanced', # Automatic adjustment random_state=42 ) # Or specify custom weights class_weights = {0: 1.0, 1: 3.0} # Give 3x weight to class 1 svm_model = SVC(class_weight=class_weights) ``` ### Reference Cortes & Vapnik (1995). "Support-Vector Networks" --- ## Random Forest **Purpose**: Ensemble of decision trees for robust classification/regression ### Theory **Core Concept**: Combine multiple decision trees to reduce overfitting **Bagging (Bootstrap Aggregating)**: 1. Create multiple bootstrap samples (random sampling with replacement) 2. Train decision tree on each sample 3. Aggregate predictions (vote for classification, average for regression) **Random Feature Selection**: - At each split, consider random subset of features - Decorrelates trees → Better ensemble **Out-of-Bag (OOB) Error**: - Each tree trained on ~63% of data - Remaining 37% used for validation (no separate test set needed) ### Hyperparameters #### n_estimators **Purpose**: Number of trees in forest **Effect**: - **More trees** → More stable, but slower - **Fewer trees** → Faster, but less stable **Typical Range**: 100 - 1000 **Recommendation**: Start with 100, increase if training loss still decreasing ```python # Check OOB error vs n_trees oob_errors = [] for n in range(10, 500, 10): rf = RandomForestClassifier(n_estimators=n, oob_score=True) rf.fit(X_train, y_train) oob_errors.append(1 - rf.oob_score_) # Plot to find plateau ``` #### max_depth **Purpose**: Maximum depth of each tree **Effect**: - **None** (default): Trees grow until pure leaves - **Shallow** (e.g., 5-10): Prevents overfitting - **Deep** (e.g., 20-30): Captures complex patterns **Typical Range**: 10 - 30, or None **Tuning**: ```python # Start with None, then limit if overfitting max_depth = None # or 20 for regularization ``` #### min_samples_split **Purpose**: Minimum samples required to split node **Effect**: - **Low** (e.g., 2): More splits → More complex trees - **High** (e.g., 10-20): Fewer splits → Regularization **Default**: 2 **Recommendation**: Increase to 5-10 if overfitting #### min_samples_leaf **Purpose**: Minimum samples in leaf node **Effect**: - **Low** (e.g., 1): Allows pure leaves - **High** (e.g., 5-10): Smooths predictions **Default**: 1 **Recommendation**: Set to 2-5 for regularization #### max_features **Purpose**: Number of features to consider for each split **Options**: - **'sqrt'** (default): √n_features (recommended for classification) - **'log2'**: log₂(n_features) - **int**: Specific number - **float**: Proportion (e.g., 0.3 = 30%) **Effect**: - **Fewer features** → More diversity → Better ensemble - **More features** → Individual trees stronger → Less diversity ### Usage Example ```python from functions.ML.random_forest import train_rf_model # Train Random Forest rf_model = train_rf_model( X_train, y_train, n_estimators=100, max_depth=None, min_samples_split=2, min_samples_leaf=1, max_features='sqrt', random_state=42, n_jobs=-1 # Use all CPU cores ) # Predictions y_pred = rf_model.predict(X_test) # Prediction probabilities y_proba = rf_model.predict_proba(X_test) # Feature importance importances = rf_model.feature_importances_ ``` ### Hyperparameter Optimization ```python from sklearn.model_selection import RandomizedSearchCV from sklearn.ensemble import RandomForestClassifier # Define parameter distributions param_dist = { 'n_estimators': [100, 200, 500], 'max_depth': [None, 10, 20, 30], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4], 'max_features': ['sqrt', 'log2', 0.3] } # Random search (faster than grid search) random_search = RandomizedSearchCV( RandomForestClassifier(random_state=42), param_distributions=param_dist, n_iter=50, # Number of parameter combinations to try cv=5, scoring='accuracy', n_jobs=-1, random_state=42, verbose=1 ) random_search.fit(X_train, y_train) print(f"Best parameters: {random_search.best_params_}") print(f"Best CV score: {random_search.best_score_:.3f}") ``` ### Feature Importance **Gini Importance** (Default): ```python import numpy as np import matplotlib.pyplot as plt # Get feature importances importances = rf_model.feature_importances_ # Sort by importance indices = np.argsort(importances)[::-1] top_k = 20 # Plot top features plt.figure(figsize=(10, 6)) plt.bar(range(top_k), importances[indices[:top_k]]) plt.xlabel('Feature (Wavenumber) Index') plt.ylabel('Importance') plt.title('Top 20 Important Features') plt.tight_layout() plt.show() # Map to wavenumbers top_wavenumbers = wavenumbers[indices[:top_k]] print(f"Top wavenumbers: {top_wavenumbers}") ``` **Permutation Importance** (More Accurate): ```python from sklearn.inspection import permutation_importance # Calculate permutation importance perm_importance = permutation_importance( rf_model, X_test, y_test, n_repeats=10, random_state=42, n_jobs=-1 ) # Get importances and standard deviations perm_importances_mean = perm_importance.importances_mean perm_importances_std = perm_importance.importances_std # Sort indices = np.argsort(perm_importances_mean)[::-1] top_k = 20 # Plot plt.figure(figsize=(10, 6)) plt.bar(range(top_k), perm_importances_mean[indices[:top_k]], yerr=perm_importances_std[indices[:top_k]]) plt.xlabel('Feature Index') plt.ylabel('Permutation Importance') plt.title('Top 20 Features (Permutation Importance)') plt.tight_layout() plt.show() ``` (shap-values)= #### SHAP Values **Purpose**: Explain model predictions via Shapley-value-based feature attribution. **When to use**: - ✓ Per-sample explanations (which wavenumbers drive a prediction) - ✓ Global importance aggregated from local attributions **Note**: This typically uses the external `shap` library. **In this application**: SHAP explainability is exposed via the ML page **SHAP** action and produces a per-spectrum explanation with a contribution **bar plot across the Raman shift axis** (red = positive, blue = negative), a ranked contributor table, and an exportable bundle. ### Out-of-Bag (OOB) Score ```python # Train with OOB scoring rf_model = RandomForestClassifier( n_estimators=100, oob_score=True, random_state=42 ) rf_model.fit(X_train, y_train) # OOB score (similar to cross-validation) print(f"OOB Score: {rf_model.oob_score_:.3f}") # OOB predictions oob_pred = rf_model.oob_decision_function_ ``` ### Advantages ✓ **Robust**: Less prone to overfitting than single tree ✓ **Feature importance**: Automatic ranking ✓ **Handles non-linearity**: No kernel needed ✓ **Missing values**: Can handle with imputation ✓ **No scaling required**: Works with raw features ✓ **Parallelizable**: Fast training with multiple cores ### Limitations ✗ **Black box**: Hard to interpret individual predictions ✗ **Large models**: Memory intensive for many trees ✗ **Extrapolation**: Poor performance outside training range ✗ **Biased**: Toward features with many categories ### When to Use **Use Random Forest when**: - ✓ Tabular data with mixed features - ✓ Need feature importance - ✓ Don't want to tune many hyperparameters - ✓ Want robust, reliable performance - ✓ Have sufficient data (> 1000 samples) **Consider alternatives when**: - ✗ Need probabilistic model → Logistic Regression - ✗ Have sequential/spatial data → Neural networks - ✗ Need speed → Logistic Regression or LinearSVC - ✗ Want best accuracy → XGBoost (usually better) ### Reference Breiman (2001). "Random Forests" --- ## XGBoost **Purpose**: Gradient boosting for high-performance classification/regression ### Theory **Core Concept**: Build trees sequentially, each correcting previous errors **Gradient Boosting**: 1. Start with simple model (e.g., mean prediction) 2. Calculate residuals (errors) 3. Train new tree to predict residuals 4. Add to ensemble with small weight 5. Repeat until convergence **XGBoost Innovations**: - Regularization (L1/L2) to prevent overfitting - Handling missing values automatically - Parallel tree construction (fast) - Built-in cross-validation **Formula**: ``` F(x) = Σ fₖ(x) where fₖ is k-th tree ``` ### Hyperparameters #### n_estimators **Purpose**: Number of boosting rounds (trees) **Effect**: - **More trees** → Better training fit → Overfitting risk - **Fewer trees** → Underfitting **Typical Range**: 100 - 1000 **Best Practice**: Use early stopping to find optimal number ```python # Early stopping xgb_model.fit( X_train, y_train, eval_set=[(X_val, y_val)], early_stopping_rounds=50, # Stop if no improvement for 50 rounds verbose=False ) ``` #### learning_rate (eta) **Purpose**: Step size for each tree's contribution **Effect**: - **High** (e.g., 0.3): Fast convergence → Overfitting risk - **Low** (e.g., 0.01): Slow convergence → More trees needed → Better generalization **Typical Range**: 0.01 - 0.3 **Rule of Thumb**: Lower learning_rate requires more n_estimators ```python # Conservative approach learning_rate = 0.1 n_estimators = 500 # Aggressive approach (faster, riskier) learning_rate = 0.3 n_estimators = 100 ``` #### max_depth **Purpose**: Maximum depth of each tree **Effect**: - **Deep** (e.g., 6-10): Captures complex interactions - **Shallow** (e.g., 3-5): Prevents overfitting **Typical Range**: 3 - 10 **Default**: 6 #### subsample **Purpose**: Fraction of samples used for each tree **Effect**: - **< 1.0** (e.g., 0.8): Reduces overfitting, speeds up training - **= 1.0**: Use all samples **Typical Range**: 0.5 - 1.0 **Recommendation**: 0.8 for robustness #### colsample_bytree **Purpose**: Fraction of features used for each tree **Effect**: - **< 1.0** (e.g., 0.8): Reduces overfitting, adds diversity - **= 1.0**: Use all features **Typical Range**: 0.5 - 1.0 **Recommendation**: 0.8 #### gamma (min_split_loss) **Purpose**: Minimum loss reduction to make split **Effect**: - **Higher**: More conservative splitting → Regularization - **Lower**: More splits → Risk of overfitting **Typical Range**: 0 - 5 **Default**: 0 #### lambda (reg_lambda) - L2 Regularization **Purpose**: L2 penalty on leaf weights **Effect**: - **Higher**: Stronger regularization - **Lower**: Less regularization **Typical Range**: 0 - 10 **Default**: 1 #### alpha (reg_alpha) - L1 Regularization **Purpose**: L1 penalty on leaf weights (feature selection) **Effect**: - **Higher**: More sparsity (some weights → 0) - **Lower**: Less sparsity **Typical Range**: 0 - 10 **Default**: 0 ### Usage Example ```python from functions.ML.xgboost import train_xgboost_model # Train XGBoost xgb_model = train_xgboost_model( X_train, y_train, n_estimators=500, learning_rate=0.1, max_depth=6, subsample=0.8, colsample_bytree=0.8, gamma=0, reg_lambda=1, reg_alpha=0, random_state=42, n_jobs=-1 ) # Predictions y_pred = xgb_model.predict(X_test) y_proba = xgb_model.predict_proba(X_test) ``` ### Hyperparameter Optimization ```python from sklearn.model_selection import RandomizedSearchCV from xgboost import XGBClassifier # Define parameter distributions param_dist = { 'n_estimators': [100, 200, 500, 1000], 'learning_rate': [0.01, 0.05, 0.1, 0.3], 'max_depth': [3, 5, 7, 9], 'subsample': [0.6, 0.8, 1.0], 'colsample_bytree': [0.6, 0.8, 1.0], 'gamma': [0, 1, 5], 'reg_lambda': [0, 1, 10], 'reg_alpha': [0, 0.1, 1] } # Random search random_search = RandomizedSearchCV( XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='logloss'), param_distributions=param_dist, n_iter=50, cv=5, scoring='accuracy', n_jobs=-1, random_state=42, verbose=1 ) random_search.fit(X_train, y_train) print(f"Best parameters: {random_search.best_params_}") print(f"Best CV score: {random_search.best_score_:.3f}") ``` ### Feature Importance ```python import matplotlib.pyplot as plt # Get feature importances importances = xgb_model.feature_importances_ # Plot top features import numpy as np indices = np.argsort(importances)[::-1] top_k = 20 plt.figure(figsize=(10, 6)) plt.bar(range(top_k), importances[indices[:top_k]]) plt.xlabel('Feature Index') plt.ylabel('Importance (Gain)') plt.title('Top 20 Important Features (XGBoost)') plt.tight_layout() plt.show() # Importance types in XGBoost: # - 'weight': Number of times feature used # - 'gain': Average gain (default, most useful) # - 'cover': Average coverage # Get different importance types import xgboost as xgb importance_gain = xgb_model.get_booster().get_score(importance_type='gain') importance_weight = xgb_model.get_booster().get_score(importance_type='weight') ``` ### Learning Curves ```python # Train with evaluation set to monitor performance eval_set = [(X_train, y_train), (X_val, y_val)] xgb_model.fit( X_train, y_train, eval_set=eval_set, eval_metric='logloss', verbose=50 # Print every 50 rounds ) # Get evaluation results results = xgb_model.evals_result() # Plot learning curves plt.figure(figsize=(10, 6)) plt.plot(results['validation_0']['logloss'], label='Train') plt.plot(results['validation_1']['logloss'], label='Validation') plt.xlabel('Boosting Round') plt.ylabel('Log Loss') plt.legend() plt.title('XGBoost Learning Curves') plt.tight_layout() plt.show() ``` ### Advantages ✓ **State-of-the-art accuracy**: Often wins competitions ✓ **Regularization**: Built-in L1/L2 ✓ **Handles missing data**: Automatically ✓ **Fast**: Parallel tree construction ✓ **Early stopping**: Prevents overfitting ✓ **Feature importance**: Gain, weight, cover ### Limitations ✗ **Many hyperparameters**: Requires tuning ✗ **Sensitive to overfitting**: With default params ✗ **Black box**: Hard to interpret ✗ **Memory intensive**: For large datasets ### When to Use **Use XGBoost when**: - ✓ Need best possible accuracy - ✓ Tabular data - ✓ Have time to tune hyperparameters - ✓ Competition or critical application - ✓ Medium to large datasets **Consider alternatives when**: - ✗ Need interpretability → Logistic Regression - ✗ Limited time → Random Forest (fewer params) - ✗ Very small data → Logistic Regression or SVM ### Reference Chen & Guestrin (2016). "XGBoost: A Scalable Tree Boosting System" --- ## Logistic Regression **Purpose**: Linear model for probabilistic binary/multi-class classification ### Theory **Core Concept**: Model log-odds as linear combination of features **Binary Classification**: ``` P(y=1|x) = 1 / (1 + exp(-(β₀ + β₁x₁ + ... + βₙxₙ))) ``` **Multi-class** (one-vs-rest or multinomial): ``` P(y=k|x) = exp(xᵀβₖ) / Σⱼ exp(xᵀβⱼ) ``` **Decision Boundary**: Linear in feature space ### Hyperparameters #### C (Inverse Regularization) **Purpose**: Control strength of regularization **Effect**: - **High C** (e.g., 100): Weak regularization → Overfitting risk - **Low C** (e.g., 0.01): Strong regularization → Underfitting risk **Typical Range**: 0.01 - 100 **Note**: C is inverse of regularization strength (unlike most methods) #### penalty **Purpose**: Type of regularization **Options**: - **'l2'** (default): Ridge regularization (shrinks coefficients) - **'l1'**: Lasso regularization (sparse coefficients, feature selection) - **'elasticnet'**: Mix of L1 and L2 - **'none'**: No regularization **When to Use**: - **L2**: Default, works well generally - **L1**: When want feature selection (many irrelevant features) - **Elasticnet**: When L1 too aggressive #### solver **Purpose**: Optimization algorithm **Options**: - **'lbfgs'** (default): Good for small datasets, L2 only - **'saga'**: Supports all penalties, good for large datasets - **'liblinear'**: Good for small datasets, supports L1/L2 **Recommendation**: Use 'saga' for flexibility #### max_iter **Purpose**: Maximum iterations for convergence **Default**: 100 **Increase if**: Warning about non-convergence **Typical Range**: 100 - 10000 ### Usage Example ```python from functions.ML.logistic_regression import train_lr_model # Train Logistic Regression lr_model = train_lr_model( X_train, y_train, C=1.0, penalty='l2', solver='saga', max_iter=1000, random_state=42 ) # Predictions y_pred = lr_model.predict(X_test) y_proba = lr_model.predict_proba(X_test) # Get coefficients (feature weights) coefficients = lr_model.coef_[0] # For binary classification # Intercept intercept = lr_model.intercept_ ``` ### Hyperparameter Optimization ```python from sklearn.model_selection import GridSearchCV from sklearn.linear_model import LogisticRegression # Define parameter grid param_grid = { 'C': [0.01, 0.1, 1, 10, 100], 'penalty': ['l1', 'l2'], 'solver': ['saga'], 'max_iter': [1000] } # Grid search grid_search = GridSearchCV( LogisticRegression(random_state=42), param_grid, cv=5, scoring='accuracy', n_jobs=-1 ) grid_search.fit(X_train, y_train) print(f"Best parameters: {grid_search.best_params_}") print(f"Best CV score: {grid_search.best_score_:.3f}") ``` ### Interpretation **Coefficients** (Feature Weights): ```python import numpy as np import matplotlib.pyplot as plt # Get coefficients coefs = lr_model.coef_[0] # Find most important features indices = np.argsort(np.abs(coefs))[::-1] top_k = 20 # Plot plt.figure(figsize=(10, 6)) plt.bar(range(top_k), coefs[indices[:top_k]]) plt.xlabel('Feature Index') plt.ylabel('Coefficient') plt.title('Top 20 Feature Coefficients') plt.axhline(y=0, color='k', linestyle='--') plt.tight_layout() plt.show() # Positive coefficient → Increases probability of class 1 # Negative coefficient → Decreases probability of class 1 ``` **Odds Ratios**: ```python # Convert coefficients to odds ratios odds_ratios = np.exp(coefs) # Interpretation: # OR = 2.0 → One unit increase in feature doubles odds # OR = 0.5 → One unit increase halves odds # OR = 1.0 → No effect print(f"Odds ratios for top features:") for i in indices[:10]: print(f" Feature {i}: OR = {odds_ratios[i]:.3f}") ``` **Probability Calibration**: ```python # Check calibration from sklearn.calibration import calibration_curve # Predicted probabilities y_proba = lr_model.predict_proba(X_test)[:, 1] # Calibration curve fraction_of_positives, mean_predicted_value = calibration_curve( y_test, y_proba, n_bins=10 ) # Plot plt.figure(figsize=(8, 6)) plt.plot([0, 1], [0, 1], 'k--', label='Perfectly Calibrated') plt.plot(mean_predicted_value, fraction_of_positives, 'o-', label='Logistic Regression') plt.xlabel('Mean Predicted Probability') plt.ylabel('Fraction of Positives') plt.title('Calibration Curve') plt.legend() plt.tight_layout() plt.show() ``` ### Advantages ✓ **Interpretable**: Clear feature weights ✓ **Probabilistic**: Direct probability estimates ✓ **Fast**: Training and prediction ✓ **Well-calibrated**: Reliable probabilities ✓ **No hyperparameters**: (except C) ✓ **Linear decision boundary**: Simple ### Limitations ✗ **Linear only**: Cannot capture non-linear relationships ✗ **Feature engineering**: May need manual feature creation ✗ **Sensitive to scaling**: Requires standardization ✗ **Multicollinearity**: Correlated features problematic ### When to Use **Use Logistic Regression when**: - ✓ Need interpretability (coefficients) - ✓ Want probabilistic outputs - ✓ Linearly separable data - ✓ Baseline model - ✓ Small datasets - ✓ Need fast predictions **Consider alternatives when**: - ✗ Non-linear boundaries → SVM (RBF) or Random Forest - ✗ Many irrelevant features → Random Forest (feature importance) - ✗ Complex interactions → XGBoost or Neural Networks ### Reference Cox (1958). "The Regression Analysis of Binary Sequences" --- ## Multi-Layer Perceptron (MLP) **Purpose**: Neural network for non-linear classification/regression ### Theory **Architecture**: Input → Hidden Layers → Output **Neuron Computation**: ``` output = activation(Σ wᵢxᵢ + b) ``` **Activation Functions**: - **ReLU**: f(x) = max(0, x) [Default, recommended] - **Tanh**: f(x) = tanh(x) - **Logistic**: f(x) = 1/(1 + e⁻ˣ) **Training**: Backpropagation with gradient descent ### Hyperparameters #### hidden_layer_sizes **Purpose**: Architecture of hidden layers **Format**: Tuple (layer1_size, layer2_size, ...) **Examples**: ```python # Single hidden layer with 100 neurons hidden_layer_sizes = (100,) # Two hidden layers (100, 50) hidden_layer_sizes = (100, 50) # Three hidden layers (200, 100, 50) hidden_layer_sizes = (200, 100, 50) ``` **Rules of Thumb**: - Start with (100,) or (100, 50) - More neurons → More capacity → Overfitting risk - More layers → Can learn complex patterns #### activation **Purpose**: Non-linear activation function **Options**: - **'relu'** (default): Recommended, works well - **'tanh'**: Alternative, slower convergence - **'logistic'**: Sigmoid, rarely used **Recommendation**: Use 'relu' #### alpha **Purpose**: L2 regularization strength **Effect**: - **Higher** (e.g., 0.01): Strong regularization - **Lower** (e.g., 0.0001): Weak regularization **Typical Range**: 0.0001 - 0.01 **Default**: 0.0001 #### learning_rate_init **Purpose**: Initial learning rate **Effect**: - **Higher** (e.g., 0.01): Faster convergence → Instability risk - **Lower** (e.g., 0.0001): Slower convergence → More stable **Typical Range**: 0.0001 - 0.01 **Default**: 0.001 #### max_iter **Purpose**: Maximum epochs (training iterations) **Default**: 200 **Typical Range**: 200 - 1000 **Increase if**: Training loss still decreasing #### early_stopping **Purpose**: Stop training when validation score stops improving **Recommended**: True **Parameters**: - **validation_fraction**: Fraction for validation (default 0.1) - **n_iter_no_change**: Patience (default 10) ### Usage Example ```python from functions.ML.mlp import train_mlp_model # Train MLP mlp_model = train_mlp_model( X_train, y_train, hidden_layer_sizes=(100, 50), activation='relu', alpha=0.0001, learning_rate_init=0.001, max_iter=500, early_stopping=True, random_state=42 ) # Predictions y_pred = mlp_model.predict(X_test) y_proba = mlp_model.predict_proba(X_test) ``` ### Hyperparameter Optimization ```python from sklearn.model_selection import RandomizedSearchCV from sklearn.neural_network import MLPClassifier # Define parameter distributions param_dist = { 'hidden_layer_sizes': [(50,), (100,), (100, 50), (200, 100)], 'activation': ['relu', 'tanh'], 'alpha': [0.0001, 0.001, 0.01], 'learning_rate_init': [0.0001, 0.001, 0.01], 'max_iter': [500] } # Random search random_search = RandomizedSearchCV( MLPClassifier(early_stopping=True, random_state=42), param_distributions=param_dist, n_iter=20, cv=5, scoring='accuracy', n_jobs=-1, random_state=42 ) random_search.fit(X_train, y_train) print(f"Best parameters: {random_search.best_params_}") print(f"Best CV score: {random_search.best_score_:.3f}") ``` ### Learning Curves ```python # Access loss history train_loss = mlp_model.loss_curve_ # Plot import matplotlib.pyplot as plt plt.figure(figsize=(10, 6)) plt.plot(train_loss) plt.xlabel('Epoch') plt.ylabel('Loss') plt.title('MLP Training Loss') plt.tight_layout() plt.show() # Check for convergence: # - Should decrease steadily # - Flatten at end # - If still decreasing → increase max_iter ``` ### Advantages ✓ **Non-linear**: Can learn complex patterns ✓ **Flexible**: Arbitrary architectures ✓ **Universal approximator**: Theoretically can learn any function ✓ **Feature learning**: Automatic feature extraction ### Limitations ✗ **Black box**: Hard to interpret ✗ **Hyperparameters**: Many to tune ✗ **Convergence**: Can be slow or unstable ✗ **Scaling required**: Sensitive to feature scales ✗ **Random initialization**: Different results per run ### When to Use **Use MLP when**: - ✓ Non-linear, complex patterns - ✓ Large datasets (> 10,000 samples) - ✓ Don't need interpretability - ✓ Have time to tune - ✓ Sufficient data to prevent overfitting **Consider alternatives when**: - ✗ Need interpretability → Logistic Regression - ✗ Small data → SVM or Random Forest - ✗ Want speed → Random Forest - ✗ Tabular data → XGBoost (usually better) ### Reference Rumelhart et al. (1986). "Learning representations by back-propagating errors" --- ## Model Evaluation ### Classification Metrics #### Accuracy **Formula**: (TP + TN) / (TP + TN + FP + FN) **When to Use**: Balanced classes **Limitation**: Misleading for imbalanced data ```python from sklearn.metrics import accuracy_score accuracy = accuracy_score(y_test, y_pred) print(f"Accuracy: {accuracy:.3f}") ``` #### Precision **Formula**: TP / (TP + FP) **Interpretation**: Of predicted positives, how many are correct? **When to Use**: Cost of false positives high (e.g., spam detection) ```python from sklearn.metrics import precision_score precision = precision_score(y_test, y_pred, average='weighted') print(f"Precision: {precision:.3f}") ``` #### Recall (Sensitivity) **Formula**: TP / (TP + FN) **Interpretation**: Of actual positives, how many detected? **When to Use**: Cost of false negatives high (e.g., disease screening) ```python from sklearn.metrics import recall_score recall = recall_score(y_test, y_pred, average='weighted') print(f"Recall: {recall:.3f}") ``` #### F1-Score **Formula**: 2 × (Precision × Recall) / (Precision + Recall) **Interpretation**: Harmonic mean of precision and recall **When to Use**: Balance precision and recall ```python from sklearn.metrics import f1_score f1 = f1_score(y_test, y_pred, average='weighted') print(f"F1-Score: {f1:.3f}") ``` #### ROC-AUC **Purpose**: Measure discrimination ability across all thresholds **Range**: 0.5 (random) to 1.0 (perfect) **When to Use**: Imbalanced classes, need threshold-independent metric ```python from sklearn.metrics import roc_auc_score, roc_curve # Binary classification y_proba = model.predict_proba(X_test)[:, 1] auc = roc_auc_score(y_test, y_proba) print(f"ROC-AUC: {auc:.3f}") # Plot ROC curve fpr, tpr, thresholds = roc_curve(y_test, y_proba) import matplotlib.pyplot as plt plt.figure(figsize=(8, 6)) plt.plot(fpr, tpr, label=f'AUC = {auc:.3f}') plt.plot([0, 1], [0, 1], 'k--', label='Random') plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('ROC Curve') plt.legend() plt.tight_layout() plt.show() ``` #### Confusion Matrix ```python from sklearn.metrics import confusion_matrix import seaborn as sns # Compute confusion matrix cm = confusion_matrix(y_test, y_pred) # Plot plt.figure(figsize=(8, 6)) sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=class_names, yticklabels=class_names) plt.xlabel('Predicted') plt.ylabel('Actual') plt.title('Confusion Matrix') plt.tight_layout() plt.show() ``` #### Classification Report ```python from sklearn.metrics import classification_report report = classification_report(y_test, y_pred, target_names=class_names) print(report) # Example output: # precision recall f1-score support # # Class A 0.92 0.95 0.93 20 # Class B 0.88 0.85 0.87 20 # # accuracy 0.90 40 # macro avg 0.90 0.90 0.90 40 # weighted avg 0.90 0.90 0.90 40 ``` ### Cross-Validation #### K-Fold Cross-Validation ```python from sklearn.model_selection import cross_val_score # 5-fold CV scores = cross_val_score( model, X_train, y_train, cv=5, scoring='accuracy' ) print(f"CV Accuracy: {scores.mean():.3f} ± {scores.std():.3f}") ``` #### Stratified K-Fold **Use when**: Imbalanced classes ```python from sklearn.model_selection import StratifiedKFold, cross_val_score skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) scores = cross_val_score( model, X_train, y_train, cv=skf, scoring='accuracy' ) print(f"Stratified CV Accuracy: {scores.mean():.3f} ± {scores.std():.3f}") ``` (groupkfold)= #### Group K-Fold (Patient-Level) **Critical for Raman data**: Prevent data leakage from same patient ```python from sklearn.model_selection import GroupKFold, cross_val_score # groups: Patient IDs for each sample gkf = GroupKFold(n_splits=5) scores = cross_val_score( model, X_train, y_train, groups=patient_ids, # Ensure all samples from same patient in same fold cv=gkf, scoring='accuracy' ) print(f"Group CV Accuracy: {scores.mean():.3f} ± {scores.std():.3f}") ``` --- ## See Also - [Machine Learning User Guide](../user-guide/machine-learning.md) - Step-by-step tutorials - [Best Practices](../user-guide/best-practices.md) - ML strategies - [Preprocessing Methods](preprocessing.md) - Data preparation - [Statistical Methods](statistical.md) - Hypothesis testing --- **Last Updated**: 2026-01-24