# Exploratory Analysis Methods Comprehensive reference for dimensionality reduction and clustering methods. ## Table of Contents - [Principal Component Analysis (PCA)](#principal-component-analysis-pca) - [UMAP](#umap-uniform-manifold-approximation-and-projection) - [t-SNE](#t-sne-t-distributed-stochastic-neighbor-embedding) - [Hierarchical Clustering](#hierarchical-clustering) - [K-Means Clustering](#k-means-clustering) - [DBSCAN](#dbscan-density-based-spatial-clustering) - [Method Comparison](#method-comparison) --- (pca)= ## Principal Component Analysis (PCA) **Purpose**: Linear dimensionality reduction preserving maximum variance ### Theory PCA finds orthogonal directions (principal components) that capture most variance in data: ``` X = USVᵀ (Singular Value Decomposition) ``` Where: - **U**: Left singular vectors (sample scores) - **S**: Singular values (eigenvalues) - **V**: Right singular vectors (loadings) **Mathematical Steps**: 1. Center data: `X_centered = X - mean(X)` 2. Compute covariance matrix: `C = XᵀX / (n-1)` 3. Eigendecomposition: `C = VΛVᵀ` 4. Project data: `scores = X × V` ### Parameters | Parameter | Type | Range | Default | Description | | -------------- | --------- | ---------------- | ------- | ------------------------------------------------------ | | `n_components` | int/float | 2-100 or 0.0-1.0 | 2 | Number of PCs or variance to retain | | `whiten` | bool | - | False | Divide components by singular values | | `svd_solver` | str | - | 'auto' | SVD algorithm ('auto', 'full', 'arpack', 'randomized') | | `random_state` | int | - | 42 | Random seed for reproducibility | **Parameter Guide**: ```python # Visualization (2D/3D) n_components = 2 # or 3 # Keep 95% variance n_components = 0.95 # Keep specific number n_components = 10 # Whitening (for clustering) whiten = True ``` ### Usage Example ```python from functions.ML import apply_pca # Apply PCA pca_result = apply_pca( data=preprocessed_spectra, n_components=2, labels=group_labels ) # Access results scores = pca_result['scores'] # (n_samples, n_components) loadings = pca_result['loadings'] # (n_features, n_components) explained_var = pca_result['explained_variance_ratio'] ``` ### Output Components **1. Scores** (Sample Coordinates): ```python scores = pca_result['scores'] # Shape: (n_samples, n_components) # Each row: sample position in PC space # Use for: Visualization, clustering ``` **2. Loadings** (Feature Contributions): ```python loadings = pca_result['loadings'] # Shape: (n_features, n_components) # Each column: wavenumber contributions to PC # Use for: Identifying important peaks ``` **3. Explained Variance**: ```python var_ratio = pca_result['explained_variance_ratio'] # Array of variance % for each PC # Example: [0.65, 0.23, 0.08, ...] # PC1 explains 65%, PC2 23%, etc. ``` **4. Scree Plot Data**: ```python cumulative_var = np.cumsum(var_ratio) # Cumulative variance explained ``` ### Interpretation #### Scores Plot ``` PC2 (23%) ↑ A | C A | CC AAA | CCC ─────────────→ PC1 (65%) BBB| BB| B| ``` **What it Shows**: - **Separation**: Groups A, B, C are distinct - **Distance**: Similar samples cluster together - **Outliers**: Points far from main cluster **Interpretation**: - **Tight clusters**: Homogeneous groups - **Overlapping**: Spectral similarity - **Trend**: Continuous variation (e.g., concentration) #### Loadings Plot ```python # Positive peaks in PC1 loading # → Increase PC1 score # Negative peaks in PC1 loading # → Decrease PC1 score ``` **Example**: ``` Loading PC1: ↑ | ___ | / \ ← Important peak |___/ \___ └─────────────→ Wavenumber If Group A has high PC1 scores, they have strong signal at this peak ``` **Key Wavenumbers**: ```python # Find most important peaks top_indices = np.argsort(np.abs(loadings[:, 0]))[-10:] important_peaks = wavenumbers[top_indices] ``` #### Explained Variance **Scree Plot**: ``` Variance (%) 100 |● 80 | ● 60 | ● 40 | ● 20 | ●●●●● 0 |________________ 1 2 3 4 5 6 7 8 PC ``` **Rules of Thumb**: - **PC1 + PC2 > 70%**: Good 2D representation - **PC1 + PC2 < 50%**: Consider 3D or UMAP - **Elbow point**: # PCs to retain (here: ~3-4) ### Common Use Cases #### 1. Group Visualization ```python # 2D scatter plot plt.scatter(scores[:, 0], scores[:, 1], c=labels) plt.xlabel(f'PC1 ({var_ratio[0]:.1%})') plt.ylabel(f'PC2 ({var_ratio[1]:.1%})') ``` #### 2. Outlier Detection ```python # Hotelling's T² statistic from scipy import stats T2 = np.sum((scores / np.std(scores, axis=0))**2, axis=1) threshold = stats.chi2.ppf(0.95, df=n_components) outliers = T2 > threshold ``` #### 3. Feature Selection ```python # Important wavenumbers from PC1 loading_weights = np.abs(loadings[:, 0]) top_features = np.argsort(loading_weights)[-20:] ``` #### 4. Dimensionality Reduction for ML ```python # Reduce to 95% variance pca = apply_pca(data, n_components=0.95) reduced_data = pca['scores'] # Use reduced_data for classification ``` ### Troubleshooting | Issue | Cause | Solution | | --------------------- | ----------------------- | -------------------------- | | Poor separation | Low variance in PC1-2 | Try more PCs, use UMAP | | All groups overlap | No spectral differences | Check preprocessing | | PC1 = baseline | Preprocessing issue | Better baseline correction | | One outlier dominates | Extreme spectrum | Remove outlier, re-run | | Scores look random | Data not standardized | Check normalization | ### Assumptions ✓ **Linear relationships**: PCA finds linear combinations ✓ **Variance = importance**: Assumes max variance = most informative ✗ **No scaling**: Apply normalization first ✗ **Not rotation-invariant**: PC axes arbitrary ### When to Use **Use PCA when**: - ✓ Visualizing high-dimensional data - ✓ Reducing dimensions for ML - ✓ Identifying important features - ✓ Quick exploratory analysis - ✓ Data roughly linear **Consider alternatives when**: - ✗ Non-linear structure (use UMAP/t-SNE) - ✗ Preserving distances (use MDS) - ✗ Only want clustering (use K-means directly) ### Advanced Options #### Whitening ```python # Normalize PC variances (useful before clustering) pca_result = apply_pca(data, whiten=True) ``` #### Incremental PCA ```python # For very large datasets from sklearn.decomposition import IncrementalPCA ipca = IncrementalPCA(n_components=2) ipca.partial_fit(data_batch_1) ipca.partial_fit(data_batch_2) scores = ipca.transform(data) ``` ### Reference Jolliffe & Cadima (2016). "Principal component analysis: a review and recent developments" --- (mcr-als)= ## MCR-ALS **Purpose**: Spectral unmixing / component extraction from mixtures. MCR-ALS (Multivariate Curve Resolution – Alternating Least Squares) aims to decompose a data matrix $X$ into concentrations $C$ and component spectra $S$: $$ X \approx C S^T $$ **When to use**: - ✓ Each measured spectrum is a mixture of a small number of underlying “pure” components - ✓ You want interpretable component spectra and relative contributions **Typical constraints**: - Non-negativity on $C$ and/or $S$ - Normalization or closure constraints (depending on experiment) **Practical notes**: - Sensitive to initialization; try multiple starts. - Preprocessing (baseline correction, normalization) usually improves results. --- ## UMAP (Uniform Manifold Approximation and Projection) **Purpose**: Non-linear dimensionality reduction preserving local and global structure ### Theory UMAP constructs high-dimensional graph, then optimizes low-dimensional representation: 1. **Build k-nearest neighbor graph** in high-dimensional space 2. **Compute fuzzy simplicial complex** (topological structure) 3. **Optimize low-dimensional layout** preserving topology **Key Difference from PCA**: - PCA: Linear projection, preserves variance - UMAP: Non-linear, preserves topology (neighbors stay neighbors) ### Parameters | Parameter | Type | Range | Default | Description | | -------------- | ----- | ---------- | ----------- | ----------------------------- | | `n_neighbors` | int | 2 - 200 | 15 | # neighbors for graph | | `min_dist` | float | 0.0 - 0.99 | 0.1 | Minimum distance in embedding | | `n_components` | int | 2 - 3 | 2 | Output dimensions | | `metric` | str | - | 'euclidean' | Distance metric | | `random_state` | int | - | 42 | Random seed | **Parameter Guide**: ```python # Local structure (tight clusters) n_neighbors = 5-10 min_dist = 0.0 # Balanced (recommended) n_neighbors = 15 min_dist = 0.1 # Global structure (broader view) n_neighbors = 50-100 min_dist = 0.5 ``` **n_neighbors**: - **Low (5-10)**: Focus on local structure, tight clusters - **Medium (15-30)**: Balanced local/global - **High (50-200)**: Focus on global structure, looser clusters **min_dist**: - **0.0**: Densest packing, tight clusters - **0.1**: Default, good separation - **0.5+**: Spread out, overview ### Usage Example ```python from functions.ML import apply_umap # Apply UMAP umap_result = apply_umap( data=preprocessed_spectra, n_neighbors=15, min_dist=0.1, n_components=2, labels=group_labels ) # Access results embedding = umap_result['embedding'] # (n_samples, 2) ``` ### PCA vs UMAP Comparison | Aspect | PCA | UMAP | | -------------------- | --------------- | --------------------- | | **Type** | Linear | Non-linear | | **Speed** | Fast | Slower | | **Preserves** | Variance | Topology | | **Global structure** | Good | Excellent | | **Local structure** | Poor | Excellent | | **Deterministic** | Yes | No (use random_state) | | **Interpretability** | High (loadings) | Low (no loadings) | | **Outliers** | Sensitive | Robust | **Decision Guide**: ``` Use PCA when: - Need feature importance (loadings) - Want speed - Linear relationships - Interpretability critical Use UMAP when: - PCA shows overlap - Need better separation - Non-linear structure - Visualization priority ``` ### Interpretation **UMAP Embedding**: ``` Dim 2 ↑ A | C AA | CCC AAA | CC ─────────────→ Dim 1 BB | BBB| B| ``` **What to Look For**: - **Clusters**: Distinct groups - **Distance**: Relative, not absolute - **Shape**: Cluster density and spread - **Bridges**: Transitional samples **Warnings**: ⚠️ **Distances not quantitative**: UMAP preserves topology, not exact distances ⚠️ **Cluster size misleading**: Doesn't reflect true cluster variance ⚠️ **Different seeds → different layouts**: Use random_state for reproducibility ### Troubleshooting | Issue | Cause | Solution | | ------------------- | -------------------- | --------------------------- | | Too many clusters | n_neighbors too low | Increase to 30-50 | | All points together | n_neighbors too high | Decrease to 10-15 | | Unclear structure | min_dist too high | Reduce to 0.05-0.1 | | Overlapping groups | Inherent similarity | Try different preprocessing | | Different results | Random seed | Set random_state=42 | ### When to Use **Use UMAP when**: - ✓ PCA shows poor separation - ✓ Suspect non-linear structure - ✓ Want beautiful visualizations - ✓ Exploring complex datasets **Use PCA when**: - ✓ Need interpretable features - ✓ Speed critical - ✓ Publishing quantitative results ### Reference McInnes et al. (2018). "UMAP: Uniform Manifold Approximation and Projection" --- ## t-SNE (t-Distributed Stochastic Neighbor Embedding) **Purpose**: Non-linear dimensionality reduction emphasizing local structure ### Theory t-SNE preserves pairwise similarities between points: 1. Compute pairwise probabilities in high dimensions (Gaussian) 2. Compute pairwise probabilities in low dimensions (t-distribution) 3. Minimize KL divergence between probability distributions **vs UMAP**: t-SNE focuses more on local structure, UMAP balances local/global ### Parameters | Parameter | Type | Range | Default | Description | | -------------------- | ----- | ---------- | ------- | ----------------------- | | `perplexity` | float | 5 - 50 | 30 | # effective neighbors | | `n_iter` | int | 250 - 5000 | 1000 | Optimization iterations | | `learning_rate` | float | 10 - 1000 | 200 | Step size | | `early_exaggeration` | float | 1 - 20 | 12 | Initial separation | | `random_state` | int | - | 42 | Random seed | **Parameter Guide**: ```python # Small dataset (n < 100) perplexity = 5-10 n_iter = 1000 # Medium dataset (n = 100-1000) perplexity = 30 n_iter = 1000-2000 # Large dataset (n > 1000) perplexity = 50 n_iter = 2000-5000 ``` **Perplexity**: - Rule of thumb: `5 < perplexity < n_samples/3` - **Low (5-10)**: Focus on local clusters - **Medium (30)**: Balanced - **High (50+)**: Focus on global structure ### Usage Example ```python from functions.ML import apply_tsne # Apply t-SNE tsne_result = apply_tsne( data=preprocessed_spectra, perplexity=30, n_iter=1000, labels=group_labels ) # Access results embedding = tsne_result['embedding'] # (n_samples, 2) ``` ### Interpretation **t-SNE Output**: ``` Similar to UMAP, but: - Even tighter clusters - Less emphasis on global distances - More sensitive to perplexity ``` **Key Points**: - ⚠️ **Cluster sizes meaningless**: Don't interpret relative sizes - ⚠️ **Distances not quantitative**: Within-cluster distances OK, between-cluster not - ⚠️ **Slow**: Much slower than PCA/UMAP for large datasets ### UMAP vs t-SNE | Aspect | UMAP | t-SNE | | -------------------- | ------------- | ------------ | | **Speed** | Fast | Slow | | **Global structure** | Better | Worse | | **Local structure** | Good | Excellent | | **Scalability** | 100k+ samples | <10k samples | | **Reproducibility** | Better | Worse | | **General use** | Preferred | Specialized | **Recommendation**: Use UMAP unless you specifically need t-SNE's extreme local focus ### Troubleshooting | Issue | Cause | Solution | | ---------------------- | ------------------- | ------------------- | | Blob without structure | Perplexity too high | Reduce perplexity | | Many tiny clusters | Perplexity too low | Increase perplexity | | Not converged | n_iter too low | Increase to 2000+ | | Different results | Random seed | Set random_state | | Very slow | Large dataset | Use UMAP instead | ### When to Use **Use t-SNE when**: - ✓ Small-medium datasets (< 5000 samples) - ✓ Need extreme local structure emphasis - ✓ Publication requires it (legacy) **Use UMAP instead when**: - ✓ Large datasets - ✓ Need reproducibility - ✓ Want global structure too - ✓ Speed matters ### Reference van der Maaten & Hinton (2008). "Visualizing Data using t-SNE" --- ## Hierarchical Clustering **Purpose**: Create tree of nested clusters (dendrogram) ### Theory **Agglomerative (bottom-up)**: 1. Start: Each point is a cluster 2. Repeat: Merge closest clusters 3. Stop: All points in one cluster **Result**: Dendrogram showing merge history ### Parameters | Parameter | Type | Options | Default | Description | | ------------ | ---- | ------------------------------- | ----------- | ------------------------ | | `linkage` | str | ward, complete, average, single | 'ward' | Cluster distance metric | | `metric` | str | euclidean, cosine, correlation | 'euclidean' | Distance function | | `n_clusters` | int | 2 - 20 | None | Cut tree to get clusters | **Linkage Methods**: | Method | Distance Between Clusters | Use Case | | ------------ | --------------------------------- | --------------------------- | | **Ward** | Minimizes within-cluster variance | **Recommended for most** | | **Complete** | Maximum distance | Compact clusters | | **Average** | Average distance | Balanced | | **Single** | Minimum distance | Can find elongated clusters | **Recommendation**: Use **Ward** with **Euclidean** distance ### Usage Example ```python from functions.ML import apply_hierarchical_clustering # Apply hierarchical clustering hc_result = apply_hierarchical_clustering( data=preprocessed_spectra, linkage='ward', metric='euclidean', n_clusters=3 ) # Access results clusters = hc_result['clusters'] # Cluster assignments linkage_matrix = hc_result['linkage'] # For dendrogram ``` ### Dendrogram Interpretation ``` Height | 10| ┌─┐ | ┌───┤ ├───┐ 5| ┌─┤ └─┘ ├─┐ | ┌─┤ └─┐ ┌─┘ ├─┐ 0| └─┘ └─────┘ └─┘ A₁ A₂ A₃ A₄ B₁ B₂ ``` **How to Read**: - **Horizontal lines**: Clusters - **Height**: Distance at merge - **Vertical lines**: Similarity - **Cut horizontally**: Define clusters **Cutting the Tree**: ```python # Cut at height = 7 # → 2 clusters: [A₁,A₂,A₃,A₄] and [B₁,B₂] # Cut at height = 3 # → 3 clusters ``` **Cophenetic Distance**: ```python from scipy.cluster.hierarchy import cophenet from scipy.spatial.distance import pdist c, coph_dists = cophenet(linkage_matrix, pdist(data)) # c > 0.7: Good representation # c < 0.5: Poor representation ``` ### Visualization ```python from scipy.cluster.hierarchy import dendrogram import matplotlib.pyplot as plt # Plot dendrogram plt.figure(figsize=(10, 7)) dendrogram( linkage_matrix, labels=sample_names, leaf_rotation=90, leaf_font_size=8 ) plt.xlabel('Sample') plt.ylabel('Distance') plt.title('Hierarchical Clustering Dendrogram') plt.tight_layout() plt.show() ``` ### Choosing Number of Clusters **Method 1: Visual Inspection** - Look for large height jumps in dendrogram - Cut before major merge **Method 2: Elbow Method** ```python from scipy.cluster.hierarchy import fcluster distances = [] for k in range(2, 11): clusters = fcluster(linkage_matrix, k, criterion='maxclust') dist = calculate_within_cluster_distance(data, clusters) distances.append(dist) # Plot and find elbow plt.plot(range(2, 11), distances) ``` **Method 3: Silhouette Score** ```python from sklearn.metrics import silhouette_score scores = [] for k in range(2, 11): clusters = fcluster(linkage_matrix, k, criterion='maxclust') score = silhouette_score(data, clusters) scores.append(score) # Choose k with highest score best_k = np.argmax(scores) + 2 ``` ### Troubleshooting | Issue | Cause | Solution | | -------------------------- | --------------------- | --------------------- | | Unbalanced clusters | Single linkage | Use Ward linkage | | Unclear structure | Wrong distance metric | Try different metrics | | Chains instead of clusters | Single linkage | Use Ward/Complete | | Dendrogram too complex | Too many samples | Truncate or subset | ### When to Use **Use Hierarchical Clustering when**: - ✓ Want to explore cluster structure - ✓ Don't know # clusters - ✓ Need hierarchical relationships - ✓ Small-medium datasets (< 5000 samples) **Use K-Means when**: - ✓ Know # clusters - ✓ Large datasets - ✓ Speed critical ### Reference Müllner (2013). "fastcluster: Fast Hierarchical, Agglomerative Clustering" --- ## K-Means Clustering **Purpose**: Partition data into K non-overlapping clusters ### Theory **Algorithm**: 1. Initialize K centroids randomly 2. Assign each point to nearest centroid 3. Update centroids to cluster means 4. Repeat 2-3 until convergence **Objective**: Minimize within-cluster sum of squares (WCSS) ### Parameters | Parameter | Type | Range | Default | Description | | -------------- | ---- | --------------------- | ----------- | ------------------------ | | `n_clusters` | int | 2 - 20 | 3 | Number of clusters | | `init` | str | 'k-means++', 'random' | 'k-means++' | Initialization method | | `n_init` | int | 10 - 100 | 10 | # random initializations | | `max_iter` | int | 100 - 1000 | 300 | Maximum iterations | | `random_state` | int | - | 42 | Random seed | **Recommendation**: Use **k-means++** initialization (smarter than random) ### Usage Example ```python from functions.ML import apply_kmeans # Apply K-means kmeans_result = apply_kmeans( data=preprocessed_spectra, n_clusters=3, init='k-means++', n_init=10 ) # Access results clusters = kmeans_result['clusters'] # Cluster assignments centroids = kmeans_result['centroids'] # Cluster centers inertia = kmeans_result['inertia'] # WCSS ``` ### Choosing Number of Clusters (K) (elbow-method)= #### Method 1: Elbow Method ```python inertias = [] K_range = range(2, 11) for k in K_range: result = apply_kmeans(data, n_clusters=k) inertias.append(result['inertia']) # Plot elbow curve plt.plot(K_range, inertias, 'bo-') plt.xlabel('Number of Clusters (K)') plt.ylabel('Inertia (WCSS)') plt.title('Elbow Method') plt.show() # Look for "elbow" point ``` **Elbow Plot**: ``` Inertia |● | ● | ● | ●___ | ●___●___● └───────────────── 2 3 4 5 6 7 8 K ↑ Elbow (K=4) ``` #### Method 2: Silhouette Score ```python from sklearn.metrics import silhouette_score silhouette_scores = [] for k in range(2, 11): result = apply_kmeans(data, n_clusters=k) score = silhouette_score(data, result['clusters']) silhouette_scores.append(score) # Choose K with highest score best_k = np.argmax(silhouette_scores) + 2 ``` **Silhouette Score**: - Range: [-1, 1] - **> 0.7**: Strong structure - **0.5 - 0.7**: Reasonable structure - **< 0.5**: Weak structure, try different K #### Method 3: Gap Statistic ```python from scipy.cluster.vq import kmeans from sklearn.metrics import pairwise_distances def gap_statistic(data, K_max=10, n_references=10): gaps = [] for k in range(1, K_max+1): # Cluster actual data result = apply_kmeans(data, n_clusters=k) actual_wcss = result['inertia'] # Cluster random reference data reference_wcss = [] for _ in range(n_references): reference = np.random.uniform( data.min(), data.max(), size=data.shape ) ref_result = apply_kmeans(reference, n_clusters=k) reference_wcss.append(ref_result['inertia']) # Gap = log(E[WCSS_ref]) - log(WCSS_actual) gap = np.log(np.mean(reference_wcss)) - np.log(actual_wcss) gaps.append(gap) return gaps # Choose K where gap stops increasing ``` ### Cluster Validation #### Silhouette Analysis ```python from sklearn.metrics import silhouette_samples import matplotlib.pyplot as plt # Compute silhouette scores for each sample silhouette_vals = silhouette_samples(data, clusters) # Plot silhouette for each cluster fig, ax = plt.subplots() y_lower = 10 for i in range(n_clusters): cluster_silhouette = silhouette_vals[clusters == i] cluster_silhouette.sort() size = cluster_silhouette.shape[0] y_upper = y_lower + size ax.fill_betweenx( np.arange(y_lower, y_upper), 0, cluster_silhouette, alpha=0.7 ) y_lower = y_upper + 10 ax.axvline(silhouette_scores[k-2], color="red", linestyle="--") ax.set_xlabel("Silhouette Coefficient") ax.set_ylabel("Cluster") ``` **Good Clustering**: - All clusters above average line - Similar widths (balanced sizes) - All positive values **Poor Clustering**: - Clusters below average - Very different widths - Negative values (misassigned points) ### Troubleshooting | Issue | Cause | Solution | | ------------------- | ------------------- | --------------------------------- | | Empty clusters | K too high | Reduce K | | Unbalanced sizes | Poor initialization | Use k-means++ | | Different results | Random init | Set random_state, increase n_init | | Not converging | max_iter too low | Increase max_iter | | Unexpected clusters | Need normalization | Apply vector norm | ### Assumptions and Limitations **Assumes**: - ✓ Clusters are spherical (isotropic) - ✓ Clusters have similar sizes - ✓ Clusters are equally dense **Limitations**: - ✗ Must specify K in advance - ✗ Sensitive to outliers - ✗ Can't find non-convex clusters - ✗ Random initialization (use n_init=10+) ### When to Use **Use K-Means when**: - ✓ Know or can estimate K - ✓ Clusters roughly spherical - ✓ Large datasets (fast algorithm) - ✓ Need deterministic results (set random_state) **Use Hierarchical when**: - ✓ Don't know K - ✓ Want dendrogram - ✓ Small-medium datasets **Use DBSCAN when**: - ✓ Arbitrary cluster shapes - ✓ Noise points present - ✓ Unknown K ### Reference Lloyd (1982). "Least squares quantization in PCM" --- ## DBSCAN (Density-Based Spatial Clustering) **Purpose**: Find arbitrarily-shaped clusters based on density ### Theory **Concepts**: - **Core point**: Has ≥ `min_samples` neighbors within `eps` - **Border point**: Within `eps` of core point - **Noise point**: Neither core nor border **Algorithm**: 1. Find all core points 2. Connect core points within `eps` 3. Assign border points to nearby clusters 4. Mark remaining points as noise (-1) ### Parameters | Parameter | Type | Range | Default | Description | | ------------- | ----- | --------------- | ----------- | --------------------------------- | | `eps` | float | 0.1 - 10.0 | 0.5 | Maximum distance for neighborhood | | `min_samples` | int | 3 - 20 | 5 | Minimum points for core point | | `metric` | str | euclidean, etc. | 'euclidean' | Distance metric | **Parameter Guide**: **eps (epsilon)**: ```python # Determine eps using k-distance graph from sklearn.neighbors import NearestNeighbors neighbors = NearestNeighbors(n_neighbors=min_samples) neighbors.fit(data) distances, indices = neighbors.kneighbors(data) # Sort and plot distances to min_samples-th neighbor sorted_distances = np.sort(distances[:, -1]) plt.plot(sorted_distances) plt.ylabel(f'Distance to {min_samples}-th Neighbor') plt.xlabel('Points (sorted)') # eps = value at "elbow" of curve ``` **min_samples**: - Rule of thumb: `2 × n_dimensions` - **3-5**: Detect finer clusters - **10-20**: More robust to noise ### Usage Example ```python from functions.ML import apply_dbscan # Apply DBSCAN dbscan_result = apply_dbscan( data=preprocessed_spectra, eps=0.5, min_samples=5 ) # Access results clusters = dbscan_result['clusters'] # -1 = noise n_clusters = len(set(clusters)) - (1 if -1 in clusters else 0) n_noise = list(clusters).count(-1) print(f"Clusters found: {n_clusters}") print(f"Noise points: {n_noise}") ``` ### Advantages ✓ **No need to specify K**: Finds natural clusters ✓ **Handles arbitrary shapes**: Not limited to spherical ✓ **Identifies outliers**: Noise points marked as -1 ✓ **Deterministic**: Same results every run ### Disadvantages ✗ **Sensitive to parameters**: eps and min_samples critical ✗ **Varying densities**: Struggles with clusters of different densities ✗ **High dimensions**: Distance-based, suffers curse of dimensionality ### Troubleshooting | Issue | Cause | Solution | | -------------------- | -------------------- | -------------------- | | One giant cluster | eps too large | Reduce eps | | All points are noise | eps too small | Increase eps | | Too many clusters | min_samples too low | Increase min_samples | | Undetected clusters | min_samples too high | Reduce min_samples | ### When to Use **Use DBSCAN when**: - ✓ Don't know number of clusters - ✓ Clusters have arbitrary shapes - ✓ Outliers present - ✓ Clusters vary in size (but not density) **Use K-Means when**: - ✓ Spherical clusters - ✓ Know K - ✓ All points should be clustered ### Reference Ester et al. (1996). "A density-based algorithm for discovering clusters" --- ## Method Comparison ### Quick Reference Table | Method | Type | Speed | Strengths | Limitations | Best For | | ---------------- | ------------- | ----- | --------------------------------------- | -------------------------- | ----------------------------------- | | **PCA** | Linear DR | ⚡⚡⚡ | Fast, interpretable, feature importance | Linear only | Quick exploration, ML preprocessing | | **UMAP** | Non-linear DR | ⚡⚡ | Preserves global+local, beautiful plots | No feature importance | Complex data visualization | | **t-SNE** | Non-linear DR | ⚡ | Excellent local structure | Slow, no global structure | Small datasets, local patterns | | **Hierarchical** | Clustering | ⚡⚡ | Dendrogram, no K needed | Slow for large data | Exploratory, unknown K | | **K-Means** | Clustering | ⚡⚡⚡ | Fast, simple | Need K, spherical clusters | Known K, large datasets | | **DBSCAN** | Clustering | ⚡⚡ | Arbitrary shapes, finds outliers | Sensitive to parameters | Outliers, complex shapes | ### Decision Tree ``` START: What's your goal? │ ├─ Visualization │ │ │ ├─ Quick overview → PCA (2D/3D) │ │ │ ├─ Poor PCA separation → UMAP │ │ │ └─ Extreme local focus → t-SNE │ ├─ Clustering │ │ │ ├─ Know # clusters → K-Means │ │ │ ├─ Don't know # clusters → Hierarchical (dendrogram) │ │ │ └─ Arbitrary shapes + outliers → DBSCAN │ └─ Feature reduction for ML │ ├─ Linear data → PCA (keep loadings) │ └─ Non-linear data → UMAP → then ML ``` ### Typical Workflow **1. Initial Exploration**: ```python # Start with PCA (fast) pca_result = apply_pca(data, n_components=2) plot_pca_scores(pca_result) # Check explained variance if pca_result['explained_variance_ratio'][:2].sum() < 0.5: # Poor separation, try UMAP umap_result = apply_umap(data, n_neighbors=15) ``` **2. Clustering Investigation**: ```python # Try hierarchical first (explore # clusters) hc_result = apply_hierarchical_clustering(data, linkage='ward') plot_dendrogram(hc_result) # Identify optimal K from dendrogram optimal_k = 3 # from visual inspection # Apply K-means with optimal K kmeans_result = apply_kmeans(data, n_clusters=optimal_k) ``` **3. Validation**: ```python # Validate clustering quality from sklearn.metrics import silhouette_score, davies_bouldin_score sil_score = silhouette_score(data, clusters) db_score = davies_bouldin_score(data, clusters) print(f"Silhouette: {sil_score:.3f} (higher better)") print(f"Davies-Bouldin: {db_score:.3f} (lower better)") ``` --- ## Validation Metrics ### Silhouette Score **Formula**: ``` s(i) = (b(i) - a(i)) / max(a(i), b(i)) where: a(i) = avg distance to points in same cluster b(i) = avg distance to points in nearest different cluster ``` **Interpretation**: - **+1**: Perfect clustering - **0**: On cluster boundary - **-1**: Misassigned **Usage**: ```python from sklearn.metrics import silhouette_score score = silhouette_score(data, clusters) ``` ### Davies-Bouldin Index **Measures**: Ratio of within-cluster to between-cluster distances **Interpretation**: - **Lower is better** - **0**: Perfect clustering **Usage**: ```python from sklearn.metrics import davies_bouldin_score score = davies_bouldin_score(data, clusters) ``` ### Calinski-Harabasz Index **Measures**: Ratio of between-cluster to within-cluster variance **Interpretation**: - **Higher is better** **Usage**: ```python from sklearn.metrics import calinski_harabasz_score score = calinski_harabasz_score(data, clusters) ``` --- ## Best Practices ### General Guidelines 1. **Preprocessing First**: ```python # Always preprocess before analysis data = apply_baseline_correction(raw_data) data = apply_smoothing(data) data = apply_normalization(data) ``` 2. **Try Multiple Methods**: ```python # Compare different approaches pca_result = apply_pca(data) umap_result = apply_umap(data) # Choose based on results ``` 3. **Validate Results**: ```python # Check quality metrics silhouette = silhouette_score(data, clusters) if silhouette < 0.5: print("Warning: Poor clustering quality") ``` 4. **Use Domain Knowledge**: - Does clustering match expected groups? - Are separated groups biologically meaningful? - Check against known standards ### Reproducibility ```python # Always set random state pca_result = apply_pca(data, random_state=42) umap_result = apply_umap(data, random_state=42) kmeans_result = apply_kmeans(data, random_state=42, n_init=10) ``` --- ## See Also - [Analysis User Guide](../user-guide/analysis.md) - Step-by-step tutorials - [Statistical Methods](statistical.md) - Hypothesis testing - [Machine Learning Methods](machine-learning.md) - Classification algorithms - [Best Practices](../user-guide/best-practices.md) - Analysis strategies --- **Last Updated**: 2026-01-24