Exploratory Analysis Methods
Comprehensive reference for dimensionality reduction and clustering methods.
Table of Contents
Principal Component Analysis (PCA)
Purpose: Linear dimensionality reduction preserving maximum variance
Theory
PCA finds orthogonal directions (principal components) that capture most variance in data:
X = USVᵀ (Singular Value Decomposition)
Where:
U: Left singular vectors (sample scores)
S: Singular values (eigenvalues)
V: Right singular vectors (loadings)
Mathematical Steps:
Center data:
X_centered = X - mean(X)Compute covariance matrix:
C = XᵀX / (n-1)Eigendecomposition:
C = VΛVᵀProject data:
scores = X × V
Parameters
Parameter |
Type |
Range |
Default |
Description |
|---|---|---|---|---|
|
int/float |
2-100 or 0.0-1.0 |
2 |
Number of PCs or variance to retain |
|
bool |
- |
False |
Divide components by singular values |
|
str |
- |
‘auto’ |
SVD algorithm (‘auto’, ‘full’, ‘arpack’, ‘randomized’) |
|
int |
- |
42 |
Random seed for reproducibility |
Parameter Guide:
# Visualization (2D/3D)
n_components = 2 # or 3
# Keep 95% variance
n_components = 0.95
# Keep specific number
n_components = 10
# Whitening (for clustering)
whiten = True
Usage Example
from functions.ML import apply_pca
# Apply PCA
pca_result = apply_pca(
data=preprocessed_spectra,
n_components=2,
labels=group_labels
)
# Access results
scores = pca_result['scores'] # (n_samples, n_components)
loadings = pca_result['loadings'] # (n_features, n_components)
explained_var = pca_result['explained_variance_ratio']
Output Components
1. Scores (Sample Coordinates):
scores = pca_result['scores']
# Shape: (n_samples, n_components)
# Each row: sample position in PC space
# Use for: Visualization, clustering
2. Loadings (Feature Contributions):
loadings = pca_result['loadings']
# Shape: (n_features, n_components)
# Each column: wavenumber contributions to PC
# Use for: Identifying important peaks
3. Explained Variance:
var_ratio = pca_result['explained_variance_ratio']
# Array of variance % for each PC
# Example: [0.65, 0.23, 0.08, ...]
# PC1 explains 65%, PC2 23%, etc.
4. Scree Plot Data:
cumulative_var = np.cumsum(var_ratio)
# Cumulative variance explained
Interpretation
Scores Plot
PC2 (23%)
↑
A | C
A | CC
AAA | CCC
─────────────→ PC1 (65%)
BBB|
BB|
B|
What it Shows:
Separation: Groups A, B, C are distinct
Distance: Similar samples cluster together
Outliers: Points far from main cluster
Interpretation:
Tight clusters: Homogeneous groups
Overlapping: Spectral similarity
Trend: Continuous variation (e.g., concentration)
Loadings Plot
# Positive peaks in PC1 loading
# → Increase PC1 score
# Negative peaks in PC1 loading
# → Decrease PC1 score
Example:
Loading PC1:
↑
| ___
| / \ ← Important peak
|___/ \___
└─────────────→ Wavenumber
If Group A has high PC1 scores,
they have strong signal at this peak
Key Wavenumbers:
# Find most important peaks
top_indices = np.argsort(np.abs(loadings[:, 0]))[-10:]
important_peaks = wavenumbers[top_indices]
Explained Variance
Scree Plot:
Variance (%)
100 |●
80 | ●
60 | ●
40 | ●
20 | ●●●●●
0 |________________
1 2 3 4 5 6 7 8 PC
Rules of Thumb:
PC1 + PC2 > 70%: Good 2D representation
PC1 + PC2 < 50%: Consider 3D or UMAP
Elbow point: # PCs to retain (here: ~3-4)
Common Use Cases
1. Group Visualization
# 2D scatter plot
plt.scatter(scores[:, 0], scores[:, 1], c=labels)
plt.xlabel(f'PC1 ({var_ratio[0]:.1%})')
plt.ylabel(f'PC2 ({var_ratio[1]:.1%})')
2. Outlier Detection
# Hotelling's T² statistic
from scipy import stats
T2 = np.sum((scores / np.std(scores, axis=0))**2, axis=1)
threshold = stats.chi2.ppf(0.95, df=n_components)
outliers = T2 > threshold
3. Feature Selection
# Important wavenumbers from PC1
loading_weights = np.abs(loadings[:, 0])
top_features = np.argsort(loading_weights)[-20:]
4. Dimensionality Reduction for ML
# Reduce to 95% variance
pca = apply_pca(data, n_components=0.95)
reduced_data = pca['scores']
# Use reduced_data for classification
Troubleshooting
Issue |
Cause |
Solution |
|---|---|---|
Poor separation |
Low variance in PC1-2 |
Try more PCs, use UMAP |
All groups overlap |
No spectral differences |
Check preprocessing |
PC1 = baseline |
Preprocessing issue |
Better baseline correction |
One outlier dominates |
Extreme spectrum |
Remove outlier, re-run |
Scores look random |
Data not standardized |
Check normalization |
Assumptions
✓ Linear relationships: PCA finds linear combinations
✓ Variance = importance: Assumes max variance = most informative
✗ No scaling: Apply normalization first
✗ Not rotation-invariant: PC axes arbitrary
When to Use
Use PCA when:
✓ Visualizing high-dimensional data
✓ Reducing dimensions for ML
✓ Identifying important features
✓ Quick exploratory analysis
✓ Data roughly linear
Consider alternatives when:
✗ Non-linear structure (use UMAP/t-SNE)
✗ Preserving distances (use MDS)
✗ Only want clustering (use K-means directly)
Advanced Options
Whitening
# Normalize PC variances (useful before clustering)
pca_result = apply_pca(data, whiten=True)
Incremental PCA
# For very large datasets
from sklearn.decomposition import IncrementalPCA
ipca = IncrementalPCA(n_components=2)
ipca.partial_fit(data_batch_1)
ipca.partial_fit(data_batch_2)
scores = ipca.transform(data)
Reference
Jolliffe & Cadima (2016). “Principal component analysis: a review and recent developments”
MCR-ALS
Purpose: Spectral unmixing / component extraction from mixtures.
MCR-ALS (Multivariate Curve Resolution – Alternating Least Squares) aims to decompose a data matrix \(X\) into concentrations \(C\) and component spectra \(S\):
When to use:
✓ Each measured spectrum is a mixture of a small number of underlying “pure” components
✓ You want interpretable component spectra and relative contributions
Typical constraints:
Non-negativity on \(C\) and/or \(S\)
Normalization or closure constraints (depending on experiment)
Practical notes:
Sensitive to initialization; try multiple starts.
Preprocessing (baseline correction, normalization) usually improves results.
UMAP (Uniform Manifold Approximation and Projection)
Purpose: Non-linear dimensionality reduction preserving local and global structure
Theory
UMAP constructs high-dimensional graph, then optimizes low-dimensional representation:
Build k-nearest neighbor graph in high-dimensional space
Compute fuzzy simplicial complex (topological structure)
Optimize low-dimensional layout preserving topology
Key Difference from PCA:
PCA: Linear projection, preserves variance
UMAP: Non-linear, preserves topology (neighbors stay neighbors)
Parameters
Parameter |
Type |
Range |
Default |
Description |
|---|---|---|---|---|
|
int |
2 - 200 |
15 |
# neighbors for graph |
|
float |
0.0 - 0.99 |
0.1 |
Minimum distance in embedding |
|
int |
2 - 3 |
2 |
Output dimensions |
|
str |
- |
‘euclidean’ |
Distance metric |
|
int |
- |
42 |
Random seed |
Parameter Guide:
# Local structure (tight clusters)
n_neighbors = 5-10
min_dist = 0.0
# Balanced (recommended)
n_neighbors = 15
min_dist = 0.1
# Global structure (broader view)
n_neighbors = 50-100
min_dist = 0.5
n_neighbors:
Low (5-10): Focus on local structure, tight clusters
Medium (15-30): Balanced local/global
High (50-200): Focus on global structure, looser clusters
min_dist:
0.0: Densest packing, tight clusters
0.1: Default, good separation
0.5+: Spread out, overview
Usage Example
from functions.ML import apply_umap
# Apply UMAP
umap_result = apply_umap(
data=preprocessed_spectra,
n_neighbors=15,
min_dist=0.1,
n_components=2,
labels=group_labels
)
# Access results
embedding = umap_result['embedding'] # (n_samples, 2)
PCA vs UMAP Comparison
Aspect |
PCA |
UMAP |
|---|---|---|
Type |
Linear |
Non-linear |
Speed |
Fast |
Slower |
Preserves |
Variance |
Topology |
Global structure |
Good |
Excellent |
Local structure |
Poor |
Excellent |
Deterministic |
Yes |
No (use random_state) |
Interpretability |
High (loadings) |
Low (no loadings) |
Outliers |
Sensitive |
Robust |
Decision Guide:
Use PCA when:
- Need feature importance (loadings)
- Want speed
- Linear relationships
- Interpretability critical
Use UMAP when:
- PCA shows overlap
- Need better separation
- Non-linear structure
- Visualization priority
Interpretation
UMAP Embedding:
Dim 2
↑
A | C
AA | CCC
AAA | CC
─────────────→ Dim 1
BB |
BBB|
B|
What to Look For:
Clusters: Distinct groups
Distance: Relative, not absolute
Shape: Cluster density and spread
Bridges: Transitional samples
Warnings:
⚠️ Distances not quantitative: UMAP preserves topology, not exact distances
⚠️ Cluster size misleading: Doesn’t reflect true cluster variance
⚠️ Different seeds → different layouts: Use random_state for reproducibility
Troubleshooting
Issue |
Cause |
Solution |
|---|---|---|
Too many clusters |
n_neighbors too low |
Increase to 30-50 |
All points together |
n_neighbors too high |
Decrease to 10-15 |
Unclear structure |
min_dist too high |
Reduce to 0.05-0.1 |
Overlapping groups |
Inherent similarity |
Try different preprocessing |
Different results |
Random seed |
Set random_state=42 |
When to Use
Use UMAP when:
✓ PCA shows poor separation
✓ Suspect non-linear structure
✓ Want beautiful visualizations
✓ Exploring complex datasets
Use PCA when:
✓ Need interpretable features
✓ Speed critical
✓ Publishing quantitative results
Reference
McInnes et al. (2018). “UMAP: Uniform Manifold Approximation and Projection”
t-SNE (t-Distributed Stochastic Neighbor Embedding)
Purpose: Non-linear dimensionality reduction emphasizing local structure
Theory
t-SNE preserves pairwise similarities between points:
Compute pairwise probabilities in high dimensions (Gaussian)
Compute pairwise probabilities in low dimensions (t-distribution)
Minimize KL divergence between probability distributions
vs UMAP: t-SNE focuses more on local structure, UMAP balances local/global
Parameters
Parameter |
Type |
Range |
Default |
Description |
|---|---|---|---|---|
|
float |
5 - 50 |
30 |
# effective neighbors |
|
int |
250 - 5000 |
1000 |
Optimization iterations |
|
float |
10 - 1000 |
200 |
Step size |
|
float |
1 - 20 |
12 |
Initial separation |
|
int |
- |
42 |
Random seed |
Parameter Guide:
# Small dataset (n < 100)
perplexity = 5-10
n_iter = 1000
# Medium dataset (n = 100-1000)
perplexity = 30
n_iter = 1000-2000
# Large dataset (n > 1000)
perplexity = 50
n_iter = 2000-5000
Perplexity:
Rule of thumb:
5 < perplexity < n_samples/3Low (5-10): Focus on local clusters
Medium (30): Balanced
High (50+): Focus on global structure
Usage Example
from functions.ML import apply_tsne
# Apply t-SNE
tsne_result = apply_tsne(
data=preprocessed_spectra,
perplexity=30,
n_iter=1000,
labels=group_labels
)
# Access results
embedding = tsne_result['embedding'] # (n_samples, 2)
Interpretation
t-SNE Output:
Similar to UMAP, but:
- Even tighter clusters
- Less emphasis on global distances
- More sensitive to perplexity
Key Points:
⚠️ Cluster sizes meaningless: Don’t interpret relative sizes
⚠️ Distances not quantitative: Within-cluster distances OK, between-cluster not
⚠️ Slow: Much slower than PCA/UMAP for large datasets
UMAP vs t-SNE
Aspect |
UMAP |
t-SNE |
|---|---|---|
Speed |
Fast |
Slow |
Global structure |
Better |
Worse |
Local structure |
Good |
Excellent |
Scalability |
100k+ samples |
<10k samples |
Reproducibility |
Better |
Worse |
General use |
Preferred |
Specialized |
Recommendation: Use UMAP unless you specifically need t-SNE’s extreme local focus
Troubleshooting
Issue |
Cause |
Solution |
|---|---|---|
Blob without structure |
Perplexity too high |
Reduce perplexity |
Many tiny clusters |
Perplexity too low |
Increase perplexity |
Not converged |
n_iter too low |
Increase to 2000+ |
Different results |
Random seed |
Set random_state |
Very slow |
Large dataset |
Use UMAP instead |
When to Use
Use t-SNE when:
✓ Small-medium datasets (< 5000 samples)
✓ Need extreme local structure emphasis
✓ Publication requires it (legacy)
Use UMAP instead when:
✓ Large datasets
✓ Need reproducibility
✓ Want global structure too
✓ Speed matters
Reference
van der Maaten & Hinton (2008). “Visualizing Data using t-SNE”
Hierarchical Clustering
Purpose: Create tree of nested clusters (dendrogram)
Theory
Agglomerative (bottom-up):
Start: Each point is a cluster
Repeat: Merge closest clusters
Stop: All points in one cluster
Result: Dendrogram showing merge history
Parameters
Parameter |
Type |
Options |
Default |
Description |
|---|---|---|---|---|
|
str |
ward, complete, average, single |
‘ward’ |
Cluster distance metric |
|
str |
euclidean, cosine, correlation |
‘euclidean’ |
Distance function |
|
int |
2 - 20 |
None |
Cut tree to get clusters |
Linkage Methods:
Method |
Distance Between Clusters |
Use Case |
|---|---|---|
Ward |
Minimizes within-cluster variance |
Recommended for most |
Complete |
Maximum distance |
Compact clusters |
Average |
Average distance |
Balanced |
Single |
Minimum distance |
Can find elongated clusters |
Recommendation: Use Ward with Euclidean distance
Usage Example
from functions.ML import apply_hierarchical_clustering
# Apply hierarchical clustering
hc_result = apply_hierarchical_clustering(
data=preprocessed_spectra,
linkage='ward',
metric='euclidean',
n_clusters=3
)
# Access results
clusters = hc_result['clusters'] # Cluster assignments
linkage_matrix = hc_result['linkage'] # For dendrogram
Dendrogram Interpretation
Height
|
10| ┌─┐
| ┌───┤ ├───┐
5| ┌─┤ └─┘ ├─┐
| ┌─┤ └─┐ ┌─┘ ├─┐
0| └─┘ └─────┘ └─┘
A₁ A₂ A₃ A₄ B₁ B₂
How to Read:
Horizontal lines: Clusters
Height: Distance at merge
Vertical lines: Similarity
Cut horizontally: Define clusters
Cutting the Tree:
# Cut at height = 7
# → 2 clusters: [A₁,A₂,A₃,A₄] and [B₁,B₂]
# Cut at height = 3
# → 3 clusters
Cophenetic Distance:
from scipy.cluster.hierarchy import cophenet
from scipy.spatial.distance import pdist
c, coph_dists = cophenet(linkage_matrix, pdist(data))
# c > 0.7: Good representation
# c < 0.5: Poor representation
Visualization
from scipy.cluster.hierarchy import dendrogram
import matplotlib.pyplot as plt
# Plot dendrogram
plt.figure(figsize=(10, 7))
dendrogram(
linkage_matrix,
labels=sample_names,
leaf_rotation=90,
leaf_font_size=8
)
plt.xlabel('Sample')
plt.ylabel('Distance')
plt.title('Hierarchical Clustering Dendrogram')
plt.tight_layout()
plt.show()
Choosing Number of Clusters
Method 1: Visual Inspection
Look for large height jumps in dendrogram
Cut before major merge
Method 2: Elbow Method
from scipy.cluster.hierarchy import fcluster
distances = []
for k in range(2, 11):
clusters = fcluster(linkage_matrix, k, criterion='maxclust')
dist = calculate_within_cluster_distance(data, clusters)
distances.append(dist)
# Plot and find elbow
plt.plot(range(2, 11), distances)
Method 3: Silhouette Score
from sklearn.metrics import silhouette_score
scores = []
for k in range(2, 11):
clusters = fcluster(linkage_matrix, k, criterion='maxclust')
score = silhouette_score(data, clusters)
scores.append(score)
# Choose k with highest score
best_k = np.argmax(scores) + 2
Troubleshooting
Issue |
Cause |
Solution |
|---|---|---|
Unbalanced clusters |
Single linkage |
Use Ward linkage |
Unclear structure |
Wrong distance metric |
Try different metrics |
Chains instead of clusters |
Single linkage |
Use Ward/Complete |
Dendrogram too complex |
Too many samples |
Truncate or subset |
When to Use
Use Hierarchical Clustering when:
✓ Want to explore cluster structure
✓ Don’t know # clusters
✓ Need hierarchical relationships
✓ Small-medium datasets (< 5000 samples)
Use K-Means when:
✓ Know # clusters
✓ Large datasets
✓ Speed critical
Reference
Müllner (2013). “fastcluster: Fast Hierarchical, Agglomerative Clustering”
K-Means Clustering
Purpose: Partition data into K non-overlapping clusters
Theory
Algorithm:
Initialize K centroids randomly
Assign each point to nearest centroid
Update centroids to cluster means
Repeat 2-3 until convergence
Objective: Minimize within-cluster sum of squares (WCSS)
Parameters
Parameter |
Type |
Range |
Default |
Description |
|---|---|---|---|---|
|
int |
2 - 20 |
3 |
Number of clusters |
|
str |
‘k-means++’, ‘random’ |
‘k-means++’ |
Initialization method |
|
int |
10 - 100 |
10 |
# random initializations |
|
int |
100 - 1000 |
300 |
Maximum iterations |
|
int |
- |
42 |
Random seed |
Recommendation: Use k-means++ initialization (smarter than random)
Usage Example
from functions.ML import apply_kmeans
# Apply K-means
kmeans_result = apply_kmeans(
data=preprocessed_spectra,
n_clusters=3,
init='k-means++',
n_init=10
)
# Access results
clusters = kmeans_result['clusters'] # Cluster assignments
centroids = kmeans_result['centroids'] # Cluster centers
inertia = kmeans_result['inertia'] # WCSS
Choosing Number of Clusters (K)
Method 1: Elbow Method
inertias = []
K_range = range(2, 11)
for k in K_range:
result = apply_kmeans(data, n_clusters=k)
inertias.append(result['inertia'])
# Plot elbow curve
plt.plot(K_range, inertias, 'bo-')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia (WCSS)')
plt.title('Elbow Method')
plt.show()
# Look for "elbow" point
Elbow Plot:
Inertia
|●
| ●
| ●
| ●___
| ●___●___●
└─────────────────
2 3 4 5 6 7 8 K
↑
Elbow (K=4)
Method 2: Silhouette Score
from sklearn.metrics import silhouette_score
silhouette_scores = []
for k in range(2, 11):
result = apply_kmeans(data, n_clusters=k)
score = silhouette_score(data, result['clusters'])
silhouette_scores.append(score)
# Choose K with highest score
best_k = np.argmax(silhouette_scores) + 2
Silhouette Score:
Range: [-1, 1]
> 0.7: Strong structure
0.5 - 0.7: Reasonable structure
< 0.5: Weak structure, try different K
Method 3: Gap Statistic
from scipy.cluster.vq import kmeans
from sklearn.metrics import pairwise_distances
def gap_statistic(data, K_max=10, n_references=10):
gaps = []
for k in range(1, K_max+1):
# Cluster actual data
result = apply_kmeans(data, n_clusters=k)
actual_wcss = result['inertia']
# Cluster random reference data
reference_wcss = []
for _ in range(n_references):
reference = np.random.uniform(
data.min(), data.max(),
size=data.shape
)
ref_result = apply_kmeans(reference, n_clusters=k)
reference_wcss.append(ref_result['inertia'])
# Gap = log(E[WCSS_ref]) - log(WCSS_actual)
gap = np.log(np.mean(reference_wcss)) - np.log(actual_wcss)
gaps.append(gap)
return gaps
# Choose K where gap stops increasing
Cluster Validation
Silhouette Analysis
from sklearn.metrics import silhouette_samples
import matplotlib.pyplot as plt
# Compute silhouette scores for each sample
silhouette_vals = silhouette_samples(data, clusters)
# Plot silhouette for each cluster
fig, ax = plt.subplots()
y_lower = 10
for i in range(n_clusters):
cluster_silhouette = silhouette_vals[clusters == i]
cluster_silhouette.sort()
size = cluster_silhouette.shape[0]
y_upper = y_lower + size
ax.fill_betweenx(
np.arange(y_lower, y_upper),
0, cluster_silhouette,
alpha=0.7
)
y_lower = y_upper + 10
ax.axvline(silhouette_scores[k-2], color="red", linestyle="--")
ax.set_xlabel("Silhouette Coefficient")
ax.set_ylabel("Cluster")
Good Clustering:
All clusters above average line
Similar widths (balanced sizes)
All positive values
Poor Clustering:
Clusters below average
Very different widths
Negative values (misassigned points)
Troubleshooting
Issue |
Cause |
Solution |
|---|---|---|
Empty clusters |
K too high |
Reduce K |
Unbalanced sizes |
Poor initialization |
Use k-means++ |
Different results |
Random init |
Set random_state, increase n_init |
Not converging |
max_iter too low |
Increase max_iter |
Unexpected clusters |
Need normalization |
Apply vector norm |
Assumptions and Limitations
Assumes:
✓ Clusters are spherical (isotropic)
✓ Clusters have similar sizes
✓ Clusters are equally dense
Limitations:
✗ Must specify K in advance
✗ Sensitive to outliers
✗ Can’t find non-convex clusters
✗ Random initialization (use n_init=10+)
When to Use
Use K-Means when:
✓ Know or can estimate K
✓ Clusters roughly spherical
✓ Large datasets (fast algorithm)
✓ Need deterministic results (set random_state)
Use Hierarchical when:
✓ Don’t know K
✓ Want dendrogram
✓ Small-medium datasets
Use DBSCAN when:
✓ Arbitrary cluster shapes
✓ Noise points present
✓ Unknown K
Reference
Lloyd (1982). “Least squares quantization in PCM”
DBSCAN (Density-Based Spatial Clustering)
Purpose: Find arbitrarily-shaped clusters based on density
Theory
Concepts:
Core point: Has ≥
min_samplesneighbors withinepsBorder point: Within
epsof core pointNoise point: Neither core nor border
Algorithm:
Find all core points
Connect core points within
epsAssign border points to nearby clusters
Mark remaining points as noise (-1)
Parameters
Parameter |
Type |
Range |
Default |
Description |
|---|---|---|---|---|
|
float |
0.1 - 10.0 |
0.5 |
Maximum distance for neighborhood |
|
int |
3 - 20 |
5 |
Minimum points for core point |
|
str |
euclidean, etc. |
‘euclidean’ |
Distance metric |
Parameter Guide:
eps (epsilon):
# Determine eps using k-distance graph
from sklearn.neighbors import NearestNeighbors
neighbors = NearestNeighbors(n_neighbors=min_samples)
neighbors.fit(data)
distances, indices = neighbors.kneighbors(data)
# Sort and plot distances to min_samples-th neighbor
sorted_distances = np.sort(distances[:, -1])
plt.plot(sorted_distances)
plt.ylabel(f'Distance to {min_samples}-th Neighbor')
plt.xlabel('Points (sorted)')
# eps = value at "elbow" of curve
min_samples:
Rule of thumb:
2 × n_dimensions3-5: Detect finer clusters
10-20: More robust to noise
Usage Example
from functions.ML import apply_dbscan
# Apply DBSCAN
dbscan_result = apply_dbscan(
data=preprocessed_spectra,
eps=0.5,
min_samples=5
)
# Access results
clusters = dbscan_result['clusters'] # -1 = noise
n_clusters = len(set(clusters)) - (1 if -1 in clusters else 0)
n_noise = list(clusters).count(-1)
print(f"Clusters found: {n_clusters}")
print(f"Noise points: {n_noise}")
Advantages
✓ No need to specify K: Finds natural clusters
✓ Handles arbitrary shapes: Not limited to spherical
✓ Identifies outliers: Noise points marked as -1
✓ Deterministic: Same results every run
Disadvantages
✗ Sensitive to parameters: eps and min_samples critical
✗ Varying densities: Struggles with clusters of different densities
✗ High dimensions: Distance-based, suffers curse of dimensionality
Troubleshooting
Issue |
Cause |
Solution |
|---|---|---|
One giant cluster |
eps too large |
Reduce eps |
All points are noise |
eps too small |
Increase eps |
Too many clusters |
min_samples too low |
Increase min_samples |
Undetected clusters |
min_samples too high |
Reduce min_samples |
When to Use
Use DBSCAN when:
✓ Don’t know number of clusters
✓ Clusters have arbitrary shapes
✓ Outliers present
✓ Clusters vary in size (but not density)
Use K-Means when:
✓ Spherical clusters
✓ Know K
✓ All points should be clustered
Reference
Ester et al. (1996). “A density-based algorithm for discovering clusters”
Method Comparison
Quick Reference Table
Method |
Type |
Speed |
Strengths |
Limitations |
Best For |
|---|---|---|---|---|---|
PCA |
Linear DR |
⚡⚡⚡ |
Fast, interpretable, feature importance |
Linear only |
Quick exploration, ML preprocessing |
UMAP |
Non-linear DR |
⚡⚡ |
Preserves global+local, beautiful plots |
No feature importance |
Complex data visualization |
t-SNE |
Non-linear DR |
⚡ |
Excellent local structure |
Slow, no global structure |
Small datasets, local patterns |
Hierarchical |
Clustering |
⚡⚡ |
Dendrogram, no K needed |
Slow for large data |
Exploratory, unknown K |
K-Means |
Clustering |
⚡⚡⚡ |
Fast, simple |
Need K, spherical clusters |
Known K, large datasets |
DBSCAN |
Clustering |
⚡⚡ |
Arbitrary shapes, finds outliers |
Sensitive to parameters |
Outliers, complex shapes |
Decision Tree
START: What's your goal?
│
├─ Visualization
│ │
│ ├─ Quick overview → PCA (2D/3D)
│ │
│ ├─ Poor PCA separation → UMAP
│ │
│ └─ Extreme local focus → t-SNE
│
├─ Clustering
│ │
│ ├─ Know # clusters → K-Means
│ │
│ ├─ Don't know # clusters → Hierarchical (dendrogram)
│ │
│ └─ Arbitrary shapes + outliers → DBSCAN
│
└─ Feature reduction for ML
│
├─ Linear data → PCA (keep loadings)
│
└─ Non-linear data → UMAP → then ML
Typical Workflow
1. Initial Exploration:
# Start with PCA (fast)
pca_result = apply_pca(data, n_components=2)
plot_pca_scores(pca_result)
# Check explained variance
if pca_result['explained_variance_ratio'][:2].sum() < 0.5:
# Poor separation, try UMAP
umap_result = apply_umap(data, n_neighbors=15)
2. Clustering Investigation:
# Try hierarchical first (explore # clusters)
hc_result = apply_hierarchical_clustering(data, linkage='ward')
plot_dendrogram(hc_result)
# Identify optimal K from dendrogram
optimal_k = 3 # from visual inspection
# Apply K-means with optimal K
kmeans_result = apply_kmeans(data, n_clusters=optimal_k)
3. Validation:
# Validate clustering quality
from sklearn.metrics import silhouette_score, davies_bouldin_score
sil_score = silhouette_score(data, clusters)
db_score = davies_bouldin_score(data, clusters)
print(f"Silhouette: {sil_score:.3f} (higher better)")
print(f"Davies-Bouldin: {db_score:.3f} (lower better)")
Validation Metrics
Silhouette Score
Formula:
s(i) = (b(i) - a(i)) / max(a(i), b(i))
where:
a(i) = avg distance to points in same cluster
b(i) = avg distance to points in nearest different cluster
Interpretation:
+1: Perfect clustering
0: On cluster boundary
-1: Misassigned
Usage:
from sklearn.metrics import silhouette_score
score = silhouette_score(data, clusters)
Davies-Bouldin Index
Measures: Ratio of within-cluster to between-cluster distances
Interpretation:
Lower is better
0: Perfect clustering
Usage:
from sklearn.metrics import davies_bouldin_score
score = davies_bouldin_score(data, clusters)
Calinski-Harabasz Index
Measures: Ratio of between-cluster to within-cluster variance
Interpretation:
Higher is better
Usage:
from sklearn.metrics import calinski_harabasz_score
score = calinski_harabasz_score(data, clusters)
Best Practices
General Guidelines
Preprocessing First:
# Always preprocess before analysis data = apply_baseline_correction(raw_data) data = apply_smoothing(data) data = apply_normalization(data)
Try Multiple Methods:
# Compare different approaches pca_result = apply_pca(data) umap_result = apply_umap(data) # Choose based on results
Validate Results:
# Check quality metrics silhouette = silhouette_score(data, clusters) if silhouette < 0.5: print("Warning: Poor clustering quality")
Use Domain Knowledge:
Does clustering match expected groups?
Are separated groups biologically meaningful?
Check against known standards
Reproducibility
# Always set random state
pca_result = apply_pca(data, random_state=42)
umap_result = apply_umap(data, random_state=42)
kmeans_result = apply_kmeans(data, random_state=42, n_init=10)
See Also
Analysis User Guide - Step-by-step tutorials
Statistical Methods - Hypothesis testing
Machine Learning Methods - Classification algorithms
Best Practices - Analysis strategies
Last Updated: 2026-01-24