Exploratory Analysis Methods

Comprehensive reference for dimensionality reduction and clustering methods.

Table of Contents

Principal Component Analysis (PCA)
UMAP
t-SNE
Hierarchical Clustering
K-Means Clustering
DBSCAN
Method Comparison

Principal Component Analysis (PCA)

Purpose: Linear dimensionality reduction preserving maximum variance

Theory

PCA finds orthogonal directions (principal components) that capture most variance in data:

X = USVᵀ  (Singular Value Decomposition)

Where:

U: Left singular vectors (sample scores)
S: Singular values (eigenvalues)
V: Right singular vectors (loadings)

Mathematical Steps:

Center data: X_centered = X - mean(X)
Compute covariance matrix: C = XᵀX / (n-1)
Eigendecomposition: C = VΛVᵀ
Project data: scores = X × V

Parameters

Parameter	Type	Range	Default	Description
`n_components`	int/float	2-100 or 0.0-1.0	2	Number of PCs or variance to retain
`whiten`	bool	-	False	Divide components by singular values
`svd_solver`	str	-	‘auto’	SVD algorithm (‘auto’, ‘full’, ‘arpack’, ‘randomized’)
`random_state`	int	-	42	Random seed for reproducibility

Parameter Guide:

# Visualization (2D/3D)
n_components = 2  # or 3

# Keep 95% variance
n_components = 0.95

# Keep specific number
n_components = 10

# Whitening (for clustering)
whiten = True

Usage Example

from functions.ML import apply_pca

# Apply PCA
pca_result = apply_pca(
    data=preprocessed_spectra,
    n_components=2,
    labels=group_labels
)

# Access results
scores = pca_result['scores']  # (n_samples, n_components)
loadings = pca_result['loadings']  # (n_features, n_components)
explained_var = pca_result['explained_variance_ratio']

Output Components

1. Scores (Sample Coordinates):

scores = pca_result['scores']
# Shape: (n_samples, n_components)
# Each row: sample position in PC space
# Use for: Visualization, clustering

2. Loadings (Feature Contributions):

loadings = pca_result['loadings']
# Shape: (n_features, n_components)
# Each column: wavenumber contributions to PC
# Use for: Identifying important peaks

3. Explained Variance:

var_ratio = pca_result['explained_variance_ratio']
# Array of variance % for each PC
# Example: [0.65, 0.23, 0.08, ...]
# PC1 explains 65%, PC2 23%, etc.

4. Scree Plot Data:

cumulative_var = np.cumsum(var_ratio)
# Cumulative variance explained

Interpretation

Scores Plot

     PC2 (23%)
         ↑
    A    |    C
    A    |   CC
   AAA   | CCC
  ─────────────→ PC1 (65%)
     BBB|
      BB|
       B|

What it Shows:

Separation: Groups A, B, C are distinct
Distance: Similar samples cluster together
Outliers: Points far from main cluster

Interpretation:

Tight clusters: Homogeneous groups
Overlapping: Spectral similarity
Trend: Continuous variation (e.g., concentration)

Loadings Plot

# Positive peaks in PC1 loading
# → Increase PC1 score

# Negative peaks in PC1 loading
# → Decrease PC1 score

Example:

Loading PC1:
  ↑
  |     ___
  |    /   \    ← Important peak
  |___/     \___
  └─────────────→ Wavenumber
  
If Group A has high PC1 scores,
they have strong signal at this peak

Key Wavenumbers:

# Find most important peaks
top_indices = np.argsort(np.abs(loadings[:, 0]))[-10:]
important_peaks = wavenumbers[top_indices]

Explained Variance

Scree Plot:

Variance (%)
|●
|  ●
|    ●
|      ●
|        ●●●●●
|________________
2 3 4 5 6 7 8  PC

Rules of Thumb:

PC1 + PC2 > 70%: Good 2D representation
PC1 + PC2 < 50%: Consider 3D or UMAP
Elbow point: # PCs to retain (here: ~3-4)

Common Use Cases

1. Group Visualization

# 2D scatter plot
plt.scatter(scores[:, 0], scores[:, 1], c=labels)
plt.xlabel(f'PC1 ({var_ratio[0]:.1%})')
plt.ylabel(f'PC2 ({var_ratio[1]:.1%})')

2. Outlier Detection

# Hotelling's T² statistic
from scipy import stats
T2 = np.sum((scores / np.std(scores, axis=0))**2, axis=1)
threshold = stats.chi2.ppf(0.95, df=n_components)
outliers = T2 > threshold

3. Feature Selection

# Important wavenumbers from PC1
loading_weights = np.abs(loadings[:, 0])
top_features = np.argsort(loading_weights)[-20:]

4. Dimensionality Reduction for ML

# Reduce to 95% variance
pca = apply_pca(data, n_components=0.95)
reduced_data = pca['scores']
# Use reduced_data for classification

Troubleshooting

Issue	Cause	Solution
Poor separation	Low variance in PC1-2	Try more PCs, use UMAP
All groups overlap	No spectral differences	Check preprocessing
PC1 = baseline	Preprocessing issue	Better baseline correction
One outlier dominates	Extreme spectrum	Remove outlier, re-run
Scores look random	Data not standardized	Check normalization

Assumptions

✓ Linear relationships: PCA finds linear combinations
✓ Variance = importance: Assumes max variance = most informative
✗ No scaling: Apply normalization first
✗ Not rotation-invariant: PC axes arbitrary

When to Use

Use PCA when:

✓ Visualizing high-dimensional data
✓ Reducing dimensions for ML
✓ Identifying important features
✓ Quick exploratory analysis
✓ Data roughly linear

Consider alternatives when:

✗ Non-linear structure (use UMAP/t-SNE)
✗ Preserving distances (use MDS)
✗ Only want clustering (use K-means directly)

Advanced Options

Whitening

# Normalize PC variances (useful before clustering)
pca_result = apply_pca(data, whiten=True)

Incremental PCA

# For very large datasets
from sklearn.decomposition import IncrementalPCA
ipca = IncrementalPCA(n_components=2)
ipca.partial_fit(data_batch_1)
ipca.partial_fit(data_batch_2)
scores = ipca.transform(data)

Reference

Jolliffe & Cadima (2016). “Principal component analysis: a review and recent developments”

MCR-ALS

Purpose: Spectral unmixing / component extraction from mixtures.

MCR-ALS (Multivariate Curve Resolution – Alternating Least Squares) aims to decompose a data matrix \(X\) into concentrations \(C\) and component spectra \(S\):

\[ X \approx C S^T \]

When to use:

✓ Each measured spectrum is a mixture of a small number of underlying “pure” components
✓ You want interpretable component spectra and relative contributions

Typical constraints:

Non-negativity on \(C\) and/or \(S\)
Normalization or closure constraints (depending on experiment)

Practical notes:

Sensitive to initialization; try multiple starts.
Preprocessing (baseline correction, normalization) usually improves results.

UMAP (Uniform Manifold Approximation and Projection)

Purpose: Non-linear dimensionality reduction preserving local and global structure

Theory

UMAP constructs high-dimensional graph, then optimizes low-dimensional representation:

Build k-nearest neighbor graph in high-dimensional space
Compute fuzzy simplicial complex (topological structure)
Optimize low-dimensional layout preserving topology

Key Difference from PCA:

PCA: Linear projection, preserves variance
UMAP: Non-linear, preserves topology (neighbors stay neighbors)

Parameters

Parameter	Type	Range	Default	Description
`n_neighbors`	int	2 - 200	15	# neighbors for graph
`min_dist`	float	0.0 - 0.99	0.1	Minimum distance in embedding
`n_components`	int	2 - 3	2	Output dimensions
`metric`	str	-	‘euclidean’	Distance metric
`random_state`	int	-	42	Random seed

Parameter Guide:

# Local structure (tight clusters)
n_neighbors = 5-10
min_dist = 0.0

# Balanced (recommended)
n_neighbors = 15
min_dist = 0.1

# Global structure (broader view)
n_neighbors = 50-100
min_dist = 0.5

n_neighbors:

Low (5-10): Focus on local structure, tight clusters
Medium (15-30): Balanced local/global
High (50-200): Focus on global structure, looser clusters

min_dist:

0.0: Densest packing, tight clusters
0.1: Default, good separation
0.5+: Spread out, overview

Usage Example

from functions.ML import apply_umap

# Apply UMAP
umap_result = apply_umap(
    data=preprocessed_spectra,
    n_neighbors=15,
    min_dist=0.1,
    n_components=2,
    labels=group_labels
)

# Access results
embedding = umap_result['embedding']  # (n_samples, 2)

PCA vs UMAP Comparison

Aspect	PCA	UMAP
Type	Linear	Non-linear
Speed	Fast	Slower
Preserves	Variance	Topology
Global structure	Good	Excellent
Local structure	Poor	Excellent
Deterministic	Yes	No (use random_state)
Interpretability	High (loadings)	Low (no loadings)
Outliers	Sensitive	Robust

Decision Guide:

Use PCA when:
- Need feature importance (loadings)
- Want speed
- Linear relationships
- Interpretability critical

Use UMAP when:
- PCA shows overlap
- Need better separation
- Non-linear structure
- Visualization priority

Interpretation

UMAP Embedding:

     Dim 2
         ↑
    A    |    C
    AA   |   CCC
   AAA   | CC
  ─────────────→ Dim 1
     BB |
     BBB|
       B|

What to Look For:

Clusters: Distinct groups
Distance: Relative, not absolute
Shape: Cluster density and spread
Bridges: Transitional samples

Warnings: ⚠️ Distances not quantitative: UMAP preserves topology, not exact distances
⚠️ Cluster size misleading: Doesn’t reflect true cluster variance
⚠️ Different seeds → different layouts: Use random_state for reproducibility

Troubleshooting

Issue	Cause	Solution
Too many clusters	n_neighbors too low	Increase to 30-50
All points together	n_neighbors too high	Decrease to 10-15
Unclear structure	min_dist too high	Reduce to 0.05-0.1
Overlapping groups	Inherent similarity	Try different preprocessing
Different results	Random seed	Set random_state=42

When to Use

Use UMAP when:

✓ PCA shows poor separation
✓ Suspect non-linear structure
✓ Want beautiful visualizations
✓ Exploring complex datasets

Use PCA when:

✓ Need interpretable features
✓ Speed critical
✓ Publishing quantitative results

Reference

McInnes et al. (2018). “UMAP: Uniform Manifold Approximation and Projection”

t-SNE (t-Distributed Stochastic Neighbor Embedding)

Purpose: Non-linear dimensionality reduction emphasizing local structure

Theory

t-SNE preserves pairwise similarities between points:

Compute pairwise probabilities in high dimensions (Gaussian)
Compute pairwise probabilities in low dimensions (t-distribution)
Minimize KL divergence between probability distributions

vs UMAP: t-SNE focuses more on local structure, UMAP balances local/global

Parameters

Parameter	Type	Range	Default	Description
`perplexity`	float	5 - 50	30	# effective neighbors
`n_iter`	int	250 - 5000	1000	Optimization iterations
`learning_rate`	float	10 - 1000	200	Step size
`early_exaggeration`	float	1 - 20	12	Initial separation
`random_state`	int	-	42	Random seed

Parameter Guide:

# Small dataset (n < 100)
perplexity = 5-10
n_iter = 1000

# Medium dataset (n = 100-1000)
perplexity = 30
n_iter = 1000-2000

# Large dataset (n > 1000)
perplexity = 50
n_iter = 2000-5000

Perplexity:

Rule of thumb: 5 < perplexity < n_samples/3
Low (5-10): Focus on local clusters
Medium (30): Balanced
High (50+): Focus on global structure

Usage Example

from functions.ML import apply_tsne

# Apply t-SNE
tsne_result = apply_tsne(
    data=preprocessed_spectra,
    perplexity=30,
    n_iter=1000,
    labels=group_labels
)

# Access results
embedding = tsne_result['embedding']  # (n_samples, 2)

Interpretation

t-SNE Output:

Similar to UMAP, but:
- Even tighter clusters
- Less emphasis on global distances
- More sensitive to perplexity

Key Points:

⚠️ Cluster sizes meaningless: Don’t interpret relative sizes
⚠️ Distances not quantitative: Within-cluster distances OK, between-cluster not
⚠️ Slow: Much slower than PCA/UMAP for large datasets

UMAP vs t-SNE

Aspect	UMAP	t-SNE
Speed	Fast	Slow
Global structure	Better	Worse
Local structure	Good	Excellent
Scalability	100k+ samples	<10k samples
Reproducibility	Better	Worse
General use	Preferred	Specialized

Recommendation: Use UMAP unless you specifically need t-SNE’s extreme local focus

Troubleshooting

Issue	Cause	Solution
Blob without structure	Perplexity too high	Reduce perplexity
Many tiny clusters	Perplexity too low	Increase perplexity
Not converged	n_iter too low	Increase to 2000+
Different results	Random seed	Set random_state
Very slow	Large dataset	Use UMAP instead

When to Use

Use t-SNE when:

✓ Small-medium datasets (< 5000 samples)
✓ Need extreme local structure emphasis
✓ Publication requires it (legacy)

Use UMAP instead when:

✓ Large datasets
✓ Need reproducibility
✓ Want global structure too
✓ Speed matters

Reference

van der Maaten & Hinton (2008). “Visualizing Data using t-SNE”

Hierarchical Clustering

Purpose: Create tree of nested clusters (dendrogram)

Theory

Agglomerative (bottom-up):

Start: Each point is a cluster
Repeat: Merge closest clusters
Stop: All points in one cluster

Result: Dendrogram showing merge history

Parameters

Parameter	Type	Options	Default	Description
`linkage`	str	ward, complete, average, single	‘ward’	Cluster distance metric
`metric`	str	euclidean, cosine, correlation	‘euclidean’	Distance function
`n_clusters`	int	2 - 20	None	Cut tree to get clusters

Linkage Methods:

Method	Distance Between Clusters	Use Case
Ward	Minimizes within-cluster variance	Recommended for most
Complete	Maximum distance	Compact clusters
Average	Average distance	Balanced
Single	Minimum distance	Can find elongated clusters

Recommendation: Use Ward with Euclidean distance

Usage Example

from functions.ML import apply_hierarchical_clustering

# Apply hierarchical clustering
hc_result = apply_hierarchical_clustering(
    data=preprocessed_spectra,
    linkage='ward',
    metric='euclidean',
    n_clusters=3
)

# Access results
clusters = hc_result['clusters']  # Cluster assignments
linkage_matrix = hc_result['linkage']  # For dendrogram

Dendrogram Interpretation

Height
  |
 10|          ┌─┐
  |      ┌───┤ ├───┐
  5|    ┌─┤   └─┘   ├─┐
  |  ┌─┤ └─┐     ┌─┘ ├─┐
  0|  └─┘   └─────┘   └─┘
     A₁ A₂  A₃ A₄   B₁ B₂

How to Read:

Horizontal lines: Clusters
Height: Distance at merge
Vertical lines: Similarity
Cut horizontally: Define clusters

Cutting the Tree:

# Cut at height = 7
# → 2 clusters: [A₁,A₂,A₃,A₄] and [B₁,B₂]

# Cut at height = 3
# → 3 clusters

Cophenetic Distance:

from scipy.cluster.hierarchy import cophenet
from scipy.spatial.distance import pdist

c, coph_dists = cophenet(linkage_matrix, pdist(data))
# c > 0.7: Good representation
# c < 0.5: Poor representation

Visualization

from scipy.cluster.hierarchy import dendrogram
import matplotlib.pyplot as plt

# Plot dendrogram
plt.figure(figsize=(10, 7))
dendrogram(
    linkage_matrix,
    labels=sample_names,
    leaf_rotation=90,
    leaf_font_size=8
)
plt.xlabel('Sample')
plt.ylabel('Distance')
plt.title('Hierarchical Clustering Dendrogram')
plt.tight_layout()
plt.show()

Choosing Number of Clusters

Method 1: Visual Inspection

Look for large height jumps in dendrogram
Cut before major merge

Method 2: Elbow Method

from scipy.cluster.hierarchy import fcluster

distances = []
for k in range(2, 11):
    clusters = fcluster(linkage_matrix, k, criterion='maxclust')
    dist = calculate_within_cluster_distance(data, clusters)
    distances.append(dist)

# Plot and find elbow
plt.plot(range(2, 11), distances)

Method 3: Silhouette Score

from sklearn.metrics import silhouette_score

scores = []
for k in range(2, 11):
    clusters = fcluster(linkage_matrix, k, criterion='maxclust')
    score = silhouette_score(data, clusters)
    scores.append(score)

# Choose k with highest score
best_k = np.argmax(scores) + 2

Troubleshooting

Issue	Cause	Solution
Unbalanced clusters	Single linkage	Use Ward linkage
Unclear structure	Wrong distance metric	Try different metrics
Chains instead of clusters	Single linkage	Use Ward/Complete
Dendrogram too complex	Too many samples	Truncate or subset

When to Use

Use Hierarchical Clustering when:

✓ Want to explore cluster structure
✓ Don’t know # clusters
✓ Need hierarchical relationships
✓ Small-medium datasets (< 5000 samples)

Use K-Means when:

✓ Know # clusters
✓ Large datasets
✓ Speed critical

Reference

Müllner (2013). “fastcluster: Fast Hierarchical, Agglomerative Clustering”

K-Means Clustering

Purpose: Partition data into K non-overlapping clusters

Theory

Algorithm:

Initialize K centroids randomly
Assign each point to nearest centroid
Update centroids to cluster means
Repeat 2-3 until convergence

Objective: Minimize within-cluster sum of squares (WCSS)

Parameters

Parameter	Type	Range	Default	Description
`n_clusters`	int	2 - 20	3	Number of clusters
`init`	str	‘k-means++’, ‘random’	‘k-means++’	Initialization method
`n_init`	int	10 - 100	10	# random initializations
`max_iter`	int	100 - 1000	300	Maximum iterations
`random_state`	int	-	42	Random seed

Recommendation: Use k-means++ initialization (smarter than random)

Usage Example

from functions.ML import apply_kmeans

# Apply K-means
kmeans_result = apply_kmeans(
    data=preprocessed_spectra,
    n_clusters=3,
    init='k-means++',
    n_init=10
)

# Access results
clusters = kmeans_result['clusters']  # Cluster assignments
centroids = kmeans_result['centroids']  # Cluster centers
inertia = kmeans_result['inertia']  # WCSS

Choosing Number of Clusters (K)

Method 1: Elbow Method

inertias = []
K_range = range(2, 11)

for k in K_range:
    result = apply_kmeans(data, n_clusters=k)
    inertias.append(result['inertia'])

# Plot elbow curve
plt.plot(K_range, inertias, 'bo-')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia (WCSS)')
plt.title('Elbow Method')
plt.show()

# Look for "elbow" point

Elbow Plot:

Inertia
  |●
  | ●
  |  ●
  |   ●___
  |       ●___●___●
  └─────────────────
   2 3 4 5 6 7 8  K
       ↑
     Elbow (K=4)

Method 2: Silhouette Score

from sklearn.metrics import silhouette_score

silhouette_scores = []
for k in range(2, 11):
    result = apply_kmeans(data, n_clusters=k)
    score = silhouette_score(data, result['clusters'])
    silhouette_scores.append(score)

# Choose K with highest score
best_k = np.argmax(silhouette_scores) + 2

Silhouette Score:

Range: [-1, 1]
> 0.7: Strong structure
0.5 - 0.7: Reasonable structure
< 0.5: Weak structure, try different K

Method 3: Gap Statistic

from scipy.cluster.vq import kmeans
from sklearn.metrics import pairwise_distances

def gap_statistic(data, K_max=10, n_references=10):
    gaps = []
    for k in range(1, K_max+1):
        # Cluster actual data
        result = apply_kmeans(data, n_clusters=k)
        actual_wcss = result['inertia']
        
        # Cluster random reference data
        reference_wcss = []
        for _ in range(n_references):
            reference = np.random.uniform(
                data.min(), data.max(), 
                size=data.shape
            )
            ref_result = apply_kmeans(reference, n_clusters=k)
            reference_wcss.append(ref_result['inertia'])
        
        # Gap = log(E[WCSS_ref]) - log(WCSS_actual)
        gap = np.log(np.mean(reference_wcss)) - np.log(actual_wcss)
        gaps.append(gap)
    
    return gaps

# Choose K where gap stops increasing

Cluster Validation

Silhouette Analysis

from sklearn.metrics import silhouette_samples
import matplotlib.pyplot as plt

# Compute silhouette scores for each sample
silhouette_vals = silhouette_samples(data, clusters)

# Plot silhouette for each cluster
fig, ax = plt.subplots()
y_lower = 10

for i in range(n_clusters):
    cluster_silhouette = silhouette_vals[clusters == i]
    cluster_silhouette.sort()
    
    size = cluster_silhouette.shape[0]
    y_upper = y_lower + size
    
    ax.fill_betweenx(
        np.arange(y_lower, y_upper),
        0, cluster_silhouette,
        alpha=0.7
    )
    y_lower = y_upper + 10

ax.axvline(silhouette_scores[k-2], color="red", linestyle="--")
ax.set_xlabel("Silhouette Coefficient")
ax.set_ylabel("Cluster")

Good Clustering:

All clusters above average line
Similar widths (balanced sizes)
All positive values

Poor Clustering:

Clusters below average
Very different widths
Negative values (misassigned points)

Troubleshooting

Issue	Cause	Solution
Empty clusters	K too high	Reduce K
Unbalanced sizes	Poor initialization	Use k-means++
Different results	Random init	Set random_state, increase n_init
Not converging	max_iter too low	Increase max_iter
Unexpected clusters	Need normalization	Apply vector norm

Assumptions and Limitations

Assumes:

✓ Clusters are spherical (isotropic)
✓ Clusters have similar sizes
✓ Clusters are equally dense

Limitations:

✗ Must specify K in advance
✗ Sensitive to outliers
✗ Can’t find non-convex clusters
✗ Random initialization (use n_init=10+)

When to Use

Use K-Means when:

✓ Know or can estimate K
✓ Clusters roughly spherical
✓ Large datasets (fast algorithm)
✓ Need deterministic results (set random_state)

Use Hierarchical when:

✓ Don’t know K
✓ Want dendrogram
✓ Small-medium datasets

Use DBSCAN when:

✓ Arbitrary cluster shapes
✓ Noise points present
✓ Unknown K

Reference

Lloyd (1982). “Least squares quantization in PCM”

DBSCAN (Density-Based Spatial Clustering)

Purpose: Find arbitrarily-shaped clusters based on density

Theory

Concepts:

Core point: Has ≥ min_samples neighbors within eps
Border point: Within eps of core point
Noise point: Neither core nor border

Algorithm:

Find all core points
Connect core points within eps
Assign border points to nearby clusters
Mark remaining points as noise (-1)

Parameters

Parameter	Type	Range	Default	Description
`eps`	float	0.1 - 10.0	0.5	Maximum distance for neighborhood
`min_samples`	int	3 - 20	5	Minimum points for core point
`metric`	str	euclidean, etc.	‘euclidean’	Distance metric

Parameter Guide:

eps (epsilon):

# Determine eps using k-distance graph
from sklearn.neighbors import NearestNeighbors

neighbors = NearestNeighbors(n_neighbors=min_samples)
neighbors.fit(data)
distances, indices = neighbors.kneighbors(data)

# Sort and plot distances to min_samples-th neighbor
sorted_distances = np.sort(distances[:, -1])
plt.plot(sorted_distances)
plt.ylabel(f'Distance to {min_samples}-th Neighbor')
plt.xlabel('Points (sorted)')

# eps = value at "elbow" of curve

min_samples:

Rule of thumb: 2 × n_dimensions
3-5: Detect finer clusters
10-20: More robust to noise

Usage Example

from functions.ML import apply_dbscan

# Apply DBSCAN
dbscan_result = apply_dbscan(
    data=preprocessed_spectra,
    eps=0.5,
    min_samples=5
)

# Access results
clusters = dbscan_result['clusters']  # -1 = noise
n_clusters = len(set(clusters)) - (1 if -1 in clusters else 0)
n_noise = list(clusters).count(-1)

print(f"Clusters found: {n_clusters}")
print(f"Noise points: {n_noise}")

Advantages

✓ No need to specify K: Finds natural clusters
✓ Handles arbitrary shapes: Not limited to spherical
✓ Identifies outliers: Noise points marked as -1
✓ Deterministic: Same results every run

Disadvantages

✗ Sensitive to parameters: eps and min_samples critical
✗ Varying densities: Struggles with clusters of different densities
✗ High dimensions: Distance-based, suffers curse of dimensionality

Troubleshooting

Issue	Cause	Solution
One giant cluster	eps too large	Reduce eps
All points are noise	eps too small	Increase eps
Too many clusters	min_samples too low	Increase min_samples
Undetected clusters	min_samples too high	Reduce min_samples

When to Use

Use DBSCAN when:

✓ Don’t know number of clusters
✓ Clusters have arbitrary shapes
✓ Outliers present
✓ Clusters vary in size (but not density)

Use K-Means when:

✓ Spherical clusters
✓ Know K
✓ All points should be clustered

Reference

Ester et al. (1996). “A density-based algorithm for discovering clusters”

Method Comparison

Quick Reference Table

Method	Type	Speed	Strengths	Limitations	Best For
PCA	Linear DR	⚡⚡⚡	Fast, interpretable, feature importance	Linear only	Quick exploration, ML preprocessing
UMAP	Non-linear DR	⚡⚡	Preserves global+local, beautiful plots	No feature importance	Complex data visualization
t-SNE	Non-linear DR	⚡	Excellent local structure	Slow, no global structure	Small datasets, local patterns
Hierarchical	Clustering	⚡⚡	Dendrogram, no K needed	Slow for large data	Exploratory, unknown K
K-Means	Clustering	⚡⚡⚡	Fast, simple	Need K, spherical clusters	Known K, large datasets
DBSCAN	Clustering	⚡⚡	Arbitrary shapes, finds outliers	Sensitive to parameters	Outliers, complex shapes

Decision Tree

START: What's your goal?
│
├─ Visualization
│  │
│  ├─ Quick overview → PCA (2D/3D)
│  │
│  ├─ Poor PCA separation → UMAP
│  │
│  └─ Extreme local focus → t-SNE
│
├─ Clustering
│  │
│  ├─ Know # clusters → K-Means
│  │
│  ├─ Don't know # clusters → Hierarchical (dendrogram)
│  │
│  └─ Arbitrary shapes + outliers → DBSCAN
│
└─ Feature reduction for ML
   │
   ├─ Linear data → PCA (keep loadings)
   │
   └─ Non-linear data → UMAP → then ML

Typical Workflow

1. Initial Exploration:

# Start with PCA (fast)
pca_result = apply_pca(data, n_components=2)
plot_pca_scores(pca_result)

# Check explained variance
if pca_result['explained_variance_ratio'][:2].sum() < 0.5:
    # Poor separation, try UMAP
    umap_result = apply_umap(data, n_neighbors=15)

2. Clustering Investigation:

# Try hierarchical first (explore # clusters)
hc_result = apply_hierarchical_clustering(data, linkage='ward')
plot_dendrogram(hc_result)

# Identify optimal K from dendrogram
optimal_k = 3  # from visual inspection

# Apply K-means with optimal K
kmeans_result = apply_kmeans(data, n_clusters=optimal_k)

3. Validation:

# Validate clustering quality
from sklearn.metrics import silhouette_score, davies_bouldin_score

sil_score = silhouette_score(data, clusters)
db_score = davies_bouldin_score(data, clusters)

print(f"Silhouette: {sil_score:.3f} (higher better)")
print(f"Davies-Bouldin: {db_score:.3f} (lower better)")

Validation Metrics

Silhouette Score

Formula:

s(i) = (b(i) - a(i)) / max(a(i), b(i))

where:
a(i) = avg distance to points in same cluster
b(i) = avg distance to points in nearest different cluster

Interpretation:

+1: Perfect clustering
0: On cluster boundary
-1: Misassigned

Usage:

from sklearn.metrics import silhouette_score
score = silhouette_score(data, clusters)

Davies-Bouldin Index

Measures: Ratio of within-cluster to between-cluster distances

Interpretation:

Lower is better
0: Perfect clustering

Usage:

from sklearn.metrics import davies_bouldin_score
score = davies_bouldin_score(data, clusters)

Calinski-Harabasz Index

Measures: Ratio of between-cluster to within-cluster variance

Interpretation:

Higher is better

Usage:

from sklearn.metrics import calinski_harabasz_score
score = calinski_harabasz_score(data, clusters)

Best Practices

General Guidelines

Preprocessing First:

# Always preprocess before analysis
data = apply_baseline_correction(raw_data)
data = apply_smoothing(data)
data = apply_normalization(data)

Try Multiple Methods:

# Compare different approaches
pca_result = apply_pca(data)
umap_result = apply_umap(data)
# Choose based on results

Validate Results:

# Check quality metrics
silhouette = silhouette_score(data, clusters)
if silhouette < 0.5:
    print("Warning: Poor clustering quality")

Use Domain Knowledge:
- Does clustering match expected groups?
- Are separated groups biologically meaningful?
- Check against known standards

Reproducibility

# Always set random state
pca_result = apply_pca(data, random_state=42)
umap_result = apply_umap(data, random_state=42)
kmeans_result = apply_kmeans(data, random_state=42, n_init=10)