Exploratory Analysis Methods

Comprehensive reference for dimensionality reduction and clustering methods.

Table of Contents


Principal Component Analysis (PCA)

Purpose: Linear dimensionality reduction preserving maximum variance

Theory

PCA finds orthogonal directions (principal components) that capture most variance in data:

X = USVᵀ  (Singular Value Decomposition)

Where:

  • U: Left singular vectors (sample scores)

  • S: Singular values (eigenvalues)

  • V: Right singular vectors (loadings)

Mathematical Steps:

  1. Center data: X_centered = X - mean(X)

  2. Compute covariance matrix: C = XᵀX / (n-1)

  3. Eigendecomposition: C = VΛVᵀ

  4. Project data: scores = X × V

Parameters

Parameter

Type

Range

Default

Description

n_components

int/float

2-100 or 0.0-1.0

2

Number of PCs or variance to retain

whiten

bool

-

False

Divide components by singular values

svd_solver

str

-

‘auto’

SVD algorithm (‘auto’, ‘full’, ‘arpack’, ‘randomized’)

random_state

int

-

42

Random seed for reproducibility

Parameter Guide:

# Visualization (2D/3D)
n_components = 2  # or 3

# Keep 95% variance
n_components = 0.95

# Keep specific number
n_components = 10

# Whitening (for clustering)
whiten = True

Usage Example

from functions.ML import apply_pca

# Apply PCA
pca_result = apply_pca(
    data=preprocessed_spectra,
    n_components=2,
    labels=group_labels
)

# Access results
scores = pca_result['scores']  # (n_samples, n_components)
loadings = pca_result['loadings']  # (n_features, n_components)
explained_var = pca_result['explained_variance_ratio']

Output Components

1. Scores (Sample Coordinates):

scores = pca_result['scores']
# Shape: (n_samples, n_components)
# Each row: sample position in PC space
# Use for: Visualization, clustering

2. Loadings (Feature Contributions):

loadings = pca_result['loadings']
# Shape: (n_features, n_components)
# Each column: wavenumber contributions to PC
# Use for: Identifying important peaks

3. Explained Variance:

var_ratio = pca_result['explained_variance_ratio']
# Array of variance % for each PC
# Example: [0.65, 0.23, 0.08, ...]
# PC1 explains 65%, PC2 23%, etc.

4. Scree Plot Data:

cumulative_var = np.cumsum(var_ratio)
# Cumulative variance explained

Interpretation

Scores Plot

     PC2 (23%)
         ↑
    A    |    C
    A    |   CC
   AAA   | CCC
  ─────────────→ PC1 (65%)
     BBB|
      BB|
       B|

What it Shows:

  • Separation: Groups A, B, C are distinct

  • Distance: Similar samples cluster together

  • Outliers: Points far from main cluster

Interpretation:

  • Tight clusters: Homogeneous groups

  • Overlapping: Spectral similarity

  • Trend: Continuous variation (e.g., concentration)

Loadings Plot

# Positive peaks in PC1 loading
# → Increase PC1 score

# Negative peaks in PC1 loading
# → Decrease PC1 score

Example:

Loading PC1:
  ↑
  |     ___
  |    /   \    ← Important peak
  |___/     \___
  └─────────────→ Wavenumber
  
If Group A has high PC1 scores,
they have strong signal at this peak

Key Wavenumbers:

# Find most important peaks
top_indices = np.argsort(np.abs(loadings[:, 0]))[-10:]
important_peaks = wavenumbers[top_indices]

Explained Variance

Scree Plot:

Variance (%)
100 |●
 80 |  ●
 60 |    ●
 40 |      ●
 20 |        ●●●●●
  0 |________________
     1 2 3 4 5 6 7 8  PC

Rules of Thumb:

  • PC1 + PC2 > 70%: Good 2D representation

  • PC1 + PC2 < 50%: Consider 3D or UMAP

  • Elbow point: # PCs to retain (here: ~3-4)

Common Use Cases

1. Group Visualization

# 2D scatter plot
plt.scatter(scores[:, 0], scores[:, 1], c=labels)
plt.xlabel(f'PC1 ({var_ratio[0]:.1%})')
plt.ylabel(f'PC2 ({var_ratio[1]:.1%})')

2. Outlier Detection

# Hotelling's T² statistic
from scipy import stats
T2 = np.sum((scores / np.std(scores, axis=0))**2, axis=1)
threshold = stats.chi2.ppf(0.95, df=n_components)
outliers = T2 > threshold

3. Feature Selection

# Important wavenumbers from PC1
loading_weights = np.abs(loadings[:, 0])
top_features = np.argsort(loading_weights)[-20:]

4. Dimensionality Reduction for ML

# Reduce to 95% variance
pca = apply_pca(data, n_components=0.95)
reduced_data = pca['scores']
# Use reduced_data for classification

Troubleshooting

Issue

Cause

Solution

Poor separation

Low variance in PC1-2

Try more PCs, use UMAP

All groups overlap

No spectral differences

Check preprocessing

PC1 = baseline

Preprocessing issue

Better baseline correction

One outlier dominates

Extreme spectrum

Remove outlier, re-run

Scores look random

Data not standardized

Check normalization

Assumptions

Linear relationships: PCA finds linear combinations
Variance = importance: Assumes max variance = most informative
No scaling: Apply normalization first
Not rotation-invariant: PC axes arbitrary

When to Use

Use PCA when:

  • ✓ Visualizing high-dimensional data

  • ✓ Reducing dimensions for ML

  • ✓ Identifying important features

  • ✓ Quick exploratory analysis

  • ✓ Data roughly linear

Consider alternatives when:

  • ✗ Non-linear structure (use UMAP/t-SNE)

  • ✗ Preserving distances (use MDS)

  • ✗ Only want clustering (use K-means directly)

Advanced Options

Whitening

# Normalize PC variances (useful before clustering)
pca_result = apply_pca(data, whiten=True)

Incremental PCA

# For very large datasets
from sklearn.decomposition import IncrementalPCA
ipca = IncrementalPCA(n_components=2)
ipca.partial_fit(data_batch_1)
ipca.partial_fit(data_batch_2)
scores = ipca.transform(data)

Reference

Jolliffe & Cadima (2016). “Principal component analysis: a review and recent developments”


MCR-ALS

Purpose: Spectral unmixing / component extraction from mixtures.

MCR-ALS (Multivariate Curve Resolution – Alternating Least Squares) aims to decompose a data matrix \(X\) into concentrations \(C\) and component spectra \(S\):

\[ X \approx C S^T \]

When to use:

  • ✓ Each measured spectrum is a mixture of a small number of underlying “pure” components

  • ✓ You want interpretable component spectra and relative contributions

Typical constraints:

  • Non-negativity on \(C\) and/or \(S\)

  • Normalization or closure constraints (depending on experiment)

Practical notes:

  • Sensitive to initialization; try multiple starts.

  • Preprocessing (baseline correction, normalization) usually improves results.


UMAP (Uniform Manifold Approximation and Projection)

Purpose: Non-linear dimensionality reduction preserving local and global structure

Theory

UMAP constructs high-dimensional graph, then optimizes low-dimensional representation:

  1. Build k-nearest neighbor graph in high-dimensional space

  2. Compute fuzzy simplicial complex (topological structure)

  3. Optimize low-dimensional layout preserving topology

Key Difference from PCA:

  • PCA: Linear projection, preserves variance

  • UMAP: Non-linear, preserves topology (neighbors stay neighbors)

Parameters

Parameter

Type

Range

Default

Description

n_neighbors

int

2 - 200

15

# neighbors for graph

min_dist

float

0.0 - 0.99

0.1

Minimum distance in embedding

n_components

int

2 - 3

2

Output dimensions

metric

str

-

‘euclidean’

Distance metric

random_state

int

-

42

Random seed

Parameter Guide:

# Local structure (tight clusters)
n_neighbors = 5-10
min_dist = 0.0

# Balanced (recommended)
n_neighbors = 15
min_dist = 0.1

# Global structure (broader view)
n_neighbors = 50-100
min_dist = 0.5

n_neighbors:

  • Low (5-10): Focus on local structure, tight clusters

  • Medium (15-30): Balanced local/global

  • High (50-200): Focus on global structure, looser clusters

min_dist:

  • 0.0: Densest packing, tight clusters

  • 0.1: Default, good separation

  • 0.5+: Spread out, overview

Usage Example

from functions.ML import apply_umap

# Apply UMAP
umap_result = apply_umap(
    data=preprocessed_spectra,
    n_neighbors=15,
    min_dist=0.1,
    n_components=2,
    labels=group_labels
)

# Access results
embedding = umap_result['embedding']  # (n_samples, 2)

PCA vs UMAP Comparison

Aspect

PCA

UMAP

Type

Linear

Non-linear

Speed

Fast

Slower

Preserves

Variance

Topology

Global structure

Good

Excellent

Local structure

Poor

Excellent

Deterministic

Yes

No (use random_state)

Interpretability

High (loadings)

Low (no loadings)

Outliers

Sensitive

Robust

Decision Guide:

Use PCA when:
- Need feature importance (loadings)
- Want speed
- Linear relationships
- Interpretability critical

Use UMAP when:
- PCA shows overlap
- Need better separation
- Non-linear structure
- Visualization priority

Interpretation

UMAP Embedding:

     Dim 2
         ↑
    A    |    C
    AA   |   CCC
   AAA   | CC
  ─────────────→ Dim 1
     BB |
     BBB|
       B|

What to Look For:

  • Clusters: Distinct groups

  • Distance: Relative, not absolute

  • Shape: Cluster density and spread

  • Bridges: Transitional samples

Warnings: ⚠️ Distances not quantitative: UMAP preserves topology, not exact distances
⚠️ Cluster size misleading: Doesn’t reflect true cluster variance
⚠️ Different seeds → different layouts: Use random_state for reproducibility

Troubleshooting

Issue

Cause

Solution

Too many clusters

n_neighbors too low

Increase to 30-50

All points together

n_neighbors too high

Decrease to 10-15

Unclear structure

min_dist too high

Reduce to 0.05-0.1

Overlapping groups

Inherent similarity

Try different preprocessing

Different results

Random seed

Set random_state=42

When to Use

Use UMAP when:

  • ✓ PCA shows poor separation

  • ✓ Suspect non-linear structure

  • ✓ Want beautiful visualizations

  • ✓ Exploring complex datasets

Use PCA when:

  • ✓ Need interpretable features

  • ✓ Speed critical

  • ✓ Publishing quantitative results

Reference

McInnes et al. (2018). “UMAP: Uniform Manifold Approximation and Projection”


t-SNE (t-Distributed Stochastic Neighbor Embedding)

Purpose: Non-linear dimensionality reduction emphasizing local structure

Theory

t-SNE preserves pairwise similarities between points:

  1. Compute pairwise probabilities in high dimensions (Gaussian)

  2. Compute pairwise probabilities in low dimensions (t-distribution)

  3. Minimize KL divergence between probability distributions

vs UMAP: t-SNE focuses more on local structure, UMAP balances local/global

Parameters

Parameter

Type

Range

Default

Description

perplexity

float

5 - 50

30

# effective neighbors

n_iter

int

250 - 5000

1000

Optimization iterations

learning_rate

float

10 - 1000

200

Step size

early_exaggeration

float

1 - 20

12

Initial separation

random_state

int

-

42

Random seed

Parameter Guide:

# Small dataset (n < 100)
perplexity = 5-10
n_iter = 1000

# Medium dataset (n = 100-1000)
perplexity = 30
n_iter = 1000-2000

# Large dataset (n > 1000)
perplexity = 50
n_iter = 2000-5000

Perplexity:

  • Rule of thumb: 5 < perplexity < n_samples/3

  • Low (5-10): Focus on local clusters

  • Medium (30): Balanced

  • High (50+): Focus on global structure

Usage Example

from functions.ML import apply_tsne

# Apply t-SNE
tsne_result = apply_tsne(
    data=preprocessed_spectra,
    perplexity=30,
    n_iter=1000,
    labels=group_labels
)

# Access results
embedding = tsne_result['embedding']  # (n_samples, 2)

Interpretation

t-SNE Output:

Similar to UMAP, but:
- Even tighter clusters
- Less emphasis on global distances
- More sensitive to perplexity

Key Points:

  • ⚠️ Cluster sizes meaningless: Don’t interpret relative sizes

  • ⚠️ Distances not quantitative: Within-cluster distances OK, between-cluster not

  • ⚠️ Slow: Much slower than PCA/UMAP for large datasets

UMAP vs t-SNE

Aspect

UMAP

t-SNE

Speed

Fast

Slow

Global structure

Better

Worse

Local structure

Good

Excellent

Scalability

100k+ samples

<10k samples

Reproducibility

Better

Worse

General use

Preferred

Specialized

Recommendation: Use UMAP unless you specifically need t-SNE’s extreme local focus

Troubleshooting

Issue

Cause

Solution

Blob without structure

Perplexity too high

Reduce perplexity

Many tiny clusters

Perplexity too low

Increase perplexity

Not converged

n_iter too low

Increase to 2000+

Different results

Random seed

Set random_state

Very slow

Large dataset

Use UMAP instead

When to Use

Use t-SNE when:

  • ✓ Small-medium datasets (< 5000 samples)

  • ✓ Need extreme local structure emphasis

  • ✓ Publication requires it (legacy)

Use UMAP instead when:

  • ✓ Large datasets

  • ✓ Need reproducibility

  • ✓ Want global structure too

  • ✓ Speed matters

Reference

van der Maaten & Hinton (2008). “Visualizing Data using t-SNE”


Hierarchical Clustering

Purpose: Create tree of nested clusters (dendrogram)

Theory

Agglomerative (bottom-up):

  1. Start: Each point is a cluster

  2. Repeat: Merge closest clusters

  3. Stop: All points in one cluster

Result: Dendrogram showing merge history

Parameters

Parameter

Type

Options

Default

Description

linkage

str

ward, complete, average, single

‘ward’

Cluster distance metric

metric

str

euclidean, cosine, correlation

‘euclidean’

Distance function

n_clusters

int

2 - 20

None

Cut tree to get clusters

Linkage Methods:

Method

Distance Between Clusters

Use Case

Ward

Minimizes within-cluster variance

Recommended for most

Complete

Maximum distance

Compact clusters

Average

Average distance

Balanced

Single

Minimum distance

Can find elongated clusters

Recommendation: Use Ward with Euclidean distance

Usage Example

from functions.ML import apply_hierarchical_clustering

# Apply hierarchical clustering
hc_result = apply_hierarchical_clustering(
    data=preprocessed_spectra,
    linkage='ward',
    metric='euclidean',
    n_clusters=3
)

# Access results
clusters = hc_result['clusters']  # Cluster assignments
linkage_matrix = hc_result['linkage']  # For dendrogram

Dendrogram Interpretation

Height
  |
 10|          ┌─┐
  |      ┌───┤ ├───┐
  5|    ┌─┤   └─┘   ├─┐
  |  ┌─┤ └─┐     ┌─┘ ├─┐
  0|  └─┘   └─────┘   └─┘
     A₁ A₂  A₃ A₄   B₁ B₂

How to Read:

  • Horizontal lines: Clusters

  • Height: Distance at merge

  • Vertical lines: Similarity

  • Cut horizontally: Define clusters

Cutting the Tree:

# Cut at height = 7
# → 2 clusters: [A₁,A₂,A₃,A₄] and [B₁,B₂]

# Cut at height = 3
# → 3 clusters

Cophenetic Distance:

from scipy.cluster.hierarchy import cophenet
from scipy.spatial.distance import pdist

c, coph_dists = cophenet(linkage_matrix, pdist(data))
# c > 0.7: Good representation
# c < 0.5: Poor representation

Visualization

from scipy.cluster.hierarchy import dendrogram
import matplotlib.pyplot as plt

# Plot dendrogram
plt.figure(figsize=(10, 7))
dendrogram(
    linkage_matrix,
    labels=sample_names,
    leaf_rotation=90,
    leaf_font_size=8
)
plt.xlabel('Sample')
plt.ylabel('Distance')
plt.title('Hierarchical Clustering Dendrogram')
plt.tight_layout()
plt.show()

Choosing Number of Clusters

Method 1: Visual Inspection

  • Look for large height jumps in dendrogram

  • Cut before major merge

Method 2: Elbow Method

from scipy.cluster.hierarchy import fcluster

distances = []
for k in range(2, 11):
    clusters = fcluster(linkage_matrix, k, criterion='maxclust')
    dist = calculate_within_cluster_distance(data, clusters)
    distances.append(dist)

# Plot and find elbow
plt.plot(range(2, 11), distances)

Method 3: Silhouette Score

from sklearn.metrics import silhouette_score

scores = []
for k in range(2, 11):
    clusters = fcluster(linkage_matrix, k, criterion='maxclust')
    score = silhouette_score(data, clusters)
    scores.append(score)

# Choose k with highest score
best_k = np.argmax(scores) + 2

Troubleshooting

Issue

Cause

Solution

Unbalanced clusters

Single linkage

Use Ward linkage

Unclear structure

Wrong distance metric

Try different metrics

Chains instead of clusters

Single linkage

Use Ward/Complete

Dendrogram too complex

Too many samples

Truncate or subset

When to Use

Use Hierarchical Clustering when:

  • ✓ Want to explore cluster structure

  • ✓ Don’t know # clusters

  • ✓ Need hierarchical relationships

  • ✓ Small-medium datasets (< 5000 samples)

Use K-Means when:

  • ✓ Know # clusters

  • ✓ Large datasets

  • ✓ Speed critical

Reference

Müllner (2013). “fastcluster: Fast Hierarchical, Agglomerative Clustering”


K-Means Clustering

Purpose: Partition data into K non-overlapping clusters

Theory

Algorithm:

  1. Initialize K centroids randomly

  2. Assign each point to nearest centroid

  3. Update centroids to cluster means

  4. Repeat 2-3 until convergence

Objective: Minimize within-cluster sum of squares (WCSS)

Parameters

Parameter

Type

Range

Default

Description

n_clusters

int

2 - 20

3

Number of clusters

init

str

‘k-means++’, ‘random’

‘k-means++’

Initialization method

n_init

int

10 - 100

10

# random initializations

max_iter

int

100 - 1000

300

Maximum iterations

random_state

int

-

42

Random seed

Recommendation: Use k-means++ initialization (smarter than random)

Usage Example

from functions.ML import apply_kmeans

# Apply K-means
kmeans_result = apply_kmeans(
    data=preprocessed_spectra,
    n_clusters=3,
    init='k-means++',
    n_init=10
)

# Access results
clusters = kmeans_result['clusters']  # Cluster assignments
centroids = kmeans_result['centroids']  # Cluster centers
inertia = kmeans_result['inertia']  # WCSS

Choosing Number of Clusters (K)

Method 1: Elbow Method

inertias = []
K_range = range(2, 11)

for k in K_range:
    result = apply_kmeans(data, n_clusters=k)
    inertias.append(result['inertia'])

# Plot elbow curve
plt.plot(K_range, inertias, 'bo-')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia (WCSS)')
plt.title('Elbow Method')
plt.show()

# Look for "elbow" point

Elbow Plot:

Inertia
  |●
  | ●
  |  ●
  |   ●___
  |       ●___●___●
  └─────────────────
   2 3 4 5 6 7 8  K
       ↑
     Elbow (K=4)

Method 2: Silhouette Score

from sklearn.metrics import silhouette_score

silhouette_scores = []
for k in range(2, 11):
    result = apply_kmeans(data, n_clusters=k)
    score = silhouette_score(data, result['clusters'])
    silhouette_scores.append(score)

# Choose K with highest score
best_k = np.argmax(silhouette_scores) + 2

Silhouette Score:

  • Range: [-1, 1]

  • > 0.7: Strong structure

  • 0.5 - 0.7: Reasonable structure

  • < 0.5: Weak structure, try different K

Method 3: Gap Statistic

from scipy.cluster.vq import kmeans
from sklearn.metrics import pairwise_distances

def gap_statistic(data, K_max=10, n_references=10):
    gaps = []
    for k in range(1, K_max+1):
        # Cluster actual data
        result = apply_kmeans(data, n_clusters=k)
        actual_wcss = result['inertia']
        
        # Cluster random reference data
        reference_wcss = []
        for _ in range(n_references):
            reference = np.random.uniform(
                data.min(), data.max(), 
                size=data.shape
            )
            ref_result = apply_kmeans(reference, n_clusters=k)
            reference_wcss.append(ref_result['inertia'])
        
        # Gap = log(E[WCSS_ref]) - log(WCSS_actual)
        gap = np.log(np.mean(reference_wcss)) - np.log(actual_wcss)
        gaps.append(gap)
    
    return gaps

# Choose K where gap stops increasing

Cluster Validation

Silhouette Analysis

from sklearn.metrics import silhouette_samples
import matplotlib.pyplot as plt

# Compute silhouette scores for each sample
silhouette_vals = silhouette_samples(data, clusters)

# Plot silhouette for each cluster
fig, ax = plt.subplots()
y_lower = 10

for i in range(n_clusters):
    cluster_silhouette = silhouette_vals[clusters == i]
    cluster_silhouette.sort()
    
    size = cluster_silhouette.shape[0]
    y_upper = y_lower + size
    
    ax.fill_betweenx(
        np.arange(y_lower, y_upper),
        0, cluster_silhouette,
        alpha=0.7
    )
    y_lower = y_upper + 10

ax.axvline(silhouette_scores[k-2], color="red", linestyle="--")
ax.set_xlabel("Silhouette Coefficient")
ax.set_ylabel("Cluster")

Good Clustering:

  • All clusters above average line

  • Similar widths (balanced sizes)

  • All positive values

Poor Clustering:

  • Clusters below average

  • Very different widths

  • Negative values (misassigned points)

Troubleshooting

Issue

Cause

Solution

Empty clusters

K too high

Reduce K

Unbalanced sizes

Poor initialization

Use k-means++

Different results

Random init

Set random_state, increase n_init

Not converging

max_iter too low

Increase max_iter

Unexpected clusters

Need normalization

Apply vector norm

Assumptions and Limitations

Assumes:

  • ✓ Clusters are spherical (isotropic)

  • ✓ Clusters have similar sizes

  • ✓ Clusters are equally dense

Limitations:

  • ✗ Must specify K in advance

  • ✗ Sensitive to outliers

  • ✗ Can’t find non-convex clusters

  • ✗ Random initialization (use n_init=10+)

When to Use

Use K-Means when:

  • ✓ Know or can estimate K

  • ✓ Clusters roughly spherical

  • ✓ Large datasets (fast algorithm)

  • ✓ Need deterministic results (set random_state)

Use Hierarchical when:

  • ✓ Don’t know K

  • ✓ Want dendrogram

  • ✓ Small-medium datasets

Use DBSCAN when:

  • ✓ Arbitrary cluster shapes

  • ✓ Noise points present

  • ✓ Unknown K

Reference

Lloyd (1982). “Least squares quantization in PCM”


DBSCAN (Density-Based Spatial Clustering)

Purpose: Find arbitrarily-shaped clusters based on density

Theory

Concepts:

  • Core point: Has ≥ min_samples neighbors within eps

  • Border point: Within eps of core point

  • Noise point: Neither core nor border

Algorithm:

  1. Find all core points

  2. Connect core points within eps

  3. Assign border points to nearby clusters

  4. Mark remaining points as noise (-1)

Parameters

Parameter

Type

Range

Default

Description

eps

float

0.1 - 10.0

0.5

Maximum distance for neighborhood

min_samples

int

3 - 20

5

Minimum points for core point

metric

str

euclidean, etc.

‘euclidean’

Distance metric

Parameter Guide:

eps (epsilon):

# Determine eps using k-distance graph
from sklearn.neighbors import NearestNeighbors

neighbors = NearestNeighbors(n_neighbors=min_samples)
neighbors.fit(data)
distances, indices = neighbors.kneighbors(data)

# Sort and plot distances to min_samples-th neighbor
sorted_distances = np.sort(distances[:, -1])
plt.plot(sorted_distances)
plt.ylabel(f'Distance to {min_samples}-th Neighbor')
plt.xlabel('Points (sorted)')

# eps = value at "elbow" of curve

min_samples:

  • Rule of thumb: 2 × n_dimensions

  • 3-5: Detect finer clusters

  • 10-20: More robust to noise

Usage Example

from functions.ML import apply_dbscan

# Apply DBSCAN
dbscan_result = apply_dbscan(
    data=preprocessed_spectra,
    eps=0.5,
    min_samples=5
)

# Access results
clusters = dbscan_result['clusters']  # -1 = noise
n_clusters = len(set(clusters)) - (1 if -1 in clusters else 0)
n_noise = list(clusters).count(-1)

print(f"Clusters found: {n_clusters}")
print(f"Noise points: {n_noise}")

Advantages

No need to specify K: Finds natural clusters
Handles arbitrary shapes: Not limited to spherical
Identifies outliers: Noise points marked as -1
Deterministic: Same results every run

Disadvantages

Sensitive to parameters: eps and min_samples critical
Varying densities: Struggles with clusters of different densities
High dimensions: Distance-based, suffers curse of dimensionality

Troubleshooting

Issue

Cause

Solution

One giant cluster

eps too large

Reduce eps

All points are noise

eps too small

Increase eps

Too many clusters

min_samples too low

Increase min_samples

Undetected clusters

min_samples too high

Reduce min_samples

When to Use

Use DBSCAN when:

  • ✓ Don’t know number of clusters

  • ✓ Clusters have arbitrary shapes

  • ✓ Outliers present

  • ✓ Clusters vary in size (but not density)

Use K-Means when:

  • ✓ Spherical clusters

  • ✓ Know K

  • ✓ All points should be clustered

Reference

Ester et al. (1996). “A density-based algorithm for discovering clusters”


Method Comparison

Quick Reference Table

Method

Type

Speed

Strengths

Limitations

Best For

PCA

Linear DR

⚡⚡⚡

Fast, interpretable, feature importance

Linear only

Quick exploration, ML preprocessing

UMAP

Non-linear DR

⚡⚡

Preserves global+local, beautiful plots

No feature importance

Complex data visualization

t-SNE

Non-linear DR

Excellent local structure

Slow, no global structure

Small datasets, local patterns

Hierarchical

Clustering

⚡⚡

Dendrogram, no K needed

Slow for large data

Exploratory, unknown K

K-Means

Clustering

⚡⚡⚡

Fast, simple

Need K, spherical clusters

Known K, large datasets

DBSCAN

Clustering

⚡⚡

Arbitrary shapes, finds outliers

Sensitive to parameters

Outliers, complex shapes

Decision Tree

START: What's your goal?
│
├─ Visualization
│  │
│  ├─ Quick overview → PCA (2D/3D)
│  │
│  ├─ Poor PCA separation → UMAP
│  │
│  └─ Extreme local focus → t-SNE
│
├─ Clustering
│  │
│  ├─ Know # clusters → K-Means
│  │
│  ├─ Don't know # clusters → Hierarchical (dendrogram)
│  │
│  └─ Arbitrary shapes + outliers → DBSCAN
│
└─ Feature reduction for ML
   │
   ├─ Linear data → PCA (keep loadings)
   │
   └─ Non-linear data → UMAP → then ML

Typical Workflow

1. Initial Exploration:

# Start with PCA (fast)
pca_result = apply_pca(data, n_components=2)
plot_pca_scores(pca_result)

# Check explained variance
if pca_result['explained_variance_ratio'][:2].sum() < 0.5:
    # Poor separation, try UMAP
    umap_result = apply_umap(data, n_neighbors=15)

2. Clustering Investigation:

# Try hierarchical first (explore # clusters)
hc_result = apply_hierarchical_clustering(data, linkage='ward')
plot_dendrogram(hc_result)

# Identify optimal K from dendrogram
optimal_k = 3  # from visual inspection

# Apply K-means with optimal K
kmeans_result = apply_kmeans(data, n_clusters=optimal_k)

3. Validation:

# Validate clustering quality
from sklearn.metrics import silhouette_score, davies_bouldin_score

sil_score = silhouette_score(data, clusters)
db_score = davies_bouldin_score(data, clusters)

print(f"Silhouette: {sil_score:.3f} (higher better)")
print(f"Davies-Bouldin: {db_score:.3f} (lower better)")

Validation Metrics

Silhouette Score

Formula:

s(i) = (b(i) - a(i)) / max(a(i), b(i))

where:
a(i) = avg distance to points in same cluster
b(i) = avg distance to points in nearest different cluster

Interpretation:

  • +1: Perfect clustering

  • 0: On cluster boundary

  • -1: Misassigned

Usage:

from sklearn.metrics import silhouette_score
score = silhouette_score(data, clusters)

Davies-Bouldin Index

Measures: Ratio of within-cluster to between-cluster distances

Interpretation:

  • Lower is better

  • 0: Perfect clustering

Usage:

from sklearn.metrics import davies_bouldin_score
score = davies_bouldin_score(data, clusters)

Calinski-Harabasz Index

Measures: Ratio of between-cluster to within-cluster variance

Interpretation:

  • Higher is better

Usage:

from sklearn.metrics import calinski_harabasz_score
score = calinski_harabasz_score(data, clusters)

Best Practices

General Guidelines

  1. Preprocessing First:

    # Always preprocess before analysis
    data = apply_baseline_correction(raw_data)
    data = apply_smoothing(data)
    data = apply_normalization(data)
    
  2. Try Multiple Methods:

    # Compare different approaches
    pca_result = apply_pca(data)
    umap_result = apply_umap(data)
    # Choose based on results
    
  3. Validate Results:

    # Check quality metrics
    silhouette = silhouette_score(data, clusters)
    if silhouette < 0.5:
        print("Warning: Poor clustering quality")
    
  4. Use Domain Knowledge:

    • Does clustering match expected groups?

    • Are separated groups biologically meaningful?

    • Check against known standards

Reproducibility

# Always set random state
pca_result = apply_pca(data, random_state=42)
umap_result = apply_umap(data, random_state=42)
kmeans_result = apply_kmeans(data, random_state=42, n_init=10)

See Also


Last Updated: 2026-01-24