Data Import Guide

Complete guide to importing, organizing, and managing spectral data in the application.

Table of Contents

Supported File Formats
Import Workflow
Data Organization
Group Management
Data Validation
Advanced Features

Supported File Formats

Primary Formats

CSV Files (Recommended)

Format: Comma-Separated Values

Structure:

Wavenumber,Sample1,Sample2,Sample3
400.0,100.5,98.3,102.1
401.0,101.2,99.1,103.5
402.0,102.8,100.4,104.2
...

Requirements:

First column: Wavenumbers (numeric, ascending)
Subsequent columns: Intensity values for each spectrum
Header row: Sample identifiers (optional but recommended)
Decimal separator: Period (.)
No missing values (use 0 or interpolate)

Example Import:

# File: blood_plasma_data.csv
# Columns: wavenumber, patient_001, patient_002, patient_003
# Rows: 1000+ wavenumber points

TXT Files (Text Format)

Format: Tab or space-delimited

Structure:

0    100.5    98.3     102.1
0    101.2    99.1     103.5
0    102.8    100.4    104.2
...

Requirements:

Similar to CSV but using tabs or spaces
Optional header row
Consistent delimiter throughout file

ASC/ASCII Files

Format: Text format containing two columns: wavenumber and intensity

Supported extensions: .asc, .ascii

PKL Files

Format: Pickled pandas DataFrame

Supported extension: .pkl

Future Import Support (Planned)

SPC: Galactic SPC binary format
WDF: Renishaw WiRE format

Import Workflow

Step 1: Navigate to Data Package Page

Open the application
Click 📦 Data Package tab
Ensure you’re in an active project (create new if needed)

Step 2: Select Files for Import

Method A: File Browser

Click [Import Data] button
File dialog opens
Navigate to your data directory
Select one or multiple files (CSV/TXT/ASC/PKL)
Click [Open]

Method B: Drag and Drop

Open file explorer (Windows Explorer, Finder)
Navigate to your data files
Drag files directly into the import area
Release to drop

Method C: Paste File Paths

Copy file path(s) from explorer
Click [Import from Path]
Paste paths (one per line for multiple files)
Click [Import]

Step 3: Data Validation

Application automatically checks:

During import, you will see a validation status panel/toast listing items like:

File format (CSV/TXT/ASC/PKL)
Wavenumber column detected
Number of spectra (samples)
Wavenumber range (e.g., 400–1800 cm⁻¹)
Data integrity checks
Missing values handling (if enabled)

Visual reference: See the Data Package Page screenshot in interface-overview.md.

Validation Checks:

File format compatibility
Wavenumber column detection
Consistent wavenumber spacing
No duplicate wavenumbers
Numeric data types
Missing value handling
Outlier detection (optional)

Step 4: Preview and Confirm

Preview window:

The preview dialog shows:

Selected file name
Sample count
Wavenumber range
A preview plot (typically the first few spectra)
Import options (auto-detect wavenumber column, interpolate missing values, etc.)
Cancel / Import actions

Options:

Auto-detect wavenumber column: Automatically identify x-axis
Interpolate missing values: Fill gaps with linear interpolation
Apply baseline correction: Pre-process during import (optional)

Step 5: Confirmation

After import completes, a success notification confirms the number of spectra imported and the source file.

Data Organization

Project Structure

Data is organized hierarchically:

Project: blood_plasma_study/
├── Data Packages/
│   ├── batch1_healthy/
│   │   ├── healthy_001.csv
│   │   ├── healthy_002.csv
│   │   └── metadata.json
│   ├── batch2_disease/
│   │   ├── disease_001.csv
│   │   ├── disease_002.csv
│   │   └── metadata.json
│   └── batch3_validation/
│       └── validation_set.csv
├── Preprocessing Pipelines/
│   └── standard_pipeline.json
└── Results/
    ├── analysis/
    └── ml_models/

Creating Data Packages

Data Package = Collection of related spectra

Create New Package:

Click [+ New Package] in Data Package page
Enter package name: batch1_healthy
Add description (optional): “Healthy controls, batch 1”
Import files into this package

Benefits:

Organize by experimental batch
Group by sample type
Separate training/validation/test sets
Apply batch-specific processing

Metadata Management

Each data package can have metadata:

{
  "package_name": "batch1_healthy",
  "description": "Healthy control samples from first batch",
  "acquisition_date": "2025-12-15",
  "laser_power": 50,
  "integration_time": 10,
  "spectrometer": "RamanSpecPro 5000",
  "notes": "Room temperature, 785nm laser"
}

Edit Metadata:

Right-click on data package
Select Edit Metadata
Fill in fields
Click Save

Group Management

Creating Sample Groups

Groups are used for:

Classification labels
Statistical comparisons
Visualization colors
Cross-validation splits

Create Group:

Click [Manage Groups] button
Click [+ New Group]
Enter group details:
- Name: Healthy Control
- Label: 0 (numeric for ML)
- Color: 🟢 Green
- Description: “Healthy patients without disease”
Click Create

Common Group Naming:

For Classification:
- Healthy Control (label: 0)
- Disease Group A (label: 1)
- Disease Group B (label: 2)

For Regression:
- Low Concentration (value: 0-5)
- Medium Concentration (value: 5-10)
- High Concentration (value: 10-20)

Assigning Samples to Groups

Method A: Manual Selection

Select samples in data list (Ctrl+Click for multiple)
Right-click → Assign to Group
Select group from dropdown
Click Assign

Method B: Bulk Assignment

Click [Bulk Assign] button
Use pattern matching:
- Pattern: healthy_* → Group: Healthy Control
- Pattern: disease_* → Group: Disease
Preview assignments
Click Apply

Method C: CSV Mapping

Create a CSV file with sample-to-group mapping:

sample_name,group_label
healthy_001,Healthy Control
healthy_002,Healthy Control
disease_001,Disease
disease_002,Disease

Import:

Click [Import Group Mapping]
Select CSV file
Verify mappings
Click Apply

Multi-Group Assignment

Some samples may belong to multiple groups:

Example: Clinical study with multiple factors

Group 1: Disease Status (Healthy, Disease A, Disease B)
Group 2: Gender (Male, Female)
Group 3: Age Range (<30, 30-50, >50)

Enable:

Settings → Data Management → Allow Multiple Groups
Assign samples to multiple group hierarchies
Select active grouping for analysis

Data Validation

Automatic Checks

Application performs validation on import:

1. Wavenumber Consistency

Check: All spectra must have identical wavenumber axis

✓ All spectra: 400-1800 cm⁻¹, 1000 points
✗ Mismatch detected:
  - File 1: 400-1800 cm⁻¹
  - File 2: 500-1700 cm⁻¹ (different range)

Solution:

Interpolate to common grid
Crop to common range
Use “Align Wavenumbers” tool

2. Missing Values

Check: No NaN or infinite values

⚠ Missing values detected:
  - Spectrum 15: 3 NaN values at 1200-1202 cm⁻¹
  - Spectrum 47: 1 NaN value at 850 cm⁻¹

Solutions:

Linear interpolation (default)
Polynomial interpolation
Remove affected spectra
Manual correction

3. Outlier Detection

Check: Identify spectra with unusual intensity values

⚠ Potential outliers:
  - Spectrum 32: Intensity >10σ from mean
  - Spectrum 88: Negative intensity values

Solutions:

Flag for review (don’t remove yet)
Visual inspection (plot spectrum)
Remove if confirmed (after manual check)
Note in metadata (keep but annotate)

4. Duplicate Spectra

Check: Detect identical or near-identical spectra

⚠ Duplicates detected:
  - Spectra 15 and 47: 99.8% correlation
  - Spectra 22 and 23: Identical (100%)

Solutions:

Remove exact duplicates (keep one copy)
Flag near-duplicates (may be technical replicates)
Keep all (if intentional replicates)

Manual Validation Tools

Spectrum Viewer

Inspect individual spectra:

Click on spectrum in list
Viewer shows:
- Full spectrum plot
- Statistics (mean, std, min, max)
- Peak detection
- Quality metrics

Actions:

Accept: Mark as validated
Reject: Remove from dataset
Edit: Manually correct issues
Notes: Add comments

Batch Validation

Review multiple spectra:

Click [Batch Validation]
Spectra displayed in grid (e.g., 3x3)
Navigate: Next/Previous pages
Actions: Accept, Reject, Flag

Use the on-screen controls for review actions.

Advanced Features

Wavenumber Calibration

Purpose: Correct systematic shifts in wavenumber axis

Calibration Methods:

Reference Peak Calibration
- Select known peak (e.g., 1001 cm⁻¹ for benzene)
- Specify expected position
- Apply linear shift correction
Multi-Peak Calibration
- Use multiple reference peaks
- Fit polynomial correction curve
- Apply non-linear calibration

Workflow:

# Example: Calibrate using 1001 cm⁻¹ benzene peak
Click [Calibration] in Data Package page
Select calibration standard spectrum
Mark expected peak position: 1001 cm⁻¹
Detected peak: 1003.5 cm⁻¹
Shift: -2.5 cm⁻¹
Apply to all spectra in package

Data Merging

Combine multiple datasets:

Select data packages to merge
Click [Merge Packages]
Choose merge strategy:
- Concatenate: Stack spectra (keep all)
- Average: Mean of all spectra per group
- Interleave: Alternate between datasets
Handle wavenumber mismatches:
- Interpolate: Resample to common grid
- Crop: Use common wavenumber range only
Click Merge

Use Cases:

Combine multiple experimental batches
Create larger training sets
Merge technical replicates

Data Splitting

Split dataset into train/validation/test:

Select data package
Click [Split Dataset]
Configure split ratios:
- Training: 70%
- Validation: 15%
- Test: 15%
Choose split strategy:
- Random: Random assignment
- Stratified: Maintain group proportions
- Patient-level: Keep all spectra from one patient together
Click Split

Result: Three new data packages created automatically

Export Data

Export for external use:

In the current application:

The Data Package page can export metadata as JSON.
The Analysis page can export:
- Plots: PNG, SVG
- Data tables: CSV, XLSX, JSON, TXT, PKL

Options:

Available export options depend on the selected analysis method and output type.

Batch Import

Import multiple files at once:

Click [Batch Import]
Select folder containing CSV files
Options:
- Recursive: Include subfolders
- Pattern: Filter by filename (e.g., *.csv)
- Auto-group: Assign groups by folder name
Preview file list
Click Import All

Progress:

During batch import, the application shows a progress indicator with:

Overall percent complete
Processed/total file count
Current file name
Estimated time remaining

Best Practices

File Organization

Recommended folder structure:

data/
├── raw/
│   ├── batch1/
│   │   ├── healthy/
│   │   │   ├── patient_001.csv
│   │   │   └── patient_002.csv
│   │   └── disease/
│   │       ├── patient_101.csv
│   │       └── patient_102.csv
│   └── batch2/
│       └── ...
└── processed/
    └── ...

Benefits:

Clear organization by batch and condition
Easy batch import
Automatic group assignment
Simplified version control

Naming Conventions

Files:

Good: patient_001_healthy.csv
Bad:  p1.csv

Good: disease_group_a_replicate_1.csv
Bad:  data.csv

Groups:

Good: Healthy_Control, Disease_GroupA, Disease_GroupB
Bad:  Group1, Group2, G3

Quality Control

Before analysis:

✓ Visual inspection of spectra
✓ Check for outliers
✓ Verify group assignments
✓ Validate wavenumber calibration
✓ Document any issues in metadata

During project:

Keep raw data unchanged
Version processed datasets
Document preprocessing steps
Backup regularly

Troubleshooting

Import Fails

Error: “Could not parse CSV file”

Solutions:

Check delimiter (comma vs tab vs semicolon)
Verify decimal separator (period vs comma)
Check for non-numeric characters
Use UTF-8 encoding

Wavenumber Mismatch

Error: “Spectra have different wavenumber axes”

Solutions:

Use Align Wavenumbers tool
Interpolate to common grid
Crop to common range
Import separately and merge later

Memory Issues

Error: “Out of memory during import”

Solutions:

Import in smaller batches
Close other applications
Enable “Chunked Loading” in settings
Use 64-bit version of application

Missing Groups

Error: “No groups defined for classification”

Solutions:

Create groups first
Assign samples to groups
Verify group labels are correct
Check for unassigned samples

Data Import Guide

Table of Contents

Supported File Formats

Primary Formats

CSV Files (Recommended)

TXT Files (Text Format)

ASC/ASCII Files

PKL Files

Future Import Support (Planned)

Import Workflow

Step 1: Navigate to Data Package Page

Step 2: Select Files for Import

Step 3: Data Validation

Step 4: Preview and Confirm

Step 5: Confirmation

Data Organization

Project Structure

Creating Data Packages

Metadata Management

Group Management

Creating Sample Groups

Assigning Samples to Groups

Multi-Group Assignment

Data Validation

Automatic Checks

1. Wavenumber Consistency

2. Missing Values

3. Outlier Detection

4. Duplicate Spectra

Manual Validation Tools

Spectrum Viewer

Batch Validation

Advanced Features

Wavenumber Calibration

Data Merging

Data Splitting

Export Data

Batch Import

Best Practices

File Organization

Naming Conventions

Quality Control

Troubleshooting

Import Fails

Wavenumber Mismatch

Memory Issues

Missing Groups

See Also