Data Import Guide

Complete guide to importing, organizing, and managing spectral data in the application.

Table of Contents


Supported File Formats

Primary Formats

TXT Files (Text Format)

Format: Tab or space-delimited

Structure:

400.0    100.5    98.3     102.1
401.0    101.2    99.1     103.5
402.0    102.8    100.4    104.2
...

Requirements:

  • Similar to CSV but using tabs or spaces

  • Optional header row

  • Consistent delimiter throughout file

ASC/ASCII Files

Format: Text format containing two columns: wavenumber and intensity

Supported extensions: .asc, .ascii

PKL Files

Format: Pickled pandas DataFrame

Supported extension: .pkl

Future Import Support (Planned)

  • SPC: Galactic SPC binary format

  • WDF: Renishaw WiRE format


Import Workflow

Step 1: Navigate to Data Package Page

  1. Open the application

  2. Click 📦 Data Package tab

  3. Ensure you’re in an active project (create new if needed)

Step 2: Select Files for Import

Method A: File Browser

  1. Click [Import Data] button

  2. File dialog opens

  3. Navigate to your data directory

  4. Select one or multiple files (CSV/TXT/ASC/PKL)

  5. Click [Open]

Method B: Drag and Drop

  1. Open file explorer (Windows Explorer, Finder)

  2. Navigate to your data files

  3. Drag files directly into the import area

  4. Release to drop

Method C: Paste File Paths

  1. Copy file path(s) from explorer

  2. Click [Import from Path]

  3. Paste paths (one per line for multiple files)

  4. Click [Import]

Step 3: Data Validation

Application automatically checks:

During import, you will see a validation status panel/toast listing items like:

  • File format (CSV/TXT/ASC/PKL)

  • Wavenumber column detected

  • Number of spectra (samples)

  • Wavenumber range (e.g., 400–1800 cm⁻¹)

  • Data integrity checks

  • Missing values handling (if enabled)

Visual reference: See the Data Package Page screenshot in interface-overview.md.

Validation Checks:

  • File format compatibility

  • Wavenumber column detection

  • Consistent wavenumber spacing

  • No duplicate wavenumbers

  • Numeric data types

  • Missing value handling

  • Outlier detection (optional)

Step 4: Preview and Confirm

Preview window:

The preview dialog shows:

  • Selected file name

  • Sample count

  • Wavenumber range

  • A preview plot (typically the first few spectra)

  • Import options (auto-detect wavenumber column, interpolate missing values, etc.)

  • Cancel / Import actions

Options:

  • Auto-detect wavenumber column: Automatically identify x-axis

  • Interpolate missing values: Fill gaps with linear interpolation

  • Apply baseline correction: Pre-process during import (optional)

Step 5: Confirmation

After import completes, a success notification confirms the number of spectra imported and the source file.


Data Organization

Project Structure

Data is organized hierarchically:

Project: blood_plasma_study/
├── Data Packages/
│   ├── batch1_healthy/
│   │   ├── healthy_001.csv
│   │   ├── healthy_002.csv
│   │   └── metadata.json
│   ├── batch2_disease/
│   │   ├── disease_001.csv
│   │   ├── disease_002.csv
│   │   └── metadata.json
│   └── batch3_validation/
│       └── validation_set.csv
├── Preprocessing Pipelines/
│   └── standard_pipeline.json
└── Results/
    ├── analysis/
    └── ml_models/

Creating Data Packages

Data Package = Collection of related spectra

Create New Package:

  1. Click [+ New Package] in Data Package page

  2. Enter package name: batch1_healthy

  3. Add description (optional): “Healthy controls, batch 1”

  4. Import files into this package

Benefits:

  • Organize by experimental batch

  • Group by sample type

  • Separate training/validation/test sets

  • Apply batch-specific processing

Metadata Management

Each data package can have metadata:

{
  "package_name": "batch1_healthy",
  "description": "Healthy control samples from first batch",
  "acquisition_date": "2025-12-15",
  "laser_power": 50,
  "integration_time": 10,
  "spectrometer": "RamanSpecPro 5000",
  "notes": "Room temperature, 785nm laser"
}

Edit Metadata:

  1. Right-click on data package

  2. Select Edit Metadata

  3. Fill in fields

  4. Click Save


Group Management

Creating Sample Groups

Groups are used for:

  • Classification labels

  • Statistical comparisons

  • Visualization colors

  • Cross-validation splits

Create Group:

  1. Click [Manage Groups] button

  2. Click [+ New Group]

  3. Enter group details:

    • Name: Healthy Control

    • Label: 0 (numeric for ML)

    • Color: 🟢 Green

    • Description: “Healthy patients without disease”

  4. Click Create

Common Group Naming:

For Classification:
- Healthy Control (label: 0)
- Disease Group A (label: 1)
- Disease Group B (label: 2)

For Regression:
- Low Concentration (value: 0-5)
- Medium Concentration (value: 5-10)
- High Concentration (value: 10-20)

Assigning Samples to Groups

Method A: Manual Selection

  1. Select samples in data list (Ctrl+Click for multiple)

  2. Right-click → Assign to Group

  3. Select group from dropdown

  4. Click Assign

Method B: Bulk Assignment

  1. Click [Bulk Assign] button

  2. Use pattern matching:

    • Pattern: healthy_* → Group: Healthy Control

    • Pattern: disease_* → Group: Disease

  3. Preview assignments

  4. Click Apply

Method C: CSV Mapping

Create a CSV file with sample-to-group mapping:

sample_name,group_label
healthy_001,Healthy Control
healthy_002,Healthy Control
disease_001,Disease
disease_002,Disease

Import:

  1. Click [Import Group Mapping]

  2. Select CSV file

  3. Verify mappings

  4. Click Apply

Multi-Group Assignment

Some samples may belong to multiple groups:

Example: Clinical study with multiple factors

  • Group 1: Disease Status (Healthy, Disease A, Disease B)

  • Group 2: Gender (Male, Female)

  • Group 3: Age Range (<30, 30-50, >50)

Enable:

  1. Settings Data Management Allow Multiple Groups

  2. Assign samples to multiple group hierarchies

  3. Select active grouping for analysis


Data Validation

Automatic Checks

Application performs validation on import:

1. Wavenumber Consistency

Check: All spectra must have identical wavenumber axis

✓ All spectra: 400-1800 cm⁻¹, 1000 points
✗ Mismatch detected:
  - File 1: 400-1800 cm⁻¹
  - File 2: 500-1700 cm⁻¹ (different range)

Solution:

  • Interpolate to common grid

  • Crop to common range

  • Use “Align Wavenumbers” tool

2. Missing Values

Check: No NaN or infinite values

⚠ Missing values detected:
  - Spectrum 15: 3 NaN values at 1200-1202 cm⁻¹
  - Spectrum 47: 1 NaN value at 850 cm⁻¹

Solutions:

  • Linear interpolation (default)

  • Polynomial interpolation

  • Remove affected spectra

  • Manual correction

3. Outlier Detection

Check: Identify spectra with unusual intensity values

⚠ Potential outliers:
  - Spectrum 32: Intensity >10σ from mean
  - Spectrum 88: Negative intensity values

Solutions:

  • Flag for review (don’t remove yet)

  • Visual inspection (plot spectrum)

  • Remove if confirmed (after manual check)

  • Note in metadata (keep but annotate)

4. Duplicate Spectra

Check: Detect identical or near-identical spectra

⚠ Duplicates detected:
  - Spectra 15 and 47: 99.8% correlation
  - Spectra 22 and 23: Identical (100%)

Solutions:

  • Remove exact duplicates (keep one copy)

  • Flag near-duplicates (may be technical replicates)

  • Keep all (if intentional replicates)

Manual Validation Tools

Spectrum Viewer

Inspect individual spectra:

  1. Click on spectrum in list

  2. Viewer shows:

    • Full spectrum plot

    • Statistics (mean, std, min, max)

    • Peak detection

    • Quality metrics

Actions:

  • Accept: Mark as validated

  • Reject: Remove from dataset

  • Edit: Manually correct issues

  • Notes: Add comments

Batch Validation

Review multiple spectra:

  1. Click [Batch Validation]

  2. Spectra displayed in grid (e.g., 3x3)

  3. Navigate: Next/Previous pages

  4. Actions: Accept, Reject, Flag

Use the on-screen controls for review actions.


Advanced Features

Wavenumber Calibration

Purpose: Correct systematic shifts in wavenumber axis

Calibration Methods:

  1. Reference Peak Calibration

    • Select known peak (e.g., 1001 cm⁻¹ for benzene)

    • Specify expected position

    • Apply linear shift correction

  2. Multi-Peak Calibration

    • Use multiple reference peaks

    • Fit polynomial correction curve

    • Apply non-linear calibration

Workflow:

# Example: Calibrate using 1001 cm⁻¹ benzene peak
1. Click [Calibration] in Data Package page
2. Select calibration standard spectrum
3. Mark expected peak position: 1001 cm⁻¹
4. Detected peak: 1003.5 cm⁻¹
5. Shift: -2.5 cm⁻¹
6. Apply to all spectra in package

Data Merging

Combine multiple datasets:

  1. Select data packages to merge

  2. Click [Merge Packages]

  3. Choose merge strategy:

    • Concatenate: Stack spectra (keep all)

    • Average: Mean of all spectra per group

    • Interleave: Alternate between datasets

  4. Handle wavenumber mismatches:

    • Interpolate: Resample to common grid

    • Crop: Use common wavenumber range only

  5. Click Merge

Use Cases:

  • Combine multiple experimental batches

  • Create larger training sets

  • Merge technical replicates

Data Splitting

Split dataset into train/validation/test:

  1. Select data package

  2. Click [Split Dataset]

  3. Configure split ratios:

    • Training: 70%

    • Validation: 15%

    • Test: 15%

  4. Choose split strategy:

    • Random: Random assignment

    • Stratified: Maintain group proportions

    • Patient-level: Keep all spectra from one patient together

  5. Click Split

Result: Three new data packages created automatically

Export Data

Export for external use:

In the current application:

  • The Data Package page can export metadata as JSON.

  • The Analysis page can export:

    • Plots: PNG, SVG

    • Data tables: CSV, XLSX, JSON, TXT, PKL

Options:

Available export options depend on the selected analysis method and output type.

Batch Import

Import multiple files at once:

  1. Click [Batch Import]

  2. Select folder containing CSV files

  3. Options:

    • Recursive: Include subfolders

    • Pattern: Filter by filename (e.g., *.csv)

    • Auto-group: Assign groups by folder name

  4. Preview file list

  5. Click Import All

Progress:

During batch import, the application shows a progress indicator with:

  • Overall percent complete

  • Processed/total file count

  • Current file name

  • Estimated time remaining


Best Practices

File Organization

Recommended folder structure:

data/
├── raw/
│   ├── batch1/
│   │   ├── healthy/
│   │   │   ├── patient_001.csv
│   │   │   └── patient_002.csv
│   │   └── disease/
│   │       ├── patient_101.csv
│   │       └── patient_102.csv
│   └── batch2/
│       └── ...
└── processed/
    └── ...

Benefits:

  • Clear organization by batch and condition

  • Easy batch import

  • Automatic group assignment

  • Simplified version control

Naming Conventions

Files:

Good: patient_001_healthy.csv
Bad:  p1.csv

Good: disease_group_a_replicate_1.csv
Bad:  data.csv

Groups:

Good: Healthy_Control, Disease_GroupA, Disease_GroupB
Bad:  Group1, Group2, G3

Quality Control

Before analysis:

  1. ✓ Visual inspection of spectra

  2. ✓ Check for outliers

  3. ✓ Verify group assignments

  4. ✓ Validate wavenumber calibration

  5. ✓ Document any issues in metadata

During project:

  • Keep raw data unchanged

  • Version processed datasets

  • Document preprocessing steps

  • Backup regularly


Troubleshooting

Import Fails

Error: “Could not parse CSV file”

Solutions:

  • Check delimiter (comma vs tab vs semicolon)

  • Verify decimal separator (period vs comma)

  • Check for non-numeric characters

  • Use UTF-8 encoding

Wavenumber Mismatch

Error: “Spectra have different wavenumber axes”

Solutions:

  1. Use Align Wavenumbers tool

  2. Interpolate to common grid

  3. Crop to common range

  4. Import separately and merge later

Memory Issues

Error: “Out of memory during import”

Solutions:

  • Import in smaller batches

  • Close other applications

  • Enable “Chunked Loading” in settings

  • Use 64-bit version of application

Missing Groups

Error: “No groups defined for classification”

Solutions:

  1. Create groups first

  2. Assign samples to groups

  3. Verify group labels are correct

  4. Check for unassigned samples


See Also


Next: Preprocessing Guide