Methodology
A low-sample pre-screening pipeline for MI-BCI literacy combining brief resting-state baselines with only the first n=15 motor imagery trials.
Phase 1: Ground Truth Label Generation (CSP-LDA)
We establish ground-truth MI performance with a subject-specific CSP-LDA decoder following the MetaBCI framework.
- Dataset: PhysioNet EEG Motor Movement/Imagery — 109 subjects, 64 channels, 160 Hz
- ROI: C3, Cz, C4 (+ FC1/FC2, CP1/CP2 for spatial context)
- Filtering: Mu-band 8–13 Hz, post-cue window 0.5–4.0 s
- CSP: Top k=4 + bottom k=4 spatial filters (8 components)
- Classifier: LDA with shrinkage regularization
- Validation: Stratified 5-fold CV within each subject → one literacy score per subject
Phase 2: Feature Extraction
We extract neurophysiological markers from the first 15 trials of left-vs-right MI runs (runs 4/8/12) plus brief resting baselines (R01/R02). Two feature families are computed:
2a. Subject-Level Trait & Task-Modulation Features
2b. Early-Trial Decodability Features
Phase 2.5: Merged Feature Pipeline
Features from Phase 2a and 2b are merged by subject ID into a single design matrix: 38 total merged features per subject for all 109 subjects.
Feature Selection
To reduce redundancy and control false discoveries:
- Rank-based association: Spearman correlation with 5,000-permutation significance tests
- FDR correction: Benjamini–Hochberg at α = 0.05
- Minimum effect size: |ρ| ≥ 0.1
This retained 12 selected predictors for the final models:
| Feature | ρ | pFDR |
|---|---|---|
| csp_class_separability | 0.454 | 0.002 |
| smr_strength | 0.399 | 0.002 |
| erdrs_mu_C3 | −0.365 | 0.002 |
| rpl | 0.314 | 0.008 |
| + 8 more (resting_rpl_alpha, resting_pse_avg, band_power_beta_C3/C4, erdrs_mu_Cz, snr_mean, mu_erd_imagined_C3, resting_tar) | ||
Phase 3: BCI Literacy Screening Models
We trained two types of predictive models from merged feature vectors (standardized prior to fitting):
Regression
Compared four models (Random Forest, Gradient Boosting, RBF-SVR, Ridge) using 5-fold CV. Random Forest selected as best (R² = 0.302, Pearson r = 0.580).
Binary Classifier
Random Forest classifier separating HIGH vs LOW performers at accuracy threshold ≥ 0.65. Evaluated via leave-one-out cross-validation (LOOCV) → 78.8% accuracy with stronger low-performer detection (precision 0.833, recall 0.890).
Preprocessing
- Re-referencing: Common-average reference
- Filtering: FIR bandpass 1–40 Hz
- Epoching: −1 to +4 s relative to cue onset; ±100 μV artifact rejection
- ICA: Picard, 20 components; Fp1 + AF7/AF8 as EOG proxies
- Laplacian: Subtracted mean of 4 nearest neighbors for ROI channels
- Imputation: Feature-wise median for missing feature groups
System Architecture: HIL Simulation
The current demonstration uses a Hardware-in-the-Loop (HIL) Simulation:
- Signal Source: Pre-recorded PhysioNet data streamed as a "Digital Twin"
- Processing: Feature extraction and ML decoding run in real-time
- Justification: Validates end-to-end latency deterministically before human deployment
Software Stack
- Python 3.9+
- MNE-Python (EEG Processing)
- MetaBCI (Decoding Pipelines)
- scikit-learn (Machine Learning)
- Flask (API)