VECTOR: Parallel Processing Engine for Real-Time Multi-Sensor Fusion
MPI-based distributed pipeline processes high-bandwidth multi-sensor streams with SIMD-accelerated per-node compute, near-linear scaling with node count, and partial-result output under node failure without full pipeline restart.

Real-time fusion of high-bandwidth data streams from multiple sensors exceeds single-node processing capacity at operational sample rates. General-purpose distributed computing frameworks introduce coordination overhead incompatible with latency requirements for operational use. A custom MPI-based pipeline eliminates middleware overhead and provides direct control over partitioning and communication patterns.
Built an MPI-based distributed processing pipeline with data-parallel partitioning across compute nodes, SIMD-accelerated local processing on each node, and a low-latency aggregation layer with configurable windowing. Explicit message-passing eliminates shared-state coordination overhead. The architecture scales near-linearly with node count and handles node failure without full pipeline restart.
Data-Parallel Partitioning
Affinity-aware stream partitioning across the MPI node cluster
Incoming sensor streams are partitioned across compute nodes using a hashing scheme that assigns each sensor's data to a fixed node for the duration of a processing window. This affinity-aware approach maximises cache locality: each node processes the same sensor data repeatedly, keeping calibration tables and filter coefficients hot in cache. Dynamic rebalancing at window boundaries redistributes load when a new node joins or an existing node is removed.
- Sensor-affinity partitioning: each sensor assigned to a fixed MPI rank
- Hash-based assignment reproducible across restarts without coordination
- Dynamic rebalancing at window checkpoints on topology change
- Partition map broadcast to all nodes on startup and after rebalancing
- Straggler detection: slow partitions flagged for rebalancing at next checkpoint
- Load monitoring with configurable rebalancing at checkpoint boundaries
SIMD-Accelerated Local Processing
Per-node vectorised compute using platform intrinsics
Each compute node processes its assigned data partition using SIMD intrinsics targeting the available instruction set: SSE4.2, AVX2, or AVX-512 depending on the hardware target. The processing kernels are written with explicit vectorisation: data is loaded into wide registers and operations applied to multiple samples per instruction cycle. A pre-processing step normalises irregular sensor data into SIMD-friendly aligned structures before the main processing loop.
- AVX2 / SSE4.2 intrinsics for floating-point signal processing kernels
- Aligned memory allocation for SIMD load/store operations
- Data normalisation pre-pass for irregular sensor data formats
- Loop unrolling calibrated to cache line size of target hardware
- Throughput benchmarking per kernel: reported as samples/second
- Fallback scalar path for non-SIMD hardware targets
Low-Latency Aggregation and Windowed Output
Collecting, fusing, and windowing results from all compute nodes
The aggregation layer collects processed results from all compute nodes using MPI gather operations and fuses them into a single output stream with configurable time-windowing. Window size and overlap are tunable at startup: narrower windows reduce latency, wider windows improve statistical quality of the fused output. The aggregator detects node failures and produces partial fused output from available nodes rather than blocking on the missing contribution.
- MPI gather-based result collection with configurable timeout per node
- Time-windowed fusion with configurable window size and hop length
- Partial-result fusion on node failure: output continues from available nodes
- Output stream formatted for downstream consumer compatibility
- Latency measurement per fusion cycle logged to performance monitor
- Zero shared memory across nodes: all coordination via explicit MPI messages
Data partitioning strategy determines whether SIMD efficiency translates to end-to-end throughput gains or is wasted on cross-partition communication: getting partition boundaries wrong forces inter-node data movement that saturates the MPI communication layer before the SIMD compute budget is reached.
SIMD vectorisation for irregular sensor data layouts requires a pre-processing step that normalises data into SIMD-aligned structures, and this step must not itself become the bottleneck.
Fault recovery in MPI without a full pipeline restart is non-standard: the aggregator must handle missing contributions without blocking on an acknowledgement that will never arrive. The windowing layer must maintain consistent timestamps across partitions even when nodes process data at slightly different rates due to scheduling jitter.