All Projects
AI/MLDefence

VECTOR: Parallel Processing Engine for Real-Time Multi-Sensor Fusion

MPI-based distributed pipeline processes high-bandwidth multi-sensor streams with SIMD-accelerated per-node compute, near-linear scaling with node count, and partial-result output under node failure without full pipeline restart.

VECTOR sensor fusion pipeline
// The Challenge

Real-time fusion of high-bandwidth data streams from multiple sensors exceeds single-node processing capacity at operational sample rates. General-purpose distributed computing frameworks introduce coordination overhead incompatible with latency requirements for operational use. A custom MPI-based pipeline eliminates middleware overhead and provides direct control over partitioning and communication patterns.

// Our Approach

Built an MPI-based distributed processing pipeline with data-parallel partitioning across compute nodes, SIMD-accelerated local processing on each node, and a low-latency aggregation layer with configurable windowing. Explicit message-passing eliminates shared-state coordination overhead. The architecture scales near-linearly with node count and handles node failure without full pipeline restart.

Module 01

Data-Parallel Partitioning

Affinity-aware stream partitioning across the MPI node cluster

Incoming sensor streams are partitioned across compute nodes using a hashing scheme that assigns each sensor's data to a fixed node for the duration of a processing window. This affinity-aware approach maximises cache locality: each node processes the same sensor data repeatedly, keeping calibration tables and filter coefficients hot in cache. Dynamic rebalancing at window boundaries redistributes load when a new node joins or an existing node is removed.

Sensor A (RF)Sensor B (IR)Sensor C (Acoustic)Sensor D (Radar)MPI Ranks 0–3Node 0RF ProcessingNode 1IR ProcessingNode 2AcousticNode 3RadarMPI Aggregatorcollect() → windowed fusion → output stream
Sensor-affinity partitioning: each sensor stream maps to a dedicated MPI rank. All four compute nodes feed their results into a single aggregation layer that windows and fuses the outputs.
  • Sensor-affinity partitioning: each sensor assigned to a fixed MPI rank
  • Hash-based assignment reproducible across restarts without coordination
  • Dynamic rebalancing at window checkpoints on topology change
  • Partition map broadcast to all nodes on startup and after rebalancing
  • Straggler detection: slow partitions flagged for rebalancing at next checkpoint
  • Load monitoring with configurable rebalancing at checkpoint boundaries
Module 02

SIMD-Accelerated Local Processing

Per-node vectorised compute using platform intrinsics

Each compute node processes its assigned data partition using SIMD intrinsics targeting the available instruction set: SSE4.2, AVX2, or AVX-512 depending on the hardware target. The processing kernels are written with explicit vectorisation: data is loaded into wide registers and operations applied to multiple samples per instruction cycle. A pre-processing step normalises irregular sensor data into SIMD-friendly aligned structures before the main processing loop.

Scalar8 ops / cycleSIMD1 op / cycles[0]ops[1]ops[2]ops[3]ops[4]ops[5]ops[6]ops[7]opout[0]out[1]out[2]out[3]out[4]out[5]out[6]out[7]AVX2 register: [s0 | s1 | s2 | s3 | s4 | s5 | s6 | s7]1 instruction8x faster
SIMD vectorisation: scalar code processes one sample per instruction; AVX2 loads eight samples into a single wide register and applies the same operation in one instruction cycle, achieving 8x throughput per clock.
  • AVX2 / SSE4.2 intrinsics for floating-point signal processing kernels
  • Aligned memory allocation for SIMD load/store operations
  • Data normalisation pre-pass for irregular sensor data formats
  • Loop unrolling calibrated to cache line size of target hardware
  • Throughput benchmarking per kernel: reported as samples/second
  • Fallback scalar path for non-SIMD hardware targets
Module 03

Low-Latency Aggregation and Windowed Output

Collecting, fusing, and windowing results from all compute nodes

The aggregation layer collects processed results from all compute nodes using MPI gather operations and fuses them into a single output stream with configurable time-windowing. Window size and overlap are tunable at startup: narrower windows reduce latency, wider windows improve statistical quality of the fused output. The aggregator detects node failures and produces partial fused output from available nodes rather than blocking on the missing contribution.

  • MPI gather-based result collection with configurable timeout per node
  • Time-windowed fusion with configurable window size and hop length
  • Partial-result fusion on node failure: output continues from available nodes
  • Output stream formatted for downstream consumer compatibility
  • Latency measurement per fusion cycle logged to performance monitor
  • Zero shared memory across nodes: all coordination via explicit MPI messages
// Technical Complexity

Data partitioning strategy determines whether SIMD efficiency translates to end-to-end throughput gains or is wasted on cross-partition communication: getting partition boundaries wrong forces inter-node data movement that saturates the MPI communication layer before the SIMD compute budget is reached.

SIMD vectorisation for irregular sensor data layouts requires a pre-processing step that normalises data into SIMD-aligned structures, and this step must not itself become the bottleneck.

Fault recovery in MPI without a full pipeline restart is non-standard: the aggregator must handle missing contributions without blocking on an acknowledgement that will never arrive. The windowing layer must maintain consistent timestamps across partitions even when nodes process data at slightly different rates due to scheduling jitter.

// Stack and Methods
MPIC++SIMDAVX2Signal ProcessingHPCParallel ComputingSensor Fusion