UoP

Real-Time Multi-Dyadic Interaction Detection in Crowded Surveillance Video

Surveillance video in the wild has multiple two-person interactions happening simultaneously in a crowded frame - and most action-recognition pipelines were trained on single-actor clips, so they collapse when interactions overlap. A real-time pipeline needed to localize each pair, isolate them, and classify the interaction without dropping frames.

2023

Approach

Built a three-stage pipeline. YOLOv7 + SORT for spatiotemporal localization of person tracks, YOLO-Pose for skeleton extraction on each tracked subject, and X3D-M with attention for interaction classification on the dyad's stacked skeleton sequence. The skeleton-only classification stage is what kept it real-time on commodity hardware. Published at IEEE ICIIS 2023.

Why skeletons, not RGB

Running an RGB-based action classifier per-pair across a crowded frame doesn't fit the real-time budget. Skeletons are a 100× smaller representation and - for two-person interactions - carry most of the discriminative signal.

Where it broke

Heavy occlusion still defeats the pose stage; that's the failure mode the paper is honest about. Future work was to add appearance-feature fallback for tracks where pose confidence drops below threshold.