Why skeletons, not RGB
Running an RGB-based action classifier per-pair across a crowded frame doesn't fit the real-time budget. Skeletons are a 100× smaller representation and - for two-person interactions - carry most of the discriminative signal.
Where it broke
Heavy occlusion still defeats the pose stage; that's the failure mode the paper is honest about. Future work was to add appearance-feature fallback for tracks where pose confidence drops below threshold.