More cameras, more problems? Why deep learning still struggles with 3D human sensing

Accurately estimating human pose was among the first tasks addressed by deep learning. Early models like OpenPose focused on localizing human joints as 2D keypoints in image coordinates. Later, Google came up with Mediapipe, followed by YOLOpose, which gained major attention and is widely adopted due to its efficiency and accuracy.

Kubuntu Focus Zr Gen 1 Review: A Powerhouse Linux Laptop

February 25, 2026

Talk to Your Own Personal Isaac Newton With Ailias’s Hologram Avatars

February 25, 2026

Naturally, the next frontier was estimating human poses in 3D—predicting the (x, y, z) locations of joints in a global reference frame. Since single-image to 3D is an ill-posed problem, the task promised to become easier using multiple cameras. However, despite years of research, multi-view 3D multi-person pose estimation remains surprisingly difficult.

Breaking down the multi-view puzzle

Multi-view 3D multi-person pose estimation is a collection of multiple sub-problems. Until recently, most studies first estimated the 2D keypoints independently from all multi-view images using Mediapipe or YOLOPose, then matched the corresponding joints across views, followed by person-matching, and then used camera parameters to triangulate the 2D keypoints to finally obtain the 3D.

However, a major shortcoming of such multi-stage pipelines is that errors at each stage get multiplied. Moreover, such approaches failed to leverage the visual cues from the multi-view images, as the first step itself discarded most of the pixel information, and the entire remaining pipeline simply rested upon the 2D keypoints estimated from an off-the-shelf detector.

End-to-end learning: A paradigm shift

Recently, some researchers shifted their focus to an important question: can the entire task be end-to-end supervised? Let’s see what the challenges will be in such an approach.

First, such a setting will require the model to process entire multi-view image inputs, leading to high computational expense, unlike previous approaches that saved upon computation by simply discarding most of the visual information. Second, how can the model learn geometric triangulation in an end-to-end differential framework? Lastly, since this model would directly regress the 3D joints, how would it generalize to new settings?

A few recent studies, including Learnable Triangulation, MvP, and MVGFormer, have explored this problem. Learnable Triangulation proposed two triangulation approaches—an algebraic and a volumetric. Crucially, both approaches are end-to-end differentiable, which allows for direct optimization of the target metric.

MvP utilized this method and directly regressed the multi-view 3D human poses without relying on intermediate tasks. Specifically, MvP represents skeleton joints as learnable query embeddings and lets them progressively attend to and reason over the multi-view information from the input images to directly regress the actual 3D joint locations. A key contribution is the geometrically guided attention mechanism, called projective attention, to more precisely fuse the cross-view information for each joint.

Recently, MVGFormer raised an important concern regarding regression models like MvP—that they have very low generalizability.

Exposé: The generalization crisis

MVGFormer demonstrated that earlier models overfitted on the training datasets, i.e., if the number of cameras is reduced or increased during testing, the AP25 scores showed a large drop, indicating that even a gain in visual information couldn’t be effectively utilized by these models. Secondly, as the camera positions or orientations are changed, or when these models are evaluated on cross-datasets, they also show high overfitting.

MVGFormer addressed this using transformers that combined geometric triangulation in a learning-based setting, with its appearance modules effectively using the visual information that was being discarded in earlier multi-stage pipelines.

Block Architecture: Architecture of the (a) Mamba block, (b) VSS block, and (c) the proposed PSS block. The PSS block captures joint spatial relationships through projective attention and state space modeling, progressively refining results. Credit: The authors
Visual Comparisons: We present a visual comparison against MVGFormer on CMU Panoptic benchmark. The Ground truth human poses are shown in red and the predicted pose is overlapped on it to show an accurate comparison. MV-SSM achieves accurate poses, especially in difficult scenarios, demonstrating superior performance. As illustrated in the first row, MV-SSM is better able to predict, for example, the person’s left foot. Note that the colors of persons are different since we do not perform ID-matching. Credit: The authors

Strong generalization: MV-SSM’s winning metric

Building upon this research problem, and to better address the generalization issue, we proposed MV-SSM—a novel model for 3D human pose estimation using Multi-View State Space Modeling. MV-SSM was presented at the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025) held in Nashville, June 11–15. MV-SSM explicitly models the joint spatial sequence at two distinct levels: the feature level from multi-view images and the person keypoint level.

We propose a Projective State Space (PSS) block to learn a generalized representation of joint spatial arrangements using state space modeling. Moreover, we modify Mamba’s traditional scanning into an effective Grid Token-guided Bidirectional Scanning (GTBS), which is integral to the PSS block.

Multiple experiments demonstrate that MV-SSM achieves strong generalization, outperforming state-of-the-art models: +24% on the challenging three-camera setting in CMU Panoptic, +13% on varying camera arrangements, and +38% on Campus A1 in cross-dataset evaluations.

Known calibration: The Achilles’ heel of 3D pose estimation

However, like previous models, a major limitation of MV-SSM is assuming that camera parameters are known. Though impressive, estimating 3D human poses that are not constrained to a specific camera arrangement, tied to specific scenes in the training data, or require a fixed number of cameras remains a major challenge—one which, if solved, has tremendous industrial utility.

This story is part of Science X Dialog, where researchers can report findings from their published research articles. Visit this page for information about Science X Dialog and how to participate.

More information:
Aviral Chharia et al, MV-SSM: Multi-View State Space Modeling for 3D Human Pose Estimation, Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) (2025).

Aviral Chharia is a graduate student at Carnegie Mellon University. He has been awarded the ATK-Nick G. Vlahakis Graduate Fellowship at CMU, the Students’ Undergraduate Research Graduate Excellence (SURGE) fellowship at IIT Kanpur, India, and the MITACS Globalink Research Fellowship at the University of British Columbia. Additionally, he was a two-time recipient of the Dean’s List Scholarship during his undergraduate. His research interests include computer vision, computer graphics, and machine learning.

Citation:
More cameras, more problems? Why deep learning still struggles with 3D human sensing (2025, August 12)
retrieved 12 August 2025
from https://techxplore.com/news/2025-08-cameras-problems-deep-struggles-3d.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.

Source link