• Business
  • Markets
  • Politics
  • Crypto
  • Finance
  • Intelligence
    • Policy Intelligence
    • Security Intelligence
    • Economic Intelligence
    • Fashion Intelligence
  • Energy
  • Technology
  • Taxes
  • Creator Economy
  • Wealth Management
  • LBNN Blueprints
  • Business
  • Markets
  • Politics
  • Crypto
  • Finance
  • Intelligence
    • Policy Intelligence
    • Security Intelligence
    • Economic Intelligence
    • Fashion Intelligence
  • Energy
  • Technology
  • Taxes
  • Creator Economy
  • Wealth Management
  • LBNN Blueprints

More cameras, more problems? Why deep learning still struggles with 3D human sensing

Simon Osuji by Simon Osuji
August 12, 2025
in Artificial Intelligence
0
More cameras, more problems? Why deep learning still struggles with 3D human sensing
0
SHARES
2
VIEWS
Share on FacebookShare on Twitter


More Cameras, More Problems? Why Deep Learning still struggles with 3D Human Sensing
Motivation: Comparison of different token scanning methods. (a) Cross Attention acts on all image tokens. (b) Projective Attention obtains anchors with perspective projection and selectively attends to sample tokens surrounding the anchor points. (c) The proposed Grid Token-guided Bidirectional Scanning (GTBS) encodes the local context and the joint spatial sequence at the visual feature and person keypoint levels. Credit: The authors

Accurately estimating human pose was among the first tasks addressed by deep learning. Early models like OpenPose focused on localizing human joints as 2D keypoints in image coordinates. Later, Google came up with Mediapipe, followed by YOLOpose, which gained major attention and is widely adopted due to its efficiency and accuracy.

Related posts

Kubuntu Focus Zr Gen 1 Review: A Powerhouse Linux Laptop

Kubuntu Focus Zr Gen 1 Review: A Powerhouse Linux Laptop

February 25, 2026
Talk to Your Own Personal Isaac Newton With Ailias’s Hologram Avatars

Talk to Your Own Personal Isaac Newton With Ailias’s Hologram Avatars

February 25, 2026

Naturally, the next frontier was estimating human poses in 3D—predicting the (x, y, z) locations of joints in a global reference frame. Since single-image to 3D is an ill-posed problem, the task promised to become easier using multiple cameras. However, despite years of research, multi-view 3D multi-person pose estimation remains surprisingly difficult.

Breaking down the multi-view puzzle

Multi-view 3D multi-person pose estimation is a collection of multiple sub-problems. Until recently, most studies first estimated the 2D keypoints independently from all multi-view images using Mediapipe or YOLOPose, then matched the corresponding joints across views, followed by person-matching, and then used camera parameters to triangulate the 2D keypoints to finally obtain the 3D.

However, a major shortcoming of such multi-stage pipelines is that errors at each stage get multiplied. Moreover, such approaches failed to leverage the visual cues from the multi-view images, as the first step itself discarded most of the pixel information, and the entire remaining pipeline simply rested upon the 2D keypoints estimated from an off-the-shelf detector.

End-to-end learning: A paradigm shift

Recently, some researchers shifted their focus to an important question: can the entire task be end-to-end supervised? Let’s see what the challenges will be in such an approach.

First, such a setting will require the model to process entire multi-view image inputs, leading to high computational expense, unlike previous approaches that saved upon computation by simply discarding most of the visual information. Second, how can the model learn geometric triangulation in an end-to-end differential framework? Lastly, since this model would directly regress the 3D joints, how would it generalize to new settings?

More Cameras, More Problems? Why Deep Learning still struggles with 3D Human Sensing
Model Architecture: MV-SSM proposes multi-view images through ResNet-50 backbone to extract multi-scale features, which are refined by stacked Projective State Space (PSS) blocks. These blocks leverage projective attention and state space modeling to progressively refine the keypoints, with final 3D keypoints estimated via geometric triangulation. Credit: The authors

A few recent studies, including Learnable Triangulation, MvP, and MVGFormer, have explored this problem. Learnable Triangulation proposed two triangulation approaches—an algebraic and a volumetric. Crucially, both approaches are end-to-end differentiable, which allows for direct optimization of the target metric.

MvP utilized this method and directly regressed the multi-view 3D human poses without relying on intermediate tasks. Specifically, MvP represents skeleton joints as learnable query embeddings and lets them progressively attend to and reason over the multi-view information from the input images to directly regress the actual 3D joint locations. A key contribution is the geometrically guided attention mechanism, called projective attention, to more precisely fuse the cross-view information for each joint.

Recently, MVGFormer raised an important concern regarding regression models like MvP—that they have very low generalizability.

Exposé: The generalization crisis

MVGFormer demonstrated that earlier models overfitted on the training datasets, i.e., if the number of cameras is reduced or increased during testing, the AP25 scores showed a large drop, indicating that even a gain in visual information couldn’t be effectively utilized by these models. Secondly, as the camera positions or orientations are changed, or when these models are evaluated on cross-datasets, they also show high overfitting.

MVGFormer addressed this using transformers that combined geometric triangulation in a learning-based setting, with its appearance modules effectively using the visual information that was being discarded in earlier multi-stage pipelines.

  • More Cameras, More Problems? Why Deep Learning still struggles with 3D Human Sensing
    Block Architecture: Architecture of the (a) Mamba block, (b) VSS block, and (c) the proposed PSS block. The PSS block captures joint spatial relationships through projective attention and state space modeling, progressively refining results. Credit: The authors
  • More Cameras, More Problems? Why Deep Learning still struggles with 3D Human Sensing
    Visual Comparisons: We present a visual comparison against MVGFormer on CMU Panoptic benchmark. The Ground truth human poses are shown in red and the predicted pose is overlapped on it to show an accurate comparison. MV-SSM achieves accurate poses, especially in difficult scenarios, demonstrating superior performance. As illustrated in the first row, MV-SSM is better able to predict, for example, the person’s left foot. Note that the colors of persons are different since we do not perform ID-matching. Credit: The authors

Strong generalization: MV-SSM’s winning metric

Building upon this research problem, and to better address the generalization issue, we proposed MV-SSM—a novel model for 3D human pose estimation using Multi-View State Space Modeling. MV-SSM was presented at the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025) held in Nashville, June 11–15. MV-SSM explicitly models the joint spatial sequence at two distinct levels: the feature level from multi-view images and the person keypoint level.

We propose a Projective State Space (PSS) block to learn a generalized representation of joint spatial arrangements using state space modeling. Moreover, we modify Mamba’s traditional scanning into an effective Grid Token-guided Bidirectional Scanning (GTBS), which is integral to the PSS block.

Multiple experiments demonstrate that MV-SSM achieves strong generalization, outperforming state-of-the-art models: +24% on the challenging three-camera setting in CMU Panoptic, +13% on varying camera arrangements, and +38% on Campus A1 in cross-dataset evaluations.

Known calibration: The Achilles’ heel of 3D pose estimation

However, like previous models, a major limitation of MV-SSM is assuming that camera parameters are known. Though impressive, estimating 3D human poses that are not constrained to a specific camera arrangement, tied to specific scenes in the training data, or require a fixed number of cameras remains a major challenge—one which, if solved, has tremendous industrial utility.

This story is part of Science X Dialog, where researchers can report findings from their published research articles. Visit this page for information about Science X Dialog and how to participate.

More information:
Aviral Chharia et al, MV-SSM: Multi-View State Space Modeling for 3D Human Pose Estimation, Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) (2025).

Aviral Chharia is a graduate student at Carnegie Mellon University. He has been awarded the ATK-Nick G. Vlahakis Graduate Fellowship at CMU, the Students’ Undergraduate Research Graduate Excellence (SURGE) fellowship at IIT Kanpur, India, and the MITACS Globalink Research Fellowship at the University of British Columbia. Additionally, he was a two-time recipient of the Dean’s List Scholarship during his undergraduate. His research interests include computer vision, computer graphics, and machine learning.

Citation:
More cameras, more problems? Why deep learning still struggles with 3D human sensing (2025, August 12)
retrieved 12 August 2025
from https://techxplore.com/news/2025-08-cameras-problems-deep-struggles-3d.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.





Source link

Previous Post

Defense One Radio, Ep. 189: The U.S. military vs. drug cartels

Next Post

Hackers breach and expose a major North Korean spying operation

Next Post
Hackers breach and expose a major North Korean spying operation

Hackers breach and expose a major North Korean spying operation

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

RECOMMENDED NEWS

US F-35 Jet Crashes in New Mexico, Pilot Hospitalized

US F-35 Jet Crashes in New Mexico, Pilot Hospitalized

2 years ago
Chevron Halts Exports Through Major Subsea Pipeline Between Israel and Egypt

Chevron Halts Exports Through Major Subsea Pipeline Between Israel and Egypt

2 years ago
Eddie Murphy brings ’80s to modern day with new ‘Beverly Hills Cop’ film

Eddie Murphy brings ’80s to modern day with new ‘Beverly Hills Cop’ film

2 years ago
AI likely to augment rather than destroy jobs: UN study

AI likely to augment rather than destroy jobs: UN study

3 years ago

POPULAR NEWS

  • Ghana to build three oil refineries, five petrochemical plants in energy sector overhaul

    Ghana to build three oil refineries, five petrochemical plants in energy sector overhaul

    0 shares
    Share 0 Tweet 0
  • The world’s top 10 most valuable car brands in 2025

    0 shares
    Share 0 Tweet 0
  • Top 10 African countries with the highest GDP per capita in 2025

    0 shares
    Share 0 Tweet 0
  • Global ranking of Top 5 smartphone brands in Q3, 2024

    0 shares
    Share 0 Tweet 0
  • When Will SHIB Reach $1? Here’s What ChatGPT Says

    0 shares
    Share 0 Tweet 0

Get strategic intelligence you won’t find anywhere else. Subscribe to the Limitless Beliefs Newsletter for monthly insights on overlooked business opportunities across Africa.

Subscription Form

© 2026 LBNN – All rights reserved.

Privacy Policy | About Us | Contact

Tiktok Youtube Telegram Instagram Linkedin X-twitter
No Result
View All Result
  • Home
  • Business
  • Politics
  • Markets
  • Crypto
  • Economics
    • Manufacturing
    • Real Estate
    • Infrastructure
  • Finance
  • Energy
  • Creator Economy
  • Wealth Management
  • Taxes
  • Telecoms
  • Military & Defense
  • Careers
  • Technology
  • Artificial Intelligence
  • Investigative journalism
  • Art & Culture
  • LBNN Blueprints
  • Quizzes
    • Enneagram quiz
  • Fashion Intelligence

© 2023 LBNN - All rights reserved.