Human 3D pose estimation from monocular videos in uncontrolled environments is hindered by the lack of direct three-dimensional annotations. In this work, we propose a modular architecture that aligns 2D poses estimated from “in the wild” videos with 3D ground-truth poses from a controlled dataset (Fit3D). The system applies anatomical normalization (neck–pelvis), followed by axial rotations, and then performs frame-by-frame matching based on composite metrics (MPJPE, PCK, pairwise similarity) to select the most congruent pose. Qualitative validation is conducted by overlaying the projected SMPL-X mesh on real video frames. Experiments comparing ground-truth data with videos acquired in uncontrolled settings demonstrate consistent alignments and robust performance under moderate variations in angle, viewpoint, and context. The proposed architecture represents a solid starting point toward automating 3D annotation in fitness, rehabilitation, and sports domains.
La stima della posa umana 3D da video monoculari in ambienti non controllati è impedita dalla carenza di annotazioni tridimensionali dirette. In questo lavoro si presenta un’architettura modulare che allinea pose 2D stimate in video "in the wild" con pose 3D ground truth di un dataset controllato (Fit3D). Il sistema applica una normalizzazione anatomica (collo–bacino) e rotazioni assiali, seguite da un accoppiamento fotogramma per fotogramma basato su metriche composite (MPJPE, PCK, similarità pairwise) per selezionare la posa più congruente. La validazione qualitativa avviene mediante sovrapposizione della mesh SMPL-X proiettata sui video reali. Esperimenti condotti fra dati di ground truth e video realizzati in ambienti non controllati mostrano allineamenti coerenti e prestazioni robuste rispetto a variazioni moderate di angolazione, punto di vista e contesto. L'architettura costituisce un solido punto di partenza verso l'automatizzazione delle annotazioni 3D in ambiti fitness, riabilitazione e sportivi.
Verso la pseudo-annotazione 3D della posa su video in-the-wild in ambito fitness
SIROCCHI, CLAUDIO
2024/2025
Abstract
Human 3D pose estimation from monocular videos in uncontrolled environments is hindered by the lack of direct three-dimensional annotations. In this work, we propose a modular architecture that aligns 2D poses estimated from “in the wild” videos with 3D ground-truth poses from a controlled dataset (Fit3D). The system applies anatomical normalization (neck–pelvis), followed by axial rotations, and then performs frame-by-frame matching based on composite metrics (MPJPE, PCK, pairwise similarity) to select the most congruent pose. Qualitative validation is conducted by overlaying the projected SMPL-X mesh on real video frames. Experiments comparing ground-truth data with videos acquired in uncontrolled settings demonstrate consistent alignments and robust performance under moderate variations in angle, viewpoint, and context. The proposed architecture represents a solid starting point toward automating 3D annotation in fitness, rehabilitation, and sports domains.File | Dimensione | Formato | |
---|---|---|---|
Tesi_ClaudioFinalepdfA.pdf
embargo fino al 12/01/2027
Dimensione
20.58 MB
Formato
Adobe PDF
|
20.58 MB | Adobe PDF |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.12075/22684