Road traffic accident classification using a sparse videotransformer and adaptive fragmentation

Tetiana V. Normatova; Sergii V.  Mashtalir

doi:10.15276/hait.08.2025.29

PDF

Published:
2025-12-05

DOI: https://doi.org/10.15276/hait.08.2025.29

Keywords:

video classification, neural networks, convolutional neural networks, object classification, video stream analysis, data classification, image fragment processing

PDF

How to cite

How to Cite

(1)

Normatova T. V.; Mashtalir S. V. . " Road Traffic Accident Classification Using a Sparse Videotransformer and Adaptive Fragmentation" Publ. Nauka i Tekhnika. Odesa: Ukraine. Herald of Advanced Information Technology 8 (4), 464–475. https://doi.org/10.15276/hait.08.2025.29.

This article was updated to correct the Conflict of Interest statement
10.02.2026

Tetiana V. Normatova

Kharkiv National University of Radio Electronics.14, Nauky Ave. Kharkiv, 61166, Ukraine

https://orcid.org/0009-0004-3503-6350

Sergii V. Mashtalir

Kharkiv National University of Radio Electronics.14, Nauky Ave. Kharkiv, 61166, Ukraine

https://orcid.org/0000-0002-0917-6622

Abstract

In this work, we propose a simple yet effective approach for classifying short road traffic video clips into car accident and normal scenes. From each clip 8 frames are uniformly sampled across the sequence to ensure that key events are preserved even in longer videos. Based on the Farnebäck optical flow map, an adaptive fragment selection is performed, where the patch size (eight/sixteen/thirty-two pixels) is determined for each region of the base grid. Smaller patches are used in areas with intensive motion (to capture finer details), while larger patches are used in static regions (to reduce computations). The selected fragments are non-overlapping, resized to a uniform scale, and converted into feature vectors. The architecture operates in two stages. First, a spatial transformer processes each frame independently; attending only to the selected fragments this drastically reduces the number of feature tokens. Second, a temporal transformer processes the sequence of classify tokens (compact per-frame representations), aggregating temporal dynamics across frames. This space-to-time factorization significantly lowers computational cost and memory consumption while maintaining high informativeness in motion-intensive regions. To address class imbalance, we employ a weighted cross-entropy loss (or focal loss emphasizing hard examples) and weighted random sampling during training. Optical flow maps and fragment lists are precomputed and cached on disk, which accelerates training epochs even on CPUs without specialized hardware. Evaluation was conducted on the Car Crash Dataset (one thousand and five hundred accident and three thousand normal videos) using an eighty to twenty percent train-test split with preserved class proportions. The proposed method achieved Accuracy = 0.864 and Macro-F1 = 0.851. Preliminary comparisons show that our approach outperforms both the baseline uniform-patch Vision Transformers and traditional temporal aggregation schemes. The key advantage of the method lies in combining motion-guided feature reduction with a two-stage spatial-temporal processing pipeline, making the model suitable for realistic computational constraints (CPU-level inference) while maintaining high sensitivity to short and localized accident events. The approach is easily scalable and can be integrated with self-supervised pertaining techniques (e.g., masked video reconstruction). All experimental conditions, hyperparameters, and configurations are documented to ensure full reproducibility.

Downloads

Download data is not yet available.

Issue

Vol. 8 No. 4 (2025): Herald of Advanced Information Technology

Topics

Section

Theoretical aspects of computer science, programming and data analysis

Authors

Author Biographies

Tetiana V. Normatova, Kharkiv National University of Radio Electronics.14, Nauky Ave. Kharkiv, 61166, Ukraine

PhD student, Informatics Department

Sergii V. Mashtalir , Kharkiv National University of Radio Electronics.14, Nauky Ave. Kharkiv, 61166, Ukraine

Doctor of Engineering Science, Professor, Informatics Department

Scopus Author ID: 36183980100

Road traffic accident classification using a sparse videotransformer and adaptive fragmentation

How to cite

How to Cite

Abstract

Downloads

Issue

Topics

Section

Authors

Author Biographies

Tetiana V. Normatova, Kharkiv National University of Radio Electronics.14, Nauky Ave. Kharkiv, 61166, Ukraine

Sergii V. Mashtalir , Kharkiv National University of Radio Electronics.14, Nauky Ave. Kharkiv, 61166, Ukraine

Most read articles by the same author(s)

Similar Articles

Menu

Article Sidebar

How to cite

How to Cite

Main Article Content

Abstract

Downloads

Article Details

Issue

Topics

Section

Authors

Author Biographies

Tetiana V. Normatova, Kharkiv National University of Radio Electronics.14, Nauky Ave. Kharkiv, 61166, Ukraine

Sergii V. Mashtalir , Kharkiv National University of Radio Electronics.14, Nauky Ave. Kharkiv, 61166, Ukraine

Most read articles by the same author(s)

Similar Articles

Menu