Road traffic accident classification using a sparse videotransformer and adaptive fragmentation

Main Article Content

Tetiana V. Normatova
Sergii V. Mashtalir

Abstract

In this work, we propose a simple yet effective approach for classifying short road traffic video clips into car accident and normal scenes. From each clip 8 frames are uniformly sampled across the sequence to ensure that key events are preserved even in longer videos. Based on the Farnebäck optical flow map, an adaptive fragment selection is performed, where the patch size (eight/sixteen/thirty-two pixels) is determined for each region of the base grid. Smaller patches are used in areas with intensive motion (to capture finer details), while larger patches are used in static regions (to reduce computations). The selected fragments are non-overlapping, resized to a uniform scale, and converted into feature vectors. The architecture operates in two stages. First, a spatial transformer processes each frame independently; attending only to the selected fragments this drastically reduces the number of feature tokens. Second, a temporal transformer processes the sequence of classify tokens (compact per-frame representations), aggregating temporal dynamics across frames. This space-to-time factorization significantly lowers computational cost and memory consumption while maintaining high informativeness in motion-intensive regions. To address class imbalance, we employ a weighted cross-entropy loss (or focal loss emphasizing hard examples) and weighted random sampling during training. Optical flow maps and fragment lists are precomputed and cached on disk, which accelerates training epochs even on CPUs without specialized hardware. Evaluation was conducted on the Car Crash Dataset (one thousand and five hundred accident and three thousand normal videos) using an eighty to twenty percent train-test split with preserved class proportions. The proposed method achieved Accuracy = 0.864 and Macro-F1 = 0.851. Preliminary comparisons show that our approach outperforms both the baseline uniform-patch Vision Transformers and traditional temporal aggregation schemes. The key advantage of the method lies in combining motion-guided feature reduction with a two-stage spatial-temporal processing pipeline, making the model suitable for realistic computational constraints (CPU-level inference) while maintaining high sensitivity to short and localized accident events. The approach is easily scalable and can be integrated with self-supervised pertaining techniques (e.g., masked video reconstruction). All experimental conditions, hyperparameters, and configurations are documented to ensure full reproducibility.

Downloads

Download data is not yet available.

Article Details

Topics

Section

Theoretical aspects of computer science, programming and data analysis

Authors

Author Biographies

Tetiana V. Normatova, Kharkiv National University of Radio Electronics.14, Nauky Ave. Kharkiv, 61166, Ukraine

PhD student, Informatics Department

Sergii V. Mashtalir , Kharkiv National University of Radio Electronics.14, Nauky Ave. Kharkiv, 61166, Ukraine

Doctor of Engineering Science, Professor, Informatics Department

Scopus Author ID: 36183980100

Similar Articles

You may also start an advanced similarity search for this article.