Action Classification

For this task, the model is finetuned end-to-end on downstream datasets, which are UCF101 and HMDB51 in our case. In Table 1, we compare our best performing model with other previous state-of-the-art approaches.

Observations: With only 30k videos compared to 240k videos used by other pretext tasks, we show that our model outperforms by good margin on UCF101 against single and multi-modal approaches. On HMDB51, we have a comparable performance. TCLR effectively takes in a larger clip duration at pre-training stage, whereas, other approaches use a bigger frame size or more number of frames.

Table 1: Comparison with previous approaches pre-trained on K400 full set. Ours (*best performing) is RSPNet pretrained on 30k subset of K400. † represents model with different backbone than R21D. *reproduced results.