Experiments on Pre-training Dataset


We first analyze the effects of pre-training data size variation. The network trains on four subsets of K400 dataset: 10,000 (10k), 30,000 (30k), 50,000 (50k) and 100,000 (100k). The number of videos per class are same. Smaller pre-training dataset is a subset of bigger pre-training dataset size (i.e. 10k is a subset of 30k and so on).
We try to answer three questions regarding dependence on pre-train subset size:




Figure 1: Left: dataset subset performance for three different architectures on RSPNet pretext task (x-axis: subset size).
Right: CKA map for RSPNet for 10k subset with R21D backbone



Table 1: RSPNet with different subset size on ShuffleNet/R21D.

Looking into different architectures in Figure 1, there's 6-7% improvement in performance with increase in dataset size from 10k to 30k for all architectures. Increasing the subset size from 30k to 100k, shows minimal effect on R21D and ShuffleNet, whereas VideoSwin still improves by 12.8%. Looking into the effect of duration of training across different architectures for different subsets, the performance gain is minimal (less than 1.5%) after training for more than 100 epochs.


Table 2: Performance of different pretext tasks on R21D with 50k pre-training subset size

If we fix the subset size to 50k, apart from PRP, the average gain in performance is less than 2% for all other pretext tasks.

Qualitative Observation: Centered Kernel Alignment (CKA) Maps



Figure 2: CKA maps for layer representations: 10k vs 10k, 30k vs 30k, 50k vs 50k, 100k vs 100k of R21D network on RSPNet pretext for all K-400 subsets (Left to right)


Figure 3. CKA maps for layer representations: 50 epochs vs 50 epochs, 100 epochs vs 100 epochs, 150 epochs vs 150 epochs, 200 epochs vs 200 epochs of R21D network on RSPNet pretext for K-400 10k subset (Left to right)


Figure 4. CKA maps for layer representations: 50 epochs vs 50 epochs, 100 epochs vs 100 epochs, 150 epochs vs 150 epochs, 200 epochs vs 200 epochs of R21D network on RSPNet pretext for K-400 100k subset (Left to right).




Inferences