Video Clip Order Prediction (VCOP)


This approach learns the representation by predicting the permutation order. The network is fed N clips from a video and then it predicts the order from N! possible permutations.

More details can be found here.

Overview of Clip Order Prediction Framework. (a) Sample and Shuffle: Sample non-overlapping clips and shuffle them to a random order. (b) Feature Extraction: Use the 3D ConvNets to extract the feature of all clips. The 3D ConvNets is not pre-trained in any datasets. (c) Order Prediction: The extracted features are pairwise concatenated, and fully connected layers are placed on top to predict the actual order. The dashed lines mean that the corresponding weights are shared among clips. The framework can be trained end-to-end, and the 3D ConvNets can be used as a video feature extractor or pre-trained weights after training.