The dataset contains 400 human action classes, with at least 400 video clips for each action. Each clip is taken from a different YouTube video. The actions are human focussed and cover a broad range of classes including human-object interactions such as playing instruments, as well as human-human interactions such as shaking hands.
The current version has 306,245 videos, and is divided into three splits, one for training having 250–1000 videos per class, one for validation with 50
videos per class and one for testing with 100 videos per class.
Kinetics videos length are generally 10s centered on human actions. It mainly constitutes singular person action, person-to-person actions
and person-object action. For pre-training, we select a random subset of videos and maintain equal distribution from each class. Unless otherwise stated, pre-training is done on
K400-50k subset for all experiments.
More details can be found
here.