TY - JOUR
T1 - Action Recognition in Videos Using Pre-Trained 2D Convolutional Neural Networks
AU - Kim, Jun Hwa
AU - Won, Chee Sun
N1 - Publisher Copyright:
© 2013 IEEE.
PY - 2020
Y1 - 2020
N2 - A pre-trained 2D CNN (Convolutional Neural Network) can be used for the spatial stream in the two-stream CNN structure for videos, treating the representative frame selected from the video as an input. However, the CNN for the temporal stream in the two-stream CNN needs training from scratch using the optical flow frames, which demands expensive computations. In this paper, we propose to adopt a pre-trained 2D CNN for the temporal stream to avoid the optical flow computations. Specifically, three RGB frames selected at three different times in the video sequence are converted into grayscale images and are assigned to three R(red), G(green), and B(blue) channels, respectively, to form a Stacked Grayscale 3-channel Image (SG3I). Then, the pre-trained 2D CNN is fine-tuned by SG3Is for the temporal stream CNN. Therefore, only pre-trained 2D CNNs are used for both spatial and temporal streams. To learn long-range temporal motions in videos, we can use multiple SG3Is by partitioning the video shot into sub-shots and a single SG3I is generated for each sub-shot. Experimental results show that our two-stream CNN with the proposed SG3Is is about 14.6 times faster than the first version of the two-stream CNN with the optical flow, and yet achieves a similar recognition accuracy for UCF-101 and a 5.7% better result for HMDB-51.
AB - A pre-trained 2D CNN (Convolutional Neural Network) can be used for the spatial stream in the two-stream CNN structure for videos, treating the representative frame selected from the video as an input. However, the CNN for the temporal stream in the two-stream CNN needs training from scratch using the optical flow frames, which demands expensive computations. In this paper, we propose to adopt a pre-trained 2D CNN for the temporal stream to avoid the optical flow computations. Specifically, three RGB frames selected at three different times in the video sequence are converted into grayscale images and are assigned to three R(red), G(green), and B(blue) channels, respectively, to form a Stacked Grayscale 3-channel Image (SG3I). Then, the pre-trained 2D CNN is fine-tuned by SG3Is for the temporal stream CNN. Therefore, only pre-trained 2D CNNs are used for both spatial and temporal streams. To learn long-range temporal motions in videos, we can use multiple SG3Is by partitioning the video shot into sub-shots and a single SG3I is generated for each sub-shot. Experimental results show that our two-stream CNN with the proposed SG3Is is about 14.6 times faster than the first version of the two-stream CNN with the optical flow, and yet achieves a similar recognition accuracy for UCF-101 and a 5.7% better result for HMDB-51.
KW - action recognition
KW - Convolutional neural network (CNN)
KW - two-stream convolutional neural networks
KW - video analysis
UR - http://www.scopus.com/inward/record.url?scp=85083249776&partnerID=8YFLogxK
U2 - 10.1109/ACCESS.2020.2983427
DO - 10.1109/ACCESS.2020.2983427
M3 - Article
AN - SCOPUS:85083249776
SN - 2169-3536
VL - 8
SP - 60179
EP - 60188
JO - IEEE Access
JF - IEEE Access
M1 - 9047853
ER -