Surveillance video cameras capture large amounts of continuous video streams every day. To analyze or investigate any significant events, it is a laborious and boring job to identify these events from the huge video data if it is done manually. Existing approaches sometimes neglect key frames with significant visual contents and/or select some unimportant frames with low/no activity. To solve this problem, in this paper, a video summarization technique is proposed by combining three multimodal human visual sensitive features, such as foreground objects, motion information, and visual saliency. In a video stream, foreground objects are one of the most important pieces of a video as they contain more detailed information and play a major role in important events. Moreover, motion is another stimulus of a video that significantly attracts human visual attention. To obtain this, motion information is calculated in the spatial domain as well as the frequency domain. Spatial motion information can select object motion accurately; however, it is sensitive to illumination changes. On the other hand, frequency motion information is robust to illumination change, although it is easily affected by noise. Therefore, motion information in both the spatial and the frequency domains is employed. Furthermore, the visual attention cue is a sensitive feature to measure the indication of a user's attraction label for determining key frames. As these features individually cannot perform very well, they are combined to obtain better results. For this purpose, an adaptive linear weighted fusion scheme is proposed to combine the features to rank video frames for summarization. Experimental results reveal that the proposed method outperforms the state-of-the-art methods.
|Number of pages||13|
|Journal||Journal of the Optical Society of America A: Optics and Image Science, and Vision|
|Publication status||Published - 01 May 2017|