Adaptive fusion of human visual sensitive features for surveillance video summarization

Md Musfequs Salehin, Manoranjan Paul

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

Surveillance video cameras capture large amounts of continuous video streams every day. To analyze or investigate any significant events, it is a laborious and boring job to identify these events from the huge video data if it is done manually. Existing approaches sometimes neglect key frames with significant visual contents and/or select some unimportant frames with low/no activity. To solve this problem, in this paper, a video summarization technique is proposed by combining three multimodal human visual sensitive features, such as foreground objects, motion information, and visual saliency. In a video stream, foreground objects are one of the most important pieces of a video as they contain more detailed information and play a major role in important events. Moreover, motion is another stimulus of a video that significantly attracts human visual attention. To obtain this, motion information is calculated in the spatial domain as well as the frequency domain. Spatial motion information can select object motion accurately; however, it is sensitive to illumination changes. On the other hand, frequency motion information is robust to illumination change, although it is easily affected by noise. Therefore, motion information in both the spatial and the frequency domains is employed. Furthermore, the visual attention cue is a sensitive feature to measure the indication of a user's attraction label for determining key frames. As these features individually cannot perform very well, they are combined to obtain better results. For this purpose, an adaptive linear weighted fusion scheme is proposed to combine the features to rank video frames for summarization. Experimental results reveal that the proposed method outperforms the state-of-the-art methods.

Original languageEnglish
Pages (from-to)814-826
Number of pages13
JournalJournal of the Optical Society of America A: Optics and Image Science, and Vision
Volume34
Issue number5
DOIs
Publication statusPublished - 01 May 2017

Fingerprint

surveillance
Fusion reactions
fusion
Lighting
Boring
Video cameras
Labels
illumination
video data
cues
stimuli
attraction
indication
cameras

Cite this

@article{85b41ea583ea4697b7c1aaffa0cd8f69,
title = "Adaptive fusion of human visual sensitive features for surveillance video summarization",
abstract = "Surveillance video cameras capture large amounts of continuous video streams every day. To analyze or investigate any significant events, it is a laborious and boring job to identify these events from the huge video data if it is done manually. Existing approaches sometimes neglect key frames with significant visual contents and/or select some unimportant frames with low/no activity. To solve this problem, in this paper, a video summarization technique is proposed by combining three multimodal human visual sensitive features, such as foreground objects, motion information, and visual saliency. In a video stream, foreground objects are one of the most important pieces of a video as they contain more detailed information and play a major role in important events. Moreover, motion is another stimulus of a video that significantly attracts human visual attention. To obtain this, motion information is calculated in the spatial domain as well as the frequency domain. Spatial motion information can select object motion accurately; however, it is sensitive to illumination changes. On the other hand, frequency motion information is robust to illumination change, although it is easily affected by noise. Therefore, motion information in both the spatial and the frequency domains is employed. Furthermore, the visual attention cue is a sensitive feature to measure the indication of a user's attraction label for determining key frames. As these features individually cannot perform very well, they are combined to obtain better results. For this purpose, an adaptive linear weighted fusion scheme is proposed to combine the features to rank video frames for summarization. Experimental results reveal that the proposed method outperforms the state-of-the-art methods.",
author = "Salehin, {Md Musfequs} and Manoranjan Paul",
note = "Includes bibliographical references.",
year = "2017",
month = "5",
day = "1",
doi = "10.1364/JOSAA.34.000814",
language = "English",
volume = "34",
pages = "814--826",
journal = "Journal of the Optical Society of America A: Optics, Image Science and Vision",
issn = "1084-7529",
publisher = "The Optical Society",
number = "5",

}

TY - JOUR

T1 - Adaptive fusion of human visual sensitive features for surveillance video summarization

AU - Salehin, Md Musfequs

AU - Paul, Manoranjan

N1 - Includes bibliographical references.

PY - 2017/5/1

Y1 - 2017/5/1

N2 - Surveillance video cameras capture large amounts of continuous video streams every day. To analyze or investigate any significant events, it is a laborious and boring job to identify these events from the huge video data if it is done manually. Existing approaches sometimes neglect key frames with significant visual contents and/or select some unimportant frames with low/no activity. To solve this problem, in this paper, a video summarization technique is proposed by combining three multimodal human visual sensitive features, such as foreground objects, motion information, and visual saliency. In a video stream, foreground objects are one of the most important pieces of a video as they contain more detailed information and play a major role in important events. Moreover, motion is another stimulus of a video that significantly attracts human visual attention. To obtain this, motion information is calculated in the spatial domain as well as the frequency domain. Spatial motion information can select object motion accurately; however, it is sensitive to illumination changes. On the other hand, frequency motion information is robust to illumination change, although it is easily affected by noise. Therefore, motion information in both the spatial and the frequency domains is employed. Furthermore, the visual attention cue is a sensitive feature to measure the indication of a user's attraction label for determining key frames. As these features individually cannot perform very well, they are combined to obtain better results. For this purpose, an adaptive linear weighted fusion scheme is proposed to combine the features to rank video frames for summarization. Experimental results reveal that the proposed method outperforms the state-of-the-art methods.

AB - Surveillance video cameras capture large amounts of continuous video streams every day. To analyze or investigate any significant events, it is a laborious and boring job to identify these events from the huge video data if it is done manually. Existing approaches sometimes neglect key frames with significant visual contents and/or select some unimportant frames with low/no activity. To solve this problem, in this paper, a video summarization technique is proposed by combining three multimodal human visual sensitive features, such as foreground objects, motion information, and visual saliency. In a video stream, foreground objects are one of the most important pieces of a video as they contain more detailed information and play a major role in important events. Moreover, motion is another stimulus of a video that significantly attracts human visual attention. To obtain this, motion information is calculated in the spatial domain as well as the frequency domain. Spatial motion information can select object motion accurately; however, it is sensitive to illumination changes. On the other hand, frequency motion information is robust to illumination change, although it is easily affected by noise. Therefore, motion information in both the spatial and the frequency domains is employed. Furthermore, the visual attention cue is a sensitive feature to measure the indication of a user's attraction label for determining key frames. As these features individually cannot perform very well, they are combined to obtain better results. For this purpose, an adaptive linear weighted fusion scheme is proposed to combine the features to rank video frames for summarization. Experimental results reveal that the proposed method outperforms the state-of-the-art methods.

UR - http://www.scopus.com/inward/record.url?scp=85018373568&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85018373568&partnerID=8YFLogxK

U2 - 10.1364/JOSAA.34.000814

DO - 10.1364/JOSAA.34.000814

M3 - Article

C2 - 28463326

AN - SCOPUS:85018373568

VL - 34

SP - 814

EP - 826

JO - Journal of the Optical Society of America A: Optics, Image Science and Vision

JF - Journal of the Optical Society of America A: Optics, Image Science and Vision

SN - 1084-7529

IS - 5

ER -