Abstract
Free viewpoint video (FVV) has attracted considerable attention in recent years, as it provides freedom to the user to observe a scene from different angles or viewpoints. A large number of views are required to facilitate the experience of FVV, which increases transmission bandwidth and storage data significantly in comparison to 2D video data. To reduce transmission bandwidth, the efficient approach is to encode a subset of views in the encoder and synthesise the other desired views in the decoder. However, view synthesis techniques suffer from poor rendering quality due to the holes which are created by occlusion and the integer rounding errors that occur during the warping process. To remove these holes in the virtual view, existing view synthesis techniques exploit spatial and temporal correlation in intra/ inter-view images and depth maps. However, the resulting view syntheses still suffer from quality degradation at the boundary regions of foreground and background areas, due to the low spatial correlation in texture images and low correspondence in inter-view depth maps. To overcome such limitations, this thesis proposes five enhanced view synthesis techniques which exploit temporal correlations.
Firstly, in the proposed background improvement (BI) technique, adjacent views and their corresponding depth maps (with associated camera parameters) are taken as input. The input views are warped and blended into the target viewpoint (virtual view). Then, Gaussian mixture modelling (GMM) is used to model each pixel of the available target viewpoint frames. The GMM technique can provide rich information about the behaviour of pixels in previous frames to separating background and foreground pixels. Moreover, GMM models contain more correlative data compared to an available static background frame, which is usually used to recover missing pixels in the view synthesis. Then, the missing pixel intensities of the background areas are recovered using the most common frame in a scene (McFIS). The McFIS is generated from the GMM model(s), which provide(s) a smaller ratio between weight and standard deviation. This hole-filling approach provides a better pixel correspondence in comparison to the existing related techniques, as all decisions related to the foreground and the background are made based on the number of models, and the missing background pixels are recovered using the pixel intensities in the models. This technique improves the quality of the background areas.
Secondly, to improve the quality of results generated using the BI technique, a weighted hole-filling using single view (WHFSV) technique is proposed. The WHFSV introduces a weighting factor in order to balance the relative contribution of the learned GMM models and the warped images. Within this process, the missing background pixels are recovered from the McFIS and missing foreground pixels are recovered from the weighted average of the warped image and learned foreground models. This technique provides a better pixel correspondence in comparison to the BI technique, by utilising GMM models for both the background and the foreground pixels. The WHFSV is further enhanced and named weighted hole-filling using multiple views (WHFMV). The WHFMV is similar to the WHFSV in that it also recovers missing foreground and background pixels. However, two adjacent views are considered instead of one, as adjacent views cover wider view angles. The WHFMV technique helps to improve synthesised view quality in comparison to the WHFSV technique. All three proposed techniques are able to improve the quality of the synthesised view in comparison to existing related techniques.
In the WHFSV and WHFMV techniques, the weighting factor does not vary with time in order to accommodate the changes due to a dynamic background and the motions of moving objects. An adaptive weighted average hole-filling (AWAHF) technique is proposed to overcome this problem by using an adaptive weighting factor. In this process, the missing pixels are recovered from the adaptively weighted average of the corresponding GMM model(s) pixel intensities and the warped image. Besides this, the adaptive strategy is introduced to reset GMM modelling if the contributions of the pixel intensities from the GMM model(s) drop significantly due to the fast motion of background or foreground objects. The proposed AWAHF approach improves the quality of the synthesised view significantly in comparison to other related available techniques. The synthesised frame generated by the proposed AWAHF technique is used as an extra reference frame to propose two new multiview video coding (MVC) frameworks, where a different number of reference frames are used. The proposed MVC frameworks improve rate-distortion (RD) performance and reduce computational time in comparison to the conventional coding framework.
For all of the above-proposed techniques, the frames of the target viewpoint are used for learning GMM to improve the quality of the synthesised view. However, the process of learning with a targeted viewpoint is not effective if a frame is the first incoming frame, or if there is an instant change of view or the user switches view frequently. To address such possibilities, a new technique named view synthesis using side view information (VSSVI) is also proposed. In this process, the textures and depth of the adjacent views are used to learn GMM. Then, a weighting factor is used to balance the contribution between the learned GMM models based on the side views and the warped image, in order to refine the missing pixel intensities of the synthesised view. The main advantage of VSSVI is that it allows users to switch view instantly and frequently for smooth FVV viewing.
The total contribution of this thesis improves the view synthesis quality significantly in comparison to the state-of-the-art methods. Moreover, incorporating the generated view in the MVC coding framework improves the RD and computational time significantly in comparison to the 3D-high-efficiency video coding (3D-HEVC) MVC coding standard. Contributions resulting from the overall improvement to view synthesis described in this thesis are likely to be useful in various applications including professional training, video games, virtual reality (VR), sports event, movies, 360-degree video, 3D TV and free viewpoint television (FTV).
Firstly, in the proposed background improvement (BI) technique, adjacent views and their corresponding depth maps (with associated camera parameters) are taken as input. The input views are warped and blended into the target viewpoint (virtual view). Then, Gaussian mixture modelling (GMM) is used to model each pixel of the available target viewpoint frames. The GMM technique can provide rich information about the behaviour of pixels in previous frames to separating background and foreground pixels. Moreover, GMM models contain more correlative data compared to an available static background frame, which is usually used to recover missing pixels in the view synthesis. Then, the missing pixel intensities of the background areas are recovered using the most common frame in a scene (McFIS). The McFIS is generated from the GMM model(s), which provide(s) a smaller ratio between weight and standard deviation. This hole-filling approach provides a better pixel correspondence in comparison to the existing related techniques, as all decisions related to the foreground and the background are made based on the number of models, and the missing background pixels are recovered using the pixel intensities in the models. This technique improves the quality of the background areas.
Secondly, to improve the quality of results generated using the BI technique, a weighted hole-filling using single view (WHFSV) technique is proposed. The WHFSV introduces a weighting factor in order to balance the relative contribution of the learned GMM models and the warped images. Within this process, the missing background pixels are recovered from the McFIS and missing foreground pixels are recovered from the weighted average of the warped image and learned foreground models. This technique provides a better pixel correspondence in comparison to the BI technique, by utilising GMM models for both the background and the foreground pixels. The WHFSV is further enhanced and named weighted hole-filling using multiple views (WHFMV). The WHFMV is similar to the WHFSV in that it also recovers missing foreground and background pixels. However, two adjacent views are considered instead of one, as adjacent views cover wider view angles. The WHFMV technique helps to improve synthesised view quality in comparison to the WHFSV technique. All three proposed techniques are able to improve the quality of the synthesised view in comparison to existing related techniques.
In the WHFSV and WHFMV techniques, the weighting factor does not vary with time in order to accommodate the changes due to a dynamic background and the motions of moving objects. An adaptive weighted average hole-filling (AWAHF) technique is proposed to overcome this problem by using an adaptive weighting factor. In this process, the missing pixels are recovered from the adaptively weighted average of the corresponding GMM model(s) pixel intensities and the warped image. Besides this, the adaptive strategy is introduced to reset GMM modelling if the contributions of the pixel intensities from the GMM model(s) drop significantly due to the fast motion of background or foreground objects. The proposed AWAHF approach improves the quality of the synthesised view significantly in comparison to other related available techniques. The synthesised frame generated by the proposed AWAHF technique is used as an extra reference frame to propose two new multiview video coding (MVC) frameworks, where a different number of reference frames are used. The proposed MVC frameworks improve rate-distortion (RD) performance and reduce computational time in comparison to the conventional coding framework.
For all of the above-proposed techniques, the frames of the target viewpoint are used for learning GMM to improve the quality of the synthesised view. However, the process of learning with a targeted viewpoint is not effective if a frame is the first incoming frame, or if there is an instant change of view or the user switches view frequently. To address such possibilities, a new technique named view synthesis using side view information (VSSVI) is also proposed. In this process, the textures and depth of the adjacent views are used to learn GMM. Then, a weighting factor is used to balance the contribution between the learned GMM models based on the side views and the warped image, in order to refine the missing pixel intensities of the synthesised view. The main advantage of VSSVI is that it allows users to switch view instantly and frequently for smooth FVV viewing.
The total contribution of this thesis improves the view synthesis quality significantly in comparison to the state-of-the-art methods. Moreover, incorporating the generated view in the MVC coding framework improves the RD and computational time significantly in comparison to the 3D-high-efficiency video coding (3D-HEVC) MVC coding standard. Contributions resulting from the overall improvement to view synthesis described in this thesis are likely to be useful in various applications including professional training, video games, virtual reality (VR), sports event, movies, 360-degree video, 3D TV and free viewpoint television (FTV).
Original language | English |
---|---|
Qualification | Doctor of Philosophy |
Awarding Institution |
|
Supervisors/Advisors |
|
Award date | 29 Mar 2019 |
Publication status | Published - 01 Apr 2019 |