With the recent growth of multimedia traffic over the Internet and emerging multimedia streaming service providers, improving Quality of Experience (QoE) for HTTP Adaptive Streaming (HAS) becomes ...more important. Alongside other factors, such as the media quality, HAS relies on the performance of the media player's Adaptive Bitrate (ABR) algorithm to optimize QoE in multimedia streaming sessions. QoE in HAS suffers from weak or unstable internet connections and suboptimal ABR decisions. As a result of imperfect adaptiveness to the characteristics and conditions of the internet connection, stall events and quality level switches could occur and with different durations that negatively affect the QoE. In this paper, we address various identified open issues related to the QoE for HAS, notably (i) the minimum noticeable duration for stall events in HAS; (ii) the correlation between the media quality and the impact of stall events on QoE; (iii) the end-user preference regarding multiple shorter stall events versus a single longer stall event; and (iv) the end-user preference of media quality switches over stall events. Therefore, we have studied these open issues from both objective and subjective evaluation perspectives and presented the correlation between the two types of evaluations. The findings documented in this paper can be used as a baseline for improving ABR algorithms and policies in HAS.
Due to the growing demand for video streaming services, providers have to deal with increasing resource requirements for increasingly heterogeneous environments. To mitigate this problem, many works ...have been proposed which aim to (<inline-formula> <tex-math notation="LaTeX">{i} </tex-math></inline-formula>) improve cloud/edge caching efficiency, ( ii ) use computation power available in the cloud/edge for on-the-fly transcoding, and ( iii ) optimize the trade-off among various cost parameters, e.g. , storage, computation, and bandwidth. In this paper, we propose LwTE , a novel <inline-formula> <tex-math notation="LaTeX">{L} </tex-math></inline-formula>ight-<inline-formula> <tex-math notation="LaTeX">{w} </tex-math></inline-formula>eight <inline-formula> <tex-math notation="LaTeX">{T} </tex-math></inline-formula>ranscoding approach at the <inline-formula> <tex-math notation="LaTeX">{E} </tex-math></inline-formula>dge, in the context of HTTP Adaptive Streaming (HAS). During the encoding process of a video segment at the origin side, computationally intense search processes are going on. The main idea of LwTE is to store the optimal results of these search processes as metadata for each video bitrate and reuse them at the edge servers to reduce the required time and computational resources for on-the-fly transcoding. LwTE enables us to store only the highest bitrate plus corresponding metadata (of very small size) for unpopular video segments/bitrates. In this way, in addition to the significant reduction in bandwidth and storage consumption, the required time for on-the-fly transcoding of a requested segment is remarkably decreased by utilizing its corresponding metadata; unnecessary search processes are avoided. Popular video segments/bitrates are being stored. We investigate our approach for Video-on-Demand (VoD) streaming services by optimizing storage and computation (transcoding) costs at the edge servers and then compare it to conventional methods (store all bitrates, partial transcoding). The results indicate that our approach reduces the transcoding time by at least 80% and decreases the aforementioned costs by 12% to 70% compared to the state-of-the-art approaches.
HTTP Adaptive Streaming (HAS) is the de-facto solution for video delivery over the Internet. In HAS, each video is encoded at multiple quality levels and resolutions (i.e., representations) to enable ...adaptation of the streaming session to viewing and network conditions of the client. This requirement brings encoding challenges along with it, e.g., a video source should be encoded efficiently at multiple bitrates and resolutions. Fast multi-rate encoding approaches aim to address this challenge of encoding multiple representations from a single video by re-using information from already encoded representations. In this paper, a convolutional neural network is used to speed up both multi-rate and multi-resolution encoding for HAS. For multi-rate encoding, the lowest bitrate representation is chosen as the reference. For multi-resolution encoding, the highest bitrate from the lowest resolution representation is chosen as the reference. Pixel values from the target resolution and encoding information from the reference representation are used to predict Coding Tree Unit (CTU) split decisions in High-Efficiency Video Coding (HEVC) for dependent representations. Experimental results show that the proposed method for multi-rate encoding can reduce the overall encoding time by 15.08% and parallel encoding time by 41.26%, with a 0.89% bitrate increase compared to the HEVC reference software. Simultaneously, the proposed method for multi-resolution encoding can reduce the encoding time by 46.27% for the overall encoding and 27.71% for the parallel encoding on average with a 2.05% bitrate increase.
InIn HTTP Adaptive Streaming (HAS), each video is divided into smaller segments, and each segment is encoded at multiple pre-defined bitrates to construct a bitrate ladder . To optimize bitrate ...ladders, per-title encoding approaches encode each segment at various bitrates and resolutions to determine the convex hull. From the convex hull, an optimized bitrate ladder is constructed, resulting in an increased Quality of Experience (QoE) for end-users. With the ever-increasing efficiency of deep learning-based video enhancement approaches, they are more and more employed at the client-side to increase the QoE, specifically when GPU capabilities are available. Therefore, scalable approaches are needed to support end-user devices with both CPU and GPU capabilities (denoted as CPU-only and GPU-available end-users, respectively) as a new dimension of a bitrate ladder. To address this need, we propose DeepStream , a scalable content-aware per-title encoding approach to support both CPU-only and GPU-available end-users. ( i ) To support backward compatibility , DeepStream constructs a bitrate ladder based on any existing per-title encoding approach. Therefore, the video content will be provided for legacy end-user devices with CPU-only capabilities as a base layer (BL). ( ii ) For high-end end-user devices with GPU capabilities, an enhancement layer (EL) is added on top of the base layer comprising lightweight video super-resolution deep neural networks (DNNs) for each bitrate-resolution pair of the bitrate ladder. A content-aware video super-resolution approach leads to higher video quality, however, at the cost of bitrate overhead. To reduce the bitrate overhead for streaming content-aware video super-resolution DNNs, DeepCABAC , context-adaptive binary arithmetic coding for DNN compression, is used. Furthermore, the similarity among ( i ) segments within a scene and ( ii ) frames within a segment are used to reduce the training costs of DNNs. Experimental results show bitrate savings of 34% and 36% to maintain the same PSNR and VMAF, respectively, for GPU-available end-users, while the CPU-only users get the desired video content as usual.
DeepVCA: Deep Video Complexity Analyzer Amirpour, Hadi; Schoeffmann, Klaus; Ghanbari, Mohammad ...
IEEE transactions on circuits and systems for video technology,
2024
Journal Article
Peer reviewed
Open access
Video streaming and its applications are growing rapidly, making video optimization a primary target for content providers looking to enhance their services. Enhancing the quality of videos requires ...the adjustment of different encoding parameters such as bitrate, resolution, and frame rate. To avoid brute force approaches for predicting optimal encoding parameters, video complexity features are typically extracted and utilized. To predict optimal encoding parameters effectively, content providers traditionally use unsupervised feature extraction methods, such as ITU-T's Spatial Information ( SI ) and Temporal Information ( TI ) to represent the spatial and temporal complexity of video sequences. Recently, Video Complexity Analyzer (VCA) was introduced to extract DCT-based features to represent the complexity of a video sequence (or parts thereof). These unsupervised features, however, cannot accurately predict video encoding parameters. To address this issue, this paper introduces a novel supervised feature extraction method named DeepVCA, which extracts the spatial and temporal complexity of video sequences using deep neural networks. In this approach, the encoding bits required to encode each frame in intra-mode and inter-mode are used as labels for spatial and temporal complexity, respectively. Initially, we benchmark various deep neural network structures to predict spatial complexity. We then leverage the similarity of features used to predict the spatial complexity of the current frame and its previous frame to rapidly predict temporal complexity. This approach is particularly useful as the temporal complexity may depend not only on the differences between two consecutive frames but also on their spatial complexity. Our proposed approach demonstrates significant improvement over unsupervised methods, especially for temporal complexity. As an example application, we verify the effectiveness of these features in predicting the encoding bitrate and encoding time of video sequences, which are crucial tasks in video streaming. The source code and dataset is available at https://github.com/cd-athena/ DeepVCA.
Light field imaging, which captures both spatial and angular information, improves user immersion by enabling post-capture actions, such as refocusing and changing view perspective. However, light ...fields represent very large volumes of data with a lot of redundancy that coding methods try to remove. State-of-the-art coding methods indeed usually focus on improving compression efficiency and overlook other important features in light field compression such as scalability. In this paper, we propose a novel light field image compression method that enables ( i ) viewport scalability, ( ii ) quality scalability, ( iii ) spatial scalability, ( iv ) random access, and ( v ) uniform quality distribution among viewports, while keeping compression efficiency high. To this end, light fields in each spatial resolution are divided into sequential viewport layers, and viewports in each layer are encoded using the previously encoded viewports. In each viewport layer, the available viewports are used to synthesize intermediate viewports using a video interpolation deep learning network. The synthesized views are used as virtual reference images to enhance the quality of intermediate views. An image super-resolution method is applied to improve the quality of the lower spatial resolution layer. The super-resolved images are also used as virtual reference images to improve the quality of the higher spatial resolution layer. The proposed structure also improves the flexibility of light field streaming, provides random access to the viewports, and increases error resiliency . The experimental results demonstrate that the proposed method achieves a high compression efficiency and it can adapt to the display type, transmission channel, network condition, processing power, and user needs.
CTU depth decision algorithms for HEVC: A survey Çetinkaya, Ekrem; Amirpour, Hadi; Ghanbari, Mohammad ...
Signal processing. Image communication,
November 2021, 2021-11-00, 20211101, Volume:
99
Journal Article
Peer reviewed
Open access
High Efficiency Video Coding (HEVC) surpasses its predecessors in encoding efficiency by introducing new coding tools at the cost of an increased encoding time-complexity. The Coding Tree Unit (CTU) ...is the main building block used in HEVC. In the HEVC standard, frames are divided into CTUs with the predetermined size of up to 64 × 64 pixels. Each CTU is then divided recursively into a number of equally sized square areas, known as Coding Units (CUs). Although this diversity of frame partitioning increases encoding efficiency, it also causes an increase in the time complexity due to the increased number of ways to find the optimal partitioning. To address this complexity, numerous algorithms have been proposed to eliminate unnecessary searches during partitioning CTUs by exploiting the correlation in the video. In this paper, existing CTU depth decision algorithms for HEVC are surveyed. These algorithms are categorized into two groups, namely statistics and machine learning approaches. Statistics approaches are further subdivided into neighboring and inherent approaches. Neighboring approaches exploit the similarity between adjacent CTUs to limit the depth range of the current CTU, while inherent approaches use only the available information within the current CTU. Machine learning approaches try to extract and exploit similarities implicitly. Traditional methods like support vector machines or random forests use manually selected features, while recently proposed deep learning methods extract features during training. Finally, this paper discusses extending these methods to more recent video coding formats such as Versatile Video Coding (VVC) and AOMedia Video 1(AV1).
In general, manipulated videos will eventually undergo recompression. Video transcoding will occur when the standard of recompression is different from the prior standard. Therefore, as a special ...sign of recompression, video transcoding can also be considered evidence of forgery in video forensics. In this paper, we focus on the detection and localization of video transcoding from AVC to HEVC (AVC-HEVC). There are two probable cases of AVC-HEVC transcoding - whole video transcoding and partial frame transcoding. However, the existing forensic methods only consider the detection of whole video transcoding, and they do not consider partial frame transcoding localization. In view of this, we propose a framewise scheme based on a convolutional neural network. First, we analyze that the essential difference between AVC-HEVC and HEVC is reflected in the high-frequency components of decoded frames. Then, the partition and location information of prediction units (PUs) are introduced to generate frame-level PU maps to make full use of the local artifacts of PUs. Finally, taking the decoded frames and PU maps as inputs, a dual-path network including specific convolutional modules and an adaptive fusion module is proposed. Through it, the artifacts on a single frame can be better extracted, and the transcoded frames can be detected and localized. Coupled with a simple voting strategy, the results of whole transcoding detection can be easily obtained. A large number of experiments are conducted to verify the performances. The results show that the proposed scheme outperforms or rivals the state-of-the-art methods in AVC-HEVC transcoding detection and localization.