Author: Mengyi Liu, Researcher
Summary: This blog describes a challenge on Content-Based Video Relevance Prediction (CBVRP) that was hosted by Hulu in the ICIP Conference (IEEE International Conference on Image Processing). In this challenge, Hulu, as a premier video serving company, drives the study on an open problem of exploiting content characteristics from original video, to deal with the “cold-start” issue encountered by most of recommendation systems. We aimed to call for scores of research teams working together to develop algorithms for (video) content-to-content relevance. We have learned a lot from the challenge, and Hulu is committed to continue the open interaction and contribution to the community.
Video streaming services, such as Hulu, depend heavily on the video recommender system to help its users discover videos they would enjoy (see Figure 1). Most existing recommendation systems compute the video relevance based on users’ implicit feedback, e.g. watch and search behaviors. That is, the system analyzes the user-to-video preference and computes the video-to-video relevance scores using traditional collaborative filtering based methods.
However, this kind of method performs poorly on “cold-start” problems — when a new video is added to the library, the recommendation system needs to bootstrap the video relevance score with very little user behavior known. One promising approach to solve “cold-start” is exploiting video content for relevance prediction, i.e. we can predict the video relevance by analyzing the content of videos including image pixels, audios, subtitles and metadata. Since the content contains almost all the information about a video, ideally, we can have enough details to build the video relevance table only from video content.
Generally, content-based methods focus on recommending items which have similar content characteristics to the items the user liked in the past. One of the key issues is how to extract the most relevant content features of each item. For most existing systems, the content features are associated with the items as structured metadata, e.g. movie/show genre, director/actors, description; Or other unstructured information from external sources, such as tags, and textual reviews. In contrast to these kinds of “explicit” features, there are also “implicit” content characteristics which can be exploited from the original movie/show video. Such characteristics could be visual features encoding low-level information like lighting, color, shape, motion, or high-level semantics like plot, mood, and artistic style.
To drive the study on this open problem, Hulu organized a challenge on Content-Based Video Relevance Prediction (CBVRP) to provide a common platform for algorithm development. The challenge was held at the IEEE International Conference on Image Processing (ICIP), in Beijing, China from September 17th~20th, 2017. During the several months long challenge session, a total of 22 teams registered for the challenge, the team members were from all over the world including America, Singapore, Mainland China, and Taiwan. Finally, we received several results submissions and one technical paper. In this tech blog, we will introduce the detailed information of the challenge task and the efforts from both participants and ourselves.
The Challenge
Problem
The challenge task is to learn a model that can compute the relevance between TV-shows/movies from the video contents and the metadata (e.g. actor/actress, director, descriptions, genres, etc.). Specifically, there are two sets of items: Retrieval set R and Candidate set C.
For each item (show/movie) r in R, we will provide:
- TV-shows/Movies’ trailers (with thumbnail) and the metadata associated with it
- Relevance vector v(r) with respect to candidate set C
-vc(r)=m indicates that candidate item c is the m-th relevant item with item r among all the items in candidate set C, where m∈{1, 2, …, M}
-vc(r)=0 indicates that candidate item c is not among the top M (M=30) most relevant items with item r in retrieval set R.
The relevance we provided is learned from massive user behaviors and can be treated as ground truth. And there is no overlap between R and C, i.e. R∩C=∅. Due to the obvious legal issues, we provide neither the videos nor the metadata of the items in the candidate set.
Challenge Dataset
We provide 400 trailers (i.e., videos) extracted from 80 TV-shows/movies as the retrieval set (All the videos are released on YouTube site in Hulu Channel). Along with the video clip, there are also thumbnail and metadata (e.g., description, genre, actor/actress, director) associated with each show/movie. For evaluation purpose, the whole set is further divided into three subsets, i.e., training, validation, and testing. The distribution is listed as follows:
- Training set, #shows = 20, #trailers = 163, #thumbnails = 163;
- Validation set, #shows = 20, #trailers = 59, #thumbnails = 59;
- Testing set, #shows = 40, #trailers = 178, #thumbnails = 178.
In the experiments, the training and validation set can be combined for feature and model learning.
Representation
We utilize deep neural network features for the video representation. Specifically, for a frame-level feature, we decode each video at 1 fps and then feed the decoded frames into the Inception network [1]. There are two kinds of feature representations depending on models trained on different datasets, i.e. ImageNet (pool3 layer) and Google OpenImages (prob layer). After obtaining frame features, we further compute their mean representation as the final signature. For a video-level feature, we also employ two state-of-the-art architectures, i.e. LSTM and 3D-CNN, resorting to their most popular implementations from BVLC [2] and Facebook [3] respectively. Here we decode each video at 8 fps and feed the frame stream into the well-trained models for feature extraction directly. The details for each kind of feature are listed below:
- Inception model (trained on ImageNet) — pool3 layer — dim: 2048
- Inception model (trained on OpenImages) — prob layer — dim: 6012
- BVLC LSTM model (trained on UCF101)
-fc6 layer — dim: 4096
-lstm layer — dim: 256
-prob layer — dim: 101
- Facebook C3D model (trained on Sports1M)
-res5b layer — dim: 25088
-pool5 layer — dim: 512
-prob layer — dim: 487
Basic Solutions
The following solutions came from submitted results, in combination with our own findings.
Relevance Prediction
According to the challenge rule, the participating team have no access to the candidate data, thus directly matching test data and candidate data is infeasible. To deal with this problem, there are two options for relevance prediction: (1) synthesize relevance lists/scores for test data directly using the relevance lists/scores of training data; (2) synthesize the invisible candidate data and then generate the relevance lists/scores for test data.
Scheme #1: Synthetic Relevance List/Score
We can simply calculate the relevance (cosine similarity) w(r*,r) between item r* in the testing set and item r in the training set based on video content features, and then use the relevance w(r*,r) to compute a relevance score vector s(r*), where sc(r*) indicates the relevance between r* and the candidate item c
where N is the number of items in the training set, Ivc(r)≠0 equals to 1 if vc(r)≠0, and 0 otherwise.
Scheme #2: Synthetic Candidate Data
Alternatively, we can estimate the representation of invisible candidate data according to their relevance to training data [8]. Here we use the rank index of item c in the relevance list of training item r to represent their “similarity”, i.e. w(c, r)=1/rank(c, r), and then use w(c, r) as weighting coefficient of each training feature fr to synthesize the candidate feature fc:
where Tc is the training data subset in which candidate data c appears within the top M rank list. Figure 2 illustrates the basic idea of synthesizing the candidate feature.
Figure 2. Schema of synthesizing the candidate features.
Experiments
In the experiments, we introduce two metrics, i.e. Recall and Hit Rate for the performance measurement. We aim to measure that how difficult to derive the desiring ranking by the prediction of the proposed methods. More specifically, suppose the ground truth top M most relevant items in candidate set for item r in the testing set is o(r)=[o1(r),o2(r), …, oM(r)], where oi(r)∈C is the item in candidate set ranked at i-th position in o(r). Similarly, o(r)=[o1(r),o2(r), …, oN(r)] denotes the topN prediction results for item r. Then Recall@topN is defined as:
And Hit@topN is defined for a single test case as either value 1, if Recall@topN≠0, or the value 0 if otherwise. In the end, we report the average of both Recall and Hit on all the test cases for the final performance.
Experimental results of Recall@topN of Scheme #1 & Scheme #2 based on Visual Modality.
Experimental results of Hit@topN of Scheme #1 & Scheme #2 based on Visual Modality.
Visualization of performance comparison based on different features with different schemes:
We can observe that Scheme #2 outperforms Scheme #1 significantly according to the above results. The reason may be the scarcity of training data. Specifically, in Scheme #1, we aim to search for “similar” items for each test show. However, the “similar” items are “fake neighbors” due to the data sparsity, which leads to inaccurate estimation of the relevance list. For Scheme #2, the relevance among training data and candidate data are derived from ground truth. Thus, we can obtain a more reliable neighborhood for the synthetic data to bridge the gap between training and test data.
Conclusion
Video content-to-content relevance prediction is an interesting and challenging domain. As a leading role, Hulu provides massive video assets, as well as ground truth data for evaluation, to build up a common platform for algorithm development, and received various promising solutions from the challenge. Since content-based relevance prediction remains an open problem, it still requires a lot of efforts regarding to the issues such as benchmarks and application. This challenge provided good opportunities for both Hulu and participants to better understand this problem with real data. Hulu will continue to organize the 2nd CBVRP joined with ACM Multimedia Conference 2018, and work with the community to explore and improve the understanding on video content relevance.
For more details, please visit our original CBVRP grand challenge website: https://github.com/CBVRP-ICIP-2017/CBVRP-ICIP2017/blob/master/CBVRP-ICIP2017.ipynb.
References
[1] Tensorflow: Image recognition. https://www.tensorflow.org/tutorials/image_recognition.
[2] Tran Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. “Learning spatiotemporal features with 3d convolutional networks.” In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497. 2015.
[3] Donahue, Jeffrey, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. “Long-term recurrent convolutional networks for visual recognition and description.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2625–2634. 2015.
[4] Yan Li, Hanjie Wang, Hailong Liu, Bo Chen. “A Study on Content-based Video Recommendation.” In Proceedings of IEEE Conference on Image Processing, 2017.