Hypercolumns for Video Segmentation - web.mit.edu

Hypercolumns for Video Segmentation! Ben Eysenbach, Carl Vondrick, Antonio Torralba! Motivation Hypercolumns Graph Cuts Approach Conclusion...

3 downloads 418 Views 3MB Size
Hypercolumns for Video Segmentation Ben Eysenbach, Carl Vondrick, Antonio Torralba Motivation •  Large image datasets have significantly improved performance on many challenging computer vision tasks, such as object labeling and face recognition. •  There are few large, labeled video datasets. •  For image labeling, it was feasible to manually label every image. However, for a large video dataset, labeling every frame of every video would be prohibitively expensive and time intensive.

Hypercolumns •  A hypercolumn is a vector of the unit activations a specific pixel generates as it travels through a convolutional neural network (CNN). •  Units of the CNN learn object detectors. Simple detectors (edges, colors) occur in the first few layers, and complex detectors (faces, animals) are found deeper in the network. Thus, hypercolumns encode information about a pixel across many spatial scales.

Graph Cuts

•  We want to convert the probability heatmap output by the SVM into a hard segmentation. •  We use the Graph Cuts optimization technique to produce a segmentation which considers two factors: 1.  The mask should look similar to the probability heatmap. 2.  Adjacent pixels should be assigned to the same class. •  Unlike the traditional application of Graph Cuts to image segmentation, our approach does not require human “scribble” annotations.

•  We then train a support vector machine (SVM) classifier on the hypercolumn descriptors.

Approach

Conclusion

We seek to build a dataset of segmented videos. In this dataset, every pixel in every frame will be labeled with the object to which the pixel belongs.

Our next step is to outsource the manual labeling to Mechanical Turk. This requires that the entire pipeline run fast in a web browser. •  We’ve already written an efficient Graph Cuts implementation in JS using Emscriptem: http://web.mit.edu/bce/www/segment/ •  We plan to to modify ConvNetJS to extract hypercolumns and train the SVM classifier.

To solve this problem, we propose a method to efficiently segment videos: 1.  Manually segment a couple frames. 2.  Extract hypercolumns from labeled and unlabeled frames.

For more details on hypercolumns and object detectors in CNNs:

3.  Train a classifier on hypercolumns from labeled frames. 4.  Apply the classifier to hypercolumns from unlabeled frames. 5.  Compute hard segmentation using Graph Cuts.

Results from applying SVM to hypercolumns. Input frame (left), output of SVM (center), ground truth (right).

Hariharan, Bharath, et al. "Hypercolumns for object segmentation and fine-grained localization." arXiv preprint arXiv:1411.5752 (2014). Zhou, Bolei, et al. "Object Detectors Emerge in Deep Scene CNNs." arXiv preprint arXiv:1412.6856 (2014).