Scaling Human Vision
April 5, 2016, GPU Technology Conference, San Jose, CA—David Luan from Dextro talked about developments to enable human users to address the issues of search, classification, and analysis of video. The main problem is that current classification techniques try to use keywords, metadata, and iconic images from static images for video.
The biggest problem with the current approaches is that words and static images do not properly represent video content. video not only has images, but includes action, audio, and non-iconic images. an iconic image is one which clearly identifies the underlying object, like a stop sign or a hamburger.
Another problem is that the volume of video images grows at a pace beyond human comprehension. Beyond the user-generated content, you have corporate and public safety groups generating hours of images per person per day. Although the increasing volume of videos represents a greatly reduced friction for consumer generation, it also creates the challenge of big numbers. How do you curate and search the volumes of content when any metadata may only be a few key words. Unfortunately, metadata is at best about 43 percent useful.
New tools to help the human to interact with the videos calls for better extraction of content. the standard flow of ingest, core processing, extraction of details to generate metadata doesn't always lead to useful search and discovery. Adding some nalytics is very helpful.
Human sight uses many cues to determine relevance. In addition to the base visual information, we add motion detection, audio, and non-verbal cues like timelines, context and topic, highlights, and thumbnails. Identifying action is a very large search issue. The classifier must be content aware and use analysis to evaluate the viewer response. Extracting insights requires the capability to enhance the user's ability to search, filter, curate, query, and select from the underlying volumes of data.
Dextro's tool set provides video-level feature extraction and user-defined search parameters and functions. The tools enable clustering of features, so for example, video from police body-worn cameras can be searched based on timelines, activities, and even iconic features like any images with person and gun.
The underlying philosophy is to represent video as video, and not a series of connected images. Classifiers traditionally extracted the iconic portions of content at the frame level, even though the total content may not have any content that fall into the standard buckets. For example an image can be classified as sky, trees, road, grass, but the vital information is that there is a truck that spilled its cargo on the road ahead. As a result, they built a single model that encodes features, motion, and spatial information.
The tool needs training, not on photos, but on real videos. Tagging enables easier search and filtering for analytics. Determining what is important makes ontology and taxonomy critical. The taxonomies create trees of tags that address salience, timelines, and frequency. After collecting data, starting with photos and moving to video with non-iconic data, the image classifier is now able to find and curate the new videos to pinpoint sequences of interest.
Working in video is computationally expensive, and this work requires extensive GPU support. The main bottleneck for their work is GPU memory, but they still can run 90 concurrent live video feeds on the latest Titan X hardware.