The Past, Present, and Future of YouTube 3-D
January 24, 2012, SPIE Electronic Imaging Conference, Burlingame, CA—Pete Bradshaw and Debargha Mukherjee from Google describe their efforts in promoting 3-D videos.
The first issue facing adoption of 3-D is that there is very little content available. Some of the first 3-D videos took a lot of work just to get the two cameras set up and some has to use a third camera for the audio. The left and right images and the audio were aligned and rendered in Final Cut.
In '09, they created a 3-D editor that cross correlated the audio with the left and right images. The vertical sync was based on the video content. This system was dropped with the advent of integrated 3-D cameras and camcorders. Now the standards allow for 3-D frame extensions into the browser. The HTML5 keyword <video> works for all YouTube content. Partners like nVidia and LG are helping to optimize 3-D displays and cameras. So users can generate more content. Now the standards and other technologies allow uploads of 3-D content.
The latest 3-D standards for Web browsers are motivated by the desire to have existing devices transfer content to the web. Viewers can see 3-D content on a 3-D display and 3-D content on a 2-D display. The encoding is based on a variant of H.264/AVC called MVC which is also used for Blu-ray disks. The frame packing format packs the left and rignt into a mono signal and is encoded a s mono plus metadata. The encoding adds a 20 percent overhead.
The problem with the encoding is that the resolution drops by half. Side-by-side is the preferred coding, but top-bottom is acceptable. The encoding uses a frame-alternate sequence, although other encoding formats like checkerboard, row-by-row, column-by-column, interlaces, and others can be handled. The sampling options require the declaration of the sub-samples and offset so the browser can know how to decode the video.
The H.264 SEI FPA defines the frame packing and how to insert the metadata on every key frames. The metadata fields delineate FPA type, interpret left or right first, and other critical matters. For the Web, VP8 for the video and Vorbis for the audio are open source programs that allow anyone to get up to speed quickly. nVidia is supporting both of these programs on the Firefox browser.
The process to upload a video starts with ingesting the raw footage into H.264 SEI FPA or web-m 3-D. the alternative is to manually enter the frame packing information. The outputs from this processing is a file in either H.264 or web-m that can be uploaded directly into YouTube.
Another aspect of getting more content in 3-D is to convert 2-D into 3-D. The infrastructure is emerging as more studios and devices become 3-D enabled. The emerging market for autostereoscopic displays helps more people to become comfortable and familiar with the technologies. The conversion process is not easy. In Hollywood, the conversion process is mostly manual that only a few conversion houses can perform well.
TV makers have integrated some technologies into their TVs, but the outputs are fairly low quality. The basic automated methodology is to generate a depth by making a per-pixel depth map. This map is pushed into a ray-based rendering engine to make left and right images. The problem is that depth shifts with screen parallax and even within a scene. The transforms have to work with non-linear frequencies and changing focal planes. The quality suffers when pixels need to be shifted, but the shifting causes overlap or occlusions in the images. The rendering engine has to fill in holes that develop when there is a truncated portion of the image.
Another problem for automation is the need to set near and far depth limits. These limits have to be managed on a per-shot basis so the objects per depth plane and the depth within an object match. The various tools try to gather cues from color, special layout, vanishing line defects, motion analysis and focus and blur analysis. Unfortunately, these analyses are very compute intensive.
One way to make the conversion easier is to use existing content for analysis and also as a database to provide a set of disparity analyses and statistics of the various parameters. The color priors are classification on a per scene basis. Spatial priors presume that the lower and middle parts are closer. Motion analysis looks for feature points and tracks object movement.
Given these data, the processing flow generates a sparse point flow map. The inputs are the local motion decks that are interpolated to make smooth motion. This data-driven approach assumes that similar images have similar depth. This data feed the database with sets of depth stores and images and enable example-based depth interpolation.
The depth maps are approximations, so the tools have learning capabilities for features and depth. Tiling the images handles the spatial variations and any changes per tile can be easily interpolated. The combination of depth maps for color, motion, spatial, etc. is linearly weighted to produce the final images.
These user-triggered tools went beta in September '11 and the users converted over 20k videos, mostly of good quality. Now the on-line tools for automatically converting a 2-D image into a 3-D image are available to everyone. The tools generate a 1080p short-form 2-D and 3-D set of files. The user has a trigger option to turn the conversion feature on or off.
The parallel processing infrastructure needed for these compute-intensive workloads is in the cloud. As much as 5 percent of all uploads go into the conversion process, where segments are transferred into the cloud for processing and integration. The conversion process uses the same hardware and software as normal encoding, just a lot of both.