CEVA refines its Vision DSP by Peter McGuinness
September 2016 - CEVA announced its new XM-6 vision processor yesterday (27th Sept) and will reveal details at today’s Linley Microprocessor conference in Santa Clara.
As one of the companies leading the race to provide power-efficient vision processing specifically aimed at the mobile and embedded markets (as opposed to repurposing power-hungry GPUs for workstation and datacenter usage) CEVA’s designs and the traction they get in the market tell us a lot about the development and maturity of vision based equipment including AR headsets, drones and more.
CEVA claims a 3x performance improvement on vector heavy code for the new core, with a 2x boost for ‘average’ code, versus the previous generation chip, as well as revealing that they have for the first time included hardware dedicated to specific functions.
On the performance front, CEVA makes no mention of increased clock rate and the total compute resource (vector and scalar integer and floating point units) has not increased over the XM-4, leading to the conclusion that the improvements are due to detailed architecture changes introduced with this generation.
On the basics, XM-6 is remarkably similar to XM-4: four 32-bit scalar units linked up VLIW style to 128 16-bit integer MACs are included in the base version of the core, with a 32x16-bit floating point unit available as an option. In detail, however, there are some differences: where the 128 MACs were previously divided into two 64-wide SIMD vectors, the new core splits them differently, with one 64-wide SIMD and two 32-wide units.
At the most basic level, this allows for a convenient substitution of the optional FP units without disturbing the rest of the architecture but it also allows CEVA to better tune the vector resources to the workloads and increase the amount of available parallelism; a good proportion of the performance increase is likely to be due to that change alone.
It is also interesting that the optional floating point has now been downgraded to half precision (midP in GPU terms). This seems to represent a growing confidence among VPU designers that 32 bits is overkill for the vectorized portions of their workloads therefore they are optimizing to 16 bits. The fact that this functionality is still optional is another strong indicator that having floating point at all is a luxury which many applications cannot afford, something which continues to drive the widening split between dedicated VPUs and repurposed GPUs.
Just as interesting is the fact that CEVA credits much of the performance improvement, not to gains in computational capacity but to improvements in efficiency due to its proprietary data handling schemes. A sophisticated buffer handling unit which offloads management of large image datasets is teamed up with two specific mechanisms for improving the vectorization of operands and hence the overall utilization of the vector units. Improvements to both of these, in particular the scatter gather mechanism whereby the load/store unit can assemble a 1D data vector from a 2D array in a single cycle represent a large part of the reason for this new release.
It’s certainly true that a large part of wringing maximum performance out of a massively parallel system is solving the data/compute mismatch problem and this is CEVA’s answer to that. It will be interesting to see how it plays out against the various multithreaded and hybrid approaches out there.