This was originally posted on aventine.se.
Even the most high-end Audio codecs are also designed to work on really low-end DSP devices. ALAC (Apple Lossless Audio Codec) for example, decodes stereo fine in software on one of the 90 MHz ARM 7TDMI cores in the original iPod. AAC requires a bit more, but it is still within the reach of software on a relatively slow processor, like a Pentium or G3. A modern ARM processor can decode MP3 at a clock-speed of mere 10MHz, and with a bit more, AAC, which essentially is the most demanding codec that you’ll meet on the web.
Video codecs on the other hand are an entirely different story. The 2.4GHz Core 2 Duo in my laptop (a Macbook Pro) has serious problems decoding high-end (1080p Hi10P for example) H.264 in with FFmpeg. My desktop, a reasonably modern Xeon quad-core, handles these videos fine using FFmpeg, but with significant load. Note that this is with an implementation that is hand-optimized with assembly. To improve the situation, we cannot depend on hardware support either, because it is often out of date. No graphics card in my collection support this profile in hardware yet for example.
Video codecs on the other hand tend to rely a lot on fixed-point for optimization, H.264 is even optimized to avoid needing floating-point as much as possible. Even the discrete cosine transform and motion compensation in H.264 is modified to operate on fixed-point numbers instead of floating-point.
The reason for this is that modern processors can often process fixed-point operations much faster, especially the 8 and 16 bit operations that are the most common. These short integer instructions often have at least 4 times faster thoroughput than double precision floating-point. Certain complex instructions like division make the difference irrelevant, and in many cases require fallback to floating-point, but these operations are extremely uncommon in H.264.
Most decoders utilize these SIMD instructions, which gives them access to 8-16 times more throughput per core for simple operations. And on top of that, there are special instructions for optimizing MPEG codecs, giving a quite measurable speedup on top of that, which you are unlikely to be able to utilize without hand-optimized code.
There are two obvious solutions to all of these problems that are being prototyped on the web right now, WebCL and Rivertrail. Both of these are designed to solve the threading problem mainly, which is likely not the biggest issue, but it is still significant.
WebCL (OpenCL for the web) on the other hand, already solves most or all of these problems since it is essentially a massively parallel C with SIMD and device-specific extensions. It even allows for the GPU to pick up most of the burden, which is in many cases preferable to running on the CPU due to the extra computational power available.
And while I would love to be proved wrong about this, I don’t think I will be for a long while, and at that point, there will be more advanced codecs and higher resolutions around to target.