As a sidenote to the recent post, the therein presented kernel is already superfast, but guess what :-)
There are ways to make it even faster by virtue of memory access optimizations. Let’s consider the memory access of U and V plane. They both access the same bits within the same dimension, thus can be consolidated into the same work item to access the memory only once (global memory access is the slowest memory access type).
Futhermore, “flattening” the work-group from 2D to 1D enables faster sequential memory access instead of the presented 2D access, hence benefit much better from prefetching and probably help avoiding bank conflicts…
So far for optimizations.. If there is demand on an appropriate kernel, then drop me an email..