This implementation is less likely to ICE compilers, and is more correct. It also acts as a memory barrier which will help prevent writes to global memory from being optimized away.