Dear Matt,

DNorm2 is published now.
I've tried to find a smart method to decide for rowwise or columnwise processing. Unfortunately this depends on the compiler and the size of the first and 2nd level caches, such that my strategies remain very coarse.

I'd be happy to see a speed comparison for mutli-core machines. And, as said already, please omp up my mex. ;-)