You should consider using Eigen for linear algebra, I have personally found it m...

Schneekatze · on April 26, 2013

This is true, even though we would rather switch to Armadillo due to it's easier handling and better high-level behaviour.

Right now the linear algebra library we use -ublas- has the same behaviour as Eigen for BLAS1 type expressions. So it tries to generate optimal (non-SSE) code. Only for BLAS2 and 3 we fall back to the ATLAS-routines which has the same performance as Eigen on the interesting problem sizes.

//small edit In the end it is not so interesting whether the BLAS1-type expressions are fast as they make up < 1% of run time performance. The big chunks are the data processing inside the matrix-matrix multiplications of the Neural Networks and similar entities.

jhartmann · on April 26, 2013

You forget that if you can do the whole weight update in a single shot operation that the data doesn't have to go through the cache multiple times, and at least on the problem sizes I am working on FP bandwidth isn't what kills it, but the memory bandwidth. Back to the NN example: If you can do the matrix multiply and the application of the delta weights in a single loop iteration you get much better cache behavior.

Another thing about code generation, I am also using a hacked version of Eigen as well in a project I'm working on that can do the tanh and derivative of the tanh so the NN activations go quite abit faster since you can generate vectorized code for the whole calculation that will visit the memory location exactly once. While true the calculation of the weight updates is the most time spent, I saw 3-4x speedup in the activation code doing it in a single operation due to better memory access patterns and less loop iterations. Better memory access patterns can also have synergistic effects on other code because there is less cache pollution happening. By being fast and loose and introducing a few other copies of the matrix data in my case, my performance falls off a cliff when it no longer fits in the cpu cache nicely. 10x difference in the particular case I am remembering.

As always performance is part art, part science and perhaps it won't matter as much for the general case, but for my specific implementation and my matrix sizes Eigen has made a measurable difference for me compared to other solutions.