You should consider using Eigen for linear algebra, I have personally found it much better performance wise than using bindings to ATLAS or other more standard linear algebra solutions. ML algorithms tend to be multistage (think about the update of weights with momentum in a Neural network for example), and the primitives available in ATLAS or a blas library are really too low level. Eigen since it generates code for a whole complicated expression can blow a standard linear algebra library out of the water for a certain class of problems. For others obviously the highly tuned vendor BLAS code would win, but I've seen huge speedups by using Eigen it fits well for complex ML operations.
This is true, even though we would rather switch to Armadillo due to it's easier handling and better high-level behaviour.
Right now the linear algebra library we use -ublas- has the same behaviour as Eigen for BLAS1 type expressions. So it tries to generate optimal (non-SSE) code. Only for BLAS2 and 3 we fall back to the ATLAS-routines which has the same performance as Eigen on the interesting problem sizes.
//small edit
In the end it is not so interesting whether the BLAS1-type expressions are fast as they make up < 1% of run time performance. The big chunks are the data processing inside the matrix-matrix multiplications of the Neural Networks and similar entities.
You forget that if you can do the whole weight update in a single shot operation that the data doesn't have to go through the cache multiple times, and at least on the problem sizes I am working on FP bandwidth isn't what kills it, but the memory bandwidth. Back to the NN example: If you can do the matrix multiply and the application of the delta weights in a single loop iteration you get much better cache behavior.
Another thing about code generation, I am also using a hacked version of Eigen as well in a project I'm working on that can do the tanh and derivative of the tanh so the NN activations go quite abit faster since you can generate vectorized code for the whole calculation that will visit the memory location exactly once. While true the calculation of the weight updates is the most time spent, I saw 3-4x speedup in the activation code doing it in a single operation due to better memory access patterns and less loop iterations. Better memory access patterns can also have synergistic effects on other code because there is less cache pollution happening. By being fast and loose and introducing a few other copies of the matrix data in my case, my performance falls off a cliff when it no longer fits in the cpu cache nicely. 10x difference in the particular case I am remembering.
As always performance is part art, part science and perhaps it won't matter as much for the general case, but for my specific implementation and my matrix sizes Eigen has made a measurable difference for me compared to other solutions.