The memory bandwidth is still a bit lower than Nvidia's best cards, and it doesn't have the equivalent of Tensor Cores. If they wanted they could compete, but it's clearly not their desire. They build consumer end products.
The neural engine on all recent Apple silicon (and A## devices) has "tensor" cores for matrix calculations (note: Apple abstracts all of this behind coreml so there is some conflation between the ANE and AMX instructions/hardware). The M2 Ultra offers 31.6 trillion ops per second with fp16, for instance, which actually bests an A100.
The software support is terrible, of course, which is the biggest limitation, but Apple clearly wants to be in that realm as well.
The neural engine has severe limitations at the moment. I tried using it for BERT about a year ago and kept crashing its API because of "out of memory" issues. The theoretical TOPs you mention also don't necessarily translate into usable TOPs because of memory bandwidth and caches. This is why for example the comparison of the M1 Max with a RTX 3090 was completely off.
I certainly can't speak to your specific uses or issues, but I mean we've really moved the goalposts from the prior claim that it didn't have tensor (e.g. matrix) functionality.
My daily work life includes a lot of model running on Apple hardware (Apple Silicon and A1# chips with the neural engine) using CoreML, often Pytorch models converted using coremltools. The performance of the Apple chips is spectacular if the intrinsics are supported (things obviously get dicier if there are currently unsupported ops). I mean, the memory bandwidth of the M2 Ultra is within spitting distance of the GDDR6X 4090.
People aren't going to be replacing H100 arrays with Apple Silicon and even as a fan I use nvidia hardware for training and convert the models to CoreML after the fact, but Apple clearly isn't just satisfied being some toy. They are continually climbing up that vine.
Yes, you are correct in that the ANE does have the equivalent of tensor cores and that I didn’t mention that. I just don’t expect it to be usable beyond inference because the number of compute units will not work for batches in medium/large/huge networks. That’s obviously by design! The ANE silicon size is tiny compared to the GPU area. I wouldn’t be actually surprised if Apple strategically only invests in using their GPU for LLM (1B+ params) work.
Note that if you are currently using CoreML for LLMs all the work is done in the GPU.
Regarding Tensor cores, it does have them as part of the 32 core Neural Engine. Apple considers AI/ML a consumer feature, all the way down to the iPhone hardware. At the same time, this isn't a data-center supercluster. It's still just a mid sized workstation.
There is a difference. We train with large batch sizes these days. The ANE silicon size is tiny and can't do the large matrix multiplications for big LLMs with or without a batch size higher than 1. Meaning that it cannot saturate the RAM bandwidth and that you're better using off the much bigger GPU on the Apple die.