No but an equivalent set of simd instructions called NEON are available. IMHO ha...

moonchild · on Sept 1, 2022

Common x86 impls have 2 or 3 vector pipes (recent intel is 2x avx512 or 3x avx2); apple arm has 4, so the difference in throughput, though definitely there, is less than you might expect.

There is sve, which has variable-width vectors. Haven't played with it--that gets you your width, but there are obviously compromises there.

The obvious longstanding omission from neon is movemask. One nice thing they have, though, is a permute with four source operands (!) for a 64-byte lookup table.

janwas · on Sept 2, 2022

We have data from vqsort showing 3.2 GHz M1 to be half as fast as 3 GHz Skylake. I'd be interested in other direct comparisons of x86 vs NEON code :)

SVE will be interesting especially when vectors get wider. For the Arm V1, there are some unfortunate bottlenecks. See Table 3.41 instruction throughput in https://documentation-service.arm.com/static/6088343f85368c4...

For OP's application, the equivalents of compressstore aren't great. COMPACT is decent (1x256-bit/cycle) but CNTP occupies the bottleneck M0 pipeline.

Still, the V1 is probably fine for applications that do not often modify predicates.

moonchild · on Sept 3, 2022

I see. Sort is an interesting case because it has very tight dependencies (IOW, I expect there are many interesting workloads for which this result doesn't generalise). When you halve the vector width, you either double the length of the dependency chain or halve your effective work unit (which amounts to the same thing, along another axis--the performance impact of which is variable, but considering the m1 is very wide, it's probably not a great idea). I wonder if there's any room for a parallel sum scan, but my intuition is that, on a superscalar, at such small sizes, it's wasted work.

While I haven't looked closely at vqsort yet, I was looking at vxsort not long ago. I suggested to its author the possibility of handling multiple partitions in parallel--handling only architectural vector per partition at a time--and he said that, while that seemed like a potentially promising approach, he was more immediately concerned with bandwidth. Considering the m1's great width, memory bandwidth, and cache size (on top of the small vector size), this seems like an especially promising approach there.

janwas · on Sept 5, 2022

I agree this is one specific use case.

Decreasing the vector size does have some upside: it makes the O(lognlognn) sorting network cheaper.

For M1, performance is not terrible, it seems mostly limited by the NEON instruction set (expensive to do masked stores and popcnt() comparison results).

How do you mean multiple partitions in parallel? One challenge is that subarrays vary in size. It could be interesting to do multiway partition (larger base for the logn, fewer passes over memory), but that seems hard to reconcile with the vectorized in-place approach. ips4o effectively does a 64 or 256-way partition, but that's also nontrivial to vectorize.

moonchild · on Sept 5, 2022

> makes the O(lognlognn) sorting network cheaper

Good point.

> subarrays vary in size

Hmm. Haven't thought about this too much yet, but it seems to me that you'd probably still even win if you had to branch once you run out the common prefix to find a new subarray. Or maybe you try to find subarrays with similar sizes and mask out the last n stores (if your subarrays are too mismatched in general, then you hit the worst-case of quicksort _anyway_, so).

> ips4o effectively does a 64 or 256-way partition, but that's also nontrivial to vectorize

This is interesting. 64 and 256 are annoyingly large--mainly because you have to scatter to make any sense, and scatter-store is slow by comparison where it's even supported--but just 4 seems potentially feasible.

janwas · on Sept 6, 2022

Agree that scatter is expensive. We did actually have hopes for 4-way partition. Even ignoring the difficulties with doing that in-place, SIMD partition still seems to require one compressstore per partition. A quick prototype turned out to be half as fast as 2-way partition, which negates the gains from doing half as many passes (this depends on the ratio of compute to bandwidth, though).