This also happens in compltely normal operation, like if you're using a TCP-base...

wumpus · on Aug 23, 2016

Uh, no. Alltoall is a challenge for MPI, but not for the reason you describe. TCP windows mean that the receivers aren't the problem. It's all the switch queues in the middle.

dekhn · on Aug 23, 2016

welp, I measured the problem on my machines, and enabling pause frames on the switches fixed the problem...

vardump · on Aug 23, 2016

TCP windows won't save you. TCP has no way to magically know when some buffer is full. Instead it notices packet loss and interprets it as congestion. Which is not what you want, because it can significantly reduce throughput.

r4um · on Aug 23, 2016

Yes it does, receiver side advertises window as 0 and persist timer kicks in. The send call then blocks until window recovers.

vardump · on Aug 23, 2016

That's the problem when you don't really even have any congestion, but just very high packet loss caused by small buffers. Transfer rate drops to nothing.

With pause frames you can avoid that situation.

bcook · on Aug 23, 2016

There's many types of TCP congestion algos and not all of them require dropped packets to begin the congestion avoidance.

dekhn · on Aug 23, 2016

In fact, at the time, I was working at LBL in a lab that developed new congestion avoidance algorithms. The problem was that most of our tuning for was for long-range links, while I was using the cluster for local high performance networking.

fanf2 · on Aug 24, 2016

The magic phrase for this problem is "TCP incast"