This also happens in compltely normal operation, like if you're using a TCP-based MPI implementation and do an all-versus-all message send. The destination buffers will fill quickly from all the senders, the receiver drops the packets, TCP sees that as a timeout after 250ms, and requests a retransmit. In principle, using PAUSE frames allows the sender to get feedback to pace its sends.
Took me a long time to debug my MPI performance problems because of this.
Uh, no. Alltoall is a challenge for MPI, but not for the reason you describe. TCP windows mean that the receivers aren't the problem. It's all the switch queues in the middle.
TCP windows won't save you. TCP has no way to magically know when some buffer is full. Instead it notices packet loss and interprets it as congestion. Which is not what you want, because it can significantly reduce throughput.
That's the problem when you don't really even have any congestion, but just very high packet loss caused by small buffers. Transfer rate drops to nothing.
In fact, at the time, I was working at LBL in a lab that developed new congestion avoidance algorithms. The problem was that most of our tuning for was for long-range links, while I was using the cluster for local high performance networking.
Took me a long time to debug my MPI performance problems because of this.