Ah... The problems of crappy consumer ethernet equipment ( I work at an ethernet switch vendor so excuse the rant )
What is likely happening is that your switch is configured by default to implement both rx and tx pause. What is happening is that your TV who's also erroneously ( in my opinion ) configured to transmit pause goes bonkers, starts sending pause to your switch. Your switch then starts buffering packets for your tv until the buffers are full and then starts transmitting pause to everyone else including ports. The switch must have some horrible buffering policy where one port ( the tv port ) can hog all the buffers and deprive every other ports of being able to send...
Now the kicker is that the way every endstation implements pause is this... Notice the pause quanta in the Pause packet is in units of 512 bits and in the packet you captured it is set to the maximal value of 65535 which is on a 100Mbps port ( presuming 100Mbps since the Mediatek has 4x100Mbps (Fast Ethernet) and 2x1Gbps ( Gigabit Ethernet ) that computes to >>> 512*65535/100.e6 = 0.3355392seconds
A normal Pause sends this packet periodically and once it has buffers to receive will send a pause with a quanta of 0 meaning cancel previous timer... but if it's malfunctioning who knows if it ever will...
The sad part is that I don't even know what to recommend for a good consumer level switch that has good defaults or configurable defaults and sane buffer config... Mine is a dinky one probably vulnerable to this problem as well... Need to do some research.
> A normal Pause sends this packet periodically and once it has buffers to receive will send a pause with a quanta of 0 meaning cancel previous timer... but if it's malfunctioning who knows if it ever will...
As with most protocols, it doesn't work if it's not implemented properly.
Many hardware implementations have no knowledge of when the buffer can be emptied, so it's understandable they treat it as a on/off switch. Screaming, my buffer is full, don't send me anything for 65535*512 bit-times! Which is perfectly good, because otherwise all those incoming frames would need to be dropped anyways.
Remember, small embedded devices can't often guarantee ethernet DMA slot time to DRAM, and definitely can't afford to have a dedicated DRAM channel, so those buffers are on a 2-8 kB on-chip SRAM block or equivalent.
When the buffer is "full" (above high water mark), an interrupt gets generated and device firmware will set up an appropriate DMA transfer to empty it. Once that is done, the device should of course send a PAUSE 0, and all is good.
> The sad part is that I don't even know what to recommend for a good consumer level switch that has good defaults or configurable defaults and sane buffer config...
Like you must know, you can turn it off entirely in most managed switches, see what happens to data transfer speed.
Most consumer level gigabit switches seem to have maybe 16 kB buffer. So they don't really have much buffers (or anything) to configure.
Like you said the host might have small buffers and without Pause it would drop, but who's supposed to buffer the packets, the cheap switch with 16kB of buffers and super idiotic buffer configuration such that everyone else on that switch gets paused?
You seem to think that it's bad to drop packets in the nic and while some nic might have buffers that are too small but in general you should drop. If you use TCP the window will adjust to whatever your bad nic and embedded system can handle. At least you won't affect the others by spreading pause like a cancer ( can you tell I am cynical on pause )
Usually on a switch you can usually drop packets based on the number of packets destined to a port and packets buffered per input port. This is how you can avoid head of line blocking but again if you are right with 16kB that's barely enough for a jumbo packet (~9200B)... geez that's depressing.
> If you use TCP the window will adjust to whatever your bad nic and embedded system can handle.
TCP window, sigh... It can't deal with the situation where, say, every second frame is lost, because someone thought 2 kB is enough buffer. TCP congestion control mechanisms are great for actual congestion, but when packet loss is due to other causes, it's actually pretty bad.
Again, TCP is no substitute for flow control in this case.
Doesn't matter how nice NIC you have. The problem usually happens before the packets reach your nice NIC.
For that reason you should basically never enable ethernet flow control except for on a Fibre Channel over Ethernet SAN, and even then they had to invent Priority-based Flow Control to make it sane. If this is a managed switch then you should be able to disable it.
(I used to also work at an ethernet switch vendor.)
I've had the exact same thing happen when a host crashed. Relatively modern Intel network chip on the host, Netgear GS108 switch (BCM5398 I believe). Presumably when the host stops servicing interrupts, the card's buffer fills up and then generates pause frames.
I don't think it requires the switch to have a bad buffer policy - all the switch ports didn't die at once, just one-by-one as each connected device tried to send a broadcast packet. I don't see a way of avoiding this logical situation if pause frames are sacrosanct - it seems that a switch would need a heuristic to forget pausing and start silently dropping packets to only the affected ports.
(I've since disabled pause frames on those cards, since I don't really need them.)
This same root issue - trying to implement "reliable multicast" - is why DaveM rejecting the AF_BUS IPC implementation a few years ago. In any multicast or broadcast system you can't allow one stoned endpoint to wedge the bus for everyone.
> The switch must have some horrible buffering policy where one port ( the tv port ) can hog all the buffers and deprive every other ports of being able to send...
Or could it be that the switch is oblivious to STP ethernet addresses and PAUSE frames ? The frames shown(presumably originating from the TV) have a destination address of 01:80:C2:00:00:00 , and if the dumb switch doesn't know that this is kind of a special address, it'll just do what multicast bit in that address tells i to do - copy the frame out to every port..
Yeah. IMO most of the blame here falls on the TV rather than the switch. Even when it's implemented well, Ethernet pause frame generation should not be enabled - certainly, not by default - on a consumer product, because it's really unreasonable to expect the average $17 consumer switch to handle it nicely. Furthermore, there's usually little need & little benefit to trying to make a home network lossless via L2 pause.
Also one of the only ways to negotiate your way out of a spanning tree broadcast storm. Generally the firmware on the MAC will reflect a pause frame to the source when it's FIFO is full. That happens because the host is not pulling packets out of the FIFO fast enough, or the network has gone bonkers and is sending a gazillion packets per second.
The latter can happen when your misconfigured DHCP server gives out an address that other nodes on your network believe to be the broadcast address for the subnet. The device with that ill fated address will get deluged after every packet they send as people ack or nak or respond with queries. I saw that happen when a NetGear router had a netmask of 255.255.255.248 which the user copied from the WAN config to the LAN config, but the DHCP server was told the netmask was 255.255.255.0. Hilarity (not) ensued.
This also happens in compltely normal operation, like if you're using a TCP-based MPI implementation and do an all-versus-all message send. The destination buffers will fill quickly from all the senders, the receiver drops the packets, TCP sees that as a timeout after 250ms, and requests a retransmit. In principle, using PAUSE frames allows the sender to get feedback to pace its sends.
Took me a long time to debug my MPI performance problems because of this.
Uh, no. Alltoall is a challenge for MPI, but not for the reason you describe. TCP windows mean that the receivers aren't the problem. It's all the switch queues in the middle.
TCP windows won't save you. TCP has no way to magically know when some buffer is full. Instead it notices packet loss and interprets it as congestion. Which is not what you want, because it can significantly reduce throughput.
That's the problem when you don't really even have any congestion, but just very high packet loss caused by small buffers. Transfer rate drops to nothing.
In fact, at the time, I was working at LBL in a lab that developed new congestion avoidance algorithms. The problem was that most of our tuning for was for long-range links, while I was using the cluster for local high performance networking.
Nobody knows about PAUSE frames until they bite you.
I found out about them when someone at a place I worked wanted to design a custom Ethernet driver for an embedded device. There was no good reason for it, could have run the regular one shipped for that device (it was an RPi equivalent kinda unit).
So there they went and months later, it emerged. Everyone was amazed: oh wow handcrafted Ethernet driver, impressive.
Except what ensued was months of debugging and wireshark captures. Not handling PAUSE frame and flooding the network with packets took a good chunk of that time. Of course it was blamed on stupid switches and broken protocol and not on the bad decision to re-write a known, well defined and stable protocol without a good reason to do so.
> Nobody knows about PAUSE frames until they bite you.
I'll be an exception to that rule. I first heard about them when trying to optimize point-to-point 10GbE NIC throughput. I worked with the support team from the NIC manufacturer because I was trying to saturate four ports simultaneously (at the time that was a big deal, it's probably much easier now).
I knew a good deal about them before I was ever "bit" by them.
I'm sure it wasn't a very good one, but do you recall what the actual stated reason for this undertaking was? It surely couldn't have been "just because".
Low latency processing and speed. But it was done without anyone measuring the latency and performance parameters of the existing one that's the crazy part.
The PAUSE frame is meant to be sent by a station (host) to the switch (or vice versa) as a flow control mechanism, only for that port. Assuming the switch has at least some egress buffering, it shouldn’t result in propagation away from that switch port, to say the switch’s uplink port, unless the switch finds itself completely congested. Most hosts won’t have flow control configured at layer 2, instead relying on TCP congestion control. It is only useful when you have non-TCP type traffic, for instance fibre channel over ethernet, and you want to avoid packet loss and prefer to try force buffering upstream
Reply
Probably because the ethernet driver or hardware has a bug, and the rx buffer is full, and it has been configured to enable pause.
I once took down an entire corp. net by doing serial kernel debugging on a machine with pause frames enabled. Once the debugger took control of the kernel, the driver's rx interrupt handler stopped running, and the rx buffers filled. Eventually, the rx buffers were totally consumed, and the NIC started to send pause frames rather than dropping the packets. To make matters worse, I was remote, so I had to call somebody to powercycle the box.
I wonder if it is an issue with the kernel not keeping up, or if it is due to some other odd hardware effect (sometimes "gigabit" nics on embedded boards are really connected to a USB2.0 bus, which is slower than gigabit).
The only "safe" way to configure traditional (not per-priority) pause frames is to configure the switch to ignore pause frames coming from hosts, and to configure hosts to obey pause frames coming from switches. With data center bridging and per-priority pause, some of that goes out the window.
In the early days of 10GbE, I did drivers for a NIC that had a very small rx fifo. In some cases we had to advise customers to enable pause frames, otherwise the NIC would be subject to tail drops when the switch burst traffic to us. I still feel kind of bad about giving out that advise.
L2 pause frames are a necessity for small networks that don't have server-grade hardware, when your MAC receive buffers are measured in a few kilobytes. That includes most embedded devices (consumer devices, printers, etc., even industrial), consumer/small business switches, they have tiny buffers.
Your datacenter/server/workstation hardware is different. It deals with higher speeds and has appropriate buffering and control.
I wish people here would understand consumer/embedded space needs pause frames to function properly and that TCP congestion control will often significantly hurt performance otherwise. TCP can't magically know when some buffer is full. Without pause frames low level hardware will send at full throttle. It's not ok if every second frame is lost.
I wouldn't call this obscure by any means. You'll find Ethernet flow control enabled on just about every datacenter network, especially those that have a combined network and storage fabric.
Some datacenters enable Priority Flow Control (PFC) which is different in that it pauses only the traffic with a specific PCP ( Priority in 802.1Q vlan tag ). They assign storage traffic a specific vlan priority and treat it as lossless with flow control but the rest of the traffic is unaffected.
The mechanism here Pause is an abomination which should never be enabled.
> Some datacenters enable Priority Flow Control (PFC) which is different in that it pauses only the traffic with a specific PCP ( Priority in 802.1Q vlan tag ). They assign storage traffic a specific vlan priority and treat it as lossless with flow control but the rest of the traffic is unaffected.
I don't think he had his TV in a datacenter.
From the article:
> After some clever deductive reasoning, a.k.a randomly unplugging cables from the router, I determined that my TV was sending these mystery frames (yes, my TV — I have a Sony X805D Android TV).
> The mechanism here Pause is an abomination which should never be enabled.
What? You can't be serious, I think you have no idea what that would cause in almost every ethernet network. Let me tell you: a lot of packet loss that messes with TCP streams etc.
L2 pause frames are used by practically all of the ethernet devices, and for a really good reason. Pause frames are a perfectly good way to do flow control in most networks. Not having them means a lot of lost frames and generic pain in most networks... except of course datacenters.
Sure, it's not a standard. But it's good enough for 99.9% of use cases. Just maybe not in datacenter.
Chrome or other applications won't be aware of what is happening way down below at layer 2. To layer 3 (TCP) the pause is indistinguishable from severe network congestion.
Just speculating, but the stop/start oscillation in traffic rate could cause code running in chrome, such as a video codec, to exercise parts of its re-buffering code in a way that exposes a bug.
If you have standard compliant hardware, pause is point to point, not broadcast. You can configure hosts to ignore pause and also to not generate it; although it may be difficult to configure an embedded device, so you probably need to fix the switch or replace it with something that works.
"The very existence of Ethernet flow control may come as a shock, especially since protocols like TCP have explicit flow control"
Not at all. Ethernet is ancient and there are other transport protocols besides TCP that can and have used it in the past Apple Talk, IPX/SPX, DecNet to name a few. This is the beauty of the OSI model. Ethernet at layer 2 is independent of what rides it at layer 4.
Switches aren't supposed to exist in the OSI model at all, which makes it of questionable usefulness today - when was the last time you used a purely-hubbed network?
> OSI is just a model and it is not limited to end stations. What gives you that impression?
I didn't say anything about end stations? The point is that a switch is inherently a violation of OSI layering (it uses layer-3 information to make layer-2 decisions), which given that practically all modern networks are switched, suggests that the 7-layer model may not be that useful for modelling real-world networks.
You missed my point - that ethernet and layer 2 is agnostic to what network or transport layers run on top of it. Those were examples were to illustrate non-TCP transport layers. The date doesn't matter at all.
Does anyone know if this is controllable by software?
i.e. is it something DD-WRT, Tomato et. al. can alleviate?
I have been suffering similar symptoms on my home network and I also have a Sony Android TV, though to be honest it hadn't occurred to me to bust out Wireshark to figure out what was going on. In hindsight, I guess this was a rookie mistake on my part.
I always thought you were not supposed to enable flow control on a network with mixed 100 and gigabit devices. From the list of things OP has hooked up to network I would be surprised if they are all operating at 100 megabits.
afair during 23C3? someone demonstrated NAV jamming, after turning it on everyones wifi connection in the whole hall dropped to zero speed while the person demonstrating was happily browsing :)
What is likely happening is that your switch is configured by default to implement both rx and tx pause. What is happening is that your TV who's also erroneously ( in my opinion ) configured to transmit pause goes bonkers, starts sending pause to your switch. Your switch then starts buffering packets for your tv until the buffers are full and then starts transmitting pause to everyone else including ports. The switch must have some horrible buffering policy where one port ( the tv port ) can hog all the buffers and deprive every other ports of being able to send...
Now the kicker is that the way every endstation implements pause is this... Notice the pause quanta in the Pause packet is in units of 512 bits and in the packet you captured it is set to the maximal value of 65535 which is on a 100Mbps port ( presuming 100Mbps since the Mediatek has 4x100Mbps (Fast Ethernet) and 2x1Gbps ( Gigabit Ethernet ) that computes to >>> 512*65535/100.e6 = 0.3355392seconds
A normal Pause sends this packet periodically and once it has buffers to receive will send a pause with a quanta of 0 meaning cancel previous timer... but if it's malfunctioning who knows if it ever will...
The sad part is that I don't even know what to recommend for a good consumer level switch that has good defaults or configurable defaults and sane buffer config... Mine is a dinky one probably vulnerable to this problem as well... Need to do some research.