DISCLAIMER: I work for Red Hat Consulting as an OpenShift/k8s consultant. This i...

hardwaresofton · on Sept 3, 2020

What are your comments on the following links below:

- https://github.com/kubernetes/kubernetes/issues/51135

- https://github.com/libero/reviewer/issues/1023

- https://medium.com/@betz.mark/understanding-resource-limits-...

- https://medium.com/omio-engineering/cpu-limits-and-aggressiv...

This doesn't seem as dangerous as is being suggested -- and in a world with the kernel bug and some separation of workloads it seems very viable.

Obviously, in a world without the kernel bug it makes much less sense to not set limits, but as far as scheduling goes, well-set requests (+/- a VPA[0]), with a HPA[1] should be enough to handle sudden increases in scale, and for truly large increases that are a complete surprise (otherwise you could have planned for it) elastic infrastructure via a cluster autoscaler[2].

[0]: https://github.com/kubernetes/autoscaler/tree/master/vertica...

[1]: https://kubernetes.io/docs/tasks/run-application/horizontal-...

[2]: https://github.com/kubernetes/autoscaler/tree/master/cluster...

sciurus · on Sept 2, 2020

How are the limits incorporated into scheduling? I assumed that was based on requests.

What does "scale atomically" even mean? How does removing limits relate to horizontal vs vertical? HPA is based on request utilization, not limits, afaik.

What's your take on the arguments against limits in the comment at https://news.ycombinator.com/item?id=24356073 ?

tpxl · on Sept 3, 2020

>How does removing limits relate to horizontal vs vertical?

Vertical -> give more resources to the program

Horizontal -> run more instances of the program

Removing limits gives your pods more resources (scaling them vertically) whereas creating more pods creates more copies (scaling horizontally).

Assuming parent meant scaling by whole units with "scale atomically", that is you have one or two running programs, not "1.5" if you just give it 50% more resources.

oso2k · on Sept 3, 2020

tpxl gets me. :D Even the "scale atomically" part.

People seem to have inferred that I believe that Limits are used by the Scheduler. I don't. But if we set "Request = Limits", we're guaranteeing to the Scheduler that our pod workload will never need more than what is Requested, or, we scale up to a new pod.

It seems to me latency is a symptom of the actual issue, not the actual problem.

If a workload idles at 25% of Request, 12.5% of Limits (as in TFA), and peaks at 50% of Request, 25% of Limits that seems hugely wasteful. What's more, the workload has several "opportunities" to optimize latency. And uncapping the CPU Limit reduces the latency. If it were me, I'd be asking, "Why does my workload potentially need access to (but not utilization?) 4, 6, 8, 16, 32 cores to reduce its latency?"

More often than not, I've been able to help customers reduce their latency by DECREASING the Pod's Requests and Limits, but also INCREASE the replica count (via HPA or manually). It's not a silver bullet, and whether a workload is node.js, JBoss EAP, Spring Boot, or Quarkus does matter to some extent. The first thing I reach for in my k8s toolbox is to scale out. "Many hands make light work" is an old adage. N+1 workloads can usually respond to more traffic than N workloads in a shorter amount of time. k8s' strength is that it is networked and clustered. Forcing one node or a set of nodes to work harder (TFA mentions "isolating" the workload) or vertically scaling is anti-pattern in my book. Especially when you understand the workload pattern well. What is being done here is that nodes (which are likely VMs) are being over-committed [0]. Now, those VMs live on physical hypervisors which are likely -guess what- over-committed. Turtles of (S)POFs all the way down I say.

Also, TFA mentions

     In the past we’ve seen some nodes going to a "notReady" state, mainly because some services were using too much resources in a node.

and

     The downsides are that we lose in “container density”, the number of containers that can run in a single node. We could also end up with a lot of “slack” during a low traffic time. You could also hit some high CPU usage, but nodes autoscaling should help you with it.

So they acknowledge the risk is real and they've encountered it. For most of my customers, failing nodes, reduced "container density", and "slack" are unacceptable. That translates into increased engineer troubleshooting time, higher cloud provider bills. What's worse is that the suggestion of the Cluster Autoscaler will protect you also comes with increased costs (licenses, VM, storage, etc.). Not the solution I want. Seems like a blank check to your cloud provider.

But I get it. I've fought with customers that tell me, "By removing the Limit, my container starts up in half the time." Great. Then they get to Perf Testing and they get wildly inconsistent speed up when scaling out (or way sublinear), or they're limited by resource in their ability to scale up especially when metrics tells them they have resources available, or there is unchecked backpressure, or downstream bottlenecks, or this one workload ends up consuming an entire worker node, or ...

[0] https://www.openshift.com/blog/full-cluster-part-2-protectin...

dilyevsky · on Sept 3, 2020

Limits aren’t consulted for scheduling (except if you have cpu manager enabled on the node it can assign dedicated cores) so the above poster is wrong

i-am-curious · on Sept 3, 2020

Isn't the solution painfully obvious. Remove the limits around the time you expect extreme loads. Like you said, it works most of the time. Take a hit for unexpected workload spikes. It's a design decision.

thockingoog · on Sept 5, 2020

Since this started by citing me, I feel somewhat obligated to defend my guidance.

I stand by it.

In an ideal world where apps are totally regular and load is equally balanced and every request is equally expensive and libraries don't spawn threads, sure. Maybe it's fine to use limits. My experience, on the other hand, says that most apps are NOT regular, load-balancers sometimes don't, and the real costs of queries are often unpredictable.

This is not to say that everyone should set their limits to `1m` and cross their fingers.

If you want to do it scientifically:

Benchmark your app under a load that represents the high end of reality. If you are preparing for BFCM, triple that.

For these benchmarks, set CPU request = limit.

Measure the critical indicators. Vary the CPU request (and limit) up or down until the indicators are where you want them (e.g. p95 latency < 100ms).

If you provision too much CPU you will waste it. Maybe nobody cares about p95 @50ms vs @100ms. If you provision too little CPU, you won't meet your SLO under load.

Now you can ask: How much do I trust that benchmark? The truth is that accurate benchmarking is DAMN hard. However hard you think it is, it's way harder than that. Even within Google we only have a few apps that we REALLY trust the benchmarks on.

This is where I say to remove (or boost) the CPU limit. It's not going to change the scheduling or feasibility. If you don't use it, it doesn't cost you anything. If you DO you use it it was either idle or you stole it from someone else who was borrowing it anyway.

When you take that unexpected spike - some query-of-doom or handling more load than expected or ... whatever - one of two things happens. Either you have extra CPU you can use, or you don't. When you set CPU limits you remove one of those options.

As for HPA and VPA - sure, great use them. We use that a LOT inside Google. But those don't act instantly - certainly not on the timescale of seconds. Why do you want a "brick-wall" at the end of your runway?

What's the flip-side of this? Well, if you are wildly off in your request, or if you don't re-run your benchmarks periodically, you can come to depend on the "extra". One day that extra won't be there, and your SLOs will be demolished.

Lastly, if you are REALLY sophisticated, you can collect stats and build a model of how much CPU is "idle" at any given time, on average. That's paid-for and not-used. You can statistically over-commit your machines by lowering requests, packing a bit more work onto the node, and relying on your stats to maintain your SLO. This works best when your various workloads are very un-correlated :)

TL;DR burstable CPU is a safety net. It has risks and requires some discipline to use properly, but for most users (even at Google) it is better than the alternative. But don't take it for granted!