This is a compilation of gotcha-discovery-reports, distributed across the surface area of K8s (which, since K8s is huge, covers many nooks and crannies). This is not a compilation of "K8s sucks, here are 10 reasons why", which is what my kneejerk expectation was. (maybe I am too cynical...).
Overall, this is a fantastic index to some very interesting resources.
There's an old saying that you learn more from failure than from success (whose analogue is a blogpost with a title "look how easy it is to set up k8s in 15 minutes at home!"). If you really want to think about how to deploy nuanced technology like this at scale, operational crash reports such as these are invaluable.
Not to knock on Tolstoy but there are many ways in which unhappy families can be grouped together. You have the alcoholic parents, the living-vicariously-through-their-children folks, the abusive parents, etc etc etc.
To tie it back to Kubernetes, you have the scheduler-does-not-schedule lessons, the ingress-not-consistent lessons, the autoscaler-killed-all-my-pods lessons, the moving-stateful-services-with-60TB-storage-size-took-20-minutes lessons and probably many more. It's like Churchills explanation of democracy: terrible, but better than all the alternatives. (at scale, at least)
I think you inadvertently explained why Tolstoy is right: while there are many categories of failure, within those categories each failure is it’s own big, and even for the same bug there are often contextual reasons why the bug manifests in one scenario but not another. Yes, there are alcoholics, but every alcoholic has their own lonely story of descent into addiction.
There are infinite ways to fail, and only finite ways to succeed. That relative difference between infinite and finite is what Tolstoy is getting at. And this is a helpful perspective, because then success becomes less about trying to prevent infinite failure-points and more about doing a few finite things very well.
This is why you see advice from YC like — “don’t worry about competition”, and “talk to users, talk to users, talk to users.”
The problem is that everyone needs to learn those lessons on their own. (But I'm biased: I'm working on a troubleshooting tool for K8s which has remediation rules for all those common cases.)
I guess it depends on your definition of "at scale" but IMO with Hashicorp's stuff and maybe something like Backstage you get all the benefits k8s but in a package that's much simpler to reason about and manage.
By building a neat box with few buttons on the outside and still the same or more complexity on the inside. If something goes wrong, you still have to deal with all the inside parts. The only hope is that the few outside buttons only trigger happy paths without complications. "At scale" this seems unlikely, simplifications usually only work in smaller, simpler use cases.
Nomad very definitely does not beget the same gordion knot of complexity of Kubernetes, while working at substantially larger cluster sizes than Kubernernetes is capable of.
> There's an old saying that you learn more from failure than from success
OT: That was easily the hardest lesson to instill in my math students, and the most impactful once internalized. Being comfortable experimenting with ideas you don't yet fully understand is critical to the learning process.
it's a hard thing to keep in mind and execute on, even if you know it consciously.
i think people in general can often have too much of a move fast and break things attitude, and i tend to be the opposite, but my default tendency toward risk aversion can definitely go to far. balance is important.
i think explicitly reminding myself to think about the realistic cost of failure can be helpful. e.g. 15 minutes or an hour starting in an uncertain approach to implementing a software feature, or writing something, or trying to sketch out a proof probably isn't a huge deal. 15 minutes or an hour trying some novel approach to fixing a production bug that's writing bad data when there's some known tedious thing that'd staunch the bleeding in a few minutes, or sinking many hours and/or much money into an uncertain hobby or activity, those might be less worth the risk =) (but i didn't get the impression you were including that sort of thing, just rambling)
At the same time your initial reaction must at the very least say something about how k8 is different from other technologies where you won't have this reaction.
The current trend goes to multi-cluster environments, because it's way too easy to destroy a single k8s cluster due to bugs, updates or human mistake. Just like it's not an very unlikely event to kill a single host in the network e.g. due to updates/maintenance.
For instance, we had several outages when upgrading the kubernetes version in our clusters. If you have many small cluster it's much easier and more save to apply cluster wide updates, one cluster at a time.
The tech churn cycle is getting more and more insane. It's the same process repeating endlessly.
1. Identify one problem you want to fix and ignore everything else.
2. Make a tool to manage the problem while still ignoring everything else.
3. Hype the tool up and shove it in every niche and domain possible.
4. Observer how "everything else" bites you in the ass.
5. Identify the worst problem from #4, use it to start the whole process again.
Microservices will save us! Oh no, they make things complicated. Well, I will solve that with containers! On no, managing containers is a pain. Container orchestration! Oh no, our clusters fail.
Meanwhile, the complexity of our technology goes up and its reliability in practice goes down. Plus, the concepts we operate with are less and less tethered to reality, which makes even talking about certain issues really hard.
I've got a different read on this. It's always been complicated, it's just that each.. I don't want to say "generation", but roughly the same concept, grew up with and internalized and knew about the complexities of the tech stacks they learned, and so when something comes about that moves the abstraction one level higher than what people are used to, it's seen as unstable crap.
This isn't some ageist kids-these-days thing, I mean that the complexity has always existed, and it's always abstracted one level higher over time. Kubernetes doesn't exist because Google and Google-adjacent engineers wanted to foist off "ooh, shiny" on the world, it exists because it solves a problem with containers at scale. Containers solve a problem with resource usage at scale, which are really just an evolution of VMs and a response to their inefficient resource usage and brittleness at scale, and even they were once seen as the new hot fancy overly-complicated thing.
For context, I entered the IT workforce just as virtual machines were being introduced into the average enterprise, and a distressingly high amount of the complaints I hear about containers and k8s were being used against VMs as well.
I'm curious to see where we go from k8s. What the next level of abstraction up will be.
I don't do anything at scale. I don't have to scale. My core competency is as a library or tool builder, and often my primary deliverable is a tool or library.
In the past year I got a new JR engineer on my team who was all hot and bothered with docker and k8s and he spent a month changing out CI process to be docker based. It went from a 20 line shell script to a pile of garbage.
I disagreed with the decision at the time, and while a subject matter expert I'm not the team lead so I couldn't say no stop that's bad.
While I'm sure k8s solves problems for Google I'm not google. My company isn't google, and my team isn't solving that class of problems so docker is useless crap for me.
The person is no longer on the team. They left their docker mark, and ran off to dockerify some other project leaving a team with no container expertise and a CI pipeline that is hacks around docker bugs.
Docker on its own is in most cases a great thing with a plethora of benefits. It's worth the upgrade from a 20 line shell script and I feel like you're part of the "I don't like new things gang". On the other hand it should stop there, k8s or swarm or whatever is not necessary for 90% of use cases or applications.
The existence of right tools for the job implies the existence of wrong tools for the job. The engineer in GP's story used the wrong tool for the job. That is a people problem. Had he used the right tool for the job, the problem would not have existed. GP wrote their CI stuff in shell to solve a tech problem, not a people problem.
What's that quote about things you grew up with are seen as infrastructure, things invented during your early adulthood are amazing, and everything after is useless crap?
The fact that this is similar to something already happened reassures me a little. May Lambda and other FAAS frameworks be considered the next layer of abstraction? (even if cloud providers are using containers underneath)
Imagine k8s but all the core features you need to run MMO-level apps are the equivalent of simple Unity Engine functions like Instantiate(gameObject). I want. Maybe I help create?
Bruv. You're catching downvotes for two reasons. A: ageism and B: also not realising that tools like the "old ones" used lasted decades. Release dates are a poor metric here.
I get it, nobody actually got the joke. Should've put the () first lol.
What I'm playing off of is that the poster probably meant when VMware and such took off in the enterprise. As one would see in my post history I know about what came before and that these things are not new in fact. And in many many cases still in use today which is your second point. What started 49 years ago is still in use on the IBM z series and all of that stuff has awesome backwards compatibility. My dad started his career with 360 assembler (as can also be seen in my post history).
So I guess it's ageism in the direction of _young_ people. I hope it's them down voting. Otherwise it's misjudging peoples sense of humour ;)
I'm with you for most of this, but I do think one element you are missing here is that the ever-increasing scale is at least in part to blame as well. Yes, software is vastly more complicated today, and perhaps suffers more errors, although I would want to see data on that. But, like, YouTube in 2021 is a vastly more difficult engineering challenge than YouTube circa 2008. The same can be said for any site with users numbering now in the millions. E.g., not only have Netflix's streaming user counts exploded since 2007 when it launched, the quality expected now of on-demand streaming has strongly increased as well.
Having said that, in reality that scale applies only to a small minority of products. I think an additional part of the real problem is the ever-present Cargo-Cult Oriented Programming model we've had for a while. I am sure container orchestration is important and maybe even required at places at Google scale. But I hear far too much about its use; it seems that nearly everyone is using it, and I really don't understand why. There seems to be a bit of a keeping-up-with-the-Jones effect, like people would be embarrassed to say that their startup relies on platforms that aren't the latest and greatest. Picking technologies because they sound cool on a resume or at a tech conference, instead of their use being appropriate for the circumstances, seems like a common issue.
I really don't understand why. There seems to be a bit of a keeping-up-with-the-Jones effect
The industry strongly incentivises individual engineers to make decisions that will ensure the latest buzzwords appear on their CVs - far more strongly than it incentivises making sound decisions for their current organisation.
It's true, nobody wants to be left behind with a bunch of skills that aren't attractive to desirable employers.
There's also very much a tendency to make things complicated in many workplaces as a form of job security and gatekeeping. But it doesn't last forever. Today's hot-shit is tomorrow's ball-and-chain.
10+ years from now, many k8s monsters will still be running and kept alive by stressed-out teams in India. Much like the "Oracle Enterprise Business Suite" giant-shit-ball from the late 90's is still working in many corporations doing absolute critical stuff with tentacles in every part of the org. Changes in these systems are nearly impossible because of the cost and nightmarish complexity, and _everyone_ is _forced_ to use it.
There's also very much a tendency to make things complicated in many workplaces as a form of job security and gatekeeping
It’s not only that it’s also boredom. Cranking out another cookie-cutter app can be much more fun if you do it in a novel way. Basically all the incentives are aligned towards exotic over-engineering rather than boring, safe but actually perfectly good choices. The danger will come if any savvy manager figures out what this costs them. So we’re probably all safe but you never know.
That's exactly what happens. But not out of ignorance and not only in IT. It's recipe for marketing success in any domain. Customers only know the problems of current solutions. Solve those and you get instant influx of customers. Sure, your solution brings it's own problems but it will take a very long time for them to surface and become common knowledge. First people will only check that your solution does fix the old problems in testing environments. And it does, so they move on. When they encounter implementation/production issues that's actually an opportunity for you to sell consulting. People who took the bait early on will mob on any naysayer: "Agile doesn't suck. it's just most implementations, but real Agile rocks!"
I think about it differently. The "problem" is that we are solving problems that didn't need to be solved before (or so we thought). The infrastructure I work on today is in many ways much more complex but it also handles a lot of things that we just didn't back in the day when everything was simpler.
> Meanwhile, the complexity of our technology goes up and its reliability in practice goes down.
I don’t agree with this at all. Reliability in practice is improving at a phenomenal pace. 10 years ago maintenance outages were a normal feature of every service, and unplanned outages were perfectly ordinary occurrences. Consumers today expect a much higher level of availability and reliability, which they receive rather consistently. Today it takes much less resource to produce a much more reliable system then you would have been able to produce in the rather recent past.
I don't know which industry you worked in, but outages in the telecom industry was strictly forbidden and came with severe financial penalties even 10 years ago. And those companies managed to adhere to really strict uptime SLAs even then.
It might take less resource today though to achieve the same, I agree with that.
Telcos have historically had availability regulations in many places because of how people rely on them for access to emergency services. So they’re a special case here, and the amount of resources they invested into optimizing for that is beyond the capacity for most organizations.
10 years ago I was working for a company that provided a financial OLTP service. We had to invest a huge amount of money to be able to provide a reasonable HA architecture, and to be able to meet 4 hour DR SLAs, and we still had weekly maintenance outages. The amount of effort required to accomplish those service levels today is comparatively trivial, and you could reasonably expect even a low-budget one person operation to be able to exceed them.
You’d expect a service outage to be a significant public controversy today for a lot of companies. It’s never been a good thing, but we’ve come a long way from it being a completely routine event for most services. Especially given the explosion in online services.
Depends on what you compare. If you compare mainframes vs k8s I would say reliability is not on par yet. If you compare commodity systems with simple monolithic apps vs k8s system maybe we just reached parity.
> There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies, and the other way is to make it so complicated that there are no obvious deficiencies. The first method is far more difficult.
Why does AWS insist on building this way? Has anyone tried to get the teams together and say, "stop, let's evaluate and simplify where we can?"
As an AWS customer, I find AWS a huge pile of overcomplicated offerings that is dense with its own jargon and way of doing things. IAM is a trash fire. Everything built on IAM like IRSA is also a trash fire. Why are there managed worker nodes, spot managed worker nodes, Fargate worker nodes, self-managed worker nodes, and spot self-managed worker nodes? Why are there circular dependencies for every piece of infra I stand up making it almost impossible to delete any resources? Why can't I click on a service and see how much it costs me in one or two clicks?
This is a completely insane way of building software to me and I have to eat it.
Amazon has "Invent and Simplify" as one of the leadership principles. However, each team has its own hiring bar and culture, so some will take it seriously and some will not.
> I find AWS a huge pile of overcomplicated offerings
I fully agree. I'm speaking exclusively about the internal infrastructure, not about AWS.
I can't even imagine, sometimes I feel like AWS offers this pretty facade of "scale it up easy" and in reality its like "wait, I can't even see why this thing is failing because its a layer below, and AWS support costs mega bucks"
Fascinating because I would have thought the whole idea is AWS is dog-fooded by retail. Yet in truth AWS is itself too expensive and / or complex for Amazon itself! (Or at least the original part of it.)
I'm curious what you mean here, since CAPI allows you to manage and scale clusters like any other Kubernetes object. You can literally "kubectl scale --replicas" a cluster control-plane and worker nodes.
You end up needing this anyway, because Kubernetes clusters are confined to single geographic regions and you need some way of managing the separate cluster(s) you have in each region
Not saying you should, but it's entirely possible to have nodes all over the world managed by a single set of master nodes. The system is not really designed with that use case in mind though, the added latencies will hurt a lot in getting etcd consensus.
While it is certainly possible to have mixed region clusters, from my experience it is better to have one cluster per region. Do your synchronous communication within the cluster / within the datacenter with sub millisecond latencies and do asynchronous communication across cluster boundaries with up to triple digit millisecond latencies.
In a way this reminds me of Kubernetes Virtual Clusters. Each virtual cluster has its own tenant control plane, namespaces. Multiple virtual clusters exist in a super cluster.
I think at a certain point up in the stack you don't want to have another orchestrator for the orchestrator. You would just tell each team to keep their cluster up-to-date without creating a (dangerous) master button to update them all at once.
That's what Federation in Kubernetes was for, sort of, but it has never really come to fruition. The closest I've seen is Admiralty, but I'm not sure how it actually works out in reality.
Kubernetes solves a problem. It isn't appropriate for all use cases, but the real problem isn't kubernetes but rather the knowledge gap around new technology.
Are you implying we need a manager1 (k8s) to manage containers? This manager1 cannot be trusted so introduce manager2 to manage manager1? What’s next? We need to look forward to manager3?
What are the best practices for transactional data storage in multi cluster environments, just ditch databases and go for distributed, raft based, nosql variants?
Thats a very good question Im also interested in knowing the answer. Currently I have seen that every company I had a pleasure to work with had to settle with „eventual” consistency of replicated masters. This is ofcourse totally unacceptable in case of some industries. Thats why I would like to know whether someone found a holy grail of multi-region database design and is willing to share it for free!
Eventual consistency happens when you allow your write transaction to finish, before the data has been spread across all masters. It allows you to have fast writes, but it's actually cheating (in my opinion), because only one master (the nearest) has seen your write, while the others don't know about it yet.
If you want to have a strong consistent multi-region database (or other form of storage), then you have to deal with slow writes, because your transaction is only allowed to terminate successfully when all masters have written the new data and are in a synchronous state.
I very much prefer slow strong consistent writes instead of fast eventual consistent writes.
Yes, but Google Spanner is not open source, so it's not so easy to actually look at it. But what is known to public is that Spanner relies on perfectly synchronized atomic clocks within their datacenters, called TrueTime. This solution is very special and does not work for other databases in general.
Which is why GCP's decision last year [1] to encourage people to only use 1 cluster with GKE for all of your minor, internal, exploring, testing, side and hobby projects was so frustrating.
For your main production load at work, paying management fees for many clusters is not that relevant costs. So not an issue with my clients' projects.
But for my many hobby-projects, dev and few self-employed business workloads balancing them all on as few clusters as possible is a major pain.
Before that I/you scaled it out to separation of concerns, maybe only grouping a few workloads. And then quite happily scrapping whole clusters all the time.
Now any lifecycle issues with them affects lots of independent applications. They went from cattle to pets over night.
Here's mine -
We were running on Cloud Foundry, had one DevOps person that mostly dealt with Jenkins, payed for 32-64GB RAM.
Decided to move to K8s (Azure AKS),
Three months later we have 4-6 DevOps people dealing with networking, cross-az replication, cluster size and autoscaling,
And we're paying thousands of $$$ pm for a minimum of 6 64GB VMs.
FAIL
Corporate decided to stop trying to compete with cloud vendors and shut down our in-house Cloud Foundry hosting.
Also Microsoft sales folks worked client decision makers pretty hard.
What? Take some damn responsibility for your decision and stop blaming them on the tools. You solely decided to migrate to K8s while Jenkins was working fine and you solely couldn’t manage a cluster with 6 nodes. That’s your responsibility that the tool you have chosen isn’t fit for your purpose.
In every crowd you will find a set of loud victim blamers.
Take some responsibility for your tools instead of getting defensive worrying that if someone actually yells at one tool author they're coming for you next.
I work my butt off to write good tools and half the time when someone comes to me about a problem, they're still right.
We have a giant Dancing Bear problem in our industry. We sit around hand wringing about all of the bad things people do out there but we never stop to think that it's environmental. Children with fucked up parents become fucked up adults. You surround them with chaos and you get more chaos. You surround them with pain and you get more pain. If you put developers in an environment where garbage tools are not only tolerated but are aggressively defended, what have they been inspired to produce? What can you expect them to produce?
More garbage.
So we go 'round and 'round victim blaming and never owning our part in this cycle of violence.
> In every crowd you will find a set of loud victim blamers.
I think the line has to be drawn somewhere on whether someone is a victim, or are complicit in their own suffering. If someone steps in doggy-doo on a walk, that's unfortunate and they might actually be a victim. However, if someone says "Me and my team have been stepping on doggy-doo everyday for the past 5 months", they are no longer victims, unless someone is coercing them into doing that against their will; and I will have some questions to ask their (technical) leadership.
At any given moment, half of developers have less that 5 years of experience.
I'm not just taking responsibility for my actions, I have to take responsibility for theirs, too. It's expected, but it would be the right thing to do even if it's not. Glitchy, dangerous tools make it difficult to grow other people into senior positions, where they can take responsibility for their own decisions. Every bad tool I use slows the process of creating a peer down.
Some of the feedback I've had so far is that it was refreshing to get "permission" to consider alternatives vs. the current hype. I use K8s and K3s quite broadly myself, but increasingly see consulting prospects and customers who are not comfortable to make the leap, but are very happy on managed services with their chosen vendor - Azure / AWS / GCP.
The OpenFaaS project again is very coupled to Kubernetes, making it easier to use and more reliable is important to the community and project's future. However, we created a version called faasd that works more like docker-compose. It's received much more traction than we expected and companies and individuals are putting it into production. It does't have clustering, and supports only 1 replica per function, so it's surprising. https://github.com/openfaas/faasd
I'll keep doing my bit to promote solutions that make K8s easier to understand like K3s (see also https://k3sup.dev) and to look into alternatives. But as you will see in my blog post - I don't think it's right to assume Kubernetes is the right solution for every team, and every project, without first talking about the problem being solved.
> I'll keep doing my bit to promote solutions that make K8s easier to understand like K3s (see also https://k3sup.dev) and to look into alternatives. But as you will see in my blog post - I don't think it's right to assume Kubernetes is the right solution for every team, and every project, without first talking about the problem being solved.
Thanks for doing your part. Frankly I don't believe the message is being spread enough. At least in the circles I circle, the unspoken expectation to use k8s for "everything" is persistent. The energy that I save by not dealing with it is sometimes spent on explaining why I'm not dealing with it. Yes, entirely an outcome of my circles. I will however share your links in the future, so thanks.
Oh good lord, I'm in the middle of one of these right now.
My team runs a fair number of K8s clusters, mostly on Azure AKS. Of the pods in our clusters, a few talk to the AKS API server for their cluster. Those pods that do, will, occasionally, lose contact with the API server. API calls will start timing out. It'll usually resolve on its own after some time, but then come back later. It's somewhat affected by load. (If we reboot a node or nodes, it usually happens after that reboot.)
I think SNAT port exhaustion is likely. Azure claims it isn't, and they don't provide monitoring on it, so I'm forced to take their word on it.
Azure support has been less than helpful. They think we're putting too much load on the API server. The sum total of all of our custom pods that make k8s API calls is only like ~1/3 of the total load. The rest is core k8s components, like the kubelet. That to me, doesn't smell like a lot of load. Their other suggestion is to buy the "Uptime SLA", a contract that adds an SLA to the cluster. Otherwise, all they offer is an SLO. (Which, empirically … we don't get.)
If someone would love to write the rest of this story… I'd love it. We're rolling out the SLA, and it seems promising, but we're hitting what we think is a separate issue. (Sometimes, the API server will refuse connections from particular nodes, usually for good: we've had to reboot the node to unwedge it. This is the second time that's happened; the first time, Azure support told us to they couldn't debug without a live example. Now we have a live example…)
(I do, on the whole, like Kubernetes. It does a lot of things well.)
This is the downside of “cloud services”. It’s all sunshine and rainbows when it works. When it breaks, start praying and pull the wallet hoping that helps. You are just at their mercy because you cant look under the hood even if you want to. And it takes a long time before they decide to assign competent people on your case.
My experience with Azure is pretty bad. (Premier) Support is next to useless. AWS is somewhat better, specially quality of their software/services/documentation. In my opinion, aws services feel more battle tested than azure.
Sorry for the rant, back on topic.
Do you have monitoring on the cluster? I would start to collect more low level metrics like open files, network traffic, pings
Also what could help is to run daemonset of a pod that can send you diagnostics at regular intervals. Perhaps even capability for remote shell so that you can troubleshoot from within the node when the problem surfaces again.
I would definitely be interested to know more how this turns out. I also run a fair share of AKS.
Would love to discuss in detail and see if I can help. At my last employer I wrote low level solutions for K8s related to security and at the moment I'm working on a stealth mode startup for K8s troubleshooting. Feel free to shoot me an email at aantny at gmail.
It takes a high level of skill and maturity to articulate such failure stories. The thing you missed is obvious to others, the "bug" you found is actually documented somewhere, and you can mitigate and move on from disaster without completely drilling down and reconstructing the problem. So thanks for posting.
Definitely a good one: "Experiences with running PostgreSQL on Kubernetes - Gravitational - blog post 2018"
For anyone who thinks running a database in a container environment is a neat idea, think again. I am guilty of using containers for temporary test databases, but the thought of running production databases in containers sends shivers down my spine.
I don’t know why anybody would presume that a technology focused on ephemeral resource provisioning would be a suitable place to put your persistence layer...
That said, I don’t think it’s a sin at all to use it for testing. My default local dev setup is to use a Postgres container. But persistence is very much not required in that situation.
> I don’t know why anybody would presume that a technology focused on ephemeral resource provisioning would be a suitable place to put your persistence layer...
Kubernetes does more than that, and has features like PVCs + Statefulsets are basically intended for, designed for exactly this use case. If you see the HN comments[1], the top comment mentions this, and that the article waves it away for reasons not related to k8s, but to "well, if the underlying storage is slow or not durable, then…" … yeah, then it doesn't matter if you're running k8s in the middle of it or not.
Kubernetes was not initially designed with persistence in mind. If it was then the etcd of the master nodes would also be in containers.
There are (good) attempts at shoving it in, but the general advice I would give is that if you care about your data you should give it every possible chance for to not be corrupted or disrupted; and that means keeping the number of abstractions and indirections low.
You can make it work, but why would you want to? Databases aren’t generally something that benefits from using container orchestration. They’re not usually highly dynamic, horizontally scaling systems. Generally you’d optimize that part of your system to maximize stability and consistency. For most typical use cases I can’t see the intuitive leap required to decide that all that additional complexity is necessary to attempt to replicate what you’d get from a few VPS. Unless you have a specialized use case, to me it just seems like very obviously the wrong tool for the job.
It that I advocate running your own Postgres setup in your own cluster instead of just renting a managed version, but I’ve run a few databases on K8s and found it pretty fine: useful for when your hosting provider doesn’t support the database you want to run (Clickhouse managed AWS service when?) or for application-specific KV-stores: EBS volumes and PVC’s are great, solid performance, kubernetes takes care of the networking, will resurrect it if the worst happens and it does go down.
I probably could have those things on their own instance but then I’d need to have to go through the hassle of networking, failover/recreation, deployments, etc and for the vast majority of cases that’s 100% more effort than deploy a stateful-set.
Now! Altinity runs Altinity.Cloud now in AWS. Feel free to drop by.
There are also services in other clouds. Yandex runs one in their cloud and there are at least 3 in China. ClickHouse has a big and active community of providers.
I strongly support this advice having felt the pain.
Inherited a setup using a semi-well-known vendors Patroni/Postgres HA operator implementation on OpenShift and it was extremely fragile to any kind of network latency/downtime (due to its strong tie to the master api) or worker node outage/drainage/maintenance. These events would mean hours of recovery work hacking around the operator.
It was not my decision to place Postgres on OpenShift and I will strongly discourage anyone planning to do this for production (or even testing). Please do not do it if you value your time and sanity. Spin up a replica set on VMs using one of the already production ready and battlehardened solutions or if in cloud use a managed Postgresql service.
For me, personally -- I cannot think of a sufficient justification to put a production database in a container. A good database server is designed for performance, reliability, scalability, security, etc., without containers. Putting a production database inside a container introduces a world of unnecessary edge cases and complexity.
Depends on requirements. Someone needs one big, highly-optimized DB instance. Someone else needs high-availability 3+ instance cluster. Having a cluster of containers brings performance penalty but if your app is read-heavy, you can read from all instances and multiply read throughput...
Thanks for your feedback. I might run the DB on the host then, and just use containers for the app server. I'm not at the scale to warrant a separate host for the DB.
The Istio one hits home. It is the single scariest thing to work with in our kubernetes clusters. We've caused several outages by changing the smallest things.
I recommend staying away from Istio unless you have experienced team of Isto ops and very good reasons for using it. I tried it a year ago and it was not pleasant at all. It looks nice (Kiali) when it works but it's a nightmare once it stops working. Using TLS complicates debugging. It lacked basic functuionality like simple Basic Auth (just recently implemented in wasm) and they deprecated original plugin architecture. At one point it started logging out certificate problems for me, I spent hours on it trying to figure out what happens. Then tried to upgrade to newer Istio and it magically fixed it so I think they shipped a broken build but comparing commits resulted in so many changes that I was unable to figure out what happened. It's unnecessary complicated layer which almost no one really needs! Maybe in the future it will be a standard but maybe not, I strongly don't recommend it unless you know what you are doing and have work/troubleshoot capacity.
We go through several stages before it hits prod. Our test environment -> the dev environment -> staging -> prod and still we are bitten by istio.
A lot of the problems we have seen do not manifest themselves until you get significant traffic in the cluster.
Luckily, it is one of our goals this year to make our setup testable.
Our setup is:
- an ingressgateway running on each node so that our SIEM can get IP addresses
- 1000s of virtual service entries for all the services running (40 per namespace x 25+ namespaces)
- 100s of deploys a day which causes pilot to update all the time
- the clusters are small with like 50 nodes
- we have way too many services and consolidation is slow
- we haven't been able to upgrade to the Istio operator yet because there is an outstanding bug that breaks everything
I don’t use it yet, but I’m probably going to have to because my team is looking at using Kubeflow and it leverages Istio to do things like traffic-splitting and stuff so you can A/B test models without needing to handle that at your model code level.
KNative can also do really cool things like per-request routing, scale-to-zero deployments that it can bring back up when it gets a request, it’s pretty rad.
My company has deployed Kubeflow for production model training. Early on, we used their big deployment, but we got frustrated trying to manage it with kfctl, so we started using kustomize directly and deploying only what we need, like KFP. So no istio for us! YMMV with your use case.
Yeah my plan is to use it the same way we use kubernetes: everything via kubernetes configs/justo use is friends and nobody using CLI’s, which I’m convinced are the text version of “click ops” hahaha.
So Istio isn’t strictly required? Can I ask which components you deploy? At minimum I’m planning on just the deployment and serving components and just use straight Polyaxon for training.
Yeah, there are quite a few applications that actively discourage the use of anything other than their CLI, which I find profoundly bizarre. Istio had a big warning in their docs that their Helm charts were deprecated, even though istioctl uses them under the hood. Funny that Kubeflow has its own Istio deployment, and they make you use kfctl to deploy it, haha.
It's not strictly required, but might be for you, since A/B deployment actually is a thing Istio does. We only use Kubeflow Pipelines currently; looking into Katib.
My feeling about Kubeflow is that it's a package of a lot of things that exist, with nice things on top, and an easy way to deploy those things. Only, it ends up not being that easy, and not easy at all to configure, and various features of the underlying tools are hard to get to or completely unavailable.
If you deployed your own Istio, you'd understand it end-to-end and could solve problems with it when it goes south, and you'd even understand what exactly you're using it for and why. The biggest problem I have with kfctl is that it basically asks for system:masters privilege to install everything and then you get to figure out what it did on your own. I don't want Istio, Cert-Manager, Knative, Argo, etc. deployments that I don't understand and can't easily configure because they're buried 5 levels deep in Kustomize overlays. These are all things I can install from elsewhere and with more documentation.
Kubeflow Pipelines is still a little mysterious, but the footprint is much smaller and not as invasive.
Yes, the opposite of it - adjective-laden flowery non-specific hype really irks me. To me it reads like how people talked about car design before ralph nader pointed out how cars with metal dashboards and no seatbelts are a bad idea.
It's too bad we don't have more UNIX-style tools in the DevOps space (in the sense of "do one thing and do it well"). Some features of K8s like software-defined networking would be really useful for our infrastructure, but it seems you'll have to fully commit to the rest of the software to use a single feature. Personally I'd prefer to have a loose set of tools for containerization, networking, orchestration, load balancing etc. that I can combine as I see fit and gradually adopt, and not a single monolith that ties me into a given paradigm for all these things.
Surprised I don't see something about "context deadline exceeded" in there. This error has been plaguing me from Kubernetes 1.5 to 1.18, and nobody has a solid answer for it other than maybe it's Docker related. Solution? Restart your node(s).
It is just a timeout. Golang uses contexts which can have timeout set. Then you can do a request (HTTP/GRPC) under that context and if it times out before getting a response, this error is returned. To find the real reason you are supposed to know the architecture and monitor metrics. It can be network problem or kernel problem (resource exhaustion).
Maybe I should mutter[1] "That's Kubernetes as fuck" next time I see another shenanigans involving multiple overcomplicated layers with confusing documentation interacting with each other in a way nobody can figure out.
Is Kubernetes really overcomplicated, though? Say you wanted to do a release of a new version of your app. You'd probably boot up a VM for the new version, provision it an IP, copy your code to it, and check if it's healthy. Then you'd edit your load balancer configuration to send traffic to that new IP, and drain traffic from the old IP. Then you'd shut the old instance down.
That's basically what a Deployment in Kubernetes would do; create a Pod to provision compute resources and an IP, pull a container to copy your code into the environment, run a health check to see if that all went OK, and then update the Service's Endpoints to start sending it traffic. Then it would drain the old Pod in a reverse of starting a new one, perhaps running your configured cleanup steps along the way.
The difference between the example and what Kubernetes does is that Kubernetes is a computer program that can do this thousands of times for you without really making you think about getting the details right every time. Doesn't seem overly complicated to me. A lot of stuff? Sure. But it's pretty close to the right level of complexity. (Some may argue that this is even too simple. What about blue/green testing? What about canaries? When do my database migrations run? Kubernetes has no answer, and if it did, it certainly wouldn't make it simpler. Perhaps the real problem is that Kubernetes isn't complicated enough!)
Anyway, "that's Kubernetes as fuck" is kind of a good meme. I'll give you that. But a deep understanding usually makes memes less amusing.
Those people clearly haven’t tried to setup vm live migration or done anything with openstack. Or deployed things at scale on Borg or Tupperware. I can go on...
> You'd probably boot up a VM for the new version, provision it an IP, copy your code to it, and check if it's healthy. Then you'd edit your load balancer configuration to send traffic to that new IP, and drain traffic from the old IP. Then you'd shut the old instance down.
I haven't deployed an app this way in probably a decade. The process is much closer to: merge the a new feature into master, let [GitHub actions, CircleCI, Azure DevOps, something else] run the tests and deploy the code.
After your initial setup, if your deployment pipeline is "merge the code in" and you have a small-to-medium-sized application, k8s is way too much in about 95% of use cases.
The problem isn't deployments, it's that k8s doesn't only do deployments. It does a ton of stuff, and deployments are a small part of it. If you only wanted to use k8s for deploying your code, well, you can't.
That's what's was great with the UNIX programs philosophy, and what is lost in kubernetes. You can't pick the feature you need, and the feature you want may well break because of a problem in another part of the system you don't even know existed (hello persistent volumes flagged for deletions, resource allocation failing, ...)
No single Kubernetes feature may be overcomplicated, but k8s as a whole is for sure extremely complex.
I think the core is very good. You need a pool of machines, you need to be able to schedule workloads, you need the workloads to be able to talk to each other over the network, you need the cluster to reach a consensus about what state it's in, etc. That is complicated, but basically mandatory.
Intrinsically, I think you take on a lot of problems by increasing the number of computers you control. Once you have decided that you need more than 0 computers, you are in for a world of hurt -- that's the entire field of software engineering. Once you have decided to make the jump from 1 computer to 2 computers, you now have to contend with the entire field of distributed systems, just so that you can serve your website when one machine fails, or you want to upgrade Linux, or whatever. At some point, even "small" companies have 10 engineers and 10 computers, and that is right about when "some guy will handle this for me" fails and you start needing software to manage things like scheduling or deployments. You can certainly make it work for longer than that without using Kubernetes, but the returns start diminishing very quickly.
There are absolutely quirks of Kubernetes that have caused users a lot of unnecessary hurt -- mandating a 5 node etcd cluster when they could have bought an RDS instance for $70/month or something. That seems to be being fixed, though.
The way I see it is that people run into trouble with the parts that are underspecified. Ingress is my classic example -- it only provides API surface for the simplest of HTTP routing operations, and it's too simple for any real-world application. The result is that people hack in their own ingress controllers in horrifying ways ("just put an nginx.conf snippet in an annotation!") and that bites a lot of users. It goes further than things like Ingress; who is providing monitoring, tracing, and authentication? Bring your own! Who gives you a container registry? Bring your own. How do you code review changes to your cluster? Bring your own gitops! (This is all in addition to the things I mentioned in my original comment.)
There are too many things that Kubernetes should probably do because a lot of people need them, but those people have to cobble together a bunch of unrelated components to get what they need. That is nice! But one way should win, and it would save people a lot of time.
> You need a pool of machines, you need to be able to schedule workloads, you need the workloads to be able to talk to each other over the network, you need the cluster to reach a consensus about what state it's in
Well, no ? I need to be able to start an app, and when that app says 'I'm ready to run', shutdown the old version. So a the extreme version, I'd just need to ensure all my apps have 2 open endpoint for "readiness" and "kill", and one way to register themselves in some kind of central db of "what's running". That's it, that's an automated deployment solution. Of course, it doesn't really do a lot, but done well, it would be something already !
I don't need a pool of machines, or at least that's not the matter for a "deployment solution".
I don't need scheduling, and I certainly don't want something to interfere with how I handle network in my app.
I don't say that all of this isn't good or often needed, but it's not atomic to the need of "deploying an app".
And then the problem arises that the feature set of k8s that is neither complete wrt some utopic "ultimate end all solution" (because realistically, it can't be), nor atomic at all on what it does. And solutions like k8s end up somewhere in the middle, never finished, always updating, and well, complicated !
I think we can draw parallels between C and Kubernetes - C can be an ideal tool for some stuff, and it's theoretically possible to write a secure program in C.
However, there's a reason why people keep saying that it's practically impossible to write C securely. C provides so many gotchas and footguns that it's not even funny. Unless your problem space really demands it, there's not much reason to use C other than "I love to live dangerously."
Similarly, Kubernetes can be an ideal choice for some people. It may even be an OK choice for some others. But there are many for whom there's no reason to use k8s other than "another shiny item on my resume."
Heh - yeah. Henning does all the hard work of curating these. But I had the domain sitting around and figured that it was a good use of it. Sharing and learning through stories like this is how systems get better.
If you want a simple log forwarding, fluentbit is really good. But if you found yourself started to tweak fluentbit config, or write custom plugin in C and (recompile yourself) then it's time to move to fluentd.
These are great, I found myself trying to predict the related areas, “..Slow connections..” ooh that must be conntrack! Been burnt by kube2iam too.. I guess it’s k8s failure bingo.
I picked two or three random stories, and learned a lot just from that, be it technology I didn't know about or just some trick for debugging. Recommended.
really cool resource for learning but a lot of these have nothing to do with k8s, beyond the company in question having k8s as part of their stack (i'm addressing the possible perception of the post title suggesting k8s horror stories).
A lot of them do have things to do with k8s, though. Admission webhooks, Istio sidecar injection, etc.
The CPU limits = weird latency spikes also shows up a lot there, but it's technically a cgroups problem. (Set GOMAXPROCS=16, set cpu limit to 1, wonder why your program is asleep 15/16th of every cgroups throttling interval. I see that happen to people a lot, the key point being that GOMAXPROCS and the throttling interval are not something they ever manually configured, hence it's surprising how they interact. I ship https://github.com/uber-go/automaxprocs in all of my open source stuff to avoid bug reports about this particular issue. Fun stuff! :)
DNS also makes a regular appearance, and I agree it's not Kubernetes' fault, but on the other hand, people probably just hard-coded service IPs for service discovery before Kubernetes, so DNS issues are a surprise to them. When they type "google.com" into their browser, it works every time, so why wouldn't "service.namespace.svc.cluster.local" work just as well? (I also love the cloud providers' approach to this rough spot -- GKE has a service that exists to scale up kube-dns if you manually scale it down!)
Anyway, it's all good reading. If you don't read this, you are bound to have these things happen to you. Many of these things will happen to you even if you don't use Kubernetes!
I have an environment variable to set process count for a nodejs service because everyone is finger pointing at who should be the one to sort out the fact that cgroup cpu limits and nprocs disagree most of the time. Our stuff is fine, those other guys should fix their stuff.
I can write my own config management, secret management, volume mounting, deployments, replica scaling, hardware affinity, antiaffinity for resilience, rollover kicking out e.g. staging if there's not enough room to schedule production workloads, and the list goes on.
I can also figure out how to install and upgrade each piece of software in my universe, across a few different languages.
Or I can just use kubernetes.
From the perspective of my developers, they tell me a few things like: "I need 4GB RAM, 2 cores, I need access to the postgres secret to talk to it, and here is my hostname" and while they can tell me many, many more things, with that little information, they can have a full application in one of several different languages in minutes.
If I were to hand roll this, it'd be held together by prayers and duct tape. And I lose out on a huge active community building tools around this. I now have istio and get cross service telemetry. For free. I can set networking rules up setting up QOS between arbitrary services, both inside my network and outside. All applied behind a single interface. There's a lot of stuff I get for free, and as my team discovers things it needs to do, I have a consistent layer to do that all behind.
I'm one guy. I don't work for Google. But my team still manages _a lot_ of services, both on application end, and things like queues, databases, etc. You can say we don't need all of that, but you'd be wrong. We arguably could use more. I can reason about my infrastructure in both the small and the large.
Then you get into the idea of transferring to other companies. If someone comes into my org, they can look at what I have and see more or less everything there is in any part of my infrastructure. It's all right there, laid out in YAML. If I were to switch companies, or even teams, it'd be the same. Having a lingua franca of common terms and ideas is super critical there. To say nothing of all the work happening in the ecosystem allowing me to help my users ship features faster to our stakeholders. Doing this in VMs would be a literal nightmare.
So, respectfully, I believe your opinion is not correct here. You personally may not need it, but it's not very long before you get to reasonable amounts of services for any project that's sufficiently complex, in my experience. And even when I have just 1 or 2 I still need many of the primitives of kubernetes.
Exactly. Kubernetes has established patterns. Once you understand them then everything gets EASY. How toñset everything up, how to deploy, networking, service communication, secrets, etc
I've had teams create brand new services in hours that used to take a week. They can create, deploy, debug and manage their own service without any help from me. Their service runs and looks exactly vthe same as every other service. When things go wrong, they know where to look.
My team is smaller than it was last year while running 3x as much. That is all from a good combination of CI jobs, terraform and kubernetes. Also, our cloud bill is WAY down from last year because of the fewer number of VMs we need to run.
Is kubernetes a fad when the CTO is happy because costs are down, while productivity and stability are up?
I think a lot depends on the environment you're working in / migrating from. If you're working from an on-prem datacenter with VM's and you're not moving to the cloud for whatever reason. I think your point of not having to hand roll config managment, secrets management volumes ect.. is a good one and kubernetes makes a lot of sense in that environment. Or using kubernetes as a target for vendor apps that may need to run in unknown or exotic environments.
But if you're already in the cloud most of the k8 solutions already exist as individual tools, with little operational maintenance needed. "I need 4GB RAM, 2 cores, I need access to the postgres secret to talk to it, and here is my hostname" is all achievable with lambda's or ecs fargate task definitions and parameter store. Managed services for queues and databases lower operational burden. All of this while not having to risk a cluster upgrade bringing everything down or now having to increase operational overhead by running multiple clusters to have some operational blast doors.
All of this to say if kubernetes is working for you thats great! Congrats. But I'm on the hill with OP. Most platform/infrastructure engineers that I've heard pitch kubernetes because they are trying to solve one problem I.E secrets management or deployment ect... not all of the problems at once and bring in kubernetes complexity when there is already a fully managed, safer and cheaper solution to migrate to doesn't usually make sense.
There's no 100% right or wrong answer but I think you really have to evaluate what you're migrating from when evaluating moving to kubernetes
There are many options that are not kubernetes and not VMs. Cloud native offerings are extremely simple and powerful. ECS does almost every single thing kubernetes does with less headaches.
To be fair, the OP is probably referring to big companies' stable enterprise applications with a few thousand users. There is a push to migrate everything to Kubernetes but the applications are doing ok on dedicated VMs.
Three years ago I joined a startup to help with infrastructure setup. They wanted to use Kubernetes to install their monolithic django/celery AI/web single app using the same script both for their cloud environment and customers servers with single node installation. I didn't last more than a month.
Since then, and for every company I worked with, no single time Kubernetes was proposed as a solution for the problem it really solves. It was always though of as the magical solution for whatever problem we had (or didn't even have).
Overall, this is a fantastic index to some very interesting resources.
There's an old saying that you learn more from failure than from success (whose analogue is a blogpost with a title "look how easy it is to set up k8s in 15 minutes at home!"). If you really want to think about how to deploy nuanced technology like this at scale, operational crash reports such as these are invaluable.