Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Fun stuff, amusing that the definition of a distributed system used; "Where a computer that you never heard of can bring your system down." is actually one of Leslie Lamport's more famous quotes.

When I joined Sun in '86 I thought it was the pinnacle of technological excellence to be a kernel programmer, and I joined the Systems Group, the notional center of the Sun universe, in 1987. However I discovered that the primary reason you had to be picky about kernel programmers what that their bogus pointer references crashed the machine (as they occurred in kernel mode with full privileges) but discovered that network programmers could crash the whole world with their bugs. So clearly they must be in a pantheon above kernel programmers. :-)

The author has come to discover that in the network world things can die anywhere, and this makes reasoning about such systems very complicated. Having been a part of the RPC and CORBA evolution I keenly felt the challenges of making APIs that "looked" like function calls to a programmer but took place across a network fabric and thus introduced error conditions that couldn't exist in locally called routines. (like the inability to return from the function due to a network partition for a simple example).

Lamport's work in this space is brilliant and inspired. Network systems can be analysed and reasoned about as physical systems when they exhibit discontinuities when considered as simple algorithms. The value here is to realize that a large number of physical systems tolerate a tremendous amount of randomness and continue to work as intended (windmills for example) while many algorithms only work consistently given a set of key invariants.

I gave a talk that was inspired by Dr. Lamports work titled 'Java as Newtonian Physics' which was a call to action to create a set of invariants, in the spirit of physical laws, that would govern the behavior and capabilities of distributed systems. It was way early for its time (AOL dialup connections were still a thing) but much of the same inspiration (presumably from Lamport) made it into the Google Spanner project.

As with many things, at a surface level many people learn an API which does something under the covers across the network but having come up through their education thinking of everything as an API they don't fundamentally grasp the notion of distributed computation. Then at some point in their experience there will be that 'ah ha' moment when suddenly everything they know is wrong, which really means they suddenly see a bigger picture of things. It makes distributed systems questions in interviews an excellent litmus test for understanding where people are in their journey.



I've never seen an RPC system that I really liked. The closest to a model of distributed computing that gets me from 'a' to 'b' without going terminally insane is anything based on message passing. Even though there is significant overhead I figure that by the time you go distributed and your target of the RPC call or message lives on the other side of a barrier with unknown latency that overhead is probably low compared to the penalties that you'll be hit with anyway.

So then the trick becomes to make sure that a message contains a payload that is 'worth it'.

Making the assumption that any message may not make it to its destination and that confirmations may be lost (akin to your return example) is still challenging but I find it easier to reason about than in the RPC analogy.

I love that Lamport quote :)

A nasty side effect of all this network business is that what looks like a function call can activate an immense cascade of work behind the scenes, gethostbyname (ok, getaddrinfo) is a nice example of such a function. On the surface it's a pretty easily understood affair but by the time you're done and you get your results back you've likely triggered millions of cycles on 'machines that you've never heard of'.


"I've never seen an RPC system that I really liked."

I must admit I've never seen a message passing system that I really liked either :-) Mind you that's possibly because of times making stuff work in environments where someone made the decision "you shall use message passing for all inter-system communication" even when it wasn't always the best option.

These days my practical test for a remote API is whether I can stand using it through cURL - if I can happily do stuff from the command line then the chances are that code to do stuff won't be too insane.


I liked QnX, currently playing around with Erlang. (Erlang has tons of warts but it gets enough of the moving parts just right that I find it interesting).


One does not often hear about the warts of Erlang. What do you name those?


Recently I was talking with a guy doing CRDT research. His past background was something CPU design related. I always considered a CPU a Newton/Turing ideal machine. I was surprised to know that it feels more like a distributed system. Due to high frequencies, events that happen in one part of CPU are unknown to other parts for quite a while, i.e. so many ticks later that they have to act semi-independently.


Hi, author here. :-) Thank you for the fun anecdote and kind words.

Hopefully we can collectively get better at addressing these problems.


Absolutely there is more fun to be had. I clearly remember that sort of "ah ha" moment when I figured out that data structures could be computation. That took me from a loop that could not operate fast enough on the data, to one where the data set had some precomputation done on it and the loop only had to 'finish' it for various conditions and was plenty fast. Suddenly large vistas of "wow" open up. The posting from Julia's blog about how computers are really fast, same sort of experience for her. Suddenly a new understanding, the world shifts, and now you have a whole bunch of new insight to throw at problems. We can't help but get better at addressing problems.

I believe it was Leslie but it might have been Butler Lampson who mentioned you could stomp on a bunch of ants and the colony still worked fine. Ants are a great example of a durable distributed system that is robust in the face of massive amounts of damage. When you start thinking about computers like that it makes you realize you can build 100% uptime systems after all. The implementation of that property (individual machines are junk, collectively they are unstoppable) was done really well inside Google's infrastructure. They got to watch it in action when a colo facility they had clusters in caught fire.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: