Wow I'm learning today that Github used Redis for persistent data, now that they moved away :-) Anyway very happy that Redis helped to run such an important site. From the blog post it looks like that for certain things to move away from Redis was hard even if they are very skilled with MySQL, this is a good thing from the POV of Redis since it means that Redis allows to model certain things easily. However they wanted to move away as an important priority, so I wish to know why they wanted to move away so badly and how Redis could be improved in order to serve better the users. If Redis was better for their use case, they could have avoided to move to MySQL I guess. Unfortunately the blog post is short of details on that regard, perhaps because the blog post author(s) are too gentle to bash Redis after using it for a long time.
FWIW we went through a very similar process to that documented here by Github (~3 months ago). It was entirely due to operational reasons and nothing to with shortcomings in Redis itself. MySQL was the master record for 99% of our data while Redis was the master record for the other 1% (as it happens it was also a kind of activity stream). Having the single 'master' reference for our data reduced complexity to a degree that it was worth running a less computationally-efficient setup. We also have nowhere near Github's volume so we did not have to do such significant re-architecting to make unification possible.
Now we still use Redis for reading the activity streams and as LRU cache for all sorts of data, but it is populated like all of our specialised slave-read systems (elasticsearch, etc) by replicating from the MySQL log.
Yes, this helps and totally makes sense to me. Thanks. I would do the same... In this case however it looks like there were certain high volume writes that could be handled in a simpler manner with Redis, however it is totally possible that while this looks like an important use case, it accounted for a small percentage of all the data, so we are back to the consolidation thing of moving everything to a single system that is in general a good idea.
In the case of ES the short answer is; we don't. We have fault tolerance in our replication system to guarantee eventual consistency instead. I would say using ES as a consistent source of data isn't really playing to its strengths so we don't use it that way. The consistency you want is determined at read time: If you need consistency then hit MySQL, but for our use case that almost never happens as eventual consistency is usually instantaneous enough.
Our other tool is to decouple lookup (which objects to fetch) and population (what data to return for each object). You can mix and match, e.g. do a lookup against an inconsistent ES but still get consistent objects by populating from MySQL (or vice versa). As others have alluded to it depends entirely on the requirements for the result set.
Where I work we use several different MySQL replicas in production, where we don't expect them to be in sync.
So long as the source of truth (Master MySQL node) is up to date, it's okay.
For example, if we show a user how much money is in their account on every page, we can run query that on a replica, since it's fine if this is a few seconds delayed. However, immediately after an action changed their balance, on a confirmation screen, we'd want to show the value from Master.
It's entirely possible that any place elasticsearch is being used just don't need consistency.
There are actually a few strong solutions out there for Mysql, most starting with change data capture like:
https://github.com/shyiko/mysql-binlog-connector-java
(I link that one in particular because he links to alternatives right in his readme!)
Pgsql is a bit harder, but if I needed to start somewhere it would be with:
If you need any kind of consistency guarantee, I think you would need to use some kind of distributed transactions.
If its not, you could tail the MySQL log and have a process making the same changes to elasticsearch. The elasticsearch may lag behind if there are problems.
I'm facing a similar challenge, although at a much(MUCH!) smaller scale.
We have nearly everything in Postgres, and redis serves as both caching layer (non-persistent), but also for rails session storage and Sidekiq (persistent).
Having one source of truth can make things like failover much easier. I can handle PG failover, and also redis, but I'd rather not have to deal with both. Especially if you consider the potential of things going slightly out-of-sync (think a job in sidekiq that relies on an id in PG, one of which loses a few microseconds of data during replication etc, just speculating a scenario here)
Did anybody face similar challenges and care to share their thoughts?
Reading more into this point in the post than i maybe should but "Take advantage of our expertise operating MySQL." sounds like they have more engineers familiar and comfortable working with MySQL than they are with Redis.
That's a valid reason indeed. Also technological consolidation, that is, if I can do everything with a single DB / language / ... I always tend to use a single thing.
Watch how far you generalize that view. I realize that "I can do everything with a single...language" can make "everything" mean anything. I had a coworker that helped a middle eastern country with security software. He went full Javascript on it: Node, Mongo, etc. It worked. JS did everything. While he didn't go into details, afterwards he thought it was a bad idea. A great learning experience on when to define "everything" properly as well as what is outside of it.
Yes I totally agree that like most "rules on software" there is the need to be judicious enough to know when to follow the rule is not a good idea... However I more often see the contrary, of adding a multitude of systems together without very strong reasons.
Part of it is JSON as the storage format. Another part is its Node driver. The whole API fit it well. It understood async programming. It felt JS-like. The input and output were JSON instances. Finally, yea, from what I know of him "web scale" did play part in the decision. Oh those heady days.
For data that's mostly to do with the the API provided by the particular mongodb driver, than mongodb itself. Mongo stores and transmits BSON, not JSON. Most mongo drivers expose an API that serialises your data to BSON for writes and wraps the BSON data with a JSON-like interface for reads.
That's a bit suprising and I think sad. One would expect Github would have at least reached out and informed and thanked you at the very least if not tried to support your project in some active way.
Whenever this comes up on the HN the perspective is quickly shifted to the developer's choice of license but there are no expectations. But let's shift the perspective to the other side. Surely startups and others using open source projects for commercial reasons even if not obligated legally or not expected to by the developers have some ecosystem responsibility to try to contribute back when they can in some meaningful way.
Acquiring open source projects or hiring developers are 'influence plays' to gain control and should not be the only way for commerical projects to contribute.
I understand your POV, and I thank you for your comment, but mine is actually opposite and I want to explain why. I consider Redis, even if the license is different, kinda of the old "Public Domain", that you grab it and do whatever you want, without also expecting much if not what you see the project direction and activity is. However Github I think was the very first big site using Redis and clearly stating it, when it was in beta, so they did a very bold thing and helped Redis a lot to grow up. Github current CEO even wrote the first Redis-based queue system that provided Redis with an huge popularity boost. And they are still using Redis even if no longer for durable data, so it's a 7 years symbiosis going forward. Even if we never exchanged much infos, I think it's fine, I actually think it's the hackers way :-)
A lot of software help host their own source code and development. Just considering Github,
1. MySQL is hosted on Github.
2. I think they use elasticsearch and it is hosted on github.
and lots of others
Software is pretty cool like that!
From the projects I've been involved in, the reason is simply that we don't want to have 2 persistent storage systems. There's a need for a fast cache system, and there's a need for a reliable – as in certainty above speed – database. The former is usually Redis, and the latter most often needs to be a full-blown SQL database to handle the required complexity of larger applications.
It's just easier to have one single source of truth. Please don't change Redis into a large SQL database. :)