Wow I'm learning today that Github used Redis for persistent data, now that they...

rthrfrd · on Jan 10, 2017

FWIW we went through a very similar process to that documented here by Github (~3 months ago). It was entirely due to operational reasons and nothing to with shortcomings in Redis itself. MySQL was the master record for 99% of our data while Redis was the master record for the other 1% (as it happens it was also a kind of activity stream). Having the single 'master' reference for our data reduced complexity to a degree that it was worth running a less computationally-efficient setup. We also have nowhere near Github's volume so we did not have to do such significant re-architecting to make unification possible.

Now we still use Redis for reading the activity streams and as LRU cache for all sorts of data, but it is populated like all of our specialised slave-read systems (elasticsearch, etc) by replicating from the MySQL log.

Hope that helps!

antirez · on Jan 10, 2017

Yes, this helps and totally makes sense to me. Thanks. I would do the same... In this case however it looks like there were certain high volume writes that could be handled in a simpler manner with Redis, however it is totally possible that while this looks like an important use case, it accounted for a small percentage of all the data, so we are back to the consolidation thing of moving everything to a single system that is in general a good idea.

sjeanpierre · on Jan 11, 2017

What method are you using to replicate from MySQL binlog to various other systems?

karmakaze · on Jan 11, 2017

FWIW, I've used github.com/siddontang/go-mysql to successfully replicate from MySQL to DynamoDB. Currently not using GTIDs and looking into that next.

parthdesai · on Jan 10, 2017

just asking for some info,

but how do you make sure that multiple of your db systems are in sync (specifically interested in MySql and elasticsearch)?

Hope it's alright to ask you that.

rthrfrd · on Jan 10, 2017

In the case of ES the short answer is; we don't. We have fault tolerance in our replication system to guarantee eventual consistency instead. I would say using ES as a consistent source of data isn't really playing to its strengths so we don't use it that way. The consistency you want is determined at read time: If you need consistency then hit MySQL, but for our use case that almost never happens as eventual consistency is usually instantaneous enough.

Our other tool is to decouple lookup (which objects to fetch) and population (what data to return for each object). You can mix and match, e.g. do a lookup against an inconsistent ES but still get consistent objects by populating from MySQL (or vice versa). As others have alluded to it depends entirely on the requirements for the result set.

jeffasinger · on Jan 10, 2017

Where I work we use several different MySQL replicas in production, where we don't expect them to be in sync.

So long as the source of truth (Master MySQL node) is up to date, it's okay.

For example, if we show a user how much money is in their account on every page, we can run query that on a replica, since it's fine if this is a few seconds delayed. However, immediately after an action changed their balance, on a confirmation screen, we'd want to show the value from Master.

It's entirely possible that any place elasticsearch is being used just don't need consistency.

bpicolo · on Jan 11, 2017

There are actually a few strong solutions out there for Mysql, most starting with change data capture like: https://github.com/shyiko/mysql-binlog-connector-java (I link that one in particular because he links to alternatives right in his readme!)

Pgsql is a bit harder, but if I needed to start somewhere it would be with:

https://github.com/debezium/debezium

or https://github.com/confluentinc/bottledwater-pg

These are the start of pretty sophisticated solutions where you need super real-time elasticsearch indexes and can bring up infra like Kafka.

For many applications, queueing an update when something hits your ORM to update, with the hourly/daily refresh is pretty satisfactory.

sacheendra · on Jan 10, 2017

If you need any kind of consistency guarantee, I think you would need to use some kind of distributed transactions.

If its not, you could tail the MySQL log and have a process making the same changes to elasticsearch. The elasticsearch may lag behind if there are problems.

gingerlime · on Jan 11, 2017

I'm facing a similar challenge, although at a much(MUCH!) smaller scale.

We have nearly everything in Postgres, and redis serves as both caching layer (non-persistent), but also for rails session storage and Sidekiq (persistent).

Having one source of truth can make things like failover much easier. I can handle PG failover, and also redis, but I'd rather not have to deal with both. Especially if you consider the potential of things going slightly out-of-sync (think a job in sidekiq that relies on an id in PG, one of which loses a few microseconds of data during replication etc, just speculating a scenario here)

Did anybody face similar challenges and care to share their thoughts?

clarkenheim · on Jan 10, 2017

Reading more into this point in the post than i maybe should but "Take advantage of our expertise operating MySQL." sounds like they have more engineers familiar and comfortable working with MySQL than they are with Redis.

antirez · on Jan 10, 2017

That's a valid reason indeed. Also technological consolidation, that is, if I can do everything with a single DB / language / ... I always tend to use a single thing.

virmundi · on Jan 10, 2017

Watch how far you generalize that view. I realize that "I can do everything with a single...language" can make "everything" mean anything. I had a coworker that helped a middle eastern country with security software. He went full Javascript on it: Node, Mongo, etc. It worked. JS did everything. While he didn't go into details, afterwards he thought it was a bad idea. A great learning experience on when to define "everything" properly as well as what is outside of it.

antirez · on Jan 10, 2017

Yes I totally agree that like most "rules on software" there is the need to be judicious enough to know when to follow the rule is not a good idea... However I more often see the contrary, of adding a multitude of systems together without very strong reasons.

ohstopitu · on Jan 10, 2017

just wondering why Mongo is considered a part of "JS". Is it because of the MEAN stack?

virmundi · on Jan 10, 2017

Part of it is JSON as the storage format. Another part is its Node driver. The whole API fit it well. It understood async programming. It felt JS-like. The input and output were JSON instances. Finally, yea, from what I know of him "web scale" did play part in the decision. Oh those heady days.

secoif · on Jan 11, 2017

For data that's mostly to do with the the API provided by the particular mongodb driver, than mongodb itself. Mongo stores and transmits BSON, not JSON. Most mongo drivers expose an API that serialises your data to BSON for writes and wraps the BSON data with a JSON-like interface for reads.

sandGorgon · on Jan 11, 2017

postgresql has the jsonb column type which is just as powerful as mongo

nine_k · on Jan 10, 2017

Possibly because of JSON as the storage format.

koolba · on Jan 10, 2017

Because of JSON and "web scale".

throw2016 · on Jan 10, 2017

That's a bit suprising and I think sad. One would expect Github would have at least reached out and informed and thanked you at the very least if not tried to support your project in some active way.

Whenever this comes up on the HN the perspective is quickly shifted to the developer's choice of license but there are no expectations. But let's shift the perspective to the other side. Surely startups and others using open source projects for commercial reasons even if not obligated legally or not expected to by the developers have some ecosystem responsibility to try to contribute back when they can in some meaningful way.

Acquiring open source projects or hiring developers are 'influence plays' to gain control and should not be the only way for commerical projects to contribute.

antirez · on Jan 10, 2017

I understand your POV, and I thank you for your comment, but mine is actually opposite and I want to explain why. I consider Redis, even if the license is different, kinda of the old "Public Domain", that you grab it and do whatever you want, without also expecting much if not what you see the project direction and activity is. However Github I think was the very first big site using Redis and clearly stating it, when it was in beta, so they did a very bold thing and helped Redis a lot to grow up. Github current CEO even wrote the first Redis-based queue system that provided Redis with an huge popularity boost. And they are still using Redis even if no longer for durable data, so it's a 7 years symbiosis going forward. Even if we never exchanged much infos, I think it's fine, I actually think it's the hackers way :-)

why-el · on Jan 10, 2017

I think this is key:

"We needed something that would work for both github.com and GitHub Enterprise, so we decided to lean on our operational experience with MySQL."

artursapek · on Jan 10, 2017

Pretty cool that redis was helping to host its own source code & development.

sacheendra · on Jan 10, 2017

A lot of software help host their own source code and development. Just considering Github, 1. MySQL is hosted on Github. 2. I think they use elasticsearch and it is hosted on github. and lots of others Software is pretty cool like that!

vacri · on Jan 11, 2017

I'm pretty sure that git's source is also version-controlled in git :)

liveoneggs · on Jan 10, 2017

You might appreciate http://fossil-scm.org/

dvirsky · on Jan 10, 2017

from the post sounds like it still is.

wvh · on Jan 11, 2017

From the projects I've been involved in, the reason is simply that we don't want to have 2 persistent storage systems. There's a need for a fast cache system, and there's a need for a reliable – as in certainty above speed – database. The former is usually Redis, and the latter most often needs to be a full-blown SQL database to handle the required complexity of larger applications.

It's just easier to have one single source of truth. Please don't change Redis into a large SQL database. :)

antirez · on Jan 11, 2017

Thanks! No plans to change it into an SQL database :-) Actually the idea is to focus more in the caching/streaming area.