Facebook: Moving Fast at Scale
“Robert Johnson talks about: the need to prepare for horizontal scalability, very short release cycles associated with a streamlined deploying process, and making the entire process faster every day.”
The neatest thing I got from this presentation was how agile Facebook were. They do daily releases. To be able to do this, they’ve put control of QA and deployment back to the engineers. The engineers are responsible for making sure the rollout happens slowly and to keep an eye for post release bugs.
There are still operations people to assist, but they avoid the common scenario where the operations team gets burnt with some bad deploys and then refuses to roll out further updates, which in turn delays the release cycle, time to get feedback for developers, etc.
The presenter claimed that once they got in the habbit of this, the cycle continued to reinforce itself. Instead of doing a release every 3-4 weeks, 12 a year, and having 12 lessons learnt a year, by doing em dailyish, you get plenty more lessons learnt from each release and the feedback can be used to improve themselves.
Other neat things they talked about were how they scaled – horizontally across their tiers (web server, index/memcache, persistence/mysql). It seems a common problem of how to find a way to cluster the information in Facebook in any kind of domain centric database or geographic location since FB users are friends all over the world, in different groups, and ‘like’ different things. The content on your own FB page is dynamic and pulled from sources all over the world.
He also talked about deploying challenges, in particular getting different versions of php and javascript to interact when different versions of FB code were in the wild (different web servers). Basically by having a repository lib they wrote called GateKeeper that tells the FB code what version it is running on and essentially which code to follow. Their code looks a bit yuk on the outset – if this version, do this, if that version, do that – but by doing this they bring the problem of inconsistent behaviour due to having different data structures and code for different versions of the software in the wild back into the hands of the developers to manage. The code doesn’t cry when different versions interact, it knows what to do which means that they can worry about more pressing things.
Not too technical but a great mix of software practice for distributed systems and something that will hopefully get discussed more and more (and hopefully even taught to undergraduates some day)