We’ve been running a pretty high traffic site on one big server for more than a year and a half. The site was originally PHP, but about four months ago we migrated the site to Rails. The Rails migration has been relatively smooth, and we now are serving more than 9 sites (mostly branding and some content differences between them) on one server.
The server has the whole stack – Apache, Passenger, MySQL, however once we migrated to Rails we simply had to start thinking about performance in a way we haven’t before. I’m not going to argue that this is a reason not to use Rails – I believe that if we were to have moved to Symfony or another PHP framework we would have been dealing with the same issues – but there are more moving parts in the migrated application and the complexity of deployment means more time is spent thinking about the server than we would want.
When the server was purchased, we asked for a 2-Proc RAID 5 box with 4 GB of memory. The client gets a lot of traffic and is further sensitive to the perceived overhead of multiple box management vs ‘one big box’. That’s fine with me, we setup one box and let it roll. However we didn’t exactly request a 4 CPU box – we actually started with a 2 CPU box with 2GB of memory and then decided to upgrade to bigger hardware after only a few weeks with the smaller box.
No problem – Rackspace took the machine down, swapped the motherboard, added the memory and we were good.
Fast forward to the Rails upgrade 1.5 years later. The box performance has generally been fine but in the back of my mind something wasn’t quite right about the actual server. Should i have benchmarked and solved this nagging feeling? Probably, but i’m not as familiar with Redhat as Ubuntu/Debian and so part of my feeling that it wasn’t performing as fast as it should i blamed on something related to RHEL4. But it’s a 4 CPU box and it has lots of memory, Rails should be fine with Apache and Passenger.
But it wasn’t.
We had some long running requests – Ruby seemed to get backed up behind – something? No problem, let’s make sure we’re doing everything we can to enable page caching – ok page level caching pretty much solved any performance issues. But still – deployments using capistrano clear the cache and the process of rebuilding up to 5,000 pages in the filesystem wasn’t fast. Deployments at particular times are out. Performance is good but not great, but not enough of a problem to cause anyone to question the configuration of the server. But that “PHP Feeling” of being able to deploy a change and have it instantly available just wasn’t there. This wouldn’t be a problem if the team wasn’t used to this flexibility of making changes at any moment, and I wasn’t happy with having to make an excuse for moving to Rails and ending up with a less flexible operation.
Our concerns, and experiences with other deployments, lead us to think that Apache/Passenger was the source of some of the slowness, so we decided to move to unicorn and nginx. The converstion from Apache to Nginx freed up a lot of memory, and unicorn’s graceful restart capability smoothed out performance during deployments. In part 2 we’ll cover in detail this conversion. It was a total success – overall throughput went up by a factor of 3, and deployments are now much smoother.
However, were things as fast as they could be?
In investigating an unrelated issue, we followed up with Rackspace on a Kernel patch that couldn’t be applied to our server. One of the technicians immediately realized why – we were not running the SMP kernel. For almost two years our 4 CPU Racehorse has been hopping on one leg around the track. Let’s look at the numbers: 9 am this morning, we switched to the proper kernel:

Every request is now almost twice as fast. Apache Bench says our throughput is 3.5 times greater. That stings a little. It feels a bit like i just piloted the Shenandoah 22,000 miles around the globe, winning every battle, only to find out that the team lost the war a few months before…
What should we have done differently? None of us are truly full-time ‘operations guys’ – but we’re all really good developers and ‘good enough’ at operations. We have no problem with configuring linux init files, syslogs, virtual ethernet interfaces, iptables, benchmarking tools, netcat, you name it – we’re pretty confident with the toolset. So why did we never cat /proc/cpuinfo ? And why did it take Rackspace 22 months to figure out that we were running a misconfigured server? Why did I have a spidey-sense (or is that spidy (spidie?) -sense?) that something was wrong with this server? And why didn’t I track it down as thoroughly as I could have? Why did I focus on the software vs hardware? But shouldn’t I be able to trust that Rackspace has that covered? What is the proper remedy for this failure from Rackspace? (They refunded the difference in price between the requested and ‘actual’ hardware without hesitation.) Would we have made the same software changes if we were getting ‘good enough’ performance from an actual 4 CPU system? Would we just be happily running on Apache/Passenger?
Answers to some of those questions in part two…