cat /proc/cpuinfo or Don’t Trust Your Cores To Rackspace, Part I 4

Posted by chad on October 27, 2009

doggiewithwheelsWe’ve been running a pretty high traffic site on one big server for more than a year and a half. The site was originally PHP, but about four months ago we migrated the site to Rails. The Rails migration has been relatively smooth, and we now are serving more than 9 sites (mostly branding and some content differences between them) on one server.

The server has the whole stack – Apache, Passenger, MySQL, however once we migrated to Rails we simply had to start thinking about performance in a way we haven’t before. I’m not going to argue that this is a reason not to use Rails – I believe that if we were to have moved to Symfony or another PHP framework we would have been dealing with the same issues – but there are more moving parts in the migrated application and the complexity of deployment means more time is spent thinking about the server than we would want.

When the server was purchased, we asked for a 2-Proc RAID 5 box with 4 GB of memory. The client gets a lot of traffic and is further sensitive to the perceived overhead of multiple box management vs ‘one big box’. That’s fine with me, we setup one box and let it roll. However we didn’t exactly request a 4 CPU box – we actually started with a 2 CPU box with 2GB of memory and then decided to upgrade to bigger hardware after only a few weeks with the smaller box.

No problem – Rackspace took the machine down, swapped the motherboard, added the memory and we were good.

Fast forward to the Rails upgrade 1.5 years later. The box performance has generally been fine but in the back of my mind something wasn’t quite right about the actual server. Should i have benchmarked and solved this nagging feeling? Probably, but i’m not as familiar with Redhat as Ubuntu/Debian and so part of my feeling that it wasn’t performing as fast as it should i blamed on something related to RHEL4. But it’s a 4 CPU box and it has lots of memory, Rails should be fine with Apache and Passenger.

But it wasn’t.

We had some long running requests – Ruby seemed to get backed up behind – something? No problem, let’s make sure we’re doing everything we can to enable page caching – ok page level caching pretty much solved any performance issues. But still – deployments using capistrano clear the cache and the process of rebuilding up to 5,000 pages in the filesystem wasn’t fast. Deployments at particular times are out. Performance is good but not great, but not enough of a problem to cause anyone to question the configuration of the server. But that “PHP Feeling” of being able to deploy a change and have it instantly available just wasn’t there. This wouldn’t be a problem if the team wasn’t used to this flexibility of making changes at any moment, and I wasn’t happy with having to make an excuse for moving to Rails and ending up with a less flexible operation.

Our concerns, and experiences with other deployments, lead us to think that Apache/Passenger was the source of some of the slowness, so we decided to move to unicorn and nginx. The converstion from Apache to Nginx freed up a lot of memory, and unicorn’s graceful restart capability smoothed out performance during deployments. In part 2 we’ll cover in detail this conversion. It was a total success – overall throughput went up by a factor of 3, and deployments are now much smoother.

However, were things as fast as they could be?

In investigating an unrelated issue, we followed up with Rackspace on a Kernel patch that couldn’t be applied to our server. One of the technicians immediately realized why – we were not running the SMP kernel. For almost two years our 4 CPU Racehorse has been hopping on one leg around the track. Let’s look at the numbers: 9 am this morning, we switched to the proper kernel:
1 CPU vs 4 CPUs

Every request is now almost twice as fast. Apache Bench says our throughput is 3.5 times greater. That stings a little. It feels a bit like i just piloted the Shenandoah 22,000 miles around the globe, winning every battle, only to find out that the team lost the war a few months before…

What should we have done differently? None of us are truly full-time ‘operations guys’ – but we’re all really good developers and ‘good enough’ at operations. We have no problem with configuring linux init files, syslogs, virtual ethernet interfaces, iptables, benchmarking tools, netcat, you name it – we’re pretty confident with the toolset. So why did we never cat /proc/cpuinfo ? And why did it take Rackspace 22 months to figure out that we were running a misconfigured server? Why did I have a spidey-sense (or is that spidy (spidie?) -sense?) that something was wrong with this server? And why didn’t I track it down as thoroughly as I could have? Why did I focus on the software vs hardware? But shouldn’t I be able to trust that Rackspace has that covered? What is the proper remedy for this failure from Rackspace? (They refunded the difference in price between the requested and ‘actual’ hardware without hesitation.) Would we have made the same software changes if we were getting ‘good enough’ performance from an actual 4 CPU system? Would we just be happily running on Apache/Passenger?

Answers to some of those questions in part two…

Trackbacks

Use this link to trackback from your own site.

Comments

Leave a response

  1. Eric Waller Tue, 27 Oct 2009 19:06:48 UTC

    Somewhat of a tangent, but related to your apache+passenger to nginx+unicorn switch:

    I noticed a similar gain in throughput (10-12 req/s to 25-30req/s) for action cached pages (in merb) after switching out nginx+passenger for nginx+thin. This is on a 256mb slice, and seems totally counter to the general opinion that passenger is great for VPSes (or for anything if these performance numbers are representative).

  2. Lamnk Tue, 27 Oct 2009 19:32:01 UTC

    Sorry, but look like it’s your fault. First thing i do on a new server is to check the specs and do some burn-in test.

  3. Brian Armstrong Wed, 28 Oct 2009 06:17:12 UTC

    Interesting writeup Chad. I use passenger/nginx and hadn’t heard of unicorn. Will have to check it out.

    It sounds like more and more apps are moving to the one big machine solution instead of sharding or something. i read recently that the entire Basecamp database is served on one big machine, 8 core, 128gb ram. I think with InnoDB you can keep almost the entire database in memory. Or as they put it, they keep getting saved from sharding by Moore’s law.

    That also definitely true where the moment you start dreading a deploy, your code suffers. It’s worth trying to make them fast and painless.

    @Eric Waller – i agree passenger doesn’t make a whole lot of sense on a 256mb slice. It really shines if you have many instances running because of the shared memory. But if you only have 1 or 2 instances it’s just extra overhead. Conservative spawning mitigates some of this.

  4. Segedunum Wed, 27 Jan 2010 08:01:49 UTC

    Conservative spawning just gives you the same issues as using Mongrels, so unless your application isn’t compatible with Passenger’s smart spawning then you don’t really want to be using it. Few application running under Mongrel will have that problem.

    Passenger makes more sense on shared hosts with shared applications, or when you’re not quite sure how many application processes you’ll need at any given time.

Comments