the pulse

You’ve bought a bunch of servers, they’re all setup at a facility, but how are they doing? How do you know whether your application is slow because there are lots of users or because you’ve got a disk going bad, or because your indexes no longer fit into memory? How do you know when Apache has died?

We identify the two separate types of questions above.

    1. know when things die, break, or are close to breaking
    2. get the pulse of our systems to see trends, problems, bottlenecks

      First, we need a tool to notify a responsible party when services die, disks get full, databases stop accepting connections, servers load spikes, or switch starts dropping packets. We’ve chosen Nagios to fill this void. Nagios allows us to monitor all of the above on each of our servers through plugins including redundancy. Alerts can be sent to a secondary party if the first neither fixes a problem nor acknowledges it within a certain amount of time. In addition to a highly-flexible set of bundled plugins, it’s easy to add new plugins to monitor custom application services and verify things are in working order.

      Second, we need a tool to allow us to see trends in load, memory, disk space, network traffic, database queries, mail queues, and application metrics in graphical format. At the workshop in New York, I was turned on to Ganglia which monitors and graphs metrics just like these using RRDTool (by the author of MRTG). Ganglia monitors clustered systems by using multicast to communicate amoung the servres. We now track trends in our web, database, mail, and supporting servers’ trends in a nice web interface. In addition, I threw together a PHP script to monitor MySQL metrics in a matter of a few hours.

      Not only can we get immediate notification (to a pager) when things break, but we can now diagnose more abstract problems like bottlenecks and hardware problems before they become critical.

      [tags]monitoring, nagios, ganglia, linux[/tags]

      Lily’s mast cell tumor #2

      About two weeks ago Libby noticed a lump on Lillian’s posterior and we took her in on Saturday February 11th to have it examined. Our vet felt and then aspirated the lump and was disappointed to find some granules in the sample. Unfortunately, these are characteristic of a mast cell tumor. Lily had a mast cell tumor removed a few years ago from the back of her neck. We were able to schedule her for surgery on Monday February 13th to have the tumor removed. The surgery was very sucessful and it was excised with an ample margin.

      Today, we heard the results of the pathology. They confirmed that it was indeed a mast cell tumor grade 2. We’ll be taking her to an oncologist as soon as we can. From what we’ve read there are a number of tests we can do to determine the extent to which the tumor has spread. The small size of the tumor, the successful surgery, and the long time between these incidents are all good signs. Obviously we’ve been worried about her since we heard this, but we can only hope for the bset and do everything in our power to assure her speedy recovery.

      one-step deployment

      one-step deployment has been a goal of ours for a long time. Our determination is renewed by the workshop last week. Cal shared with us a screenshot of the Flickr deployment interface which involves two buttons on an HTML page that deploy the current codebase to staging and/or production.

      But, as I posed the question in my previous article, we have a bit of distance to cover.

      1. put more and more emphasis on the quality of code checked into the repository. We get lots of mileage out of code reviews and unit tests, but we need developers to seek out other developers, our support team, and the rest of the company to test features and/or changes that are particularly important (arguably this is easier if deployment is easier).
      2. continue to move our development process to be more agile to handle the evolution of requirements, needs identified by progress, and necessary optimization shown by customer use.
      3. finally, make deployment a one-step process. Make the web servers less tied to a given backend. Ensure the process works seamlessly and quickly (and atomically).

      Database changes don’t fit well with the one-step deployment. These still need to be made manually, but still get tracked in source-control.

      Every day we move further in this direction.

      [tags]agile development, deployment, quality assurance[/tags]