May 16, 2014 by Joe Uhl Ops

Why Colocation?

We've mentioned it before, but we use dedicated hardware to drive most of MailChimp and the surrounding services. I often run into people who are shocked to hear we're building out our own environments instead of using "the cloud." Often, these same people are using single instances or communicating their request volume in minutes, hours, or days—instead of seconds.

There are many reasons for our approach and the cloud isn't the automatic best answer for every company.

Why use the cloud?

By "cloud," I mean "managed hosting," whether that's virtual machines, physical machines, or a mix of the two. The various providers add value through their software and services, and take advantage of their scale for spreading the cost of infrastructure, networking, and skills to keep everything running.

For all providers, no matter what scale, you pay a premium for that added value, convenience, and incremental cost control. That premium is well spent for new companies or new products, or products with highly variable load. If you're a startup, it's a great fit.

The feasibility of taking this approach further has changed dramatically over the last few years as well. IO used to be laughably, inexcusably terrible in the fully virtualized environments, but great improvements have been made through SSDs and higher speed connectivity. Virtual machine IO has gotten fast and exotic bare metal configs with SSDs and 10gbps lines are available if you really need big speed, depending on which providers you go with.

Why not use the cloud?

It turns out, the cloud is just a bunch of servers in a datacenter somewhere. You have no visibility or control over those servers. Outages happen to everyone, us included, and at some point, it's helpful to have more visibility into their causes or to limit the layers and systems that can potentially contribute to them.

The value the providers build on top for services involves incredibly complex layers of software that have to accommodate a messy, multi-tenant environment. I've experienced tremendous downtime due to outages and upgrades necessary to handle other neighbor companies' workloads and behavior. In my experience, no matter what provider you select, the core strength of their machines and networking will become your base building block as you throw out more services that can no longer scale or have unworkable restrictions. In the end, you're left with machines and networking that can both be done better in a more focused manner in-house.

For MailChimp, it primarily boils down to the following necessities:

Control: I want to know the person who's grabbing a server when it's broken. I don't want to work with a tech team comprising people of varying competence, one where context and progress is reset with every shift change. I want our network to support MailChimp and only MailChimp, and for us to have deep insight into exactly what is happening in it. No matter what is being fixed, I want the people fixing it to be MailChimp employees that are proud of their work and take personally any impact to our users or other engineering teams.

Options: We have dozens of MTAs that can individually push 10gbps out to the internet. Any cloud/hosting provider struggles with this. In our experience, even those that will sell you such a connection will struggle to make it work.

Sustainability: In my adventures with managed cloud/hosting providers, I have run into many bad decisions and shift-focused, indifferent attitudes.

  • Raid cards that were never tested and can't hot swap
  • Consumer grade parts presented as enterprise appropriate
  • Poorly built servers with loose parts and missing components
  • "Redundant" bonds where only one port has a physical cable in it
  • Dishonesty around upstream network capacity
  • No-shows for scheduled maintenance
  • Dropped tickets
  • Wrong machine power down, including due to mixing up customers
  • Wrong disk hotswaps resulting in lost arrays
  • Wrong PSU hot swap pulls
  • Incorrect raid levels
  • Vanishing virtual machines
  • Major mainstream CDN providers using tcp_tw_recycle on their edge
  • Portal compromises
  • Dishonesty around disk wipes
  • Short (or no) notice on hardware retirement and decommission
  • Random deletion of VLANs

I could go on. This stuff adds up over time. It slows everything down.

Reputation: This is extremely important in our industry, and we take it very seriously. We use our own IPs so we have complete control over their use and reputation. Few managed hosting providers will route customer-owned IP space.

Money: The cloud is more expensive once your load justifies the investment in space, gear, and people—even when considering aggressively negotiated, up-front cost for long term contracts.

As your machine count grows, the effort required to fight through the lack of company-focused rigor from the bottom up slows you down. Our team does things well and does not cut corners. We make mistakes and we have downtime (everyone does), but it's our environment and only ours. There are no sacrifices or compromises we have to make with others in mind. Our business is not a startup and is not going anywhere, so cruising with the hope of bailing before we hit a brick wall or exit is not an option. The other factors matter and the savings are real, but sustainability and control are the big ones.

Why colocation fits MailChimp

We're growing fast and have a good handle on how to scale our infrastructure. By monitoring everything very closely, including user counts, shard sizes, and capacity, we can plan and provision machines ahead of our growth curve with accuracy. We get new boxes into service as they're needed, and timed such that they won't sit idle. With 10,000 new accounts coming in every day, and all of our existing accounts growing alongside them, those new machines will stay busy. Our limiting factors are generally always disk- or network-related, not CPU/memory. The variable scale and cost of a cloud provider is of little use to us.

As we've grown and maintained our extremely efficient ratio of engineers to users, introducing stability at all layers helps tremendously. This doesn't mean we can be lazy about how we build things, and we still assume everything will fail.

MailChimp is a big and mature application. We send around 5,000 emails, handle 10,000 requests, and eat 100,000 queries every second. We have millions of users who themselves have billions of subscribers (for whom we handle all the opens/clicks/views). We're international enough at this point that our load is barely variable—there's never quiet time. Our size and server count is large enough that there are significant savings in rolling our own. Even taking networking hardware, colocation space and power, carrier lines, and people to build and run everything into account, the savings are large. Against hourly instance cost, it's laughably different, and even against heavily discounted long term contracts in the cloud, the price difference is still impressive.

We don't try to do everything. We do still lean on certain services for distribution, protection, and reducing latency, because it just doesn't make sense for us to have servers all over the world. Not yet.

Want to help us continue growing this environment? We're hiring.

Comment