Here at MailChimp, we love redundancy. As explained by Joe in this post, we also love simplicity. Redundancy on all levels of our infrastructure lends to this philosophy by increasing our fault tolerance and mitigating downtime every step of the way.
Arguably, the most important form of redundancy is power. At our datacenters, we use two separate rPDUs with 208V surge protection in each rack. Each PDU is connected to its own independent circuit that the datacenter maintains, likely with more layers of redundancy under the hood. Each PDU pair is cross-connected and maintained as a single power node. This works well in conjunction with the two power supplies present in each server and switch, which are each plugged into a separate PDU. If one side happens to go down (whether it be a PSU, a PDU, or a datacenter circuit), there will typically be no downtime due to power failure.
Network connectivity is another hugely important area where redundancy needs to be present in order to maintain uptime for our customers. Currently, we use two production switches per rack, each of which have two (one "main" and one redundant) 10gbps uplinks for a total of 40gbps. These uplinks are spread between four separate port expander cards on two independent aggregate switches. A primary bonded interface on each of the servers is formed from the two top-of-rack switches, creating yet another layer of redundancy. If a single fiber line were to fail, we would still be able to serve 10gbps to the impacted switch. If a whole top-of-rack switch happens to go down in any way, the bonded interfaces on the servers would still be able to serve full duplex. On the top-end, we use several enterprise-level internet service providers to mitigate any downtime caused by regional outages. If one line were to go down, we have multiple dependable failovers that we can fall back on within minutes.
We also have the ability to back up user data 7 ways from sundown. Well, closer to 5. Each of our database machines are paired and replicate off of each other, creating two identical copies of user data. Every night one copy of this data is backed up to a local storage node, which will store the copy and will cycle out the data every 7 days with fresh backups. Another copy is sent to a non-localized storage machine and exists there for 7 days as well. After 7 days, both of these storage nodes will cycle the old data and dump it onto big archive machines, which will store those copies for another 30 days before deleting that data for good. This may seem over the top to some, but everything here is done with purpose. We have been bitten before by not having enough backups stored in multiple locations, causing permanent data loss for some of our users. With this system in place, that hasn't happened.
These topics are but a few of the areas in which we've taken redundancy pretty seriously to make all of our lives simpler. Not only does it mitigate downtime for our customers, it also prevents us having to scramble to get things running smoothly again.