Sep 12, 2016 by Toby Chappell Systems

Performance Testing with SSDs, Part 1

At MailChimp, we've historically had mixed feelings about SSD-based servers. Years ago, our servers were hosted by a provider that utilized SSDs for storage, and many of those SSDs ended up failing at the same time. It caused quite a few headaches for the team, and we were hesitant to consider SSDs as an option for a long time. So, when we decided it was time to give SSD-based servers another chance, we needed to convince ourselves that the performance (and the risk, given our past experiences) was worth the cost. Everyone knew that modern, server-class SSDs would be faster, but we didn't know exactly how much faster.

Here is a brief recap of the hardware used for our two test boxes:

  • HP DL380P Gen9 24SFF
  • 2x 600GB 10k SAS disks (OS partition), connected via onboard HP P440 RAID controller
  • 6x 800GB Intel S3700 series SSDs (database storage)
  • 128GB DDR-2133Hz RAM
  • A pair of HP P440ar or LSI SAS9300-8i disk controllers for the SSDs

The only hardware difference between the 2 test servers is in the disk controller. One server was configured with a pair of HP P440ar controllers—this is a hardware RAID-capable controller, and is the PCI version of the onboard HP HP440 controller. The second server had a pair of PCI LSI SAS9300-8i controllers—these are simple HBAs (Host Bus Adapters), not RAID controllers. Initially, both were tested with CentOS 6.x (2.6 series kernel). We later re-ran some tests on the "winning" controller config under CentOS 7.x (3.10 series kernel) and determined that this second kernel provided even more performance. But more on that in a minute.

The deep questions

As you can see, this test largely became about which controller was a better fit for our unique scenario, and what we'd need for effective tuning. This is an important question for us, because we try to keep our hardware configs as consistent as possible (we literally have racks and racks and racks of HP DL380s), so if we make a fundamental change in this config, it needs to be a change we can live with for a long time.

These are the questions we were trying to answer about this hardware setup:

  • Which of these 2 controller types performs best?
  • What OS and hardware tuning are needed to maximize performance?
  • Are there any other non-performance factors that need to be considered?

The disks

The HP DL380p 24SFFF servers have 24x 2.5" disk bays grouped into three enclosures or, in the parlance of the HP controller tool, "boxes."

The only disks on the onboard controller (slot "0") are the root drives. The SSDs are distributed evenly across the other 2 controllers. Because each SATA cable is really a bundle of 4 independent cables, the distribution of the SSDs on its controller doesn't really matter.

Presenting the disks to the OS

The HP P440ar controllers can run in 2 modes: RAID (Redundant Array of Independent Disks) and HBA (Host Bus Adaptor). For performance reasons, if you'll be using software RAID (mdadm), HBA mode would be the clear choice because it presents the most direct I/O path to each individual disk. But, and this is a big "but" for us, in HBA mode you cannot use the controller tools to turn on the location indicator light for a given disk. This is because the location light protocol actually works on the enclosure and not the disk itself. And, in HBA mode, the disks aren't treated as being in enclosures.

At MailChimp, we have so many servers (~1600 at last count!) that we depend on things like disk lights to double check our work. So, even though our tests showed around a 2% performance improvement by running the HP controllers in HBA mode, that wasn't enough of a boost to override our need for the ability to control the lights.

Since we're running in RAID mode, there is an extra step on the HP controllers that wasn't necessary on the LSI controllers. To present each disk individually, we turn it into a RAID-0 volume.

We do the same for both controllers. You can see the list of disks by running hpssacli ctrl slot=X pd all show, where X is the PCI slot holding the controller.

sudo hpssacli ctrl slot=2 create type=ld raid=0 ss=1024 drives=1I:2:1 size=max aa=disable
sudo hpssacli ctrl slot=2 create type=ld raid=0 ss=1024 drives=1I:2:2 size=max aa=disable
sudo hpssacli ctrl slot=2 create type=ld raid=0 ss=1024 drives=1I:2:3 size=max aa=disable

This is repeated for the other SSDs that will be used.

The ss option is stripe size in kilobytes. Ideally, this would be set to the erase block size on the SSD (2048Kb), but 1024Kb is the maximum supported by the HP controllers. As long as the erase block size is a small multiple of the stripe size, it shouldn't be an issue. The other option above is aa=disable, but the full name is arrayaccelerator. It's not an option you want to enable on SSDs since they handle their own caching and can handle orders of magnitude more iops (I/O operations per second) than a standard hard drive, so you want to be sure those iops make it to the SSD as quickly as possible.

For each controller, as long as it only has SSDs, you should disable surface scan mode as well. Otherwise, the controller will greatly impact the life of your SSDs since it continually checks and rewrites parity bits to be sure your data is recoverable.

sudo hpssacli controller slot=2 modify surfacescanmode=disable

Once all of the above has been done, each disk will then show up as /dev/sdX to the OS. For example, /dev/sdc.

None of this is necessary on the LSI controllers, since they present the SSDs directly to the operating system without need of further configuration.

Making the disks usable

Now that we've got disks on both controller types exposed to the operating system (CentOS, in our case), the commands are basically the same for both HP and LSI controllers from here on out.

Each disk must be partitioned to ensure proper alignment with the underlying block size that arises from the individual chips within the SSDs. The intricacies of SSD alignment will be covered in Part 2 of this post, but suffice it to say that offsetting the beginning of the disk partition by 4096 sectors provided the best alignment and performance. For the disks we created above, these parted commands partition the SSDs accordingly. We're only showing 3 disks here, but it's the same command for all:

sudo parted -s -a optimal /dev/sdb mklabel gpt -- mkpart primary xfs 4096s -1
sudo parted -s -a optimal /dev/sdc mklabel gpt -- mkpart primary xfs 4096s -1
sudo parted -s -a optimal /dev/sdd mklabel gpt -- mkpart primary xfs 4096s -1

We'll follow the same pattern for the other devices that will be mirrored and striped.

Mirror, mirror

Our desired software RAID setup is RAID 1+0, which is a collection of mirrors (RAID-1) striped together (RAID-0). This uses more SSDs than, for example, a traditional RAID-5 setup, but provides a good mix between redundancy and performance. While mdadm does provide a one-step RAID-10 setup, we found in testing that configuring individual RAID-1 pairs and then striping across them provides much better performance and visibility.

So, to create this RAID 1+0 setup, we first have to build the mirrors. To protect us against a controller failure, we want each mirror to have disks that are presented by different controllers.

In our case, devices /dev/sdb through /dev/sdd were on one controller, and /dev/sde through /dev/sdg on the other.

sudo mdadm -C /dev/md1 --level=raid1 --raid-devices=2 /dev/sdb1 /dev/sde1
sudo mdadm -C /dev/md2 --level=raid1 --raid-devices=2 /dev/sdc1 /dev/sdf1
sudo mdadm -C /dev/md3 --level=raid1 --raid-devices=2 /dev/sdd1 /dev/sdg1

After all the mirrors are built, the next step is to stripe them together:

sudo mdadm -C /dev/md101 --level=raid0 --raid-devices=3 /dev/md1 /dev/md2 /dev/md3

It's important to capture this mdadm config so it'll be reapplied when the server reboots. To do this, create the file /etc/mdadm.conf. Your contents should be the output from this command:

sudo mdadm --detail --scan

We've covered a lot of ground today, but there's still more to learn. Next time, we'll discuss OS tuning, mounting the filesystem, and review the final outcome of our SSD performance testing.