Thrashing Your SATA for Fun and Profit

N.B. this page is not intended for the really inexperienced! It may turn into "how to boil an egg on my disk?" quite quickly and some damages may be irreversible

One of the major issues of my target is high speed sustained writing to disks. In the process of deciding how to continue, or just getting an answer to the usual question "what have I thrown my money on?", finding the hardware and/or software limits is a must. As a rule of thumb, if I may say so, never try to see maximum hardware performance, especially when disks are involved, while adding any unnecessary software level above the lowest level drivers. For the vast majority that hasn't understood what I just wrote, that translates to "forget about file systems, LVM or anything like that!".

Another important rule when estimating the disk raw speed is to go for sequential writing/reading. Jumping back and forth over the disk is costing time and that's expensive in term of sustained transfer rate. Also it's very important to feed the data fast enough so the disk logic doesn't decide to park the heads between bursts.

So I bought 4 1TB Seagate Barricade 7200.11 drives (ST31000340AS), as being both extremely fast at a sustained speed of 105MB/s, but also quite reliable when compared to their bigger 1.5TB brothers. Even worse, I bought two SATA port multipliers to be able to connect the disks in various ways to see how far will I get. With an ideal target of over 160MB/s,nothing was too expensive for experimenting.

As of writing this page, Seagate has launched for a month already the 7200.12 series which provide up to 160MB/s sustained transfer rates. But with the money already spent and the delay between devices launch and them being available on the Romanian market, they will need a re-visit sometime in the future.

"One Small Step for a Man ..."

Never go immediately for whole structure. Sometimes (please read it as almost always) it means just wasting money and time! And these days (Feb 2009) money is most important as time some of us might have plenty, unfortunately! So I started with just one disk to see how fast can I go with it and how true is the Seagate datasheet. Fortunately for them, and me, I could almost get the announced 105MB/s sustained transfer speed.

There are though some simple rules to consider:

Speed Matters

With such good results obtained using a single disk, I have moved on. After installing a second disk I started trying to find the best approach in software. Because writing to two disks in parallel is not an easy task. The only certain thing is that the disks must be written to in turns, so while one is busy executing a set of commands, the other is fed with commands. I have took the following steps, all failing but the last, of course:

As a last test, I have thrown in both port multipliers, insuring that I connect in turn the disks so I have '/dev/sdb' and '/dev/sdd' on one of them and '/dev/sdc' and '/dev/sde' on the other. The final result was an astonishing 240MB/s when no other major disturbing activity occurred.

I did the same test on a rather fast PC, with two of the disks. Could not get faster than 160MB/s, so the little chip from Marvell is really amazing with it's completely inter-connected architecture.

Adding Extra Kernel Layers

Just for the completeness of this batch of tests, I decided to go one step further and build a RAID0 array out of the 4 disks. As it was pretty obvious from the very beginning that the performance will be affected by the array creating parameters, I decided to go for a set of tests and see what happens. The commands and the corresponding results are:

| Command | Speed | | ------- | ----- | | mdadm -C /dev/md0 -v -n 4 -l raid0 /dev/sda /dev/sdb /dev/sdc /dev/sdd | Started at 178MB/s, degraded to 176MB/s | | mdadm -C /dev/md0 -v -n 4 -l raid0 /dev/sda /dev/sdc /dev/sdb /dev/sdd | Started at 178MB/s, degraded to 170MB/s | | mdadm -C /dev/md0 -v -n 4 -c 1024 -l raid0 /dev/sda /dev/sdb /dev/sdc /dev/sdd | 249MB/s | | mdadm -C /dev/md0 -v -n 4 -c 1024 -l raid0 /dev/sda /dev/sdc /dev/sdb /dev/sdd | 244MB/s | | mdadm -C /dev/md0 -v -n 4 -c 2048 -l raid0 /dev/sda /dev/sdb /dev/sdc /dev/sdd | 247MB/s |

The first conclusion that really matters is that the RAID chunk size should properly match the available hardware. In this case 1024Kb gave the best results.

The second, and quite puzzling I must admit, conclusion is that the disks should be accessed first on one of the SATA ports and then on the other SATA port. The speed penalty is not big, but it might matter. As of kernel version 2.6.29-rc3-git10 this has been apparently fixed. Due to a series of fixes in the md support, this now seems to be consistent with my initial logic.

But, of course, the most annoying thing is that RAID0, properly set, proved to be faster than my best access scheme.

"No RAID" Revisited

Based on the RAID0 tests, I decided to make a final test by accessing the disks in the '/dev/sda', '/dev/sdb', '/dev/sdc' and '/dev/sdd' instead of the original approach of '/dev/sda', '/dev/sdc', '/dev/sdb' and '/dev/sdd'.

With this final test I regain confidence in my considerations - just over 252MB/s!

The 252MB/s Wall

After a lot more work I have finally understood what limits the SATA transfer speed to 252MB/s, no matter how the disks are installed on the two port multipliers. Being announced as a board with 128MB DDR memories, and briefly checking the SoC and the memory chips datasheets, I have considered that the theoretical limit of about 510-520MB/s read speed could be achieved. Totally wrong as the SoC is capable of transferring just 4 32-bit words in a burst instead of a maximum of 8 32-bit words. So the theoretical limit is around 313MB/s. With the extra overhead of SATA DMA and the actual operating system running, 252MB/s is quite a decent speed.

The Code

The code used to test this aspect is shown next. 'libaio' must be installed in order to be able to compile it and the kernel must have asynchronous I/O support enabled.

#include `<aio.h>`
#include `<errno.h>`
#include `<fcntl.h>`
#include `<libaio.h>`
#include `<stdint.h>`
#include `<stdio.h>`
#include `<stdlib.h>`
#include `<unistd.h>`

*/*///////////////////////////////////////////////////////////////////////////

#define BufferSize      0x00100000
#define BuffersNum      ((0x40000000/BufferSize)*512)
#define DrivesNum       4
*#define DrivesNum       1 * when testing RAID 0
static const char *Drives[]={"/dev/sda","/dev/sdb","/dev/sdc","/dev/sdd"};
//static const char *Drives[]={"/dev/md0","/dev/sdb","/dev/sdc","/dev/sdd"}; // when testing RAID 0
#define BuffsPerDrive   (0x01000000/BufferSize/DrivesNum)

*/*///////////////////////////////////////////////////////////////////////////

int main(void)
{ void *Buf[DrivesNum][BuffsPerDrive];
  int fd[DrivesNum];
  size_t i,j,BufIdx;
  uint32_t StartTime,LastTime;
  uint32_t _StepsNum;
  uint64_t _TotalTime;
  io_context_t Ctx[DrivesNum];
  struct iocb CBs[DrivesNum][BuffsPerDrive];
  struct iocb *CBPs[DrivesNum][BuffsPerDrive];
  struct timeval tv;
  //  Opens the files.
  for (i=0;i<DrivesNum;i++)
    if ((fd[i]=open(Drives[i],O_DIRECT|O_LARGEFILE|O_WRONLY))==-1) perror(Drives[i]);
  //  Allocates the memory blocks from which the write should be done.
  for (i=0;i<DrivesNum;i++)
    for (j=0;j<BuffsPerDrive;j++)
    { Buf[i][j]=malloc(BufferSize+4096);Buf[i][j]=(void *)(((uint32_t)Buf[i][j])&-4096); }
  for (i=0;i<DrivesNum;i++)
  { Ctx[i]=NULL;
    j=io_setup(64,&Ctx[i]);
    if (j) printf("io_setup %d - Error %d\n",i,j);
    for (j=0;j<BuffsPerDrive;j++) CBPs[i][j]=&CBs[i][j];
  }
  BufIdx=0;
  for (_TotalTime=0,_StepsNum=1;_StepsNum<=BuffersNum;_StepsNum++)
  { gettimeofday(&tv,NULL);
    StartTime=tv.tv_sec*1000000+tv.tv_usec;
    for (i=0;i<DrivesNum;i++) if (fd[i]!=-1)
    { struct io_event Ev;
      if (_StepsNum>=BuffsPerDrive) io_getevents(Ctx[i],1,1,&Ev,NULL);
      io_prep_pwrite(&CBs[i][BufIdx],fd[i],Buf[i][BufIdx],BufferSize,(uint64_t)_StepsNum*BufferSize);
      io_submit(Ctx[i],1,&CBPs[i][BufIdx]);
    }
    gettimeofday(&tv,NULL);
    _TotalTime+=(LastTime=tv.tv_sec*1000000+tv.tv_usec-StartTime);
    BufIdx=(BufIdx+1)%BuffsPerDrive;
    if (!(_StepsNum%10)) printf("Step %u/%u - %.02lf MB/s - %.02lf MB/s          \r",
                                _StepsNum,BuffersNum,
                                (float)BufferSize*DrivesNum/(int32_t)LastTime,
                                (float)BufferSize*DrivesNum*(int32_t)_StepsNum/(int64_t)_TotalTime);
  }
  for (i=0;i<DrivesNum;i++) io_destroy(Ctx[i]);
  //  Frees the memory blocks.
  for (i=0;i<DrivesNum;i++) for (j=0;j<BuffsPerDrive;j++) free(Buf[i][j]);
  //  Closes all opened files.
  for (i=0;i<DrivesNum;i++) if (fd[i]!=-1) close(fd[i]);
}