Lightweight estimation of disk misalignment performance penalty

by Martin Monperrus
Modern disks have a physical sector size that is larger than 512B, but the software chain (BIOS, OS, partitioning tools) generally assumes a sector size of 512B. As a result, first, I/O requests that are smaller than the physical sector size are actually translated to larger requests. For instance, if the physical sector size is 4096B, it's exactly equivalent to ask for reading/writing 512B or 4096B. In other terms, it is not worth having small I/O requests. Second, it yields potential disk alignment issues.

Disk alignment issues occur when a request should fit into one physical sector but actually requires reading/writing two of them. For instance, assume that we have physical sectors of 4096B. If I ask for 4096B starting at byte #4094, the disk must read the first and the second physical sector to fulfill the request. In other terms, one should always issue requests that start at addresses that are multiple of the physical secor size. Let's now use the "old" 512B unit, and call this unit "s". A physical sector size of 4096B corresponds to 8s. In this case, one should ask for 8s at adresses #0, #8, #16, #32. On the contrary, if I ask for 8s at addresses #6, #14, #30 all requests are misaligned.

Consequently, in the presence of large physical sectors, the average time to seek N random aligned blocks should be shorter than the average time to seek N random aligned blocks. I propose to measure those two values as a way to estimate the impact of disk misalignment on I/O performance. This approach to assess misalignment performance overhead is lightweight, because contrary to most existing solutions, it does not require (and is independent of) a file system, a database and an I/O workload.

I modified seeker.sh to test this (see below). On Amazon Elastic Block Storage (EBS), I noticed at least a difference of one millisecond in average between accessing aligned random 4096B blocks (~10ms) and misaligned random 4096B blocks (~12ms, see below). It means that EBS is sensitive to disk and partition misalignment with 4KiB requests (that typically correspond to filesystem blocks).

Finally, note that to ensure that requests are aligned, partition must be aligned, i.e. must start at correct addresses.

Related resources

@IBM: http://www.ibm.com/developerworks/linux/library/l-4kb-sector-disks/
@Microsoft: http://msdn.microsoft.com/en-us/library/dd758814%28v=sql.100%29.aspx
@VMWare: http://www.vmware.com/pdf/esx3_partition_align.pdf

Tests

Note that we use sort and awk to compute the median, which is equivalent (and often better) than the average. See also this post for creating histograms of seek time using Scilab.
root@ec2:~# ./seeker-advanced.sh 8 0 200 /dev/xvda | sort -g | awk 'NR==100'
block size (byte): 4096
block size (512B-sector): 8
disk size (byte) : 10737418240
disk size (512B-sector) : 20971520
disk size (block) : 2621440
misalignment (bytes): 0
misalignment (512B-sector): 0
0.010
root@ec2:~# ./seeker-advanced.sh 8 1 200 /dev/xvda | sort -g | awk 'NR==100'
block size (byte): 4096
block size (512B-sector): 8
disk size (byte) : 10737418240
disk size (512B-sector) : 20971520
disk size (block) : 2621440
misalignment (bytes): 512
misalignment (512B-sector): 1
0.012 ## there is a penalty of 2 ms per random seek of unaligned blocks.

Software


#!/bin/bash
# see http://www.monperrus.net/martin/lightweight-analysis-disk-alignment
#
# Usage: ./seeker.sh BLOCKSIZE MISALIGNMENT N DEVICE REQUESTSIZE 
# e.g. ./seeker.sh 8 1 200 /dev/sda 8 
#   seeks 200x times a random 4Kib at an adress misaligned of one sector (512B)


if [[ ! -z $1 ]]
then
  BLOCKSIZESECTOR=$1
else
  BLOCKSIZESECTOR=8
fi

if [[ ! -z $2 ]]
then
  MISALIGNMENT=$2
else
  MISALIGNMENT=0
fi

if [[ ! -z $3 ]]
then
  N=$3
else
  N=200
fi


if [[ ! -z $4 ]]
then
  F=$4
else
  F=/dev/sda
fi

if [[ ! -z $5 ]]
then
  COUNT=$5
else
  COUNT=$BLOCKSIZESECTOR
fi


SECTORSIZE=512

BLOCKSIZEBYTE=$((SECTORSIZE*BLOCKSIZESECTOR))

MAXBYTE=`blockdev --getsize64 $F` ## measured in bytes

MAXSECTOR=$((MAXBYTE/SECTORSIZE))

MAXBLOCK=$((MAXBYTE/BLOCKSIZEBYTE))

echo block size "(byte)": $BLOCKSIZEBYTE >&2
echo block size "(${SECTORSIZE}B-sector)": $BLOCKSIZESECTOR >&2
echo disk size "(byte)" : $MAXBYTE >&2
echo disk size "(${SECTORSIZE}B-sector)" : $MAXSECTOR >&2
echo disk size "(block)" : $MAXBLOCK >&2
echo misalignment "(bytes)": $((MISALIGNMENT*SECTORSIZE)) >&2
echo misalignment "(${SECTORSIZE}B-sector)": $MISALIGNMENT >&2

# bash constant of $RANDOM
RANDOM_MAX=32767

TIMEFORMAT="%E"

for i in `seq 1 $N`
do
  ## the seek is always computed in blocks in order to be aligned
  SEEKBLOCK=$(((RANDOM*MAXBLOCK)/RANDOM_MAX))
  
  ## translating the seek adress in ${SECTORSIZE}B-sectors and misaligning by $MISALIGNMENT sectors
  SEEKSECTOR=$(((SEEKBLOCK*BLOCKSIZESECTOR)-MISALIGNMENT))

  #echo $SEEKBLOCK/$MAXBLOCK $SEEKSECTOR/$MAXSECTOR
  
  A=`time dd if=$F of=/dev/null ibs=$SECTORSIZE skip=$SEEKSECTOR count=$COUNT 2>&1`
done 2>&1
Tagged as: