Performance of read-write throughput with iscsi

I recently encountered some performance issues using iSCSI. I use the open-iscsi implementation on the client side. After hours of googling and trial and errors, here are some points related to the performance of iSCSI.

Readahead

The performance is highly dependent on the block device readahead parameter (sector count for filesystem read-ahead).
$ blockdev --getra /dev/sda
256

By setting it to 1024 instead of the default 256, I doubled the read throughput.
$ blockdev --setra 1024 /dev/sda

Note: on 2.6 kernels, this is equivalent to $ hdparm -a 1024 /dev/sda (see blockdev man page: --setfra N: Set filesystem readahead (same like --setra on 2.6 kernels).)
$ hdparm -a 1024 /dev/sda
/dev/sda:
setting fs readahead to 1024
readahead = 1024 (on)

This also equivalent to setting /sys/block/sda/queue/read_ahead_kb except that the unit is different. The unit is #sectors for blockdev, hdparm and kilobyte for read_ahead_kb.

Note that the setting the readahead to a value larger than max_sectors_kb (/sys/block/sda/queue/max_sectors_kb) has no effect. The minimum value of both is taken.

To see the effect of your changes, look at the field avgrq-sz in $ iostat -x -d2 during hdparm -t.

MTU

On http://publib.boulder.ibm.com/infocenter/iseries/v7r1m0/index.jsp?topic=/rzahq/mtuconsiderations.htm, it is stated that High bandwidth and low latency is desirable for the iSCSI network. Storage and virtual Ethernet can take advantage of a maximum transmission unit (MTU) up to a 9000 byte ‘jumbo' frame if the iSCSI network supports the larger MTU. An initiator functions as an iSCSI client.. Jumbo frames also seem to be a solution according to several posts on the web (e.g. this one). The reason is that a basic filesystem block is 4096 bytes, which requires 3 packets with a default MTU of 1500 bytes. On the contrary, with jumbo frames, one network packet can contain one and sometimes even two sequential FS blocks.

MTU is set using Path MTU discovery. It requires ICMP packets not to be firewalled.
To set the MTU of the network interface controller (NIC): $ ifconfig eth0 mtu 7200
To test the maximum MTU between the initiator and the target: (initiator) $ tracepath target.com

Partition alignement

The partition should be aligned for maximum performance. On my iSCSI disk, I have a single partition which starts at sector #2048.
See:
* heads and sectors for partition alignment
* http://communities.vmware.com/docs/DOC-10510
* http://groups.google.com/group/open-iscsi/browse_thread/thread/37741fb3b3eca1e4
* http://comments.gmane.org/gmane.linux.iscsi.open-iscsi/2240

/sys/block/sdX/queue/scheduler

The scheduler/elevator "noop" seems to be the best for iSCSI. In my settings, noop is better because it avoids crashing the server under heavy I/O load (e.g. when writing very large files) (echo noop > /sys/block/sda/queue/scheduler).

/sys/block/sdX/queue/max_sectors_kb

The default value on Linux is 512. It means that a request is at most 256kb, since one request is translated to one SCSI Command PDU, it means that the lower max_sectors_kb the higher the number of SCSI Command PDUs for the same amount of data to be transferred in sequential read/write. Based on this observation, I noticed that my sequential write throughput significantly increased by increasing max_sectors_kb to 16384 (echo 16384 > /sys/block/sda/queue/max_sectors_kb).
To monitor the number of SCSI Command PDUs: iscsiadm -m session --stats, field scsicmd_pdus

/sys/block/sdX/queue/nr_requests

Increasing the maximal I/O queue size (nr_requests) often improves the performance:
echo 1024 > /sys/block/sda/queue/nr_requests
See http://www.monperrus.net/martin/scheduler+queue+size+and+resilience+to+heavy+IO.

iSCSI R2T

I still dont' know whether Initial R2T has an impact on throughput or latency. For a given request, the number of R2T PDUs is approximately equal to Size / MaxBurstLength. For instance, to copy 100 MB of data with a connection configured with MaxBurstLength=262144, there are 100000000/262144=38 R2T PDUs.
To monitor the number of R2T PDUs: iscsiadm -m session --stats, field r2t_pdus

vm.vfs_cache_pressure

Decreasing vm.vfs_cache_pressure increases the file system cache, hence decreases the number of accesses to the iscsi disk. I set it to the usual value of 50: $ sysctl -w vm.vfs_cache_pressure=50 (you can read the current value with cat /proc/sys/vm/vfs_cache_pressure). I believe in the arguments, but I've not set up an experiment which verifies that it actually increases the performance.

To empty the cache (free pagecache, dentries and inodes):
$ echo 3 > /proc/sys/vm/drop_caches

Troubleshooting

$ iscsid --version
iscsid version 2.0-870

$ iscsiadm --version
iscsiadm version 2.0-870

$ iscsiadm -m session -P 2
iSCSI Transport Class version 2.0-870
iscsiadm version 2.0-870
Target: iqn.2007-10.net.ovh:r35173vol0
Current Portal: 91.121.191.30:3260,1
Persistent Portal: 91.121.191.30:3260,1

Interface:

Iface Name: default
Iface Transport: tcp
Iface Initiatorname: iqn.2005-03.org.open-iscsi:e4fe229d280f
Iface IPaddress: 94.23.243.67
Iface HWaddress: default
Iface Netdev: default
SID: 1
iSCSI Connection State: LOGGED IN
iSCSI Session State: LOGGED_IN
Internal iscsid Session State: NO CHANGE

Negotiated iSCSI params:

HeaderDigest: None
DataDigest: None
MaxRecvDataSegmentLength: 131072
MaxXmitDataSegmentLength: 8192
FirstBurstLength: 65536
MaxBurstLength: 262144
ImmediateData: Yes
InitialR2T: Yes
MaxOutstandingR2T: 1

Tests
$ hdparm -Tt /dev/sda
/dev/sda:
Timing cached reads: 876 MB in 2.00 seconds = 438.04 MB/sec
Timing buffered disk reads: 22 MB in 3.66 seconds = 6.01 MB/sec

## read test
## the kernel does not cache block devices
$ dd if=/dev/sda of=/dev/null bs=1024k count=50
50+0 records in
50+0 records out
52428800 bytes (52 MB) copied, 4.68656 s, 11.2 MB/s

## write test (to a real file on top of the filesystem, so first empty the kernel cache)
$ echo 3 > /proc/sys/vm/drop_caches
$ dd if=/dev/zero of=/foo bs=1024k count=50
50+0 records in
50+0 records out
52428800 bytes (52 MB) copied, 4.31652 s, 12.1 MB/s
$ rm /foo

## read and write tests with bonnie++
$ bonnie++ -f
Version 1.96 ------Sequential Output------ --Sequential Input- --Random-
Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
r35173.ovh.ne 1000M 9132 5 4756 3 11636 2 209.0 8
Latency 10596ms 3903ms 773ms 2373ms
Version 1.96 ------Sequential Create------ --------Random Create--------
r35173.ovh.net -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 6180 41 +++++ +++ 10183 46 7648 50 +++++ +++ 10177 46
Latency 58855us 1460us 4112us 463us 241us 200us

# random access time with seeker from http:www.linuxinsight.com/how_fast_is_your_disk.html
$ ./seeker /dev/sda
Seeker v2.0, 2007-01-15, http:www.linuxinsight.com/how_fast_is_your_disk.html
Benchmarking /dev/sda [20480MB], wait 30 seconds.............................
Results: 20 seeks/second, 49.02 ms random access time

# parallel random access time with seeker_baryluk from http:smp.if.uj.edu.pl/~baryluk/seeker_baryluk.c
# I set the number of threads to 32 because it is the maximum number of parallel requests from /sys/block/sda/device/queue_depth
# note that the random access time is buggy (see source code)
$ ./seeker_baryluk /dev/sda 32
Seeker v3.0, 2009-06-17, http:www.linuxinsight.com/how_fast_is_your_disk.html
Benchmarking /dev/sda [41943040 blocks, 21474836480 bytes, 20 GB, 20480 MB, 21 GiB, 21474 MiB]
[512 logical sector size, 512 physical sector size]
[32 threads]
Wait 30 seconds..............................
Results: 164 seeks/second, 6.088 ms random access time (67320 < offsets < 21471527830)

# to read the current configuration value of node.conn[0].timeo.noop_out_interval, node.conn[0].timeo.noop_out_timeout and node.session.timeo.replacement_timeout
# you may have to adapt the host and connection number
$ cat /sys/class/iscsi_connection/connection1\:0/recv_tmo/ping_tmo
$ cat /sys/class/iscsi_connection/connection1\:0/recv_tmo/recv_tmo
$ cat /sys/class/iscsi_session/session1/recovery_tmo
$ cat /sys/class/iscsi_session/session1/abort_tmo
$ cat /sys/class/iscsi_session/session1/lu_reset_tmo