- Background
- dCache disk pools
- TCP/network tuning
- Enable BBR congestion control
- vm.swappiness
- vm.min_free_kbytes
- vm.dirty
- disk scheduler
- dCache tape pools
- disk scheduler/readahead
- Write pool caveats
- Hardware specific issues
- HP(E) Smart Array RAID controllers
- Queue depth
- Write performance degradation due to overly large max_sectors_kb setting
- Experiences
- UIO
- Управление NCQ на SATA-винчестерах в GNU/Linux
- Просмотр состояния NCQ
- Управление NCQ
- How to check the current queue depth value?
Background
The Linux kernel is not really tuned for large-scale IO or high-bandwidth long-haul network traffic by default. This page aims to document various tuning applicable to the various services run with NDGF.
dCache disk pools
TCP/network tuning
NDGF dCache pools do single-stream TCP transfers for pool-to-pool (p2p) copies, so we cater for that performance and hope that it will also be good enough for end user transfers.
The bandwidth-delay product limits the transfer speed that can be achieved. The RTT over the LHC OPN between HPC2N and IJS is approx 75 ms. In order to sustain 800 MB/s transfers a 60 MB TCP window is needed, which is larger than most Linux distribution defaults allow for TCP auto tuning.
Check your current tuning with: sysctl net.ipv4.tcp_rmem net.ipv4.tcp_wmem
You need to modify the net.ipv4.tcp_rmem and net.ipv4.tcp_wmem sysctl:s to allow Linux TCP autotuning to reach at minimum the needed TCP window size. Change only the rightmost value. Note that the value set for the read buffer typically needs to be 50% larger than the wanted TCP window size due to overhead.
Below is an example /etc/sysctl.d/60-tcptuning.conf
# Typical defaults, listed by sysctl net.ipv4.tcp_rmem net.ipv4.tcp_wmem: # net.ipv4.tcp_rmem = 4096 87380 6291456 # net.ipv4.tcp_wmem = 4096 16384 4194304 # # Tuning for 64MiB tcp windows, with similar tcp_rmem overhead as default: net.ipv4.tcp_rmem = 4096 87380 100663296 net.ipv4.tcp_wmem = 4096 16384 67108864
Enable BBR congestion control
Switching to the BBR congestion control algorithm is preferred in order to better manage links with spurious periods of high package loss. This condition causes the default Linux congestion control to overreact and recover by doing a slow ramp up from a full stop. BBR is able to detect spurious loss and reacts in a more appropriate manner.
BBR is available in Linux distributions with a sufficiently recent kernel, for example Ubuntu 18.04, RHEL 8, Debian 10, and more.
Below is an example /etc/sysctl.d/60-net-tcp-bbr.conf
# Enable the Bottleneck Bandwidth and RTT (BBR) congestion control algorithm. # # Use fq qdisc for best performance. # From https://github.com/google/bbr/blob/master/Documentation/bbr-quick-start.md: # Any qdisc will work, though "fq" performs better for highly-loaded servers. # (Note that TCP-level pacing was added in v4.13-rc1 but did not work well for # BBR until a fix was added in 4.20.) net.core.default_qdisc=fq # # Enable BBR, despite the name also applies to IPv6 net.ipv4.tcp_congestion_control=bbr
vm.swappiness
It generally makes no sense to preemptively swap out (parts of) running applications (ie java/dcache) in order to gain a few bytes more disk cache.
Add the following as /etc/sysctl.d/60-vm-swappiness.conf
# Tell the kernel to reduce the willingness to swap out applications when doing # IO in order to increase the amount of file system cache. # Default: 60 # Note that it needs to be set to minimum 1 with newer kernels to have the same # behaviour as 0 has with older kernels. vm.swappiness = 1
vm.min_free_kbytes
To avoid «failed 0-order allocation», vm.min_free_kbytes needs to be large enough (0.5-1.0 seconds of network traffic might be a reasonable starting point).
Older distributions/kernels has a default tuned for GigE class networking.
Suggested starting value for 10GigE class network is 524288 kbytes (512 MiB).
Check your current value with sysctl vm.min_free_kbytes . If it’s smaller than the suggested value you need to increase it.
To increase it, add the following as /etc/sysctl.d/60-vm-minfree.conf
vm.min_free_kbytes = 524288
vm.dirty
Defaults are quite large on modern machines with much RAM and basically waits waaay to long before starting writeout, which causes write storms and huge impacts on read. The total lack of IO pacing makes this behaviour have bigger impact than it should have.
The workaround is to reduce the vm.dirty settings, but care also needs to be taken that this doesn’t cause files written in multiple fragments as that reduces the efficency of read-ahead. AIM for setting it as low as possible without causing a lot of fragmentation.
Proceed in the following way to find suitable tuning for this:
- Start out with vm.dirty_background_bytes approximating 0.5s of IO, ie same value as vm.min_free_kbytes but note bytes vs kbytes!
- Set vm.dirty_background_bytes to 4*vm.dirty_background_bytes
- Verify that you’re not getting overfragmented files:
- Copy/write a big file (multiple GB)
- sync
- Check file fragmentation with filefrag filename
- If more than one extents is listed, increase vm.dirty* and try again
- If more than a couple (single-digit) extents are listed for each file, increase vm.dirty* and try again.
Add the tuning to /etc/sysctl.d/60-vm-dirty.conf:
# Start writeout of pending writes earlier. # Limit size of pending writeouts. # # When to start writeout (default: 10% of system RAM) # 512 MiB vm.dirty_background_bytes = 536870912 # # Limit of pending writes (default: 20% of system RAM) # 2GiB vm.dirty_bytes = 2147483648
disk scheduler
We recommend using the deadline disk scheduler. Either hardcode it by passing elevator=deadline to the kernel on boot or create a nifty udev rule file, see the tape pool tuning for an example
NOTE: Recent multiqueue kernels already uses the mq-deadline scheduler as default. Check what’s available/used before trying to change it.
dCache tape pools
The goal: To be able to stream to/from a tape drive with decent speed despite other IO happening. Linux/XFS is really bad at this, in the long run ZFS on Linux is probably a better idea as it has some sort of IO pacing.
NOTE: These tunings are in addition to the disk pool tunings.
If you find that writes are starving reads, consider lowering vm.dirty*_bytes more. vm.dirty_bytes should be significantly smaller than the write cache in your raid controller.
Note however that you’ll want vm.dirty* to be at least a few multiples larger than the raid stripe size in order to be able to do full-stripe writes.
disk scheduler/readahead
We recommend using an udev rule to set the tuning attributes. This ensures that the tunings get reapplied if the device is recreated for some reason (device resets, changes, etc).
Add the following as /etc/udev/rules.d/99-iotuning-largedev.rules (verified on Ubuntu):
# Note the ugly-trick to only match large devices: glob on the size attribute . # dCache tape pool optimized tunings for large block devices: # - Set scheduler to deadline # - 64MB read-ahead gives somewhat decent streaming reads while writing. # - Lowering nr_requests from default 128 improves read performance with a slight write penalty. # - Tell the IO scheduler it's OK to starve writes more times in favor of reads SUBSYSTEM=="block", ACTION=="add|change", ATTR=="6233995932*", ATTR="65536", ATTR="deadline", ATTR="64", ATTR="10"
Write pool caveats
If you’re still having problems keeping the tapedrive streaming, investigate what the OS thinks the IO latencies are. The r_await output from iostat -dxm device-name 4 is a good starting point. Consider worst-case-r_await*tape-drive-speed to be the absolute minimum requirement for bdi/read_ahead_kb.
If you find that writes still starves reads, consider changing queue/nr_requests even more. Note that changing nr_requests also affects read behaviour/performance. Increasing bdi/read_ahead_kb further might also help.
See also hardware specific issues (Smart Array controller queue depth for example).
Hardware specific issues
HP(E) Smart Array RAID controllers
Queue depth
Change HP Smart Array controller queue depth from auto to 8 for more balanced read/write performance, ie:
hpssacli ctrl slot=X modify queuedepth=8
This is suggested primarily for tape pools.
Write performance degradation due to overly large max_sectors_kb setting
This change is included in Ubuntu Vivid 3.19.0 (and newer) kernels, and CentOS 7 3.10.0-327.36.3 (and newer) kernels.
While the controllers can handle big writes, the performance suffers in some conditions.
A workaround is to cap max_sectors_kb to the old value used before this change.
The following udev rule can be used for this:
# # HP/HPE Smart Array controllers can't handle IO:s much larger than 1 MiB with # good performance although they can handle 4 MiB without error. The driver # advertise the stripe size as maximum IO size which can be substantially # larger for bulk IO setups. # # Install this as # /etc/udev/rules.d/90-smartarray-limitiosize.rules # to limit max_sectors_kb to 512, the default before Linux 3.18.22. # # hpsa driver, Px1x and newer SUBSYSTEM=="block", ACTION=="add|change", DRIVERS=="hpsa", ENV=="disk", ATTR="512" # # cciss driver, Px0x and older SUBSYSTEM=="block", ACTION=="add|change", DRIVERS=="cciss", ENV=="disk", ATTR="512"
Experiences
UIO
We tried myriads (*) of combinations of hardware and os configurations to prevent read requests from being stalled by a continuous stream of writes. To date (20141128), the only really helpful trick was to use cgroups to cap the bandwidth of writes at the block level. (*) It felt like myriads at least 😉
Here is what you need to do:
1. Check if cgroups and io throttling are enabled in the kernel
$ grep CONFIG_BLK_CGROUP /boot/config-$(uname -r) CONFIG_BLK_CGROUP=y $ grep CONFIG_BLK_DEV_THROTTLING /boot/config-$(uname -r) CONFIG_BLK_DEV_THROTTLING=y
2. Create a mount point for the cgroups interface and mount it
$ mkdir -p /cgroup/blkio $ mount -t cgroup -o blkio none /cgroup/blkio
3. Enable non-root uids to write to /cgroup/blkio/blkio.throttle.write_bps_device (needed for integration with endit/tsmarchiver.pl)
$ chmod 666 /cgroup/blkio/blkio.throttle.write_bps_device
4. Determine major:minor of the device for which you want to throttle write bandwidth
$ stat -c "0x%t 0x%T" $(readlink -f /dev/mapper/cachevg-tapecachelv) | gawk --non-decimal-data '< printf "%d:%d\n", $1, $2>' 253:6
5. Cap write bandwidth at, for example, 300 MiB/s
$ echo "253:6 314572800" > /cgroup/blkio/blkio.throttle.write_bps_device
Управление NCQ на SATA-винчестерах в GNU/Linux
SATA NCQ (Native Command Queueing) – это поддерживаемая многими жёсткими дисками и контроллерами функция, позволяющая диску выполнять запросы ввода вывода не в порядке их поступления, а в более оптимальном, руководствуясь (к примеру) взаимным расположением головки чтения/записи и требуемых секторов.
При условии поддержки диском и контроллером, NCQ активируется автоматически. Для того, чтобы в GNU/Linux отключить NCQ для какого-либо из жёстких дисков, нужно выставить параметр queue_depth (глубина очереди команд) для этого диска в 1 . По-умолчанию (со включенным NCQ), этот параметр чаще всего равен 31 .
Просмотр состояния NCQ
echo -n "NCQ depths: " ; cat /sys/block/sd?/device/queue_depth | tr "\n" " " ; echo
echo "NCQ depths:"; find /sys/block/sd? -maxdepth 0 -exec sh -c "echo -n <> =\ ; cat <>/device/queue_depth" \;
Управление NCQ
Для управления этим параметром можно воспользоваться ключом -Q программы hdparm . Кроме того, можно изменять его значение “вручную”, записывая нужные значения в соответствующие файлы в /sys .
echo 1 > /sys/block/sda/device/queue_depth
echo 31 > /sys/block/sda/device/queue_depth
Отключить NCQ для всех дисков:
find /sys/block/sd?/device/queue_depth -exec sh -c "echo 1 > <>" \;
Включить NCQ для всех дисков:
find /sys/block/sd?/device/queue_depth -exec sh -c "echo 31 > <>" \;
How to check the current queue depth value?
I have an OEL server connected via fibre to a NetApp SAN. How can I view the queue depth as the OS sees it? Output from lspci :
05:00.0 Fibre Channel: QLogic Corp. ISP2532-based 8Gb Fibre Channel to PCI Express HBA (rev 02) 05:00.1 Fibre Channel: QLogic Corp. ISP2532-based 8Gb Fibre Channel to PCI Express HBA (rev 02) 08:00.0 Fibre Channel: QLogic Corp. ISP2532-based 8Gb Fibre Channel to PCI Express HBA (rev 02) 08:00.1 Fibre Channel: QLogic Corp. ISP2532-based 8Gb Fibre Channel to PCI Express HBA (rev 02)
qla2xxx 1262209 352 scsi_transport_fc 83145 1 qla2xxx scsi_mod 199641 15 be2iscsi,ib_iser,iscsi_tcp,bnx2i,libcxgbi,libiscsi2,scsi_transport_iscsi2,scsi_dh,sr_mod,sg,qla2xxx,scsi_transport_fc,libata,cciss,sd_mod
/sys/class/fc_host/host3: device issue_lip port_id port_state speed subsystem supported_speeds system_hostname uevent fabric_name node_name port_name port_type statistics supported_classes symbolic_name tgtid_bind_type /sys/class/fc_host/host4: device issue_lip port_id port_state speed subsystem supported_speeds system_hostname uevent fabric_name node_name port_name port_type statistics supported_classes symbolic_name tgtid_bind_type /sys/class/fc_host/host5: device issue_lip port_id port_state speed subsystem supported_speeds system_hostname uevent fabric_name node_name port_name port_type statistics supported_classes symbolic_name tgtid_bind_type /sys/class/fc_host/host6: device issue_lip port_id port_state speed subsystem supported_speeds system_hostname uevent fabric_name node_name port_name port_type statistics supported_classes symbolic_name tgtid_bind_type