Hang with innodb_use_native_aio=0 and innodb_write_io_threads=1
The reason seems to be the following combination of parameters:
To add confusion, InnoDB will claim that it is using native AIO (because innodb_use_native_aio=1 by default), but a little later, it will silently change to simulated asynchronous I/O (innodb_use_native_aio=0) due to io_setup() failure.
The hang is possible, because starting with MDEV-23855 , the doublewrite buffer issues asynchronous writes. On buf_dblwr_t::flush_buffered_writes_completed(), up to 128 page writes may be submitted. But, there may already be 128 outstanding page writes. The thread would then hang here, because the maximum number of outstanding requests per thread is 256:
#14 0x000055c699979745 in io_slots::acquire (this=0x55c69c565d50) at /home/mariadb/semaphore1/10.5-enterprise/storage/innobase/os/os0file.cc:103
#15 0x000055c699977bfe in os_aio (type=. buf=0x7f9c7847c000, offset=4128768, n=16384) at /home/mariadb/semaphore1/10.5-enterprise/storage/innobase/os/os0file.cc:4211
#16 0x000055c699c14ce0 in fil_space_t::io (this=0x55c69cd565e8, type=. offset=4128768, len=16384, buf=0x7f9c7847c000, bpage=0x7f9c78009ab0)
at /home/mariadb/semaphore1/10.5-enterprise/storage/innobase/fil/fil0fil.cc:3431
#17 0x000055c699b9113c in buf_dblwr_t::flush_buffered_writes_completed (this=0x55c69b62b960 , request=. ) at /home/mariadb/semaphore1/10.5-enterprise/storage/innobase/buf/buf0dblwr.cc:678
#18 0x000055c699c14fb9 in fil_aio_callback (request=. ) at /home/mariadb/semaphore1/10.5-enterprise/storage/innobase/fil/fil0fil.cc:3466
#19 0x000055c699976827 in io_callback (cb=0x55c69c565f40) at /home/mariadb/semaphore1/10.5-enterprise/storage/innobase/os/os0file.cc:3918
#20 0x000055c699ca75dd in tpool::simulated_aio::simulated_aio_callback (param=0x55c69c565f40) at /home/mariadb/semaphore1/10.5-enterprise/tpool/aio_simulated.cc:162
A possible workaround could be to make innodb_write_io_threads=2 the minimum value.
Thanks to wlad for participating in the debugging.
Innodb using linux native aio
If you follow best practices for database design and tuning techniques for SQL operations, but your database is still slow due to heavy disk I/O activity, consider these disk I/O optimizations. If the Unix top tool or the Windows Task Manager shows that the CPU usage percentage with your workload is less than 70%, your workload is probably disk-bound.
- Increase buffer pool size When table data is cached in the InnoDB buffer pool, it can be accessed repeatedly by queries without requiring any disk I/O. Specify the size of the buffer pool with the innodb_buffer_pool_size option. This memory area is important enough that it is typically recommended that innodb_buffer_pool_size is configured to 50 to 75 percent of system memory. For more information see, Section 8.12.3.1, “How MySQL Uses Memory”.
- Adjust the flush method In some versions of GNU/Linux and Unix, flushing files to disk with the Unix fsync() call (which InnoDB uses by default) and similar methods is surprisingly slow. If database write performance is an issue, conduct benchmarks with the innodb_flush_method parameter set to O_DSYNC .
- Configure a threshold for operating system flushes By default, when InnoDB creates a new data file, such as a new log file or tablespace file, the file is fully written to the operating system cache before it is flushed to disk, which can cause a large amount of disk write activity to occur at once. To force smaller, periodic flushes of data from the operating system cache, you can use the innodb_fsync_threshold variable to define a threshold value, in bytes. When the byte threshold is reached, the contents of the operating system cache are flushed to disk. The default value of 0 forces the default behavior, which is to flush data to disk only after a file is fully written to the cache. Specifying a threshold to force smaller, periodic flushes may be beneficial in cases where multiple MySQL instances use the same storage devices. For example, creating a new MySQL instance and its associated data files could cause large surges of disk write activity, impeding the performance of other MySQL instances that use the same storage devices. Configuring a threshold helps avoid such surges in write activity.
- Use fdatasync() instead of fsync() On platforms that support fdatasync() system calls, the innodb_use_fdatasync variable, introduced in MySQL 8.0.26, permits using fdatasync() instead of fsync() for operating system flushes. An fdatasync() system call does not flush changes to file metadata unless required for subsequent data retrieval, providing a potential performance benefit. A subset of innodb_flush_method settings such as fsync , O_DSYNC , and O_DIRECT use fsync() system calls. The innodb_use_fdatasync variable is applicable when using those settings.
- Use a noop or deadline I/O scheduler with native AIO on Linux InnoDB uses the asynchronous I/O subsystem (native AIO) on Linux to perform read-ahead and write requests for data file pages. This behavior is controlled by the innodb_use_native_aio configuration option, which is enabled by default. With native AIO, the type of I/O scheduler has greater influence on I/O performance. Generally, noop and deadline I/O schedulers are recommended. Conduct benchmarks to determine which I/O scheduler provides the best results for your workload and environment. For more information, see Section 15.8.6, “Using Asynchronous I/O on Linux”.
- Use direct I/O on Solaris 10 for x86_64 architecture When using the InnoDB storage engine on Solaris 10 for x86_64 architecture (AMD Opteron), use direct I/O for InnoDB -related files to avoid degradation of InnoDB performance. To use direct I/O for an entire UFS file system used for storing InnoDB -related files, mount it with the forcedirectio option; see mount_ufs(1M) . (The default on Solaris 10/x86_64 is not to use this option.) To apply direct I/O only to InnoDB file operations rather than the whole file system, set innodb_flush_method = O_DIRECT . With this setting, InnoDB calls directio() instead of fcntl() for I/O to data files (not for I/O to log files).
- Use raw storage for data and log files with Solaris 2.6 or later When using the InnoDB storage engine with a large innodb_buffer_pool_size value on any release of Solaris 2.6 and up and any platform (sparc/x86/x64/amd64), conduct benchmarks with InnoDB data files and log files on raw devices or on a separate direct I/O UFS file system, using the forcedirectio mount option as described previously. (It is necessary to use the mount option rather than setting innodb_flush_method if you want direct I/O for the log files.) Users of the Veritas file system VxFS should use the convosync=direct mount option. Do not place other MySQL data files, such as those for MyISAM tables, on a direct I/O file system. Executables or libraries must not be placed on a direct I/O file system.
- Use additional storage devices Additional storage devices could be used to set up a RAID configuration. For related information, see Section 8.12.1, “Optimizing Disk I/O”. Alternatively, InnoDB tablespace data files and log files can be placed on different physical disks. For more information, refer to the following sections:
- innodb_checksum_algorithm The crc32 option uses a faster checksum algorithm and is recommended for fast storage systems.
- innodb_flush_neighbors Optimizes I/O for rotational storage devices. Disable it for non-rotational storage or a mix of rotational and non-rotational storage. It is disabled by default.
- innodb_idle_flush_pct Permits placing a limit on page flushing during idle periods, which can help extend the life of non-rotational storage devices. Introduced in MySQL 8.0.18.
- innodb_io_capacity The default setting of 200 is generally sufficient for a lower-end non-rotational storage device. For higher-end, bus-attached devices, consider a higher setting such as 1000.
- innodb_io_capacity_max The default value of 2000 is intended for workloads that use non-rotational storage. For a high-end, bus-attached non-rotational storage device, consider a higher setting such as 2500.
- innodb_log_compressed_pages If redo logs are on non-rotational storage, consider disabling this option to reduce logging. See Disable logging of compressed pages.
- innodb_log_file_size (deprecated in MySQL 8.0.30) If redo logs are on non-rotational storage, configure this option to maximize caching and write combining.
- innodb_redo_log_capacity If redo logs are on non-rotational storage, configure this option to maximize caching and write combining.
- innodb_page_size Consider using a page size that matches the internal sector size of the disk. Early-generation SSD devices often have a 4KB sector size. Some newer devices have a 16KB sector size. The default InnoDB page size is 16KB. Keeping the page size close to the storage device block size minimizes the amount of unchanged data that is rewritten to disk.
- binlog_row_image If binary logs are on non-rotational storage and all tables have primary keys, consider setting this option to minimal to reduce logging.
Ensure that TRIM support is enabled for your operating system. It is typically enabled by default.
- History list length low, below a few thousand.
- Insert buffer merges close to rows inserted.
- Modified pages in buffer pool consistently well below innodb_max_dirty_pages_pct of the buffer pool. (Measure at a time when the server is not doing bulk inserts; it is normal during bulk inserts for the modified pages percentage to rise significantly.)
- Log sequence number — Last checkpoint is at less than 7/8 or ideally less than 6/8 of the total size of the InnoDB log files.
Because the doublewrite buffer setting is global, the doublewrite buffer is also disabled for data files that do not reside on Fusion-io hardware.
Innodb using linux native aio
InnoDB uses the asynchronous I/O subsystem (native AIO) on Linux to perform read-ahead and write requests for data file pages. This behavior is controlled by the innodb_use_native_aio configuration option, which applies to Linux systems only and is enabled by default. On other Unix-like systems, InnoDB uses synchronous I/O only. Historically, InnoDB only used asynchronous I/O on Windows systems. Using the asynchronous I/O subsystem on Linux requires the libaio library.
With synchronous I/O, query threads queue I/O requests, and InnoDB background threads retrieve the queued requests one at a time, issuing a synchronous I/O call for each. When an I/O request is completed and the I/O call returns, the InnoDB background thread that is handling the request calls an I/O completion routine and returns to process the next request. The number of requests that can be processed in parallel is n , where n is the number of InnoDB background threads. The number of InnoDB background threads is controlled by innodb_read_io_threads and innodb_write_io_threads . See Section 15.8.5, “Configuring the Number of Background InnoDB I/O Threads”.
With native AIO, query threads dispatch I/O requests directly to the operating system, thereby removing the limit imposed by the number of background threads. InnoDB background threads wait for I/O events to signal completed requests. When a request is completed, a background thread calls an I/O completion routine and resumes waiting for I/O events.
The advantage of native AIO is scalability for heavily I/O-bound systems that typically show many pending reads/writes in SHOW ENGINE INNODB STATUS\G output. The increase in parallel processing when using native AIO means that the type of I/O scheduler or properties of the disk array controller have a greater influence on I/O performance.
A potential disadvantage of native AIO for heavily I/O-bound systems is lack of control over the number of I/O write requests dispatched to the operating system at once. Too many I/O write requests dispatched to the operating system for parallel processing could, in some cases, result in I/O read starvation, depending on the amount of I/O activity and system capabilities.
If a problem with the asynchronous I/O subsystem in the OS prevents InnoDB from starting, you can start the server with innodb_use_native_aio=0 . This option may also be disabled automatically during startup if InnoDB detects a potential problem such as a combination of tmpdir location, tmpfs file system, and Linux kernel that does not support asynchronous I/O on tmpfs .