- Linux how to restart service automatically to avoid server downtime
- Symptom
- Solution to automatically restart Linux service to avoid downtime
- Service Restart policy
- Change Restart policy
- Summary
- References
- OmniLock — Block / Hide App on iOS
- DNS Firewall for iOS and Mac OS
- Does systemd have a different timeout setting when rebooting vs normal restart of a service?
Linux how to restart service automatically to avoid server downtime
Have a linux service running for a long time but quit accidentally due to crash, signal, kill etc. Want to restart it automatically to avoid/reduce service downtime, use systemd service restart policy to control it easily.
Symptom
I have nginx server running for months, suddenly got a alarm from monitor service indicate the nginx server is not providing service. I can ssh to server, so server is still online. Then check nginx server status use systemctl status nginx , I see nginx is not running due to a core dump. yes, even nginx may crash.
$ systemctl status nginx # or sudo service nginx status ● nginx.service - A high performance web server and a reverse proxy server Loaded: loaded (/lib/systemd/system/nginx.service; enabled; vendor preset: enabled) Active: failed (Result: core-dump) since Thu 2022-01-06 18:51:49 PST; 702ms ago Docs: man:nginx(8) Process: 720751 ExecReload=/usr/sbin/nginx -g daemon on; master_process on; -s reload (code=exited, status=0/SUCCES> Main PID: 700787 (code=dumped, signal=SEGV) Tasks: 0 (limit: 1110) Memory: 14.5M CGroup: /system.slice/nginx.service Jan 06 10:57:38 systemd[1]: Reloading A high performance web server and a reverse proxy server. Jan 06 10:57:38 systemd[1]: Reloaded A high performance web server and a reverse proxy server. Jan 06 18:51:46 systemd[1]: Reloading A high performance web server and a reverse proxy server. Jan 06 18:51:46 systemd[1]: Reloaded A high performance web server and a reverse proxy server. Jan 06 18:51:49 systemd[1]: nginx.service: Main process exited, code=dumped, status=11/SEGV Jan 06 18:51:49 systemd[1]: nginx.service: Killing process 718970 (nginx) with signal SIGKILL. Jan 06 18:51:49 systemd[1]: nginx.service: Killing process 718971 (nginx) with signal SIGKILL. Jan 06 18:51:49 systemd[1]: nginx.service: Killing process 718970 (nginx) with signal SIGKILL. Jan 06 18:51:49 systemd[1]: nginx.service: Killing process 718971 (nginx) with signal SIGKILL. Jan 06 18:51:49 systemd[1]: nginx.service: Failed with result 'core-dump'.
Use ps -ef|grep nginx also verified there is no nginx is running:
$ ps -ef|grep nginx ubuntu 720761 718615 0 18:52 pts/0 00:00:00 grep --color=auto nginx
I can restart nginx service manually to recover but I hope a Linux service can restart automatically to avoid any downtime.
Solution to automatically restart Linux service to avoid downtime
Luckily Linux systemd system and service manager already provide this feature in service configuration. You can specific Restart policy.
Service Restart policy
This following the full description of Restart policy:
Configures whether the service shall be restarted when the service process exits, is killed, or a timeout is reached. The service process may be the main service process, but it may also be one of the processes specified with ExecStartPre= , ExecStartPost= , ExecStop= , ExecStopPost= , or ExecReload= . When the death of the process is a result of systemd operation (e.g. service stop or restart), the service will not be restarted. Timeouts include missing the watchdog “keep-alive ping” deadline and a service start, reload, and stop operation timeouts.
Restart value takes one of no , on-success , on-failure ,
on-abnormal , on-watchdog , on-abort , or always .
If set to no (the default), the service will not be restarted.
If set to on-success , it will be restarted only when the service process exits cleanly. In this context, a clean exit means any of the following:
- exit code of 0;
- for types other than Type=oneshot , one of the signals SIGHUP , SIGINT , SIGTERM , or SIGPIPE ;
- exit statuses and signals specified in SuccessExitStatus= .
If set to on-failure , the service will be restarted when the process exits with a non-zero exit code, is terminated by a signal (including on core dump, but excluding the aforementioned four signals), when an operation (such as service reload) times out, and when the configured watchdog timeout is triggered.
If set to on-abnormal , the service will be restarted when the process is terminated by a signal (including on core dump, excluding the aforementioned four signals), when an operation times out, or when the watchdog timeout is triggered.
If set to on-abort , the service will be restarted only if the service process exits due to an uncaught signal not specified as a clean exit status. If set to on-watchdog, the service will be restarted only if the watchdog timeout for the service expires.
If set to always , the service will be restarted regardless of whether it exited cleanly or not, got terminated abnormally by a signal, or hit a timeout.
Table: Exit causes and the effect of the Restart= settings
┌──────────────┬────┬────────┬────────────┬────────────┬─────────────┬──────────┬─────────────┐ │Restart │ no │ always │ on-success │ on-failure │ on-abnormal │ on-abort │ on-watchdog │ │settings/Exit │ │ │ │ │ │ │ │ │causes │ │ │ │ │ │ │ │ ├──────────────┼────┼────────┼────────────┼────────────┼─────────────┼──────────┼─────────────┤ │Clean exit │ │ X │ X │ │ │ │ │ │code or │ │ │ │ │ │ │ │ │signal │ │ │ │ │ │ │ │ ├──────────────┼────┼────────┼────────────┼────────────┼─────────────┼──────────┼─────────────┤ │Unclean exit │ │ X │ │ X │ │ │ │ │code │ │ │ │ │ │ │ │ ├──────────────┼────┼────────┼────────────┼────────────┼─────────────┼──────────┼─────────────┤ │Unclean │ │ X │ │ X │ X │ X │ │ │signal │ │ │ │ │ │ │ │ ├──────────────┼────┼────────┼────────────┼────────────┼─────────────┼──────────┼─────────────┤ │Timeout │ │ X │ │ X │ X │ │ │ ├──────────────┼────┼────────┼────────────┼────────────┼─────────────┼──────────┼─────────────┤ │Watchdog │ │ X │ │ X │ X │ │ X │ └──────────────┴────┴────────┴────────────┴────────────┴─────────────┴──────────┴─────────────┘
As exceptions to the setting above, the service will not be restarted if the exit code or signal is specified in RestartPreventExitStatus= or the service is stopped with systemctl stop or an equivalent operation. Also, the services will always be restarted if the exit code or signal is specified in RestartForceExitStatus= .
Note that service restart is subject to unit start rate limiting configured with StartLimitIntervalSec= and StartLimitBurst= , see systemd.unit(5) for details. A restarted service enters the failed state only after the start limits are reached.
Setting this to on-failure is the recommended choice for long-running services, in order to increase reliability by attempting automatic recovery from errors. For services that shall be able to terminate on their own choice (and avoid immediate restarting), on-abnormal is an alternative choice.
Change Restart policy
systemctl have edit command to override the service config:
edit UNIT. Edit a drop-in snippet or a whole replacement file if --full is specified, to extend or override the specified unit.
To override existing unit file for nginx , Run sudo systemctl edit nginx , then paste following two lines to specific Restart policy as always to indicate the service will be restarted regardless of whether it exited cleanly or not, got terminated abnormally by a signal, or hit a timeout:
Save and quit. It should take effect immediately.
The sudo systemctl edit nginx command write nginx config in /etc/systemd/system/nginx.service.d/override.conf , you can use cat to see its content.
$ cat /etc/systemd/system/nginx.service.d/override.conf [Service] Restart=always
You can also check the full config of nginx service unit file by systemctl cat nginx.service
$ systemctl cat nginx.service # /lib/systemd/system/nginx.service # Stop dance for nginx # ======================= # # ExecStop sends SIGSTOP (graceful stop) to the nginx process. # If, after 5s (--retry QUIT/5) nginx is still running, systemd takes control # and sends SIGTERM (fast shutdown) to the main process. # After another 5s (TimeoutStopSec=5), and if nginx is alive, systemd sends # SIGKILL to all the remaining processes in the process group (KillMode=mixed). # # nginx signals reference doc: # http://nginx.org/en/docs/control.html # [Unit] Description=A high performance web server and a reverse proxy server Documentation=man:nginx(8) After=network.target [Service] Type=forking PIDFile=/run/nginx.pid ExecStartPre=/usr/sbin/nginx -t -q -g 'daemon on; master_process on;' ExecStart=/usr/sbin/nginx -g 'daemon on; master_process on;' ExecReload=/usr/sbin/nginx -g 'daemon on; master_process on;' -s reload ExecStop=-/sbin/start-stop-daemon --quiet --stop --retry QUIT/5 --pidfile /run/nginx.pid TimeoutStopSec=5 KillMode=mixed [Install] WantedBy=multi-user.target # /etc/systemd/system/nginx.service.d/override.conf [Service] Restart=always
$ sudo systemctl start nginx $ systemctl status nginx ● nginx.service - A high performance web server and a reverse proxy server Loaded: loaded (/lib/systemd/system/nginx.service; enabled; vendor preset: enabled) Drop-In: /etc/systemd/system/nginx.service.d └-override.conf Active: active (running) since Tue 2022-02-01 15:18:18 PST; 4s ago Docs: man:nginx(8) Process: 2302672 ExecStartPre=/usr/sbin/nginx -t -q -g daemon on; master_process on; (code=exited, status=0/SUCCESS) Process: 2302673 ExecStart=/usr/sbin/nginx -g daemon on; master_process on; (code=exited, status=0/SUCCESS) Main PID: 2302674 (nginx) Tasks: 3 (limit: 1113) Memory: 9.2M CGroup: /system.slice/nginx.service ├-2302674 nginx: master process /usr/sbin/nginx -g daemon on; master_process on; ├-2302675 nginx: worker process └-2302676 nginx: worker process
Test kill nginx process by sudo pkill -f nginx , then use systemctl status nginx to check nginx status, you should see nginx is active (running) but with different process id, this indicate the nginx service restart automatically. cheers.
$ sudo pkill nginx $ systemctl status nginx ● nginx.service - A high performance web server and a reverse proxy server Loaded: loaded (/lib/systemd/system/nginx.service; enabled; vendor preset: enabled) Drop-In: /etc/systemd/system/nginx.service.d └-override.conf Active: active (running) since Tue 2022-02-01 15:18:56 PST; 2s ago Docs: man:nginx(8) Process: 2302830 ExecStartPre=/usr/sbin/nginx -t -q -g daemon on; master_process on; (code=exited, status=0/SUCCESS) Process: 2302831 ExecStart=/usr/sbin/nginx -g daemon on; master_process on; (code=exited, status=0/SUCCESS) Main PID: 2302832 (nginx) Tasks: 3 (limit: 1113) Memory: 9.8M CGroup: /system.slice/nginx.service ├-2302832 nginx: master process /usr/sbin/nginx -g daemon on; master_process on; ├-2302833 nginx: worker process └-2302834 nginx: worker process
Summary
To config a service restart automatically, use sudo systemctl edit to add following config:
Save and quit, that’s simple.
References
OmniLock — Block / Hide App on iOS
Block distractive apps from appearing on the Home Screen and App Library, enhance your focus and reduce screen time.
DNS Firewall for iOS and Mac OS
Encrypted your DNS to protect your privacy and firewall to block phishing, malicious domains, block ads in all browsers and apps
Does systemd have a different timeout setting when rebooting vs normal restart of a service?
Does systemd use a different timeout setting for stopping a running daemon (e.g., rsyslog ) when rebooting the system (e.g., by running reboot ) vs when you simply restart the daemon (e.g., systemctl restart rsyslog )? I’ve checked the systemd.service page, but I didn’t spot it. Instead I only found the TimeoutStopSec and TimeoutStartSec options. I’ve set the TimeoutStopSec option, but it appears that systemd may be killing the daemon before it has a chance to safely save its state and terminate cleanly. EDIT 1: As @sourcejedi suggested (thanks), I should emphasize that this is not a desktop installation where rsyslog is running, but an Ubuntu 16.04 server install of rsyslog that receives messages from client nodes and may still be holding many messages in memory when asked to terminate by systemd. I attempted to help work around some corrupt disk queue issues by increasing the value for the TimeoutStopSec option from 90 seconds to 240 seconds, but I still observed this message multiple times in the related log file:
rsyslogd: queue 'strm 0x26b4800', file '/var/spool/rsyslog/q_ForwardToNode2.00000003' opened for non-append write, but already contains 983505 bytes [v8.29.0 try http://www.rsyslog.com/e/0 ]
The idea was that system might be impatient and was killing rsyslog while it was still saving content to disk. I attempted to work around another issue by forcing systemd to wait on an active network connection before attempting to start rsyslog. I’ve included the contents of both systemd Drop-Ins that I am using below for reference in case it adds helpful context to this entry. cat /etc/systemd/system/rsyslog.service.d/*.conf | grep -Ev ‘#|^$’ Attempt to work around github #1656
[Unit] Documentation=https://internal/wiki/url/here After=network.target Wants=nework.target
[Unit] Documentation=https://internal/wiki/url/here [Service] TimeoutStopSec=240