ceph upgrade: Job for ceph-FSID@osd.ID.service failed because a timeout was exceeded.

Timeout on cephadm upgrade

Posted March 3, 2023 by Frank Holtz ‐ 3 min read ‐ Kategorie: Ceph

blog-thumb

If cephadm stops starting a service after a timeout

If you upgrade ceph via ceph orch upgrade or a service is crashed, then there may be timeout messages. In our case it was a timeout starting OSD services.

Output of ceph -W cephadm:

2023-03-03T18:46:54.603629+0100 mgr.HOST.daiycs [INF] Upgrade: Updating osd.36 (2/12)
2023-03-03T18:46:54.672130+0100 mgr.HOST.daiycs [INF] Deploying daemon osd.36 on HOST
2023-03-03T18:49:04.703865+0100 mgr.HOST.daiycs [ERR] cephadm exited with an error code: 1, stderr:Redeploy daemon osd.36 ...
Non-zero exit code 1 from systemctl start ceph-FSID@osd.36
systemctl: stderr Job for ceph-FSID@osd.36.service failed because a timeout was exceeded.
systemctl: stderr See "systemctl status ceph-FSID@osd.36.service" and "journalctl -xe" for details.
Traceback (most recent call last):
  File "/var/lib/ceph/FSID/cephadm.ID", line 9248, in <module>
    main()
  File "/var/lib/ceph/FSID/cephadm.ID", line 9236, in main
    r = ctx.func(ctx)
  File "/var/lib/ceph/FSID/cephadm.ID", line 1990, in _default_image
    return func(ctx)
  File "/var/lib/ceph/FSID/cephadm.ID", line 5041, in command_deploy
    ports=daemon_ports)
  File "/var/lib/ceph/FSID/cephadm.ID", line 2952, in deploy_daemon
    c, osd_fsid=osd_fsid, ports=ports)
  File "/var/lib/ceph/FSID/cephadm.ID", line 3197, in deploy_daemon_units
    call_throws(ctx, ['systemctl', 'start', unit_name])
  File "/var/lib/ceph/FSID/cephadm.ID", line 1657, in call_throws
    raise RuntimeError(f'Failed command: {" ".join(command)}: {s}')
RuntimeError: Failed command: systemctl start ceph-FSID@osd.36: Job for ceph-FSID@osd.36.service failed because a timeout was exceeded.
See "systemctl status ceph-FSID@osd.36.service" and "journalctl -xe" for details.
Traceback (most recent call last):
  File "/usr/share/ceph/mgr/cephadm/serve.py", line 1456, in _remote_connection
    yield (conn, connr)
  File "/usr/share/ceph/mgr/cephadm/serve.py", line 1353, in _run_cephadm
    code, '\n'.join(err)))
orchestrator._interface.OrchestratorError: cephadm exited with an error code: 1, stderr:Redeploy daemon osd.36 ...
Non-zero exit code 1 from systemctl start ceph-FSID@osd.36
systemctl: stderr Job for ceph-FSID@osd.36.service failed because a timeout was exceeded.
systemctl: stderr See "systemctl status ceph-FSID@osd.36.service" and "journalctl -xe" for details.
Traceback (most recent call last):
  File "/var/lib/ceph/FSID/cephadm.ID", line 9248, in <module>
    main()
  File "/var/lib/ceph/FSID/cephadm.ID", line 9236, in main
    r = ctx.func(ctx)
  File "/var/lib/ceph/FSID/cephadm.ID", line 1990, in _default_image
    return func(ctx)
  File "/var/lib/ceph/FSID/cephadm.ID", line 5041, in command_deploy
    ports=daemon_ports)
  File "/var/lib/ceph/FSID/cephadm.ID", line 2952, in deploy_daemon
    c, osd_fsid=osd_fsid, ports=ports)
  File "/var/lib/ceph/FSID/cephadm.ID", line 3197, in deploy_daemon_units
    call_throws(ctx, ['systemctl', 'start', unit_name])
  File "/var/lib/ceph/FSID/cephadm.ID", line 1657, in call_throws
    raise RuntimeError(f'Failed command: {" ".join(command)}: {s}')
RuntimeError: Failed command: systemctl start ceph-FSID@osd.36: Job for ceph-FSID@osd.36.service failed because a timeout was exceeded.
See "systemctl status ceph-FSID@osd.36.service" and "journalctl -xe" for details.

2023-03-03T18:49:04.704379+0100 mgr.HOST.daiycs [ERR] Upgrade: Paused due to UPGRADE_REDEPLOY_DAEMON: Upgrading daemon osd.36 on host HOST failed.

The cause is a hardcoded timeout value in cephadm for the systemd service definition.

With a simple fix on each host, we can change thie timeout value.

cat >>/usr/local/sbin/ceph-timeout-fix <<EOF
#!/bin/bash
SERVICE_FILE="/etc/systemd/system/ceph-0f33077d-9ae3-4c43-b873-cacf0c2db052@.service"
grep -q "TimeoutStartSec=120" "${SERVICE_FILE}" || exit

sed -i 's/TimeoutStartSec=120/TimeoutStartSec=720/' "${SERVICE_FILE}"
/usr/bin/systemctl daemon-reload
EOF

chmod 755 /usr/local/sbin/ceph-timeout-fix

cat >/etc/cron.d/ceph-timeout-fix <<EOF
* * * * *       root    /usr/local/sbin/ceph-timeout-fix
EOF

After appling this on all hosts, the ceph upgrade can started again with ceph orch upgrade resume.