RHEL Labs - Solving a Full Filesystem Issue
Table of Contents
Can you fix this broken RHEL system?
Full Filesystem #
Scott: “Ok, those web folks have stopped bothering me for some reason, guess they finally figured out THEIR problem. Oh hey, not a big deal, but the database server’s been alerting all night that it’s disk is full. I just ack’d it, those emails were really getting annoying.”
- Can you figure out why the disk is full and fix it?
- Free up some disk space before the database gets corrputed!
df
tells us that dev/sda2
mounted on the root filesystem /
is full:
root@rhel:~# df -HT --total
Filesystem Type Size Used Avail Use% Mounted on
devtmpfs devtmpfs 4.2M 0 4.2M 0% /dev
tmpfs tmpfs 2.0G 0 2.0G 0% /dev/shm
tmpfs tmpfs 767M 8.9M 759M 2% /run
/dev/sda2 xfs 22G 22G 1.4M 100% /
/dev/sda1 vfat 210M 7.4M 203M 4% /boot/efi
/dev/sdb ext4 511M 57M 417M 13% /opt/instruqt/bootstrap
tmpfs tmpfs 384M 0 384M 0% /run/user/0
total - 26G 22G 3.7G 86% -
root@rhel:~#
Running a df
under /
, tells us that the /var
directory is using 17G of space:
root@rhel:/# du -ch -d 1 2> /dev/null
327M ./boot
0 ./dev
0 ./proc
8.5M ./run
0 ./sys
4.0K ./tmp
22M ./etc
84K ./root
17G ./var
2.6G ./usr
0 ./afs
176K ./home
0 ./media
0 ./mnt
55M ./opt
0 ./srv
20G .
20G total
Inside /var
it’s the log
directory:
root@rhel:/var# du -ch -d 1 2> /dev/null
209M ./lib
17G ./log
0 ./adm
127M ./cache
0 ./db
0 ./empty
0 ./ftp
0 ./games
0 ./local
0 ./nis
0 ./opt
0 ./preserve
8.0K ./spool
4.0K ./tmp
0 ./yp
0 ./kerberos
0 ./crash
0 ./www
17G .
17G total
The file /var/log/biglog
is 17G in size:
root@rhel:/var/log# ls -lhS | head
total 17G
-rw-r--r--. 1 root root 17G Aug 6 16:20 biglog
-rw-------. 1 root root 974K Nov 14 2023 messages-20231114
-rw-r--r--. 1 root root 957K Aug 6 16:18 dnf.log
-rw-rw-r--. 1 root utmp 289K Aug 6 16:17 lastlog
-rw-r--r--. 1 root root 269K Aug 6 16:18 dnf.librepo.log
-rw-------. 1 root root 260K May 27 14:10 messages-20240806
-rw-r--r--. 1 root root 215K Aug 6 16:18 dnf.rpm.log
-rw-------. 1 root root 214K Nov 14 2023 messages-20240527
-rw-------. 1 root root 124K Aug 6 16:23 messages
or du -aBm | sort -rn | head
Deleting the file before killing the process using it #
Before we delete the file, let’s verify if a process is using it first:
root@rhel:/var/log# lsof biglog
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
vim 2158 root 3r REG 8,2 17652908032 36327 biglog
We can see that vim
with PID 2158
, running as root
, has file descriptor 3
open for reading(r
) on a regular(REG
) file called biglog
. This file is on the device with ID 8,2
, and it’s 17.6 GB
in size with inode number 36327
.
If we delete the file first without killing vim
, the file descriptor will stay open:
root@rhel:/var/log# rm -f biglog && lsof biglog
lsof: status error on biglog: No such file or directory
root@rhel:/var/log# lsof | grep -i deleted
dbus-brok 595 dbus 12u REG 0,1 2097152 3 /memfd:dbus-broker-log (deleted)
firewalld 607 root 9u REG 0,1 4096 4 /memfd:libffi (deleted)
firewalld 607 879 gmain root 9u REG 0,1 4096 4 /memfd:libffi (deleted)
vim 2158 root 3r REG 8,2 17652908032 36327 /var/log/biglog (deleted)
root@rhel:/var/log#
When you delete a file in Linux while a process is still using it, the file’s data remains on disk because the system only removes the directory entry, not the actual data. The file stays accessible to the process through its open file descriptor(FD), which points to the file’s inode. The file’s data is fully removed from the disk only after all FDs referencing it are closed.
Terminating vim
to reclaim the 17G of disk space:
root@rhel:/var/log# kill -9 2158
root@rhel:/var/log# lsof | grep -i deleted
dbus-brok 595 dbus 12u REG 0,1 2097152 3 /memfd:dbus-broker-log (deleted)
firewalld 607 root 9u REG 0,1 4096 4 /memfd:libffi (deleted)
firewalld 607 879 gmain root 9u REG 0,1 4096 4 /memfd:libffi (deleted)
Available space is now 18G on /dev/sda2
.
root@rhel:/var/log# df -HT --total
Filesystem Type Size Used Avail Use% Mounted on
devtmpfs devtmpfs 4.2M 0 4.2M 0% /dev
tmpfs tmpfs 2.0G 0 2.0G 0% /dev/shm
tmpfs tmpfs 767M 8.9M 759M 2% /run
/dev/sda2 xfs 22G 3.6G 18G 17% /
/dev/sda1 vfat 210M 7.4M 203M 4% /boot/efi
/dev/sdb ext4 511M 57M 417M 13% /opt/instruqt/bootstrap
total - 25G 3.7G 21G 15% -
root@rhel:/var/log#
kill process using the file first: #
When the process using the file is killed/exits/terminated first, you can simply delete the file afterwards and get the space back:
root@rhel:~# lsof /var/log/biglog
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
vim 2162 root 3r REG 8,2 17651924992 36319 /var/log/biglog
root@rhel:~# kill -9 2162
root@rhel:~# lsof /var/log/biglog
Recover deleted file #
If a process has an open file descriptor (FD) to a file:
- like in this case with
vim
) - or a daemon like
httpd
that has a log files opened at all times(until it restarts/stops, or log rotation)
and you accidentally delete the file, it’s very easy to recover it.
Get the PID of the process using the file:
root@rhel:~# rm -f /var/log/biglog
root@rhel:~# lsof | grep deleted
dbus-brok 596 dbus 12u REG 0,1 2097152 3 /memfd:dbus-broker-log (deleted)
firewalld 608 root 9u REG 0,1 4096 4 /memfd:libffi (deleted)
firewalld 608 866 gmain root 9u REG 0,1 4096 4 /memfd:libffi (deleted)
vim 2160 root 3r REG 8,2 17652908032 36319 /var/log/biglog (deleted)
Look for the file descriptor pointing to the file:
root@rhel:~# ls -l /proc/2160/fd/
total 0
lrwx------. 1 root root 64 Aug 6 16:43 0 -> /dev/pts/0
lrwx------. 1 root root 64 Aug 6 16:43 1 -> /dev/pts/0
lrwx------. 1 root root 64 Aug 6 16:43 2 -> /dev/pts/0
lr-x------. 1 root root 64 Aug 6 16:43 3 -> '/var/log/biglog (deleted)'
lrwx------. 1 root root 64 Aug 6 16:43 4 -> /var/log/.biglog.swp
root@rhel:~#
“A rule of thumb is to copy the recovered file to the same device as the original, because inode tables are unique to each filesystem (I haven’t tried copying across filesystems):
root@rhel:~# mkdir /tmp/recover
root@rhel:~#
root@rhel:~# cp /proc/2160/fd/3 /tmp/recover/
root@rhel:~#
root@rhel:~# ls -lSh /tmp/recover/
total 17G
-rw-r--r--. 1 root root 17G Aug 6 16:45 3
root@rhel:~#
Little experiment with httpd #
I tried this with apache/httpd
(or any process that keeps a file descriptor open on a file): If you delete /var/log/httpd/access_log
, list all the file descriptors, and copy the file to the same location on the disk. Apache
will continue to write to the deleted file until it is restarted. After restarting Apache
, it will begin writing to the new access_log
, although the file will have a different inode.
Ref #
Le devoir de la comédie est de corriger les hommes en les amusant.
– Molière.