Skip to main content

RHEL Labs - Solving a Full Filesystem Issue

·6 mins

Can you fix this broken RHEL system?

Full Filesystem #

Scott: “Ok, those web folks have stopped bothering me for some reason, guess they finally figured out THEIR problem. Oh hey, not a big deal, but the database server’s been alerting all night that it’s disk is full. I just ack’d it, those emails were really getting annoying.”

  • Can you figure out why the disk is full and fix it?
  • Free up some disk space before the database gets corrputed!

df tells us that dev/sda2 mounted on the root filesystem / is full:

root@rhel:~# df -HT --total
Filesystem     Type      Size  Used Avail Use% Mounted on
devtmpfs       devtmpfs  4.2M     0  4.2M   0% /dev
tmpfs          tmpfs     2.0G     0  2.0G   0% /dev/shm
tmpfs          tmpfs     767M  8.9M  759M   2% /run
/dev/sda2      xfs        22G   22G  1.4M 100% /
/dev/sda1      vfat      210M  7.4M  203M   4% /boot/efi
/dev/sdb       ext4      511M   57M  417M  13% /opt/instruqt/bootstrap
tmpfs          tmpfs     384M     0  384M   0% /run/user/0
total          -          26G   22G  3.7G  86% -
root@rhel:~# 

Running a df under /, tells us that the /var directory is using 17G of space:

root@rhel:/# du -ch -d 1 2> /dev/null
327M    ./boot
0       ./dev
0       ./proc
8.5M    ./run
0       ./sys
4.0K    ./tmp
22M     ./etc
84K     ./root
17G     ./var
2.6G    ./usr
0       ./afs
176K    ./home
0       ./media
0       ./mnt
55M     ./opt
0       ./srv
20G     .
20G     total

Inside /var it’s the log directory:

root@rhel:/var# du -ch -d 1 2> /dev/null
209M    ./lib
17G     ./log
0       ./adm
127M    ./cache
0       ./db
0       ./empty
0       ./ftp
0       ./games
0       ./local
0       ./nis
0       ./opt
0       ./preserve
8.0K    ./spool
4.0K    ./tmp
0       ./yp
0       ./kerberos
0       ./crash
0       ./www
17G     .
17G     total

The file /var/log/biglog is 17G in size:

root@rhel:/var/log# ls -lhS | head
total 17G
-rw-r--r--. 1 root   root    17G Aug  6 16:20 biglog
-rw-------. 1 root   root   974K Nov 14  2023 messages-20231114
-rw-r--r--. 1 root   root   957K Aug  6 16:18 dnf.log
-rw-rw-r--. 1 root   utmp   289K Aug  6 16:17 lastlog
-rw-r--r--. 1 root   root   269K Aug  6 16:18 dnf.librepo.log
-rw-------. 1 root   root   260K May 27 14:10 messages-20240806
-rw-r--r--. 1 root   root   215K Aug  6 16:18 dnf.rpm.log
-rw-------. 1 root   root   214K Nov 14  2023 messages-20240527
-rw-------. 1 root   root   124K Aug  6 16:23 messages

or du -aBm | sort -rn | head

Deleting the file before killing the process using it #

Before we delete the file, let’s verify if a process is using it first:

root@rhel:/var/log# lsof biglog 
COMMAND  PID USER   FD   TYPE DEVICE    SIZE/OFF  NODE NAME
vim     2158 root    3r   REG    8,2 17652908032 36327 biglog

We can see that vim with PID 2158, running as root, has file descriptor 3 open for reading(r) on a regular(REG) file called biglog. This file is on the device with ID 8,2, and it’s 17.6 GB in size with inode number 36327.

If we delete the file first without killing vim, the file descriptor will stay open:

root@rhel:/var/log# rm -f biglog && lsof biglog 
lsof: status error on biglog: No such file or directory

root@rhel:/var/log# lsof | grep -i deleted
dbus-brok  595                   dbus   12u      REG                0,1     2097152          3 /memfd:dbus-broker-log (deleted)
firewalld  607                   root    9u      REG                0,1        4096          4 /memfd:libffi (deleted)
firewalld  607  879 gmain        root    9u      REG                0,1        4096          4 /memfd:libffi (deleted)
vim       2158                   root    3r      REG                8,2 17652908032      36327 /var/log/biglog (deleted)
root@rhel:/var/log# 

When you delete a file in Linux while a process is still using it, the file’s data remains on disk because the system only removes the directory entry, not the actual data. The file stays accessible to the process through its open file descriptor(FD), which points to the file’s inode. The file’s data is fully removed from the disk only after all FDs referencing it are closed.

From lsof.readthedocs.io: Only when the process closes the file will its resources, particularly disk space, be released.

Terminating vim to reclaim the 17G of disk space:

root@rhel:/var/log# kill -9 2158

root@rhel:/var/log# lsof | grep -i deleted
dbus-brok  595                   dbus   12u      REG                0,1  2097152          3 /memfd:dbus-broker-log (deleted)
firewalld  607                   root    9u      REG                0,1     4096          4 /memfd:libffi (deleted)
firewalld  607  879 gmain        root    9u      REG                0,1     4096          4 /memfd:libffi (deleted)

Available space is now 18G on /dev/sda2.

root@rhel:/var/log# df -HT --total
Filesystem     Type      Size  Used Avail Use% Mounted on
devtmpfs       devtmpfs  4.2M     0  4.2M   0% /dev
tmpfs          tmpfs     2.0G     0  2.0G   0% /dev/shm
tmpfs          tmpfs     767M  8.9M  759M   2% /run
/dev/sda2      xfs        22G  3.6G   18G  17% /
/dev/sda1      vfat      210M  7.4M  203M   4% /boot/efi
/dev/sdb       ext4      511M   57M  417M  13% /opt/instruqt/bootstrap
total          -          25G  3.7G   21G  15% -
root@rhel:/var/log# 

kill process using the file first: #

When the process using the file is killed/exits/terminated first, you can simply delete the file afterwards and get the space back:

root@rhel:~# lsof /var/log/biglog 
COMMAND  PID USER   FD   TYPE DEVICE    SIZE/OFF  NODE NAME
vim     2162 root    3r   REG    8,2 17651924992 36319 /var/log/biglog

root@rhel:~# kill -9 2162
root@rhel:~# lsof /var/log/biglog 

Recover deleted file #

If a process has an open file descriptor (FD) to a file:

  • like in this case with vim)
  • or a daemon like httpd that has a log files opened at all times(until it restarts/stops, or log rotation)

and you accidentally delete the file, it’s very easy to recover it.

Get the PID of the process using the file:

root@rhel:~# rm -f /var/log/biglog 

root@rhel:~# lsof | grep deleted
dbus-brok  596                   dbus   12u      REG                0,1     2097152          3 /memfd:dbus-broker-log (deleted)
firewalld  608                   root    9u      REG                0,1        4096          4 /memfd:libffi (deleted)
firewalld  608  866 gmain        root    9u      REG                0,1        4096          4 /memfd:libffi (deleted)
vim       2160                   root    3r      REG                8,2 17652908032      36319 /var/log/biglog (deleted)

Look for the file descriptor pointing to the file:

root@rhel:~# ls -l /proc/2160/fd/
total 0
lrwx------. 1 root root 64 Aug  6 16:43 0 -> /dev/pts/0
lrwx------. 1 root root 64 Aug  6 16:43 1 -> /dev/pts/0
lrwx------. 1 root root 64 Aug  6 16:43 2 -> /dev/pts/0
lr-x------. 1 root root 64 Aug  6 16:43 3 -> '/var/log/biglog (deleted)'
lrwx------. 1 root root 64 Aug  6 16:43 4 -> /var/log/.biglog.swp
root@rhel:~# 

“A rule of thumb is to copy the recovered file to the same device as the original, because inode tables are unique to each filesystem (I haven’t tried copying across filesystems):

root@rhel:~# mkdir /tmp/recover
root@rhel:~# 
root@rhel:~# cp /proc/2160/fd/3 /tmp/recover/
root@rhel:~# 
root@rhel:~# ls -lSh /tmp/recover/
total 17G
-rw-r--r--. 1 root root 17G Aug  6 16:45 3
root@rhel:~# 

Little experiment with httpd #

I tried this with apache/httpd (or any process that keeps a file descriptor open on a file): If you delete /var/log/httpd/access_log, list all the file descriptors, and copy the file to the same location on the disk. Apache will continue to write to the deleted file until it is restarted. After restarting Apache, it will begin writing to the new access_log, although the file will have a different inode.

Ref #

Le devoir de la comédie est de corriger les hommes en les amusant.
Molière.