Saving your bacon: recovering from a serious Linux issue

Image via Wikipedia

Imagine, if you will, that you stumble upon a directory on one of the CentOS Linux servers that you administer named /opt/config/etc. “That’s odd”, you say to yourself, “that must have been when I was experimenting with placing /etc/ under version control.” You do a quick listing of all of the files in /opt/config/etc and notice that the files are basically identical to the ones in /etc/. You say to yourself, “Hmmm… better get rid of these files – they’re just taking up space here.” And you type sudo rm -rf /opt/config/etc. And … all hell breaks loose. What has just happened?

You are no longer able to sudo. You receive odd messages about user 501. It slowly dawns on you that … /opt/config/etc was a link to /etc. You cd into /etc and your fears are realized. There is nothing there. Nada. Zilch. Well then. Well, here’s another nice mess you’ve gotten yourself into. The question is how to recover.

Well, here’s what I did. I rebooted the server using the fabulous Linux System Rescue CD. The System Rescue CD will boot the computer from CD. Initially, it does not attempt to mount the server’s hard drive. We use CentOS, so by default the drives are configured as LVM volumes. This complicates the recovery very slightly. Using the tips from this blog post, because we can never remember our lvm commands, we do this:

# lvm vgscan
Reading all physical volumes.  This may take a while...
Found volume group "VolGroup00" using metadata type lvm2
# lvm vgchange -ay
# lvm lvs
  LV       VG         Attr   LSize Origin Snap%  Move Log Copy%  Convert
  LogVol00 VolGroup00 -wi-ao 6.88G
  LogVol01 VolGroup00 -wi-ao 1.00G

This shows us that we do have two volumes on the CentOS disk, which makes sense. There is a 7 GB root partition and a 1 GB swap partition. The root partition is the one we’re after, so we can do this from the command line:

# mkdir /disk
# mount /dev/VolGroup00/LogVol00 /disk
# cd /disk/etc
# ls /disk/etc

At this point, we see nothing, no files, just as we had feared. Time to restore from our backup. Nothing magical here; if you don’t have a backup, you’ll be re-installing the OS.

In our case, we use the terrific rsnapshot script to periodically store backups of important directories (like /etc) to another server. Because the data is remote, we need to bring up networking via System Rescue CD. You can just do ifconfig eth0 and then route add default gw yyy.yyy.yyy.yyy to bring up the adapter and establish a route. You could also edit /etc/resolv.conf so that you have access to a name server.

At this point you should be able to ssh to the host that holds the backups. In our case, because of firewall configuration, we cannot ssh into the backup server. Rather we need to ssh into the system being repaired from the backup server. When System Rescue CD starts up, it actually starts an sshd server and root is allowed to connect to the server. However, we had to set root’s password first before we were able to successfully connect: passwd at the command line and then a reasonable password. You may also need to fiddle with the ssh settings on the backup server; after all the server to be rescued no longer has the same server key.

On the backup host, we did tar cvfp etcbackup.tgz etc/* to create a tgz archive containing all of the files from the backed-up etc directory. Note the “p” option – we’re trying to preserve file modes and ownership of the files to be restored. We then copied the archive over to the host to be rescued: scp etcbackup.tgz This copied the archive to root’s home directory on the host to be rescued. Back on the machine to be rescued, we did ‘tar xvfz etcbackup.tgz‘ and examined /root/etc/* to see that the files were there. At this point, we copied the files back into /disk/etc/ (the previously mounted hard disk for the damaged server), crossed our fingers and rebooted.

The machine came back up without any issues and we are back in business. The total time commitment was about 35 minutes from scary start to relieved finish.

You may still be nervous that there are things that are broken in /etc/ that will cause unforeseen problems down the road. Here’s one way to do some checking with regard to that:

cd /etc
rpm -qf * | grep -v "is not owned"|sort | uniq >/tmp/etcpkgs
for x in $(cat /tmp/etcpkgs);do rpm -V $x;done

Here we are leveraging RPM’s package validation tools. You change into the /etc directory. First, you determine what packages own the files in /etc/ and strip out any information about files that are not owned by any rpm package. Obviously, the assumption here is that you primarily use pre-built RPM packages and do not install much software from source. We then sort the list of packages and save the unique package names to a file in /tmp called etcpkgs. For each package in that list you then run the rpm –verify command. That command will return information like the following:

.......T c /etc/audit/auditd.conf
S.5....T c /etc/yum.repos.d/CentOS-Base.repo
S.5....T c /etc/httpd/conf/httpd.conf
.M...... /etc/httpd/logs
.M...... /etc/httpd/run
.......T c /etc/inittab
S.5....T c /etc/ssh/sshd_config
....L... c /etc/pam.d/system-auth
S.5....T c /etc/php.ini
S.5....T c /etc/postfix/
S.5....T c /etc/postfix/virtual
......G. /var/cache/samba/winbindd_privileged
S.5....T c /etc/mail/
S.5....T c /etc/mail/
S.5....T c /etc/aliases
S.5....T c /etc/printcap
S.5....T c /etc/sudoers

The columns in the output correspond to the following issues:

S file Size differs
M Mode differs (includes permissions and file type)
5 MD5 sum differs
D Device major/minor number mismatch
L readLink(2) path mismatch
U User ownership differs
G Group ownership differs
T mTime differs

Based on the returned output, you can investigate further. Logical candidates for further investigation are any files with file Mode, Link, User or Group issues.

Enhanced by Zemanta

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s