Friday, 10 April 2015

Found duplicate PV: using /dev/... not /dev/...

When you mix software RAID1 (md) and LVM, in some situations you can get this message:

Found duplicate PV: using /dev/sdb1 not /dev/md0 ...

and the LVM doesn't assemble. The exact device names may differ, of course. But how does this happen?

What happened that at some point vgscan was run and read partition(s) that were later made into RAID1 members and saw a Physical Volume (PV) UUID on it. Since the PV UUID of a RAID1 array is identical to the PV UUID of the members, you get duplicate(s).

RAID1 members are usually not candidates for PVs, as vgscan  normally excludes such devices from consideration. However there is a cache: /etc/lvm/cache/.cache which may contain outdated entries. In the example above it contained an entry for /dev/sdb1 which should have been filtered out by virtue of being in a RAID array. The solution is simple: just run vgscan again to update the cache. But you may have a problem if the device is needed for booting up. If the root devices is on a different partition or you have a rescue DVD you might be able to mount the root filesystem containing /etc read-write and refresh the cache.

Some articles suggest editing the lvm.conf file to specify a filter to exclude the RAID1 members. Try refreshing the cache first before you resort to this as it should just work.

This problem occurred in the context of converting in-situ a filesystem on a single disk to reside in a RAID1.

Thursday, 9 April 2015

Converting single disk to RAID1 in-situ

You have this Linux system that doesn't use RAID. You start to worry about the loss of files (from the last backup; you do backups, right?) and downtime should the disk fail. Maybe it is a good idea to have RAID. But how to retrofit RAID1 without a lot of downtime backing up, reformatting the disks and restoring the data?

I suspected there might be a way to start off with a degraded RAID1 array on the second, new disk, copy the partitions on the old disk onto it, change the type of the old disk to RAID element, add it to the array and let it resync. Sure enough it can be done, and François Marier has blogged it. In fact he goes further and shows how to reinstall the boot loader. I didn't have to do this because my partition is /home. The critical tip is the use of the keyword missing to create the degraded array without issues.

In my case the decision to go RAID1 was done after a failed disk caused loss of files. It was not a wise decision by the system builder to not use RAID1 in the first place.

I've varied the procedure a little. Instead of putting ext4 directly on the RAID partition, I put a logical volume on it, and then created an ext4 partition inside that. This allows me to migrate the content to a larger disk if expansion is needed in future, using logical volume operations, with little downtime.

There's one thing you should do if you decide to use logical volumes on the RAID. After you have assembled the RAID array, run vgscan. This will reinitialise the cache in /etc/lvm/cache/.cache. Otherwise it will contain entries for the components of the array and cause failure to assemble later on with a mysterious (to me at first) duplicate PV error because it thinks the array components are candidates for volumes. LVM is normally configured to ignore components of RAID arrays but only if the cache is up to date. See here for more details.

A couple of caveats: On other Linux systems mdadm.conf may be in /etc, not /etc/mdadm. Also the mdadm --detail --scan command to get the mdadm.conf line will contain a spares=1 directive if run while the array is resyncing. Remove it, or you will have problems next boot.

Saturday, 17 January 2015

ssh hangs at SSH2_MSG_KEX_DH_GEX_GROUP trying to connect to servers behind Cisco firewall

Today I was unable to ssh to some CentOS servers behind a Cisco firewall. I was connected using AnyConnect. When I ran ssh with -v, it showed me that it stopped at expecting SSH2_MSG_KEX_DH_GEX_GROUP.

A search on the Internet turned up this article: Natty Narwhal: Problems connecting to servers behind (Cisco) firewalls using ssh. Shortening the Cipher and MAC list as suggested solved the problem. Apparently due to overflowing some packet size limit somewhere. I'll leave it to the experts to work out what it is about Cisco and ssh.

Friday, 16 January 2015

NVIDIA driver, libglx.so and hardware acceleration crashing X server

My users on CentOS complained that Firefox would crash the X server after I updated the package xorg-x11-server-Xorg.

A search returned suggestions to disable hardware acceleration in Firefox. However a user mentioned that when he reinstalled the NVIDIA driver, it gave the warning libglx.so is not a symbolic link before setting things right.

An investigation showed that the NVIDIA installer replaces the file /usr/lib64/xorg/modules/extensions/libglx.so with a symlink to /usr/lib64/xorg/modules/extensions/libglx.so.X.Y where X.Y is the NVIDIA package version. So every time the xorg-x11-server-Xorg package is reinstalled, it replaces the symlink and hardware acceleration fails, crashing the X server.

The same problem for Debian is documented in this forum thread.

I could blacklist xorg-x11-server-Xorg in yum but I rather not do that. Since I may forget when an update of that package happens, I wrote a shell script to restore the symlink if removed and a cron job to call it periodically. But I'm also looking to make a Puppet stanza to do this.

Thursday, 15 January 2015

Squid failing to fetch IPv6 web resources and dns_v4_first

I had a strange symptom on my Chromebook on my home LAN. A particular Internet web page would not work because a jQuery file distributed by cdnjs.cloudflare.com could not be fetched. When I disabled the use of a proxy in my Chromebook, the web page worked. I have a squid caching proxy on my LAN, partly to be able to zap advertisments. So I knew it had something to do with squid.

Maybe it was a bad object in the cache? I knew there was a way to delete cached objects from the command line, and a search showed me that all I had to do was add these lines to /etc/squid/squid.conf:

acl purge method PURGE

http_access deny purge !localhost

and then use the command:

squidclient -m PURGE https://cdnjs.cloudflare.com/blah.js

to remove it. However when I ran

squidclient https://cdnjs.cloudflare.com/blah.js

to fetch it again, I saw that squid was trying to use the IPv6 address of cdnjs.cloudflare.com to get the resource and failing.

My LAN is fully IPv6 enabled because my Linux machines all support the IPv6 stack, I have a BIND server giving out IPv6 addresses, and in fact a lot of the internal traffic such as Apache goes over IPv6 transparently. I wish I had an IPv6 broadband connection but I don't so I cannot use IPv6 addresses in the outside world.

So the problem boiled down to: how can I prevent squid from trying to reach IPv6 sites. It turns out that the directive dns_v4_first is intended for this. Just adding:

dns_v4_first on

to /etc/squid/squid.conf worked and now I can view that Internet web page.

One symptom remained unexplained though, why didn't the Chrome browser on the host running squid suffer the same problem? I can only surmise that it's because in that case the proxy is specified as an IPv4 address and port so squid thinks, the request is coming in from a IPv4 host so I'll forward this request to an IPv4 origin. The Chromebook however, sends the request to the (internal) IPv6 address of squid so squid thinks it's allowed to forward the request to an IPv6 origin.

Friday, 9 January 2015

Apache won't start on CentOS 5 because self-signed certificate used by mod_nss expired

You are not likely to hit this error with the usual configuration since mod_ssl is the one normally used. However mod_nss is used by the Fedora Directory Server (aka 389 Directory Server now) in CentOS for the console for LDAP authentication (centos-idm-console). Possibly also used by RHEL.

One day I found that Apache would not start. The error log indicated that the certificate had expired. I searched and searched for how to generate a new certificate and tried various things. Too many steps and too hard.

Then I had an idea. The certificate must have been generated when the package mod_nss was installed. Let's try reinstalling it. First we move the old database directory out of the way:

mv /etc/httpd/alias /etc/httpd/alias.old

Then reinstall mod_nss:

yum reinstall mod_nss

and voila, after a while, a new database with a fresh certificate was generated and I could start Apache.

Monday, 23 June 2014

Apache2 403 error but no clue in error_log, check rewrite rules

I had a directory inside an Apache2 website that I needed to open for indexed browsing.  But it always returned 403 errors. This old but still valid document explains how to solve 403 errors. Generally the first place you should look is /var/log/apache2/error_log. The usual reason is clients' IP address not allowed or file permission problems and there will be a line in the log. However in my situation there was absolutely no diagnostic.

Finally I figured the error must be generated by a RewriteRule. But which one? This page shows how to debug rewrite rules by enabling the log. Put it in say /etc/apache2/conf.d/rewrite.conf (Debian).

RewriteEngine On
RewriteLog "/var/log/apache2/rewrite.log"
RewriteLogLevel 3

It turned out to be a rule in Joomla!'s .htaccess that was blocking indexing on the directory. I also had to re-enable Options Indexes for that directory, as well modify .htaccess to comment out IndexIgnore *, since this cannot be overridden, and Options -Indexes blocks indexing already.

Remember to disable the rewrite log afterwards.