Thursday 27 June 2013

A caching scheme to reduce downloads for Linux package updates

As I administer sites with a lot of (nearly) identical Linux hosts, I thought a lot about the issue of reducing the download for package updates. Even if you have unlimited data quota for your site, downloading the updates multiple times for multiple machines takes time.

I looked at various schemes suggested on the Internet. One is to set up a satellite repository. This is reliable but involves a lot of configuration. Also you may download a lot of package updates that are not used at your site.

Another class of scheme involves interposing a proxy. It could be a specialised proxy for the distro, or a general proxy like squid. This can work well but is sometimes defeated by load balancers or smart mirrors where you are not guaranteed to hit the same repository every download, as the actual repository used could vary.

Eventually I came up with a scheme that is applicable to three distros that I have tried it on: CentOS, Debian (and derivatives) and openSUSE. The key observation is that fact that the updaters in these distros have a package cache. If packages are added to the cache on the side, the updater will skip downloading them. I describe this as a caching scheme, not a proxying scheme as no change is made to the updater configuration files and normal, independent update works.

The scheme:

There are three parts to the scheme.

1. A download script runs on a designated master host. This runs during off-peak hours. We use the updater program in download but not install mode so we will only download packages we need. It's also necessary to set the repository to retain packages.

2. A synchronisation script runs on the other hosts to pull the new updates to their cache. This is run as soon as possible after the download.

3. Various utility scripts are used to fire off updates on all hosts. I make use of parallel ssh and ssh-agent to make life easy.

Let's use a concrete example to illustrate this.

CentOS:

1. Download:

The download script is this:

yum --disableexcludes=updates --downloadonly -y update

You need to install the auxiliary package yum-downloadonly. The --disableexcludes=updates is to disable the regular expressions that prevent the installation of the kernel on this host. On this host I disabled kernel updates due to the need to reboot and recompile VirtualBox modules whenever the kernel is updated, and as this is a server that engineers use constantly, I delay kernel updates until downtime can be scheduled to do them manually. Other software that may need recompilation on kernel update include proprietary video drivers.

To set the repository to retain packages, add

keepcache=1

to /etc/yum.conf

2. Synchronisation:

Next I set up a rsync server. Initially I used rsync over ssh, but this requires setting up a login account, restricted commands, etc. for security reasons. Package updates are not sensitive data so I just share the cache directory with rsyncd. Please consult your distro documentation on how to enable the rsyncd daemon. Usually it's run out of xinetd. Here is /etc/rsyncd.conf:

secrets file = /etc/rsyncd.secrets
motd file = /etc/rsyncd.motd
read only = yes
list = yes
uid = nobody
gid = nobody

[yum]
comment = YUM cache
path = /var/cache/yum
filter = + *.rpm + */ - *

Note the filter expression to share RPM files and directories but not files of other types.

The synchronisation script, call it syncfrommaster, on the other hosts is:

rsync -avi masterhost::yum /var/cache/yum

In the cron job that runs this command I insert a random delay so that the hosts don't all try to synchronise at the same time and overload the master.

sleep $(($RANDOM \% 900)); /root/bin/syncfrommaster

The % is escaped with \ in cron scripts by the way.

I arrange for this job to be run an hour after the download script.

3. Update.

You have probably noticed that it is possible that an update will be published in the time between the master pulling it down and you applying the updates. In this case each host merely downloads the package on its own, and nothing breaks. However you may wish to run a script to cause the cache to be updated and the hosts to catch up.

#!/bin/sh
yum --disableexcludes=updates --downloadonly -y update
pssh -h /root/linux-servers -P -p 1 -t -1 /root/bin/syncfrommaster

This first command is just the download script again. The second command makes use of the parallel ssh program to fire off syncfrommaster on each host. Before you run this you need to run ssh-agent bash, followed by ssh-add to make it possible to run pssh without supplying a login password for each host. This assumes have already installed ssh keys on each host.

4. Finally you can update all the hosts in one go:

pssh -h /root/linux-servers -P -p 1 -t -1 yum update -y

You need the -y as pssh cannot run interactive commands. You might want to run

yum update

on the master host first to check that the updates will go as expected.

It is possible that the master doesn't have all the packages in the union of all packages installed on all hosts. No harm done, the host that has distinct packages will download it on its own. Or you may wish to install that package on the master so the next update will be covered.

openSUSE:

1. Download:

zypper -n up -l -d

To set the repository to retain packages, run

zypper mr -k <repo>

for the repositories must be retained, usually repo-update and any extras such as packman.

2. Synchronisation:

/etc/rsyncd.conf (I only show the stanza added to the existing config file)

[zypp]
comment = Zypp archive
path = /var/cache/zypp/packages
read only = yes
list = yes

Notice that read only and list are in the zypp section. It can be either in the global or module section.

Synchronisation script:

rsync -avi masterhost::zypp /var/cache/zypp/packages

3. Update:

zypper up

With this info you can work out the catchup script and parallel ssh scripts. As in CentOS the -y flag also works to make zypper non-interactive.

Debian:

1. Download:

apt-get update; apt-get upgrade -d -y

2. Synchronisation:

/etc/rsyncd.conf:


[apt]
comment = APT archive
path = /var/cache/apt/archives
read only = yes
list = yes
filter = - lock

Synchronisation script:

rsync -avi masterhost::apt /var/cache/apt/archives

3. Update:

apt-get upgrade

and you can work out the catchup script and parallel ssh scripts. Again -y works to make apt-get upgrade non-interactive.

Notes:

To keep the explanation simpler I have not added the shell directives to redirect output to /dev/null. If cron jobs generate output, root or you will get mailed the output. This could be useful for debuging and to keep an eye on things, but gets tiresome after a while, given that this all works well and falls back to normal update if the caching is not effective. You will need 2>&1 to redirect stderr also.

You will notice that packages will accumulate in the cache after a while. While Debian commendably has apt-get autoclean, the other two RPM based distros do not have an analogous feature. You might just have a cron job remove packages older than a certain number of days.

Thursday 13 June 2013

Install tmm, matplotlib, scipy and numpy on CentOS/RHEL 5

I hope this post saves you some searching if you have to go through the same steps as I did.

A user at this site wanted to run tmm for simulations on CentOS 5 servers.

CentOS/RHEL 5 unfortunately comes with python 2.4, too old for the required modules. Fortunately there is an easy way to install python 2.6 side by side and that is to get the python26 package from EPEL. If you do:
yum search python26
you will get a list of python26 packages. Aside from the base python26 package, I also needed python26-numpy. This will pull in any dependencies. When you have installed python26 you must remember to invoke python scripts with python26 or to edit the #! first line.

The scipy module was installed from source, that was the easiest way. Then I installed the tmm module. BTW in all the python extension installs I used python26 setup.py install rather than any package manager.

After that the test.py program in the tmm distribution ran. But my user pointed out that examples.py didn't run. This requires matplotlib. Ok, installed that from source. Now the function sample1() in examples.py ran, but generated no output. More searching showed that you have to specify the backend in ~/.matplotlib/matplotlibrc like this:

backend: TkAgg
interactive: true

(I'm leaving out some useless experiments involving the Postscript backend.) This generated an error about Tkinter not loading. I went back and looked at the setup step for matplotlib and noticed that when I ran:
python26 setup.py
it said I didn't have the Tkinter header files installed so it wasn't building the interface for the Tkinter backend. So I did yum install tk-devel, built and installed matplotlib again, and finally I got the example program to display a plot.

Whew!