Monday 27 November 2017

EFI System Partition in soft RAID1

One reason you might want to put the EFI System Partition (ESP) in a RAID1 array on a computer with Linux soft RAID is to have redundancy when booting. If one disk fails, you want the boot to continue from the other disk.

At first I thought this wasn't possible since a RAID1 partition wouldn't have the specific FAT filesystem and GUID required by the specification. However the fact that the CentOS 7 install media offered the choice of putting the ESP on a RAID1 array and that it actually works, made me doubt my hypothesis.

The key to this that the CentOS 7 installer uses RAID metadata format 1.0, which is located at the end of the partition. Thus it doesn't clash with the beginning of the partition, which is where the BIOS will check to see if the partition is an ESP. However most Linux partition tools will detect it first as a RAID member so it's not immediately obvious that it's an ESP.

There are some caveats to this scheme. All writing of the ESP must be done while it's mounted as a RAID array so that there is no discrepancy between the two members. If the only OS on the disks is Linux, this won't be a problem. But don't use this scheme if the ESP also boots other operating systems that don't know about Linux RAID.

For CentOS when you look at the choice of boot devices in the BIOS, you should see two disk boot candidates, both labelled CentOS.

On the machines I used, HP z230 workstations, I found that I had to disable Legacy Boot or errors reading the boot sectors would be triggered.

The bottom line is I now have workstations with soft RAID1 whose disks are fully redundant. If one disk fails, the other will continue to boot and run with degraded arrays for each of the partitions.

Thursday 23 November 2017

grub2 error: failure reading sector 0x0 from 'hd0'. Press any key to continue

After I had installed CentOS 7 as the only OS on a HP z230 workstation in UEFI boot mode, I got this message before booting. It was actually the last of three errors:

error: failure reading sector 0xfc from 'hd0'. 
error: failure reading sector 0xe0 from 'hd0'. 
error: failure reading sector 0x0 from 'hd0'

Boot would resume from the hard disk after a timeout, but the pause was unacceptable and would worry users.

A search showed many articles like this but none solved my problem. I tried various things: refreshing grub.cfg, disabling the CD/DVD (thinking it might be trying to read the optical drive), checking if having the ESP in a EFI system partition in a RAID1 array was disallowed. (I figured out how ESP can work with RAID1, and its limitations, but that's for another blog entry.) None of my experiments worked.

However the linked to web page alluded to turning off Secure Boot so I went into that part of the BIOS setup. I found that it was already turned off but there was a setting there for Legacy Boot which was enabled. So I turned it off to see what would happen. Lo and behold, the error messages ceased, and UEFI boot worked as expected. Also the Boot Order menu stopped showing a Legacy section.

Since debugging the innards of the GRUB2 loader is beyond me, I can only surmise that the presence of Legacy Boot entries in the BIOS makes GRUB2 try reading the sectors in question but since the disk is formatted with GPT partitions and UEFI is in force, the sector reads fail, for some definition of fail. Maybe somebody can figure out the significance of the sectors 0xfc, 0xe0, and 0x0.

Thursday 16 November 2017

Dos and Don'ts deploying sssd for authentication against Windows AD

New: For deployment on Redhat/CentOS 6, see here.

sssd (and realmd) in RedHat/CentOS 7 offers the chance to use Windows as a single authentication base. The RedHat manual was the most useful but there were also good debugging tips on stackoverflow and similar forums. However in deploying sssd I found some things worked for me and some things didn't.
  • Do harmonise all the Windows and Linux login IDs. If there are users with two different IDs, then they'll have to bite the bullet and accept the change of one ID. Unfortunately domain logins cannot have aliases.
  • When you join Linux to AD using the realm command and an unprivileged account, you may encounter this 10 machine limit. Here's how to raise the limit.
  • Do use ntpd to keep all the clients in time sync. Specify the domain servers as NTP servers in ntp.conf. I had an issue where one client wouldn't authenticate. All the config files were identical with a working client. Finally I realised I had not enabled and started ntpd. It turned out to be clock skew. Kerberos is sensitive to this.
  • Do enable GSSAPI with MIC in sshd. It really works and you can use putty to ssh to the server without specifying a password provided the Windows user has authenticated to the domain.
  • Do use AD security groups to restrict access to the Linux servers. Otherwise all AD users can login by default. This means that enrolling a new Linux user across all the servers is simply adding the user to your chosen security group. Create one if necessary. Oddjobd will take care of creating the home directory on first login, which is very nice. I used the simple access_provider. I couldn't get the ad access_provider and ad_access_filter to work, but this is probably because I couldn't work out the correct LDAP strings.
  • You can also use a security group to specify who can have extra privileges in sudo.
  • I used the deterministic hash scheme for mapping SIDs to UIDs because I didn't want to (and didn't have authority to) add attributes to the AD schema.
  • When migrating existing user accounts, make sure you find all the places a user might have a file. Not just /home but also /var/spool/cron and /var/spool/mail. Kick all the users off and kill all of their processes before you do the chown. Since after the switchover the names will map to the new UIDs, you can cd /home and run a loop: for u in * do; chown -R $u $u; done. Also the cron and mail directories.
  • If you have software that must have simple login IDs, i.e. fred and not fred@example.com, then you should set use_fully_qualified_names = False. This implies you cannot have a default_domain_suffix. If you have a single domain, then you don't need domain suffixes. If you have multiple domains, then this is beyond my knowledge. I found that some applications cannot handle usernames of the domain form. Even the crontab command will create and require cron files of the domain form if domain suffixes are enabled.
  • I couldn't get the sssd idmap to work with Samba so I chose winbind. Also you have to use winbind if you have to support NTLM authentication.
  • New: If you are running 32-bit applications, you should also install the 32-bit libsss* shared libraries corresponding to the 64-bit ones, otherwise those applications may not be able to get user account info via PAM. This showed up in icfb, an old 32-bit Cadence executable, that worked for local users (in /etc/passwd) but failed for SSSD authenticated users.
  • New: If oddjob_mkhomedir doesn't work, as evidenced by no home directory created for a new login, check /var/log/messages. SELinux is probably blocking this. Either make the policy permissive, or create a policy for this.