SLAC ESD Software Engineering Group

 

UNIX SYSTEM ADMIN

 

 

RHEL5 Upgrade Issues and Solutions

SLAC Detailed

SLAC Computing

Software Home

Software Detailed

Documentation and Web Support


 

 

This document describes issues found in RHEL5 upgrade and solutions applied.

 

NFS mounting issue

rsize and wsize are the block size of data that the NFS client and server pass back and forth to each other, i.e., the NFS data transfer buffer sizes. It can be set on NFS server and NFS client. The default normally is set to to rsize=32768,wsize=32768
(32k = 32*1024 =32768 bytes). On a RHEL NFS client, the setting is defined in e.g.


grep NFSSVC_MAXBLKSIZE /usr/src/kernels/2.6.18-274.7.1.el5-PAE-i686/include/linux/nfsd/const.h
#define NFSSVC_MAXBLKSIZE (32*1024)

On a RHEL4 client machine, the default size ( rsize=32768,wsize=32768) is used for NFS, regardless what is the setting on NFS server.

However, on a RHEL5 client machine, even if NFSSVC_MAXBLKSIZE is defined as (32768 bytes) in const.h, it may not use this default, instead it can use the setting defined on a NFS server if the size is larger.

e.g. on lcls-archsrv (RHEL5)

jingchen@lcls-archsrv $ grep wain029 /proc/mounts
wain029:/g.archiver /a/wain029/g.archiver nfs rw,nosuid,vers=3,rsize=1048576,wsize=1048576,hard,intr,proto=tcp,timeo=600,retrans=3,sec=sys,addr=wain029 0 0

The size is 32 x larger. This large setting is meant to optimize NFS performance in general ; however, it can become a performance problem if files transferred are small, like IndexBuilder on lcls-archsrv.

For a taylored machine which uses automount, this setting on a client can be enforced to use the default, by changed selectors_on_defaults to selectors_in_defaults in /etc/amd.conf (and touch /usr/etc/linux24.amd.flag).
With this change, the setting on lcls-archsrv is

jingchen@lcls-archsrv $ grep wain029 /proc/mounts
wain029:/g.archiver /a/wain029/g.archiver nfs rw,nosuid,vers=3,rsize=32768,wsize=32768,hard,intr,proto=tcp,timeo=600,retrans=3,sec=sys,addr=wain029 0 0

For a standalone machine, e.g. lcls-opi09, this setting can be redefined in /etc/fstab (as mount option)

rsize=32768,wsize=32768

One can do so in command line:
mount -o rsize=1024,wsize=1024 eris:/mn/eris/local /mnt

Tuning rsize and wsize

To experiment and find a rsize and wsize that works and is as fast as possible. You can test the speed of your options with some simple commands, e.g. to test the sequential write performance:

time dd if=/dev/zero of=/mnt/testfile bs=16k count=4096

This creates a 64Mb (4096 blocks of 16k each) file of zeroed bytes (which should be large enough that caching is no significant part of any performance perceived, use a larger file if you have a lot of memory). Do it a couple (5-10?) of times and average the times. Then you can test the read performance by reading back the file:

time dd if=/mnt/testfile of=/dev/null bs=16k

Mirroring system disk sda to a backup disk sdb on RHEL5

Starting RHEL5 (perhaps new version of GRUP that comes with RHEL5 installation), /boot/grub/grub.conf (or /etc/grub.conf) uses LABEL instead of device name (disk partition, e.g. root=/dev/sda1). This makes our dd based disk mirroring and system recovery procedure fail, as GRUB confutes which disk to load, and the system somehow mounts / on the second drive, when system reboots after dd. grub.conf is used by GRUB in BIOS to load Linux kernel and device drivers (stage1: IPL, initial program loader in MBR in first disk and first partition; stage2: loading kernel and device drives in /boot).

The fix is to replace LABEL with disk partition, i.e, replacing root=LABEL=/ with root=/dev/sda1, in grub.conf before dd.

 

shell scripts which rely on old POSIX version

From Shashi: the gnu utilities normally conform to the version of POSIX that is standard for your system. To cause them to conform to a different version of POSIX,  define the _POSIX2_VERSION environment variable to a value of the form yyyymm specifying the year and month the standard was adopted. Two values are currently supported for _POSIX2_VERSION: '199209' stands for POSIX 1003.2-1992, and '200112' stands for POSIX 1003.1-2001. For example, if you have a newer system but are running software that assumes an older version of POSIX and uses 'sort +1' or 'tail +10',
you can work around any compatibility problems by setting '_POSIX2_VERSION=199209' in your environment.

We might or might not have shell scripts which use certain uncommon options for commands like sort, cut etc. But in case we (or users) discover that then -

My proposal in case we receive any complain about script braking is -

Temporary fix - adding one line in the script

export _POSIX2_VERSION=199209
Permanent fix - Change the code to confirm to new Posix standards.

Note: POSIX (Portable Operating System Interface), is a family of standards
specified by the IEEE for maintaining compatibility between operating systems.
POSIX defines the API, along with command line shells and utility interfaces,
for software compatible with variants of Unix and other operating systems

Xvfb

On RHEL4: /usr/X11R6/bin/Xvfb
on RHEL5: /usr/bin/Xvfb

S97st.Xvfb needs to be updated.

iocLogAndFwdServer

sioc-sys0-ml02>
../../../src/libCom/logClient/logClient.c:286 unable to bind error Cannot assign requested address

The messages are cleared after resending from soft IOCs

Virtual IP on lcls-daemon1

No more virtual IP is needed for softIOCs on lcls-daemon1. Disabled. EPICS now supports nodename for a softIOC.

find works bit differently

On RHEL4 -
$ strings /usr/bin/find | grep Automatically

On RHEL-5 -
$ strings /usr/bin/find | grep Automatically
WARNING: Hard link count is wrong for %s: this may be a bug in your filesystem driver. Automatically turning on find's -noleaf option. Earlier results may have failed to include directories that should have been searched.

Solution: apply -noleaf option

OPI Issues


See Shashi's doc


LD_ASSUME_KERNEL


LD_ASSUME_KERNEL is an environment variable used by the dynamic linker to decide what implementation of libraries are used. For most cases,  the most important lib is the c lib, or "libc" or "glibc". The reason "glibc" is important is because it contains the thread implantation for a system. For properly written apps, there sould be no reason to use this setting. However, for some legacy apps that depend on a particular thread implementation in glibc, LD_ASSUME_KERNEL can be used to force the app to use an older implementation.

This is no longer needed. Disabled in
./loos/ENVS:
#export LD_ASSUME_KERNEL=2.4.1


python launching

issues where Murali was including the non-standard version of python in some of his scripts which is an issue because sometimes the python API changes with versions.

The correct way to do this is to use
#!/usr/bin/env python as your script launcher.

As of today,  this should get you Python 2.6.4 on all the environments.

eclipse

- /usr/local/lcls/package/eclipse/3.7/eclipse reinstalled
- /usr/local/lcls/package/java/jdk1.6.0_11/bin is needed for CVS to work

- /home/softegr/partha/ENVS
alias eclipse="/usr/local/lcls/package/eclipse/3.7/eclipse -vm /usr/local/lcls/package/java/jdk1.6.0_11/bin -clean -data /home/softegr/partha/workspace"

- updated /usr/local/lcls/package/eclipse/3.7/eclipse.ini

-Dorg.eclipse.swt.browser.XULRunnerPath=/usr/lib/xulrunner-1.9.2

the root cause is Mozilla browser that it was trying to run. Partha was trying to use xulrunner-1.9.2 even though it was going through swt-mozilla. It looks like the xulrunner is not properly registered. Partha added the following line in eclipse.ini and it worked like a charm!

NFS steals the port number for cupd

On rhel 4 and 5, NFS client dynamically allocates port at reboot; thus sometimes, it uses 631, which is supposed
for cupd, making cupd failed. The failure of cupd causes Elog printing queues fails.

The solution is to make NFS use the static port,

update /etc/sysconfig/nfs and uncomment the following lines:

[jingchen@lcls-opi01 sysconfig]$ grep PORT /etc/sysconfig/nfs
#RQUOTAD_PORT=875
#LOCKD_TCPPORT=32803
#LOCKD_UDPPORT=32769
#MOUNTD_PORT=892
#STATD_PORT=662
#STATD_OUTGOING_PORT=2020

 

References:

  1. Problems and Solutions in RHEL5 upgrade (Shashi)
  2. Physic-elog Upgrade to RHEL5 (Shashi)
  3. Special configuration

Author: Jingchen Zhou (x4661, jingchen@slac). Last edited on March 25, 2012