We have recently migrated to a larger single-server configuration to support increasing load. The new server is a dual-proc Xeon 2.53Ghz device with 12GB RAM running Ubuntu 8.04 Server. While we were at it, we also bumped up from 6.0.1 to 6.0.3. Generally the server runs very well and supports several dozen simultaneous users with aplomb, but a couple of times per day the load average climbs to 10+ and the server becomes temporarily unresponsive. It typically recovers before underlying protocols timeout, but it is quite noticeable when it occurs. In investigating the issue we have followed the guidelines for tuning, increasing the innodb buffer space to 40% of total RAM and reducing the JVM heap to 20% of total RAM. Unfortunately while these measures seem to improve the "sunny day performance", the problem is still occurring with the same severity. When the problem occurs it is primarily Java processes monopolizing CPU time with little or no iowait. Sometimes the stats process is running, other times it is not. I haven't yet found anything going on in zimbra.log or mailbox.log which seems to correlate but I have found that we frequently see errors like this:
These occur several times per minute day in and day out. In researching these messages it would appear they are frequently associated with services failing to start, but in my case everything starts happily and the condition of things looks good:
zmmtaconfig: Skipping getAllReverseProxyURLs ERROR: service.FAILURE (system failure: ZimbraLdapContext) (cause: javax.naming.CommunicationException zimbra.mydomain.com:389)
zmmtaconfig: gacf ERROR: service.FAILURE (system failure: ZimbraLdapContext) (cause: javax.naming.CommunicationException zimbra.mydomain.com:389)
zmmtaconfig: Skipping getAllMtaAuthURLs ERROR: service.FAILURE (system failure: ZimbraLdapContext) (cause: javax.naming.CommunicationException zimbra.mydomain.com:389)
zmmtaconfig: Sleeping...Key lookup failed.
$ zmcontrol status
$ ldap status
slapd running pid: 18976
* My /etc/hosts file may have been malformed at the times the dummy install and upgrade were run. Presently it looks like this:
* DNS and IP assignments remained static through the transition.
127.0.0.1 localhost.localdomain localhost
220.127.116.11 zimbra.mydomain.com zimbra
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
* Architecture was the same on both machines (hence dummy install).
Obviously I'm just looking for advice on how to isolate and resolve. Thanks in advance for any pointers.