Hi All,

I'm running Zimbra 5.0.20 NE on a 2-node cluster of CentOS 4.8 (active/standby). The other day, the cluster decided to fail over to the standby, and I'm trying to determine why. In the logs, I see:

/var/log/messages on node 1 (originally the standby, became the active):
Code:
May  7 06:40:16 wsl-mx1 clurgmgrd[5374]: <notice> Recovering failed service mx.mydomain.com 
May  7 06:40:17 wsl-mx1 kernel: kjournald starting.  Commit interval 5 seconds
May  7 06:40:17 wsl-mx1 kernel: EXT3-fs warning: mounting fs with errors, running e2fsck is recommended
May  7 06:40:17 wsl-mx1 kernel: EXT3 FS on emcpowera1, internal journal
May  7 06:40:17 wsl-mx1 kernel: EXT3-fs: mounted filesystem with ordered data mode.
May  7 06:42:59 wsl-mx1 saslauthd: auth_zimbra_init: zimbra_cert_check is off!
May  7 06:42:59 wsl-mx1 saslauthd: auth_zimbra_init: 1 auth urls initialized for round-robin
May  7 06:43:03 wsl-mx1 clurgmgrd: [5374]: <err> script:zimbra: start of /opt/zimbra-cluster/bin/zmcluctl failed (returned 1) 
May  7 06:43:03 wsl-mx1 clurgmgrd[5374]: <notice> start on script "zimbra" returned 1 (generic error) 
May  7 06:43:03 wsl-mx1 clurgmgrd[5374]: <warning> #68: Failed to start service:mx.mydomain.com; return value: 1 
May  7 06:43:03 wsl-mx1 clurgmgrd[5374]: <notice> Stopping service mx.mydomain.com 
May  7 06:43:14 wsl-mx1 clurgmgrd: [5374]: <notice> Forcefully unmounting /opt/zimbra-cluster/mountpoints/mx.mydomain.com 
May  7 06:43:14 wsl-mx1 clurgmgrd: [5374]: <warning> killing process 7666 (zimbra amavisd /opt/zimbra-cluster/mountpoints/mx.mydomain.com)
...(more killing process messages)
May  7 06:43:20 wsl-mx1 clurgmgrd[5374]: <notice> Service mx.mydomain.com is recovering 
May  7 07:46:16 wsl-mx1 clurgmgrd[5374]: <notice> Starting stopped service mx.mydomain.com 
May  7 07:46:16 wsl-mx1 kernel: kjournald starting.  Commit interval 5 seconds
May  7 07:46:16 wsl-mx1 kernel: EXT3-fs warning: mounting fs with errors, running e2fsck is recommended
May  7 07:46:16 wsl-mx1 kernel: EXT3 FS on emcpowera1, internal journal
May  7 07:46:16 wsl-mx1 kernel: EXT3-fs: mounted filesystem with ordered data mode.
May  7 07:48:09 wsl-mx1 saslauthd: auth_zimbra_init: zimbra_cert_check is off!
May  7 07:48:09 wsl-mx1 saslauthd: auth_zimbra_init: 1 auth urls initialized for round-robin
May  7 07:48:13 wsl-mx1 clurgmgrd[5374]: <notice> Service mx.mydomain.com started
/var/log/messages on node 2 (originally the active, became the standby):
Code:
May  7 06:36:20 wsl-mx2 clurgmgrd: [5376]: <err> script:zimbra: status of /opt/zimbra-cluster/bin/zmcluctl failed (returned 1) 
May  7 06:36:20 wsl-mx2 clurgmgrd[5376]: <notice> status on script "zimbra" returned 1 (generic error) 
May  7 06:36:20 wsl-mx2 clurgmgrd[5376]: <notice> Stopping service mx.mydomain.com 
May  7 06:37:08 wsl-mx2 clurgmgrd[5376]: <notice> Service mx.mydomain.com is recovering 
May  7 06:37:08 wsl-mx2 clurgmgrd[5376]: <notice> Recovering failed service mx.mydomain.com 
May  7 06:37:08 wsl-mx2 kernel: kjournald starting.  Commit interval 5 seconds
May  7 06:37:08 wsl-mx2 kernel: EXT3-fs warning: mounting fs with errors, running e2fsck is recommended
May  7 06:37:08 wsl-mx2 kernel: EXT3 FS on emcpowera1, internal journal
May  7 06:37:08 wsl-mx2 kernel: EXT3-fs: mounted filesystem with ordered data mode.
May  7 06:39:55 wsl-mx2 saslauthd: auth_zimbra_init: zimbra_cert_check is off!
May  7 06:39:55 wsl-mx2 saslauthd: auth_zimbra_init: 1 auth urls initialized for round-robin
May  7 06:39:59 wsl-mx2 clurgmgrd: [5376]: <err> script:zimbra: start of /opt/zimbra-cluster/bin/zmcluctl failed (returned 1) 
May  7 06:39:59 wsl-mx2 clurgmgrd[5376]: <notice> start on script "zimbra" returned 1 (generic error) 
May  7 06:39:59 wsl-mx2 clurgmgrd[5376]: <warning> #68: Failed to start service:mx.mydomain.com; return value: 1 
May  7 06:39:59 wsl-mx2 clurgmgrd[5376]: <notice> Stopping service mx.mydomain.com 
May  7 06:40:10 wsl-mx2 clurgmgrd: [5376]: <notice> Forcefully unmounting /opt/zimbra-cluster/mountpoints/mx.mydomain.com 
May  7 06:40:10 wsl-mx2 clurgmgrd: [5376]: <warning> killing process 6870 (zimbra amavisd /opt/zimbra-cluster/mountpoints/mx.mydomain.com) 
...(more killing process messages)
May  7 06:40:16 wsl-mx2 clurgmgrd[5376]: <notice> Service mx.mydomain.com is recovering 
May  7 06:40:16 wsl-mx2 clurgmgrd[5376]: <warning> #71: Relocating failed service mx.mydomain.com
I didn't see anything particularly interesting in the zimbra logs, and they were too big to post in this message, so I'll reply back with them.

I found two threads that might be related to this:
http://www.zimbra.com/forums/install...g-problem.html
http://www.zimbra.com/forums/install...vice-well.html

The former suggests deleting the log directory, the latter suggest increasing the zmcluctl timeout. However, neither indicates if the possible solution actually solved the problem.

Any suggestions? Thanks!