Results 1 to 7 of 7

Thread: ZCS cluster went down - /opt/zimbra-cluster/bin/zmcluctl failed (returned 1)

Hybrid View

  1. #1
    Join Date
    May 2010
    Posts
    12
    Rep Power
    5

    Default ZCS cluster went down - /opt/zimbra-cluster/bin/zmcluctl failed (returned 1)

    Hi All,

    I'm running Zimbra 5.0.20 NE on a 2-node cluster of CentOS 4.8 (active/standby). The other day, the cluster decided to fail over to the standby, and I'm trying to determine why. In the logs, I see:

    /var/log/messages on node 1 (originally the standby, became the active):
    Code:
    May  7 06:40:16 wsl-mx1 clurgmgrd[5374]: <notice> Recovering failed service mx.mydomain.com 
    May  7 06:40:17 wsl-mx1 kernel: kjournald starting.  Commit interval 5 seconds
    May  7 06:40:17 wsl-mx1 kernel: EXT3-fs warning: mounting fs with errors, running e2fsck is recommended
    May  7 06:40:17 wsl-mx1 kernel: EXT3 FS on emcpowera1, internal journal
    May  7 06:40:17 wsl-mx1 kernel: EXT3-fs: mounted filesystem with ordered data mode.
    May  7 06:42:59 wsl-mx1 saslauthd: auth_zimbra_init: zimbra_cert_check is off!
    May  7 06:42:59 wsl-mx1 saslauthd: auth_zimbra_init: 1 auth urls initialized for round-robin
    May  7 06:43:03 wsl-mx1 clurgmgrd: [5374]: <err> script:zimbra: start of /opt/zimbra-cluster/bin/zmcluctl failed (returned 1) 
    May  7 06:43:03 wsl-mx1 clurgmgrd[5374]: <notice> start on script "zimbra" returned 1 (generic error) 
    May  7 06:43:03 wsl-mx1 clurgmgrd[5374]: <warning> #68: Failed to start service:mx.mydomain.com; return value: 1 
    May  7 06:43:03 wsl-mx1 clurgmgrd[5374]: <notice> Stopping service mx.mydomain.com 
    May  7 06:43:14 wsl-mx1 clurgmgrd: [5374]: <notice> Forcefully unmounting /opt/zimbra-cluster/mountpoints/mx.mydomain.com 
    May  7 06:43:14 wsl-mx1 clurgmgrd: [5374]: <warning> killing process 7666 (zimbra amavisd /opt/zimbra-cluster/mountpoints/mx.mydomain.com)
    ...(more killing process messages)
    May  7 06:43:20 wsl-mx1 clurgmgrd[5374]: <notice> Service mx.mydomain.com is recovering 
    May  7 07:46:16 wsl-mx1 clurgmgrd[5374]: <notice> Starting stopped service mx.mydomain.com 
    May  7 07:46:16 wsl-mx1 kernel: kjournald starting.  Commit interval 5 seconds
    May  7 07:46:16 wsl-mx1 kernel: EXT3-fs warning: mounting fs with errors, running e2fsck is recommended
    May  7 07:46:16 wsl-mx1 kernel: EXT3 FS on emcpowera1, internal journal
    May  7 07:46:16 wsl-mx1 kernel: EXT3-fs: mounted filesystem with ordered data mode.
    May  7 07:48:09 wsl-mx1 saslauthd: auth_zimbra_init: zimbra_cert_check is off!
    May  7 07:48:09 wsl-mx1 saslauthd: auth_zimbra_init: 1 auth urls initialized for round-robin
    May  7 07:48:13 wsl-mx1 clurgmgrd[5374]: <notice> Service mx.mydomain.com started
    /var/log/messages on node 2 (originally the active, became the standby):
    Code:
    May  7 06:36:20 wsl-mx2 clurgmgrd: [5376]: <err> script:zimbra: status of /opt/zimbra-cluster/bin/zmcluctl failed (returned 1) 
    May  7 06:36:20 wsl-mx2 clurgmgrd[5376]: <notice> status on script "zimbra" returned 1 (generic error) 
    May  7 06:36:20 wsl-mx2 clurgmgrd[5376]: <notice> Stopping service mx.mydomain.com 
    May  7 06:37:08 wsl-mx2 clurgmgrd[5376]: <notice> Service mx.mydomain.com is recovering 
    May  7 06:37:08 wsl-mx2 clurgmgrd[5376]: <notice> Recovering failed service mx.mydomain.com 
    May  7 06:37:08 wsl-mx2 kernel: kjournald starting.  Commit interval 5 seconds
    May  7 06:37:08 wsl-mx2 kernel: EXT3-fs warning: mounting fs with errors, running e2fsck is recommended
    May  7 06:37:08 wsl-mx2 kernel: EXT3 FS on emcpowera1, internal journal
    May  7 06:37:08 wsl-mx2 kernel: EXT3-fs: mounted filesystem with ordered data mode.
    May  7 06:39:55 wsl-mx2 saslauthd: auth_zimbra_init: zimbra_cert_check is off!
    May  7 06:39:55 wsl-mx2 saslauthd: auth_zimbra_init: 1 auth urls initialized for round-robin
    May  7 06:39:59 wsl-mx2 clurgmgrd: [5376]: <err> script:zimbra: start of /opt/zimbra-cluster/bin/zmcluctl failed (returned 1) 
    May  7 06:39:59 wsl-mx2 clurgmgrd[5376]: <notice> start on script "zimbra" returned 1 (generic error) 
    May  7 06:39:59 wsl-mx2 clurgmgrd[5376]: <warning> #68: Failed to start service:mx.mydomain.com; return value: 1 
    May  7 06:39:59 wsl-mx2 clurgmgrd[5376]: <notice> Stopping service mx.mydomain.com 
    May  7 06:40:10 wsl-mx2 clurgmgrd: [5376]: <notice> Forcefully unmounting /opt/zimbra-cluster/mountpoints/mx.mydomain.com 
    May  7 06:40:10 wsl-mx2 clurgmgrd: [5376]: <warning> killing process 6870 (zimbra amavisd /opt/zimbra-cluster/mountpoints/mx.mydomain.com) 
    ...(more killing process messages)
    May  7 06:40:16 wsl-mx2 clurgmgrd[5376]: <notice> Service mx.mydomain.com is recovering 
    May  7 06:40:16 wsl-mx2 clurgmgrd[5376]: <warning> #71: Relocating failed service mx.mydomain.com
    I didn't see anything particularly interesting in the zimbra logs, and they were too big to post in this message, so I'll reply back with them.

    I found two threads that might be related to this:
    http://www.zimbra.com/forums/install...g-problem.html
    http://www.zimbra.com/forums/install...vice-well.html

    The former suggests deleting the log directory, the latter suggest increasing the zmcluctl timeout. However, neither indicates if the possible solution actually solved the problem.

    Any suggestions? Thanks!

  2. #2
    Join Date
    May 2010
    Posts
    12
    Rep Power
    5

    Default zimbra.log

    Attached are my zimbra.log snippets from both servers. Thanks!
    Attached Files Attached Files

  3. #3
    Join Date
    May 2010
    Posts
    12
    Rep Power
    5

    Default

    (Sorry to spam my own post)

    I was looking through the zimbra.log on node 2 (originally the active), and noticed this:

    Code:
    May  7 06:36:20 wsl-mx2 zimbra-cluster[2821]: status - rc=1 from zmcontrol: output=[Host mx.mydomain.com <EOL>, 	antispam                Running <EOL>, 	antivirus               Stopped <EOL>, 		zmclamdctl is not running <EOL>, 	imapproxy               Running <EOL>, 	ldap                    Running <EOL>, 	logger                  Running <EOL>, 	mailbox                 Running <EOL>, 	mta                     Running <EOL>, 	snmp                    Running <EOL>, 	spell                   Running <EOL>, 	stats                   Running ] 
    May  7 06:36:21 wsl-mx2 zimbra-cluster[3300]: stop -  Zimbra stop initiated via zmcluctl
    Could this indicate that the cluster failed over because antivirus wasn't running? I don't see anything worthwhile in clamd.log from that time, other than the daemon starting up after the failover:

    Code:
    Fri May  7 06:20:02 2010 -> SelfCheck: Database status OK.
    Fri May  7 06:30:02 2010 -> SelfCheck: Database status OK.
    Fri May  7 06:35:56 2010 -> Reading databases from /opt/zimbra/data/clamav/db
    Fri May  7 06:38:36 2010 -> +++ Started at Fri May  7 06:38:36 2010
    Fri May  7 06:38:36 2010 -> clamd daemon 0.95.1-broken-compiler (OS: linux-gnu, ARCH: i386, CPU: i686)
    Fri May  7 06:38:36 2010 -> Log file size limited to 20971520 bytes.
    Fri May  7 06:38:36 2010 -> Reading databases from /opt/zimbra/data/clamav/db
    Fri May  7 06:38:36 2010 -> Not loading PUA signatures.
    Fri May  7 06:41:43 2010 -> +++ Started at Fri May  7 06:41:43 2010

  4. #4
    Join Date
    Mar 2006
    Location
    Beaucaire, France
    Posts
    2,322
    Rep Power
    13

    Default

    This looks like the the way the cluster scripts are working : they tests all the component each X minutes (5 or 10, can't remember). If one of the component is not working, it switches!

    It's supposed to fence the node on which the component failed and and switch to the spare node.

    If fencing is not working properly, you can get both node up at the same time. Very very very bad.

  5. #5
    Join Date
    May 2010
    Posts
    12
    Rep Power
    5

    Default

    Ah ok, so you agree that it failed over because AV wasn't running? How would I figure out why AV wasn't running, as I don't see anything in the clamd log file?

    And yes, the failover did not work properly. I have it set to fence using drac, so it should just power off the failed node and start up the services on the standby. However, that didn't happen. Both servers stayed online, the zimbra service was taken down, and did not come back up automatically. It had to be brought up manually.

  6. #6
    Join Date
    Mar 2006
    Location
    Beaucaire, France
    Posts
    2,322
    Rep Power
    13

    Default

    Fencing not working as it should and scripts switching "too quickly" (waiting for operator to restart antivirus should have been OK) are the reason some of our customers left RHCS and are now working with "hand clustering".

    If a module goes down, it is manually restarted.
    If active server goes down, ZCS is manually switched to spare server.

    In ZCS6, there's a new way of configuring RHCS in order to have it switch only in case of "hardware" failure.

  7. #7
    Join Date
    May 2010
    Posts
    12
    Rep Power
    5

    Default

    Got it, RHCS failover is not very reliable. Probably I'll just write a wrapper script for it to use instead of zmcluctl, which only returns 1 in case of mailbox being down or telnet port 25 not working, and alerts me otherwise.

    Any idea why the clamd service might have died, or where I could check to glean more info? Thanks!

Similar Threads

  1. Replies: 8
    Last Post: 01-28-2010, 12:32 PM
  2. Help!!! Moving ZCS does not work!
    By ASebestian in forum Migration
    Replies: 7
    Last Post: 02-12-2009, 06:06 PM
  3. Mail is being queued, not delivered!
    By icepick94 in forum Administrators
    Replies: 12
    Last Post: 01-22-2009, 07:03 AM
  4. My Zimbra server down ... please help :)
    By frankb in forum Administrators
    Replies: 2
    Last Post: 12-12-2007, 11:29 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •