Results 1 to 4 of 4

Thread: My Zimbra Server crashed this morning...

Hybrid View

  1. #1
    Join Date
    Jul 2006
    Location
    New York, NY
    Posts
    122
    Rep Power
    9

    Exclamation My Zimbra Server crashed this morning...

    So this morning around 8:45am the main Zimbra mailbox server became unresponsive. Normally when I can't ssh into a *nix system I pop up a terminal on the machine if its local and switch over to a second terminal (ALT+F2) if the first one is frozen for some reason. I was on my way into work at the time but the employees were already causing a fuss, as most employees do when email is down.

    Anyways since I wasn't there another admin did the only thing he could think of and restarted the server. When I came in the server was still booting, more on that later, and in took some troubleshooting to get the system back to normal, so here's where I'm at now:

    This next part is long......

    1) Bravo Gents that was the first outage (read: more than just a few minutes) we've had since we migrated from Exchange last May.

    2) Since the system had not been rebooted in such a long time it when fsck when booting, on most systems thats fine, but when checking through 1TB of HD space it takes quite a long time to finish. Is there anyway to disable forced filesystem checks in Linux? (I can probably Google this one)

    3) Maybe I never noticed this before but using Ubuntu 6.06 server with a clean install postfix has this setup in rc.d:
    Code:
       /etc/rc0.d/K20postfix
       /etc/rc1.d/K20postfix
       /etc/rc2.d/S20postfix
       /etc/rc3.d/S20postfix
       /etc/rc4.d/S20postfix
       /etc/rc5.d/S20postfix
       /etc/rc6.d/K20postfix
    This caused the MTA service to not startup as something was already using port 25 and was not owned by zimbra so it could not stop it. Killing the process as root and restart the MTA service brought it back up fine and dandy.

    Does anyone else with Ubuntu 6.06 and no LAMP installation also have this setup? If so is it safe to remove? I figure it will have the same problem again the next time it has to reboot.

    4) So I'm currently diagnosing what caused the problem and here's where it gets tricky. Since the system was powered off while still unresponsive the log files in /var/log all have create dates of when the server came back online (10:07am). "kern.log" However seems to have maintain its data from previous day but doesn't report anything in terms of a hardware failure between yesterday and this morning. The log files before that end at 6:29am, which is the time they were rotated. This is what I have from the logs stored in /opt/zimbra/log:

    From audit.log:

    Code:
    2008-04-03 08:50:05,693 INFO  [ImapServer-3848] [ip=xxx;] security - cmd=Auth; account=user1@domain.com; protocol=imap;
    2008-04-03 08:50:09,081 INFO  [ImapServer-3850] [ip=xxx;] security - cmd=Auth; account=user1@domain.com; protocol=imap;
    2008-04-03 08:55:25,188 WARN  [ImapServer-3876] [ip=xxx;] security - cmd=Auth; account=user2@domain.com; protocol=imap; error=system failure: getDirectContext;
    2008-04-03 08:55:51,565 WARN  [ImapServer-3879] [ip=xxx;] security - cmd=Auth; account=user3@domain.com; protocol=imap; error=system failure: getDirectContext;
    2008-04-03 08:56:29,224 WARN  [ImapServer-3883] [ip=xxx;] security - cmd=Auth; account=user4@domain.com; protocol=imap; error=system failure: getDirectContext;
    2008-04-03 08:56:29,224 WARN  [ImapServer-3884] [ip=xxx;] security - cmd=Auth; account=user5@domain.com; protocol=imap; error=system failure: getDirectContext;
    2008-04-03 08:57:08,013 WARN  [http-443-Processor18] [ip=xxx;ua=ZimbraWebClient - IE7 (Win);] security - cmd=Auth; account=user5@domain.com; protocol=soap; error=system failure: getDirectContext;
    2008-04-03 08:57:17,094 WARN  [http-443-Processor21] [ip=xxx;ua=ZimbraWebClient - IE7 (Win);] security - cmd=Auth; account=user5@domain.com; protocol=soap; error=system failure: getDirectContext;
    2008-04-03 08:57:38,389 WARN  [http-443-Processor29] [ip=xxx;ua=ZimbraWebClient - IE7 (Win);] security - cmd=Auth; account=user5@domain.com; protocol=soap; error=system failure: getDirectContext;
    2008-04-03 09:22:27,695 WARN  [http-7071-Processor48] [ip=xxx;ua=ZimbraWebClient - FF2.0 (Win)/undefined;] security - cmd=AdminAuth; account=admin; error=system failure: getDirectContext;
    
    After system restart
    
    2008-04-03 10:08:06,856 INFO  [ImapServer-1] [ip=xxx;] security - cmd=Auth; account=user3@domain.com; protocol=imap;
    2008-04-03 10:08:06,862 INFO  [ImapServer-2] [ip=xxx;] security - cmd=Auth; account=user6@domain.com; protocol=imap
    What does this error mean "error=system failure: getDirectContext;" ?

    From mailbox.log:

    Code:
    2008-04-03 08:53:36,735 INFO  [ImapServer-3870] [] imap - [xxx] connected
    2008-04-03 08:53:42,116 INFO  [ImapServer-3871] [] imap - [xxx] connected
    2008-04-03 08:53:50,647 INFO  [IndexWritersSweeperThread] [] MailboxIndex - open index writers sweep: before=21, closed=5, after=16 (241ms)
    2008-04-03 08:53:50,799 WARN  [http-443-Processor43] [] misc - unable to lookup account for log, id: c885be3e-927c-42bc-8a4b-960855dfc60a
    com.zimbra.common.service.ServiceException: system failure: getDirectContext
    	at com.zimbra.common.service.ServiceException.FAILURE(ServiceException.java:165)
    	at com.zimbra.cs.account.ldap.LdapUtil.getDirContext(LdapUtil.java:200)
    	at com.zimbra.cs.account.ldap.LdapUtil.getDirContext(LdapUtil.java:184)
    	at com.zimbra.cs.account.ldap.LdapProvisioning.getAccountByQuery(LdapProvisioning.java:441)
    	at com.zimbra.cs.account.ldap.LdapProvisioning.getAccountById(LdapProvisioning.java:469)
    	at com.zimbra.cs.account.ldap.LdapProvisioning.getAccountById(LdapProvisioning.java:496)
    	at com.zimbra.cs.account.ldap.LdapProvisioning.get(LdapProvisioning.java:485)
    	at com.zimbra.cs.account.Account.addAccountToLogContext(Account.java:119)
    	at com.zimbra.soap.SoapEngine.dispatch(SoapEngine.java:123)
    	at com.zimbra.soap.SoapEngine.dispatch(SoapEngine.java:85)
    	at com.zimbra.soap.SoapServlet.doPost(SoapServlet.java:220)
    	at javax.servlet.http.HttpServlet.service(HttpServlet.java:709)
    	at com.zimbra.cs.servlet.ZimbraServlet.service(ZimbraServlet.java:152)
    	at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
    	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252)
    	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173)
    	at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213)
    	at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:178)
    	at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:126)
    	at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:105)
    	at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:107)
    	at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:541)
    	at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:148)
    	at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:869)
    	at org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:667)
    	at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:527)
    	at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:80)
    	at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684)
    	at java.lang.Thread.run(Thread.java:619)
    Caused by: javax.naming.CommunicationException: Timeout exceeded while waiting for a connection: 30000ms
    	at com.sun.jndi.ldap.pool.Connections.get(Connections.java:128)
    	at com.sun.jndi.ldap.pool.Pool.getPooledConnection(Pool.java:129)
    	at com.sun.jndi.ldap.LdapPoolManager.getLdapClient(LdapPoolManager.java:310)
    	at com.sun.jndi.ldap.LdapClient.getInstance(LdapClient.java:1572)
    	at com.sun.jndi.ldap.LdapCtx.connect(LdapCtx.java:2616)
    	at com.sun.jndi.ldap.LdapCtx.<init>(LdapCtx.java:287)
    	at com.sun.jndi.ldap.LdapCtxFactory.getUsingURL(LdapCtxFactory.java:175)
    	at com.sun.jndi.ldap.LdapCtxFactory.getUsingURLs(LdapCtxFactory.java:193)
    	at com.sun.jndi.ldap.LdapCtxFactory.getLdapCtxInstance(LdapCtxFactory.java:136)
    	at com.sun.jndi.ldap.LdapCtxFactory.getInitialContext(LdapCtxFactory.java:66)
    	at javax.naming.spi.NamingManager.getInitialContext(NamingManager.java:667)
    	at javax.naming.InitialContext.getDefaultInitCtx(InitialContext.java:288)
    	at javax.naming.InitialContext.init(InitialContext.java:223)
    	at javax.naming.ldap.InitialLdapContext.<init>(InitialLdapContext.java:134)
    	at com.zimbra.cs.account.ldap.LdapUtil.getDirContext(LdapUtil.java:195)
    	... 27 more
    There wasn't much in any of the other logs. I'm lost on that one so please help me out.

    5) After the main server was back up and running, MTA and all I noticed in the admin gui that only 1 server out of 3 was being reported. Knowing this problem usually stems from "sysklogd" needing a restart I did that on all the servers to no avail. Checking the local syslogs on the 2 other servers showed they were just not reporting anything since around 8:44am:

    Code:
    Apr  3 08:44:04 relay zimbramon[23889]: 23889:info: 2008-04-03 08:44:01, STATUS: relay.domain.com: snmp: Running
    Apr  3 08:44:04 relay zimbramon[23889]: 23889:info: 2008-04-03 08:44:01, STATUS: relay.domain.com: stats: Running
    Apr  3 11:10:04 relay2 zimbramon[31305]: 31305:info: 2008-04-03 11:10:01, STATUS: relay.domain.com: antispam: Running
    Apr  3 11:10:04 relay zimbramon[31305]: 31305:info: 2008-04-03 11:10:01, STATUS: relay.domain.com: antivirus: Running
    So to start fresh I ran 'zmcontrol stop' then 'ps auxww | grep zimbra' and got this:
    Code:
    zimbra   12628 99.7  2.6  62520 54344 ?        R     2007 220693:12 amavisd (ch12-12628-12)
    zimbra   24133  0.0  0.0      0     0 ?        Zs   08:46   0:00 [zmstatuslog] <defunct>
    zimbra   24141  0.0  0.2   6872  4204 ?        S    08:46   0:00 /usr/bin/perl /opt/zimbra/bin/zmcontrol status
    zimbra   24155  0.0  0.0   4208  1644 ?        S    08:46   0:00 sh -c /opt/zimbra/bin/zmprov -l gs relay.domain.com 2>/dev/null
    zimbra   24156  0.0  1.0 413968 22004 ?        Sl   08:46   0:00 /opt/zimbra/java/bin/java -client -Xmx256m -Dzimbra.home=/opt/zimbra -Djava.library.path=/opt/zimbra/lib -Djava.ext.dirs=/opt/zimbra/java/jre/lib/ext:/opt/zimbra/lib/jars:/opt/zimbra/lib/ext com.zimbra.cs.account.ProvUtil -l gs relay.domain.com
    zimbra   24188  0.0  0.0   2956  1368 ?        S    08:46   0:00 sh -c /opt/zimbra/bin/zmprov -l gacf | grep -v InheritedAttr
    zimbra   24189  0.0  1.0 414300 22116 ?        Sl   08:46   0:00 /opt/zimbra/java/bin/java -client -Xmx256m -Dzimbra.home=/opt/zimbra -Djava.library.path=/opt/zimbra/lib -Djava.ext.dirs=/opt/zimbra/java/jre/lib/ext:/opt/zimbra/lib/jars:/opt/zimbra/lib/ext com.zimbra.cs.account.ProvUtil -l gacf
    zimbra   24192  0.0  0.0   1636   504 ?        S    08:46   0:00 grep -v InheritedAttr
    Mind you nothing should have been running, killing all of those and restarting Zimbra gave me this:
    Code:
    zimbra   30882  0.0  0.2   6980  4648 pts/0    S    11:09   0:00 /usr/bin/perl /opt/zimbra/libexec/zmmtaconfig
    zimbra   30932  0.1  2.4  57440 49836 ?        Ss   11:09   0:00 amavisd (master)
    zimbra   30948  0.0  2.4  58732 50600 ?        S    11:09   0:00 amavisd (ch2-avail)
    zimbra   30949  0.0  2.4  58816 50684 ?        S    11:09   0:00 amavisd (ch4-avail)
    zimbra   30950  0.0  2.4  58716 50672 ?        S    11:09   0:00 amavisd (ch2-avail)
    zimbra   30952  0.0  2.4  59708 51680 ?        S    11:09   0:00 amavisd (ch3-avail)
    zimbra   30953  0.0  2.4  59640 51532 ?        S    11:09   0:00 amavisd (ch2-avail)
    zimbra   30954  0.0  2.3  58168 48892 ?        S    11:09   0:00 amavisd (virgin child)
    zimbra   30955  0.0  2.3  58168 48888 ?        S    11:09   0:00 amavisd (virgin child)
    zimbra   30956  0.0  2.3  58168 48896 ?        S    11:09   0:00 amavisd (virgin child)
    zimbra   30957  0.0  2.3  58168 48892 ?        S    11:09   0:00 amavisd (virgin child)
    zimbra   30958  0.0  2.3  58168 48896 ?        S    11:09   0:00 amavisd (virgin child)
    zimbra   30993  1.5  5.2 119756 108644 ?       Ss   11:09   0:07 /opt/zimbra/clamav/sbin/clamd --config-file /opt/zimbra/conf/clamd.conf
    zimbra   31002  0.0  0.1   5688  3948 pts/0    S    11:09   0:00 /usr/bin/perl /opt/zimbra/libexec/swatch --config-file=/opt/zimbra/conf/swatchrc --use-cpan-file-tail --script-dir=/tmp -t /var/log/zimbra.log
    zimbra   31014  0.0  0.3   8380  6544 pts/0    S    11:09   0:00 /usr/bin/perl /tmp/.swatch_script.31002
    root     31094  0.0  0.0   6384  1688 ?        Ss   11:09   0:00 /opt/zimbra/postfix-2.2.9/libexec/master
    zimbra   31113  0.0  0.0   5584   872 ?        Ss   11:09   0:00 /opt/zimbra/cyrus-sasl-2.1.22.3/sbin/saslauthd -r -a zimbra
    zimbra   31114  0.0  0.0   5584   468 ?        S    11:09   0:00 /opt/zimbra/cyrus-sasl-2.1.22.3/sbin/saslauthd -r -a zimbra
    zimbra   31115  0.0  0.0   5584   524 ?        S    11:09   0:00 /opt/zimbra/cyrus-sasl-2.1.22.3/sbin/saslauthd -r -a zimbra
    zimbra   31117  0.0  0.0   5584   468 ?        S    11:09   0:00 /opt/zimbra/cyrus-sasl-2.1.22.3/sbin/saslauthd -r -a zimbra
    zimbra   31118  0.0  0.0   5584   468 ?        S    11:09   0:00 /opt/zimbra/cyrus-sasl-2.1.22.3/sbin/saslauthd -r -a zimbra
    zimbra   31166  0.0  0.1   4924  3296 pts/0    S    11:09   0:00 /usr/bin/perl -w /opt/zimbra/libexec/zmstat-proc
    zimbra   31182  0.0  0.1   4792  3136 pts/0    S    11:09   0:00 /usr/bin/perl -w /opt/zimbra/libexec/zmstat-cpu
    zimbra   31184  0.0  0.1   4788  3124 pts/0    S    11:09   0:00 /usr/bin/perl -w /opt/zimbra/libexec/zmstat-vm
    zimbra   31219  0.0  0.1   4792  3080 pts/0    S    11:09   0:00 /usr/bin/perl -w /opt/zimbra/libexec/zmstat-mtaqueue
    zimbra   31275  0.0  0.0   1692   576 pts/0    S    11:10   0:00 /usr/bin/vmstat -n -S K 30
    So my question is what are those java processes? It looks like they hung because the main server was querying them and hung itself.

    6) Lastly I've noticed this happening more frequently ever since a failed backup triggered multiple zmlogger processes that don't die when you run "zmcontrol stop"
    Code:
    top - 17:20:15 up  7:55,  2 users,  load average: 1.26, 0.79, 0.74
    Tasks: 136 total,   2 running, 134 sleeping,   0 stopped,   0 zombie
    Cpu(s): 12.3% us,  8.6% sy,  0.0% ni, 40.9% id, 37.9% wa,  0.0% hi,  0.2% si
    Mem:   4147624k total,  3997224k used,   150400k free,     1540k buffers
    Swap:  1935824k total,  1011920k used,   923904k free,    78724k cached
    As you can see I'm swapping big time, and sometimes the CPU will spike to over 5.00 unless I kill zmlogger, and this is just from it running today. I've checked and repaired the logger tables, and even rebuilt the db recently but it keeps coming back. The server in question has (2) Dual Core Intel Xeon CPU 2.66GHz Cpu's with 4GB of RAM, we only have 75 users though that also includes accounts as HAM, SPAM, Wiki. I'm am also at a loss on this one, as it wasn't happening in early February, but now I deal with it everyday.

    All in all still must better than using and troubleshooting Exchange.
    Last edited by glitch23; 04-03-2008 at 03:56 PM.
    Because we all can't be geniuses, I'll go first.

  2. #2
    Join Date
    Feb 2007
    Location
    Portland, OR
    Posts
    1,147
    Rep Power
    10

    Default

    Quote Originally Posted by glitch23 View Post
    2) Since the system had not been rebooted in such a long time it when fsck when booting, on most systems thats fine, but when checking through 1TB of HD space it takes quite a long time to finish. Is there anyway to disable forced filesystem checks in Linux? (I can probably Google this one)
    tune2fs -i 0 /device-name

    As far as the logger not shutting down for a zmcontrol -stop there have been a lot of fixes related to this implemented in recent 5.0.x versions. If there is some reason that you can't upgrade your servers right now then running something like
    Code:
    kill -9 `ps -u zimbra -o "pid="`
    should clear up those processes...

  3. #3
    Join Date
    Jul 2006
    Location
    New York, NY
    Posts
    122
    Rep Power
    9

    Default

    Thanks a bunch for that, easiest way to kill all the zimbra processes. Since 5.0.4 recently came out I have to wait at least 3 weeks before scheduling downtime for an upgrade. ZCS 5.0 was supposed to have html signatures, however it did not until 5.0.4. I currently have 5.0.4 running in a dev environment so its going to happen at some point, but I can't chance any bugs appearing right after I upgrade.
    Because we all can't be geniuses, I'll go first.

  4. #4
    Join Date
    Jul 2006
    Location
    New York, NY
    Posts
    122
    Rep Power
    9

    Default

    Does anyone else have any insight to what could have cause those error messages?
    Because we all can't be geniuses, I'll go first.

Similar Threads

  1. [SOLVED] Install Problem in Ubuntu 6.06 Server
    By xtimox in forum Installation
    Replies: 16
    Last Post: 03-27-2008, 10:36 AM
  2. Replies: 22
    Last Post: 12-02-2007, 05:05 PM
  3. [SOLVED] Error Installing Zimbra on RHEL 5
    By harris7139 in forum Installation
    Replies: 10
    Last Post: 09-25-2007, 12:39 PM
  4. svn version still won't start
    By kinaole in forum Developers
    Replies: 0
    Last Post: 10-04-2006, 07:47 AM
  5. Monitoring : Data not yet avalaible
    By s3nz3x in forum Installation
    Replies: 7
    Last Post: 11-30-2005, 07:18 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •