Results 1 to 3 of 3

Thread: Very puzzling performance issues with Zimbra server

  1. #1
    Join Date
    Mar 2014
    Posts
    2
    Rep Power
    1

    Default Very puzzling performance issues with Zimbra server

    Hi everyone.

    We've been using Zimbra for a while but have had some serious issues lately that have baffled our support reps. I'll post the details below and maybe one of you has hopefully seen this problem before. **Also good to note that we have only the Zimbra guest machine installed on the host; i.e., there are no other virtual machines on that hypervisor.

    Below is a detailed recap of our issue.

    ________________________________________
    Events and Timeline

    • Purchased and configured email server in September 2013. Specs are as follows:
    o Model: Dell Poweredge R320
    o Storage: 4 X 3TB SAS 3.5” drives
    o RAID controller: PERC 310
    o RAID configuration: RAID 10
    o Memory: 32GB RAM (16GB allocated to Zimbra server)
    o Processor: 1 X Processor, 6 cores
    o Host OS: Windows Server 2012 with Hyper-V
    o Guest OS: Ubuntu 12.04
    o Zimbra Version: ZCS 8.0.5
    • Server ran beautifully for about six months.
    • On March 2nd (i.e., Mardi Gras . . . they always break down during a holiday, don’t they?), the email server slowed to a crawl and stopped responding.
    o Symptoms:
     Emails not going through. SSH terminal is slow.
     Top command reveals over 40 java processes in queue.
     CPU utilization > 99%
     RAM consumption < 13%
     Disk IO normal (no high wait times).
     Console message reads: INFO: task java:28193 blocked for more than 120 seconds. Not tainted 2.6.32-431.5.1.e16.x86_64 #1 \ “echo 0 > /procsys/kernel/hung_task_timeout_secs” disables this message.
    o Troubleshooting steps:
     Rebooted server; no improvement.
     Increased java heap size; no improvement.
     Made other minor changes to zmlocalconfig and server settings; no improvement.
     Reinstalled everything using the more supported configuration listed below (recommended by Cole Haynes); server performance returned to normal.
    • Windows 2013 Hyper-V  ESXi
    • Ubuntu 12.04  CentOS 6.5
    • Zimbra 8.0.5  8.0.6 (upgraded after the zmrestore)
    • Server ran fine until Friday, March 21st. The symptoms were the same. Our steps were as follows:
    o Troubleshooting steps:
     Adjusted java heap size; no improvement.
     Disabled hyperthreading; no improvement.
     Adjusted 2 vCPUs  1 vCPU (recommended by Frederico); server performance returned to normal.
    • Installed New Relic monitoring software. New Relic monitors CPU, RAM, and storage performance and notifies me when any stats exceed thresholds set by a custom policy. I’m currently looking into Nagios as a free alternative.
    • Disabled antispam and antivirus services in Zimbra.
    • Saturday March 29th @ around 11:00 PM, the server once again slowed to a crawl and then stopped responding.
    o New Relic performance logs:
     CPU usage shows a huge increase in utilization (1.7%  61.4%) between 7:45 PM and 8:00 PM Saturday. The usage then holds at about 70% until 9:15 PM when it starts to spike from 0% to 73% repeatedly until it stops altogether (that may be when I restarted the machine).
     RAM usage shows normal memory consumption until about 7:45 PM. Then it rose steadily until 9:00 PM. Then it began to spike repeatedly.
     **Note: I’ve attached the 3 hour and 6 hour CPU & RAM reports ending at 11:00 PM Saturday March 29th. You can review the specific processes in queue from there.
    o Troubleshooting steps:
     Changed the following host settings; server performance returned to normal.
    • In the hardware’s BIOS settings, disabled Power Management.
    • In vSphere, disabled Power Management.
    • In vSphere, disabled hyperthreading.
    • In vSphere, returned vCPU count back to two vCPUs.
    o Benchmarking steps using sysbench tool:
     One vCPU vs Two vCPU Two vCPUs performed much better than one.
     vSphere Power Management enabled vs disabled (did not test BIOS power management)  No significant change in performance.
    ________________________________________
    Points of Interest
    • The issue occurred on both Hyper-V and ESXi.
    • A processor issue is currently our prime-suspect—Frederico thinks it may not be properly assigning/performing processes (or something to that extent).
    • Somehow decreasing the vCPU count fixed the issue on the 21st, despite the fact that we proved it does not improve performance. Our Dell support rep thinks this may have decreased the total throughput back and forth on the RAID controller, temporarily fixing the problem.
    • In addition to the BIOS power management settings, our Dell support rep also noted that “… if C States or C1E are still enabled (you can check in the BIOS under Processor Settings) then that might cause some of the issues we are seeing. A physical core getting throttled without the VM knowing what’s going on could cause any number of issues.”
    • In addition to the New Relic charts from Saturday, I’ve attached the screenshots of some commands (e.g., iostat, top, etc.) that were run while the issue was occurring.

    ________________________________________

  2. #2
    Join Date
    Mar 2006
    Location
    Beaucaire, France
    Posts
    2,322
    Rep Power
    13

    Default

    The H310 (disk controller) is (from what I read) very very bad (no write cache capability).
    While you don't have visible IO issues, wouldn't it be possible your problems are related to it (because they happen both with HyperV and ESX)?

    Dell PERC H310 slow performance - Best Tech Blog | Windows | Linux | Mac | How To
    Sam's IT Blog: Dell PERC H310 Controller RAID 5 Performance Issues
    https://serverfault.com/questions/54...rc-h310-raid-5
    BlackCat Research Facility · Poor disk performance on Dell servers
    Extremely poor I/O performance on XenServer 6.1 - Storage - Discussions

  3. #3
    Join Date
    Mar 2014
    Posts
    2
    Rep Power
    1

    Default

    @Klug well I'm certainly not ruling anything out at this point, and yes logic says it probably is a hardware issue since it traversed hypervisors.

    Also, the links you posted were certainly eye-opening. I've sent an email suggesting this problem to our Dell support representative--hopefully he can tell me how I can test it. I'll keep you posted.

Similar Threads

  1. Performance issues with Zimbra desktop
    By noobinator in forum General Questions
    Replies: 2
    Last Post: 11-10-2011, 03:44 PM
  2. Zimbra performance issues
    By cmcbride in forum Administrators
    Replies: 4
    Last Post: 07-16-2008, 07:20 AM
  3. Performance Issues
    By dketchum in forum Administrators
    Replies: 24
    Last Post: 11-17-2007, 10:28 AM
  4. performance issues
    By solarsail in forum Administrators
    Replies: 3
    Last Post: 11-09-2007, 03:28 PM
  5. Zimbra Performance Issues
    By skyphyr in forum Administrators
    Replies: 4
    Last Post: 08-08-2007, 09:03 AM

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •