We are a Zimbra Hosting Provider provider running a virtualised Zimbra farm on 7.2.1 presently; we also support multiple clients' virtualised Zimbra multi- and single-server environments. All of our storage is SAN-backed.

We are looking to improve our DR methodologies in the event a Sandy/Irene/Katrina or similar event takes, or is expected to take, down the primary data center site, and a failover to the secondary data center at least several hundred miles away is made.

Specifically, we are looking for a Zimbra-supported process which:
  1. is hypervisor agnostic (but which may rely on hypervisor-specific data transport tools).
  2. is data-safe (e.g. does not rely on rsyncing the entire /opt/zimbra tree without using LDIF to copy LDAP from the production to the DR site with Zimbra versions prior to 8).
  3. is, to the extent possible, Zimbra server-version agnostic.
  4. by leveraging intra-day transport of the redo logs, minimizes loss of data when failing over to the DR site.
  5. minimizes the time to "light up" a working DR site once a decision is made to failover.



Background
Historically our DR has relied on having cold standby Zimbra servers ready in the secondary data center; rsyncing the Zimbra NE backups from the primary to the secondary data center; running zmrestoreldap and zmrestoreoffline followed by making changes to public DNS to complete the failover. As the store sizes grow, this process consumes a great deal of inter-data center bandwidth (to rsync the backups) and increases the time to complete the failover as zmrestoreoffline takes a fair amount of time to do its work.

The Admin Guide points out (NE 7.2, April 2012 edition, page 239) that storage layer snapshots, in conjunction with zmplayredo, may be used as an alternative to Zimbra's own backup and restore feature by leveraging Zimbra's redo logs. But the Admin Guide does not lay out the entire redolog DR process to the extent it does for a Zimbra server replacement.

Consequently, we are starting this thread to try to document this redo log-based DR process. Once done, I'm happy to write a wiki article and suggest Zimbra mark it as Certified Documentation once they have QA'd it. We are seeing hardly any new Zimbra installs going on bare metal, so upgrading the documented DR processes to take advantage of hypervisor-related technologies seems timely. Please help!


Proposed High-Level DR Process
Reading between the lines of the above-mentioned Admin Guide process, we are contemplating testing the following process, and would be grateful for feedback from the community before we get too far ahead of ourselves here. At a high level, the process it seems to us would comprise:


One-Time or Infrequent Initialization Tasks
  1. Shorten the TTLs for all Zimbra server public DNS records, and/or use a third-party DNS provider who can orchestrate rapid DNS failover-based changes.
  2. On the production Zimbra servers, as root do a "chkconfig zimbra off", then as the zimbra user do a zmcontrol stop, shut down the virtual machine, note the exact time, take a hypervisor-level snapshot of the Zimbra virtual machines and then restart the virtual machines and Zimbra, and then as root run "chkconfig zimbra on".
  3. Depending on the hypervisor, use appropriate procedures to create clones of the Zimbra servers from the snapshots and transfer them to the DR data center.
  4. (As and when production Zimbra is upgraded/patched, steps 2. and 3. here will need to be repeated).
  5. At the DR site, boot the cloned Zimbra mailbox servers only (Zimbra itself will not start and should not be started)


Routine Syncing Between The Production and DR Servers
  1. Schedule a cron job which, after each zmbackup on the production servers, rsyncs from /opt/zimbra/backup on the production servers to the DR servers just /opt/zimbra/backup/ldap and /opt/zimbra/backup/sys (for full backups) and /opt/zimbra/backup/ldap, /opt/zimbra/backup/sys and /opt/zimbra/backup/redologs (for incremental backups).
  2. After the above rsync is complete, run zmrestoreldap daily and zmplayredo --logfile=/opt/zimbra/backup/sessions/incr-[sessionID] after the incremental backup rsyncs only.
  3. In zimbra's crontab at the DR site, comment out the lines which run full and incremental zmbackup, but keep the line which prunes older backups.
  4. Schedule a cron job which every hour or so, rsyncs /opt/zimbra/redolog with the --delete switch from the production Zimbra mailbox servers to the DR site Zimbra mailbox servers.


Failover to DR Site Process

  1. On the DR Zimbra mailbox servers, run zmplayredo --logfile=/opt/zimbra/redolog one last time, then start Zimbra on all DR servers.
  2. If the production site is still reachable, shutdown the Zimbra servers.
  3. Update public DNS to point to the DR Zimbra servers.


Questions and Missing Bits
  1. Is it true that, once a Zimbra server is snapshotted as above and transported to the DR site, keeping this DR Zimbra server periodically in sync with the production site requires only restoring a recent version of LDAP and replaying the redo logs? Is there any other data which needs to copied over to avoid having to ship the entire /opt/zimbra/backup tree between the production and DR sites?
  2. Is anyone using this technique (or a variation thereof) presently, and have you ever tested, or actually had to, fail over to the DR site?
  3. What else needs to be included from a process standpoint that we haven't thought of?


Thanks!
Mark