There is a very fundamental issue with this work flow that needs to be considered:
If a service stops, it stops for a reason. This work flow does nothing to address that problem.
This means that if there is larger issue, such as an unhanded exception...well it's only a matter of time before it goes down again. Since this idea would automatically restart the service, you may never know if you hit an unhanded exception. It also might make it worse....
Zimbra has great handlers. We have our own watchdog proc for things like mta, clam, and java. If those die, it tries to restart them. If there is a condition preventing the restart, it won't restart them.
The moral of the story is that if the server goes down, you really should figure out why, as opposed to just restarting the service.
I do think this is a good idea, which is why I'm saying it's a problem with the work flow itself.
There's a high availability/fail over script floating around. You might want to look at that.
does the watchdog process send an email to the admin if a process dies and it has to restart it or cant restart it? is there an option to set something like that up? i realize that if a service does die that there could be a bigger underlying issue, but i would like an alert telling me its died and could/couldnt be restarted rather than just finding out by all my customers calling and complaining ;-)
i was just trying to be proactive in being alerted to the issue first if something were to happen.
thanks for the input.
Well, it wouldn't be able to send an e-mail because the server is down, thus smtp is down. If e-mail's down, you probably won't get the message anyway.
What I would do is to have a script that monitors the services. If a condition is raised where the services go down, you could have it sent an http post to your "support server" or something. If you're using windows nt, you would whip up a script where if that post is received, it uses windows messaging service (not MSN messenger, but the messenger protocol built into windows nt machines) to send your machine an alert.
Just some thoughts.
SMTP may not be down, but another service could be down. In any case, since this is a disaster-related script, you should plan for the event that smtp is unavailable.
all what u say, john, is right..but:
i have a multistore architecture with store servers wan-connected to a central hub;
i have a store that die when wan connection with master goes away; at this moment i dunno any way to resort it without using monit; if u would suggest me something different u are welcome!
any advice will be glad
Hey there... no one's done anything with this in a while, but I figured I would post my working monitor script. The one thing to note is that the purpose of the script is NOT to restart a failed process, simply to give the administrator a heads up that something is about to go bad (Eg. process hung, running out of resources, process died... etc).
So, think of this as an early warning system. Monit can easily be set to use a different SMTP server than your Zimbra server, so it gets around that problem as well.Code:check system myhost.local if loadavg (1min) > 4 then alert if loadavg (5min) > 2 then alert if memory usage > 85% then alert if cpu usage (user) > 70% then alert if cpu usage (system) > 50% then alert if cpu usage (wait) > 20% then alert check process Zimbra.Apache with pidfile "/opt/zimbra/log/httpd.pid" if children > 255 for 5 cycles then alert if cpu usage > 95% for 3 cycles then alert if failed port 80 protocol http then alert group zimbra check process Zimbra.Logwatch with pidfile "/opt/zimbra/log/logswatch.pid" if children > 255 for 5 cycles then alert if cpu usage > 95% for 3 cycles then alert group zimbra check process Zimbra.MySQL with pidfile "/opt/zimbra/db/mysql.pid" if children > 255 for 5 cycles then alert if cpu usage > 95% for 3 cycles then alert if failed port 7306 protocol mysql then alert group zimbra check process Zimbra.MySQL_Logger with pidfile "/opt/zimbra/logger/db/mysql.pid" if children > 255 for 5 cycles then alert if cpu usage > 95% for 3 cycles then alert depends on Zimbra.MySQL group zimbra check process Zimbra.MTA_Config with pidfile "/opt/zimbra/log/zmmtaconfig.pid" if children > 255 for 5 cycles then alert if cpu usage > 95% for 3 cycles then alert group zimbra check process Zimbra.Mailbox_Java with pidfile "/opt/zimbra/log/zmmailboxd_java.pid" if children > 255 for 5 cycles then alert if cpu usage > 95% for 3 cycles then alert if failed port 143 protocol imap then alert group zimbra check process Zimbra.Mailbox_Control with pidfile "/opt/zimbra/log/zmmailboxd_manager.pid" if children > 255 for 5 cycles then alert if cpu usage > 95% for 3 cycles then alert group zimbra check process Zimbra.ClamAV with pidfile /opt/zimbra/log/clamd.pid if children > 255 for 5 cycles then alert if cpu usage > 95% for 3 cycles then alert group zimbra check process Zimbra.Cyrus_SASL with pidfile /opt/zimbra/cyrus-sasl/state/saslauthd.pid if children > 255 for 5 cycles then alert if cpu usage > 95% for 3 cycles then alert group zimbra check process Zimbra.Postfix with pidfile /opt/zimbra/data/postfix/spool/pid/master.pid if children > 255 for 5 cycles then alert if cpu usage > 95% for 3 cycles then alert if failed port 25 protocol smtp then alert group zimbra check process Zimbra.LDAP with pidfile /opt/zimbra/openldap/var/run/slapd.pid if children > 255 for 5 cycles then alert if cpu usage > 95% for 3 cycles then alert if failed host myhost.local port 389 protocol ldap3 then alert group zimbra check process Zimrba.Amavis with pidfile /opt/zimbra/log/amavisd.pid if children > 255 for 5 cycles then alert if cpu usage > 95% for 3 cycles then alert group zimbra
Depending on your environment, you may not want the service down, if say it happened at 4am and you get a wakeup call at 8am from irate users. Your investigation time would be limited, you would have to restart the service.
So the real moral of the story, know what you need before you implement. Just leaving a service down is great in theory, as we take our time to exchange pleasantries with Zimbra tech support to get the issue resolved. But that's not always a quick thing.
As someone mentioned later, monit can be configured to send alerts via another smtp server, so based on your alerts config, you will be notified of a down situation.
You can also comment out the start/stop lines and just have the alerts sent out, pretty flexible.
Oh, I completely agree... That's the whole point of the monitrc posting that I put up... all it does is let the admin know that either (A) a service has gone down, or (b) the server appears to be struggling with something... either way, they should look into it. The monit script I posted doesn't even have start/stop lines, and that's completely intentional.
The idea behind having the alerts for children processes/memory utilization/load etc. is that the administrator can get in, and worst case scenario, alert the users that the system is going down. In my experience, I've seen that generally the anger level of a client is inversely proportional to the amount of warning they had. eg. "You're getting a lot of spam, it looks like it's about to hang the system" is often appreciated more than "The reason you haven't received email in the last 4 hours is because spam clogged the system".
... god I hate spam.