May 25, 2009

DIY ESX Server Health Monitoring - Part 4

If you're just catching this series on creating an ESX health report, in Part 1, Part 2, and Part 3 we set up everything we need to schedule the daily health check and send the results in a HTML formatted email. Running the health check once a day is probably not sufficient if you want to be on top of developing issues, however, and if you have a lot of ESX hosts, reading through a long list of performance statistics may be unreasonable. So to wrap this project up, we'll look at setting up a second cron job that will only send out an alert message when an ESX host exceeds a specified threshold.

Due to the simple design of the health report scripts, to set up this functionality we only need to modify a few lines from the run-esx-report.sh script:
  • The first change is in the loop where we SSH into each ESX host and run the esx-report.sh script. We'll simply change the append redirection symbols, >>, to the create or truncate symbol, >, this way we're creating a new report output file for each host, rather than a combined report. To be extra sure the temp file is truncated each time through the loop, we'll use the noclobber override option as well, so the >> symbols become >|

  • Next, we grep for the word WARNING in the output file, and wrap the rest of the script in an if statement so the email is only sent out if the grep command returns true

  • And finally, we'll just change the subject of the email message


###############################################################################
#
#  run-esx-threshold.sh
#
###############################################################################
#
#  To create the run-esx-threshold.sh script in the ~/esx-report directory,
#  copy this entire code segment into your shell.
#  If you'd rather copy just the script, select everything between the
#  SCRIPTCREATOR limit strings.
#
#  putty will ignore all the tabs, making the copied script quite ugly
#
###############################################################################

# If the ~/esx-report directory exists, cd to it so the script is created there
[ -d ~/esx-report ] && cd ~/esx-report

cat > ./run-esx-threshold.sh <<'SCRIPTCREATOR'
#! /bin/bash
  PATH="/bin:/usr/bin"

  if [ -z $1 ]; then
    echo "No ESX hosts specified, exiting"
    exit 1
  fi

  if ! pgrep ssh-agent >/dev/null; then
    echo "The ssh-agent process does not appear to be running, exiting"
    exit 1
  fi

  RUNDIR=$(dirname "$(which "$0")")

  source "${HOME}/.ssh-agent" >/dev/null || exit 1

  THISHOST=$(hostname | cut -d . -f 1)

  TEMPTEXT=$(mktemp "${RUNDIR}/temptext.XXXXXXXXXX")

  TEMPHTML=$(mktemp "${RUNDIR}/temphtml.XXXXXXXXXX")

  for host in $@; do
    if [ $(echo $host | cut -d . -f 1) = $THISHOST ]; then
      "${RUNDIR}/esx-report.sh" >| "$TEMPTEXT"
    else
      ssh -q $host "$(cat "${RUNDIR}/esx-report.sh")" >| "$TEMPTEXT" || \
        printf "WARNING: SSH connection to $host failed\n\n\n\n" >| "$TEMPTEXT"
    fi

    if grep WARNING "$TEMPTEXT" >/dev/null; then

      cat >| "$TEMPHTML" <<-'HEADEREOF'
	<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
	<html>
	<head>
	<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
	<style type="text/css">
	body { font-family: monospace; font-size: 12px }
	pre { font-family: monospace; font-size: 12px }
	</style>
	</head>
	<body>
	<pre>
	HEADEREOF

      cat "$TEMPTEXT" | \
      sed -e 's/>/\&#62/g' \
          -e 's/WARNING:.*/<span style="color: red">&<\/span>/' >> "$TEMPHTML"

      cat >> "$TEMPHTML" <<-'FOOTEREOF'
	</pre>
	</body>
	</html>
	FOOTEREOF

      "${RUNDIR}/html-mailer.pl" -f esx-report@yourdomain.dom \
                                 -r administrator@yourdomain.dom \
                                 -s "Alert on $host" \
                                 -m exchange.yourdomain.com \
                                 -b "$TEMPHTML"
    fi
  done

  rm -f "$TEMPTEXT"; rm -f "$TEMPHTML"

SCRIPTCREATOR

chmod 0700 ./run-esx-threshold.sh

###############################################################################


Don't spam yourself
When considering how often you want to run the threshold check script, keep one shortcoming of this method in mind: if a parameter continues to exceed its threshold, the script will continue to email you every time it runs. If you set this up to run every five minutes, and head out into the woods over a holiday weekend, you're going to get a thousand alert messages before you get a chance to resolve the issue.

For our purposes, once every 30 minutes will suffice, and so we'll add another cron job by issuing a crontab -e command as the non-root user, press i to enter insert mode, and below the line containing the 7:10 AM ESX server health report job, we'll add:

0,30 * * * * ${HOME}/esx-report/run-esx-threshold.sh ESX LIST >/dev/null 2>&1

Press Esc, then :wq to write the crontab and exit vi, and we're done!

If you do want to run the threshold check every five minutes, instead of specifying a list like 0,5,10,15, etc., use the range of minutes followed by a forward slash and interval, like:

0-59/5 * * * * ${HOME}/esx-report/run-esx-threshold.sh ESX LIST >/dev/null 2>&1

Tweak the thresholds
You'll definitely want to play with the threshold settings from the esx-report.sh script in Part 1. The threshold is the third parameter supplied to the scale function, and I've highlighted it below for the memory usage check:

  printf "  Memory Usage:\n"
  (free | awk '/^Mem:/ {print $3, $2, "100", $1}
              /^Swap:/ {print $3, $2, "1", $1}') | \
    while read line; do scale $line; done

That does it for the DIY ESX Server Health Monitoring project, I hope you'll find this information easy to customize for your own environment. If you add new performance checks or enhancements, feel free to describe the changes in a comment.

Install it
If you'd like to set the whole thing up, just copy and paste each code segment with a light blue background into a putty session. To install:
  • Create the esx-report.sh script from Part 1 as the non-root user. Copying the entire code segment in the light blue box into a putty window will create the esx-report folder under the home folder of the user that executes it.

  • From Part 2, execute the ssh-keygen command as the non-root user. Then run the esxcfg-firewall command as root to open an outbound port for SSH. Create the two remaining scripts; copykey.sh, and start-ssh-agent.sh as the non-root user.
    Use copykey.sh to distribute the public key file, then launch ~/start-ssh-agent.sh to load the private key into memory, both as the non-root user. Make test ssh connections to each ESX host you need to run the report on, but make sure you source the .ssh-agent file first so the variables are exported to your shell, source ~/.ssh-agent

  • Now create the html-mailer.pl script from Part 3 as the non-root user. As root, run the esxcfg-firewall command to open outbound SMTP in the firewall. Change users back to the non-root user, and create the run-esx-report.sh script and change the email settings for your environment.

  • Create the run-esx-threshold.sh script from this post as the non-root user and change the email settings.

  • Set up the cron jobs for the daily health report and the threshold check. Customize the whole thing any way you see fit.

* A couple of tips:
  • Try to schedule the daily health check and threshold checks so they don't run at the same time. The jobs will run fine simultaneously, but the usage numbers could be inflated.

  • Configure reverse DNS records for your ESX hosts on the DNS servers they point to or you'll see long pauses during SSH connection attempts as the server times out attempting to resolve the connecting client's hostname from its IP.


No comments:

Post a Comment