May 12, 2009

DIY ESX Server Health Monitoring - Part 1

Updated: May 25, 2009
I've had a chance to test this project out with vSphere - ESX 4.0, and everything works the same, except for the last section of the esx-report.sh script that parsed the /proc/vmware/mem file. This file has moved and the format was changed. Since the hypervisor memory usage can be monitored and alerts triggered from vCenter Server, I've just removed that section from the script.

VMware creates some pretty amazing stuff. If they didn't, I wouldn't have a blog about it, and you wouldn't be reading blogs about it. But let's be honest, it's not perfect (what software is?), and every now and then something buggy will happen. Sometimes these quirks occur at scary moments, like when removing a snapshot, or in the middle of a VMotion operation. Rarely do they cause any actual downtime, however.

You could classify the quirks into two categories; vCenter Server and ESX. The vCenter issues are almost always harmless, and are sometimes simply resolved by closing and reopening the VI/vSphere client. It's the ESX quirks that can be really serious, as any problems with ESX can lead to virtual machine downtime.

That's why it is so important to have some type of ESX performance monitoring in place. vCenter provides some basics, but doesn't offer any real visibility into the Service Console, which is where the serious problems can be lurking. There are some very good commercial products, and one in particular, Veeam Monitor, is even offered in a free edition. Sometimes you need something more customized, however, and the only option is to build it yourself.

In this four part series, we'll build our own ESX health report with a shell script, use key based SSH authentication so that one ESX server can run the script on the others, and then email the report using a perl script. We'll finish with a quick modification to enable the report to trigger an email when performance thresholds are reached. The format of the report will be designed to display perfectly on a BlackBerry Curve set to its smallest font size, allowing us to know about issues from anywhere.

The motivation
I started working on this script after two separate ESX Service Console incidents. The first occurred after upgrading an ESX 3.5 server to Update 2. The upgrade was successful, and everything seemed perfectly normal. But as it turned out, there was an issue between the HP Systems Insight Manager server and the new update, causing a new process to be launched in the Service Console every five minutes, but the processes never terminated. After a week of this, several thousand zombie processes were running in the Service Console. There is a limit on the number of processes before the ESX server will stop launching new ones, and once you hit that limit, chaos ensues.

The second incident was less severe, but was just as scary because no one really knew how long the problem had been occurring. For reasons never understood, one of the VMware log files started filling up with the same generic HA message, logging more than five entries a second. The log file was rotating through all of its file names several times an hour, and the Service Console processor usage was pegged. The problem was finally noticed when a VMotion took longer than fifteen minutes to complete.

The plan
The setup for the ESX health report is fairly simple and it satisfies two key components of a good security policy; do not permit SSH access for root, and never store passwords in scripts. The basic plan is:
  • A non-root user account will be used for the entire process, and a public SSH key for this account will be distributed to the other ESX servers

  • From the central ESX server the non-root account will SSH to each ESX host, execute the script, and redirect the script output to the central ESX host

  • The combined output from each ESX server will then be emailed using a perl script, formatted to display nicely on a BlackBerry Curve

For this setup to work, you'll need to use the same non-root account on each ESX server. Even though most of the VMware command-line tools can only be executed by root, we can get most of the critical stats we need with just a regular non-privileged account.

It will also be necessary to open an outbound port for SMTP in the firewall on the ESX host that will be emailing the report. If your network design has isolated the ESX hosts from the rest of the network (and it should!), and only the server running vCenter has access to the service console network, you'll need to set up SMTP on the vCenter server and configure it to forward to an Exchange server or whatever groupware application you use, which we'll cover in Part 3.

The script
The shell script used to gather data from the ESX hosts is pretty straightforward, and can be developed and tested from a local ESX console, as the output is just being sent to the terminal.

The scale function will perform all calculations and print the histogram output. The function expects four parameters: the value, the maximum range for the value, a threshold percentage for generating a warning, and a description for the histogram.

The CSCALE local variable determines how the data is scaled, or how many intervals the graph will display. I've used a value of twenty here, mainly because it displays perfectly on my BlackBerry Curve, so each hash will represent a 5% interval. If you need more resolution than that in the graphs, it's just a matter of changing the CSCALE constant:

  function scale {
    local CSCALE=20

The bash shell lacks the ability to perform floating point calculations, which makes getting percentages pretty tough. However, awk fills this gap easily, and can round the percentages by simply using a printf format specifying a floating point number, with a precision of 0, so no decimal point will be printed (%.0f)

   avg=$(echo $1 $2 | awk '{printf "%.0f", ($1/$2) * 100}')

We'll use awk again to determine how many hashes should be printed in the histogram, but this time we'll multiply by the CSCALE value rather than 100:

  scaled=$(echo $1 $2 $CSCALE | awk '{printf "%.0f", ($1/$2) * $3}')

With the percentage value rounded and stored in avg, we will compare it to the threshold value specified as the third parameter to the scale function. If avg is greater than or equal to the threshold, we'll store the third parameter in a variable named alert that we can check for later:

  if [ $avg -ge $3 ]; then
    alert=$3
  fi

When calling a bash function, the parameters are specified as a space delimited list. This is problematic if the arguments themselves have spaces in them, which the descriptions for the histograms most certainly will. So in the scale function, we'll make the description the last parameter, that way we know that from the fourth parameter on will be part of the description. Using the built-in shift command, we can shift off the first three parameters, making the array of all the arguments specified to the function, the $@ variable, hold the description. We can then grab the whole array of arguments left over and store it in a single variable:

  shift 3; histtext=$@

Now that the calculations are done and parameters stored, we can start printing some histograms. To keep everything lined up in the output, we'll let printf format specifiers do all the work. This first printf command outputs the description for the histogram followed by the average. The first format directive, %-10.10s, specifies a string value (s) will be printed in a 10 character width field (10), and it should be left justified (-) with a precision of 10, meaning only print the first 10 characters even if it is longer (.10).
The second format directive, %3d%%, specifies an integer value will be printed in a three character width field, and a percent sign will follow it:

  printf "    %-10.10s %3d%% " "$histtext" "$avg"

This loop adds # characters to the hist variable up to the value of scaled, giving us the bar for the histogram:

  for ((i=0; i>scaled; i++)); do
    hist="${hist}#"
  done

And now we'll print the histogram bar, enclose it in [] brackets, left justify it (-), and print it in a field width equal to the CSCALE value:

  printf "[%-${CSCALE}.${CSCALE}s]" "$hist"

If the average was greater than or equal to the threshold, the alert variable will be defined, so we will print a warning message right below the histogram, lining it up perfectly by using printf format directives to specify the field width:

  if [ $alert ]; then
    printf "%28s%3d%% %8s" "WARNING:" "$alert" "threshold"
    printf "\n"
  fi

With the scale function defined, we can start gathering data and formatting the report. We'll begin by printing the hostname of the ESX server:

  printf "$(hostname)\n\n"

To get the load average for the service console, we'll use the last section of the output from uptime. Using egrep with the -o option tells it to only print out the section of the line that matches. Then use tr to change the lower case 'l' in 'load' to upper case, and we have a nicely formatted load average:

  printf " Service Console Stats:\n"
  printf "  $(uptime | egrep -o 'load.*' | tr 'l' 'L')\n"

Getting the number of running processes is easy, just pipe ps into grep, match everything (.), and output the number of lines that matched (-c). This won't be completely accurate, as the script itself and any subshells it launches will be included in the count. But that's not a big deal, the process count is only critical when it gets into the thousands, so we don't really care if it's inflated by a couple of counts:

   printf "  Number of running processes: $(ps -e | grep . -c)\n\n"

We can get the processor use average from vmstat, telling it to calculate the average over five seconds. We're grabbing two stats here, the user and system processor use percentages, so the '\n' newline in the middle of the awk command splits the output onto two lines. In order to handle sending the two lines to the scale function, we use read in a while loop to process each line.

In the awk command, we'll send the scale function the appropriate field parsed from vmstat, and then follow it with the maximum value for the data (in this case 100 as vmstat is reporting a percentage), the threshold value for triggering a warning (75), and a description for the histogram (User: or System:)

  printf "  Proc 5 sec avg:\n"
  (vmstat 5 2 | awk 'END {
   print $13, "100", "75", "User:\n" $14, "100", "75", "System:"}') | \
    while read line; do scale $line; done

To parse the memory use information from free, we'll use the pattern matching capabilities of awk to execute print statements only on the lines that begin with Mem: and Swap:

You might want to exclude the memory use histogram and only report on swap usage, as Linux will use memory not allocated to applications for buffering files, resulting in free almost always reporting very high memory usage:

  printf "  Memory Usage:\n"
  (free | awk '/^Mem:/ {print $3, $2, "100", $1}
              /^Swap:/ {print $3, $2, "1", $1}') | \
    while read line; do scale $line; done

And finally, for disk usage output from df, we want to print every line except the header line, so we'll use the pattern exclusion command in awk to skip the first line. We also need to filter out the % signs from the df output, and awk can do this for us with a sub command:

  printf "  Disk Usage:\n"
  (df -mP | \
   awk '$1 !~ /Filesystem/ {sub(/%/,""); print $5, "100", "70", $6}') | \
    while read line; do scale $line; done

Try it out
You can copy the entire script below into a putty session as a non-root user and try the script out for yourself. In Part 2 of this series, we'll explore a technique for running the script on multiple ESX hosts using SSH.


###############################################################################
#
#  esx-report.sh
#
###############################################################################
#
#  Copy this entire code segment into your shell to create the ~/esx-report
#  directory and esx-report.sh script and make it executable.
#  If you would rather copy just the script itself, select everything between 
#  the SCRIPTCREATOR limit strings.
#
###############################################################################
if [ ! -d ~/esx-report ]; then mkdir ~/esx-report -m 0700; fi
cd ~/esx-report

cat > ./esx-report.sh <<'SCRIPTCREATOR'
#! /bin/bash

 PATH="/bin:/usr/bin"

# Usage: [value] [max value] [threshold percentage] [description]
function scale ()
{
  # exit if called without four parameters
  if [ -z $4 ]; then
    return 1
  fi

  local avg alert scaled hist histtext i
  # change histogram scale here
  local CSCALE=20

  # protect against divide by zero, even though awk doesn't complain
  if [ $1 -gt 0 ] && [ $2 -gt 0 ]; then

    # no floating point in bash, use awk to get avg and round (%.0f) to int
    avg=$(echo $1 $2 | awk '{printf "%.0f", ($1/$2) * 100}')

    scaled=$(echo $1 $2 $CSCALE | awk '{printf "%.0f", ($1/$2) * $3}')
  else
    avg=0; scaled=0
  fi

  if [ $avg -ge $3 ]; then
    alert=$3
  fi

  # shift off first three args, leaving rest of args for description
  shift 3
  # grab whole array of args left over, this allows for spaces in description
  histtext="$@"

  printf "    %-10.10s %3d%% " "$histtext" "$avg"

  for ((i=0; i<scaled; i++)); do
    hist="${hist}#"
  done

  # with scaling, low values may show nothing in histogram
  # if hist undef, add a #, but we want a zero value to show nothing
  if [ ! $hist ] && [ $avg -gt 0 ]; then
    hist='#'
  fi

  printf "[%-${CSCALE}.${CSCALE}s]" "$hist"
  printf "\n"

  if [ $alert ]; then
    printf "%28s%3d%% %8s" "WARNING:" "$alert" "threshold"
    printf "\n"
  fi
}

  printf ">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>\n"
  printf ">\n"
  printf "> $(hostname)\n\n"

  printf " Service Console Stats:\n"
  printf "  $(uptime | egrep -o 'load.*' | tr 'l' 'L')\n"
  printf "  Number of running processes: $(ps -e | grep . -c)\n\n"
  printf "  Proc 5 sec avg:\n"
  (vmstat 5 2 | awk 'END {
   print $13, "100", "75", "User:\n" $14, "100", "75", "System:"}') | \
    while read line; do scale $line; done

  printf "  Memory Usage:\n"
  (free | awk '/^Mem:/ {print $3, $2, "100", $1}
              /^Swap:/ {print $3, $2, "1", $1}') | \
    while read line; do scale $line; done

  printf "  Disk Usage:\n"
  (df -mP | \
   awk '$1 !~ /Filesystem/ {sub(/%/,""); print $5, "100", "75", $6}') | \
    while read line; do scale $line; done
  printf "\n\n\n\n"

SCRIPTCREATOR

chmod 0700 ./esx-report.sh

###############################################################################

2 comments:

  1. Very nice script..... I look forward to part two!

    But I do wonder how well this script will scale with 70+ hosts to monitor...

    ReplyDelete
  2. Hey Andrew, that's a good question, I'm hoping you guys will let me know how it works in a large installation. To be honest, I don't think I had quite that many hosts in mind...

    But out of curiosity, I let it run against 60 hosts in my test environment. I only have a cluster of three ESX hosts in that setup, so I just looped it on the three 20 times, so not exactly the best test, but probably pretty close. Anyway it took 5 minutes and 52 seconds to complete, but when you consider there is a 5 second pause in the script to gather processor use from vmstat, that's actually only 52 seconds to run the script, so less than a second per host.

    Outlook reports the email size as 66 KB, so not too bad. One issue though, I sent it to my BlackBerry, and it would only display about 30 hosts before it truncated the message, so that could be an issue if you want more than 30 hosts per email on your Berry.

    ReplyDelete