Updated: May 25, 2009
I've had a chance to test this project out with vSphere - ESX 4.0, and everything works the same, except for the last section of the esx-report.sh script that parsed the /proc/vmware/mem file. This file has moved and the format was changed. Since the hypervisor memory usage can be monitored and alerts triggered from vCenter Server, I've just removed that section from the script.
VMware creates some pretty amazing stuff. If they didn't, I wouldn't have a blog about it, and you wouldn't be reading blogs about it. But let's be honest, it's not perfect (what software is?), and every now and then something buggy will happen. Sometimes these quirks occur at scary moments, like when removing a snapshot, or in the middle of a VMotion operation. Rarely do they cause any actual downtime, however.
You could classify the quirks into two categories; vCenter Server and ESX. The vCenter issues are almost always harmless, and are sometimes simply resolved by closing and reopening the VI/vSphere client. It's the ESX quirks that can be really serious, as any problems with ESX can lead to virtual machine downtime.
That's why it is so important to have some type of ESX performance monitoring in place. vCenter provides some basics, but doesn't offer any real visibility into the Service Console, which is where the serious problems can be lurking. There are some very good commercial products, and one in particular, Veeam Monitor, is even offered in a free edition. Sometimes you need something more customized, however, and the only option is to build it yourself.
In this four part series, we'll build our own ESX health report with a shell script, use key based SSH authentication so that one ESX server can run the script on the others, and then email the report using a perl script. We'll finish with a quick modification to enable the report to trigger an email when performance thresholds are reached. The format of the report will be designed to display perfectly on a BlackBerry Curve set to its smallest font size, allowing us to know about issues from anywhere
.
The motivation
I started working on this script after two separate ESX Service Console incidents. The first occurred after upgrading an ESX 3.5 server to Update 2. The upgrade was successful, and everything seemed perfectly normal. But as it turned out, there was an issue between the HP Systems Insight Manager server and the new update, causing a new process to be launched in the Service Console every five minutes, but the processes never terminated. After a week of this, several thousand zombie processes were running in the Service Console. There is a limit on the number of processes before the ESX server will stop launching new ones, and once you hit that limit, chaos ensues.
The second incident was less severe, but was just as scary because no one really knew how long the problem had been occurring. For reasons never understood, one of the VMware log files started filling up with the same generic HA message, logging more than five entries a second. The log file was rotating through all of its file names several times an hour, and the Service Console processor usage was pegged. The problem was finally noticed when a VMotion took longer than fifteen minutes to complete.
The plan
The setup for the ESX health report is fairly simple and it satisfies two key components of a good security policy; do not permit SSH access for root, and never store passwords in scripts. The basic plan is:
- A non-root user account will be used for the entire process, and a public SSH key for this account will be distributed to the other ESX servers
- From the central ESX server the non-root account will SSH to each ESX host, execute the script, and redirect the script output to the central ESX host
- The combined output from each ESX server will then be emailed using a perl script, formatted to display nicely on a BlackBerry Curve
For this setup to work, you'll need to use the same non-root account on each ESX server. Even though most of the VMware command-line tools can only be executed by root, we can get most of the critical stats we need with just a regular non-privileged account.
It will also be necessary to open an outbound port for SMTP in the firewall on the ESX host that will be emailing the report. If your network design has isolated the ESX hosts from the rest of the network (and it should!), and only the server running vCenter has access to the service console network, you'll need to set up SMTP on the vCenter server and configure it to forward to an Exchange server or whatever groupware application you use, which we'll cover in Part 3.
The script
The shell script used to gather data from the ESX hosts is pretty straightforward, and can be developed and tested from a local ESX console, as the output is just being sent to the terminal.
The scale
function will perform all calculations and print the histogram output. The function expects four parameters: the value, the maximum range for the value, a threshold percentage for generating a warning, and a description for the histogram.
The CSCALE
local variable determines how the data is scaled, or how many intervals the graph will display. I've used a value of twenty here, mainly because it displays perfectly on my BlackBerry Curve, so each hash will represent a 5% interval. If you need more resolution than that in the graphs, it's just a matter of changing the CSCALE
constant:
function scale {
local CSCALE=20
The bash shell lacks the ability to perform floating point calculations, which makes getting percentages pretty tough. However, awk
fills this gap easily, and can round the percentages by simply using a printf
format specifying a floating point number, with a precision of 0, so no decimal point will be printed (%.0f)
avg=$(echo $1 $2 | awk '{printf "%.0f", ($1/$2) * 100}')
We'll use awk
again to determine how many hashes should be printed in the histogram, but this time we'll multiply by the CSCALE
value rather than 100:
scaled=$(echo $1 $2 $CSCALE | awk '{printf "%.0f", ($1/$2) * $3}')
With the percentage value rounded and stored in avg
, we will compare it to the threshold value specified as the third parameter to the scale
function. If avg
is greater than or equal to the threshold, we'll store the third parameter in a variable named alert
that we can check for later:
if [ $avg -ge $3 ]; then
alert=$3
fi
When calling a bash function, the parameters are specified as a space delimited list. This is problematic if the arguments themselves have spaces in them, which the descriptions for the histograms most certainly will. So in the scale
function, we'll make the description the last parameter, that way we know that from the fourth parameter on will be part of the description. Using the built-in shift
command, we can shift off the first three parameters, making the array of all the arguments specified to the function, the $@
variable, hold the description. We can then grab the whole array of arguments left over and store it in a single variable:
shift 3; histtext=$@
Now that the calculations are done and parameters stored, we can start printing some histograms. To keep everything lined up in the output, we'll let printf
format specifiers do all the work. This first printf
command outputs the description for the histogram followed by the average. The first format directive, %-10.10s, specifies a string value (s) will be printed in a 10 character width field (10), and it should be left justified (-) with a precision of 10, meaning only print the first 10 characters even if it is longer (.10).
The second format directive, %3d%%, specifies an integer value will be printed in a three character width field, and a percent sign will follow it:
printf " %-10.10s %3d%% " "$histtext" "$avg"
This loop adds #
characters to the hist
variable up to the value of scaled
, giving us the bar for the histogram:
for ((i=0; i>scaled; i++)); do
hist="${hist}#"
done
And now we'll print the histogram bar, enclose it in [] brackets, left justify it (-), and print it in a field width equal to the CSCALE
value:
printf "[%-${CSCALE}.${CSCALE}s]" "$hist"
If the average was greater than or equal to the threshold, the alert
variable will be defined, so we will print a warning message right below the histogram, lining it up perfectly by using printf
format directives to specify the field width:
if [ $alert ]; then
printf "%28s%3d%% %8s" "WARNING:" "$alert" "threshold"
printf "\n"
fi
With the scale
function defined, we can start gathering data and formatting the report. We'll begin by printing the hostname of the ESX server:
printf "$(hostname)\n\n"
To get the load average for the service console, we'll use the last section of the output from uptime
. Using egrep
with the -o
option tells it to only print out the section of the line that matches. Then use tr
to change the lower case 'l' in 'load' to upper case, and we have a nicely formatted load average:
printf " Service Console Stats:\n"
printf " $(uptime | egrep -o 'load.*' | tr 'l' 'L')\n"
Getting the number of running processes is easy, just pipe ps
into grep
, match everything (.), and output the number of lines that matched (-c). This won't be completely accurate, as the script itself and any subshells it launches will be included in the count. But that's not a big deal, the process count is only critical when it gets into the thousands, so we don't really care if it's inflated by a couple of counts:
printf " Number of running processes: $(ps -e | grep . -c)\n\n"
We can get the processor use average from vmstat
, telling it to calculate the average over five seconds. We're grabbing two stats here, the user and system processor use percentages, so the '\n' newline in the middle of the awk
command splits the output onto two lines. In order to handle sending the two lines to the scale
function, we use read
in a while loop to process each line.
In the awk
command, we'll send the scale
function the appropriate field parsed from vmstat
, and then follow it with the maximum value for the data (in this case 100 as vmstat
is reporting a percentage), the threshold value for triggering a warning (75), and a description for the histogram (User: or System:)
printf " Proc 5 sec avg:\n"
(vmstat 5 2 | awk 'END {
print $13, "100", "75", "User:\n" $14, "100", "75", "System:"}') | \
while read line; do scale $line; done
To parse the memory use information from free
, we'll use the pattern matching capabilities of awk
to execute print statements only on the lines that begin with Mem: and Swap:
You might want to exclude the memory use histogram and only report on swap usage, as Linux will use memory not allocated to applications for buffering files, resulting in free
almost always reporting very high memory usage:
printf " Memory Usage:\n"
(free | awk '/^Mem:/ {print $3, $2, "100", $1}
/^Swap:/ {print $3, $2, "1", $1}') | \
while read line; do scale $line; done
And finally, for disk usage output from df
, we want to print every line except the header line, so we'll use the pattern exclusion command in awk
to skip the first line. We also need to filter out the % signs from the df
output, and awk
can do this for us with a sub
command:
printf " Disk Usage:\n"
(df -mP | \
awk '$1 !~ /Filesystem/ {sub(/%/,""); print $5, "100", "70", $6}') | \
while read line; do scale $line; done
Try it out
You can copy the entire script below into a putty
session as a non-root user and try the script out for yourself. In Part 2 of this series, we'll explore a technique for running the script on multiple ESX hosts using SSH.
###############################################################################
#
# esx-report.sh
#
###############################################################################
#
# Copy this entire code segment into your shell to create the ~/esx-report
# directory and esx-report.sh script and make it executable.
# If you would rather copy just the script itself, select everything between
# the SCRIPTCREATOR limit strings.
#
###############################################################################
if [ ! -d ~/esx-report ]; then mkdir ~/esx-report -m 0700; fi
cd ~/esx-report
cat > ./esx-report.sh <<'SCRIPTCREATOR'
#! /bin/bash
PATH="/bin:/usr/bin"
# Usage: [value] [max value] [threshold percentage] [description]
function scale ()
{
# exit if called without four parameters
if [ -z $4 ]; then
return 1
fi
local avg alert scaled hist histtext i
# change histogram scale here
local CSCALE=20
# protect against divide by zero, even though awk doesn't complain
if [ $1 -gt 0 ] && [ $2 -gt 0 ]; then
# no floating point in bash, use awk to get avg and round (%.0f) to int
avg=$(echo $1 $2 | awk '{printf "%.0f", ($1/$2) * 100}')
scaled=$(echo $1 $2 $CSCALE | awk '{printf "%.0f", ($1/$2) * $3}')
else
avg=0; scaled=0
fi
if [ $avg -ge $3 ]; then
alert=$3
fi
# shift off first three args, leaving rest of args for description
shift 3
# grab whole array of args left over, this allows for spaces in description
histtext="$@"
printf " %-10.10s %3d%% " "$histtext" "$avg"
for ((i=0; i<scaled; i++)); do
hist="${hist}#"
done
# with scaling, low values may show nothing in histogram
# if hist undef, add a #, but we want a zero value to show nothing
if [ ! $hist ] && [ $avg -gt 0 ]; then
hist='#'
fi
printf "[%-${CSCALE}.${CSCALE}s]" "$hist"
printf "\n"
if [ $alert ]; then
printf "%28s%3d%% %8s" "WARNING:" "$alert" "threshold"
printf "\n"
fi
}
printf ">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>\n"
printf ">\n"
printf "> $(hostname)\n\n"
printf " Service Console Stats:\n"
printf " $(uptime | egrep -o 'load.*' | tr 'l' 'L')\n"
printf " Number of running processes: $(ps -e | grep . -c)\n\n"
printf " Proc 5 sec avg:\n"
(vmstat 5 2 | awk 'END {
print $13, "100", "75", "User:\n" $14, "100", "75", "System:"}') | \
while read line; do scale $line; done
printf " Memory Usage:\n"
(free | awk '/^Mem:/ {print $3, $2, "100", $1}
/^Swap:/ {print $3, $2, "1", $1}') | \
while read line; do scale $line; done
printf " Disk Usage:\n"
(df -mP | \
awk '$1 !~ /Filesystem/ {sub(/%/,""); print $5, "100", "75", $6}') | \
while read line; do scale $line; done
printf "\n\n\n\n"
SCRIPTCREATOR
chmod 0700 ./esx-report.sh
###############################################################################
|
...read more