July 23, 2009

Virtualize Cisco Routers with GNS3


This title is a little misleading, I figure I get to do that at least once. Yes, GNS3, http://www.gns3.net, emulates the hardware of several Cisco router platforms, and it will boot real IOS images. It's also free, easy to install, and there are abundant video and written tutorials. But you're not going to use it in your production environment.

So what's it good for? Well, there's the obvious; if you are studying for a Cisco certification test, GNS3 is like a dream come true. Though there have been Cisco emulation packages around for a while, they are usually expensive, and only provide a small fraction of the IOS feature set.

But even if you're not studying for an exam, or have no interest in Cisco IOS, GNS3 is a fantastic tool for testing network and service architecture designs. When used in combination with VMware Workstation, connecting your virtual machines up with a GNS3 topology takes about three clicks. In the simple diagram above, I've connected two VMs across a virtual WAN link, utilizing two Cisco 3640 routers and OSPF as the dynamic routing protocol. The computer shapes are really cloud objects that I've customized the shape for. When you configure a cloud object, the vmnet virtual networks appear in the drop down list of available NIO Ethernet objects. To connect a virtual machine to a GNS3 virtual router, just add the virtual network the VM is plugged into in the cloud configuration window, and then create a Fast Ethernet connection to a router.

The mind boggles at the thought of all the projects you can test with this setup. Imagine being able to test a change to the Active Directory infrastructure between multiple WAN sites, or verify client connections after changing from a static routing configuration to OSPF. A few months back I completely modeled a client's network infrastructure using GNS3, and proved how they could reduce their routing tables by about 10,000 entries by configuring EIGRP route summarization on a single link.

If you are running a recent version of Ubuntu on your demo laptop, installing GNS3 is as simple as sudo apt-get install gns3. The install is pretty painless on XP as well, and possible on Vista (though I've read of folks having a lot of issues), but GNS3 runs a lot slower and you won't be able to run as many router instances with XP or Vista.

My current demo laptop, a Core2 Duo T9600 running Ubuntu 9.04 64-bit, VMware Workstation 6.5 and GNS3 is the coolest thing since sliced bread. And I would know, because I eat a lot of sandwiches.

Oh that's bad..... but it's true, I do eat a lot of sandwiches

...read more

July 12, 2009

Know Your History

The UNIX/Linux shells evolved on a planet where saving every keystroke and millisecond of time was absolutely essential to survival. As a result, they're chock full of shortcuts, many of them with overlapping functionality, letting the user choose the method that works best for them.

The shell command history is a prime example. Even a newborn knows to use the up and down arrow keys to recall commands, and most toddlers are piping history into less to perform manual searches for complicated commands they don't feel like recreating. But if you ever have the chance to stare over the shoulder of a grey-bearded shell guru, you'll see that the true masters use several different techniques to pull up commands in the most efficient way. The following report was compiled from the sage advice given by these mysterious wizards.

Bang!
We'll start with the classic C Shell history syntax. You've no doubt used an exclamation point to re-execute a command from the history list, like !42, or have used the double exclamation to execute the previous command, !!. You can get pretty fancy with the available event designators, for instance, !-3 will execute the command three commands ago. And !echo will execute the most recently run command that began with echo.

This syntax has always scared me, as I could see myself executing a dangerous command without realizing it, especially if I'm in a hurry. But there is a safer way to use it, just include the :p modifier, which displays the command, but doesn't actually execute it. For instance, if the last command executed was echo Hello, typing !!:p would preview the command without running it, so you'd see echo Hello, instead of Hello. This allows you to make sure it's the command you intended to recall, and you can simply press the up arrow to execute it.

The ! syntax can be really useful in a grep-the-history-file workflow. Say I grep for the last time I issued an esxcfg-firewall command:

  history | grep esxcfg

    Output:
    203  esxcfg-firewall --openPort 9090,tcp,in,SimpleHTTP

  !203:p

    Output:
    esxcfg-firewall --openPort 9090,tcp,in,SimpleHTTP

Now if I hit the up arrow, the command is put on the prompt, and I can edit it to add a different port, etc.

If you'd like to delve deeper, google c shell command history

Fix Command
Another method for recalling commands from the history list is the bash built-in Fix Command, invoked with fc. To get the skinny on fc, bring up the man page for the built-in commands, with man builtins. Invoking fc with the -l option will print the last 16 commands. You can also specify a range of history commands to display, like:

  fc -l 208 234

The fc command can come in real handy when you are trying to recall and edit a whole series of commands. For instance, say you remember adding an alias to your .bashrc file, and you want to add an additional alias using the same series of commands to make sure it was configured properly. You recall using a cp command first to backup the .bashrc file, and specifying 'cp' as a search string after fc -l will print the history list beginning with the last occurrence of the command that matches the search string:

  fc -l cp

  110  cp .bashrc backup_bashrc
  111  echo "alias lax='ls -lax'" >> ~/.bashrc
  112  cat .bashrc
  113  . .bashrc
  114  lax

To create an la alias, invoke fc with a range of history events to copy them all into the default editor, which is set to vi in the ESX console:

  fc 110 114

Running the above command will copy the history events 110 through 114 into a vi editor session, where they can be modified to create the alias for la. Typing ZZ in vi will exit the editor and execute all the commands in the buffer: backing up .bashrc, creating the alias, cat'ing the file to see the change, restarting bash, and finally testing the new la alias. I don't use fc very often, but for scenarios like this it is a great tool to be familiar with.

vi Command Editing
The next two methods for using the command history rely on functionality provided by the GNU Readline library, a standard set of C functions available to any application, including bash. There are two editing and key-binding modes, emacs and vi. The default mode is emacs, and you can see what mode your shell is in now by typing:

  set | grep SHELLOPTS

If you're in the default emacs mode, you'll see emacs in the colon-separated list of options. To change this to vi mode, type:

  set -o vi

Now if you check the SHELLOPTS variable, you'll find emacs has been replaced with vi. I'm a vi kind of guy, so I always add the set option to my .bashrc file, with a command like:

  echo "set -o vi" >> ~/.bashrc

And source .bashrc again to get the changes:

  . ~/.bashrc

We'll look at how to use the Readline library in vi editing mode with an easy example. I've just grepped the default word file that comes with the service console for every word that begins with 'a' (this word file is not present in ESX 4.0). I'll do the same for the letters 'b' through 'f', and if I execute history, this is the output:

[admin@esx02 admin]$ history
    1  grep ^a /usr/share/dict/words
    2  grep ^b /usr/share/dict/words
    3  grep ^c /usr/share/dict/words
    4  grep ^d /usr/share/dict/words
    5  grep ^e /usr/share/dict/words
    6  grep ^f /usr/share/dict/words
    7  history

Back at the shell prompt, we can search through the history list for the last instance of the grep command. Press Esc to enter vi command editing mode, and then use a forward slash to search, just like in a normal instance of vi:

/grep

After hitting return, we should find grep ^f /usr/share/dict/words has been placed on the command line. Pressing n will iterate through each grep command until the first instance is found -- the grep command for words starting with 'a'. Continuing to press n will do nothing now, as we've reached the end of the matches for grep. However, pressing N will now iterate through the grep matches in reverse, working its way to the most recent grep command. This is handy and easy to remember, n or N for next, forward or backward through the history list.

Of course, the whole point of accessing the history list with this method is to easily edit the commands before executing them. If we wanted to modify the grep command to find all words starting with 'fox', we can just press Esc at a bash prompt to enter vi command editing mode, type /grep ^f to find the history entry where we searched for words starting with f, press i to enter vi insert mode, edit the command: grep ^fox /usr/share/dict/words, and press return to execute it. If we want to get every word that starts with ox, we just perform another search, /grep ^fox, move the cursor over the 'f', and hit x -- the vi delete character command -- to remove it.

This example is beyond goofy, but if you play around with this method, you'll find it to be very powerful and a huge timesaver.

Ctrl-r
If you press Ctrl-r in your terminal, you'll get a curious looking prompt:

  (reverse-i-search)`':

If you start typing a part of the command you wish to search through the history for, the command will appear after the prompt:

  (reverse-i-search)`grep': grep ^f /usr/share/dict/words

This is the most recent grep found while searching back through the history list. If you press Ctrl-r again, you'll find the one before that:

  (reverse-i-search)`grep': grep ^e /usr/share/dict/words

If you hit return, the command will execute. If you hit the left or right arrow keys, the command will be left on the prompt so you can edit it first.

The reverse-i-search prompt is a bash built-in named reverse-search-history. You can see that by default it is bound to the Ctrl-r key by issuing a bind -P command. If you look through the list, you should also see forward-search-history has been bound to Ctrl-s. The forward-search-history command is a useful companion because if you play with Ctrl-r, you'll notice that as it iterates through the history list, it remembers where it left off. So if you reverse search all the way to the start of the history file, you then have to cycle back around, even though the command you want was just one command more recent.

But there is a big problem with forward-search-history, it's bound to Ctrl-s by default, which is also bound to the stty stop output command, and stty will intercept it before anything has a chance to see it (if you already pressed Ctrl-s and your terminal appears dead, just type Ctrl-q to bring it back to life).

You can view the key bindings used by stty with this command:

  stty -a

But these two commands are too handy to let a little key binding issue get in the way, and we can easily add custom key bindings to our .bashrc file:

  bind '"\C-f": forward-search-history'

  bind '"\C-b": reverse-search-history'

Just edit the ~/.bashrc file with the editor you prefer, add those two lines in, and re-source with: . ~/.bashrc

Now Ctrl-b searches back through the history file, and Ctrl-f searches forward, and you can toggle back and forth between them and the search term will remain. Nice!

Share your secrets
Hopefully you found a new trick to play with, and if you have your own history workflow, please leave a comment with the details.

...read more

June 30, 2009

Cut Your Exchange Backup Window in Half

Going all the way back to the Exchange 5.5 days, I've preferred doing disk-to-disk-to-tape Exchange backups. I'll use NTBackup for the disk-to-disk part, and the regular file backup agent that comes with whatever backup software we happen to be using for the to-tape part. This means eschewing the Exchange backup agents that most vendors provide, which in my mind is a big plus. Opinions will certainly vary, but using this method provides cost savings, fast restore times, extra protection and redundancy, and the reassurance that comes with using the native Exchange backup application. If you need more convincing, there are a few white papers from Microsoft describing this same setup as the Exchange backup method used by their in-house staff.

A client that is currently using this configuration has just surpassed the 200 GB mark for their combined mailbox store size, and the nightly full backups were taking a little longer than 6 hours to complete. This large backup window was preventing the nightly database maintenance tasks from completing, so a new strategy was in order. While thinking through some possibilities, I remembered reading about some registry tweaks that could improve the NTBackup performance when backing up to disk. After a little research, I made the changes, and the results were almost unbelievable: the Exchange backup job that had previously taken more than 6 hours to complete now finished in just under 2 1/2 hours!

While taking advantage of this dramatic speed boost only requires three registry changes and an additional command line parameter, there is a big bummer at first glance: the NTBackup registry keys that need to be changed reside in the HKEY_CURRENT_USER hive. This really cramps my style as I always configure the scheduled task that kicks off the NTBackup job as the NT AUTHORITY\SYSTEM account with a blank password. If you work in an environment with strict password change policies, even for system accounts, you know the pain of having to maintain passwords in scheduled tasks and scripts. Life is so much easier if it can just be avoided. But since the system account doesn't execute NTBackup interactively, the registry keys don't get created, and I assumed this meant there was no way to have the application check for the configuration tweaks.

But thankfully I was wrong, and it's a pretty simple process to manually create the necessary keys in the right spot:

  • First of all, you need to actually complete a backup job once to get the registry entries all set up, so as a regular administrator on the Exchange server, launch NTBackup, select a single temp file somewhere to backup, let the job run to completion, and then just delete the temporary backup set.

  • Launch regedit, and drill down to HKEY_CURRENT_USER\Software\Microsoft\Ntbackup\Backup Engine

  • You should already see the values we're about to change, if not, something didn't get created properly, so try a manual NTBackup job again. If the keys are present, make the following changes:

    • Change Logical Disk Buffer Size from 32 to 64

    • Change Max Buffer Size from 512 to 1024

    • Change Max Num Tape Buffers from 9 to 16

  • After making the changes, select the Backup Engine key from the left pane, and right click and select Export. Save it as a .reg file, and make sure Selected branch at the bottom of the Export window is set to HKEY_CURRENT_USER\Software\Microsoft\Ntbackup\Backup Engine

  • Now we'll locate the system account's registry settings, with regedit still open, browse to HKEY_USERS\S-1-5-18\Software\Microsoft\Ntbackup. The S-1-5-18 is the standard identifier for the system account, and unless you've scheduled NTBackup to run as NT AUTHORITY\SYSTEM before, the key will most likely be empty.

  • We need to schedule a job to run as NT AUTHORITY\SYSTEM to create the default keys, so launch NTBackup in advanced mode, select the Schedule Jobs tab, and set up a temp job to just back up any text file and schedule it to run in a couple of minutes from now. When prompted for the credentials that should be used for the job, you'll need to change the user account to NT AUTHORITY\SYSTEM with a blank password, several times. In fact, it still won't save it as the account to use, so after saving the scheduled job, open the task from the Scheduled Tasks panel and change the user account to NT AUTHORITY\SYSTEM with a blank password again.

  • After the job runs, you should see the following registry keys have been created under HKEY_USERS\S-1-5-18\Software\Microsoft\Ntbackup; Backup Engine, Backup Utility, Display, and Log Files. But if you drill into Backup Engine, you'll see it didn't create the keys we modified a few steps ago.

  • To easily create the keys, just edit the .reg file we exported earlier in Notepad. Change the line [HKEY_CURRENT_USER\Software\Microsoft\Ntbackup\Backup Engine] to [HKEY_USERS\S-1-5-18\Software\Microsoft\Ntbackup\Backup Engine], and save the file.

  • Now right click the .reg file, and select Merge. You should find the registry settings have been created for the system account, and NTBackup will now use the much speedier settings even when running as the system account.

There's another performance mod we need to make to give the backup even more boost. Since Windows Server 2003 Service Pack 1, NTBackup has been equipped with a secret and offensively named /fu switch, for 'file unbuffered' mode. To bolt this on, just edit the Scheduled Task for the NTBackup job, and add the /fu switch after the /hc:off parameter. When you're done, the Run: text box of the Scheduled Task will look something like this:


  C:\WINDOWS\system32\ntbackup.exe backup "@C:\Documents and Settings\
    Administrator\Local Settings\Application Data\Microsoft\Windows NT\
    NTBackup\data\Exchange_Daily.bks" /n "exchange_Backup.bkf created 
    6/30/2009 at 6:06 PM" /d "Set created 6/30/2009 at 6:06 PM" /v:no 
    /r:no /rs:no /hc:off /fu /m normal /j "Exchange_Daily" /l:s /f 
    "E:\exchange_ backups\exchange_Backup.bkf"


...read more

June 10, 2009

Configure a Vyatta Cluster for Redundant Virtual Firewalls

If you missed the Protect the Service Console Network With a Virtual Firewall project, we looked at how to use a Vyatta firewall to protect the ESX Service Console network and restrict SSH and VI or vSphere Client access to only a few specific workstations. Vyatta offers an impressive network operating system that can be run from a live CD, permanently installed on physical or virtual hardware, or downloaded as a virtual appliance. It comes with some high end features like stateful packet inspection, site-to-site VPN, OSPF, and BGP. There's a completely free edition with unrestricted access to all the features, but it can also be purchased with support offerings.

If you followed along with the original post, you may have noticed a potential pitfall: what if the ESX server hosting the Vyatta virtual machine goes down? You may have HA enabled, but what if it takes several minutes for the VM to boot all the way up on another host? Even a couple of minutes with no SSH or VM console access during a crisis would feel like an eternity.

Amazingly, the Vyatta operating system also includes clustering, and it's very simple to configure. To set up a cluster, we'll need the following:

  • Two Vyatta VC5 virtual machines, preferably with very similar configurations, see Protect the Service Console Network With a Virtual Firewall for a quick setup tutorial

  • Both Vyatta VMs will need two virtual NICs, each with its own real IP address: one in the Service Console network, and one in a LAN network

  • The clustered Vyatta VMs will host two virtual IPs: one will be the default gateway address configured on every ESX host (172.20.1.254 in our case), and the second will be a LAN address you specify as the route to the Service Console network on the Layer 3 device routing between your LAN subnets. In our case, we have a very simplified setup and the Vyatta's LAN facing interface is the default gateway for the LAN (10.1.1.254)

There's some good documentation on setting up a cluster in the High Availability Reference Guide for VC5 available for download at the Vyatta website.

Once you've got the two Vyatta VMs up and running on different ESX hosts, this is how the cluster configuration will look on the primary Vyatta firewall. We're using a conservative dead-interval of ten seconds, meaning a failover will only occur if keepalives are missed for that long, and keepalives are being sent out every two seconds over the eth0 (Console Network) interface.

The service commands define the virtual IPs the cluster will bring up on the secondary if the primary stops responding:

cluster {
    dead-interval 10000
    group sc-cluster {
        auto-failback true
        primary sc-firewall-pri
        secondary sc-firewall-sec
        service 10.1.1.254/24/eth1
        service 172.20.1.254/24/eth0
    }
    interface eth0
    keepalive-interval 2000
    pre-shared-secret ****************
}

interfaces {
    ethernet eth0 {
        address 172.20.1.252/24
        hw-id 00:50:56:9c:3b:0b
    }
    ethernet eth1 {
        address 10.1.1.252/24
        hw-id 00:50:56:9c:04:a5
    }
    loopback lo {
    }
}


And here's the cluster configuration on the secondary firewall. The actual cluster commands are identical, only the real IPs assigned to the interfaces are different:

cluster {
    dead-interval 10000
    group sc-cluster {
        auto-failback true
        primary sc-firewall-pri
        secondary sc-firewall-sec
        service 10.1.1.254/24/eth1
        service 172.20.1.254/24/eth0
    }
    interface eth0
    keepalive-interval 2000
    pre-shared-secret ****************
}

interfaces {
    ethernet eth0 {
        address 172.20.1.253/24
        hw-id 00:50:56:9c:3c:e0
    }
    ethernet eth1 {
        address 10.1.1.253/24
        hw-id 00:50:56:9c:38:04
    }
    loopback lo {
    }
}

As you can see, it's pretty simple to set up, but one annoyance with the cluster feature is that you have to create the same firewall rules on each device, there's no functionality for syncing up the configurations.

It wasn't too hard to write a quick and dirty little shell script to copy just the firewall configuration from the primary to the secondary, however, so you only need to maintain the rules on the primary, and then remember to run the script after saving the changes. If you like, you can set up public key authentication for SSH access from the primary to the secondary like we did in DIY ESX Server Health Monitoring - Part 2, but it's not necessary, the script will prompt you for the password during the SSH connection attempt.


#!/bin/bash

# Vyatta cluster firewall sync script
# by Robert Patton - 2009
#
# Copies firewall rules from primary to secondary
# and applies them to the appropriate interfaces.
#
# Deletes existing firewall rules on secondary and
# removes any firewall sets on interfaces, so make
# sure this is only run from the primary.
#
# Replace the SECONDARY value with the hostname or IP
# of the secondary device in the cluster.

SECONDARY="sc-firewall-sec"

TEMPFWRULES=$(mktemp TEMPFWRULES.XXXXXXXX)
TEMPINTCMDS=$(mktemp TEMPINTCMDS.XXXXXXXX)
TEMPSETCMDS=$(mktemp TEMPSETCMDS.XXXXXXXX)

# Match just the firewall section from the boot config file
awk '/^firewall {/, /^}/' /opt/vyatta/etc/config/config.boot > $TEMPFWRULES

# Match the interface section, we filter for firewall set statements later
awk '/^interfaces {/, /^}/' /opt/vyatta/etc/config/config.boot > $TEMPINTCMDS

# Create a script to run on the secondary with the firewall set commands
# The vyatta-config-gen-sets.pl script creates set commands from the config
cat > $TEMPSETCMDS <<'EOF1'
configure
# First remove any firewalls from interfaces
for int in $(show interfaces ethernet | \
awk '/eth[0-9]/ {print $1}'); \
do delete interfaces ethernet $int firewall; \
done
# Now delete all firewalls
for fwall in $(show firewall name | \
awk '/^ \w* {$/ {print $1}')
do delete firewall name $fwall; \
done
EOF1

cat >> $TEMPSETCMDS <<EOF2
# Create firewalls found on primary
$(/opt/vyatta/sbin/vyatta-config-gen-sets.pl $TEMPFWRULES)
# Apply firewalls to interfaces as defined on primary
$(/opt/vyatta/sbin/vyatta-config-gen-sets.pl $TEMPINTCMDS | grep firewall)
commit
save
exit
exit
EOF2

# Force a tty for the ssh connection - Vyatta environment variables
# and special shell are only set up during an interactive login
cat $TEMPSETCMDS | ssh -tt $SECONDARY

rm -f $TEMPFWRULES $TEMPSETCMDS $TEMPINTCMDS



...read more

June 5, 2009

Another ESX Server HTTP File Trick

While reading some docs on the vSphere CLI, I came across this note:

...you can browse datastore contents and host files using a Web browser. Connect to the following location:

http://ESX_host_IP_Address/host
http://ESX_host_IP_Address/folder

You can view datacenter and datastore directories from this root URL...


I knew you could browse the datastores this way as there is a link from the main welcome page, but the /host URL is news to me. Log in with the root account, and it brings up a page titled Configuration files with links to view a bunch of important - you guessed it - configuration files.

Not too terribly interesting, but I can already see myself hitting this URL just to get the vSphere license key or double check that the proper entries are in an ESX server's hosts file.

...read more

May 29, 2009

Instantly Serve an ESX Directory via HTTP


python -m SimpleHTTPServer 9090

Before you type that in, understand that it's not going to work unless you've made some ill-advised changes to the Service Console firewall. Also, make sure you fully grasp the security risk you're about to take. This command will start up a simple web server on TCP port 9090 in the current working directory, allowing anyone to browse the files and subdirectories from a web browser under the security context of the user that executed the command. In other words, if you execute this as the root user, in the root directory, any file in the Service Console can be downloaded from a web browser.

This one-liner is extremely dangerous, but it is also extremely handy, and if used correctly in a properly designed environment, the potential risks can be managed. I use this all the time in my test lab to get output files from scripts by simply cd'ing to the script directory, running the above command, and pointing a web browser to http://IP_OF_ESX:9090 from the vCenter server.

How to make it safer:
  • The ESX Service Console network should be completely isolated from the LAN, and only vCenter servers and specific administrative workstations are allowed access

  • The Python command should be executed while the working directory is a folder created just for this purpose, and only contains the specific files you want to share and no subdirectories

  • The command should only be executed by a non-root user and the web server torn down as soon as the files have been downloaded by issuing a Ctrl-C

  • The root user must open a specific port in the firewall prior to using the command; for example, to open TCP port 9090:

    esxcfg-firewall --openPort 9090,tcp,in,SimpleHTTP

  • The port should then be closed immediately after the needed files have been downloaded; for example, to close down the previous command:

    esxcfg-firewall --closePort 9090,tcp,in


This also works in ESX 3.5, but the version of Python in the Service Console lacks the -m option, so the path to SimpleHTTPServer.py must be specified:

# ESX 3.5
python /usr/lib/python2.2/SimpleHTTPServer.py 9090


Might be too dangerous for production, so consider the risks carefully. But for testing, it can be really handy.

...read more

May 26, 2009

VM Security in vSphere - Same Ol' Situation (S.O.S.)

Over the weekend, I had a chance to test out the directives for locking down the virtual machine security issues discussed in Hardening the VMX File with vSphere / ESX 4.0. Unfortunately, all of the security issues are still present in the GA release of vSphere, including non-privileged users having the ability to disconnect virtual NICs and change the time synchronization behavior.

I can't imagine why this situation still persists through version 4.0 of VMware's enterprise virtualization platform. Are there customers who prefer non-privileged user accounts retain this ability? And if so, couldn't we disable this functionality by default, and require .vmx directives to enable it?

Yes, it is easy to change the default settings, and any sysadmin worth his or her salary will make the changes and audit their environment for compliance. That's a tired argument, however, and better "out of the box" security should be a goal for any product. Anybody remember Windows 2000?

...read more

May 25, 2009

DIY ESX Server Health Monitoring - Part 4

If you're just catching this series on creating an ESX health report, in Part 1, Part 2, and Part 3 we set up everything we need to schedule the daily health check and send the results in a HTML formatted email. Running the health check once a day is probably not sufficient if you want to be on top of developing issues, however, and if you have a lot of ESX hosts, reading through a long list of performance statistics may be unreasonable. So to wrap this project up, we'll look at setting up a second cron job that will only send out an alert message when an ESX host exceeds a specified threshold.

Due to the simple design of the health report scripts, to set up this functionality we only need to modify a few lines from the run-esx-report.sh script:
  • The first change is in the loop where we SSH into each ESX host and run the esx-report.sh script. We'll simply change the append redirection symbols, >>, to the create or truncate symbol, >, this way we're creating a new report output file for each host, rather than a combined report. To be extra sure the temp file is truncated each time through the loop, we'll use the noclobber override option as well, so the >> symbols become >|

  • Next, we grep for the word WARNING in the output file, and wrap the rest of the script in an if statement so the email is only sent out if the grep command returns true

  • And finally, we'll just change the subject of the email message


###############################################################################
#
#  run-esx-threshold.sh
#
###############################################################################
#
#  To create the run-esx-threshold.sh script in the ~/esx-report directory,
#  copy this entire code segment into your shell.
#  If you'd rather copy just the script, select everything between the
#  SCRIPTCREATOR limit strings.
#
#  putty will ignore all the tabs, making the copied script quite ugly
#
###############################################################################

# If the ~/esx-report directory exists, cd to it so the script is created there
[ -d ~/esx-report ] && cd ~/esx-report

cat > ./run-esx-threshold.sh <<'SCRIPTCREATOR'
#! /bin/bash
  PATH="/bin:/usr/bin"

  if [ -z $1 ]; then
    echo "No ESX hosts specified, exiting"
    exit 1
  fi

  if ! pgrep ssh-agent >/dev/null; then
    echo "The ssh-agent process does not appear to be running, exiting"
    exit 1
  fi

  RUNDIR=$(dirname "$(which "$0")")

  source "${HOME}/.ssh-agent" >/dev/null || exit 1

  THISHOST=$(hostname | cut -d . -f 1)

  TEMPTEXT=$(mktemp "${RUNDIR}/temptext.XXXXXXXXXX")

  TEMPHTML=$(mktemp "${RUNDIR}/temphtml.XXXXXXXXXX")

  for host in $@; do
    if [ $(echo $host | cut -d . -f 1) = $THISHOST ]; then
      "${RUNDIR}/esx-report.sh" >| "$TEMPTEXT"
    else
      ssh -q $host "$(cat "${RUNDIR}/esx-report.sh")" >| "$TEMPTEXT" || \
        printf "WARNING: SSH connection to $host failed\n\n\n\n" >| "$TEMPTEXT"
    fi

    if grep WARNING "$TEMPTEXT" >/dev/null; then

      cat >| "$TEMPHTML" <<-'HEADEREOF'
	<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
	<html>
	<head>
	<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
	<style type="text/css">
	body { font-family: monospace; font-size: 12px }
	pre { font-family: monospace; font-size: 12px }
	</style>
	</head>
	<body>
	<pre>
	HEADEREOF

      cat "$TEMPTEXT" | \
      sed -e 's/>/\&#62/g' \
          -e 's/WARNING:.*/<span style="color: red">&<\/span>/' >> "$TEMPHTML"

      cat >> "$TEMPHTML" <<-'FOOTEREOF'
	</pre>
	</body>
	</html>
	FOOTEREOF

      "${RUNDIR}/html-mailer.pl" -f esx-report@yourdomain.dom \
                                 -r administrator@yourdomain.dom \
                                 -s "Alert on $host" \
                                 -m exchange.yourdomain.com \
                                 -b "$TEMPHTML"
    fi
  done

  rm -f "$TEMPTEXT"; rm -f "$TEMPHTML"

SCRIPTCREATOR

chmod 0700 ./run-esx-threshold.sh

###############################################################################


Don't spam yourself
When considering how often you want to run the threshold check script, keep one shortcoming of this method in mind: if a parameter continues to exceed its threshold, the script will continue to email you every time it runs. If you set this up to run every five minutes, and head out into the woods over a holiday weekend, you're going to get a thousand alert messages before you get a chance to resolve the issue.

For our purposes, once every 30 minutes will suffice, and so we'll add another cron job by issuing a crontab -e command as the non-root user, press i to enter insert mode, and below the line containing the 7:10 AM ESX server health report job, we'll add:

0,30 * * * * ${HOME}/esx-report/run-esx-threshold.sh ESX LIST >/dev/null 2>&1

Press Esc, then :wq to write the crontab and exit vi, and we're done!

If you do want to run the threshold check every five minutes, instead of specifying a list like 0,5,10,15, etc., use the range of minutes followed by a forward slash and interval, like:

0-59/5 * * * * ${HOME}/esx-report/run-esx-threshold.sh ESX LIST >/dev/null 2>&1

Tweak the thresholds
You'll definitely want to play with the threshold settings from the esx-report.sh script in Part 1. The threshold is the third parameter supplied to the scale function, and I've highlighted it below for the memory usage check:

  printf "  Memory Usage:\n"
  (free | awk '/^Mem:/ {print $3, $2, "100", $1}
              /^Swap:/ {print $3, $2, "1", $1}') | \
    while read line; do scale $line; done

That does it for the DIY ESX Server Health Monitoring project, I hope you'll find this information easy to customize for your own environment. If you add new performance checks or enhancements, feel free to describe the changes in a comment.

Install it
If you'd like to set the whole thing up, just copy and paste each code segment with a light blue background into a putty session. To install:
  • Create the esx-report.sh script from Part 1 as the non-root user. Copying the entire code segment in the light blue box into a putty window will create the esx-report folder under the home folder of the user that executes it.

  • From Part 2, execute the ssh-keygen command as the non-root user. Then run the esxcfg-firewall command as root to open an outbound port for SSH. Create the two remaining scripts; copykey.sh, and start-ssh-agent.sh as the non-root user.
    Use copykey.sh to distribute the public key file, then launch ~/start-ssh-agent.sh to load the private key into memory, both as the non-root user. Make test ssh connections to each ESX host you need to run the report on, but make sure you source the .ssh-agent file first so the variables are exported to your shell, source ~/.ssh-agent

  • Now create the html-mailer.pl script from Part 3 as the non-root user. As root, run the esxcfg-firewall command to open outbound SMTP in the firewall. Change users back to the non-root user, and create the run-esx-report.sh script and change the email settings for your environment.

  • Create the run-esx-threshold.sh script from this post as the non-root user and change the email settings.

  • Set up the cron jobs for the daily health report and the threshold check. Customize the whole thing any way you see fit.

* A couple of tips:
  • Try to schedule the daily health check and threshold checks so they don't run at the same time. The jobs will run fine simultaneously, but the usage numbers could be inflated.

  • Configure reverse DNS records for your ESX hosts on the DNS servers they point to or you'll see long pauses during SSH connection attempts as the server times out attempting to resolve the connecting client's hostname from its IP.


...read more

May 19, 2009

DIY ESX Server Health Monitoring - Part 3

Updated: June 18, 2009
Added a semicolon to run-esx-report.sh that was left out and responsible for some ugly HTML formatting.

With the secure SSH access problem solved in Part 2, we'll move on to getting the data in the proper format and emailing it from the ESX Service Console. As you probably know, the Linux distribution installed with ESX 3.5 lacks sendmail or an equivalent command, but we can roll our own from a perl script.

The perl mailer script
We need to import two perl modules for the script, and both are included by default in the Service Console. Getopt::Std provides a simple way to get command line options, and Net::SMTP will interface with an Exchange or SMTP server accessible from the console network:

  use Getopt::Std;
  use Net::SMTP;

This getopt call is all that's necessary to declare the command line options (-f, -r, -s, etc.), and it will automatically populate a set of corresponding variables named opt_*. We'll do a quick check to make sure all the command line options were specified, and if not display the usage message:

  getopt ('frsmb');

  unless ($opt_f && $opt_r && $opt_s && $opt_m && $opt_b) {
    print_usage();
    exit 1;
  }

Next we'll create a filehandle named BODY, opening the file specified on the command line. After reading in each line to the variable body_data, we'll close the handle:

  open(BODY, $opt_b) || error("Could not open file $opt_b.");

  my @body_data=<BODY>;
  close(BODY);

The Net::SMTP module is pretty straightforward. To generate a HTML formatted email, we just need to specify the MIME version, the content type as HTML, and the character encoding as ISO 8859-1. If you would rather send the message as plain text, just remove those two lines:

  my $smtp = Net::SMTP->new($opt_m) ||
    error("SMTP connection to $opt_m failed.");
  $smtp->mail($opt_f);
  $smtp->to($opt_r);
  $smtp->data();
  $smtp->datasend("MIME-Version: 1.0\n");
  $smtp->datasend("Content-Type: text/html; charset=iso-8859-1\n");
  $smtp->datasend("To: $opt_r\n");
  $smtp->datasend("From: $opt_f\n");
  $smtp->datasend("Subject: $opt_s\n");
  foreach $line (@body_data)
    {
      $smtp->datasend("$line");
    }
  $smtp->dataend();
  $smtp->quit;

Here's the complete html-mailer.pl script:


###############################################################################
#
#  html-mailer.pl
#
###############################################################################
#
#  To create the html-mailer.pl script in the ~/esx-report directory, copy
#  this entire code segment into your shell.
#  If you'd rather copy just the script, select everything between the
#  SCRIPTCREATOR limit strings.
#
###############################################################################

# If the ~/esx-report directory exists, cd to it so the script is created there
[ -d ~/esx-report ] && cd ~/esx-report

cat > ./html-mailer.pl <<'SCRIPTCREATOR'
#! /usr/bin/perl -w

 use strict;
 use Getopt::Std;
 use Net::SMTP;

 # Options:
 # $opt_f  email address of the sender
 # $opt_r  recipient email address
 # $opt_s  message subject, enclose in quotes if spaces
 # $opt_m  SMTP server FQDN or IP address
 # $opt_b  HTML formatted file for the message body

  our ($opt_f, $opt_r, $opt_s, $opt_m, $opt_b);

  getopt ('frsmb');

  unless ($opt_f && $opt_r && $opt_s && $opt_m && $opt_b) {
    print_usage();
    exit 1;
  }

  open(BODY, $opt_b) || error("Unable to open file $opt_b");

  my @body_data=<BODY>;
  close(BODY);
  my $line;

  my $smtp = Net::SMTP->new($opt_m) ||
    error("SMTP connection to $opt_m failed");
  $smtp->mail($opt_f);
  $smtp->to($opt_r);
  $smtp->data();
  $smtp->datasend("MIME-Version: 1.0\n");
  $smtp->datasend("Content-Type: text/html; charset=iso-8859-1\n");
  $smtp->datasend("To: $opt_r\n");
  $smtp->datasend("From: $opt_f\n");
  $smtp->datasend("Subject: $opt_s\n");
  foreach $line (@body_data)
    {
      $smtp->datasend("$line");
    }
  $smtp->dataend();
  $smtp->quit;

sub error {
  my $msg = shift;
  print STDERR "html-mailer.pl: $msg\n";
  exit 1;
}

sub print_usage {
  print STDERR <<EOF

  html-mailer.pl - HTML Formatted Message Mailer

  Usage: html-mailer.pl -f FROM -r RECIP -s SUBJ -m SMTP_HOST -b HTML_FILE

  Sends an email to the specified address, filling the message body with the
  HTML formatted file specified.

EOF
}

SCRIPTCREATOR

chmod 0700 ./html-mailer.pl

###############################################################################


Enable outbound SMTP
Now that we've got a script that we can send test messages with, we need to enable outbound SMTP through the ESX firewall on the ESX server that will have the script scheduled from a cron job. Just type this command as root to open the port:

  # Execute as root
  esxcfg-firewall --openPort 25,tcp,out,SMTP


If your Exchange or SMTP server is reachable from the Service Console network, execute html-mailer.pl with the appropriate parameters, and specify any old text file:

  ./html-mailer.pl -f me@mydomain.dom \
                   -r me@mydomain.dom \
                   -s "Test message" \
                   -m exchange.mydomain.dom \
                   -b ./testfile.txt


The Service Console can't reach the Exchange server...
No worries, as long as you're able to reach the VirtualCenter server, we can install the SMTP service and set it up to forward to the Exchange server. To install SMTP on a Windows 2003 server, do the following:

  • Open the Control Panel > Add or Remove Programs > Add/Remove Windows Components > double click Application Server > and then double click Internet Information Server (IIS). Put a check next to SMTP Service and click OK, OK, and Next

  • After the SMTP install is complete, open the Start Menu > Programs > Administrative Tools > Internet Information Services (IIS) Manager, then right click Default SMTP Virtual Server and select Properties

  • In the General tab, drop down the IP address: to the IP address in the Service Console network, if different from the LAN. This will prevent the SMTP service from popping up on your network security guy's port scans :)

  • In the Access tab, click the Connection button, choose the Only the list below radio button, then click Add to add the appropriate subnet address and mask to the Group of computers option, or add each ESX server one at a time

  • In the Access tab again, click the Relay button, choose the Only the list below radio button, then click Add to add the appropriate subnet address and mask to the Group of computers option, or add each ESX server one at a time. Uncheck the option Allow all computers which successfully authenticate to relay

  • On the Delivery tab, click the Advanced button and add your Exchange server information in the Smart host: box. By specifying a smart host, the SMTP server will simply forward everything to the Exchange server, letting it make all the decisions about which domains to accept mail for, etc.

  • Now test out the SMTP forwarder by using telnet to initiate a SMTP session from the ESX server that will be sending the messages:
    
     telnet virtualcenter.lab.local 25
     ehlo
     mail from:spongebob@lab.local
     rcpt to:administrator@lab.local
     data
     Subject:test
     .
     quit
    

One script to rule them all
Almost there, so let's recap what we've done so far. In Part 1, we created the health check script that will run on each ESX server and send key performance stats and scaled histograms to the terminal. Then in Part 2, we covered how to distribute public keys so the script can be executed on several ESX servers via SSH. So far in Part 3, we've looked at a perl script that will email the combined script output, and now we need to create a script to tie it all together, and then schedule the script from a cron job.

Let's break down the main components of the script. First of all, if ssh-agent isn't running, the script isn't going to get very far, so we'll use pgrep to check for the process and exit if it's not found:

  if ! pgrep ssh-agent >/dev/null; then
    echo "The ssh-agent process does not appear to be running, exiting"
    exit 1
  fi

We need to source .ssh-agent, the file with the ssh-agent PID and socket info set up by the start-ssh-agent.sh script, or exit if it doesn't exist:

  source "${HOME}/.ssh-agent" >/dev/null || exit 1

Since we'll be running everything from an ESX Service Console, and that server is likely to be part of the health check, we should compare the list of ESX hosts to the local hostname so we don't open a SSH connection to the local machine. We use cut here to strip off the domain name so we'll match whether the FQDN or just the bare hostname is specified:

  THISHOST=$(hostname | cut -d . -f 1)

  for host in $@; do
    if [ $(echo $host | cut -d . -f 1) = $THISHOST ]; then
      "${RUNDIR}/esx-report.sh" >> "$TEMPTEXT"
    else
      ssh -q $host "$(cat "${RUNDIR}/esx-report.sh")" >> "$TEMPTEXT" || \
        printf "WARNING: SSH connection to $host failed\n\n\n\n" >> "$TEMPTEXT"
    fi
  done

After the health check script has looped through the list of ESX hosts, we'll start building the HTML file with the necessary tags. Setting the font size for the pre tag is the secret sauce for getting the email to display perfectly on a BlackBerry:

  cat > "$TEMPHTML" <<-'HEADEREOF'
	<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
	<html>
	<head>
	<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
	<style type="text/css">
	body { font-family: monospace; font-size: 12px }
	pre { font-family: monospace; font-size: 12px }
	</style>
	</head>
	<body>
	<pre>
	HEADEREOF

If you need to use < or > symbols in a HTML document, you have to specify the actual ASCII code of the character, as HTML considers words wrapped in those symbols to be tags. We'll use a sed filter to replace all the >'s with the ASCII equivalent, and add a color tag to any lines with the word WARNING to make it stand out:

  cat "$TEMPTEXT" | \
  sed -e 's/>/\&#62/g' \
      -e 's/WARNING:.*/<span style="color: red">&<\/span>/' >> "$TEMPHTML"
Then we'll add the closing tags for everything to the end of the HTML file:

  cat >> "$TEMPHTML" <<-'FOOTEREOF'
	</pre>
	</body>
	</html>
	FOOTEREOF

And finally, we'll execute html-mailer.pl with the appropriate parameters. You'll need to change this section of the script for your environment:

  "${RUNDIR}/html-mailer.pl" -f esx-report@lab.local \
                             -r administrator@lab.local \
                             -s "ESX Health Report" \
                             -m lab-vc \
                             -b "$TEMPHTML"

Here's the run-esx-report.sh script. Remember to change the email address and mail server parameters for your environment:

###############################################################################
#
#  run-esx-report.sh
#
###############################################################################
#
#  To create the run-esx-report.sh script in the ~/esx-report directory, copy
#  this entire code segment into your shell.
#  If you'd rather copy just the script, select everything between the
#  SCRIPTCREATOR limit strings.
#
#  putty will ignore all the tabs, making the copied script quite ugly
#
###############################################################################

# If the ~/esx-report directory exists, cd to it so the script is created there
[ -d ~/esx-report ] && cd ~/esx-report

cat > ./run-esx-report.sh <<'SCRIPTCREATOR'
#! /bin/bash
  PATH="/bin:/usr/bin"

  if [ -z $1 ]; then
    echo "No ESX hosts specified, exiting"
    exit 1
  fi

  if ! pgrep ssh-agent >/dev/null; then
    echo "The ssh-agent process does not appear to be running, exiting"
    exit 1
  fi

  RUNDIR=$(dirname "$(which "$0")")

  source "${HOME}/.ssh-agent" >/dev/null || exit 1

  THISHOST=$(hostname | cut -d . -f 1)

  TEMPTEXT=$(mktemp "${RUNDIR}/temptext.XXXXXXXXXX")

  TEMPHTML=$(mktemp "${RUNDIR}/temphtml.XXXXXXXXXX")

  for host in $@; do
    if [ $(echo $host | cut -d . -f 1) = $THISHOST ]; then
      "${RUNDIR}/esx-report.sh" >> "$TEMPTEXT"
    else
      ssh -q $host "$(cat "${RUNDIR}/esx-report.sh")" >> "$TEMPTEXT" || \
        printf "WARNING: SSH connection to $host failed\n\n\n\n" >> "$TEMPTEXT"
    fi
  done

  cat > "$TEMPHTML" <<-'HEADEREOF'
	<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
	<html>
	<head>
	<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
	<style type="text/css">
	body { font-family: monospace; font-size: 12px }
	pre { font-family: monospace; font-size: 12px }
	</style>
	</head>
	<body>
	<pre>
	HEADEREOF

  cat "$TEMPTEXT" | \
  sed -e 's/>/\&#62;/g' \
      -e 's/WARNING:.*/<span style="color: red">&<\/span>/' >> "$TEMPHTML"

  cat >> "$TEMPHTML" <<-'FOOTEREOF'
	</pre>
	</body>
	</html>
	FOOTEREOF

  "${RUNDIR}/html-mailer.pl" -f esx-report@yourdomain.dom \
                             -r administrator@yourdomain.dom \
                             -s "ESX Health Report" \
                             -m exchange.yourdomain.com \
                             -b "$TEMPHTML"

  rm -f "$TEMPTEXT"; rm -f "$TEMPHTML"

SCRIPTCREATOR

chmod 0700 ./run-esx-report.sh

###############################################################################


To cron foo, thanks for everything
Still with us? One more step, and it's an easy one. We'll add a cron job to run the script at 7:10 AM every morning. Remember to add the job for the user account you distributed SSH keys for.

To edit the cron entries for the user, type:

  crontab -e

This starts vi and opens up the user's crontab. To enter insert mode, type i

Assuming you've set everything up using the code segments used in this series, to add an entry for 7:10 AM, type this line, replacing ESX LIST with a space separated list of ESX hosts:

  10 7 * * * ${HOME}/esx-report/run-esx-report.sh ESX LIST >/dev/null 2>&1

After adding the entry, press the Esc key, and type :wq to write the crontab and quit.

If you have a long list of hosts, put them all in a text file, separated by spaces or each on its own line, and use command substitution to feed the list to run-esx-report.sh

  run-esx-report.sh $(cat ${HOME}/esx-report/hostlist.txt)

There's more?!
What if we wanted to trigger an email warning if an ESX host exceeds a threshold value? As we'll see in Part 4, we can do this easily with a quick modification to the run-esx-report.sh script.

...read more

May 15, 2009

DIY ESX Server Health Monitoring - Part 2

In Part 1 of this series, we created the shell script that will generate a formatted health report for each ESX server. In order to have a combined health report for all of the ESX hosts, we'll use SSH to run the shell script on each host and send the output to the central ESX server responsible for gathering the data and sending the email.

To schedule the health report from a script and cron job, we'll need to use key based authentication rather than interactive passwords to access the remote ESX servers. We'll also encrypt the private key with a passphrase. That way if the filesystem security were ever compromised and someone were able to obtain the private key, they would still need the passphrase to unlock it and gain access to the remote systems.

Encrypting the private key presents a problem, however, as unlocking it during a connection attempt requires entering the password interactively. Thankfully there is a solution, ssh-agent, which will allow us to unlock the private key once with an interactive password prompt, and then keep it in memory until ssh-agent is terminated or the server is rebooted.

SSH authentication using keys
If you are new to the concept of using key based authentication for SSH, a quick Google search for 'ssh using keys' will provide a wealth of info. Here's a couple of links that explain it much better than I can: http://www.sshkeychain.org/mirrors/SSH-with-Keys-HOWTO and http://wiki.archlinux.org/index.php/Using_SSH_Keys
And this link covers some challenges with using ssh-agent: http://www.ibm.com/developerworks/library/l-keyc2

To get started, we'll generate a 2048 bit RSA key pair for authentication. There's plenty of debate on the merits of DSA over RSA, and vice versa, but we'll flip a coin, and pick RSA.

Logged in as the non-root account you are planning to use for the ESX health report, execute this command to generate a 2048 bit RSA key pair:


  ssh-keygen -t rsa -b 2048

When prompted with: Enter file in which to save the key, hit return to accept the default location. At the prompt: Enter passphrase (empty for no passphrase):, enter a strong passphrase for the private key.

If you type ls -la in the user's home directory, you should see that a .ssh folder was created. If you cd into that directory, you'll find the private key file, id_rsa, and the public key file, id_rsa.pub, have been created.

Distribute the public key
For the next step, we need to copy the public key just generated to each ESX host we want to SSH into and execute the health report script on. You can simply scp them, or even use a Windows SFTP client if you wish (yuck). One issue with that approach is that unless you have used the SSH client or gone through the ssh-keygen process on each remote host, the necessary .ssh folder hasn't been created in the user's home folder. The following script will take care of the whole process; SSH to each host, create the .ssh directory if needed, and add the public key to the authorized_keys file on the remote ESX server.

By default, the ESX server firewall blocks outgoing SSH client connections, so issue this command as root on the central reporting ESX server to enable outbound SSH:


  # Execute as root
  esxcfg-firewall -e sshClient

To use the public key distribution script below, paste the entire code block into a putty window, then execute the script with the list of ESX hosts you wish to copy the key to. You'll get a bunch of password and key fingerprint - authenticity prompts, but we only have to do this once. Remember to run this from the ESX server that will be polling the others, logged in as the non-root user that will be executing the health check script:


###############################################################################
#
#  copykey.sh
#
###############################################################################
#
#  To create the copykey.sh script, copy this whole code segment into your
#  shell. If you'd rather copy just the script, select everything between the
#  SCRIPTCREATOR limit strings.
#
###############################################################################

# If the ~/esx-report directory exists, cd to it so the script is created there
[ -d ~/esx-report ] && cd ~/esx-report

cat > ./copykey.sh <<'SCRIPTCREATOR'
#! /bin/bash
  PATH="/bin:/usr/bin"

  if [ ! -e ~/.ssh/id_rsa.pub ]; then
    echo "RSA public key file ~/.ssh/id_rsa.pub not found!"
    exit 1
  fi

  for esxhost in $@; do
    ssh -q "${USER}@${esxhost}" \
    "if [ ! -d ~/.ssh ]; then mkdir -m 0700 ~/.ssh; fi; \
    echo $(cat ~/.ssh/id_rsa.pub) >> ~/.ssh/authorized_keys; \
    chmod 0600 ~/.ssh/authorized_keys" || echo "Unable to connect to $esxhost"
  done
SCRIPTCREATOR

chmod 0700 ./copykey.sh

###############################################################################

Invoke the copykey.sh script with a space delimited list of ESX hosts:
./copykey.sh esx02.vmnet.local esx03.vmnet.local esx04.vmnet.local

Or read them from a text file if you have a lot of hosts. The text file can be space delimited or have each host on a new line:
./copykey.sh $(cat hostlist.txt)

In the copykey.sh script, notice how we used the command substitution syntax, $( ), to echo the text of the public key file into the authorized_keys file on the remote host. The local shell interprets the command substitution before the SSH command, so it executes cat on the local public key file. This is a handy trick, and we'll use it later to execute the locally stored health report shell script on the remote hosts.

Keep the private key unlocked with ssh-agent
Now that the public keys are pushed out, make a test SSH connection to one of the remote servers with a ssh somehost command. If you've set everything up correctly to this point, you should receive a prompt like Enter passphrase for key '/home/user/.ssh/id_rsa':, which is different from the user@host password: prompt of a typical SSH connection. We're being prompted to decrypt the local private key before the key algorithm is run to verify the connection attempt. Obviously, that's not going to work from a cron job.

This is where ssh-agent comes in. If you run it from a putty session, you should get some unusual output like:

SSH_AUTH_SOCK=/tmp/ssh-UIbA2689/agent.2689; export SSH_AUTH_SOCK;
SSH_AGENT_PID=2690; export SSH_AGENT_PID;
echo Agent pid 2690;
The output is providing you with the environment variables to use in order to locate the ssh-agent PID and socket. The application doesn't actually export any of the information into your shell, it expects you do that. You can test this out by typing echo $SSH_AGENT_PID after running ssh-agent; the variable isn't defined in the current shell.

There are a couple of ways to fix that, you could invoke it like ssh-agent bash, which will open a new bash shell with the variables exported. Or you could execute it with eval $(ssh-agent) to export the variables into your current shell. Since we won't be using it interactively, but rather from a cron job, we'll redirect the output from ssh-agent into a file, and then source that file from the cron job.

Something has to get ssh-agent running every time the ESX server is rebooted or someone kills the process, so let's create a handy start-ssh-agent.sh shell script in the non-root user's home directory:


###############################################################################
#
#  start-ssh-agent.sh
#
###############################################################################
#
#  To create the start-ssh-agent.sh script in the current user's home
#  directory, copy this whole code segment into your shell.
#  If you'd rather copy just the script, select everything between the
#  SCRIPTCREATOR limit strings.
#
###############################################################################

cat > ~/start-ssh-agent.sh <<'SCRIPTCREATOR'
#! /bin/bash
  PATH="/bin:/usr/bin"
  killall ssh-agent >/dev/null 2>&1
  ssh-agent > ~/.ssh-agent
  chmod 0600 ~/.ssh-agent
  source ~/.ssh-agent
    export SSH_AUTH_SOCK
    export SSH_AGENT_PID
  ssh-add
SCRIPTCREATOR

chmod 0700 ~/start-ssh-agent.sh

###############################################################################

The ssh-add command at the end of the script loads the private key into ssh-agent, and will prompt for the private key passphrase. Once the key is loaded, you can log off and ssh-agent will continue to run until the process is killed or the server is rebooted. You'll need to run start-ssh-agent.sh from an interactive login each time the ESX server is rebooted, but that's probably not very often, and the added security of using an encrypted private key certainly makes up for the hassle.

Execute the script above by typing ~/start-ssh-agent.sh to load the private key into ssh-agent, and we can test the health report script on multiple hosts. Paste the following into a putty window, after replacing the hostnames with your own, and the script output should display on your terminal:

[ -d ~/esx-report ] && cd ~/esx-report
source ~/.ssh-agent; \
for host in esx02.vmnet.local esx03.vmnet.local; do \
ssh -q $host "$(cat esx-report.sh)"; done

Notice again how we used command substitution, $( ), to cat the locally stored script file through the SSH session, running the commands in the script on the remote host. For a small script like esx-report.sh, this is a really simple and efficient method, and it makes it very easy to add additional checks to the script.

Stay tuned
Coming up in Part 3, we'll take a look at emailing the script output in HTML format, and tie the whole process together from a cron job.

...read more

May 12, 2009

DIY ESX Server Health Monitoring - Part 1

Updated: May 25, 2009
I've had a chance to test this project out with vSphere - ESX 4.0, and everything works the same, except for the last section of the esx-report.sh script that parsed the /proc/vmware/mem file. This file has moved and the format was changed. Since the hypervisor memory usage can be monitored and alerts triggered from vCenter Server, I've just removed that section from the script.

VMware creates some pretty amazing stuff. If they didn't, I wouldn't have a blog about it, and you wouldn't be reading blogs about it. But let's be honest, it's not perfect (what software is?), and every now and then something buggy will happen. Sometimes these quirks occur at scary moments, like when removing a snapshot, or in the middle of a VMotion operation. Rarely do they cause any actual downtime, however.

You could classify the quirks into two categories; vCenter Server and ESX. The vCenter issues are almost always harmless, and are sometimes simply resolved by closing and reopening the VI/vSphere client. It's the ESX quirks that can be really serious, as any problems with ESX can lead to virtual machine downtime.

That's why it is so important to have some type of ESX performance monitoring in place. vCenter provides some basics, but doesn't offer any real visibility into the Service Console, which is where the serious problems can be lurking. There are some very good commercial products, and one in particular, Veeam Monitor, is even offered in a free edition. Sometimes you need something more customized, however, and the only option is to build it yourself.

In this four part series, we'll build our own ESX health report with a shell script, use key based SSH authentication so that one ESX server can run the script on the others, and then email the report using a perl script. We'll finish with a quick modification to enable the report to trigger an email when performance thresholds are reached. The format of the report will be designed to display perfectly on a BlackBerry Curve set to its smallest font size, allowing us to know about issues from anywhere.

The motivation
I started working on this script after two separate ESX Service Console incidents. The first occurred after upgrading an ESX 3.5 server to Update 2. The upgrade was successful, and everything seemed perfectly normal. But as it turned out, there was an issue between the HP Systems Insight Manager server and the new update, causing a new process to be launched in the Service Console every five minutes, but the processes never terminated. After a week of this, several thousand zombie processes were running in the Service Console. There is a limit on the number of processes before the ESX server will stop launching new ones, and once you hit that limit, chaos ensues.

The second incident was less severe, but was just as scary because no one really knew how long the problem had been occurring. For reasons never understood, one of the VMware log files started filling up with the same generic HA message, logging more than five entries a second. The log file was rotating through all of its file names several times an hour, and the Service Console processor usage was pegged. The problem was finally noticed when a VMotion took longer than fifteen minutes to complete.

The plan
The setup for the ESX health report is fairly simple and it satisfies two key components of a good security policy; do not permit SSH access for root, and never store passwords in scripts. The basic plan is:
  • A non-root user account will be used for the entire process, and a public SSH key for this account will be distributed to the other ESX servers

  • From the central ESX server the non-root account will SSH to each ESX host, execute the script, and redirect the script output to the central ESX host

  • The combined output from each ESX server will then be emailed using a perl script, formatted to display nicely on a BlackBerry Curve

For this setup to work, you'll need to use the same non-root account on each ESX server. Even though most of the VMware command-line tools can only be executed by root, we can get most of the critical stats we need with just a regular non-privileged account.

It will also be necessary to open an outbound port for SMTP in the firewall on the ESX host that will be emailing the report. If your network design has isolated the ESX hosts from the rest of the network (and it should!), and only the server running vCenter has access to the service console network, you'll need to set up SMTP on the vCenter server and configure it to forward to an Exchange server or whatever groupware application you use, which we'll cover in Part 3.

The script
The shell script used to gather data from the ESX hosts is pretty straightforward, and can be developed and tested from a local ESX console, as the output is just being sent to the terminal.

The scale function will perform all calculations and print the histogram output. The function expects four parameters: the value, the maximum range for the value, a threshold percentage for generating a warning, and a description for the histogram.

The CSCALE local variable determines how the data is scaled, or how many intervals the graph will display. I've used a value of twenty here, mainly because it displays perfectly on my BlackBerry Curve, so each hash will represent a 5% interval. If you need more resolution than that in the graphs, it's just a matter of changing the CSCALE constant:

  function scale {
    local CSCALE=20

The bash shell lacks the ability to perform floating point calculations, which makes getting percentages pretty tough. However, awk fills this gap easily, and can round the percentages by simply using a printf format specifying a floating point number, with a precision of 0, so no decimal point will be printed (%.0f)

   avg=$(echo $1 $2 | awk '{printf "%.0f", ($1/$2) * 100}')

We'll use awk again to determine how many hashes should be printed in the histogram, but this time we'll multiply by the CSCALE value rather than 100:

  scaled=$(echo $1 $2 $CSCALE | awk '{printf "%.0f", ($1/$2) * $3}')

With the percentage value rounded and stored in avg, we will compare it to the threshold value specified as the third parameter to the scale function. If avg is greater than or equal to the threshold, we'll store the third parameter in a variable named alert that we can check for later:

  if [ $avg -ge $3 ]; then
    alert=$3
  fi

When calling a bash function, the parameters are specified as a space delimited list. This is problematic if the arguments themselves have spaces in them, which the descriptions for the histograms most certainly will. So in the scale function, we'll make the description the last parameter, that way we know that from the fourth parameter on will be part of the description. Using the built-in shift command, we can shift off the first three parameters, making the array of all the arguments specified to the function, the $@ variable, hold the description. We can then grab the whole array of arguments left over and store it in a single variable:

  shift 3; histtext=$@

Now that the calculations are done and parameters stored, we can start printing some histograms. To keep everything lined up in the output, we'll let printf format specifiers do all the work. This first printf command outputs the description for the histogram followed by the average. The first format directive, %-10.10s, specifies a string value (s) will be printed in a 10 character width field (10), and it should be left justified (-) with a precision of 10, meaning only print the first 10 characters even if it is longer (.10).
The second format directive, %3d%%, specifies an integer value will be printed in a three character width field, and a percent sign will follow it:

  printf "    %-10.10s %3d%% " "$histtext" "$avg"

This loop adds # characters to the hist variable up to the value of scaled, giving us the bar for the histogram:

  for ((i=0; i>scaled; i++)); do
    hist="${hist}#"
  done

And now we'll print the histogram bar, enclose it in [] brackets, left justify it (-), and print it in a field width equal to the CSCALE value:

  printf "[%-${CSCALE}.${CSCALE}s]" "$hist"

If the average was greater than or equal to the threshold, the alert variable will be defined, so we will print a warning message right below the histogram, lining it up perfectly by using printf format directives to specify the field width:

  if [ $alert ]; then
    printf "%28s%3d%% %8s" "WARNING:" "$alert" "threshold"
    printf "\n"
  fi

With the scale function defined, we can start gathering data and formatting the report. We'll begin by printing the hostname of the ESX server:

  printf "$(hostname)\n\n"

To get the load average for the service console, we'll use the last section of the output from uptime. Using egrep with the -o option tells it to only print out the section of the line that matches. Then use tr to change the lower case 'l' in 'load' to upper case, and we have a nicely formatted load average:

  printf " Service Console Stats:\n"
  printf "  $(uptime | egrep -o 'load.*' | tr 'l' 'L')\n"

Getting the number of running processes is easy, just pipe ps into grep, match everything (.), and output the number of lines that matched (-c). This won't be completely accurate, as the script itself and any subshells it launches will be included in the count. But that's not a big deal, the process count is only critical when it gets into the thousands, so we don't really care if it's inflated by a couple of counts:

   printf "  Number of running processes: $(ps -e | grep . -c)\n\n"

We can get the processor use average from vmstat, telling it to calculate the average over five seconds. We're grabbing two stats here, the user and system processor use percentages, so the '\n' newline in the middle of the awk command splits the output onto two lines. In order to handle sending the two lines to the scale function, we use read in a while loop to process each line.

In the awk command, we'll send the scale function the appropriate field parsed from vmstat, and then follow it with the maximum value for the data (in this case 100 as vmstat is reporting a percentage), the threshold value for triggering a warning (75), and a description for the histogram (User: or System:)

  printf "  Proc 5 sec avg:\n"
  (vmstat 5 2 | awk 'END {
   print $13, "100", "75", "User:\n" $14, "100", "75", "System:"}') | \
    while read line; do scale $line; done

To parse the memory use information from free, we'll use the pattern matching capabilities of awk to execute print statements only on the lines that begin with Mem: and Swap:

You might want to exclude the memory use histogram and only report on swap usage, as Linux will use memory not allocated to applications for buffering files, resulting in free almost always reporting very high memory usage:

  printf "  Memory Usage:\n"
  (free | awk '/^Mem:/ {print $3, $2, "100", $1}
              /^Swap:/ {print $3, $2, "1", $1}') | \
    while read line; do scale $line; done

And finally, for disk usage output from df, we want to print every line except the header line, so we'll use the pattern exclusion command in awk to skip the first line. We also need to filter out the % signs from the df output, and awk can do this for us with a sub command:

  printf "  Disk Usage:\n"
  (df -mP | \
   awk '$1 !~ /Filesystem/ {sub(/%/,""); print $5, "100", "70", $6}') | \
    while read line; do scale $line; done

Try it out
You can copy the entire script below into a putty session as a non-root user and try the script out for yourself. In Part 2 of this series, we'll explore a technique for running the script on multiple ESX hosts using SSH.


###############################################################################
#
#  esx-report.sh
#
###############################################################################
#
#  Copy this entire code segment into your shell to create the ~/esx-report
#  directory and esx-report.sh script and make it executable.
#  If you would rather copy just the script itself, select everything between 
#  the SCRIPTCREATOR limit strings.
#
###############################################################################
if [ ! -d ~/esx-report ]; then mkdir ~/esx-report -m 0700; fi
cd ~/esx-report

cat > ./esx-report.sh <<'SCRIPTCREATOR'
#! /bin/bash

 PATH="/bin:/usr/bin"

# Usage: [value] [max value] [threshold percentage] [description]
function scale ()
{
  # exit if called without four parameters
  if [ -z $4 ]; then
    return 1
  fi

  local avg alert scaled hist histtext i
  # change histogram scale here
  local CSCALE=20

  # protect against divide by zero, even though awk doesn't complain
  if [ $1 -gt 0 ] && [ $2 -gt 0 ]; then

    # no floating point in bash, use awk to get avg and round (%.0f) to int
    avg=$(echo $1 $2 | awk '{printf "%.0f", ($1/$2) * 100}')

    scaled=$(echo $1 $2 $CSCALE | awk '{printf "%.0f", ($1/$2) * $3}')
  else
    avg=0; scaled=0
  fi

  if [ $avg -ge $3 ]; then
    alert=$3
  fi

  # shift off first three args, leaving rest of args for description
  shift 3
  # grab whole array of args left over, this allows for spaces in description
  histtext="$@"

  printf "    %-10.10s %3d%% " "$histtext" "$avg"

  for ((i=0; i<scaled; i++)); do
    hist="${hist}#"
  done

  # with scaling, low values may show nothing in histogram
  # if hist undef, add a #, but we want a zero value to show nothing
  if [ ! $hist ] && [ $avg -gt 0 ]; then
    hist='#'
  fi

  printf "[%-${CSCALE}.${CSCALE}s]" "$hist"
  printf "\n"

  if [ $alert ]; then
    printf "%28s%3d%% %8s" "WARNING:" "$alert" "threshold"
    printf "\n"
  fi
}

  printf ">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>\n"
  printf ">\n"
  printf "> $(hostname)\n\n"

  printf " Service Console Stats:\n"
  printf "  $(uptime | egrep -o 'load.*' | tr 'l' 'L')\n"
  printf "  Number of running processes: $(ps -e | grep . -c)\n\n"
  printf "  Proc 5 sec avg:\n"
  (vmstat 5 2 | awk 'END {
   print $13, "100", "75", "User:\n" $14, "100", "75", "System:"}') | \
    while read line; do scale $line; done

  printf "  Memory Usage:\n"
  (free | awk '/^Mem:/ {print $3, $2, "100", $1}
              /^Swap:/ {print $3, $2, "1", $1}') | \
    while read line; do scale $line; done

  printf "  Disk Usage:\n"
  (df -mP | \
   awk '$1 !~ /Filesystem/ {sub(/%/,""); print $5, "100", "75", $6}') | \
    while read line; do scale $line; done
  printf "\n\n\n\n"

SCRIPTCREATOR

chmod 0700 ./esx-report.sh

###############################################################################
...read more

May 6, 2009

Become Friends with find

A while back I noticed a tip posted somewhere on how to use the find utility to register a bunch a virtual machines at once. It was a really helpful post and illustrated some of the potential of the the Swiss Army-like find utility. But it overlooked one of the coolest features of find, the -exec option.

Using -exec, you can launch a command with find and pass each found file as a parameter to it, eliminating the need to run vmware-cmd in a for loop. Just place the command to run after the -exec parameter, and find will replace the string {} with the current file being processed:

[root@esx02 root]# find /vmfs/volumes/ -name '*.vmx' -exec vmware-cmd -s register {} \;
 
register(/vmfs/volumes/495e44a0-d41258bc-fac4-000c299206d0/lab-dc1/lab-dc1.vmx) = 1
register(/vmfs/volumes/495e44a0-d41258bc-fac4-000c299206d0/lab-ex1/lab-ex1.vmx) = 1
register(/vmfs/volumes/495e44a0-d41258bc-fac4-000c299206d0/uda14-esx/uda14-esx.vmx) = 1

The -exec option is often considered less efficient than using xargs, which will feed multiple parameters to a command at once rather than launching the command with a single argument like -exec does. But for use with a command like vmware-cmd, which only expects to have one .vmx file passed to it, -exec is perfect.

And find has another option, -ok, which does the same thing but will present a prompt before running the command on each .vmx file that is found. This can be really handy if you are registering or powering on a bunch of VMs but know that there are some you are going to want to skip:

[root@esx02 root]# find /vmfs/volumes/ -name '*.vmx' -ok vmware-cmd -s register {} \;

< vmware-cmd ... /vmfs/volumes/495e44a0-d41258bc-fac4-000c299206d0/lab-dc1/lab-dc1.vmx > ? y
register(/vmfs/volumes/495e44a0-d41258bc-fac4-000c299206d0/lab-dc1/lab-dc1.vmx) = 1

< vmware-cmd ... /vmfs/volumes/495e44a0-d41258bc-fac4-000c299206d0/lab-ex1/lab-ex1.vmx > ? y
register(/vmfs/volumes/495e44a0-d41258bc-fac4-000c299206d0/lab-ex1/lab-ex1.vmx) = 1

< vmware-cmd ... /vmfs/volumes/495e44a0-d41258bc-fac4-000c299206d0/uda14-esx/uda14-esx.vmx > ? n

find rules
I've been in a couple of jams where find really saved the day. Consider a situation where a standalone ESX server needs to be rebuilt, but the virtual machines are all on networked storage. Using find afterwards to register and start the VMs makes the process trivial:

[root@esx02 root]# find /vmfs/volumes/ -name 'lab*.vmx' -exec vmware-cmd -s register {} \;

register(/vmfs/volumes/495e44a0-d41258bc-fac4-000c299206d0/lab-dc1/lab-dc1.vmx) = 1
register(/vmfs/volumes/495e44a0-d41258bc-fac4-000c299206d0/lab-ex1/lab-ex1.vmx) = 1


[root@esx02 root]# find /vmfs/volumes/ -name 'lab*.vmx' -exec vmware-cmd {} start \;

VMControl error -16: Virtual machine requires user input to continue
VMControl error -16: Virtual machine requires user input to continue


[root@esx02 root]# find /vmfs/volumes/ -name 'lab*.vmx' -print -exec vmware-cmd {} getstate \;

/vmfs/volumes/495e44a0-d41258bc-fac4-000c299206d0/lab-dc1/lab-dc1.vmx
getstate() = stuck

/vmfs/volumes/495e44a0-d41258bc-fac4-000c299206d0/lab-ex1/lab-ex1.vmx
getstate() = stuck


[root@esx02 root]# find /vmfs/volumes/ -name 'lab*.vmx' -print -exec vmware-cmd {} answer \;

/vmfs/volumes/495e44a0-d41258bc-fac4-000c299206d0/lab-dc1/lab-dc1.vmx

Question (id = 1) :msg.uuid.moved:The location of this virtual machine's configuration file has changed since it was last powered on.

If the virtual machine has been copied, you should create a new unique identifier (UUID).  If it has been moved, you should keep its old identifier.

If you are not sure, create a new identifier.

What do you want to do?
        0) Create
        1) Keep
        2) Always Create
        3) Always Keep
        4) Cancel
Select choice. Press enter for default ɘ> : 1
selected 1 : Keep

/vmfs/volumes/495e44a0-d41258bc-fac4-000c299206d0/lab-ex1/lab-ex1.vmx

Question (id = 1) :msg.uuid.moved:The location of this virtual machine's configuration file has changed since it was last powered on.

If the virtual machine has been copied, you should create a new unique identifier (UUID).  If it has been moved, you should keep its old identifier.

If you are not sure, create a new identifier.

What do you want to do?
        0) Create
        1) Keep
        2) Always Create
        3) Always Keep
        4) Cancel
Select choice. Press enter for default ɘ> : 1
selected 1 : Keep

You could even use the -ok option to introduce a pause between the VM start-ups.

Shrimp tacos
The syntax for -exec and -ok can be a little difficult to remember. The ; at the end of the command designates the end of the command find should execute, and it has to be escaped like \; so the shell doesn't interpret it.

I could never get the syntax right from memory until I associated it with shrimp tacos. The curly braces {} are a corn tortilla, and I'm using a spatula to get my shrimp off the skillet, \;

It may be corny, but I never forget the syntax now!

...read more

May 3, 2009

Secrets of the e1000

Updated: May 23, 2009
Note that as of vSphere - ESX 4.0, when using the new virtual machine wizard, if you select a custom configuration, and virtual machine version 7, the e1000 is now presented as a virtual adapter option for a Windows Guest (along with Flexible, VMXNET 2 and VMXNET 3. Sweet!). The operating system swap around and .vmx file editing hacks detailed below should no longer be necessary if you want to use the e1000 in a VM. For details on the available virtual NIC options, see this KB article.

There is still no equivalent to vlance.noOprom = "true" or vmxnet.noOprom = "true" for the e1000 to directly disable the PXE option ROM. However the solution described below also works with vSphere.


If you haven't used the e1000 virtual NIC before, it's a virtual implementation of the ubiquitous Intel PRO/1000 Ethernet adapter. According to this VMware KB article, the performance of the e1000 device lies somewhere in between the vlance and vmxnet devices, making it the perfect choice for a virtual machine that doesn't have VMware tools installed, and is therefore unable to utilize the advanced vmxnet virtual NIC.

Pesky pixies
I've been thinking all along that the e1000 virtual NIC lacks the option ROM necessary for PXE booting, and so have never bothered to disable it. While doing some testing on the Vyatta virtual machine I set up for the Protect the Service Console Network With a Virtual Firewall project, I noticed the familiar network boot screen after forgetting to connect the Vyatta installation CD for the initial boot.

If you read my post on Hardening the VMX File, you'll remember I discussed a potential exploit using PXE. Since I thought the e1000 was option ROM free, I failed to discuss disabling it if you have no use for PXE in your environment.

I can make at least one excuse for this oversight: there actually is no .vmx directive to disable the option ROM in the e1000. The option ROMs in the vlance and vmxnet adapters can be disabled with one of these directives, vlance.noOprom = "true" or vmxnet.noOprom = "true", but there is no equivalent command for the e1000.

There is an easy workaround for the lack of a disabling command, however. We can set the memory size that the BIOS makes available to the option ROM to zero, effectively preventing it from loading with this .vmx directive:


ethernet0.opromsize = "0"


You'll need to add this directive for each e1000 adapter, ethernet1.opromsize = "0" for example, if you have two of them in a virtual machine. I've gone back to the Hardening the VMX File post and added this as an additional recommended directive if you are using e1000 adapters.

Not so secret
In an attempt to understand how I never noticed that PXE boot was available with the e1000, I searched through the release notes for each update version of ESX 3.5 on the VMware downloads page. I was able to confirm that I am actually crazy, and PXE boot has always been available with the e1000. I also discovered that the e1000 is the default when creating virtual machines with some specific operating systems. As of Update 4, selecting one of the Linux 64-bit options, Netware, Solaris, or the Other (64-bit) guest operating system will cause the New Virtual Machine Wizard to present the e1000 as the default network adapter option. So if you don't feel like manually adding an e1000 adapter by editing the .vmx file for a virtual machine, you could initially set it up as one of those guest operating systems and then change it back after the VM is created.

If you don't mind editing a .vmx file, just add this directive, changing the device name to the specific adapter you wish to change:


ethernet0.virtualDev = "e1000"


...read more