Solaris ATA-over-Ethernet (AoE) Installation and Operation Guide

Norman Wilson
2007 August 19

1. Introduction

This manual describes Version 1.4 of the Solaris ATA-over-Ethernet subsystem, which makes EtherDrive®1 Storage Blades available as disk devices. It explains how to install and configure the AoE package; how the disks appear to the rest of the Solaris system; and some quirks and implementation details. It is not a manual on how to install EtherDrive hardware.

Parts of the AoE subsystem work differently if the Solaris System Management Facility (SMF) is in use (svcs, svcadm, and svccfg(1M)). Solaris 10 normally uses SMF; Solaris 9 and earlier systems don't have it. When the difference matters, this manual will refer to SMF and non-SMF systems.

Parts may also behave differently on the slightly-different versions of Solaris on SPARC and x86 (Intel/AMD) hardware. Such differences will be annotated `on SPARC' or `on x86.'

This document uses the new binary-multiple prefixes described in the IEEE 1541-2002 standard. Prefixes like kilo, mega, giga, tera (k, M, G, T) refer to powers of ten; when the near-equivalent powers of two are meant, they are called kibi, mebi, gibi, tebi (ki, Mi, Gi, Ti). For example, a gigabyte (GB) is 109 bytes, while a gibibyte (GiB) is 230 (1073741824). See http://physics.nist.gov/cuu/Units/binary.html for more details.

1.1. Requirements and limitations

The programs described here have been tested on Solaris 7, 8, 9, and 10 on 32- and 64-bit SPARC systems, and on Solaris 10 on 32- and 64-bit x86 (IA32) systems. They are unlikely to function correctly on Solaris 2.6 or older without additional programming work. Solaris 9/x86 may work, but has not been tested. They are likely to work on Open Solaris, but since that is a moving target, no promises can be made.

Coraid EtherDrive® storage blades and Coraid SR RAID controllers have been tested. Any target device conforming to version 1 of the AoE protocol specification should work, provided the attached (or emulated) ATA disk supports logical-block addressing. Any ATA disk manufactured since 1995 ought to work fine.

The driver supports both 24 and 48-bit sector addresses, allowing disks as large as 128 pebibytes. Older versions of Solaris may be limited to 1TiB; see below for the full scoop.

1.2. Release notes

Version 1.4 has these additional features and carries these warnings:

1.3. Background: AoE and EtherDrive blades

AoE (ATA over Ethernet) is a network protocol for accessing ATA devices over an Ethernet. An initiator system constructs an AoE message containing an ATA command such as `read sector,' and sends it to the Ethernet address of the desired target device; the target responds with an AoE message containing the resulting device status and data (if any). EtherDrive Storage Blades and SR RAID controllers are two examples of such targets, but the protocol is more general. The programs described here will work with any target that correctly implements the protocol.

AoE is defined at the Ethernet level, using an AoE-specific Ethernet protocol type and Ethernet MAC addresses. IP is not involved in any way. There is no inter-network routing: initiators and targets must share an Ethernet network. Neither is there any security built into the protocol: a target will process commands from any node on the same Ethernet. It is usually best to dedicate an Ethernet (or VLAN) to AoE traffic, or at least to share only with other secure traffic, e.g. SNMP or other management protocols. It is quite unwise to run AoE on a network into which a random person may plug his untrustworthy laptop.

Each target is identified by a pair of numbers, aoemin and aoemaj. For an EtherDrive blade, aoemaj is the shelf number, aoemin the slot; for an SR RAID controller, they are set through the configuration interface. The Solaris implementation allows AoE to be active on several network interfaces; each interface is assigned an aoechan number, and targets are addressed by the triplet (aoechan, aoemaj, aoemin). Error messages and support programs often express such target addresses as three numbers separated by slashes: 0/2/8 means the target with aoechan 0, aoemaj 2, aoemin 8.

A special broadcast message asks every target on the network, or the target with a specified address, to respond with a check-in message reporting its address and a few other parameters. When a target is first powered on (e.g. when an EtherDrive blade is inserted in a shelf) it broadcasts such a check-in response without being asked. Thus host software can find a target's Ethernet address given its target address, assemble a list of all targets on a network, and learn when a new target appears.

2. Installing, removing, and upgrading the software

2.1. Installing AoE

The AoE subsystem is a single Solaris package distributed as a single (datastream format) file, usually named CORDaoe-version-arch, e.g. CORDaoe-1.4-x86. It may arrive compressed, e.g. CORDaoe-1.4-x86.Z; use uncompress(1) to unpack it.

To install the package, use pkgadd(1M) as super-user:

pkgadd -d CORDaoe-1.4-x86 CORDaoe

The uncompressed package file is about 900kiB; the installed subsystem takes about 1000kiB.

Pkgadd may warn that scripts are to be run as super-user. Installing the package runs add_drv(1M) to add three device drivers to the system, and uses awk to add entries to /etc/devlink.tab. (On Solaris 10, pkgadd claims /etc/devlink.tab is itself a set-userid file, apparently a garbling of the super-user script warning.) On an SMF system, installation runs svccfg(1M) to import the service manifest, and svcadm(1M) to work around a bug in establishing service dependencies.

Pkgadd may also warn that the ownership of the three /usr/share/man directories is being changed. The trouble is that these directories may not always exist, and there is no way both to tell pkgadd to create them if necessary and to tell it to leave existing permissions be.

2.2. Removing AoE

To remove AoE from the system:

  1. Save any configuration files you may need later, in particular /etc/aoe.conf and /usr/kernel/drv/aoed.conf.
  2. Ensure that no AoE disks are in use: unmount any file systems and zpool export any ZFS pools on AoE disks; remove any AoE swap areas. Then run svcadm disable aoe (SMF system) or /etc/init.d/aoe stop (non-SMF).
  3. Ensure that the system won't try to use AoE devices later: remove or comment out vfstab entries referring to AoE disks; remove or comment out every line of /etc/aoe.conf, or remove or rename the file.
  4. Run pkgrm CORDaoe. This calls rem_drv(1M) to remove the device drivers from the system configuration, then deletes all files created when the package was installed.

Rem_drv removes the device files in /devices, but may leave the symbolic links in /dev. That's just the way some versions of Solaris work. It's OK to remove the dangling links yourself.

2.3. Upgrades and downgrades

To change from one version of AoE to another, remove the existing CORDaoe package and install the desired version as explained above. Remember to save /usr/kernel/drv/aoed.conf if you've customized it; otherwise it will be reset to its default contents when the new package is installed. If using ZFS, remember to zpool export and zpool import any pools already stored on AoE disks.

2.4. What's installed

The package installs files in the following directories:

/usr/kernel/strmod
/usr/kernel/strmod/sparcv9
/usr/kernel/strmod/amd64
/usr/kernel/drv
/usr/kernel/drv/sparcv9
/usr/kernel/drv/amd64
Device driver binaries and configuration files. There are two device drivers and one STREAMS module, each in both 32- and 64-bit (sparcv9 or amd64) versions.
/usr/sbin
Configuration and control programs: aoectl, aoestart, aoestop, aoemkconf, aoelabinit, aoemon, aoeunlabel.
/etc/init.d
/etc/rc2.d
Non-SMF start-and-stop control script. /etc/init.d/aoe is installed on both SMF and non-SMF systems, the /etc/rc2.d/S00aoe link only on an non-SMF system.
/lib/svc/method
/var/svc/manifest/device
SMF start-and-stop control script /lib/svc/method/device-aoe and service manifest /var/svc/manifest/device/aoe.xml. Installed only on an SMF system.
/etc/aoe
Initially empty; holds one file per active aoechan, used (with fattach(3)) to hold the channel open.
/usr/share/man/man1m
/usr/share/man/man7d
/usr/share/man/man7m
Manual entries describing the programs and drivers.
/opt/CORDaoe/doc
Longer documents (this one, for instance) about the use and workings of AoE. See /opt/CORDaoe/doc/README for details.
/opt/CORDaoe/lib
Odds and ends used during package installation.

3. Configuration and use

3.1. Setting up a network channel

Before AoE targets can be accessed, the network channels it will use must be defined in file /etc/aoe.conf. Until this file has been set up, starting AoE silently does nothing.

To set up a single channel, create aoe.conf with a single line declaring the device to be used. For example:

aoechan 0 ether /dev/bge2
configures a channel using the third connector of the four-port gigabit interface built into some systems.

Any Ethernet device Solaris supports may be used. Special configuration may be required to enable jumbo frames; see Sun's documentation for details.

The Ethernet device need not be dedicated to AoE; in particular it may be shared with IP, though as noted above, this is often unwise.

To use more than one network device for AoE, declare additional channels as explained below.

Normally nothing special is needed in aoe.conf to permit jumbo frames, but jumbo support may need to be enabled in Solaris for the network device to be used. See the discussion of jumbo frames for the full scoop.

Once aoe.conf has been set up, AoE must be started:

3.2. Device names

AoE disk devices are named similarly to directly attached disks. A block devices has a name like /dev/dsk/cad107s2; the corresponding raw device is /dev/rdsk/cad107s2.

What does cad107s2 mean? In AoE's use of standard Sun device-name conventions:

ca
Solaris controller ID a. AoE disks are always ca.
d107
Device 107. The number is a decimal-digit encoding of the target address for this disk: cMMmm for target c/MM/mm, with leading zeroes discarded. Thus d107 is target 0/1/7; d3 is 0/0/3; d10319 is 1/3/19.
s2
Partition 2. The eight standard Solaris partitions s0 - s7 (sixteen, s0 - s15, on x86) are allowed.
If the disk has a DOS (fdisk) partition table, partition suffixes p1 - p4 refer to the four direct DOS partitions. Only primary DOS partitions are accessible; extended partitions are not supported. Unlike standard Sun drivers, DOS partitions are allowed even on SPARC systems.

On x86 systems only, a Sun VTOC label may be encapsulated in a DOS partition; in that case the sn partitions refer to the VTOC partitions contained within that DOS partition, as with standard Sun drivers.

There is also a device name with no partition suffix, e.g. /dev/dsk/cad107; this accesses the whole disk regardless of the disk label, including the parts normally reserved for system use.

By default devices are created for targets 0/0/0 - 0/0/14 (i.e. cad0s* - cad14s*) and 0/1/0 - 0/1/14 (cad100s* - cad114s*): enough for two EtherDrive shelves or SR RAID devices numbered 0 and 1, each with up to fifteen blades. To make more devices requires editing a configuration file; in Solaris 9 and earlier, the system usually must be rebooted as well. See More devices for details.

In Solaris, names in /dev are symbolic links to real device files in /devices. For an AoE disk the real device name is of the form

/devices/pseudo/aoed@inst:part
Inst is the decimal-digit encoding of the target address, part the /devices-style partition name. For example, /dev/dsk/cad107s2 is a link to /devices/pseudo/aoed@107:c; /dev/rdsk/cad107 to /devices/pseudo/aoed@107:wd,raw.

3.3. Enabling devices

Before an AoE disk device can be opened, it must be enabled. Normally this is done automatically: a target's check-in message is intercepted by aoemon; aoemon enables its devices, and reports to syslog that it did so.

aoectl list lists enabled targets. Aoectl may also be used to enable or disable specific targets, or to repeat the check-in broadcast so targets will repeat their responses. See the section on tools for details.

If a target isn't enabled but should be, try resetting it; e.g. pull the EtherDrive blade out of the shelf and push it back in, or put the SR lblade offline and then online. If all is well, within a few seconds aoemon should report that it has enabled that target.

3.4. Using AoE disks

Once enabled, AoE /dev/dsk and /dev/rdsk devices work like any other disks.

Format(1M) can be used to set up partition tables (the partition submenu) and to run simple read and read-write disk tests (analyze), modulo a few bugs and quirks described below. Low-level formatting, bad-block repair, and defect-list management (the format, repair, and defect menus) are not supported. A partition table may also be written with fmthard(1M) or read with prtvtoc(1M). On a brand-new huge drive, or one bearing a spurious old label, it may be necessary to use aoelabinit to initialize the label first.

If the disk doesn't have a label (partition table), the driver makes one up. In the made-up label, subdevice s0 (and s2 if the disk supports VTOC labels) is all of the disk intended for normal use, excluding parts reserved for use by diagnostics and the label itself.

Any file system except the root or /usr may be stored on an AoE disk. AoE is started very early in the boot process; AoE file systems listed in vfstab(4) and mounted automatically when the system boots. It is not possible to boot from an AoE device.

AoE disks may belong to ZFS storage pools, but beware the reliability problems (not specific to AoE) described below.

AoE disks may be used for swapping. AoE swap areas listed in vfstab will be added automatically. On a non-SMF system, swap areas are added twice, once before AoE has been started and once after. The first attempt produces a harmless message like

/dev/dsk/cad108s1: No such device or address
If all is well, the second attempt succeeds. On an SMF system, AoE is started earlier and AoE swap areas are enabled without fuss.

On an SMF system, if the devices/aoe service is enabled, it is automatically started if possible in single-user mode. To use AoE in single-user mode on a non-SMF system, ensure that /usr is mounted if it is a separate file system, then run /etc/init.d/aoe start.

3.4.1. Large disks

Support for disks larger than 1TiB is somewhat complicated, because of the way Solaris supports such disks.

3.4.1.1. What Solaris does

The following is mostly a summary of Sun's documentation, which should be consulted for details.

The original Solaris disk-labelling scheme (VTOC) stored the size of each partition as a 32-bit signed sector count. This limited the size of a single disk subdevice (partition) to 231 sectors, or 1 tebibyte. The UFS file system format had similar limits: no file system could span more than 1TiB.

DOS partition labels are similarly constrained by 32-bit values to disks no larger than 1TiB.

The Solaris 9 4/03 release added a new labelling scheme (EFI), derived from an Intel specification. EFI labels store partition parameters in 64-bit unsigned values; a disk device or any single partition may span 264 sectors, or 8 zebibytes, or about eight billion tebibytes. Solaris 9 8/03 updated UFS to match: a new option to newfs(1M) creates a `multi-terabyte [sic] file system.' These upgrades may be installed on an older Solaris 9 system by installing the current recommended patch cluster; they are not available for Solaris 8 or older systems.

EFI labels afford several benefits besides large-disk handling:

  • The VTOC format is sensitive to the byte order of the computer accessing the label: a VTOC written by Solaris/SPARC makes no sense to Solaris/x86. EFI labels have a defined byte order, independent of the host CPU.
  • The VTOC format assumed the disk comprised an array of cylinders, all of the same size. Disks ceased to be built that way many years ago; modern ATA disks no longer report useful sector, head, or cylinder counts. EFI labels treat the disk as a simple array of blocks.

One new constraint is that all partitions in a single EFI label must be non-overlapping. In particular the Solaris tradition that subdevice s2 accesses the entire user-data part of the disk regardless of other partitions cannot be honoured under EFI.

To retain compatibility with existing disks and older systems, VTOC labels are still supported. Sun's standard tools write VTOC labels to 1TiB or smaller disks, EFI labels to larger disks. The system looks for a VTOC label first, then an EFI label; thus a small disk may have either type of label.

3.4.1.2. How it works with AoE disks

The Solaris AoE driver supports either labelling scheme.

If an unlabelled disk is no larger than 1TiB, the AoE driver pretends it has a VTOC label in which s0 and s2 both access the entire user-data part of the disk. The regular Solaris tools read and modify the label without trouble.

If an unlabelled disk is larger than 1TiB, the disk driver makes no pretense. The whole-disk subdevice (cadnnn) accesses the entire physical device; none of the standard partition subdevices (cadnnnsn) may be used. Sun's standard tools report no label until one is written; that is the best that can be done within Sun's disk-driver interface. Format(1M) cannot handle an unlabelled disk large enough to require EFI labels; use aoelabinit as a workaround.

If a disk already has a VTOC label, everything works in the standard way. The eight standard partitions s0-s7 work as described in the label; Sun's standard tools display the label and can modify it.

If a disk already has an EFI label, and the system accessing it runs a version of Solaris new enough to have EFI support, everything also just works. The eight standard partitions s0-s7 are defined by the first eight EFI partitions. Sun's standard tools display the label and can modify it, with minor bugs.

If a disk already has an EFI label, pre-EFI Solaris can access it with limitations. The AoE driver reads the label and defines the partitions; the corresponding s0-s7 subdevices work; but Sun's standard tools don't understand the EFI label and report errors if asked to display it. In particular, format on a pre-EFI system doesn't work at all with an EFI-labelled disk. Aoeunlabel will wipe out an existing EFI label, allowing the disk to be partitioned from scratch.

Pre-EFI Solaris running in 32-bit mode can access at most 1TiB from a disk partition, even if the EFI label defines it to be larger. Pre-EFI Solaris in 64-bit mode can access the whole partition no matter how big. Pre-EFI Solaris cannot make a UFS file system larger than 1TiB, nor can such a system read a larger file system written by a newer system.

3.4.2. format(1M) bugs and limitations

The format format, repair, and defect menus are not supported on AoE disks.

An AoE target is listed in format's disk-selection menu only if it is enabled. If an enabled device is unreachable (broken, removed, mistakenly enabled by hand), format may stall for a few seconds trying to access it, as it would for a configured-but-inaccessible SCSI disk.

Format exhibits a few bugs when working with an AoE disk larger than 1TiB:

3.4.3. ZFS is surprisingly fragile

We have learned that ZFS is surprisingly non-robust in the fact of I/O errors: if a disk reports repeated errors and the ZFS pool doesn't have enough remaining redundancy to allow the disk to be set aside, the whole system panics (crashes).

For example, the system will panic if:

  • Any disk fails in a pool with neither RAID nor mirror redundancy.
  • A disk fails in a raidz1 pool that is already DEGRADED, i.e. a disk has already failed and has not yet been replaced.
  • A disk fails in a raidz2 pool in which two disks have already failed and neither has yet been replaced.
It is not unreasonable for ZFS to withdraw access to a pool that has suffered such a severe failure, but to do so by crashing the whole system (even if other ZFS pools or other non-ZFS operations are still have to function) is, to put it politely, excessive.

This is a problem in Solaris, not in the AoE code; it has been reported with iSCSI arrays as well. Rumour has it that Sun are aware of the problem and a fix is in the works.

We urge customers to take special care when using ZFS, unless it is acceptable for the entire system to crash because of a single broken disk. Use redundancy of one sort or another for all pools, whether by using the mirroring and RAID mechanisms built into ZFS or by using an external RAID mechanism (such as that built into Coraid's RAID appliances). Whatever redundancy is used, make use of hot spare mechanisms so that redundancy will be restored automatically as quickly as possible. Check regularly for failures and replace broken disks so that the hot-spare supply won't run out.

All of this is good practice for any modern storage system. That ZFS is crash-prone just sharpens the consequences of error.

3.4.4. Metadisks

AoE disks may belong to metadisks managed by the Solaris Volume Manager in Solaris 9 and newer systems. Concatenation, striping, mirroring, RAID 5, and soft partitions all work just as for any other disk. Solstice DiskSuite, the extra-cost predecessor to Volume Manager, has not been tested; it probably works but this is not a promise. Field reports are welcome.

On an old computer system, initializing a RAID group (metainit dnn -r) with four or more AoE-disk components may lock up the system. The system is actually OK, but so busy initializing RAID data that it cannot do much else. All returns to normal when RAID initialization has finished. A 140MHz Ultra-1 system exhibits the problem; an Ultra-2 with a single 300MHz processor does not. Any current system ought to be OK.

3.5. Error logging and monitoring

Kernel drivers in the AoE subsystem log error messages to syslog facility kern; in particular this is how disk errors are reported. All messages from the AoE disk driver contain the string aoed and give the device instance number and target address. Disk-error messages always contain the string ATA error and give the ATA error-register value and an ASCII interpretation of its contents. Messages about errors in the AoE part of a target (not the ATA disk part) say AoE error and give both the AoE error code and its conventional meaning.

User-mode programs responsible for starting AoE channels and for monitoring for unusual events (aoestart and aoemon, described in more detail below) log to syslog facility daemon. This log is where aoestart reports that an aoechan could not be started, and where aoemon reports that a target was automatically enabled. Aoemon also logs a packet dump when an error occurs: of the message received for an AoE-protocol or ATA error, of the message that couldn't be sent for a timeout.

Iostat(1M) lists AoE disks that have been used at least once since the last boot. Additional AoE-specific statistics may be fetched with kstat(1M).

4. Advanced topics

4.1. More devices

File /usr/kernel/drv/aoed.conf lists the AoE disk devices to be created. Only the devices listed when the aoed driver last processed the file (usually when the system last booted) will work, even if additional special files happen to exist, and even if other targets exist and have been enabled.

The AoE subsystem comes with a default aoed.conf file declaring an initial set of targets. If more are needed, the file must be edited and the system told to reread it.

4.1.1. Editing aoed.conf

aoed.conf is a Solaris driver configuration file, in the form described in driver.conf(4). The easiest way to add to the file is to generate new entries with aoemkconf and edit them in. You may also write your own entries.

To summarize the general Solaris rules: a driver configuration file contains lines of text. Anything following # is a comment. Empty lines are ignored. Non-empty lines declare devices or driver properties (parameters, given as name=number or name="string"). Each line must be terminated with a semi-colon.

Each line in aoed.conf declares a device instance, specifying an instance number and target address:

name="aoed" parent="pseudo" instance=inst aoechan=cno aoemaj=majno aoemin=minno;
The punctuation is important; in particular the double-quotes and the terminal semicolon must be present. The required fields have these meanings:
name="aoed" parent="pseudo"
Constant values telling the system which driver this is and how it fits into the system.
instance=inst
Inst, a decimal number, is the instance number of this device, used to generate device names: instance=102 produces /devices/pseudo/aoed@102:a, /dev/dsk/cad102s0, and so on.
aoechan=cno
aoemaj=majno
aoemin=minno
Cno, majno, and minno are decimal numbers expressing the the target address: this is target cno/majno/minno.

These optional fields are allowed:

timeout=ms
Allow this device ms milliseconds before retrying or expiring a command. The default is 200ms. It may need to be lengthened for a long-latency device, or shortened to make the best of a badly-congested network, e.g. when running AoE over 10Mbps Ethernet (allowed but probably not a good idea).
hd=nhead
sec=nsec
Pretend this drive has cylinders of nhead heads and nsec sectors, ignoring any information returned by the hardware or the default numbers in the driver. Numbers stored in a valid disk label take precedence. Both hd and sec must be specified; if only one is supplied, it is ignored.
maxdata=size
Limit ATA data transfers for this device to at most size bytes. Usually needed only when using jumbo frames on problematic networks. Normally it is better to set the size for the whole aoechan rather than for individual targets.
maxbuf=ncmds
Allow at most ncmds unacknowledged commands for this target. Subsequent commands will be held until a pending command has been answered or has timed out. Normally initiator and target negotiate a suitable value, but it may occasionally be necessary to limit it on congested networks or when many initiators access a target concurrently.

The optional parameters must be added by hand; aoemkconf won't put them in.

The declarations in aoed.conf, and nothing else, connect instance numbers to target addresses. That the instance number is a decimal encoding of the target address is only a convention; nowhere in the driver subsystem is this assumed. If you write your own entries you may use whatever mapping you like, as long as no inst value is used for more than one device, and none is too large:2 20164 on a 32-bit SPARC system, 330382099 on 64-bit SPARC; 12483 and 204522252 on x86.

4.1.2. Making the new file effective

On Solaris 10, changes to aoed.conf can be made effective while the system is running:

# update_drv -f aoed
unloads the aoed module if possible, reloads it, and updates /devices and /dev. If some AoE devices are in use (mounted, in use for swapping, device file open), update_drv will complain that the module could not be unloaded; new devices will be added, but existing devices whose aoed.conf entries have changed will not be updated until the next driver reload or system boot.

On Solaris 9 and older systems, changes to aoed.conf have no effect until the next time the aoed driver module is loaded, and /devices and /dev are not updated until Solaris is told to do so. The simplest way to change the AoE configuration is to update aoed.conf and then perform a configuration reboot: reboot -- -r if Solaris is running, boot -r to the Open Boot ok prompt. It is also possible to make changes effective right away if no AoE device is in use, by unloading the aoed module and explicitly reconfiguring:

# modinfo | awk '$6 == "aoed"'
190 780a4000 a570 229 1 aoed (AoE disk driver v1.2)
# modunload -i 190
# devfsadm -i aoed
The -i argument to modunload is the first field of the line printed by modinfo. If modinfo doesn't list the aoed module, the driver wasn't loaded; skip ahead and run devfsadm.

On Solaris 7, devfsadm may not exist (it was added by a patch) and in any case appears to be undocumented. The older equivalent, which works in either case, is:

# drvconfig -i aoed
# devlinks

It is wise to plan ahead, especially on a non-SMF system where the driver must be unloaded to reconfigure. For example, when installing a new EtherDrive shelf, add a device instance for each slot, not just for those you plan to use right away; when installing an EtherDrive Storage Appliance, create a few extra devices. To create hundreds of never-used devices would be silly, but the cost of a few extras is modest: a kibibyte or so inside the operating system per device, a handful of inodes for special files and symbolic links, extra entries in the /dev/dsk and /dev/rdsk directories.

4.2. Fewer devices

If a target is removed while the system is running, it is prudent to disable it manually:

aoectl disable target ...
Otherwise attempts to access it will stall for a few seconds before timing out. In particular, format may stall during its initial search for disks, just as it sometimes stalls when a SCSI disk has been removed. If the system reboots, a missing AoE target won't respond and will not be enabled. If the target is plugged back in while the system is running it will be enabled automatically.

Removing a target's entry from /usr/kernel/drv/aoed.conf will save a little memory after the system is next booted. There's little point in doing this for a single target; it may make sense if many disks have been permanently removed.

When a target is removed from aoed.conf its entries in /devices or /dev may remain, even after a configuration reboot. Attempts to use such ghost devices return errors. If the same target is later restored to aoed.conf (and the driver reconfigured) the devices will work again. This is just the way some versions of Solaris work, especially older ones.

4.3. More channels

File /etc/aoe.conf assigns AoE network channels to Solaris Ethernet devices. Every device through which AoE devices will be accessed must be listed, with a distinct aoechan number.

To add network interfaces, add lines to that file, in the same format as the first:

aoechan channum ether device

Each channel must have a distinct channum in the range 0-9. Channels may not share a network device, though of course a multiport device like the Quad Fast Ethernet may support several channels, one to a port. The network device need not be dedicated to AoE; in particular it may also be configured as an IP interface.

Changes to aoe.conf take effect when the system is next booted, or when svcadm refresh svc:/device/aoe (SMF system) or /etc/init.d/aoe restart (non-SMF) is run. Adding an aoechan usually requires adding entries to aoed.conf, so a non-SMF system usually must be booted anyway.

4.4. Jumbo frames

The Ethernet standard allows at most 1500 bytes per frame. The AoE header for an ATA command occupies 22 bytes; transfers must be in whole 512-byte sectors. Thus only two sectors may be read or written at a time using standard Ethernet.

Per-frame overhead is quite significant in gigabit Ethernet, not just for AoE but for many protocols. Hence many manufacturers of gigabit network cards and switches allow larger frames, so that more data may be packed into a single frame and fewer frames sent. Usually the maximum size is about 9000 bytes, but it varies by product.

Jumbo frames can speed up AoE I/O quite a bit. Usually it is necessary to enable them explicitly in Solaris, and some care may be needed to make everything work right.

4.4.1. Requirements for jumbo frames

For jumbo frames to work, they must be supported in three places:

  1. The AoE target must support jumbo frames. All current gigabit-capable Coraid devices do.
  2. The host network interface to be used for AoE must support jumbo frames, and jumbo support must be enabled. Not all Sun-supported network devices are jumbo-capable. In particular jumbo frames are allowed by the ce device (e.g. the add-on GigaSwift adapter), but not the bge device (e.g. the embedded Ethernet devices on many Sun motherboards).

    Beware that jumbo support in early versions of Sun's drivers may not have been well-tested, especially in the ways AoE uses it. In particular, the earliest version of the Solaris 10 ce driver may be prone to crash when used with AoE. It is wise to patch such drivers to their most-recent versions before using jumbos.

  3. If there are switches in the network between AoE host and target, they must all support jumbo frames. Check the manufacturer's specifications for the specific models of switch you have.

4.4.2. Enabling jumbo frames

To make AoE use jumbo frames:

  1. Enable jumbo frames in the appropriate Solaris device driver. See Sun's documentation for how to do this, and for the expected maximum frame size (usually expressed as MTU, the maximum Ethernet data payload). Make sure you set things up so that jumbo frames are automatically enabled at boot time for that device.
  2. If there are any network switches between hosts and targets, find out the maximum frame size supported by each switch. If this is smaller than the jumbo size allowed by Solaris, configure the jumbo frame size by hand, as explained below.
  3. If no switches are involved or all switches support sufficiently large frames, simply restart AoE or reboot with jumbo frames enabled in Solaris. The AoE subsystem will automatically detect the maximum frame size allowed by the Solaris network driver being used, reduce it if necessary to match that allowed by the AoE target, and use that size for subsequent I/O to that target.

The AoE subsystem can see the frame size configured in Solaris, and that available in the AoE target, but it can't see any limits imposed by switches. It's up to you to get that right.

A maximum frame size may be set for each AoE channel. This may be used only to reduce the frame size, not to increase it; the maximum size allowed by Solaris for the network device used by the channel cannot be exceeded. For each target accessed over this channel, the maximum size is further reduced if necessary to that supported by that target (discovered through part of the AoE protocol).

By default, AoE uses the largest size supported by the network device. Manual configuration may be needed to handle size limits in switches, or for debugging. It is not a substitute for properly configuring Sun's network drivers.

To specify a maximum frame size, add any of the following arguments to the channel's entry in /etc/aoe.conf:

mtu n
Frame data-payload size may be no greater than n bytes. If n is zero, the largest size supported by this channel's network device is used.
maxdata n
AoE ATA data transfers may be no larger than n bytes, rounded down to the nearest multiple of 512. maxdata n means the same as mtu n+22.
jumbo
nojumbo
Shorthand for mtu 0 (jumbo) or mtu 1500 (nojumbo).

The current maximum transfer size for a particular target is shown in the maxdata field of the conf kstat table.

4.4.3. Jumbo frames and disk I/O speed

Is bigger always better? Apparently not.

Our tests suggest that maxdata values of at least 3-4 kibibytes (6-8 sectors) produce large speed improvements over the default 1kiB (2 sectors); reads and writes through the file system are 2-3 times as fast. With even larger frames, the story is not as clear. For some operations speeds continue to increase, though not as markedly; for others there appears to be an optimal size (varying with type of operation and type of disk) after which speeds drop a little, though again not markedly.

It is not a bad idea just to use the largest value supported by your hardware and software; I/O speed will be much better than with standard 1500-byte frames, and the largest value will cause the least system overhead within Solaris. If it's important to squeeze every possible drop of speed out of your system, you may want to experiment to find the maxdata value that best matches your network, device drivers, disks, and application mix.

Those who do run their own tests are invited to share them with us, or to contact us for details about the simple tests we have used so far.

4.5. More monitoring

On Solaris 8 and newer systems, additional statistics can be read via a special kstat(1M) table called aoedinst,stats where inst is the number appearing in device names (as /dev/rdsk/cadinsts0). For example, to check the statistics for target 0/0/5:

# kstat -n aoed5,stats
module: aoed            instance: 5
name:   aoed5,stats     class:    misc

        atarcv          316771
        atasnd          316771
        crtime          83.211228318
        erraoe          0
        errata          0
        errtimeout      0
        errtrunc        0
        errwmsg         0
        retries         0
        snaptime        215074.845901558
        softerr         0

crtime and snaptime are standard values supplied by the system: when data-gathering began and when the data shown were last updated, in seconds since the system booted. Other values are counters maintained by the AoE disk driver, separately for each AoE target:

atarcv
atasnd
ATA-command responses received and sent.
erraoe
AoE protocol errors.
errata
ATA (disk) errors.
errtimeout
Hard timeouts (no response received even after several retries).
errtrunc
Truncated AoE responses.
errwmsg
AoE requests that couldn't be sent (likely a problem with the network channel).
retries
Retransmissions because a response wasn't received after a standard timeout interval (that listed as timeout in the conf table, below).
softerr
Soft (ECC-corrected) errors reported by the drive.

Static configuration info may be retrieved from kstat table aoedinst,conf:

# kstat -n aoed5,conf
module: aoed            instance: 5
name:   aoed5,conf      class:    misc

        crtime          83.211229535
        cylsize         56960
        maxbuf          3
        maxdata         1024
        mibsize         39266
        snaptime        215076.003383030
        timeout         200
The aoed-specific values are:
cylsize
Number of sectors per cylinder for this drive, discovered from the disk or (more often) invented by the driver.
maxbuf
Maximum number of unacknowledged AoE commands allowed by this target.
maxdata
Maximum data-segment size agreed among aoestart, network device, and target.
mibsize
Size of this disk, in mebibytes.
timeout
I/O operation timeout, in milliseconds.

4.6. Tools and control programs

See the manual pages supplied with the software for details. Except as noted, programs are installed in directory /usr/sbin, normally part of the super-user's shell search path.

4.6.1. Aoectl

Aoectl issues control commands to the AoE subsystem:

aoectl list address ...
List enabled targets at each address; if none given, all enabled targets.
aoectl enable address ...
aoectl disable address ...
Enable or disable targets at the addresses.
aoectl probe address ...
Broadcast a check-in request for each target address; if none given, for the wildcard address on every active aoechan. Any responding targets are enabled by aoemon.
Addresses are target addresses like 0/2/4. If an address field is missing or empty, it matches any value.

For example:

aoectl list 0
List all enabled targets on aoechan 0.
aoectl probe 0/0/5 0/0/7 0//2
Probe for 0/0/5, 0/0/7, and any target on aoechan 0 with aoemin 2 no matter what its aoemaj.
aoectl disable 0/1
Disable all targets with aoechan 0, aoemaj 1.
aoectl list
List all enabled targets.
aoectl probe
Broadcast a wildcard check-in request to every active aoechan.

Aoectl need not be used during normal operation, only in exceptional cases or for monitoring. aoectl list helps spot targets that should be there but aren't; aoectl probe may be useful if the network is rearranged while the system is running; aoectl disable will prevent programs like format from stalling because a target has broken or been removed.

4.6.2. Aoestart and aoestop

Aoestart starts a single aoechan according to its arguments. It is called automatically for each channel named in /etc/aoe.conf when AoE is started; normally it need not be run by hand. In fact each line of /etc/aoe.conf is simply used as arguments to aoestart.

Aoestop stops one, several, or all (-a) named aoechans. Normally there is no need to do this at all.

4.6.3. Aoemkconf

Aoemkconf turns target addresses into device declarations suitable for inclusion in aoed.conf. It simply prints the entries on the standard output; it's up to you to edit them into the configuration file.

An argument like form 0/7/3 names a single target; 0/7/9-11 is shorthand for 0/7/9 0/7/10 0/7/11. Thus

aoemkconf 0/10/0-14 0/11/0-14
generates entries for two 15-slot EtherDrive shelves numbered 10 and 11. Instance numbers use the standard decimal encoding.

Arguments that don't look like target addresses are taken to be existing configuration files; aoemkconf silently omits entries whose instance numbers would duplicate any in the files. Hence if /usr/kernel/drv/aoed.conf already has some entries for shelf 10,

aoemkconf /usr/kernel/drv/aoed.conf 0/10/0-15
will print only the missing ones.

4.6.4. Aoelabinit

Aoelabinit writes a Solaris disk label to the single disk named as an argument:

aoelabinit /dev/rdsk/cad209
If the disk is already labelled, option -f (force) must be given:
aoelabinit -f /dev/rdsk/cad209

By default, aoelabinit invents a label as follows:

  • If the disk is no larger than 1TiB: VTOC label; subdevices s0 and s2 access the whole user-data part of the disk, excluding system-reserved areas.
  • If the disk is larger: EFI label; s0 spans the whole user-data part of the disk, excluding disk labels and a small mandatory system-reserved area.

The defaults may be overridden in several ways; see the manual page.

Aoelabinit is not meant to be a general-purpose disk labeller, just a workaround for the format bug forbidding access to large unlabelled disks.

4.6.5. Aoemon

Aoemon is a daemon that processes messages from the kernel AoE subsystem. It has two main functions:

  1. When a check-in response arrives, aoemon enables the corresponding target, and logs that it did so.
  2. When an error or an unexpected event occurs (e.g. a message arrives when the AoE disk driver is inactive, or in response to a request the system didn't make, or appears to be mangled) aoemon logs the details.

Aoemon uses syslog facility daemon. Severity info is used to report that a target has been enabled; notice when an unexpected or ill-formed AoE message arrives; error for error conditions that prevent aoemon from working at all.

Aoemon is started automatically when AoE is started. Only one copy is needed for all aoechans. If aoemon is not running, AoE disk devices will still work, but targets will not be automatically enabled and some errors may go unreported.

4.6.6. Aoeunlabel

Aoeunlabel destroys any Solaris disk labels on the devices named as arguments:

aoeunlabel /dev/rdsk/cad312
Both VTOC and EFI labels are destroyed.

Aoeunlabel is unlikely to be needed except when a disk already EFI-labelled by a newer system is to be repartitioned on an older one.

4.6.7. Starting and stopping AoE

4.6.7.1. With SMF

On an SMF system, AoE is represented as service svc:/device/aoe. Normal SMF tools may be used:

svcs svc:/device/aoe
Display the state of the service: online if all is well; disabled if not enabled; maintenance or offline if the service couldn't be started. In the latter case, log files /var/svc/log/device-aoe:default.log and /etc/svc/volatile/device-aoe:default.log may offer clues.
svcadm enable svc:/device/aoe
Start AoE if not already running: start aoemon, then call aoestart for each channel listed in aoe.conf. Remember that AoE should be started automatically when the system boots.
svcadm disable svc:/device/aoe
Abruptly stop AoE if it was running: call aoestop for each active channel, then kill aoemon. Remember that AoE should not be started on boot. All AoE disks become inaccessible, even if still mounted or open.
svcadm clear svc:/device/aoe
Reset the service from maintenance state, e.g. if it failed to start when last enabled.
svcadm refresh svc:/device/aoe
Shut down any currently-active AoE channels, and restart AoE with the channels now listed in /etc/aoe.conf. Equivalent to svcadm disable svc:/device/aoe; svcadm enable svc:/device/aoe, but quicker to type and less disruptive to active AoE devices.

Once enabled, the AoE service is started automatically early in the boot process. In particular the services that start swap devices and mount local file systems depend on svc:/device/aoe, and AoE is started even for the single-user milestone, though if AoE fails to start single-user mode still works.

Usually it is safe to abbreviate the service name to aoe.

4.6.7.2. Without SMF

On a non-SMF system, /etc/init.d/aoe is a conventional boot-and-shutdown shell script to start and shut down the AoE subsystem:

/etc/init.d/aoe start
Start AoE: start aoemon, call aoestart for each channel listed in aoe.conf.
/etc/init.d/aoe stop
Abruptly shut AoE down: call aoestop to stop all channels, then kill aoemon. All AoE disks become inaccessible, even if still mounted or open.
/etc/init.d/aoe restart
Shut down any currently-active AoE channels, start aoemon if it is not already running, then call aoestart for each channel listed in /etc/aoe.conf. Equivalent to aoe stop; aoe start but quicker to type and to execute, and less disruptive to active AoE devices.
AoE is started very early during the boot process, before any file system but the root and /usr has been mounted; thus any other file system may be put on an AoE disk, but all files used by the aoe script itself (the kernel drivers, aoestart, aoemon) must be in the root or /usr.
4.6.7.3. Customization

With or without SMF, parts of the initialization script may be customized by placing shell-variable declarations in file /etc/default/aoe.options. Here are some of the things that may be set:

AOEBIN=dir
Where to find aoestart and aoemon; default /usr/sbin.
AOECONF=file
Where to find the file declaring channels; default /etc/aoe.conf.
AOESTARTOPTS="options"
AOEMONOPTS="options"
Additional options for aoestart or aoemon; default empty.
AOEDIR=dir
Directory for open-channel files; default /etc/aoe.
AOEDIRPERM=perms
Permissions with which to create $AOEDIR if it doesn't exist; default 700, i.e. only the owner (the super-user) may look within.
AOEPRESTART="shell-commands"
AOEPOSTSTART="shell-commands"
AOEPRESTOP="shell-commands"
AOEPOSTSTOP="shell-commands"
Additional shell commands to be executed just before or just after starting channels, or just before or just after stopping them. May be helpful if magic commands are required to initialize network interfaces.

4.7. Driver and device file details

The AoE subsystem comprises three distinct device drivers.

A network device is made into an AoE channel by opening it, issuing Solaris DLPI commands to set device unit number and Ethernet protocol type, and pushing an instance of the aoecomm STREAMS module. There is a little more configuration to tell aoecomm the desired aoechan number, and to protect against denial-of-service attacks (anybody could push aoecomm on a pipe, for example). Aoestart does all this.

The aoed driver affords disk access, through both block and character devices. It uses an internal call to inform aoecomm of its interest in incoming AoE messages, and another call to send messages out. It also calls aoecomm when a device is opened, to see whether the corresponding target has been enabled; if not, the open call fails. Aoed creates 18 /devices/pseudo/aoed@* devices for each target: the eight standard block subdevices :a-:h, their raw counterparts :a,raw-:h,raw, and whole-disk subdevices :wd and :wd,raw.

The aoectl driver provides two control devices:

/devices/pseudo/aoectl@0:ctl
/dev/aoectl
Written by the aoectl program to issue control commands, by aoemon to enable devices, and by aoestart as part of channel configuration.
/devices/pseudo/aoectl@0:mon
/dev/aoemon
Read by aoemon. May not be reopened while open; hence cannot be opened again while aoemon is running, not even by another copy of aoemon.
The aoectl driver calls aoecomm to send AoE messages, to register its interest in unexpected, ill-formed, or error-reporting AoE messages (to be made available through /dev/aoemon), and to query or change the enabled-target list.

Each of these drivers is loaded into the kernel as needed: when the STREAMS module is pushed, or one of the device files is opened. Because the aoed and aoectl drivers depend on aoecomm, loading the former automatically loads the latter first, and aoecomm cannot be unloaded while either of the other two modules is loaded.


Footnotes

1.
EtherDrive® is a registered trademark of Coraid, Inc.
2.
The exact rule in in the versions of Solaris we support:
  • A 32-bit system allows 18 bits for a minor device number. On a SPARC system, the driver creates 13 minor devices for each target, 262144/13 = 20164 instances are allowed. On an x86 system, there are 21 devices, so the limit is 262144/21 = 12483.
  • A 64-bit system has 32-bit minor device numbers, so the calculation is 4294967296/13 = 330382099 or 4294967296/21 = 204522252.