This manual describes Version 1.4 of the Solaris ATA-over-Ethernet subsystem, which makes EtherDrive®1 Storage Blades available as disk devices. It explains how to install and configure the AoE package; how the disks appear to the rest of the Solaris system; and some quirks and implementation details. It is not a manual on how to install EtherDrive hardware.
Parts of the AoE subsystem work differently if the Solaris System Management Facility (SMF) is in use (svcs, svcadm, and svccfg(1M)). Solaris 10 normally uses SMF; Solaris 9 and earlier systems don't have it. When the difference matters, this manual will refer to SMF and non-SMF systems.
Parts may also behave differently on the slightly-different versions of Solaris on SPARC and x86 (Intel/AMD) hardware. Such differences will be annotated `on SPARC' or `on x86.'
This document uses the new binary-multiple prefixes
described in the IEEE 1541-2002 standard.
Prefixes like kilo, mega, giga, tera
(k, M, G, T)
refer to powers of ten;
when the near-equivalent powers of two are meant,
they are called kibi, mebi, gibi, tebi
(ki, Mi, Gi, Ti).
For example,
a gigabyte (GB) is
109
bytes,
while
a gibibyte (GiB) is
230
(1073741824).
See
http://physics.nist.gov/cuu/Units/binary.html
for more details.
The programs described here have been tested on Solaris 7, 8, 9, and 10 on 32- and 64-bit SPARC systems, and on Solaris 10 on 32- and 64-bit x86 (IA32) systems. They are unlikely to function correctly on Solaris 2.6 or older without additional programming work. Solaris 9/x86 may work, but has not been tested. They are likely to work on Open Solaris, but since that is a moving target, no promises can be made.
Coraid EtherDrive® storage blades and Coraid SR RAID controllers have been tested. Any target device conforming to version 1 of the AoE protocol specification should work, provided the attached (or emulated) ATA disk supports logical-block addressing. Any ATA disk manufactured since 1995 ought to work fine.
The driver supports both 24 and 48-bit sector addresses, allowing disks as large as 128 pebibytes. Older versions of Solaris may be limited to 1TiB; see below for the full scoop.
Version 1.4 has these additional features and carries these warnings:
Error occurred with device in use checking: No such device
warning: device in use checking failed: No such device
le
network interface.
Using a 100Mbps
hme
network card
or a faster CPU
(any UltraSPARC should be OK)
works fine.
AoE (ATA over Ethernet) is a network protocol for accessing ATA devices over an Ethernet. An initiator system constructs an AoE message containing an ATA command such as `read sector,' and sends it to the Ethernet address of the desired target device; the target responds with an AoE message containing the resulting device status and data (if any). EtherDrive Storage Blades and SR RAID controllers are two examples of such targets, but the protocol is more general. The programs described here will work with any target that correctly implements the protocol.
AoE is defined at the Ethernet level, using an AoE-specific Ethernet protocol type and Ethernet MAC addresses. IP is not involved in any way. There is no inter-network routing: initiators and targets must share an Ethernet network. Neither is there any security built into the protocol: a target will process commands from any node on the same Ethernet. It is usually best to dedicate an Ethernet (or VLAN) to AoE traffic, or at least to share only with other secure traffic, e.g. SNMP or other management protocols. It is quite unwise to run AoE on a network into which a random person may plug his untrustworthy laptop.
Each target is identified by a pair of numbers,
aoemin
and
aoemaj.
For an
EtherDrive blade,
aoemaj
is the shelf number,
aoemin
the slot;
for an SR RAID
controller,
they are set through the configuration interface.
The Solaris implementation
allows AoE to be active on several network interfaces;
each interface is assigned an
aoechan
number,
and targets are addressed by the triplet
(aoechan,
aoemaj,
aoemin).
Error messages and support programs
often express such target addresses
as three numbers
separated by slashes:
0/2/8
means the target with
aoechan
0,
aoemaj
2,
aoemin
8.
The AoE subsystem is a single Solaris package
distributed as a single (datastream format) file,
usually named
CORDaoe-
version-
arch,
e.g.
CORDaoe-1.4-x86
.
It may arrive compressed,
e.g.
CORDaoe-1.4-x86.Z
;
use
uncompress(1)
to unpack it.
To install the package, use pkgadd(1M) as super-user:
pkgadd -d CORDaoe-1.4-x86 CORDaoe
The uncompressed package file is about 900kiB; the installed subsystem takes about 1000kiB.
Pkgadd
may warn that scripts are to be run as super-user.
Installing the package runs
add_drv(1M)
to add
three device drivers
to the system,
and uses
awk
to add entries to
/etc/devlink.tab
.
(On Solaris 10,
pkgadd
claims
/etc/devlink.tab
is itself a set-userid file,
apparently a garbling of the super-user script warning.)
On an SMF system,
installation runs
svccfg(1M)
to import the service manifest,
and
svcadm(1M)
to work around a bug in establishing service dependencies.
Pkgadd
may also warn that the ownership of the three
/usr/share/man
directories
is being changed.
The trouble is that these directories may not always exist,
and there is no way both to tell
pkgadd
to create them if necessary
and to tell it to leave existing permissions be.
To remove AoE from the system:
/etc/aoe.conf
and
/usr/kernel/drv/aoed.conf
.
zpool
export
any ZFS pools
on AoE disks;
remove any AoE swap areas.
Then run
svcadm
disable
aoe
(SMF system)
or
/etc/init.d/aoe
stop
(non-SMF).
/etc/aoe.conf
,
or remove or rename the file.
pkgrm CORDaoe
.
This calls
rem_drv(1M)
to remove the device drivers from the system configuration,
then deletes all files created when the package was installed.
Rem_drv
removes the device files in
/devices
,
but may leave the symbolic links in
/dev
.
That's just the way some versions of Solaris work.
It's OK to remove the dangling links yourself.
To change from one version of AoE to another,
remove the existing
CORDaoe
package
and
install the desired version
as explained above.
Remember to save
/usr/kernel/drv/aoed.conf
if you've customized it;
otherwise it will be reset to its default contents
when the new package is installed.
If using ZFS,
remember to
zpool
export
and
zpool
import
any pools already stored on AoE disks.
The package installs files in the following directories:
/usr/kernel/strmod
/usr/kernel/strmod/sparcv9
/usr/kernel/strmod/amd64
/usr/kernel/drv
/usr/kernel/drv/sparcv9
/usr/kernel/drv/amd64
sparcv9
or
amd64
)
versions.
/usr/sbin
/etc/init.d
/etc/rc2.d
/etc/init.d/aoe
is installed on both SMF and non-SMF systems,
the
/etc/rc2.d/S00aoe
link only on an non-SMF system.
/lib/svc/method
/var/svc/manifest/device
/lib/svc/method/device-aoe
and service manifest
/var/svc/manifest/device/aoe.xml
.
Installed only on an SMF system.
/etc/aoe
/usr/share/man/man1m
/usr/share/man/man7d
/usr/share/man/man7m
/opt/CORDaoe/doc
/opt/CORDaoe/doc/README
for details.
/opt/CORDaoe/lib
Before AoE targets can be accessed,
the network channels it will use
must be defined in file
/etc/aoe.conf
.
Until this file has been set up,
starting AoE silently does nothing.
To set up a single channel,
create
aoe.conf
with a single line declaring the device to be used.
For example:
aoechan 0 ether /dev/bge2
Any Ethernet device Solaris supports may be used. Special configuration may be required to enable jumbo frames; see Sun's documentation for details.
The Ethernet device need not be dedicated to AoE; in particular it may be shared with IP, though as noted above, this is often unwise.
To use more than one network device for AoE, declare additional channels as explained below.
Normally nothing special is needed in
aoe.conf
to permit jumbo frames,
but jumbo support may need to be enabled
in Solaris for the network device to be used.
See
the discussion of jumbo frames
for the full scoop.
Once
aoe.conf
has been set up,
AoE must be started:
svcadm enable svc:/device/aoe
/etc/init.d/aoe start
AoE disk devices
are named similarly to directly attached disks.
A block devices has a name like
/dev/dsk/cad107s2
;
the corresponding raw device is
/dev/rdsk/cad107s2
.
What does
cad107s2
mean?
In AoE's use of standard Sun device-name conventions:
ca
a
.
AoE disks are always
ca
.
d107
107
.
The number is a
decimal-digit encoding of the target address for this disk:
cMMmm
for target
c/
MM/
mm,
with leading zeroes discarded.
Thus
d107
is target
0/1/7
;
d3
is
0/0/3
;
d10319
is
1/3/19
.
s2
2
.
The eight standard Solaris partitions
s0
- s7
(sixteen,
s0
- s15
,
on x86)
are allowed.
p1
- p4
refer to the four direct DOS partitions.
Only primary DOS partitions are accessible;
extended partitions are not supported.
Unlike standard Sun drivers,
DOS partitions
are allowed even on SPARC systems.
On x86 systems only,
a Sun VTOC label
may be encapsulated in a DOS partition;
in that case the
s
n
partitions refer to the VTOC partitions
contained within that DOS partition,
as with standard Sun drivers.
There is also a device name
with no partition suffix,
e.g.
/dev/dsk/cad107
;
this accesses the whole disk
regardless of the disk label,
including the parts normally reserved
for system use.
By default devices are created for targets
0/0/0
- 0/0/14
(i.e.
cad0s*
- cad14s*
)
and
0/1/0
- 0/1/14
(cad100s*
- cad114s*
):
enough for
two EtherDrive shelves
or SR RAID devices
numbered 0 and 1,
each with up to fifteen blades.
To make more devices
requires editing a configuration file;
in Solaris 9 and earlier,
the system usually must be rebooted as well.
See
More devices
for details.
In Solaris,
names in
/dev
are symbolic links to real device files in
/devices
.
For an AoE disk the real device name
is of the form
/devices/pseudo/aoed@
inst:
part
/devices
-style
partition name.
For example,
/dev/dsk/cad107s2
is a link to
/devices/pseudo/aoed@107:c
;
/dev/rdsk/cad107
to
/devices/pseudo/aoed@107:wd,raw
.
Before an AoE disk device can be opened, it must be enabled. Normally this is done automatically: a target's check-in message is intercepted by aoemon; aoemon enables its devices, and reports to syslog that it did so.
aoectl
list
lists enabled targets.
Aoectl
may also be used to enable or disable specific targets,
or to repeat the check-in broadcast
so targets will repeat their responses.
See the section on
tools
for details.
If a target isn't enabled but should be, try resetting it; e.g. pull the EtherDrive blade out of the shelf and push it back in, or put the SR lblade offline and then online. If all is well, within a few seconds aoemon should report that it has enabled that target.
Once enabled,
AoE
/dev/dsk
and
/dev/rdsk
devices work like any other disks.
Format(1M)
can be used to set up partition tables
(the
partition
submenu)
and to run simple read and read-write disk tests
(analyze
),
modulo
a few bugs and quirks
described below.
Low-level formatting,
bad-block repair,
and defect-list management
(the
format
,
repair
,
and
defect
menus)
are not supported.
A partition table may also be written with
fmthard(1M)
or read with
prtvtoc(1M).
On a brand-new huge drive,
or one bearing a spurious old label,
it may be necessary to use
aoelabinit
to initialize the label first.
If the disk doesn't have a label (partition table),
the driver makes one up.
In the made-up label,
subdevice
s0
(and
s2
if the disk supports VTOC labels)
is all of the disk intended for normal use,
excluding parts reserved for use by diagnostics
and the label itself.
Any file system except the root
or
/usr
may be stored on an AoE disk.
AoE is started very early in the boot process;
AoE file systems
listed in
vfstab(4)
and mounted automatically when the system boots.
It is not possible to boot from an AoE device.
AoE disks may belong to ZFS storage pools, but beware the reliability problems (not specific to AoE) described below.
AoE disks may be used for swapping. AoE swap areas listed in vfstab will be added automatically. On a non-SMF system, swap areas are added twice, once before AoE has been started and once after. The first attempt produces a harmless message like
/dev/dsk/cad108s1: No such device or address
On an SMF system,
if the
devices/aoe
service is enabled,
it is automatically started if possible in single-user mode.
To use AoE in single-user mode
on a non-SMF system,
ensure that
/usr
is mounted if it is a separate file system,
then run
/etc/init.d/aoe
start
.
Support for disks larger than 1TiB is somewhat complicated, because of the way Solaris supports such disks.
The following is mostly a summary of Sun's documentation, which should be consulted for details.
The original Solaris disk-labelling scheme (VTOC) stored the size of each partition as a 32-bit signed sector count. This limited the size of a single disk subdevice (partition) to 231 sectors, or 1 tebibyte. The UFS file system format had similar limits: no file system could span more than 1TiB.
DOS partition labels are similarly constrained by 32-bit values to disks no larger than 1TiB.
The Solaris 9 4/03 release added a new labelling scheme (EFI), derived from an Intel specification. EFI labels store partition parameters in 64-bit unsigned values; a disk device or any single partition may span 264 sectors, or 8 zebibytes, or about eight billion tebibytes. Solaris 9 8/03 updated UFS to match: a new option to newfs(1M) creates a `multi-terabyte [sic] file system.' These upgrades may be installed on an older Solaris 9 system by installing the current recommended patch cluster; they are not available for Solaris 8 or older systems.
EFI labels afford several benefits besides large-disk handling:
One new constraint is that all partitions
in a single EFI label
must be non-overlapping.
In particular the Solaris tradition
that subdevice
s2
accesses the entire user-data part of the disk
regardless of other partitions
cannot be honoured under EFI.
To retain compatibility with existing disks and older systems, VTOC labels are still supported. Sun's standard tools write VTOC labels to 1TiB or smaller disks, EFI labels to larger disks. The system looks for a VTOC label first, then an EFI label; thus a small disk may have either type of label.
The Solaris AoE driver supports either labelling scheme.
If an unlabelled disk is no larger than 1TiB,
the AoE driver pretends it has a VTOC label
in which
s0
and
s2
both access the entire user-data part of the disk.
The regular Solaris tools read and modify the label
without trouble.
If an unlabelled disk is larger than 1TiB,
the disk driver makes no pretense.
The whole-disk subdevice
(cad
nnn)
accesses the entire physical device;
none of the standard partition subdevices
(cad
nnns
n)
may be used.
Sun's standard tools report no label
until one is written;
that is the best that can be done
within Sun's disk-driver interface.
Format(1M)
cannot handle an unlabelled disk
large enough to require EFI labels;
use
aoelabinit
as a workaround.
If a disk already has a VTOC label,
everything works in the standard way.
The eight standard partitions
s0
-s7
work as described in the label;
Sun's standard tools display the label and can modify it.
If a disk already has an EFI label,
and the system accessing it runs a version of Solaris
new enough to have EFI support,
everything also just works.
The eight standard partitions
s0
-s7
are defined by the first eight EFI partitions.
Sun's standard tools display the label and can modify it,
with
minor bugs.
If a disk already has an EFI label,
pre-EFI Solaris can access it with limitations.
The AoE driver reads the label and defines the partitions;
the corresponding
s0
-s7
subdevices work;
but Sun's standard tools don't understand the EFI label
and report errors if asked to display it.
In particular,
format
on a pre-EFI system
doesn't work at all with an EFI-labelled disk.
Aoeunlabel
will wipe out an existing EFI label,
allowing the disk to be partitioned from scratch.
Pre-EFI Solaris running in 32-bit mode can access at most 1TiB from a disk partition, even if the EFI label defines it to be larger. Pre-EFI Solaris in 64-bit mode can access the whole partition no matter how big. Pre-EFI Solaris cannot make a UFS file system larger than 1TiB, nor can such a system read a larger file system written by a newer system.
The
format
format
,
repair
,
and
defect
menus
are not supported
on AoE disks.
An AoE target is listed in format's disk-selection menu only if it is enabled. If an enabled device is unreachable (broken, removed, mistakenly enabled by hand), format may stall for a few seconds trying to access it, as it would for a configured-but-inaccessible SCSI disk.
Format exhibits a few bugs when working with an AoE disk larger than 1TiB:
We have learned that ZFS is surprisingly non-robust in the fact of I/O errors: if a disk reports repeated errors and the ZFS pool doesn't have enough remaining redundancy to allow the disk to be set aside, the whole system panics (crashes).
For example, the system will panic if:
raidz1
pool that is already
DEGRADED
,
i.e.
a disk has already failed
and has not yet been replaced.
raidz2
pool in which two disks have already failed
and neither has yet been replaced.
This is a problem in Solaris, not in the AoE code; it has been reported with iSCSI arrays as well. Rumour has it that Sun are aware of the problem and a fix is in the works.
We urge customers to take special care when using ZFS, unless it is acceptable for the entire system to crash because of a single broken disk. Use redundancy of one sort or another for all pools, whether by using the mirroring and RAID mechanisms built into ZFS or by using an external RAID mechanism (such as that built into Coraid's RAID appliances). Whatever redundancy is used, make use of hot spare mechanisms so that redundancy will be restored automatically as quickly as possible. Check regularly for failures and replace broken disks so that the hot-spare supply won't run out.
All of this is good practice for any modern storage system. That ZFS is crash-prone just sharpens the consequences of error.
AoE disks may belong to metadisks managed by the Solaris Volume Manager in Solaris 9 and newer systems. Concatenation, striping, mirroring, RAID 5, and soft partitions all work just as for any other disk. Solstice DiskSuite, the extra-cost predecessor to Volume Manager, has not been tested; it probably works but this is not a promise. Field reports are welcome.
On an old computer system,
initializing a RAID group
(metainit
d
nn -r
)
with four or more AoE-disk components
may lock up the system.
The system is actually OK,
but so busy initializing RAID data
that it cannot do much else.
All returns to normal when RAID initialization has finished.
A 140MHz Ultra-1 system exhibits the problem;
an Ultra-2 with a single 300MHz processor does not.
Any current system ought to be OK.
Kernel drivers in the AoE subsystem
log error messages to
syslog
facility
kern
;
in particular this is how disk errors are reported.
All messages from the AoE disk driver
contain the string
aoed
and give the device instance number
and target address.
Disk-error messages always contain the string
ATA
error
and give the ATA error-register value
and an ASCII interpretation of its contents.
Messages about errors in the AoE part of a target
(not the ATA disk part)
say
AoE
error
and give both the AoE error code
and its conventional meaning.
User-mode programs
responsible for starting AoE channels
and for monitoring for unusual events
(aoestart
and
aoemon,
described in more detail below)
log to
syslog
facility
daemon
.
This log is where
aoestart
reports that an
aoechan
could not be started,
and where
aoemon
reports that a target was automatically enabled.
Aoemon
also logs a packet dump when an error occurs:
of the message received for an AoE-protocol or ATA error,
of the message that couldn't be sent for a timeout.
Iostat(1M) lists AoE disks that have been used at least once since the last boot. Additional AoE-specific statistics may be fetched with kstat(1M).
File
/usr/kernel/drv/aoed.conf
lists the AoE disk devices to be created.
Only the devices listed when the
aoed
driver last processed the file
(usually when the system last booted)
will work,
even if additional special files happen to exist,
and even if other targets exist and have been enabled.
The AoE subsystem comes with a default
aoed.conf
file
declaring
an initial set of targets.
If more are needed,
the file must be edited
and the system told to reread it.
aoed.conf
aoed.conf
is a Solaris driver configuration file,
in the form described in
driver.conf(4).
The easiest way to add to the file
is to generate new entries with
aoemkconf
and edit them in.
You may also write your own entries.
To summarize the general Solaris rules:
a driver configuration file
contains lines of text.
Anything following
#
is a comment.
Empty lines are ignored.
Non-empty lines declare devices
or driver properties
(parameters,
given as
name=
number
or
name="
string"
).
Each line must be terminated with a semi-colon.
Each line in
aoed.conf
declares a device instance,
specifying an instance number
and target address:
name="aoed" parent="pseudo"
instance=
inst
aoechan=
cno
aoemaj=
majno
aoemin=
minno;
name="aoed" parent="pseudo"
instance=
inst
instance=102
produces
/devices/pseudo/aoed@102:a
,
/dev/dsk/cad102s0
,
and so on.
aoechan=
cno
aoemaj=
majno
aoemin=
minno
/
majno/
minno.
These optional fields are allowed:
timeout=
ms
hd=
nhead
sec=
nsec
hd
and
sec
must be specified;
if only one is supplied,
it is ignored.
maxdata=
size
maxbuf=
ncmds
The optional parameters must be added by hand; aoemkconf won't put them in.
The declarations in
aoed.conf
,
and nothing else,
connect instance numbers to target addresses.
That the instance number is a decimal encoding of the target address
is only a convention;
nowhere in the driver subsystem is this assumed.
If you write your own entries
you may use whatever mapping you like,
as long as no
inst
value is used for more than one device,
and none is too large:2
20164 on a 32-bit SPARC system,
330382099 on 64-bit SPARC;
12483 and 204522252 on x86.
On Solaris 10,
changes to
aoed.conf
can be made effective while the system is running:
# update_drv -f aoed
aoed
module if possible,
reloads it,
and updates
/devices
and
/dev
.
If some AoE devices are in use
(mounted,
in use for swapping,
device file open),
update_drv
will complain that the module could not be unloaded;
new devices will be added,
but existing devices whose
aoed.conf
entries have changed
will not be updated until the next driver reload or system boot.
On Solaris 9 and older systems,
changes to
aoed.conf
have no effect until the next time the
aoed
driver module is loaded,
and
/devices
and
/dev
are not updated until Solaris is told to do so.
The simplest way to change the AoE configuration
is to update
aoed.conf
and then perform
a configuration reboot:
reboot
--
-r
if Solaris is running,
boot
-r
to the Open Boot
ok
prompt.
It is also possible to make changes effective
right away
if no AoE device is in use,
by unloading the
aoed
module and explicitly reconfiguring:
# modinfo | awk '$6 == "aoed"'
190 780a4000 a570 229 1 aoed (AoE disk driver v1.2)
# modunload -i 190
# devfsadm -i aoed
-i
argument to
modunload
is the first field of the line printed by
modinfo.
If
modinfo
doesn't list the
aoed
module,
the driver wasn't loaded;
skip ahead and run
devfsadm.
On Solaris 7, devfsadm may not exist (it was added by a patch) and in any case appears to be undocumented. The older equivalent, which works in either case, is:
# drvconfig -i aoed
# devlinks
It is wise to plan ahead,
especially on a non-SMF system
where the driver must be unloaded to reconfigure.
For example,
when installing a new EtherDrive shelf,
add a device instance for each slot,
not just for those you plan to use right away;
when installing an EtherDrive Storage Appliance,
create a few extra devices.
To create hundreds of never-used devices
would be silly,
but the cost of a few extras is modest:
a kibibyte or so inside the operating system per device,
a handful of inodes for special files and symbolic links,
extra entries in the
/dev/dsk
and
/dev/rdsk
directories.
If a target is removed while the system is running, it is prudent to disable it manually:
aoectl disable
target ...
Removing a target's entry from
/usr/kernel/drv/aoed.conf
will save a little memory
after the system is next booted.
There's little point in doing this for a single target;
it may make sense if many disks
have been permanently removed.
When a target is removed from
aoed.conf
its entries in
/devices
or
/dev
may remain,
even after a configuration reboot.
Attempts to use such ghost devices
return errors.
If the same target is later restored to
aoed.conf
(and the driver reconfigured)
the devices will work again.
This is just the way some versions of Solaris work,
especially older ones.
File
/etc/aoe.conf
assigns AoE network channels
to Solaris Ethernet devices.
Every device through which AoE devices will be accessed
must be listed,
with a distinct
aoechan
number.
To add network interfaces, add lines to that file, in the same format as the first:
aoechan
channum ether
device
Each channel must have a distinct channum in the range 0-9. Channels may not share a network device, though of course a multiport device like the Quad Fast Ethernet may support several channels, one to a port. The network device need not be dedicated to AoE; in particular it may also be configured as an IP interface.
Changes to
aoe.conf
take effect when the system is next booted,
or when
svcadm
refresh
svc:/device/aoe
(SMF system)
or
/etc/init.d/aoe
restart
(non-SMF)
is run.
Adding an
aoechan
usually requires
adding entries to
aoed.conf
,
so a non-SMF system
usually must be booted anyway.
The Ethernet standard allows at most 1500 bytes per frame. The AoE header for an ATA command occupies 22 bytes; transfers must be in whole 512-byte sectors. Thus only two sectors may be read or written at a time using standard Ethernet.
Per-frame overhead is quite significant in gigabit Ethernet, not just for AoE but for many protocols. Hence many manufacturers of gigabit network cards and switches allow larger frames, so that more data may be packed into a single frame and fewer frames sent. Usually the maximum size is about 9000 bytes, but it varies by product.
Jumbo frames can speed up AoE I/O quite a bit. Usually it is necessary to enable them explicitly in Solaris, and some care may be needed to make everything work right.
For jumbo frames to work, they must be supported in three places:
ce
device
(e.g. the add-on GigaSwift adapter),
but not the
bge
device
(e.g. the embedded Ethernet devices on
many Sun motherboards).
Beware that jumbo support in early versions
of Sun's drivers may not have been well-tested,
especially in the ways AoE uses it.
In particular,
the earliest version of the Solaris 10
ce
driver
may be prone to crash when used with AoE.
It is wise to patch such drivers to their
most-recent versions before using jumbos.
To make AoE use jumbo frames:
The AoE subsystem can see the frame size configured in Solaris, and that available in the AoE target, but it can't see any limits imposed by switches. It's up to you to get that right.
By default, AoE uses the largest size supported by the network device. Manual configuration may be needed to handle size limits in switches, or for debugging. It is not a substitute for properly configuring Sun's network drivers.
To specify a maximum frame size,
add any of the following arguments to
the channel's entry in
/etc/aoe.conf
:
mtu
n
maxdata
n
maxdata
n
means the same as
mtu
n+22
.
jumbo
nojumbo
mtu
0
(jumbo
)
or
mtu
1500
(nojumbo
).
The current maximum transfer size
for a particular target
is shown in the
maxdata
field of the
conf
kstat
table.
Is bigger always better? Apparently not.
Our tests suggest that
maxdata
values of at least 3-4 kibibytes
(6-8 sectors)
produce large speed improvements
over the default 1kiB (2 sectors);
reads and writes through the file system
are 2-3 times as fast.
With even larger frames,
the story is not as clear.
For some operations speeds continue to increase,
though not as markedly;
for others there appears to be an optimal size
(varying with type of operation and type of disk)
after which speeds drop a little,
though again not markedly.
It is not a bad idea just to use the
largest value supported by your hardware and software;
I/O speed will be much better than with standard 1500-byte frames,
and the largest value will cause the least
system overhead within Solaris.
If it's important to squeeze every possible drop of speed
out of your system,
you may want to experiment to find the
maxdata
value that best matches your network,
device drivers,
disks,
and application mix.
Those who do run their own tests are invited to share them with us, or to contact us for details about the simple tests we have used so far.
On Solaris 8 and newer systems,
additional statistics can be read
via a special
kstat(1M)
table called
aoed
inst,stats
where
inst
is the number appearing in device names
(as
/dev/rdsk/cad
insts0
).
For example,
to check the statistics for target
0/0/5
:
# kstat -n aoed5,stats
module:
aoed
instance:
5
name:
aoed5,stats
class:
misc
atarcv
316771
atasnd
316771
crtime
83.211228318
erraoe
0
errata
0
errtimeout
0
errtrunc
0
errwmsg
0
retries
0
snaptime
215074.845901558
softerr
0
crtime
and
snaptime
are standard values supplied by the system:
when data-gathering began
and when the data shown were last updated,
in seconds since the system booted.
Other values are counters maintained by the AoE disk driver,
separately for each AoE target:
atarcv
atasnd
erraoe
errata
errtimeout
errtrunc
errwmsg
retries
timeout
in the
conf
table,
below).
softerr
Static configuration info may be retrieved
from
kstat
table
aoed
inst,conf
:
# kstat -n aoed5,conf
module:
aoed
instance:
5
name:
aoed5,conf
class:
misc
crtime
83.211229535
cylsize
56960
maxbuf
3
maxdata
1024
mibsize
39266
snaptime
215076.003383030
timeout
200
aoed
-specific
values are:
cylsize
maxbuf
maxdata
mibsize
timeout
See the manual pages supplied with the software
for details.
Except as noted,
programs are installed in directory
/usr/sbin
,
normally part of the super-user's shell search path.
Aoectl issues control commands to the AoE subsystem:
aoectl list
address ...
aoectl enable
address ...
aoectl disable
address ...
aoectl probe
address ...
0/2/4
.
If an address field is missing or empty,
it matches any value.
For example:
aoectl
list
0
aoectl
probe
0/0/5
0/0/7
0//2
0/0/5
,
0/0/7
,
and any target on
aoechan
0
with
aoemin
2 no matter what its
aoemaj.
aoectl
disable
0/1
aoectl
list
aoectl
probe
Aoectl
need not be used during normal operation,
only in exceptional cases
or for monitoring.
aoectl
list
helps spot targets that should be there but aren't;
aoectl
probe
may be useful if the network is rearranged
while the system is running;
aoectl
disable
will prevent programs like
format
from stalling
because a target has broken or been removed.
Aoestart
starts a single
aoechan
according to its arguments.
It is called automatically for each channel named in
/etc/aoe.conf
when
AoE is started;
normally it need not be run by hand.
In fact each line of
/etc/aoe.conf
is simply used
as arguments to
aoestart.
Aoestop
stops one,
several,
or all
(-a
)
named
aoechans.
Normally there is no need to do this at all.
Aoemkconf
turns target addresses
into device declarations
suitable for inclusion in
aoed.conf
.
It simply prints the entries on the standard output;
it's up to you to edit them into the configuration file.
An argument like form
0/7/3
names a single target;
0/7/9-11
is shorthand for
0/7/9
0/7/10
0/7/11
.
Thus
aoemkconf 0/10/0-14 0/11/0-14
Arguments that don't look like target addresses
are taken to be existing configuration files;
aoemkconf
silently omits entries whose instance numbers
would duplicate any in the files.
Hence if
/usr/kernel/drv/aoed.conf
already has some entries for shelf 10,
aoemkconf /usr/kernel/drv/aoed.conf 0/10/0-15
Aoelabinit writes a Solaris disk label to the single disk named as an argument:
aoelabinit /dev/rdsk/cad209
-f
(force)
must be given:
aoelabinit -f /dev/rdsk/cad209
By default, aoelabinit invents a label as follows:
s0
and
s2
access the whole user-data part of the disk,
excluding system-reserved areas.
s0
spans the whole user-data part of the disk,
excluding disk labels
and a small mandatory system-reserved area.
The defaults may be overridden in several ways; see the manual page.
Aoelabinit is not meant to be a general-purpose disk labeller, just a workaround for the format bug forbidding access to large unlabelled disks.
Aoemon is a daemon that processes messages from the kernel AoE subsystem. It has two main functions:
Aoemon
uses
syslog
facility
daemon
.
Severity
info
is used to
report that a target has been enabled;
notice
when an unexpected or ill-formed AoE message arrives;
error
for error conditions
that prevent
aoemon
from working at all.
Aoemon is started automatically when AoE is started. Only one copy is needed for all aoechans. If aoemon is not running, AoE disk devices will still work, but targets will not be automatically enabled and some errors may go unreported.
Aoeunlabel destroys any Solaris disk labels on the devices named as arguments:
aoeunlabel /dev/rdsk/cad312
Aoeunlabel is unlikely to be needed except when a disk already EFI-labelled by a newer system is to be repartitioned on an older one.
On an SMF system,
AoE is represented as service
svc:/device/aoe
.
Normal SMF tools may be used:
svcs
svc:/device/aoe
online
if all is well;
disabled
if not enabled;
maintenance
or
offline
if the service couldn't be started.
In the latter case, log files
/var/svc/log/device-aoe:default.log
and
/etc/svc/volatile/device-aoe:default.log
may offer clues.
svcadm
enable
svc:/device/aoe
aoe.conf
.
Remember that AoE should be started automatically
when the system boots.
svcadm
disable
svc:/device/aoe
svcadm
clear
svc:/device/aoe
maintenance
state,
e.g. if it failed to start when last enabled.
svcadm
refresh
svc:/device/aoe
/etc/aoe.conf
.
Equivalent to
svcadm
disable
svc:/device/aoe;
svcadm
enable
svc:/device/aoe
,
but quicker to type
and less disruptive to active AoE devices.
Once enabled,
the AoE service is started automatically
early in the boot process.
In particular the services that start swap devices
and mount local file systems
depend on
svc:/device/aoe
,
and AoE is started even for the single-user milestone,
though if AoE fails to start
single-user mode still works.
Usually it is safe to abbreviate
the service name to
aoe
.
On a non-SMF system,
/etc/init.d/aoe
is a conventional boot-and-shutdown shell script
to start and shut down the AoE subsystem:
/etc/init.d/aoe
start
aoe.conf
.
/etc/init.d/aoe
stop
/etc/init.d/aoe
restart
/etc/aoe.conf
.
Equivalent to
aoe
stop;
aoe
start
but quicker to type and to execute,
and less disruptive to active AoE devices.
/usr
has been mounted;
thus any other file system
may be put on an AoE disk,
but all files used by the
aoe
script itself
(the kernel drivers,
aoestart,
aoemon)
must be in the root or
/usr
.
With or without SMF,
parts of the initialization script
may be customized
by placing shell-variable declarations in file
/etc/default/aoe.options
.
Here are some of the things that may be set:
AOEBIN=
dir
/usr/sbin
.
AOECONF=
file
/etc/aoe.conf
.
AOESTARTOPTS="
options"
AOEMONOPTS="
options"
AOEDIR=
dir
/etc/aoe
.
AOEDIRPERM=
perms
$AOEDIR
if it doesn't exist;
default
700
,
i.e. only the owner (the super-user)
may look within.
AOEPRESTART="
shell-commands"
AOEPOSTSTART="
shell-commands"
AOEPRESTOP="
shell-commands"
AOEPOSTSTOP="
shell-commands"
The AoE subsystem comprises three distinct device drivers.
A network device is made into an AoE channel by opening it, issuing Solaris DLPI commands to set device unit number and Ethernet protocol type, and pushing an instance of the aoecomm STREAMS module. There is a little more configuration to tell aoecomm the desired aoechan number, and to protect against denial-of-service attacks (anybody could push aoecomm on a pipe, for example). Aoestart does all this.
The
aoed
driver affords disk access,
through both block and character devices.
It uses an internal call
to inform
aoecomm
of its interest in incoming AoE messages,
and another call to send messages out.
It also calls
aoecomm
when a device is opened,
to see whether the corresponding target has been enabled;
if not,
the
open
call fails.
Aoed
creates 18
/devices/pseudo/aoed@*
devices
for each target:
the eight standard block subdevices
:a
-:h
,
their raw counterparts
:a,raw
-:h,raw
,
and whole-disk subdevices
:wd
and
:wd,raw
.
The aoectl driver provides two control devices:
/devices/pseudo/aoectl@0:ctl
/dev/aoectl
/devices/pseudo/aoectl@0:mon
/dev/aoemon
/dev/aoemon
),
and to query or change the enabled-target list.
Each of these drivers is loaded into the kernel as needed: when the STREAMS module is pushed, or one of the device files is opened. Because the aoed and aoectl drivers depend on aoecomm, loading the former automatically loads the latter first, and aoecomm cannot be unloaded while either of the other two modules is loaded.