This is a description of the Solaris ATA-over-Ethernet (AoE) device-driver implementation, version 1.4. It includes:
This is not a tutorial on the Solaris operating system, device-driver and STREAMS subsystems, or kernel; the AoE protocol; or C programming. Neither does it explain how to install and operate the AoE subsystem. For background on these topics and others, the reader should have access to the following references:
/opt/CORDaoe/doc/aoe-guide.html
in the customer distribution.
http://docs.sun.com
.
There are different editions
for different versions of Solaris.
http://www.coraid.com
.
http://www.opengroup.org
.
http://docs.sun.com
.
Certain parts of this subsystem behave differently when newer Solaris features are present:
This document uses the new binary-multiple prefixes
described in the IEEE 1541-2002 standard.
Prefixes like kilo, mega, giga, tera
(k, M, G, T)
refer to powers of ten;
when the near-equivalent powers of two are meant,
they are called kibi, mebi, gibi, tebi
(ki, Mi, Gi, Ti).
For example,
a gigabyte (GB) is
109
bytes,
while
a gibibyte (GiB) is
230
(1073741824).
For more details
see
http://physics.nist.gov/cuu/Units/binary.html
.
The AoE subsystem comprises several modules executing in kernel mode and several user-mode utility programs.
These are the kernel parts. Each is a loadable kernel module, supplied in both 32- and 64-bit binary versions.
aoecomm
aoecomm
to send messages,
to arrange to receive messages,
and to query and update a registry of AoE targets
known to exist.
aoectl
/dev/aoectl
,
to which configuration commands may be written;
/dev/aoemon
,
from which error reports and unsolicted AoE messages
may be read.
aoed
/dev/dsk
and
/dev/rdsk
directories
implementing normal Solaris disk semantics,
sufficient to allow file systems to be created and mounted
and to support standard maintenance programs
like
fsck
and
format(1M).
Implements the required Solaris disk-specific
ioctl
calls,
including some Sun doesn't bother to document.
These are the
user-mode parts.
Except as noted,
each is a binary executable program
installed in directory
/sbin
.
Only 32-bit binaries are supplied;
they run fine on 64-bit systems.
aoectl
driver;
e.g. query or tweak the active-target registry,
broadcast a Query-Config request to cause targets
to identify themselves.
aoectl
driver,
logging errors and unusual events
and updating the registry as Query-Config responses
are received.
aoed
.
Entries can be written by hand as well;
this is just a convenience.
aoe.xml
/etc/aoe.conf
.
On a non-SMF system,
aoe
is stored in directory
/etc/init.d
and linked into
/etc/rc2.d
,
so it will be run when the system boots.
If SMF is installed and active,
aoe
is stored in directory
/lib/svc/manifest/device-aoe
and
aoe.xml
added to the SMF inventory,
so that AoE will be started when service
svc:/device/aoe
is enabled.
All of these programs,
along with default configuration files
and manual pages and other documentation,
are bundled into a single Solaris package
named
CORDaoe
for distribution.
To install AoE,
a system administrator runs
pkgadd(1M),
edits a configuration file or two,
and starts the subsystem:
on a non-SMF system,
/etc/init.d/aoe
start
or a reboot;
with SMF,
svcadm
enable
svc:/device/aoe
.
See the
Installation and Operation Guide
for further operational details.
The source code tree contains these directories:
kern
kern
directory,
64-bit binaries
(whether for SPARC or x86)
in subdirectory
kern/bin64
.
user
include
man
-man
format.
doc
pkg
pkg/data
contains files to be assembled into the eventual package,
copied there by
make
pkg
in each of the directories above.
The resulting package is stored in subdirectory
pkg/CORDaoe
in file system format,
and file
pkg/CORDaoe.Z
in compressed stream format.
The build process is controlled by
make(1).
Makefile
s
come ready-to-use;
there is no system-dependent configuration process,
automatic or otherwise.
Sun's compilation tools should be used:
Solaris
make,
included in the standard
SUNWsprot
package,
as
/usr/ccs/bin/make
;
Sun Studio C,
available at no cost from
www.opensolaris.org
.
Gcc
is not recommended,
and probably cannot correctly compile
the current kernel-mode code.
At present the code may be compiled only on Solaris 10, though the resulting binaries will run on systems as old as Solaris 7.
Each source-file directory
kern
,
user
,
man
,
doc
has a
Makefile
with these targets
(except as noted):
make
(default target)
make
install
man/Makefile
or
doc/Makefile
.
make
pkg
pkg/data
.
Create subdirectories within
pkg/data
as needed.
make
clean
pkg/Makefile
has these targets:
make
pkg
pkg/Makefile
to create two copies of the installable package:
one in file system format
(a collection of individual files)
in directory
pkg/CORDaoe
,
one in compressed stream format
(a single file)
in file
CORDaoe.Z
.
make
clean
make
clobber
Makefile
and the control files.
The root directory of the source tree contains an overall
Makefile
with these targets:
make
(default target)
make
in
kern
,
user
,
man
,
and
doc
,
in that order.
make
pkg
make
pkg
in
kern
,
user
,
man
,
doc
,
and
pkg
,
in that order.
make
clean
make
clean
in
kern
,
user
,
man
,
doc
,
and
pkg
,
in that order.
Thus:
make
pkg
at the top level.
make
clean
at the top level.
cd
kern
and edit the files.
To compile what has been changed,
run
make
.
When all seems well
(perhaps after installing the new binary by hand
to test it),
build a new package with
make pkg && cd ../pkg && make pkg
cd .. && make pkg
All
Makefile
s
assume that required tools are in the shell's search path;
in particular
cc
if C programs are to be compiled,
ld
if kernel object files are to be linked into loadable modules,
fidl
if long documents are to be rendered.
The stream-format package file resulting from the build process
is named
CORDaoe.Z
.
Societal conventions may dictate giving it another name
in public,
like
aoe-
version.Z
.
The package name
CORDaoe
is encoded in the package data;
the filename doesn't matter.
A fixed name is used
so that
pkg/Makefile
need not be edited just because the version number has changed.
Makefile
s
and associated conventions are designed
the distribution to build without fuss
whether on SPARC or x86;
no manual configuration is required.
It is assumed that only one of those architectures
will be built at a time;
there is no provision to keep SPARC- and x86-specific binaries
separate.
To use the same source tree for both,
build one architecture,
save the resulting binaries,
then run
make
clean
before building the other.
Within each architecture
32- and 64-bit kernel binaries are kept side-by-side.
Compiler flag
-xarch=generic64
is used for 64-bit compilations;
this is recognized by Sun's compiler on either architecture.
64-bit kernel modules are created in subdirectory
bin64
.
When
make
pkg
copies files to
pkg/data
for packaging,
a recursive
make
call selects target directory
usr/kernel/drv/sparcv9
or
usr/kernel/drv/amd64
according to the architecture on which
make
is run.
The user-mode utility programs are built only as 32-bit binaries, since those work fine on 64-bit systems.
pkg/Makefile
calls
pkgadd(1M)
with variables declaring the current architecture name
and the corresponding subdirectory name
for 64-bit kernel modules,
so that names need not be hard-coded in
pkginfo
and
prototype
.
Some aspects of the style used in C code and documentation are a bit unusual. The following is meant to aid comprehension, not as a religious tract, despite the occasional descent into sermon.
As a general but not inviolate rule, code is presented top-down: within each file, major routines and entry points come first, followed by local subroutines they call, and so on. Subroutines used only by one or two related major routines immediately follow those routines; common code used by many different routines is grouped after all uses. Routines performing related functions (e.g. different operations on the same type of data structure) are placed near one another.
Data declarations and definitions are placed at the top of the file, except for tables used by only one or two subroutines, which are usually placed next to the relevant code.
Every procedure has a prototype declaration. A global procedure (one called from another file) is declared in a header file. A static procedure is declared near the top of the source file, after all data declarations but before any code. As a special case, if a procedure's address is used as a data initializer (e.g. device-driver entry points), its prototype precedes the relevant data.
Although every procedure has an ISO-C prototype, procedure definitions declare arguments in old-C syntax:
static mblk_t *aoectlintr(int, mblk_t *);
...
static mblk_t *
aoectlintr(chan, mp)
int chan;
mblk_t *mp;
The original C specification intentionally made procedure definitions resemble calls, with type information as a sidebar, in the belief that this makes the code easier to read. The author still agrees. The meaning of such mixed notation is clearly defined in the ISO standard.
Variadic procedures are an exception: their definitions use the ISO syntax, because that is the only way to express them.
Types are always declared explicitly:
static int
doquery(ap, maj, min)
Aoechan *ap;
int maj, min;
{
unsigned int x;
...
static
doquery(ap, maj, min)
Aoechan *ap;
{
unsigned x;
...
Neither procedures nor data are allowed global scope
unless that is truly required.
(Here the author disagrees with both original and ISO C:
static
scope should have been the default.)
Machine-specific pseudo-optimizations
such as the
register
keyword are not used.
Source files are not made aware of the source-code directory structure.
In particular,
#include
directives for AoE-specific header files
are always of the form
#include "
file.h"
;
if
file.h
might be in directory
../include
,
compiler option
-I../include
should be used.
Compile-time conditionals are used sparingly.
There are no `normal' options;
only one version of the code is officially supported.
#ifdef
s
are used only to hide unsupported experimental code,
to make compiler or system-bug band-aids stand out,
or to work around compile-time environment differences
in different versions of Solaris.
Externally-defined binary data structures (e.g. AoE messages, ATA IDENTIFY data, EFI disk labels) are treated as byte arrays, not as C structures. A constant is defined for the offset to each element. Multi-byte objects other than plain byte arrays (e.g. integers of various sizes) are accessed through machine-independent macros that pack or unpack data with appropriate attention to byte order. This allows source code to be ignorant of per-system byte-order and word-alignment rules.
These header files,
all stored in directory
include
in the source tree,
define constants and macros
to implement this scheme.
In the descriptions below,
buf
is always a buffer address,
and may be either
char
*
or
unsigned
char
*
.
Off
is an integer offset within the buffer addressed by
buf.
Val
is an integer value;
sometimes only the low-order bits are used.
aoeproto.h
fraoechar(buf, off)
fraoeshort(buf, off)
fraoelong(buf, off)
toaoechar(buf, off, val)
toaoeshort(buf, off, val)
toaoelong(buf, off, val)
etherproto.h
fretherprot(buf, off)
to fetch a protocol-type value from offset
off
in buffer
buf;
macro
toetherprot(buf, off, val)
to set protocol-type value
val.
LEN_EADDR
LEN_ETHER
MINLEN_ETHER
MAXLEN_ETHER
ata.h
frataword(buf,
off)
fratalong(buf,
off)
fratall(buf,
off)
toataword(buf,
off,
val)
toatalong(buf,
off,
val)
toatall(buf,
off,
val)
aoecommproto.h
aoecomm
and
aoectl
drivers.
This header defines offsets and other constants;
integer values
are formatted using the
fraoe
xxx
and
toaoe
xxx
macros from
aoeproto.h
.
efilabel.h
frefiword(buf,
off)
frefilong(buf,
off)
frefill(buf,
off)
toefiword(buf,
off,
val)
toefilong(buf,
off,
val)
toefill(buf,
off,
val)
MINLEN_GPT
LEN_GPT
EFI_GPT_LOC
LEN_GPE
LEN_GUID
toefi
xxx
macros are not used at present;
they are included for completeness.
dospart.h
frdospchar(buf,
off)
frdospword(buf,
off)
frdosplong(buf,
off)
todospchar(buf,
off,
val)
todospword(buf,
off,
val)
todosplong(buf,
off,
val)
todosp
xxx
macros are not used at present;
they are included for completeness.
LEN_BSEC
LEN_DOSP
BSECMAGIC
SYSID_SUNOS
SYSID_SUNOS2
Sun supply headers describing EFI and DOS labels, but they use C structures accompanied by explicit machine-dependent byte-swapping, something we would rather avoid. Having our own headers may also simplify future support for compiling the AoE code on older versions of Solaris.
This code fragment tests whether the AoE message at p is a response message, and if so fetches its tag value:
if (fraoechar(p, AOE_VF) & AOEF_R)
tag = fraoelong(p, AOE_TAG);
toaoeshort(p, AOE_MAJOR, AOEMAJWILD);
toaoechar(p, AOE_MINOR, AOEMINWILD);
An internal protocol is used for messages exchanged
between user-mode programs and the
aoecomm
STREAMS module during channel setup,
and for messages
written to
/dev/aoectl
and read from
/dev/aoemon
.
The message format is defined by constants in
aoecommproto.h
;
messages should be accessed using the macros defined in
aoeproto.h
.
Each message comprises a fixed-length header and a variable-length body. The header contains these fields:
AC_TYPE
AC_LEN
AC_CHANNEL
These message types are defined:
ACINITCHAN
/dev/aoectl
,
then to the channel on which
aoecomm
was pushed.
ACINITACK
aoecomm
to
ACINITCHAN
,
announcing that channel setup is complete
or giving an
errno
value explaining why it failed.
ACSEND
/dev/aoectl
:
send the enclosed datagram
(usually an AoE message)
via the channel named.
The MAC header must be included,
but only the destination address is significant;
source address and protocol values are filled in by the system.
ACDEVENAB
/dev/aoectl
:
query or update the active-target registry.
Body gives an AoE target address
(major and minor numbers)
and a command:
enable target, disable, query.
ACLOG
/dev/aoemon
:
body contains an unexpected,
ill-formed,
or unsendable AoE message,
as explained by a type code:
ACLUNSOL
ACLILLFORM
ACLSENDFAIL
ACLTIMEOUT
ACLAOEERR
ACLATAERR
See the source code and the supplied aoecomm(7M) and aoectl(7D) manual entries for details.
Here is an sketch of major operations showing the components and subroutines used.
These user-mode steps use a file descriptor for a network device of suitable type, suitably initialized (e.g. configured to receive the desired Ethernet protocol), to create an active AoE channel:
aoecomm
STREAMS module onto the file descriptor.
ACINITCHAN
AoE control message
containing the desired channel number,
the cookie,
the local MAC address of the network device
(usually discovered by a device-specific call),
the Ethernet protocol type to be used,
and the maximum data-segment size
(usually computed from the Ethernet device MTU).
/dev/aoectl
and write the
ACINITCHAN
message.
Close
/dev/aoectl
.
The
aoectl
module calls
aoecomm_initchan
to register the cookie
for the desired channel.
ACINITCHAN
message to the file descriptor.
aoecomm
compares the cookie stored for the desired channel
with that supplied in the message at hand.
If the cookies match
and the channel number is not already in use,
the protocol number
and MAC address are stored for use when sending messages,
and the channel is made active.
The two-way handshake
using both
aoectl
and
aoecomm
is needed to
make it harder for bad guys
to cause trouble.
aoecomm
instance for this channel is popped,
and the channel shut down.
This initialization dance is normally done
by user-mode program
aoestart
within library routine
comminit.
Aoestart
then calls library routine
achattach,
which uses
fattach(3)
to attach the file descriptor to a file in directory
/etc/aoe
to keep the channel open.
The
aoed
disk driver open routine
initializes the device if necessary,
broadcasting an AoE Query-Config message to
discover the target's Ethernet MAC address
and other characteristics.
Device configuration is semi-automatic.
Every target to be used
must be listed in advance in kernel configuration file
aoed.conf
;
changes to the file
are effective only when the
aoed
module is reloaded,
or
(Solaris 10 only)
when
update_drv
-f
is run.
The memory overhead for a target that is configured
but not used
is modest;
it is unreasonable to declare every possible device
allowed by the protocol
(255*65535 of them),
but prudent to declare every slot
in a new EtherDrive shelf
even if only a few will be used at first.
It takes a few seconds to discover
that an AoE device isn't available on the network
even though its device files exist.
If there are many such unavailable devices,
programs that open every possible disk device in the system
(notably
format(1M))
will run quite slowly.
To prevent this,
aoecomm
maintains a registry of `enabled' devices,
those believed actually to exist.
Daemon
aoemon
listens to
/dev/aoectl
for unsolicited Query-Config response messages,
generated when a target device is powered up
or in response to a broadcast Query-Config command
from
aoestart
or
aoectl.
When such a message arrives,
aoemon
enables the responding target
by writing an
ACDEVENAB
control-protocol message
to
/dev/aoectl
,
triggering a call to
aoecomm_devenab
to enable the device in the registry.
The aoectl command lists and modifies the registry.
The
aoed
driver supplies
read and write entry points
for raw I/O,
and a
strategy routine
for block and file system I/O.
I/O is done by composing an AoE command
in a STREAMS buffer
and calling
aoecomm_send
to send it to the network device.
When an AoE message arrives
on a channel in which
aoed
has registered interest
(any on which some disk device
has been opened),
receive routine
adreceive
is called.
Pending commands are stored in a list.
When a message arrives,
the system searches for
a pending command
for the same AoE device,
with the same AoE tag value.
Timer routine
chantimer
is called periodically to scan the pending-command list
for commands which haven't received responses
within a specified time interval.
If a command has timed out only a few times,
it is retransmitted.
After several timeouts
the command is abandoned,
and the I/O request aborted with an error.
Often an I/O request will involve more data than can be handled in a single AoE command: 1024 bytes by default, more if larger (jumbo) Ethernet frames are allowed. If so, a single request will generate several AoE commands. The corresponding several AoE responses may be returned out of order.
On a non-SMF system,
startup script
/etc/init.d/aoe
is run at boot time.
When SMF is present,
AoE is represented by service
svc:/device/aoe
,
initially disabled;
when the service is enabled
and on subsequent reboots,
method script
/lib/svc/method/device-aoe
is called.
In either case
the script starts the
aoemon
daemon,
then calls the
aoestart
command for each channel
listed in AoE-specific configuration file
/etc/aoe.conf
.
The startup script is called early in the boot process,
after the root and
/usr
file systems have been mounted,
but before any others.
Hence the AoE subsystem must not rely on access
to other file systems,
and in particular must not require access to
/var
or
/opt
.
The payoff is that any other file system,
including
/var
or
/opt
,
may be stored on an AoE disk.
AoE drivers are normally stored in
/usr/kernel
,
support programs in
/usr/sbin
,
configuration files and the mount points used by
fattach
in
/etc
.
Nothing special need be done
when the system is shut down.
Channels remain active until the very end,
even after all processes have been killed,
because it is the
fattach
operation that keeps them open.
Thus the normal shutdown code
to unmount all file systems
at the last minute
works without fuss.
Notice,
however,
that the file system where AoE channels are attached
may not be unmounted
because the AoE
fattach
calls keep it busy.
That is why AoE uses
/etc/aoe
rather than
/var/aoe
;
the latter directory is sometimes a separate file system,
the latter rarely if ever.
AoE may be intentionally shut down
by running
/etc/init.d/aoe
stop
(non-SMF system)
or
svcadm
disable
svc:/device/aoe
(SMF).
This is sometimes useful for maintenance purposes,
but is not done by a normal shutdown.
aoecomm
An instance of
aoecomm
is created whenever the module
is pushed on a stream file;
thus each active AoE channel has a separate instance.
All code is in a single source file,
aoecomm.c
.
It uses AoE-specific header files
aoeproto.h
,
aoecomm.h
,
aoecommproto.h
,
and
etherproto.h
.
None is used outside the
aoecomm
module.
Several data structures
provide global context within
aoecomm
,
but are not accessible to the rest of the kernel:
static
void
(*logger)(int,
int,
mblk_t
*);
static
kmutex_t
loggerlock;
static
Aoechan
*aoechan[MAXCHAN];
static
kmutex_t
chantlock;
Aoechan
structures
indexed by channel number
(assigned when the channel is initialized)
for faster lookup.
MAXCHAN
is defined in
aoecomm.c
;
its present value is 10.
Chantlock
prevents concurrent access to the
aoechan
array,
but not the
Aoechan
s.
static
Aoecookie
pendcookie[MAXCHAN];
static
kmutex_t
pendlock;
pendcookie[
i]
contains a pending initialization cookie
for channel
i,
or
NOCOOKIE
(zero)
if none has arrived yet.
Pendlock
prevents concurrent access to the
pendcookie
array.
Standard Solaris loadable-module and STREAMS-module
entry points and data structures are supplied:
in particular
_init,
_fini,
and
_info
routines,
a
modlinkage
structure (and its many children),
a
streamtab
structure,
and a pair of
qinit
structures.
Open and close
(module push and pop),
read put and service,
and write service routines are supplied;
puts to the write queue use
putq(9F).
Only
_init,
_fini,
and
_info
are global.
_init
calls
mod_install
to tell the system where to find the
modlinkage
structure
through which the other data structures and routines
can be located.
Several routines called by other AoE kernel modules
are made available as global entry points.
Prototypes for these routines are declared in
aoecomm.h
,
constant values in
aoecommproto.h
.
int aoecomm_initchan(chan, cookie)
int chan;
Aoecookie cookie;
Store
cookie
in
pendcookie[
chan]
.
If an
ACINITCHAN
message arrives
on the write queue
with the same
chan
and
cookie
values,
the channel becomes active.
Return > 0 if all is well,
< 0 if an error occurred
(chan
out of range).
int aoecomm_initdriver(chan, receiver)
int chan;
mblk_t *(*receiver)(int, mblk_t *);
If the receiver routine for channel
chan
is
NULL
,
set it to
receiver.
If
receiver
was already this channel's receiver,
leave it be.
In either case return > 0.
If chan is invalid or already has a different receiver, return < 0.
When an AoE message arrives on the read queue:
int aoecomm_initlog(logfunc)
void (*logfunc)(int, int, mblk_t *);
If
logger
is
NULL
,
set it to
logfunc
and return > 0.
If
logfunc ==
logger,
return > 0.
If
logger
was already nonzero
and is not the same as
logfunc,
leave
logger
unchanged and
return < 0.
void aoecomm_log(chan, code, mp)
int chan, code;
mblk_t *mp;
If a logger routine has been registered with
aoecomm_initlog,
call it with the same arguments:
(*logger)(
chan,
code,
mp)
.
The logger becomes responsible for STREAMS message
mp,
and will free it when finished;
neither
aoecomm_log
nor its caller may use it further.
If there is no logger,
free
mp.
int aoecomm_devenab(chan, maj, min, cmd)
int chan, maj, min, cmd;
Query or update the enabled-target registry, according to cmd:
ADENAB
AOEMINWILD
,
enable every possible target with the
chan
and
maj
given.
Neither
chan
nor
maj
may be wild.
Return
ADENAB
if all is well,
ADDISAB
if
chan
or
maj
was invalid.
ADDISAB
AOEMAJWILD
matches any major number,
min
value
AOEMINWILD
,
any minor number.
Chan
may not be wild.
Return
ADDISAB
.
ADQUERY
ADENAB
if at least one target matching the arguments
is registered,
ADDISAB
otherwise.
Chan
value
ACHWILD
matches any channel,
maj
value
AOEMAJWILD
any major number,
min
value
AOEMINWILD
any minor number.
int aoecomm_maxdata(chan)
int chan;
Return the maximum data-segment size
for channel
chan,
or -1
if
chan
is invalid.
int aoecomm_send(chan, mp)
int chan;
mblk_t *mp;
Send STREAMS message
mp
via channel
chan.
If
chan
is
ACHWILD
,
send a copy to every active channel.
The message must begin with an Ethernet MAC header
with destination MAC address filled in.
Source address and protocol type
are overwritten
with the values from the
Aoechan
.
If the message is sent, return > 0. If an error occurs (invalid chan, no room downstream), free the message and return < 0.
Open routine
aoecommopen
allocates a new
Aoechan
,
marked inactive;
stores its address in
q_ptr
for both the read and write queues;
and calls
qprocson(9F)
to enable queue processing.
A channel number is not assigned yet,
so no data may be sent;
the channel is inactive,
so received data are thrown away.
Arriving messages are added to the write queue by putq(9F).
Write service routine aoecommwsrv loops calling getq(9F), processing each message as follows:
M_DATA
,
pass it downstream without further interpretation.
ACINITCHAN
message,
and if:
ACINITCHAN
message
to this channel's
Aoechan
.
Aoechan
in
aoechan.
ACINITACK
reply reporting success.
ACINITACK
reply containing an appropriate error code.
ACINITCHAN
,
free it and continue.
Read put routine aoecommrput works as follows:
->db_type
>=
QPCTL
),
pass the message downstream.
Read service routine aoecommrsrv loops calling getq(9F), processing each message as follows:
M_DATA
,
pass it downstream.
ACLUNSOL
.
Close routine aoecommclose does the following:
Aoechan
from the
aoechan
table,
making the channel unavailable to
aoecomm_initdriver
and
aoecomm_send.
NULL
to announce that the channel is shutting down.
Aoechan
.
aoectl
A single instance of
aoectl
is configured by driver configuration file
aoectl.conf
,
attached to the
pseudo
device nexus.
There are two minor devices:
/devices/pseudo/aoectl@0:ctl
(/dev/aoectl
)
/devices/pseudo/aoectl@0:mon
(/dev/aoemon
)
Neither device supports polling; neither has any ioctl commands.
All code is in a single source file,
aoecomm.c
.
It uses AoE-specific header files
aoeproto.h
,
aoecomm.h
,
aoecommproto.h
,
and
etherproto.h
.
None is used outside the
aoectl
module.
static Aoectlstate aoectlstate;
Aoemon
structures
used to store messages
pending receipt by
/dev/aoemon
.
rmon
,
the next message to be read;
wmon
,
the next to be written.
Aoemon
s
in the ring.
kmutex_t
lock to prevent concurrent use of
/dev/aoectl
.
kmutex_t
lock to prevent concurrent access
to the ring buffer.
kcondvar_t
condition variable
on which reads from
/dev/aoemon
sleep if no messages were available;
a flag indicating that some process is sleeping.
dev_info_t
structure for the sole device instance.
Each
Aoemon
contains:
next
,
a pointer to the next
Aoemon
in the ring.
/dev/aoemon
device
is opened,
and the
next
-Aoemon
pointers set to sew the
Aoemon
s
into a ring.
All
Aoemon
s
(and any unread messages)
are freed when the device is closed.
When
aoectlstate.wmon
==
aoectlstate.rmon
,
the ring is empty;
when
aoectlstate.wmon->next
==
aoectlstate.rmon
,
the ring is full.
Notice that there is always at least one empty
Aoemon
following that pointed to by
aoectlstate.rmon
.
When a message arrives:
*
aoectlstate.wmon
.
*
aoectlstate.wmon
,
and set
aoectlstate.wmon
to
aoectlstate.wmon->next
.
When it is desired to read a message:
aoectlstate.wmon
==
aoectlstate.rmon
,
no message is available.
*
aoectlstate.rmon
and set
aoectlstate.rmon
to
aoectlstate.rmon->next
.
Standard Solaris loadable-module and device-driver
entry points and data structures are supplied:
in particular
_init,
_fini,
and
_info
routines,
a
modlinkage
structure (and its many children),
dev_ops
and
cb_ops
structures.
A
_depends_on
string declares a dependency on
aoecomm
,
since
aoectl
calls procedures
supplied by that module.
Only
_init,
_fini,
and
_info
are global.
_init
calls
mod_install
to tell the system where to find the
modlinkage
structure
through which the other data structures and routines
can be located.
Apparently
_depends_on
need not be global to work.
Internal routine
aoectlintr
is registered with
aoecomm
as the global message receiver
if
/dev/aoemon
is open.
Open routine aoectlopen acts according to the minor device opened:
/dev/aoectl
/dev/aoemon
EBUSY
.
aoecomm_initlog(aoectlintr)
to register
aoectlintr
as the logger.
If this fails,
return
EBUSY
.
monbufs
exists,
allocate that many messages;
the default is 32.
If this fails,
deregister the logger
(call
aoecomm_initlog(NULL)
)
and return
EBUSY
.
ENXIO
.
Close routine aoectlclose acts according to the minor device closed:
/dev/aoemon
aoecomm_initlog(NULL)
to deregister the logger.
Write routine aoectlwrite acts according to the minor device written:
/dev/aoectl
ACINITCHAN
aoecomm_initchan(
chan,
cookie)
.
On success,
return as if all bytes were successfully written;
on failure,
return
ENXIO
.
Only the channel and cookie values
are used here;
other parameters are ignored.
ACSEND
aoecomm_send(
chan,
message)
.
If all is well,
return success;
if no STREAMS buffer was available,
ENOMEM
;
if
aoecomm_send
failed,
ENXIO
.
ACDEVENAB
aoecomm_devenab(
chan,
maj,
min,
cmd)
.
If the result is
ADENAB
,
return as if all bytes were written;
if
ADDISAB
,
return zero (no error, but nothing written).
ENXIO
.
ENODEV
.
Read routine aoectlread acts according to the minor device read:
/dev/aoemon
aoectlstate.rmon
==
aoectlstate.wmon
,
set the process-sleeping flag,
wait on the condition variable,
and try again.
If a signal arrives while waiting,
return
EINTR
.
aoectlstate.rmon
!=
aoectlstate.wmon
and the user's buffer has room for the message stored in
*
aoectlstat.rmon
,
compose an
ACLOG
control-message header including the channel number,
log-message type,
and lost-message count
stored in that
Aoebuf
;
use
uiomove(9F) to
copy the header,
then the data contents of the message,
to the user's buffer.
Free the STREAMS message;
clear the lost-message counter in
*
aoectlstate.rmon
;
set
aoectlstate.rmon
to
aoectlstate.rmon->next
.
aoectlstate.rmon
==
aoectlstate.wmon
,
the user's buffer has no room for the next message,
or
uiomove
returns an error.
In the last case
return the same error as
uiomove;
otherwise return the number of bytes read.
ENODEV
.
static void aoectlintr(chan, code, mp)
int chan, code;
mblk_t *mp;
If there's room in the ring,
store
mp,
code
(the log-message type),
and
chan
in
*
aoectlstate.wmon
and set
aoectlstate.wmon
to
aoectlstate.wmon->next
.
If the ring is full,
increment the lost-messages counter in
*
aoectlstate.wmon
and free
mp.
In either case,
if the process-sleeping flag is set,
signal the condition variable,
awakening any process blocked reading
/dev/aoemon
.
aoed
Each instance of
aoed
represents one AoE disk target,
with
13 or 21 block and 13 or 21 character
minor devices:
Links in directories
/dev/dsk
and
/dev/rdsk
use
fixed `controller' name
ca
.
These devices are created for
aoed
:
/devices/pseudo/aoed@
inst:a
(/dev/dsk/cad
insts0
)
/devices/pseudo/aoed@
inst:p
(/dev/dsk/cad
insts15
)
a
-h
(and on to
p
on x86)
in
/devices
-speak,
or
s0
-s7
(s15
)
in
/dev
-speak.
/devices/pseudo/aoed@
inst:r
(/dev/dsk/cad
instp1
)
/devices/pseudo/aoed@
inst:u
(/dev/dsk/cad
instp4
)
r
-u
in
/devices
-speak,
or
p1
-p4
in
/dev
-speak.
/devices/pseudo/aoed@
inst:wd
(/dev/dsk/cad
inst)
/devices/pseudo/aoed@
inst:a,raw
(/dev/rdsk/cad
insts0
)
/devices/pseudo/aoed@
inst:h,raw
(/dev/rdsk/cad
insts7
)
/devices/pseudo/aoed@
inst:r,raw
(/dev/rdsk/cad
instp1
)
/devices/pseudo/aoed@
inst:u,raw
(/dev/rdsk/cad
instp4
)
/devices/pseudo/aoed@
inst:wd,raw
(/dev/rdsk/cad
inst)
Sun's standard device drivers create the DOS-partition
devices only for certain kinds of disk
and only on Solaris/x86;
when the DOS devices exist,
the whole-disk device in
/dev
is called
p0
.
The AoE driver supports DOS partition tables
on Solaris/SPARC as well,
but omits the
p0
name to avoid a bug in tools like
format
(1M).
The driver calls
ddi_create_minor_node(9F)
with node type
DDI_NT_PSEUDO
,
rather than the
DDI_NT_BLOCK
type normally used for disks.
AoE-specific entries in
/dev/devlink.tab
create the
/dev/dsk
and
/dev/rdsk
links.
This circumvents a
bug in
devfsadm(1M).
The copy of
aoed.conf
installed with the subsystem
declares 30 devices:
instances 0-14 and 100-114
for
aoechan
0,
aoemaj
0 and 1,
aoemin
0-14.
This is enough for two EtherDrive shelves,
each with 15 slots.
The system administrator may edit
the file to add or remove devices
(with
aoemkconf
or by hand)
or to apply his own numbering convention,
as explained in the
Installation and Operation Guide.
By convention,
decimal instance number
instance=
cMMmm
corresponds to target address
aoechan=
c,
aoemaj=
MM,
aoemin=
mm.
Whatever mapping appears in the entries in
aoed.conf
is honoured,
however;
the only program that knows about it
is the
aoemkconf
program for generating configuration entries.
Any desired instance-to-address mapping will work
so long as instance numbers are not reused
and none is too big.
In 32-bit versions of Solaris, 18 bits are available for minor device numbers. 13 (21 on x86) minor devices are created for each instance, allowing 262144/13 = 20164 (262144/21 = 12483) instances. In 64-bit systems, 32 bits are available, hence there may be 4294967296/13 = 330382099 (4294967296/21 = 204522252).
These optional integer-valued properties may be included:
timeout=
ms
hd=
nhead
sec=
nsec
maxdata=
len
maxbuf=
nbuf
disksort=
enable
On Solaris 9 and earlier versions,
changes to
aoed.conf
take effect only when the
aoed
kernel module is loaded.
If AoE disks have already been used,
the file will not be effective
until the driver is unloaded,
which means every AoE disk device
must be unmounted and closed.
Often it is simplest just to reboot the system.
On Solaris 10,
update_drv(1M)
updates the configuration.
With option
-f
it can do so without a module unload.
Device instances presently in use cannot be changed
on the fly,
but new devices may be added
without rebooting.
The
aoed
driver source code
is broken into several files:
aoed.c
addadk.c
aduscsi.c
adio.c
addos.c
adefi.c
advtoc.c
adcmd.c
adsubr.c
Adrive
structure associated with a Solaris device;
error-message helpers.
adtrace.c
aoed.h
dadkio32.h
DKIOCTL_RWCMD
ioctl.
For some reason Sun's header files leave these out.
To make it easier to
include optional code only as needed
(trace
in particular),
all object files except main program
aoed.o
are collected into object library
adlib.a
.
The module binary is generated by running
ld -r -o aoed aoed.o adlib.a
None is used outside the
aoed
module.
Adrive
An
Adrive
structure
is allocated for each device instance
when it is attached,
freed when it is detached,
containing:
ADCLOSED
ADWAOE
ADWATA
ADGATA
wd
partition.
ADWPTAB
ADREADY
ap->state
>=
ADGATA
is true only if IDENTIFY information has been received.
Pbits
to hide the number of bits required)
indicating which block and which character minor devices
are open;
a count of the outstanding layered opens.
Used to decide whether this is the first open
or last close.
aoed.conf
.
aoed.conf
:
I/O timeout value,
substitute head and sector counts.
dev_info_t
pointer.
dev_t
as argument.
DKIOCGETEFI
and
DKIOCSETEFI
ioctls
.
AVLABVTOC
,
AVLABEFI
,
AVLABDOS
,
or
(AVLABDOS|AVLABVTOC)
(when the VTOC label is
encapsulated in a DOS partition).
dk_geom
structure describing disk geometry;
meaningful only with label type
AVLABDOS
or
AVLABVTOC
.
vtoc
(volume table-of-contents,
i.e. on-disk label;
valid only under label type
LABVTOC
.
Acmd
structures
available for allocation by
acalloc.
Achan
.
diskhd
structure
containing a queue of
buf
structures for pending I/O requests,
suitable for use by
disksort(9F).
buf
structures for this device,
i.e. those whose transfers have been at least partly begun.
kmutex_t
lock to prevent concurrent access to non-static values in this
Adrive
.
kcondvar_t
condition variable
used while waiting for initialization to finish.
kstat_t
of type
KSTAT_TYPE_IO
for the whole drive,
another for each partition
(including the whole-disk partition),
one named
aoed
inst,stats
for a handful of per-drive counters
specific to
aoed
,
another named
aoed
inst,conf
to display per-drive configuration values.
Acmd
Each
Adrive
has a pool of
Acmd
structures,
allocated when the device is first opened,
freed on last close.
The pool is initialized with as many
Acmd
s
as the target allows concurrent commands.
When a command is composed and sent,
an
Acmd
is taken from the pool;
while the command awaits a response,
the
Acmd
is kept in a per-channel pending-command list;
when a response is received
or the command is abandoned after a timeout,
the
Acmd
is returned to the pool.
If the pool is empty
no new commands may be sent
until an outstanding command has completed or timed out
and its
Acmd
has been returned to the pool.
Each
Acmd
contains:
Adrive
.
buf
structure.
If
NULL
this is not an I/O command.
kcondvar_t
condition variable;
NULL
if none.
Acmd
and
previous-Acmd
pointers,
for the several lists in which
Acmd
s
may be stored.
CSFREE
if this
Acmd
is in the
drive's
free-command pool;
CSCHAN
if on the channel's
active-command list;
CSLOOSE
otherwise.
If an
Acmd
in the pool or on the list
is found to be in the wrong state,
or if an
Acmd
being added to either is not in state
CSLOOSE
,
the system panics.
Acmd
that was still on the active-channel list.
It remains because it's cheap insurance:
if another such bug creeps in,
it will be caught early and found more easily.
Achan
Each
aoechan
named at least once in
aoed.conf
has an
Achan
structure.
Achan
s
are allocated as needed as devices are attached,
in a dynamic array handled by
ddi_soft_state_init
and
ddi_get_soft_state(9F).
The whole array is freed
when the module is unloaded.
Each
Achan
contains:
Abuf
Every
buf
structure for a pending or active I/O transfer
has an associated
Abuf
structure,
allocated when the
buf
is accepted by
adstrategy.
The address of the
Abuf
is stored in
bp->b_private
.
The
Abuf
is freed before
biodone(9F)
is called.
Macro
bptoabuf(bp)
is a shorthand for
(Abuf
*
)(bp->b_private)
.
Each
Abuf
contains:
buf
.
buf
structure
is complete when the offset value equals the total-bytes value
and the number of outstanding commands
becomes zero,
or when an error occurs.
Standard Solaris loadable-module and device-driver
entry points and data structures are supplied:
in particular
_init,
_fini,
and
_info
routines,
a
modlinkage
structure (and its many children),
dev_ops
and
cb_ops
structures.
A
_depends_on
string declares a dependency on
aoecomm
,
since
aoed
calls procedures
supplied by that module.
Only
_init,
_fini,
and
_info
are global.
_init
calls
mod_install
to tell the system where to find the
modlinkage
structure
through which the other data structures and routines
can be located.
Apparently
_depends_on
need not be global to work.
Internal routine
adreceive
is registered with
aoecomm
as the message receiver for each
aoechan
that has been used to send at least one message.
Internal routine
chantimer
runs as a self-renewing timer,
with one instance per
aoechan.
Attach routine adattach works as follows:
Adrive
for this device instance.
Initialize it with constants at hand:
instance number and
dev_info_t
pointer;
aoechan
,
aoemaj
,
aoemin
,
other
aoed.conf
-entry
properties,
fetched with
ddi_get_prop_int(9F).
If any required property is missing,
return failure.
DKIOCGETEFI
and
DKIOCSETEFI
ioctls.
Adrive
.
aoed.conf
.
Open routine adopen works as follows:
ENXIO
.
kstat_t
structures
for the drive
and for the partition being opened,
as necessary.
FNDELAY
or
FNONBLOCK
set in
open
flags)
and the
Adrive
state is
ADCLOSED
,
call
driveinit
to start the initialization process.
FNDELAY
nor
FNONBLOCK
was set,
call
openwait
to start initialization if necessary
and wait for completion.
If
openwait
fails,
return the error reported.
Check whether the partition being opened
is of length zero;
if so,
return
ENXIO
.
Drive initialization comprises these steps:
If any step fails,
initialization is completely abandoned:
the drive state is reset to
ADCLOSED
Acmd
for this drive;
broadcast an AoE Query-Config command
with the desired
aoemaj
and
aoemin
addresses to the desired AoE channel,
to locate the target.
Set drive state to
ADWAOE
.
Acmd
;
drive state is
ADWAOE
Acmd
pool to
maxbuf
entries.
If the maximum data-segment size
wasn't specified as a configuration property,
extract that in the response message,
interpreting zero as the default value 1024;
compare it with the value configured for the channel,
returned by
aoecomm_maxdata;
store the lesser in
ep.
Compose and send an ATA IDENTIFY command.
Set state to
ADWATA
.
ADWATA
dk_geom
structure in
ep
and call
fudgegeom
to tweak the numbers if necessary;
if
VTOC-improper,
invent a cylinder size
such that the largest possible cylinder number
(given the size of this drive)
is bounded by
ULONG_MAX
,
for
disksort(9F).
Set up whole-disk partition limits.
Fill in the configuration
kstat_t
structure,
now that dynamically-determined values
like
maxdata
and
mibsize
are known.
Set state to
ADGATA
.
Awaken any processes sleeping in
openwait
for this drive.
ADGATA
discovered within
openwait
ADWPTAB
.
Call
readptab
to read the disk label,
filling in all remaining partition entries,
the label type,
and possibly
other label-specific data
in the
Adrive
.
When all is well,
set state to
ADREADY
.
ADCLOSED
,
the
Acmd
pool is freed.
The next
open
attempt will start over.
Locks are used to assure that at most one process at a time is working on initializing a given drive. Other processes desiring to access the drive sleep until initialization succeeds or is abandoned due to an error.
Close routine adclose works as follows:
Adrive
's
condition variable,
until both the active- and
pending-buf
lists have drained
and the active-command count is zero.
Acmd
pool.
Clear the drive-is-closing flag.
Set drive state to
ADCLOSED
.
Detach routine addetach works as follows.
NULL
).
buf
structs for this drive
is not empty,
log an error message
and return failure.
(This shouldn't happen.)
Acmd
s
for this drive;
clear out the
Acmd
pool.
kstat_t
structures associated with this drive,
including those for all partitions.
kmutex_t
locks and the
kcondvar_t
condition variable.
Adrive
.
The system calls addetach only after all references to this device instance have been closed; adclose blocks until no pending I/O operations remain. Hence adddetach need show no mercy to any I/O still unfinished.
Read routine
adread
and write routine
adwrite,
called only for the character (raw) device,
just call
physio(9F)
which allocates a
buf
structure and calls
adstrategy.
Block-device I/O,
whether through the file system
or by direct
read
or
write
call,
is handled by the system;
the device driver sees only a call to
adstrategy
for each block.
All I/O transfers
are therefore queued by strategy routine
adstrategy,
which works as follows:
ENXIO
.
DEV_BSIZE
(512).
If not,
abort the transfer with
ENXIO
.
ADGATA
,
call
openwait
to wait until I/O is possible;
evidently a non-blocking open is in effect.
If
openwait
reports an error,
abort the transfer
with that error code.
(Why
ADGATA
rather than
ADREADY
?
So the normal I/O path may be used
to read the disk label.)
ENXIO
.
->b_resid
set to
bp->b_bcount
(so
read
will return zero),
call
biodone(9F),
and return.
->b_bcount
to fit.
Abuf
for this
buf
structure;
store its address in
bp->b_private
.
Initialize the
count and absolute starting block number.
->b_resid
;
call
disksort(9F)
to insert this transfer in the drive's queue of pending requests.
Ioctl routine adioctl calls openwait to check that initialization is complete (in case this was a non-blocking open), then switches on the ioctl command code.
The data-model rules for ioctl in Solaris complicate matters. When a 32-bit program makes ioctl calls under a 64-bit kernel, conversions may be necessary, depending on the data used by the particular ioctl command. Sometimes the 32- and 64-bit forms of a data structure have the same size and layout; sometimes they differ, and the appropriate version must be filled in and returned. In one case they differ but Sun doesn't ship a header file declaring the 32-bit form even though it is needed for format(1M). See Writing Device Drivers for more about this mess.
In another case, handled by efiioc, the argument format differs according to the Solaris version. This is why adattach checks the version and sets a flag.
The following standard disk-device commands are supported. Many, but not all, are listed in the Writing Device Drivers book and described in somewhat more detail in dkio(7D).
DKIOCINFO
DKC_DIRECT
,
with a constant large controller number
(in the hope that it won't conflict with a real one),
the instance number as the disk unit number,
slave number zero.
DKIOCGAPART
DKIOCSAPART
Adrive
is written,
not the disk label.
DKIOCPARTINFO
DKIOCGGEOM
DKIOCG_PHYGEOM
dk_geom
structure.
DKIOCSGEOM
dk_geom
structure in
ep,
without changing the on-disk label.
DKIOCGMEDIAINFO
dk_minfo
structure containing
the media type
(always
DK_FIXED_DISK
),
sector size
(always
DEV_BSIZE
),
and total sector count
of the disk.
DKIOCREMOVABLE
DKIOCSETEFI
DKIOCSETEFI
DKIOCPARTITION
DKIOCGVTOC
DKIOCSVTOC
DKIOCGMBOOT
DKIOCSMBOOT
DIOCTL_RWCMD
DIOCTL_GETMODEL
DIOCTL_GETSERIAL
USCSICMD
To send an AoE command from within the driver:
Acmd
from the pool for this drive.
If none is available,
no more commands may be sent for now.
Acalloc
initializes the
Adrive
pointer,
header-length field,
and the common part of the AoE header
(AoE command,
aoemaj
and
aoemin,
version code and other flags,
unique tag value),
and zeroes all other fields.
buf
pointer,
the starting block number,
and the transfer length and offset.
Acmd
on the pending-command list
for this
aoechan.
Chantimer is called at regular intervals to discover commands that must be retransmitted or expired.
AoE messages arrive by calls to
adreceive,
which calls
adlookup
to locate the corresponding
Acmd
,
then processes the message
according to its command code.
If the message is garbled
or otherwise questionable,
the
Acmd
is returned to the channel's pending-command list;
otherwise it is freed to the pool for its
Adrive
.
A garbled, erroneous, or unsolicited (no
Acmd
)
message is
logged
and discarded.
Here is a summary
of major internal procedures,
grouped by source file.
Unless otherwise noted,
each is global within the
aoed
module,
but is not meant to be called from elsewhere in the system;
hence
aoed.h
supplies both a prototype
and a name-hiding macro,
as explained above.
aoed.c
static int openwait(ep)
Adrive *ap;
Return zero when
ep
has reached
state
ADREADY
,
blocking if necessary;
or return a nonzero
errno
value if initialization fails.
If
ep
is in state
ADCLOSED
(initialization hasn't started yet),
call
driveinit
to get things started.
At each subsequent step prior to
ADREADY
,
call
cv_wait_sig(9S)
to block
on the condition variable in
ep.
If
cv_wait_sig
is interrupted by a
UNIX1
signal,
return
EINTR
.
Openwait
is local to
aoed.c
.
static int readptab(ep)
Adrive *ep;
Search device ep for a valid disk label:
<
ADGATA
or the disk size is not known,
return -1.
AVLABVTOC
.
AVLABEFI
.
.
If no proper VTOC label is found,
call
fakevtoc
to invent one.
In either case return 1,
leaving the label-type flags
set to
AVFLABDOS|AVLABVTOC
.
AVLABDOS
.
AVLABVTOC
.
Readptab
is local to
aoed.c
.
void fudgegeom(ep)
Adrive *ep;
Invent consistent geometry values for drive ep if necessary. At present this means:
aoed.conf
supplied both
hd
and
sec
values
for this drive,
adopt them.
Compute the number of cylinders
by dividing
hd*
sec
by the drive's total size
(number of user-accessible sectors)
as reported by ATA IDENTIFY.
aoed.conf
.
At present these values are
hd=64
sec=890
;
the resulting very-large cylinders
are required to avoid bugs in
some versions of
newfs(1M)
when making very large file systems.
dk_geom
structure in
ep.
int zeroblock(ep, bno)
Adrive *ep;
Write zeroes to absolute sector
bno
on device
ep;
return zero for success,
nonzero
errno
value on failure.
Calls
adrwkern
to do the real work.
int adrwkern(ep, rw, bno, buf, len)
Adrive *ep;
int rw;
dev_t dev;
diskaddr_t bno;
void *buf;
int len;
int adrwuser(ep, rw, bno, buf, len, resp)
Adrive *ep;
int rw;
dev_t dev;
diskaddr_t bno;
void *buf;
int len;
int *resp;
Read
(rw==B_READ
)
or write
(B_WRITE
)
len
bytes to or from kernel or user address
buf
from or to absolute sector
bno
on drive
ep.
Return zero for success,
nonzero
errno
value on failure.
If the transfer finished without error
but without reading or writing the whole buffer,
adrwkern
returns nonstandard error value
EXDEV
;
adrwuser
returns zero,
but if
resp
is nonzero,
stores the number of bytes not transferred in
*
resp.
Adrwkern
makes a synthetic buffer header
and calls
adstrategy;
adrwuser
makes a
struct
uio
and calls
physio(9F).
Both operate on the whole-disk partition.
addadk.c
int dadkioctl(ep, cmd, arg, mode)
Adrive *ep;
int cmd;
intptr_t arg;
int mode;
Implement these undocumented ATA-specific ioctl calls. Some are used by format(1M).
DIOCTL_RWCMD
DIOCTL_GETMODEL
DIOCTL_GETSERIAL
Adrive
during device initialization.
aduscsi.c
int aduscsi(ep, cmd, arg, mode)
Adrive *ep;
int cmd;
intptr_t arg;
int mode;
Implement the
SCSI-specific
USCSICMD
ioctl,
faking
the SCSI
INQUIRY,
TEST UNIT READY,
READ CAPACITY,
and READ BLOCK LIMITS
commands.
In all cases this comprises
encoding and returning
various data stored in the
Adrive
.
Aduscsi is needed to work around bugs in some versions of format(1M).
Those familiar with SCSI may note that the SCSI standard doesn't define READ BLOCK LIMITS for disk devices. Format calls it anyway.
The ATA IDENTIFY command returns a device-model string; SCSI INQUIRY returns separate vendor and model strings. The two SCSI strings together afford less than half the maximum length of the single ATA string, and in fact some ATA devices return strings too long to fit in the SCSI message. The faking code does the best it can.
addos.c
int fetchdos(ep)
Adrive *ep;
Examine the boot sector (sector 0) of drive
ep
for a valid DOS partition table.
If none is found,
return 0.
Otherwise
scan the four direct partitions
for one of type
SYSID_SUNOS
or
SYSID_SUNOS2
.
If such a partition is found,
set the encapsulated-label partition
start and length to those of the partition,
and record fake geometry parameters in
ep.
If there is more than one such partition,
use the highest-numbered.
Whether an encapsulated-label partition
was found or not,
set the
AVLABDOS
label-type flag
and return 1.
void zapdos(ep)
Adrive *ep;
AVLABDOS
label-type flag.
int dosioc(ep, cmd, arg, mode)
Adrive *ep;
int cmd;
intptr_t arg;
int mode;
Handle DOS-specific disk ioctl cmd, with the given arg and call mode:
DKIOCGMBOOT
DKIOCSMBOOT
adefi.c
int fetchefi(ep)
Adrive *ep;
Search drive
ep
for a valid
EFI label,
trying the primary (sector 1)
and backup (last sector) GPT locations as necessary.
If a label is found,
store the starting sector address and length
of the first eight partitions
in the drive's partition table,
set label-type flag to
AVLABEFI
,
and return 1.
If there is no label,
return 0;
if an I/O error occurred,
return -1.
void zapefi(ep)
Adrive *ep;
Invalidate any existing EFI label on the drive
by zeroing the primary and backup GPT sectors;
clear label-type flag
AVLABEFI
.
int efiioc(ep, cmd, arg, mode)
Adrive *ep;
int cmd;
intptr_t arg;
int mode;
Handle EFI-specific disk ioctl cmd, with the given arg and call mode:
DKIOCGETEFI
DKIOCSETEFI
dk_efi_t
structure addressed by
arg,
via calls to
adrwuser.
Take care to account for the different format
the structure had in Solaris 9.
For
DKIOCSETEFI
only,
call
zaplabels
with flag argument
AVLABEFI
,
then set that flag
in the label-type flags of
ep.
The Solaris
libefi(3LIB)
library uses these calls;
the real label processing is done by the library,
not the kernel.
DKIOCPARTITION
partition64
structure
addressed by
arg
directly from the disk
(not from the driver's in-core partition table).
Copy the partition's starting sector,
length,
and UUID
to the
partition64
and copy it back to the user.
advtoc.c
int fetchvtoc(ep)
Adrive *ep;
Search the
Solaris-labelled part of the disk
for a valid
VTOC label,
checking the primary location
(relative sector 0)
and the backups
(sectors 1, 3, 5, 7, and 9
in the last track)
as necessary.
If a valid label is found,
use its contents to fill in the drive's
vtoc
and
dk_geom
structures and its partition table;
set the drive's label type to
LABVTOC
;
and return 1.
If no label is found
or an I/O error occurs,
return -1.
void zapvtoc(ep)
Adrive *ep;
int fakevtoc(ep)
Adrive *ep;
a
and c
)
map all of the disk but the reserved cylinders
(usually the last two);
set the label type to
LABVTOC
;
and return 1.
int vtocioc(ep, cmd, arg, mode)
Adrive *ep;
int cmd;
intptr_t arg;
int mode;
If the drive's label type is not
LABVTOC
,
return error
ENOTSUP
.
Otherwise
handle VTOC-specific
ioctl
cmd,
with the given
arg
and call
mode:
DKIOCGVTOC
vtoc
structure stored in
ep.
DKIOCSVTOC
vtoc
structure addressed by
arg;
copy that
vtoc
to
ep;
then,
using that
vtoc
and the drive's current
dk_geom
,
update the primary copy and all backups
of the on-disk VTOC label.
Call
zaplabels
with argument type
AVLABVTOC
.
adio.c
int driveinit(ep, ndelay)
Adrive *ep;
int ndelay;
Start initialization for drive ep:
Acmd
,
and immediately use it to send
an AoE Query-Config command.
Initially
ep
should be in state
ADCLOSED
.
If all is well,
return 1
with
ep
in state
ADWAOE
;
if an error occurs,
return -1
with the state unchanged.
static void chantimer(acp)
void *acp;
Acp
really points to an
Achan
,
but the argument type must be
void
*
to satisfy the rules of
timeout(9F).
Examine each
Acmd
awaiting a response from the
aoechan
associated with
ap.
If an
Acmd
has expired
but another retry is allowed,
retransmit it.
If no more retries are allowed,
cancel the command:
buf
structure,
pass it to
biodone(9F)
with error
ETIME
,
and
call
adkillbuf
to free any other
Acmd
s
associated with the same
buf
.
Acmd
to the pool for the corresponding
Adrive
.
A copy of
chantimer
normally runs every
TMO_CLOCK
ticks
(20 ms)
for each AoE channel that has been used at least once
by the
aoed
module.
If a command was retransmitted,
chantimer
quits,
and the next run begins in
TMO_CLOCKHOLD
ticks
(10 ms),
to avoid flooding a congested network
with retries.
Chantimer
is local to
adio.c
,
though it is called from outside the module
via
timeout.
static void adreceive(chan, mp)
int chan;
mblk_t *mp;
AoE message mp has arrived via channel chan:
NULL
,
the channel is shutting down.
Locate the corresponding
Achan
,
clear the channel-is-active flag,
and return.
ACLILLFORM
;
return.
Acmd
for which this is a response,
and to remove it from the active-command list.
If none is found,
log it
with code
ACLUNSOL
and return.
ADWAOE
,
copy the MAC address and
maxbuf
value to the
Adrive
;
adjust the
maximum data-segment size if required;
call
acfree
to free the
Acmd
;
and move to the
next initialization step.
ADWATA
,
this is the response to an ATA IDENTIFY command:
store data;
free the
Acmd
;
and
move to the
next initialization step.
ADGATA
,
ADWPTAB
,
or
ADREADY
,
this is the response to an ATA READ or WRITE request.
Locate the corresponding
buf
structure;
free the
Acmd
.
Store read data if any;
if the transfer is now complete
or an error occurred,
free the
Abuf
and call
biodone(9F).
If the device-is-closing flag is set
and both the active-transfer list
and the pending-transfer queue
are empty,
signal the condition variable
to awaken
adclose.
In any case call
adstart
to start more I/O if any remains pending,
whether for this transfer or another another.
ACLILLFORM
or
ACLUNSOL
,
and return the
Acmd
to the active-command list.
Adreceive
is local to
adio.c
,
but is usually registered with
aoecomm
as the receiver
for one or more channels.
void adstart(ep)
Adrive *ep;
Start I/O transfers
on drive
ep,
looping until no
further
Acmd
is available
(acalloc
returns
NULL
)
or no
buf
structure remains queued.
Depending on the transfer length
and the maximum data-segment size for this drive,
several commands may be required
for a single
buf
structure.
Adstart
starts as much as it can.
If it was forced to stop
(ran out of
Acmd
s)
partway through a
buf
,
the next call to
adstart
continues with that buffer
before starting another.
int adstcmd(dp)
Acmd *dp;
Compose the AoE command described in dp in a new STREAMS buffer, and call aoecomm_send to send it. On success, call adstore to add dp to the channel's active-command list and return 1. On error, return -1: unreasonable dp contents, aoecomm_send failed.
Before calling
aoecomm_send
adstcmd
locates the
Achan
for the target channel
and checks the channel-is-active flag.
If the flag is clear,
aoecomm_initdriver
is called to re-register
adreceive
as the receiver for this channel;
if
aoecomm_initdriver
succeeds,
the flag is then set.
This affords recovery when a channel is shut down
and then restarted:
adreceive
clears the flag when
aoecomm
reports the shutdown;
adstcmd
re-registers the receiver when next possible,
and sets the flag when it happens.
Pending messages presumably time out and are retried.
If the channel comes back within the timeout,
no data are lost;
if not,
an I/O error is reported as for any other timeout.
These fields must have been filled in in dp:
->ep
Adrive
to which the command should be sent.
Both the target address
(aoechan, aoemaj, aoemin)
and Ethernet MAC address
are significant;
the latter may be the broadcast address.
->tag
->aoehd
->aoelen
aoehd
is preallocated,
with room for
AOEHEADLEN
(32)
bytes,
of which
aoelen
are used by this command.
->exptime
->maxtries
->ntries
maxtries
times,
i.e. at most
maxtries-1
retries;
it has already been sent
ntries
times.
->bp
->len
->off
bp
is not
NULL
and
B_READ
is not set in its flags,
append
len
bytes of data
from that buffer,
starting at offset
off
within the buffer.
Dp->expires
is set to the current time in clock ticks
(read from
ddi_get_lbolt(9F))
plus the expiry interval,
with the latter lengthened a little
for retries:
dp->expires = ddi_get_lbolt() + (dp->exptime * (1 + dp->ntries))
adcmd.c
Extract the tag value and target-address numbers
from the AoE message pointed to by
p.
Search the active-command list for channel
chan
for an
Acmd
with a matching tag,
associated with an
Adrive
with a matching target address.
If a match is found,
remove the
Acmd
from the active-command list,
set its
state
to
CSLOOSE
,
decrement the active-command count
in the
Adrive
,
and return the address of the
Acmd
.
If there is no match,
return
NULL
.
void adstore(chan, dp)
int chan;
Acmd *dp;
Add
dp
to the active-command list for channel
chan;
increment the active-command count
in the corresponding
Adrive
.
The
state
of
dp
must be
CSLOOSE
;
it is set to
CSCHAN
.
void adremove(chan, dp)
int chan;
Acmd *dp;
void adcremove(cp, dp)
Achan *cp;
Acmd *dp;
If
dp
is on the active-command list for channel
chan
or
cp,
remove it;
set its
state
is set to
CSLOOSE
;
decrement the active-command count
of the corresponding
Adrive
.
Cp
must be locked before calling
adcremove.
void adkilldrive(chan, ep)
int chan;
Adrive *ep;
For each
Acmd
on the active-command list for channel
chan
and associated with
ep:
remove and free
(acfree)
the
Acmd;
decrement the active-command counter for
ep.
void adkillbuf(chan, bp)
int chan;
struct buf *bp;
For each
Acmd
on the active-command list for channel
chan
and associated with
bp:
remove and free
(acfree)
the
Acmd;
decrement the active-command counter for
the corresponding
Adrive.
Acmd *acalloc(ep, aoecmd, hdlen)
Adrive *ep;
int aoecmd, hdlen;
Allocate an
Acmd
from the pool associated with
ep,
and
initialize its contents as follows:
Adrive
pointer to
ep.
->nexttag & ~TAGOFFSET
in the
tag
field of the
Acmd
,
taking care to skip the value zero
(used by AoE targets
for unsolicited broadcast replies).
Increment
ep->nexttag
for next time.
CSLOOSE
.
Acmd
,
or
NULL
if none was available.
Messages with tag values greater than or equal to
TAGOFFSET
(0x80000000
)
are reserved for non-kernel diagnostic programs.
There are none such yet in the Solaris implementation.
Originally the sense was inverted,
with
TAGOFFSET
set in messages generated by the Solaris kernel driver,
lesser values reserved for non-kernel use;
this was changed in version 1.3.3
for consistency with the Linux AoE implementation.
void acfree(ep, dp)
Adrive *ep;
Acmd *dp;
Return
dp
to the pool associated with
ep,
setting its state to
CSFREE
.
void acinit(ep, nfree)
Adrive *ep;
int nfree;
Allocate new
Acmd
s
from system memory,
initialize them with
state
CSFREE
,
and add them to the pool associated with
ep
until the pool contains at least
nfree
entries
or the system runs out of memory.
If the pool is not empty,
but has fewer than
nfree
entries,
it is topped up to
nfree.
If it is already larger than
nfree
the system panics.
void acclear(ep)
Adrive *ep;
Empty the
Acmd
pool associated with
ep,
returning memory to the system.
adsubr.c
Adrive *devtodrive(dev, s)
dev_t dev;
char *s;
Return a pointer to the
Adrive
corresponding to Solaris device
dev.
If none exists,
log an error message
(including
s
if
non-NULL
)
and return
NULL
.
char *aoeerrstr(aoecode, buf, len)
int aoecode;
char *buf;
int len;
char *ataerrstr(atacode, buf, len)
int atacode;
char *buf;
int len;
Aoeerrstr returns a string encoding aoecode as a decimal number, with a string explaining its meaning as an AoE error code if one is known.
Ataerrstr returns a string encoding atacode as an eight-bit hexadecimal number, with a string containing the standard abbreviation describing each bit that is set.
Either routine stores the string in
buf,
truncating safely after
len
bytes if necessary.
The return value is
buf.
char *admsg(dev, ep, fmt, ...)
dev_t dev;
Adrive *ep;
char *fmt;
Write the message given by printf-like format string fmt and any following arguments to syslog and the console, prepending a string of the form
aoed
inst
s
part
chan/
aoemaj/
aoemin
If
ep
is
NULL
,
the
Adrive
is determined from
dev
if possible.
If no
Adrive
can be found,
the target address is omitted.
If
dev
is
NODEV
,
the partition name is omitted.
A newline is appended to the message; fmt should contain none.
The message will be truncated safely at 300 characters.
char *adpname(part)
int part;
Return a short string name
for partition
part,
in the style used for names in
/devices
:
a
for
part
0,
b
for 1,
and so on to
h
for 7;
wd
for 8;
???
for anything else.
void zaplabels(ep, f)
Adrive *ep;
int f;
int solversion();
Solversion
works by examining global value
utsname.release
.
This is admittedly a hack,
as is the very notion of this routine,
which should be used only
as a last resort.
int putllprop(dev, dip, name, val)
dev_t dev;
dev_info_t *dip;
char *name;
uint64_t val;
aoed
called it directly
the module wouldn't load on Solaris 7 or 8.
putllprop
does the work through undocumented
but apparently-stable interfaces that exist
even in older systems.
The
Nblocks
and
Size
properties,
at least one of which is required by ZFS,
have 64-bit integer values.
adtrace.c
void trace(id, p0, i0, i1)
int id;
void *p0;
int i0, i1;
Store the arguments and a sequence number in the next slot in a circular buffer of 1024 trace records. There is sufficient locking to prevent concurrent calls from using the same slot.
Trace is meant for collecting real-time event traces during debugging.
No tools are supplied to read the trace buffer; use adb or mdb(1).
A
kmutex_t
lock is initialized when
trace
is first called,
but never destroyed.
Hence if
trace
is used and the
aoed
module is reloaded,
the system loses a little memory
until the next boot.
Since
trace
is used mostly to gather information
just before a reproducible crash or hang,
this is unlikely to cause trouble in practice.
This is a summary of function and implementation; see the manual pages for details of usage.
Each program has a single source file,
but
C programs use the
library
described in its own section below.
Most C programs
also use some of the
network data structure include files
from the
include
directory.
Aoestart starts an AoE channel according to its arguments. A network device and channel number must be given. Optional parameters include Ethernet protocol type and maximum data-segment size, with defaults taken from the AoE protocol spec and derived from the device's MTU setting.
Aoestart
calls
comminit
to open the Ethernet device,
configure it,
and perform the
channel startup handshake
with the
aoecomm
and
aoectl
kernel modules.
If all is well,
it calls
achattach
to keep the channel open.
By default it then writes an
ACSEND
command to
/dev/aoectl
to broadcast a Query-Config command,
in the hope that every AoE target will respond,
and that
aoemon
will hear the responses and enable all the devices.
The source code is file
aoestart.c
.
Aoestop calls achdetach to close the AoE channels named by its arguments.
The source code is file
aoestop.c
.
Aoectl
opens
/dev/aoectl
and writes commands
according to its arguments:
probe
ACSEND
to send to the broadcast address
on the channel named.
list
ACDEVENAB
commands with subcommand
ADQUERY
.
If a target has wildcards,
first let the system discover whether any matching device
is enabled;
if so,
iterate over possible values to find out what they are.
On older hardware this takes a few seconds.
enable
disable
ACDEVENAB
commands with subcommand
ADENAB
or
ADDISAB
.
Wildcards are handled by the system.
The source code is file
aoectl.c
.
Aoemon
opens
/dev/aoemon
and loops forever
reading it.
An
ACLOG
message of type
ACLUNSOL
containing an AoE Query-Config response
causes the responding target device to be enabled,
and the action reported with
logmsg.
Any other message is just logged with
logpkt.
The source code is file
aoemon.c
.
Common code used by more than one user-mode component
or isolating some implementation dependency
is compiled into object library
libcmd.a
,
used with all the programs listed above.
Header file
aoecmd.h
(in the
user
directory,
since no kernel component needs it)
declares prototypes for all library routines,
as well as a few parameter values.
Here is a list of library routines, organized by source file.
comm.c
Communication-channel startup.
Used by
aoestart.
int comminit(chan, name, proto, maxdata, retaddr)
int chan;
char *name;
int proto;
int maxdata;
char retaddr[LEN_EADDR];
Open Ethernet device
name.
Configure it to receive AoE messages
with Ethernet protocol type
proto
and maximum data-segment size
maxdata
(if
maxdata
<=
0
,
a size computed from the device's current MTU).
Make the resulting file into AoE channel
chan.
Copy the Ethernet MAC address of the device
to
retaddr,
and return the resulting file descriptor;
or return -1 if an error occurred.
If
maxdata
is zero,
fetch the device MTU,
compute the corresponding AoE maximum-data size,
and so inform
aoecomm
.
Hidden inside comminit are many Solaris DLPI calls and the AoE startup dance. The channel is initialized, but not attached to its file system mount point; see achattach.
achan.c
Arrange that file descriptor fd will remain open, and the corresponding AoE channel active, even if the calling program closes fd or exists. Remember that fd it is associated with channel chan so the arrangements may be undone by achdetach.
Return > 0 for success,
< 0 for failure
or if
chan
has an unreasonable value.
int achdetach(chan)
int chan;
Undo what achattach did, closing the file descriptor and shutting down AoE channel chan. If chan is negative, do this for every active channel.
Return > 0 if this was done; zero if there is no file descriptor associated with chan (or none with any chan if zero if no such channel chan is negative); < 0 if an error occurred.
Achattach
calls
fattach(3)
to attach
fd
to file
/etc/aoe/ch
nn
(nn
the channel number,
expressed as a two-digit decimal number),
after creating the file if necessary.
Achdetach
calls
fdetach(3)
on
/etc/aoe/ch
nn.
trdwr.c
int tread(fd, buf, len, ms)
int fd;
void *buf;
int len, ms;
int twrite(fd, buf, len, ms)
int fd;
void *buf;
int len, ms;
Call
read
or
write,
but return -1
with
errno
set to
EINTR
if the operation hasn't completed in
ms
milliseconds.
log.c
Write to an error log,
using
syslog(3)
with facility
LOG_DAEMON
and a severity level specified as an abstract argument.
Why not just call
syslog
directly?
Partly to make it easier to fine-tune the implementation
(e.g. the best mapping from internal severity levels
to those of
syslog
depends on
syslog.conf
conventions);
partly to avoid stepping on the buffer-overflow
and format-string-trust problems
in some versions of
syslog.
Initialize the logging mechanism.
Should be called before any other logging call.
Name
is a string name to be included in messages.
If
tostderr
is nonzero,
messages are copied to the standard error file
as well as to the log.
void logmsg(type, format, ...)
int type;
char *format;
Write a message to the log. Format is a format string of the sort accepted by printf(3); it may be followed by arguments. Type has one of the following values, listed here from most important to least:
LE
LOG_ERROR
.
LN
LOG_WARNING
.
LI
LOG_NOTICE
.
LD
LOG_DEBUG
.
LS
may be or-ed into
type
to request that the message be copied to standard error
regardless of the
tostderr
argument in
loginit.
The mapping between
type
codes and
syslog
severity values is not quite the obvious one
because the default
/etc/syslog.conf
file supplied with Solaris
throws away
daemon.notice
messages.
Each
logmsg
call produces
exactly one line of log text.
Messages accumulated in dribs and drabs
should be composed separately.
A terminating newline is neither necessary nor desired.
A message longer than 250 characters
is silently (but safely) truncated.
void logpkt(type, buf, len)
int type;
char *buf;
int len;
Log the contents of
buf,
interpreting it
as an Ethernet datagram
containing an AoE message.
If the message appears truncated
just go as far as possible;
if ill-formed,
interpret as far as possible,
then log the remainder with
loghex.
void loghex(type, buf, len)
int type;
char *buf;
int len;
Log the contents of
buf
as hexadecimal bytes,
no more than 16 to a line.
char *ethertoa(addr, buf)
char *addr;
char *buf;
Return a printable string representing
the six bytes at address
addr
as an Ethernet MAC address.
If
buf
is nonzero,
put the string there;
at least
LEN_EADDR*3
bytes should be available.
If
buf
is
NULL
,
use a static buffer.
debug.c
int verbose;
int debug;
If verbose has a nonzero value, some programs and library routines chatter a little (via logmsg) as they work. If debug is nonzero, chatter is more copious and more detailed. Normally these are set by command-line options.
Aoelabinit writes an initial disk label to one or more disks. By default the label is VTOC format if the disk size allows that, EFI otherwise, and a label is written only if none (of either format) already exists; different choices may be specified. Writing a label of one type invalidates any existing label of the other.
This program exists only as a workaround for a bug in format(1M), which is unable to cope with an unlabelled ATA disk large enough to require EFI labelling. Probably Sun will fix this eventually, but aoelabinit will remain both to avoid stranding those with older systems and because it seems like a useful tool in its own right.
Aoelabinit uses the Solaris libefi(3LIB) library to read and write EFI labels. Since this is a shared library present only since Solaris 9 4/03, aoelabinit can be run only on newer systems. Static linking isn't practical because the library uses a kernel interface that differs in Solaris 9 and Solaris 10.
Aoeunlabel zeroes the sectors conventionally used for EFI and VTOC labels and their backups. Something of the sort must be done before a disk with an EFI label may be repartitioned on a pre-EFI system, e.g. when a disk used as part of a ZFS pool is recycled to ordinary use on a Solaris 8 system.
Aoeunlabel uses none of the special Solaris label-access libraries; it just overwrites sectors directly. Hence it works even on a pre-EFI system, unlike aoelabinit.
Aoemkconf
emits device-configuration entries
suitable for inclusion in
aoed.conf
.
An argument of the form
0/11/9
names a blade for which an entry is desired;
the
aoemin
part may be an inclusive range,
like
0/11/9-11
as a shorthand for
0/11/9
0/11/10
0/11/11
.
Entries contain only the required properties:
name="aoed"
,
parent="pseudo"
,
aoechan
and
aoemaj
and
aoemin
as specified,
instance
computed in
the standard way.
Any other argument names an existing kernel-configuration file; entries that would duplicate an instance number already declared in the file are omitted.
For example:
aoemkconf
/usr/kernel/drv/aoed.conf
0/11/0-14 0/13/0-14
aoemkconf
/usr/kernel/drv/aoed.conf
`aoectl
list`
aoemkconf
/usr/kernel/drv/aoed.conf
`aoectl
list`
>>/usr/kernel/drv/aoed.conf
Aoemkconf is an awk program. There are no explicit hooks for customization, but it ought to be easy to adapt it to local ideas of instance numbers or to supply additional device properties.
/etc/init.d/aoe
,
/lib/svc/method/device-aoe
Startup/shutdown shell script,
with the same contents under either name:
/etc/init.d/aoe
for use in the
init.d(4)
mechanism on a non-SMF system,
/lib/svc/method/device-aoe
for use with SMF.
The script acts according to its first argument:
aoe
start
/etc/aoe.conf
to produce a collection of
aoestart
commands,
one for each channel to be started,
and execute them.
In spirit this is just
sed </etc/aoe.conf 's;^;/sbin/aoestart ;' | sh
aoe
stop
[ contract ]
aoe
restart
[ contract ]
aoe
refresh
[ contract ]
aoe
stop;
aoe
start
except that
aoemon
is not restarted.
On a non-SMF system,
/etc/init.d/aoe
is linked to
/etc/rc2.d/S00aoe
so that
aoe
start
will be called very early in the system startup process.
There is no
K
xxaoe
link:
there's no need
to call
aoe
stop
during a normal shutdown.
On an SMF system,
AoE is installed as service
svc:/device/aoe
,
and should be started and stopped
(enabled and disabled)
with
svcadm(1M).
The service manifest is
user/aoe.xml
in the source-code tree,
/opt/CORDaoe/lib/aoe.xml
and
/var/svc/manifest/device/aoe.xml
in the installed package.
The
aoe
service depends on
svc:/filesystems/root
and declares
svc:/filesystems/usr
as an optional dependent.
Thus any file system but the root or
/usr
,
and any swap area,
may be placed on an AoE disk.
The choice of dependencies derives from Solaris implementation details:
svc:/filesystems/usr
enables swap areas,
so
svc:/device/aoe
must be enabled first
to allow swapping to AoE disks.
svc:/filesystems/usr
,
because AoE drivers and tools are stored in
/usr/kernel
and
/usr/sbin
.
In fact,
if
/usr
is a separate file system
(not so common any more)
filesystems/root
mounts
/usr
read-only,
presumably because so much of Solaris itself
is stored in
/usr
.
If file
/etc/default/aoe.options
exists,
its contents are interpolated
into the
aoe
script
(with the shell's
.
operator)
before anything else is done.
Many parameters such as
the location of
aoestart,
aoestop,
and
aoemon,
the directory where channel files are attached
to keep them open,
and
the name of the configuration file
are set within
aoe
by shell variables;
settings in
aoe.options
override the default values.
SMF or
init.d
startup is selected
when the
Here is a collection of notes
about design decisions,
problems encountered with Solaris or elsewhere
and how they have been papered over (or not),
and so on.
Some of the problems described here
will, we hope,
be mended in future versions of the subsystem,
or eased by future versions of Solaris,
though the latter is no panacea
since older Solaris systems must not be abandoned lightly.
Currently all the code is compiled with
Sun's C compiler,
available at no cost
from
Earlier versions of the driver were built with
gcc,
which proved unsatisfactory:
These problems were observed with
gcc
version 3;
perhaps newer versions are better.
Sun's C compiler works fine,
no longer costs an arm and a leg,
and does stricter type-checking
(which has helped prevent a few bugs);
we plan to stick with it.
The compromises required for
gcc
have been removed from the code
and will not be reinstated.
The same
Care is needed in a few places:
The AoE package currently compiles without error
only on Solaris 10,
though the resulting binaries may be run on any supported system.
The original Solaris disk-label format
has limits that have recently become troublesome:
in particular,
it is unable to handle multi-terabyte disks.
A mid-life update to Solaris 9
introduced a new label format,
which removes the old limitations
at the cost of considerable extra complexity.
To top it off,
Solaris/SPARC and Solaris/x86
use different forms of old-style disk label,
and the x86 implementation allows an old-style label
to be encapsulated within a specially-designated DOS partition.
Here is a summary of what Sun did
and how it affects the AoE driver,
derived from Sun documentation,
experiment,
and some analysis of Open Solaris source code.
The original Solaris
volume table of contents (VTOC) disk label
comprises a single sector (512 bytes)
of label information.
The VTOC label records
storing disk geometry information
(in particular cylinder, head, and sector counts),
a string label or two,
and an array of eight or sixteen partition descriptors.
Values are stored as 16- and 32-bit
native integers.
The label is protected by a magic number
and a simple-minded checksum.
The eight- and sixteen-partition label formats
differ quite a bit;
they are not meant to interoperate.
SPARC systems use the eight-partition variant,
x86 systems the sixteen-partition one.
The primary VTOC label is stored
in the first sector of the disk.
Backups are kept at sector offsets
1, 3, 5, 7, and 9
in the last track.
Many limitations of the VTOC format
are more obvious now than when
the scheme was adopted in the early 1990s:
Solaris/x86 allows
(in some cases requires)
a disk to have a DOS partition label.
A DOS label is stored in the first sector of the disk;
it contains four partition descriptors,
each including the starting and ending sector of the partition
and a partition type (`system ID').
The label is protected by a magic number.
Integer values in the label are always in
Intel (little-endian) order.
There is no backup label.
The four partitions described in the label
are called primary partitions.
There is a scheme to allow a primary partition
to encapsulate a logical disk
containing another DOS label,
affording additional logical partitions.
Solaris supports only primary partitions.
It has its own encapsulation scheme,
however:
partition type 130 or 191 (decimal)
indicates a logical disk with a Sun VTOC label.
Solaris makes the (relative) partitions
described in the VTOC
available through the conventional subdevices
Sun addressed the problems with
the old VTOC label scheme by adding support for EFI labels,
borrowed from the
Intel Extensible Firmware Interface
standard.
EFI support first appeared in Solaris 9 4/03.
Patches are available to add support to older copies of Solaris 9,
but not to Solaris 8 or older releases.
An EFI label comprises two data structures:
the GUID Partition Table header (GPT)
and a GUID Partition Entry (GPE) array.
One copy of the GPT is stored in the second sector
of the disk,
immediately followed by the GPE array
in contiguous sectors.
A backup GPT is stored in the last sector,
immediately preceded by a backup GPE array.
Sectors from the end of the primary GPE array
to the beginning of the backup
may be allocated to partitions.
EFI partitions may not overlap one another,
nor may an EFI partition overlap the GPT or GPE or sector 0.
The EFI label contains no information
about cylinders, heads, or tracks;
the disk is treated as a simple linear array of sectors.
Sector addresses and counts are stored as 64-bit unsigned integers,
allowing for disks and individual partitions
as large as 8 zebibytes
(more than 8 billion tebibytes).
All integer values
are stored in a fixed byte order
defined by the EFI standard;
hence a label written by a big-endian SPARC system
may be read without difficulty
by a little-endian IA32 system
or vice versa.
The original Solaris UFS file system format
also allows only 31 bits
for sector numbers.
In Solaris 9/03
(with corresponding patches
for earlier editions of Solaris 9),
Sun added a `multi-terabyte file system' variant.
When there were only VTOC labels,
a Solaris disk driver was expected to support
these label-related
ioctl
commands:
DOS-label support added these
ioctls:
An EFI-compliant driver
supports the VTOC-label and
(if DOS labels are implemented)
DOS-label
ioctls,
but returns error
The
In an EFI-compliant version of Solaris,
utility programs like
format
and
prtvtoc(1M)
use new
libefi
and
libvtoc(3LIB)
libraries
to read and write disk labels.
Libvtoc
is a simple wrapper around the old
ioctls.
Libefi
is more complicated:
when reading a label
it validates all the checksums
(device drivers just check the magic number);
when writing,
it computes correct checksums and validates other GPT and GPE values,
in particular enforcing the EFI-standard rule
that partitions may not overlap
and a Sun-specific rule that exactly one partition
must be of special
To find the label on a disk,
Sun's EFI-compliant drivers
search as follows:
The
The Solaris kernel is pre-emptive:
all processing in the kernel
belongs to a scheduling thread
and may be pre-empted.
Thus locking is important
even on a single-processor system.
There are three global locks:
chantlock
protects the table that maps
aoechan
numbers to
Each
During normal operation,
the data protected by
loggerlock
and the
There are two locks in
Two locks are used throughout the
The driver's main entry points
locate and lock the relevant
Timer routine
chantimer
contains an exception
that illustrates the rule.
Chantimer
locks an
The channel-startup algorithm
is complicated by a potential security problem.
A STREAMS module has no associated permissions.
Anyone can open a stream device of some sort:
a pipe,
a network connection,
his own terminal.
Given an open stream,
anyone may push any module.
In particular,
anyone could push
A malicious user could do this using a channel number
normally assigned to official disk devices.
An active channel number cannot be reused,
but a bad guy might attempt a race with the system administrator
or take advantage of a system problem.
Hence the algorithm that first requires
The
If it were SCSI-over-Ethernet rather than ATA,
that would be possible:
Sun expects third parties to write SCSI HBA drivers,
and documents the details.
Unfortunately that is not true for ATA.
Hence
Perhaps this can be revisited in the future,
if Sun stabilize and publish their
ATA-adapter interface,
and once their ATA code has been fixed
to handle multi-terabyte disks.
Even then,
support for older Solaris versions would
require something like the current driver.
Device instances in Solaris are created in one of two ways:
Because AoE is a software construct
attached to the
In Solaris 9 and earlier versions,
this means
devices can be attached
only when the
In Solaris 10,
a static device configuration can be updated
without unloading the driver;
update_drv(1M)
makes it happen.
Devices that are in use (open or mounted)
cannot be changed,
but new devices can be added
and idle devices removed
or given new properties.
At present,
the system administrator is expected to update
It might seem natural for
An early version of the driver used node type
To avoid these problems,
the
This approach reaps some minor benefits:
Rules are added for 16 Solaris partitions
even on systems supporting only eight Solaris partitions.
This makes installation easier
(one less special case),
and is harmless;
the system ignores any extra rules.
In an early version of
aoed,
the
To avoid this mess,
The
EFI-label
code in
later Solaris 9 releases and in Solaris 10
introduced two new
format
embarrassments:
There is also at least one embarrassment in
DOS-label
support:
When a Solaris Ethernet device is opened,
it sometimes takes a few seconds for the hardware
and software
to initialize.
During this interval,
DLPI attachment and configuration messages
are processed correctly,
and DLPI queries
report that the device is ready,
but it isn't:
messages written to the device
are silently thrown away.
Thus
aoestart
must somehow wait until the device is really ready
before broadcasting a Query-Config message
to discover which devices exist.
Comminit,
the library routine called by
aoestart
to open and initialize the device,
tries two schemes to wait for the hardware:
The
kstat
test is tried first because it's a bit faster
and rather less hacky in implementation
than the
ndd
test.
Recent network-device drivers
no longer support the
ndd
parameter,
but all Sun-supported network devices we know
support the
kstat
scheme,
except the old
Solaris
newfs
gets into trouble when a large disk
has small cylinders,
producing complaints like
`Insufficient space in super block for rotational tables'
and
`inode blocks/cyl group >= data blocks.'
The details are not yet understood;
possibly
newfs
is incorrectly doing some bit of arithmetic
in too small a variable.
Empirically the trouble seems to vanish
when cylinders
are very large.
Hence the phony numbers generated by
fudgegeom,
which make cylinders as big as the system seems to be able to stand.
The
The default timeouts for I/O operations recorded in
aoed.h
are surprisingly long:
200ms for a read or write,
half a second for an AoE Query-Config
or ATA IDENTIFY.
Originally much smaller values were used,
but complicated devices like RAID controllers
really do take as long as 100-150ms
to finish some operations.
Longer timeouts shouldn't cause much grief anyway
except in special circumstances,
since lost messages are unlikely on modern
(switched, flow-controlled) networks
unless something is broken.
The large values may cause grief on
slow, congested networks,
e.g. 10Mbps or 100Mbps,
especially when repeaters or broadcast cables are used
rather than switches.
If that case comes up in real life,
just set the timeouts down manually.
If it happens often enough
it may make sense to invent a way
to set a per-channel I/O timeout default
rather than having to set it for each target.
CORDaoe
package is installed:
/etc/init.d/aoe
is installed in any case.
/etc/init.d/aoe
to
/lib/svc/method/device-aoe
.
/opt/CORDaoe/lib/aoe.xml
to
/var/svc/manifest/device/aoe.xml
,
then runs
svccfg(1M)
to import the latter.
svccfg
refresh
/system/filesystem/usr
as a workaround for a bug in
early versions of Solaris 10,
which sometimes didn't register
dependent
declarations when a new service was added.
/etc/init.d/aoe
is linked to
/etc/rc2.d/S00aoe
.
9.
Design details,
compromises,
bugs,
and other concessions to reality
9.1.
Gcc
versus
Sun C
www.opensolaris.org
.
aoed
driver was modified to keep certain values
as 32-bit numbers
even though 64 bits would have been more appropriate;
in particular,
the size of an AoE disk was limited to 32 bits' worth of sectors,
i.e. 2 tebibytes.
9.2.
Solaris version differences
CORDaoe
binary package may be installed on
Solaris 7, 8, 9, or 10.
Surprisingly little magic is required
to make this work:
despite marked changes in the underlying operating system,
the interfaces visible to device drivers
have been quite stable since Solaris 7.
DKIOCGETEFI
and
DKIOCSETEFI
ioctls,
introduced in Solaris 9,
was changed for Solaris 10,
apparently for the convenience of the library code
in which Sun use that call.
This call is used by
prtvtoc
and
format(1M),
so it is important that it work properly.
To paper this over,
the
aoed
module makes a somewhat-hacky explicit
runtime Solaris-version test.
Nblocks
defined.
A call to define a 64-bit integer property
wasn't added to the Sun driver interface
until Solaris 9.
A
custom interface routine
is required to paper over this,
so that the driver may support ZFS
but will still load on Solaris 7 and 8.
9.3.
Disk labels
9.3.1.
VTOC (old-style) labels
9.3.2.
DOS (fdisk) labels
s0
-s15
;
DOS primary partitions are accessed through
new devices
p1
-p4
,
whether or not there is an encapsulated VTOC.
9.3.3.
EFI (new-style) labels
9.3.4.
Bigger file systems
9.3.5.
Sun implementation details
DKIOCGGEOM
DKIOCSGEOM
struct
dk_geom
containing disk-geometry information.
DKIOCGVTOC
DKIOCSVTOC
struct
vtoc
containing disk-partition information.
The on-disk VTOC contains a merger of
struct
vtoc
and
struct dk_geom
.
DKIOCGAPART
DKIOCSAPART
struct
dk_allmap
containing the starting cylinder number
and size in sectors
of all eight or sixteen partitions.
DKIOCG_PHYGEOM
DKIOCG_VIRTGEOM
DKIOCPARTINFO
DKIOCGMBOOT
DKIOCSMBOOT
ENOTSUP
if the disk has an EFI label,
or if
DKIOCSVTOC
is called on a VTOC-improper disk.
An EFI-compliant driver
also supports these
ioctls:
DKIOCGETEFI
DKIOCSETEFI
DKIOCPARTITION
struct
partition64
containing the starting sector address,
length,
and 128-bit partition-type code
for a designated partition.
DKIOCGMEDIAINFO
struct
dk_minfo
giving the sector size in bytes,
device size in sectors,
and a device-type code
(removable disk,
fixed disk,
CD-ROM,
CD-R or CD-RW,
floppy,
etc.).
DKIOCGETEFI
and
DKIOCSETEFI
ioctls
really just perform I/O to
absolute disk addresses,
regardless of the partition table.
reserved
type,
presumably as a stand-in for the reserved cylinders
in the old VTOC scheme.
:h
with one called
:wd
mapping the entire physical disk,
including the label areas.
:a
and
:c
span the whole disk.
:wd
partition.
Apparently it is expected that
format
will be used to label the disk before use.
(But what if the disk was written
by another operating system
with its own label scheme?)
9.3.6.
AoE implementation details
aoed
driver
recognizes VTOC, DOS (and encapsulated-VTOC), and EFI labels,
but with some differences from the Sun convention:
ENOTSUP
unless the disk already has a VTOC label,
or has no label and is VTOC-proper.
An existing VTOC label
may be updated with
DKIOCSVTOC
even if the disk is VTOC-improper;
Sun's drivers don't allow this.
:wd
partition
uses a ninth (or seventeenth) partition,
rather than stealing one of the normal ones.
In fact that partition is valid
regardless of label type.
aoed
checks for backup VTOC labels
if none is found in sector 0.
See the description of
readptab
for the full scoop.
aoed
always supports DOS and EFI labels
even when the surrounding system doesn't.
In practice this means an older system
may access existing partitions on a disk
that already has an EFI label,
or a SPARC system may access DOS partitions,
but tools like
prtvtoc
and
format
cannot print the partition table or update it.
Encapsulated VTOC labels are allowed only on Solaris/x86,
however,
so that the difference between SPARC and x86 VTOC-label formats
won't cause trouble.
9.4.
Concurrency issues
9.4.1.
Locks in
aoecomm
Aoechan
structures;
pendlock
protects the pending-cookie table;
loggerlock
protects message-logger pointer.
Each of these locks is held for only a few lines of code at a time;
protected code sections contain no procedure calls,
and in particular no lock calls.
Aoechan
structure
includes a lock,
protecting its contents.
These locks are also held only during short
code sequences
that cannot provoke other locks.
Aoechan
lock are written rarely
but read constantly.
These locks proved to be hot spots;
changing them from the
kmutex_t
type to
krwlock_t
made raw disk I/O measurably faster.
9.4.2.
Locks in
aoectl
aoectl
.
One serializes writes to the
/dev/aoectl
device;
the other prevents concurrent access to the buffer ring
feeding the
/dev/aoemon
device.
The two are entirely disjoint:
code using one lock
never calls code using the other.
9.4.3.
Locks in
aoed
aoed
module:
one protecting the
Adrive
structure,
one the
Achan
structure.
To avoid nested-lock hangups,
there is a rule that
code in which an
Adrive
is locked
may lock
(or call code that locks)
an
Achan
,
but not vice versa.
Adrive
early on,
and unlock it just before returning.
Most subroutines
that take an
Adrive
argument assume it was already locked,
and leave it that way.
Achan
s
are used
(hence locked)
only here and there,
and for short periods;
usually just for long enough to search
or to make a single change to
the active-command list.
Only rarely need an
Adrive
be accessed while an
Achan
is locked.
Achan
while walking its active-command list
looking for expired commands.
If one is found,
the
Achan
is unlocked and the
Adrive
associated with the command locked
while the command is retransmitted or cancelled.
Then the
Adrive
is unlocked
and the
Achan
locked again;
and because the active-command list
may have changed while the
Achan
was unlocked,
chantimer
starts over from the beginning of the list.
9.5.
The startup dance
aoecomm
onto one end of a pipe,
handle AoE messages on the other,
and the system would think it a valid
aoechan.
/dev/aoectl
to be opened:
that device file can be given whatever permissions
local policy dictates.
Someone who may open
/dev/aoectl
can set up AoE channels;
someone may not, cannot.
9.6.
How
aoed
fits into the system
aoed
driver is in a sense half-redundant.
Since the AoE protocol is just a way
to bundle ATA commands
into Ethernet packets,
one might think it would be possible
to write not a complete disk driver
but just an ATA host-adapter driver,
using the existing Solaris ATA code to do the rest of the work.
aoed
is a standalone disk driver
attached to the
pseudo
nexus.
9.7.
Device-creation static
dev_info_t
structure
and calls the driver
attach
routine.
pseudo
nexus,
only static configuration is allowed.
Some versions of Solaris even complain
if such a driver has a real
probe
entry point.
aoed
module is loaded.
To create more devices
requires shutting down the driver,
unloading the module,
and loading it afresh.
aoed.conf
and run
update_drv.
Perhaps devices can be detected automatically
in a future version,
though that might forbid
custom target-address-to-instance-number mappings.
9.8.
Device types,
and device-name horrors
aoed
to create device nodes by calling
ddi_create_minor_node(9F)
with node type
DDI_NT_BLOCK_CHAN
,
using the
aoemaj
number
(or some combination of
aoechan
and
aoemaj)
as the device instance and
aoemin
as the target number.
DDI_NT_BLOCK_CHAN
is listed in the documentation,
but
nowhere do the manuals say how to supply a target number.
Apparently there are special hooks
for use by SCSI and ATA host-bus adapter drivers,
but no mechanism for general use.
Worse,
if one uses
DDI_NT_BLOCK_CHAN
anyway,
the system apparently picks a
garbage number out of uninitialized memory.
DDI_NT_BLOCK
.
This ran afoul of a bug in
devfsadm(1M)
in Solaris 7, 8, and 9.
Calling
ddi_create_minor_node
with the
dev_info_t
for instance 15
created devices named
/devices/pseudo/aoed@15:*
,
as expected;
but the links to
/dev/dsk
and
/dev/rdsk
were named
c
Nd21*
.
Apparently the code
that creates names in
/devices
believes instance numbers are decimal,
but the code that reads those names
and creates names in
/dev
believes the
/devices
names are hexadecimal
and `corrects' them.
aoed
driver does everything the hard way.
Device nodes are created with type
DDI_NT_PSEUDO
,
for which
devfsadm
does no automatic processing.
Installing the AoE subsystem
adds explicit rules to
/etc/devlink.tab
to generate
/dev
links
for
aoed
:
type=ddi_pseudo;name=aoed;minor=a dsk/cad
A0s0
type=ddi_pseudo;name=aoed;minor=a,raw rdsk/cad
A0s0
type=ddi_pseudo;name=aoed;minor=b dsk/cad
A0s1
type=ddi_pseudo;name=aoed;minor=b,raw rdsk/cad
A0s1
...
type=ddi_pseudo;name=aoed;minor=p dsk/cad
A0s15
type=ddi_pseudo;name=aoed;minor=p,raw rdsk/cad
A0s15
type=ddi_pseudo;name=aoed;minor=r dsk/cad
A0p1
type=ddi_pseudo;name=aoed;minor=r,raw rdsk/cad
A0p1
...
type=ddi_pseudo;name=aoed;minor=u dsk/cad
A0p4
type=ddi_pseudo;name=aoed;minor=u,raw rdsk/cad
A0p4
type=ddi_pseudo;name=aoed;minor=wd dsk/cad
A0
type=ddi_pseudo;name=aoed;minor=wd,raw rdsk/cad
A0
wd
partition.
The explicit rules generate a single consistent name.
/dev/dsk/c1d*
on one system,
/dev/dsk/c2d*
on another with different hardware or a different
Solaris version;
the names might even change after an OS reinstall.
Again,
the explicit rules
generate a single consistent name.
9.9.
Format
woes
DKIOCINFO
ioctl
returned a customer controller type value
(>=DKC_CUSTOMER
).
This causes
format(1M)
to misbehave quite badly:
DKIOCINFO
.
aoed
attempts to mimic a directly-attached ATA disk driver,
even though that is an undocumented interface.
DKIOCINFO
reports controller type
DKC_DIRECT
;
the driver
implements a subset of
DIOCTL_RWCMD
format
uses for direct-address access to ATA disks.
The details are not officially documented;
they were
worked out by tracing
format,
reading
<sys/dktp/dadkio.h>
,
and applying a mix of imagination and common sense.
ioctl
calls.
This wouldn't work even with Sun's ATA-disk driver;
it's just a bug.
aoed
implements a fake
USCSICMD
ioctl.
Since ATA allows longer device-model strings than SCSI,
the string is sometimes truncated;
there's nothing we can do about that.
Some of these bugs may have been cured in
Solaris 10 11/06.
:q
(p0
in
/dev
)
accessing the whole disk.
If this device exists,
format
insists that that the disk have a DOS label;
if none is present,
format
refuses even to print the existing partition table,
commanding
Please use fdisk first
.
AoE therefore doesn't make the
:q
/p0
device.
The
:wd
(no suffix in
/dev
)
device,
created by AoE regardless of label type,
affords equivalent access if needed.
9.10.
Waiting for the network
If neither test is possible
or the test chosen fails,
comminit
returns an error.
link_up
parameter.
If found,
wait up to 30 seconds for the value
to become positive;
when it does,
return success.
link_status
value for the device.
Wait up to 30 seconds for the value
to become positive;
when it does,
return success.
A global state variable in the kernel must be changed
to select the device to be polled;
good citizenship suggests its original value should be saved
just before every test
and restored immediately after,
though races are still possible.
le
(10Mbps LANCE Ethernet)
driver.
The latter device isn't a good choice for AoE,
but support will remain for now
because it is sometimes useful during our own testing.
9.11.
Newfs
woes
-T
(new multi-terabyte file system)
newfs
option in Solaris 10
and in newer editions of Solaris 9
also avoids the problem,
but that's no help in older versions of Solaris,
nor on 32-bit hardware.
9.12.
I/O timeouts
Footnotes