struct ucred
is the kernel's internal credential
structure, and is generally used as the basis for process-driven access control
within the kernel. BSD-derived systems use a “copy-on-write” model
for credential data: multiple references may exist for a credential structure,
and when a change needs to be made, the structure is duplicated, modified, and then
the reference replaced. Due to wide-spread caching of the credential to
implement access control on open, this results in substantial memory savings.
With a move to fine-grained SMP, this model also saves substantially on locking
operations by requiring that modification only occur on an unshared credential,
avoiding the need for explicit synchronization when consuming a known-shared
credential.
Credential structures with a single reference are considered mutable; shared
credential structures must not be modified or a race condition is risked. A mutex,
cr_mtxp
protects the reference count of struct ucred
so as to maintain consistency. Any use of the
structure requires a valid reference for the duration of the use, or the structure
may be released out from under the illegitimate consumer.
The struct ucred
mutex is a leaf mutex and is
implemented via a mutex pool for performance reasons.
Usually, credentials are used in a read-only manner for access control decisions,
and in this case td_ucred
is generally preferred
because it requires no locking. When a process' credential is updated the proc lock must be held across the check and update operations
thus avoid races. The process credential p_ucred
must be used for check and update operations to prevent time-of-check,
time-of-use races.
If system call invocations will perform access control after an update to the
process credential, the value of td_ucred
must also
be refreshed to the current process value. This will prevent use of a stale
credential following a change. The kernel automatically refreshes the td_ucred
pointer in the thread structure from the process
p_ucred
whenever a process enters the kernel,
permitting use of a fresh credential for kernel access control.
Details to follow.
struct prison
stores administrative details
pertinent to the maintenance of jails created using the jail(2) API. This
includes the per-jail hostname, IP address, and related settings. This
structure is reference-counted since pointers to instances of the structure are
shared by many credential structures. A single mutex, pr_mtx
protects read and write access to the reference
count and all mutable variables inside the struct jail. Some variables are set only
when the jail is created, and a valid reference to the struct prison
is sufficient to read these values. The
precise locking of each entry is documented via comments in sys/jail.h.
The TrustedBSD MAC Framework maintains data in a variety of kernel objects, in
the form of struct label
. In general, labels in
kernel objects are protected by the same lock as the remainder of the kernel
object. For example, the v_label
label in
struct vnode
is protected by the vnode lock on the
vnode.
In addition to labels maintained in standard kernel objects, the MAC Framework
also maintains a list of registered and active policies. The policy list is
protected by a global mutex (mac_policy_list_lock
) and
a busy count (also protected by the mutex). Since many access control checks
may occur in parallel, entry to the framework for a read-only access to the policy
list requires holding the mutex while incrementing (and later decrementing) the busy
count. The mutex need not be held for the duration of the MAC entry
operation--some operations, such as label operations on file system objects--are
long-lived. To modify the policy list, such as during policy registration and
de-registration, the mutex must be held and the reference count must be zero,
to prevent modification of the list while it is in use.
A condition variable, mac_policy_list_not_busy
, is
available to threads that need to wait for the list to become unbusy, but this
condition variable must only be waited on if the caller is holding no other locks,
or a lock order violation may be possible. The busy count, in effect, acts as a form
of shared/exclusive lock over access to the framework: the difference is that,
unlike with an sx lock, consumers waiting for the list to become unbusy may be
starved, rather than permitting lock order problems with regards to the busy count
and other locks that may be held on entry to (or inside) the MAC Framework.
For the module subsystem there exists a single lock that is used to protect the
shared data. This lock is a shared/exclusive (SX) lock and has a good chance of
needing to be acquired (shared or exclusively), therefore there are a few macros
that have been added to make access to the lock more easy. These macros can be
located in sys/module.h and are quite basic in terms
of usage. The main structures protected under this lock are the module_t
structures (when shared) and the global modulelist_t
structure, modules. One should review the
related source code in kern/kern_module.c to further
understand the locking strategy.
The newbus system will have one sx lock. Readers will hold a shared (read) lock (sx_slock(9)) and writers will hold an exclusive (write) lock (sx_xlock(9)). Internal functions will not do locking at all. Externally visible ones will lock as needed. Those items that do not matter if the race is won or lost will not be locked, since they tend to be read all over the place (e.g., device_get_softc(9)). There will be relatively few changes to the newbus data structures, so a single lock should be sufficient and not impose a performance penalty.
...
- process hierarchy
- proc locks, references
- thread-specific copies of proc entries to freeze during system calls, including td_ucred
- inter-process operations
- process groups and sessions
Lots of references to sched_lock
and notes pointing
at specific primitives and related magic elsewhere in the document.
The select
and poll
functions permit threads to block waiting on events on file descriptors--most
frequently, whether or not the file descriptors are readable or writable.
...
The SIGIO service permits processes to request the delivery of a SIGIO signal to
its process group when the read/write status of specified file descriptors changes.
At most one process or process group is permitted to register for SIGIO from
any given kernel object, and that process or group is referred to as the owner. Each
object supporting SIGIO registration contains pointer field that is NULL
if the object is not registered, or points to a struct sigio
describing the registration. This field is
protected by a global mutex, sigio_lock
. Callers to
SIGIO maintenance functions must pass in this field “by reference”
so that local register copies of the field are not made when unprotected by the
lock.
One struct sigio
is allocated for each registered
object associated with any process or process group, and contains back-pointers to
the object, owner, signal information, a credential, and the general disposition of
the registration. Each process or progress group contains a list of registered
struct sigio
structures, p_sigiolst
for processes, and pg_sigiolst
for process groups. These lists are protected
by the process or process group locks respectively. Most fields in each struct sigio
are constant for the duration of the
registration, with the exception of the sio_pgsigio
field which links the struct sigio
into the process or process group list.
Developers implementing new kernel objects supporting SIGIO will, in general, want
to avoid holding structure locks while invoking SIGIO supporting functions,
such as fsetown
or funsetown
to avoid defining a lock order between structure
locks and the global SIGIO lock. This is generally possible through use of an
elevated reference count on the structure, such as reliance on a file
descriptor reference to a pipe during a pipe operation.
The sysctl
MIB service is invoked from both within
the kernel and from userland applications using a system call. At least two issues
are raised in locking: first, the protection of the structures maintaining the
namespace, and second, interactions with kernel variables and functions that are
accessed by the sysctl interface. Since sysctl permits the direct export (and
modification) of kernel statistics and configuration parameters, the sysctl
mechanism must become aware of appropriate locking semantics for those
variables. Currently, sysctl makes use of a single global sx lock to serialize use
of sysctl
; however, it is assumed to operate
under Giant and other protections are not provided. The remainder of this
section speculates on locking and semantic changes to sysctl.
- Need to change the order of operations for sysctl's that update values from read old, copyin and copyout, write new to copyin, lock, read old and write new, unlock, copyout. Normal sysctl's that just copyout the old value and set a new value that they copyin may still be able to follow the old model. However, it may be cleaner to use the second model for all of the sysctl handlers to avoid lock operations.
- To allow for the common case, a sysctl could embed a pointer to a mutex in the SYSCTL_FOO macros and in the struct. This would work for most sysctl's. For values protected by sx locks, spin mutexes, or other locking strategies besides a single sleep mutex, SYSCTL_PROC nodes could be used to get the locking right.
The taskqueue's interface has two basic locks associated with it in order to
protect the related shared data. The taskqueue_queues_mutex
is meant to serve as a lock to protect
the taskqueue_queues
TAILQ. The other mutex lock
associated with this system is the one in the struct
taskqueue
data structure. The use of the synchronization primitive here is to
protect the integrity of the data in the struct
taskqueue
. It should be noted that there are no separate macros to
assist the user in locking down his/her own work since these locks are most likely
not going to be used outside of kern/subr_taskqueue.c.