The Neutrino Microkernel

Introduction
The implementation of Neutrino
Neutrino services
Threads and processes
Thread scheduling
Synchronization services
QNX/Neutrino IPC
Clock and timer services
Interrupt handling

Introduction

Neutrino is a microkernel implementation of the core POSIX 1003.1, 1003.1a, 1003.1b, 1003.1c, and 1003.1d features used in embedded realtime systems, along with the fundamental QNX message-passing services. The POSIX features that aren't implemented in the microkernel (file and device I/O, for example) are provided by optional processes and DLLs (dynamically linked libraries).

Successive QNX microkernels have seen a reduction in the code required to implement a given kernel call. The object definitions at the lowest layer in the kernel code have become more specific, allowing greater code reuse (such as folding various forms of POSIX signals, realtime signals, and QNX pulses into common data structures and code to manipulate those structures).

At its lowest level, Neutrino contains a few fundamental objects and the highly tuned routines that manipulate them. The Neutrino microkernel is built from this foundation.

fig: images/overview.gif

The Neutrino microkernel.

Some developers have assumed that the QNX microkernel is implemented in assembly code for size or performance reasons. In fact, our implementation is coded mostly in C; size and performance goals are achieved through successively refined algorithms and data structures, rather than via assembly-level peep-hole optimizations.

The implementation of Neutrino

Historically, the ``application pressure'' on QNX has been from both ends of the computing spectrum - from memory-limited embedded systems all the way up to high-end SMP machines with gigabytes of physical memory. Accordingly, the design goals for Neutrino accommodate both seemingly exclusive sets of functionality. Pursuing these goals is intended to extend the reach of systems well beyond what the QNX 4.x implementation could address.

fig: images/grail.gif

The "Holy Grail" of operating systems.

1003.1b, 1003.1c, and 1003.1d extensions

Since Neutrino implements the majority of the realtime and thread services directly in the microkernel, these services are available even without the presence of additional OS modules.

In addition, some of the profiles defined by POSIX 1003.13 suggest that these services be present without necessarily requiring a process model. In order to accommodate this, Neutrino provides direct support for threads, but relies on an add-on process manager to extend this functionality to processes containing multiple threads.

Note that many realtime executives and kernels provide only a non-memory-protected threaded model, with no process model and/or protected memory model at all. Without a process model, full POSIX compliance cannot be achieved.

Scalable for very small embedded systems

To target very small embedded systems, Neutrino goes a step further than earlier versions of QNX by allowing even the process manager to be an optional component. When running without a process manager, Neutrino lacks the ability to create processes and has only limited, library-implemented pathname space management. As a result, a minimal Neutrino embedded system must boot with an initial code set and must not attempt to create processes during run time (although additional threads can be created).

This is a reasonable tradeoff, since many embedded systems don't need to dynamically create processes; they boot with an initial code set and run ``as is.'' Support for dynamic process creation would go unused in such systems.

fig: images/sysproc.gif

Neutrino system process containing threads.

Since embedded systems are equipped with various combinations of memory-protection hardware (or none at all), an OS with a strict dependence on an MMU would be limited. The solution for Neutrino was to implement an optional code module that could be included to perform MMU initialization for the rest of the OS. As a result, Neutrino can run with or without an MMU. This means the system developer can choose to do without memory protection if the application doesn't warrant it.

Neutrino services

The Neutrino microkernel has kernel calls to support the following:

threads
message passing
signals
timers
interrupt handlers
semaphores
mutual exclusion locks (mutexes)
condition variables (condvars)

The entire OS is built upon these calls. Neutrino is fully preemptable, even while passing messages between processes; it resumes the message pass where it left off before preemption.

The minimal complexity of the Neutrino microkernel significantly helps place an upper bound on the longest non-preemptable code path through the kernel, while the small code size makes addressing complex multiprocessor issues a tractable problem. Services were chosen for inclusion in the microkernel on the basis of having a short execution path. Operations requiring significant work (e.g. process loading) were assigned to external processes/threads, where the effort to enter the context of that thread would be insignificant compared to the work done within the thread to service the request.

Rigorous application of this rule to dividing the functionality between the kernel and external processes destroys the myth that a microkernel OS must incur higher runtime overhead than a monolithic kernel OS. Given the work done between context switches (implicit in a message pass), and the very quick context-switch times that result from the simplified kernel, the time spent performing context switches becomes ``lost in the noise'' of the work done to service the requests communicated by the message passing between the processes that make up the QNX OS.

The following diagram illustrates the preemptability and interruptibility of the kernel through various stages of processing a message-pass request.

fig: images/preempt.gif

Neutrino preemption details for the non-SMP kernel. The SMP version of the kernel has more opcodes.

Interrupts are disabled, or preemption is held off, for only very brief intervals. The exit case exhibits the variability of 14 to 40 opcodes only when ``error'' cases (page faults and memory violations) in application processing occur. For the normal, non-error case, the work to exit the microkernel is only 14 opcodes.

Threads and processes

When building an application (realtime, embedded, graphical, or otherwise), the developer may want several algorithms within the application to execute concurrently. Within QNX/Neutrino, this concurrency is achieved by using the POSIX thread model, which defines a process as containing one or more threads of execution.

A thread can be thought of as the minimum ``unit of execution,'' the unit of scheduling and execution in the microkernel. A process, on the other hand, can be thought of as a ``container'' for threads, defining the ``address space'' within which threads will execute. A process will always contain at least one thread.

Depending on the nature of the application, threads might execute independently with no need to communicate between the algorithms (unlikely), or they may need to be tightly coupled, with high-bandwidth communications and tight synchronization. To assist in this communication and synchronization, QNX provides a rich variety of IPC and synchronization services.

The following pthreads (POSIX Threads) library calls don't involve any microkernel thread calls:

pthread_attr_destroy()
pthread_attr_getdetachstate()
pthread_attr_getinheritsched()
pthread_attr_getschedparam()
pthread_attr_getschedpolicy()
pthread_attr_getscope()
pthread_attr_getstackaddr()
pthread_attr_getstacksize()
pthread_attr_init()
pthread_attr_setdetachstate()
pthread_attr_setinheritsched()
pthread_attr_setschedparam()
pthread_attr_setschedpolicy()
pthread_attr_setscope()
pthread_attr_setstackaddr()
pthread_attr_setstacksize()
pthread_cleanup_pop()
pthread_cleanup_push()
pthread_equal()
pthread_getspecific()
pthread_setspecific()
pthread_testcancel()
pthread_key_create()
pthread_key_delete()
pthread_once()
pthread_self()
pthread_setcancelstate()
pthread_setcanceltype()

The following table lists the POSIX thread calls that have a corresponding microkernel thread call, allowing the developer to elect to use either interface:

POSIX call	Microkernel call	Description
pthread_create()	ThreadCreate()	Create a new thread of execution.
pthread_exit()	ThreadDestroy()	Destroy a thread.
pthread_detach()	ThreadDetach()	Detach a thread so it doesn't need to be joined.
pthread_join()	ThreadJoin()	Join a thread waiting for its exit status.
pthread_cancel()	ThreadCancel()	Cancel a thread at the next cancellation point.
N/A	ThreadCtl()	Change a thread's Neutrino-specific thread characteristics.
pthread_mutex_init()	SyncCreate()	Create a mutex.
pthread_mutex_destroy()	SyncDestroy()	Destroy a mutex.
pthread_mutex_lock()	SyncMutexLock()	Lock a mutex.
pthread_mutex_trylock()	SyncMutexLock()	Conditionally lock a mutex.
pthread_mutex_unlock()	SyncMutexUnlock()	Unlock a mutex.
pthread_cond_init()	SyncCreate()	Create a condition variable.
pthread_cond_destroy()	SyncDestroy()	Destroy a condition variable.
pthread_cond_wait()	SyncCondvarWait()	Wait on a condition variable.
pthread_cond_signal()	SyncCondvarSignal()	Signal a condition variable.
pthread_cond_broadcast()	SyncCondvarSignal()	Broadcast a condition variable.
pthread_getschedparam()	SchedGet()	Get scheduling parameters and policy of thread.
pthread_setschedparam()	SchedSet()	Set scheduling parameters and policy of thread.
pthread_sigmask()	SignalProcMask()	Examine or set a thread's signal mask.
pthread_kill()	SignalKill()	Send a signal to a specific thread.

Neutrino and the process manager can be configured to provide a mix of threads and processes (as defined by POSIX) to create at least the following "concurrent execution" environments:

A team of threads, all running within the address space of a single process. This is comparable to the classical "realtime kernel" runtime model. Without the assistance of the process manager, Neutrino can create this runtime environment.
A team of processes, each containing one thread, with no memory protection between processes.
A team of processes, each with one thread, with each process running in a separate, MMU-protected address space. This is the typical UNIX-without-threads runtime environment. The optional process manager is required for this environment.
A team of processes, each containing a team of cooperating threads, with all the processes MMU-protected from each other. This is the modern runtime environment found in UNIX, Windows NT, etc. The optional process manager is required for this environment.

The environment you choose affects not only the concurrency capabilities of the application, but also the IPC and synchronization services the application might make use of.

Even though the common term ``IPC'' refers to communicating processes, we use it here to describe the communication between threads, whether they're within the same process or separate processes.

Thread attributes

Although threads within a process share everything within the process's address space, each thread still has some ``private'' data. In some cases, this private data is protected within the kernel (e.g. the tid or thread ID), while other private data resides unprotected in the process's address space (e.g. each thread has a stack for its own use). Some of the more noteworthy thread-private resources are:

tid: Each thread is identified by an integer thread ID, starting at 1. The tid is unique within the thread's process.
register set: Each thread has its own program counter (PC), stack pointer (SP), and other processor-specific register context.
stack: Each thread executes on its own stack, stored within the address space of its process.
signal mask: Each thread has its own signal mask.
thread local storage: A thread has a system-defined data area called ``thread local storage'' (TLS). The TLS is used to store ``per-thread'' information (such as tid, pid, stack base, errno, and thread-specific key/data bindings). The TLS doesn't need to be accessed directly by a user application. A thread can have user-defined data associated with a thread-specific data key.
cancellation handlers: Callback functions that are executed when the thread terminates.

Thread-specific data, implemented in the pthread library and stored in the TLS, provides a mechanism for associating a process global integer key with a unique per-thread data value. To use thread-specific data, a new key is first created and a unique data value is then bound to the key (per thread). The data value may, for example, be an integer or a pointer to a dynamically allocated data structure. Subsequently, the key is used to return the bound data value per thread.

A typical application of thread-specific data is for a thread-safe function that needs to maintain a context for each calling thread.

fig: images/matrix.gif

Sparse matrix (tid,key) to value mapping.

The following functions are used to create and manipulate this data:

Function	Description
pthread_key_create()	Create a data key with destructor function
pthread_key_delete()	Destroy a data key
pthread_setspecific()	Bind a data value to a data key
pthread_getspecific()	Return the data value bound to a data key

Thread life cycle

The number of threads within a process can vary widely, with threads being created and destroyed dynamically. Thread creation (pthread_create()) involves allocating and initializing the necessary resources within the process's address space (thread stack, etc.) and starting the execution of the thread at some function in the address space.

Thread termination (pthread_exit(), pthread_cancel()) involves stopping the thread and reclaiming the thread's resources. As a thread executes, its state can generally be described as either "ready" or "blocked." More specifically, it can be one of the following:

fig: images/lifecycl.gif

Possible thread states in a Neutrino system.

RUNNING: The thread is being executed by a processor.
READY: The thread is waiting to be executed while the processor executes another thread of equal or higher priority.
STOPPED: The thread is blocked waiting for a SIGCONT signal.
SEND: The thread is blocked on a message send (called MsgSendv()).
REPLY: The thread is blocked on a message reply (called MsgSendv()).
RECEIVE: The thread is blocked on a message receive (called MsgReceivev()).
SIGSUSPEND: The thread is blocked waiting for a signal (called sigsuspend()).
SIGWAIT: The thread is blocked waiting for a signal (called sigwaitinfo()).
NANOSLEEP: The thread is sleeping for a short time interval (called nanosleep()).
MUTEX: The thread is blocked on a mutual exclusion lock (called pthread_mutex_lock()).
CONDVAR: The thread is blocked on a condition variable (called pthread_condvar_wait()).
JOIN: The thread is blocked waiting to join another thread (called pthread_join()).
INTERRUPT: The thread is blocked waiting for an interrupt (called InterruptWait()).
DEAD: The thread has terminated and is waiting for a join by another thread.

Thread scheduling

When scheduling decisions are made

The execution of a running thread is temporarily suspended whenever the microkernel is entered as the result of a kernel call, exception, or hardware interrupt. A scheduling decision is made whenever the execution state of any thread changes - it doesn't matter which processes the threads might exist within. Threads are scheduled globally across all processes.

Normally, the execution of the suspended thread will resume, but the scheduler will perform a context switch from one thread to another whenever the running thread:

is blocked
is preempted
yields.

When thread is blocked

The running thread will block when it must wait for some event to occur (response to an IPC request, wait on a mutex, etc.). The blocked thread is removed from the ready queue and the highest priority ready thread is then run. When the blocked thread is subsequently unblocked, it is placed on the end of the ready queue for that priority level.

When thread is preempted

The running thread will be preempted when a higher-priority thread is placed on the ready queue (it becomes READY, as the result of its block condition being resolved). The preempted thread remains at the start of the ready queue for that priority and the higher-priority thread runs.

When thread yields

The running thread voluntarily yields the processor (sched_yield()) and is placed on the end of the ready queue for that priority. The highest-priority thread then runs (which may still be the thread that just yielded).

Scheduling priority

Every thread is assigned a priority. The scheduler selects the next thread to run by looking at the priority assigned to every thread that is READY (i.e. capable of using the CPU). The thread with the highest priority is selected to run.

fig: images/readyq.gif

The ready queue for six threads (A-F) that are READY. All other threads (G-Z) are BLOCKED. Thread A is currently running. Thread A, B, and C are at the highest priority, so will share the processor based on the running thread's scheduling algorithm.

Each thread can have a scheduling priority ranging from 0 to 31 (the highest priority), independent of scheduling policy. A thread inherits the priority of its parent thread by default. The idle thread has priority 0 and is always ready to run.

A thread has both a real priority and an effective priority, and is scheduled in accordance with its effective priority. The real and effective priority may be changed together by the thread itself. The effective priority may change due to priority inheritance or the scheduling policy (see below). Normally the effective priority is the same as the real priority.

The threads on the ready queue are ordered by priority. The ready queue is actually implemented as 32 separate queues, one for each priority. Threads are queued in FIFO order in the queue of their priority. The first thread in the highest priority queue is selected to run.

Scheduling algorithms

To meet the needs of various applications, QNX provides three scheduling algorithms:

FIFO scheduling
round-robin scheduling
adaptive scheduling

Each thread in the system may run using any one of these methods. They are effective on a per-thread basis, not on a global basis for all threads and processes on a node.

Remember that these scheduling algorithms apply only when two or more threads that share the same priority are READY (i.e. the threads are directly competing with each other). If a higher-priority thread becomes READY, it immediately preempts all lower-priority threads.

In the following diagram, three threads of equal priority are READY. If Thread A blocks, Thread B will run.

fig: images/ablocks.gif

Thread A blocks, Thread B runs.

Although a thread inherits its scheduling algorithm from its parent process, the thread can request to change the algorithm applied by the kernel.

FIFO scheduling

In FIFO scheduling, a thread selected to run continues executing until it:

voluntarily relinquishes control (e.g. it blocks)
is preempted by a higher-priority thread

fig: images/fifo.gif

FIFO scheduling. Thread A runs until it blocks.

Round-robin scheduling

In round-robin scheduling, a thread selected to run continues executing until it:

voluntarily relinquishes control
is preempted by a higher-priority thread
consumes its timeslice

fig: images/robin.gif

Round-robin scheduling. Thread A ran until it consumed its timeslice; the next READY thread (Thread B) now runs.

A timeslice is the unit of time assigned to every process. Once it consumes its timeslice, a thread is preempted and the next READY thread at the same priority level is given control. A timeslice is 50 milliseconds.

Apart from time slicing, round-robin scheduling is identical to FIFO scheduling.

Adaptive scheduling

In adaptive scheduling, a thread behaves as follows:

If the thread consumes its timeslice (i.e. it doesn't block), its priority is reduced by 1. This is known as priority decay. Note that a "decayed" thread won't continue decaying, even if it consumes yet another timeslice without blocking - it will drop only one level below its original priority.
If the thread blocks, it immediately reverts to its original priority.

fig: images/adaptive.gif

Adaptive scheduling. Thread A consumed its timeslice; its priority was then dropped by 1. The next READY thread (Thread B) runs.

Adaptive scheduling can be used in environments where potentially compute-intensive background threads are sharing the computer with interactive users. You should find that adaptive scheduling gives the compute-intensive threads sufficient access to the CPU, yet retains fast interactive response for other threads. It is rarely used for most realtime control systems.

For release 1.01, adaptive scheduling is treated the same as round-robin schdeuling.

Manipulating priority and scheduling algorithms

A thread's priority can vary during its execution, either from direct manipulation by the thread itself or from the kernel adjusting the thread's priority as it receives a message from a higher-priority thread.

In addition to priority, the scheduling algorithm employed by the kernel as the thread runs can also be selected. Here are the POSIX calls for performing these manipulations, along with the microkernel calls used by these library routines:

POSIX call	Microkernel call	Description
sched_getparam()	SchedGet()	get scheduling priority
sched_setparam()	SchedSet()	set scheduling priority
sched_getscheduler()	SchedGet()	get scheduling policy
sched_setscheduler()	SchedSet()	set scheduling policy

IPC issues

Since all the threads in a process have unhindered access to the shared data space, wouldn't this execution model ``trivially'' solve all of our IPC problems? Can't we just communicate the data through shared memory and dispense with any other execution models and IPC mechanisms?

If only it were that simple!

One issue is that the access of individual threads to common data must be synchronized. Having one thread read inconsistent data because another thread is part way through modifying it is a recipe for disaster. For example, if one thread is updating a linked list, no other threads can be allowed to traverse or modify the list until the first thread has finished. A code passage that must execute ``serially'' (i.e. by only one thread at a time) in this manner is termed a ``critical section.'' The program would fail (intermittently, depending on how frequently a ``collision'' occurred) with irreparably damaged links unless some synchronization mechanism ensured serial access.

Mutexes, semaphores, and condition variables (condvars) are examples of synchronization tools that can be used to address this problem. These tools are described later in this section.

Although synchronization services can be used to allow threads to cooperate, shared memory per se can't address a number of IPC issues. For example, although threads can communicate through the common data space, this works only if all the threads communicating are within a single process. What if our application needs to communicate a query to a database server? We need to pass the details of our query to the database server, but the thread we need to communicate with lies within a database server process and the address space of that server isn't addressable to us. If an MMU is enabled, an attempt to directly address that data space would result in the OS aborting our thread (or process, depending on the signal-handling configuration)!

This problem can be resolved by creating shared-memory regions between the client and server, with the associated increase in complexity to manage these areas. However, if that database server process is running on a different processor connected via some network interface, then directly accessing the memory of that server isn't an option. Clearly, shared-memory IPC between threads doesn't address all our needs.

Although some OSs utilize the MMU to implement network-distributed shared memory and can accomplish IPC by passing ownership of this memory around the network, there are related issues of complexity, robustness, and performance that may not be appropriate to mission-critical and/or realtime applications. Also, if the CPU lacks a paged MMU, then the OS won't be able to obtain the "Not Present" faults needed. For these reasons, we restrict our discussion of network IPC to explicit IPC across the network, rather than rely on distributed shared-memory services and the use of synchronization primitives within this memory.

QNX message passing addresses this network-distributed IPC issue because the one interface - message passing - operates identically in both the local and network-remote cases, and can be used to access all OS services. Also, because messages can be exactly sized, and because most messages tend to be quite tiny (e.g. the error status on a write request, or a tiny read request), the data moved around the network can be far less with message passing than with network-distributed shared memory, which would tend to copy 4K pages around.

Thread complexity issues

Although threads are very appropriate for some system designs, it's important to respect the Pandora's box of complexities their use unleashes. In some ways, it's ironic that while MMU-protected multitasking has become common, computing fashion has made popular the use of multiple threads in an unprotected address space. This not only makes debugging difficult, but also hampers the generation of reliable code.

Threads were initially introduced to UNIX systems as a ``light-weight'' concurrency mechanism to address the problem of slow context switches between ``heavy weight'' processes. Although this is a worthwhile goal, an obvious question arises: Why are process-to-process context switches slow in the first place?

Architecturally, QNX addresses the context-switch performance issue first. In fact, QNX threads and processes provide nearly identical context-switch performance numbers. QNX process-switch times are faster than UNIX thread-switch times. As a result, QNX threads don't need to be used to solve the IPC performance problem; instead, they're a tool for achieving greater concurrency within application and server processes.

Without resorting to threads, fast process-to-process context switching makes it reasonable to structure an application as a team of cooperating processes sharing an explicitly allocated shared-memory region. An application thus exposes itself to bugs in the cooperating processes only so far as the effects of those bugs on the contents of the shared-memory region. The private memory of the process is still protected from the other processes. In the purely threaded model, the private data of all threads is openly accessible, vulnerable to stray pointer errors in any thread in the process. For example, each process would have an MMU-protected stack, while threads within a process run with unprotected stacks.

Nevertheless, threads can also provide concurrency advantages that a pure process model cannot address. For example, a filesystem server process that executes requests on behalf of many clients (where each request takes significant time to complete), definitely benefits from having multiple threads of execution. If one client process requests a block from disk, while another client requests a block already in cache, the filesystem process can utilize a pool of threads to concurrently service client requests, rather than remain "busy" until the disk block is read for the first request.

As requests arrive, each thread can either respond directly from the buffer cache or block and wait for disk I/O without increasing the response latency seen by other client processes. The filesystem server can "precreate" a team of threads, ready to respond in turn to client requests as they arrive. Although this complicates the architecture of the filesystem manager, the gains in concurrency are significant.

Synchronization services

QNX/Neutrino provides the POSIX-standard thread-level synchronization primitives, some of which are useful even between threads in different processes. The synchronization services provided include at least the following:

Synchronization service	Supported between processes	Supported across a QNX LAN
Mutexes	Yes	No
Condvars	Yes	No
Sleepon locks	No	No
Reader/writer locks	No	No
Semaphores	Yes (named only)	Yes
FIFO scheduling	Yes	No
Send/Receive/Reply	Yes	Yes

These synchronization primitives are implemented directly by Neutrino, except for sleepon locks and reader/writer locks, which are built from mutexes and condvars.

Mutual exclusion locks

Mutual exclusion locks, or mutexes, are the simplest of the Neutrino synchronization services. A mutex is used to ensure exclusive access to data shared between threads. It is typically acquired (pthread_mutex_lock()) and released (pthread_mutex_unlock()) around the code that accesses the shared data (usually a critical section).

Only one thread may have the mutex locked at any given time. Threads attempting to lock an already locked mutex will block until the thread is later unlocked. When the thread unlocks the mutex, the highest-priority thread waiting to lock the mutex will unblock and become the new owner of the mutex. In this way, threads will sequence through a critical region in priority-order.

Acquisition of the mutex requires only a single opcode (compare and swap) if the mutex isn't already held by another thread, and a single opcode to release the mutex. Entry to the kernel is done only at acquisition time if the mutex is already held so that the thread can go on a blocked list; entry is done on exit if other threads are waiting to be unblocked on that mutex. This allows acquisition and release of an uncontested critical section or resource to be very quick, incurring work by the OS only to resolve contention.

A non-blocking lock function (pthread_mutex_trylock()) can be used to test whether the mutex is currently locked or not. For best performance, the execution time of the critical section should be small and of bounded duration. A condvar should be used if the thread may block within the critical section.

Priority inheritance

If a thread with a higher priority than the mutex owner attempts to lock a mutex, then the effective priority of the current owner will be increased to that of the higher-priority blocked thread waiting for the mutex. The owner will return to its real priority when it unlocks the mutex. This scheme ensures that the higher-priority thread will be blocked waiting for the mutex for the shortest possible time and solves the classic priority-inversion problem.

The attributes of the mutex can also be modified (using pthread_mutex_setrecursive()) to allow a mutex to be recursively locked by the same thread. This can be useful to allow a thread to call a routine that might attempt to lock a mutex that the thread already happens to have locked.

Recursive mutexes are non-POSIX services - they don't work with condvars.

Condition variables

A condition variable, or condvar, is used to block a thread within a critical section until some condition is satisfied. The condition can be arbitrarily complex and is independent of the condvar. However, the condvar must always be used with a mutex lock in order to implement a monitor.

A condvar supports three operations:

wait (pthread_cond_wait())
signal (pthread_cond_signal())
broadcast (pthread_cond_broadcast()).

Note that there's no connection between a condvar signal and a POSIX signal.

Here's a typical example of how a condvar can be used:

pthread_mutex_lock( &m );
. . .
while (!arbitrary_condition) {
    pthread_cond_wait( &cv, &m );
    }
. . .
pthread_mutex_unlock( &m );

In this code sample, the mutex is acquired before the condition is tested. This ensures that only this thread has access to the arbitrary condition being examined. While the condition is true, the code sample will block on the wait call until some other thread performs a signal or broadcast on the condvar.

The while loop is required for two reasons. First of all, POSIX cannot guarantee that false wakeups will not occur (e.g. multi-processor systems). Second, when another thread has made a modification to the condition, we need to retest to ensure that the modification matches our criteria. The associated mutex is unlocked atomically by pthread_cond_wait() when the waiting thread is blocked to allow another thread to enter the critical section.

A thread that performs a signal will unblock the highest-priority thread queued on the condvar, while a broadcast will unblock all threads queued on the condvar. The associated mutex is locked atomically by the highest-priority unblocked thread; the thread must then unlock the mutex after proceeding through the critical section.

A version of the condvar wait operation allows a timeout to be specified (pthread_cond_timedwait()). The waiting thread can then be unblocked when the timeout expires.

Sleepon locks

Sleepon locks are very similar to condvars, with a few subtle differences. Like condvars, sleepon locks (pthread_sleepon_lock()) can be used to block until a condition becomes true (like a memory location changing value). But unlike condvars (which must be allocated for each condition to be checked), sleepon locks multiplex their functionality over a single mutex and dynamically allocated condvar, regardless of the number of conditions being checked. The maximum number of condvars ends up being equal to the maximum number of blocked threads. These locks are patterned after the sleepon locks commonly used within the UNIX kernel.

Reader/writer locks

More formally known as "Multiple readers, single writer locks," these locks are used when the access pattern for a data structure consists of many threads reading the data, and (at most) one thread writing the data. These locks are more expensive than mutexes, but can be useful for this data access pattern.

This lock works by allowing all the threads that request a read-access lock (pthread_rwlock_shared()) to succeed in their request. But when a thread wishing to write asks for the lock (pthread_rwlock_exclusive()), the request is denied until all the current reading threads release their reading locks (pthread_rwlock_unlock()).

Multiple writing threads can queue (in priority order) waiting for their chance to write the protected data structure, and all the blocked writer-threads will get to run before reading threads are allowed access again. The priorities of the reading threads are not considered.

There are also calls (pthread_rwlock_tryexlusive() and pthread_rwlock_tryshared()) to allow a thread to test the attempt to achieve the requested lock, without blocking. These calls return with a successful lock or a status indicating that the lock couldn't be granted immediately.

Reader/writer locks aren't implemented directly within the kernel, but are instead built from the mutex and condvar services provided by the kernel.

Semaphores

Semaphores are another common form of synchronization that allows threads to ``post'' (sem_post()) and ``wait'' (sem_wait()) on a semaphore to control when threads wake or sleep. The post operation increments the semaphore; the wait operation decrements it.

If you wait on a semaphore that is positive, you will not block. Waiting on a non-positive semaphore will block until some other thread executes a post. It is valid to post one or more times before a wait. This use will allow one or more threads to execute the wait without blocking.

A significant difference between semaphores and other synchronization primitives is that semaphores are ``async safe'' and can be manipulated by signal handlers. If the desired effect is to have a signal handler wake a thread, semaphores are the right choice.

Another useful property of semaphores is that they were defined to operate between processes. Although QNX/Neutrino mutexes work between processes, the POSIX thread standard considers this an optional capability and as such may not be portable across systems. For synchronization between threads in a single process, mutexes will be more efficient than semaphores.

As a useful variation, a named semaphore service is also available. It uses a resource manager and as such allows semaphores to be used between processes on different machines connected by a network.

Note that named semaphores are slower than the unnamed variety.

Since semaphores, like condition variables, can legally return a non-zero value because of a false wakeup, correct usage requires a loop:

while (sem_wait(&s) == EINTR) { do_nothing(); }
  do_critical_region();   /* Semaphore was decremented */

Synchronization via scheduling algorithm

By selecting the POSIX FIFO scheduling algorithm, we can guarantee that no two threads of the same priority execute the critical section concurrently. The FIFO scheduling algorithm dictates that all FIFO-scheduled threads in the system at the same priority will run, when scheduled, until they voluntarily release the processor to another thread.

This "release" can also occur when the thread blocks as part of requesting the service of another process, or when a signal occurs. The critical region must therefore be carefully coded and documented so that later maintenance of the code doesn't violate this condition.

In addition, higher-priority threads in that (or any other) process could still preempt these FIFO-scheduled threads. So, all the threads that could ``collide'' within the critical section must be FIFO-scheduled at the same priority. Having enforced this condition, the threads can then casually access this shared memory without having to first make explicit synchronization calls.

Note that this exclusive-access relationship doesn't apply in multi-processor systems, since each CPU could run a thread simultaneously through the region that would otherwise be serially scheduled on a single-processor machine.

Synchronization via message passing

The Neutrino Send/Receive/Reply message-passing IPC services (described later) implement an implicit synchronization by their blocking nature. These IPC services can, in many instances, render other synchronization services unnecessary. They are also the only synchronization and IPC primitives (other than named semaphores, which are built on top of messaging) that can be used across the network.

Synchronization services implementation

The following table lists the various microkernel calls and the higher-level POSIX calls constructed from them:

Microkernel call	POSIX call	Description
SyncCreate()	pthread_mutex_init(), pthread_cond_init(), sem_init()	Create object for mutex, condvars, and semaphore.
SyncDestroy()	pthread_mutex_destroy(), pthread_cond_destroy(), sem_destroy()	Destroy synchronization object.
SyncCondvarWait()	pthread_cond_wait(), pthread_cond_timedwait()	Block on a condvar.
SyncCondvarSignal()	pthread_cond_broadcast(), pthread_cond_signal()	Wake up condvar blocked threads.
SyncMutexLock()	pthread_mutex_lock(), pthread_mutex_trylock()	Lock a mutex.
SyncMutexUnlock()	pthread_mutex_unlock()	Unlock a mutex.
SyncSemPost()	sem_post()	Post a semaphore.
SyncSemWait()	sem_wait(), sem_trywait()	Wait on a semaphore.

QNX/Neutrino IPC

IPC plays a fundamental role in the transformation of QNX/Neutrino from an embedded realtime kernel into a full-scale POSIX operating system. As various service-providing processes are added to the Neutrino microkernel, IPC is the ``glue'' that connects those components into a cohesive whole.

Although message passing is the primary form of IPC in QNX/Neutrino, several other forms are available as well. Unless otherwise noted, those other forms of IPC are built over QNX message passing. The strategy is to create a simple, robust IPC service that can be tuned for performance through a simplified code path in the microkernel; more ``feature cluttered'' IPC services can then be implemented from these.

Benchmarks comparing higher-level IPC services (like pipes and FIFOs implemented over QNX messaging) with their monolithic kernel counterparts show comparable performance.

QNX/Neutrino offers at least the following forms of IPC:

Service:	Implemented in:
Message-passing	kernel
Signals	kernel
POSIX message queues	external process
Shared memory	kernel

These services can be selected by the designer on the basis of bandwidth requirements, the need for queuing, network transparency, etc. The tradeoff can be complex, but the flexibility is useful.

As part of the engineering effort that went into defining the Neutrino microkernel, the focus on message passing as the fundamental IPC primitive was deliberate. As a form of IPC, message passing (as implemented inMsgSendv(), MsgReceivev(), and MsgReplyv()), is synchronous and copies data. Let's explore these two attributes in more detail.

Synchronous message passing

A thread that does a MsgSendv() to another thread (which could be within another process) will be blocked until the target thread does a MsgReceivev(), processes the message, and executes a MsgReplyv(). If a thread executes aMsgReceivev() without a previously sent message pending, it will block until another thread executes a MsgSendv().

fig: images/states.gif

A thread undergoing state changes in a typical send-receive-reply transaction. This inherent blocking synchronizes the execution of the sending thread, since the act of requesting that the data be sent also causes the sending thread to be blocked and the receiving thread to be scheduled for execution - this happens without requiring explicit work by the kernel to determine which thread to run next (as would be the case with most other forms of IPC). Execution and data move directly from one context to another.

Data queuing capabilities are omitted from these messaging primitives because queueing could be implemented when needed within the receiving thread. The sending thread is often prepared to wait for a response; queueing is unnecessary overhead and complexity (i.e. it slows down the non-queued case). As a result, the sending thread doesn't need to make a separate, explicit blocking call to wait for a response had some other IPC form been used.

While the send and receive operations are blocking and synchronous, MsgReplyv() (or MsgError()) doesn't block. Since the client thread is already blocked waiting for the reply, no additional synchronization is required, so a blocking MsgReplyv() isn't needed. This allows a server to reply to a client and continue processing while the kernel and/or networking code asynchronously passes the reply data to the sending thread and marks it ready for execution. As most servers will tend to do some processing to prepare to receive the next request (at which point they block again), this works out well.

MsgReplyv() vs MsgError()

The MsgReplyv() function is used to return zero or more bytes to the client. MsgError(), on the other hand, is used to return only a status to the client. Both functions will unblock the client from its MsgSendv().

MsgError() is supported in Neutrino 1.1

Message copying

Because the Neutrino kernel messaging services copy a message directly from the address space of one thread to another without intermediate buffering, the message-delivery performance approaches the memory bandwidth of the underlying hardware. Neutrino attaches no special meaning to the content of a message - the data in a message has meaning only as mutually defined by sender and receiver. However, ``well-defined'' message types are also provided so that user-written processes or threads can augment or substitute for system-supplied services.

The messaging primitives in QNX support multipart transfers, so that a message delivered from the address space of one thread to another needn't pre-exist in a single, contiguous buffer. Instead, both the sending and receiving threads can specify a vector table that indicates where the sending and receiving message fragments reside in memory. Note that the size of the various parts can be different for the sender and receiver.

Multipart transfers allow messages that have a header block separate from the data block to be sent without performance-consuming copying of the data to create a contiguous message. In addition, if the underlying data structure is a ring buffer, specifying a three-part message will allow a header and two disjoint ranges within the ring buffer to be sent as a single atomic message. A hardware equivalent of this concept would be that of a scatter/gather DMA facility.

When sent or received, these parts are treated as one contiguous sequence of bytes. This is ideal for scatter/gather buffers and caches.

The multipart transfers are also used extensively by filesystems. On a read, the data is copied directly from the filesystem cache into the application using a message with one part for the reply status and n parts for the data. Each part points into the cache and compensates for the fact that cache blocks aren't contiguous in memory with a read starting or ending within a block.

For example, with a cache block size of 512 bytes, a read of 1454 bytes can be satisfied with a 5-part message:

fig: images/scatter.gif

Scatter/gather of a read of 1454 bytes.

Because message data is explicitly copied between address spaces (rather than by doing page table manipulations), messages can be easily allocated on the stack instead of from a special pool of page-aligned memory for MMU ``page flipping.'' As a result, many of the library routines that implement the API between client and server processes can be trivially expressed, without elaborate IPC-specific memory allocation calls.

For example, the code used by a client thread to request that the filesystem manager execute lseek on its behalf is implemented as follows:

#include <unistd.h>
#include <errno.h>
#include <sys/iomsg.h>
 
 
off64_t _lseek(int fd, off64_t offset, int whence) {
    union {
        struct _io_lseek        s;
        struct _io_lseek_reply  r;
    }                         msg;
    iov_t                  iov[2];
 
    msg.s.type = _IO_LSEEK;
    msg.s.combine_len = _IO_NO_COMBINE;
    msg.s.offset = offset; 
    msg.s.whence = whence; 
    msg.s.zero = 0;
    SETIOV(iov + 0, &msg.s, sizeof msg.s);
    SETIOV(iov + 1, &msg.r, sizeof msg.r);
    if(MsgSendv(fd, iov + 0, 1, iov + 1, 1) == -1) {
        offset.lo = offset.hi = -1;
        return offset;
    }
    if(msg.r.status != EOK) {
        errno = msg.r.status;
        offset.lo = offset.hi = -1;
        return offset;             
    }
    return msg.r.offset;
}
 
off_t lseek(int fd, off_t offset, int whence) {
    off64_t             off;
 
    off.hi = (offset < 0) ? -1 : 0;
    off.lo = offset;
    off = _lseek(fd, off, whence);
    return off.lo;
}
 
off_t tell(int fd) {
    return lseek(fd, 0, SEEK_CUR);
}

This code essentially builds a message structure on the stack, populates it with various constants and passed parameters from the calling thread, and sends it to the filesystem manager associated with fd. The reply indicates the success or failure of the operation.

This implementation doesn't prevent the kernel from detecting large message transfers and choosing to implement ``page flipping'' for those cases. Since most messages passed are quite tiny, copying messages is often faster than manipulating MMU page tables. For bulk data transfer, shared memory between processes (with message-passing or the other synchronization primitives for notification) is also a viable option.

Channels and connections

In Neutrino, message passing is directed towards channels and connections, rather than targeted directly from thread to thread. A thread that wishes to receive messages first creates a channel, and another thread that wishes to send a message to that thread must first make a connection to that channel by ``attaching'' to the channel.

Channels are required by the message kernel calls and are used by servers to MsgReceivev() messages on. Connections are created by client threads to ``connect'' to the channels made available by servers. Once connections are established, clients can MsgSendv() messages over them. If a number of threads in a process all attach to the same channel, then the one connection is shared between all the threads. Channels and connections are named within a process by a small integer identifier. Client connections map directly into file descriptors.

Architecturally, this is a key point. By having client connections map directly into FDs, we have eliminated yet another layer of translation. We don't need to "figure out" where to send a message based on the file descriptor (e.g. via a read(fd) call). Instead, we can simply send a message directly to the "file descriptor" (i.e. connection ID).

Function	Description
ChannelCreate()	Create a channel to receive messages on.
ChannelDestroy()	Destroy a channel.
ConnectAttach()	Create a connection to send messages on.
ConnectDetach()	Detach a connection.

fig: images/connect.gif

Connections map elegantly into file descriptors (i.e. coid 2 == fd 2).

A process acting as a server would implement an event loop to receive and process messages as follows:

chid = ChannelCreate(flags);
SETIOV(&iov, &msg, sizeof(msg));
for(;;) {
    rcv_id = MsgReceivev( chid, &iov, parts, &info );
 
    switch( msg.type ) {
        /* Perform message processing here */
        }
 
    MsgReplyv( rcv_id, &iov, rparts );
    }

This loop allows the thread to receive messages from any thread that had a connection to the channel.

The channel has three queues associated with it:

one queue for threads waiting for messages
one queue for threads that have sent a message that hasn't yet been received
one queue for threads that have sent a message that has been received, but not yet replied to.

While in any of these queues, the waiting thread is blocked (i.e. RECEIVE, SEND, or REPLY blocked).

fig: images/priority.gif

Waiting threads are blocked while in a channel queue.

Pulses

In addition to the synchronous Send/Receive/Reply services, Neutrino also supports fixed-size, non-blocking messages. These are referred to as pulses and carry a small payload (four bytes of data plus a single byte code).

Pulses are often used as a notification mechanism within interrupt handlers. They also allow servers to signal clients without blocking on them.

fig: images/pulse.gif

Pulses pack a small payload - 8 bits of code and 32 bits of data.

Priority inheritance

A server process receives messages in priority order. As the threads within the server receive requests, they then inherit the priority of the sending thread (but not the scheduling algorithm). As a result, the relative priorities of the threads requesting work of the server are preserved, and the server work will be executed at the appropriate priority. This message-driven priority inheritance avoids priority-inversion problems.

The QNX/Neutrino message-passing API

The message-passing API consists of the following functions:

Function	Description
MsgSendv()	Send a message and block until reply.
MsgReceivev()	Wait for a message.
MsgReplyv()	Reply to a message.
MsgError()	Reply only with an error status. No message bytes are transferred.
MsgReadv()	Read additional data from a received message.
MsgWritev()	Write additional data to a reply message.
MsgInfo()	Obtain info on received message.
MsgSendPulse()	Send tiny, non-blocking message (pulse).
MsgDeliverEvent()	Deliver an event to a client.
MsgKeyData()	Key a message to allow security checks.

Chapter 2 - The QNX/Neutrino Microkernel

The Neutrino Microkernel

Introduction

The implementation of Neutrino

1003.1b, 1003.1c, and 1003.1d extensions

Scalable for very small embedded systems

Neutrino services

Threads and processes

Thread attributes

Thread life cycle

Thread scheduling

When scheduling decisions are made

When thread is blocked

When thread is preempted

When thread yields

Scheduling priority

Scheduling algorithms

FIFO scheduling

Round-robin scheduling

Adaptive scheduling

Manipulating priority and scheduling algorithms

IPC issues

Thread complexity issues

Synchronization services

Mutual exclusion locks

Priority inheritance

Condition variables

Sleepon locks

Reader/writer locks

Semaphores

Synchronization via scheduling algorithm

Synchronization via message passing

Synchronization services implementation

QNX/Neutrino IPC

Synchronous message passing

MsgReplyv() vs MsgError()

Message copying

Channels and connections

Pulses

Priority inheritance

The QNX/Neutrino message-passing API

The message-passing API consists of the following functions: