Robust implementations with Send/Receive/Reply

Architecting a QNX application as a team of cooperating threads and processes via Send/Receive/Reply results in a system that uses synchronous notification. IPC thus occurs at specified transitions within the system, rather than asynchronously.

A significant problem with asynchronous systems is that event notification requires signal handlers to be run. Asynchronous IPC can make it difficult to thoroughly test the operation of the system and make sure that no matter when the signal handler runs, that processing will continue as intended. Applications often try to avoid this scenario by relying on a ``window'' explicitly opened and shut, during which signals will be tolerated.

With a synchronous, non-queued system architecture built around Send/Receive/Reply, robust application architectures can be very readily implemented and delivered.

Avoiding deadlock situations is another difficult problem when constructing applications from various combinations of queued IPC, shared memory, and miscellaneous synchronization primitives. For example, suppose thread A doesn't release mutex 1 until thread B releases mutex 2. Unfortunately, if thread B is in the state of not releasing mutex 2 until thread A releases mutex 1, a standoff results. Simulation tools are often invoked in order to ensure that deadlock won't occur as the system runs.

The Send/Receive/Reply IPC primitives allow the construction of deadlock-free systems with the observation of only a couple simple rules:

Never have two threads send to each other.
Always arrange your threads in a hierarchy, with sends going up the tree.

The first rule is an obvious avoidance of the standoff situation, but the second rule requires further explanation. The team of cooperating threads and processes is arranged as follows:

fig: images/thtree.gif

Threads should always send up to higher-level threads.

Here the threads at any given level in the hierarchy never send to each other, but send only upwards instead.

One example of this might be a client application that sends to a database server process, which in turn sends to a filesystem process. Since the sending threads block and wait for the target thread to reply, and since the target thread isn't send-blocked on the sending thread, deadlock cannot result.

But how does a higher-level thread notify a lower-level thread that it has the results of a previously requested operation? (Assume the lower-level thread didn't want to wait for the replied results when it last sent.)

QNX/Neutrino provides a very flexible architecture with the MsgDeliverEvent() kernel call to deliver non-blocking events. All of the common asynchronous services can be implemented with this. For example, the server-side of the select() call is an API that an application can use to allow a thread to wait for an I/O event to complete on a set of file descriptors. In addition to an asynchronous notification mechanism being needed as a ``back channel'' for notifications from higher-level threads to lower-level threads, we can also build a reliable notification system for timers, hardware interrupts, and other event sources around this.

fig: images/pulseven.gif

A higher-level thread can "send" a pulse event in order to notify a lower-level thread.

A related issue is the problem of how a higher-level thread can request work of a lower-level thread without sending to it, risking deadlock. The lower-level thread is present only to serve as a ``worker thread'' for the higher-level thread, doing work on request. The lower-level thread would send in order to ``report for work,'' but the higher-level thread wouldn't reply then. It would defer the reply until the higher-level thread had work to be done, and it would reply (which is a non-blocking operation) with the data describing the work. In effect, the reply is being used to initiate work, not the send, which neatly side-steps rule #1.

Events

A significant advance in the kernel design for Neutrino is the event-handling subsystem. POSIX and its realtime extensions define a number of asynchronous notification methods (e.g. UNIX signals that don't queue or pass data, POSIX realtime signals that may queue and pass data, etc.)

Neutrino also defines additional, QNX-specific notification techniques such as pulses. Implementing all of these event mechanisms could have consumed significant code space, so our implementation strategy was to build all of these notification methods over a single, rich, event subsystem.

A benefit of this approach is that capabilities exclusive to one notification technique can become available to others. For example, a Neutrino application can apply the same queueing services of POSIX realtime signals to UNIX signals. This can simplify the robust implementation of signal handlers within applications.

The events encountered by an executing thread can come from any of three sources:

a MsgDeliverEvent() kernel call invoked by a thread
an interrupt handler
the expiry of a timer.

The event itself can be any of a number of different types: QNX pulses, interrupts, various forms of signals, and forced ``unblock'' events. ``Unblock'' is a means by which a thread can be released from a deliberately blocked state without any explicit event actually being delivered.

Given this multiplicity of event types, and applications needing the ability to request whichever asynchronous notification technique best suits their needs, it would be awkward to require that server processes (the higher-level threads from the previous section) carry code to support all these options.

Instead, the client thread can give a data structure, or ``cookie,'' to the server to hang on to until later. When the server needs to notify the client thread, it will invoke MsgDeliverEvent() and the microkernel will set the event type encoded within the cookie upon the client thread.

fig: images/sigevent.gif

The client sends a sigevent to the server, who saves the event structure. When conditions are met, the server delivers the event via MsgDeliverEvent().

I/O notification

The ionotify() function is a means by which a client thread can request asynchronous event delivery. Many of the POSIX asynchronous services (e.g. mq_notify() and the client-side of the select()) are built on top of it. When performing I/O on a file descriptor (fd), the thread may choose to wait for an I/O event to complete (for the write() case), or for data to arrive (for the read() case). Rather than have the thread block on the resource manager process that's servicing the read/write request, ionotify() can allow the client thread to post an event to the resource manager that the client thread would like to receive when the indicated I/O condition occurs. Waiting in this manner allows the thread to continue executing and responding to event sources other than just the single I/O request.

The select() call is implemented using I/O notification and allows a thread to block and wait for a mix of I/O events on multiple fd's while continuing to respond to other forms of IPC.

Here are the conditions upon which the requested event can be delivered:

_NOTIFY_COND_OUTPUT - there's room in the output buffer for more data.
_NOTIFY_COND_INPUT - resource-manager-defined amount of data is available to read.
_NOTIFY_OUT_OF_BAND - resource-manager-defined ``out of band'' data is available.

Signals

Neutrino supports the 32 standard POSIX signals (as in UNIX) as well as the POSIX realtime signals, both numbered from a kernel-implemented set of 64 signals with uniform functionality. While the POSIX standard defines realtime signals as differing from UNIX-style signals (in that they may contain four bytes of data and a byte code and may be queued for delivery), this functionality can be explicitly selected or deselected on a per-signal basis, allowing this converged implementation to still be compliant with the standard.

Incidentally, the UNIX-style signals can select POSIX realtime signal queuing, should the application desire it. Neutrino also extends the signal-delivery mechanisms of POSIX by allowing signals to be targeted at specific threads, rather than simply at the process containing the threads. Since signals are an asynchronous event, they're also implemented with the event-delivery mechanisms within Neutrino.

Microkernel call	POSIX call	Description
SignalKill()	kill(), pthread_kill(), raise(), sigqueue()	Set a signal on a process group, process, or thread.
SignalReturn()	N/A	Return from a signal handler.
SignalAction()	sigaction()	Define action to take on receipt of a signal.
SignalProcmask()	sigprocmask()	Change signal blocked mask of a thread.
SignalSuspend()	sigsuspend(), pause()	Block until a signal invokes a signal handler.
SignalWaitinfo()	sigwaitinfo()	Wait for signal and return info on it.

The original POSIX specification defined signal operation on processes only. In a multi-threaded process, the following rules are followed:

The signal actions are maintained at the process level. If a thread ignores or catches a signal, it affects all threads within the process.
The signal mask is maintained at the thread level. If a thread blocks a signal, it affects only that thread.
An un-ignored signal targeted at a thread will be delivered to that thread alone.
An un-ignored signal targeted at a process is delivered to the first thread that doesn't have the signal blocked. If all threads have the signal blocked, the signal will be queued on the process until any thread ignores or unblocks the signal. If ignored, the signal on the process will be removed. If unblocked, the signal will be moved from the process to the thread that unblocked it.

When a signal is targeted at a process with a large number of threads, the thread table must be scanned, looking for a thread with the signal unblocked. Standard practice for most multi-threaded processes is to mask the signal in all threads but one, which is dedicated to handling them. To increase the efficiency of process-signal delivery, the kernel will cache the last thread that accepted a signal and will always attempt to deliver the signal to it first.

fig: images/signal.gif

Signals delivered to a process are given to the first thread with an interest or queued on the process until a thread expresses an interest.

The POSIX standard includes the concept of queued realtime signals (first introduced in 1003.1b). QNX/Neutrino supports optional queuing of any signal, not just realtime signals. The queuing can be specified on a signal-by-signal basis within a process. Each signal can have an associated 8-bit code and a 32-bit value.

This is very similar to message pulses described earlier. The kernel takes advantage of this similarity and uses common code for managing both signals and pulses. The signal number is mapped to a pulse priority using _SIGMAX - signo. As a result, signals are delivered in priority order with lower signal numbers having higher priority. This conforms with the POSIX standard, which states that existing signals (which encompass the first 32) have priority over the new realtime signals.

Neutrino special signals

As mentioned earlier, Neutrino defines a total of 64 signals. Their range is as follows:

Signal range	Description
1 ... 32	32 POSIX 1003.1a signals (including traditional UNIX signals)
33 ... 56	24 POSIX 1003.1b realtime signals (SIGRTMIN to SIGRTMAX)
57 ... 64	8 special-purpose Neutrino signals (SIGSPECIALMIN to SIGSPECIALMAX)

The 8 special signals cannot be ignored or caught. An attempt to call the signal() or sigaction() functions or the SignalAction() kernel call to change them will fail with an error of EINVAL.

In addition, these signals are always blocked and have signal queuing enabled. An attempt to unblock these signals via the sigprocmask() function or SignalProcmask() kernel call with be quietly ignored.

A regular signal can be programmed to this behavior using the following standard signal calls. The special signals save the programmer from writing this code and protect the signal from accidental changes to this behavior.

sigset_t *set;
struct sigaction action;
 
sigemptyset(&set);
sigaddset(&set, signo);
sigprocmask(SIG_BLOCK, &set, NULL);
 
action.sa_handler = SIG_DFL;
action.sa_flags = SA_SIGINFO;
sigaction(signo, &action, NULL);

This configuration makes these signals suitable for synchronous notification using the sigwaitinfo() function or SignalWaitinfo() kernel call. The following code will block until the 8th special signal is received:

sigset_t *set;
siginfo_t info;
 
sigemptyset(&set);
sigaddset(&set, SIGSPECIALMAX);
sigwaitinfo(&set, &info);
printf("Received signal %d with code %d and value %d\n",
            info.si_signo,
            info.si_code,
            info.si_value.sival_int);

Since the signals are always blocked, the program cannot be interrupted or killed if the special signal is delivered outside of the sigwaitinfo() function. Since signal queuing is always enabled, signals won't be lost - they'll be queued for the next sigwaitinfo() call.

These signals were designed to solve a common IPC requirement where a server wishes to notify a client that it has information available for the client. The server will use the MsgDeliverEvent() call to notify the client. There are two reasonable choices for the event within the notification: pulses or signals.

A pulse is the preferred method for a client that may also be a server to other clients. In this case, the client will have created a channel for receiving messages and can also receive the pulse.

This won't be true for most simple clients. In order to receive a pulse, a simple client would be forced to create a channel for this express purpose. A signal can be used in place of a pulse if the signal is configured to be synchronous (i.e. the signal is blocked) and queued - this is exactly how the special signals are configured. The client would replace the MsgReceivev() call used to wait for a pulse on a channel with a simple sigwaitinfo() call to wait for the signal.

This signal mechanism is used by Photon to wait for events and by the select() function to wait for I/O from multiple servers. Of the 8 special signals, the first two have been given special names for this use.

#define SIGSELECT	(SIGSPECIALMIN + 0)
#define SIGPHOTON	(SIGSPECIALMIN + 1)

Summary of signals

Signal	Description
SIGABRT	Abnormal termination signal such as issued by the abort() function.
SIGALRM	Timeout signal such as issued by the alarm() function.
SIGBUS	Indicates a memory parity error (QNX-specific interpretation). Note that if a second fault occurs while your process is in a signal handler for this fault, the process will be terminated.
SIGCHLD	Child process terminated. The default action is to ignore the signal.
SIGCONT	Continue if HELD. The default action is to ignore the signal if the process isn't HELD.
SIGEMT	EMT instruction (emulator trap)
SIGFPE	Erroneous arithmetic operation (integer or floating point), such as division by zero or an operation resulting in overflow. Note that if a second fault occurs while your process is in a signal handler for this fault, the process will be terminated.
SIGHUP	Death of session leader, or hangup detected on controlling terminal.
SIGILL	Detection of an invalid hardware instruction. Note that if a second fault occurs while your process is in a signal handler for this fault, the process will be terminated.
SIGINT	Interactive attention signal (Break)
SIGIOT	IOT instruction (not generated on x86 hardware)
SIGKILL	Termination signal - should be used only for emergency situations. This signal cannot be caught or ignored.
SIGPIPE	Attempt to write on a pipe with no readers.
SIGPOLL	Pollable event occurred
SIGQUIT	Interactive termination signal.
SIGSEGV	Detection of an invalid memory reference. Note that if a second fault occurs while your process is in a signal handler for this fault, the process will be terminated.
SIGSTOP	HOLD process signal. The default action is to hold the process.
SIGSYS	Bad argument to system call
SIGTERM	Termination signal
SIGTRAP	Unsupported software interrupt
SIGTSTP	Not supported by QNX/Neutrino.
SIGTTIN	Not supported by QNX/Neutrino.
SIGTTOU	Not supported by QNX/Neutrino.
SIGURG	Urgent condition present on socket
SIGUSR1	Reserved as application-defined signal 1
SIGUSR2	Reserved as application-defined signal 2
SIGWINCH	Window size changed

POSIX message queues

POSIX defines a set of non-blocking message-passing facilities known as message queues. Like pipes, message queues are named objects that operate with "readers" and "writers." As a priority queue of discrete messages, a message queue has more structure than a pipe and offers applications more control over communications.

POSIX message queues are implemented in QNX/Neutrino via an optional resource manager (Mqueue). Unlike QNX/Neutrino's inherent message-passing primitives, the POSIX message queues reside outside the kernel. For information about resource managers, see Chapter 4 in this book.

Why use POSIX message queues?

POSIX message queues provide a familiar interface for many realtime programmers. They are similar to the "mailboxes" found in many realtime executives.

There's a fundamental difference between QNX messages and POSIX message queues. QNX messages block - they copy their data directly between the address spaces of the processes sending the messages. POSIX messages queues, on the other hand, implement a store-and-forward design in which the sender need not block and may have many outstanding messages queued. POSIX message queues exist independently of the processes that use them. You would likely use message queues in a design where a number of named queues will be operated on by a variety of processes over time.

For raw performance, POSIX message queues will be slower than QNX native messages for transferring data. However, the flexibility of queues may make this small performance penalty worth the cost.

File-like interface

Message queues resemble files, at least as far as their interface is concerned. You open a message queue with mq_open(), close it with mq_close(), and destroy it with mq_unlink(). And to put data into ("write") and take it out of ("read") a message queue, you use mq_send() and mq_receive().

For strict POSIX conformance, you should create message queues that start with a single slash (/) and contain no other slashes. But note that QNX/Neutrino extends the POSIX standard by supporting pathnames that may contain multiple slashes. This allows, for example, a company to place all its message queues under its company name and distribute a product with increased confidence that a queue name will not conflict with that of another company.

In QNX/Neutrino, all message queues created will appear in the filename space under the directory /dev/mqueue.

mq_open() name:	Pathname of message queue:
`/data`	`/dev/mqueue/data`
`/acme/data`	`/dev/mqueue/acme/data`
`/qnx/data`	`/dev/mqueue/qnx/data`

You can display all message queues in the system using the ls command as follows:

ls -Rl /dev/mqueue

The size printed will be the number of messages waiting.

Message queue functions

POSIX message queues are managed via the following functions:

Function	Description
mq_open()	Open a message queue.
mq_close()	Close a message queue.
mq_unlink()	Remove a message queue.
mq_send()	Add a message to the message queue.
mq_receive()	Receive a message from the message queue.
mq_notify()	Tell the calling process that a message is available on a message queue.
mq_setattr()	Set message queue attributes.
mq_getattr()	Get message queue attributes.

Shared memory

Shared memory offers the highest bandwidth IPC available. Once a shared memory object is created, processes with access to the object can use pointers to directly read and write into it. This means that access to shared memory is in itself unsynchronized. If a process is updating an area of shared memory, care must be taken to prevent another process from reading or updating the same area. Even in the simple case of a read, the other process may get information that is in flux and inconsistent.

To solve these problems, shared memory is often used in conjunction with one of the synchronization primitives to make updates atomic between processes. If the granularity of updates is small, then the synchronization primitives themselves will limit the inherently high bandwidth of using shared memory. Shared memory is therefore most efficient when used for updating large amounts of data as a block.

Both semaphores and mutexes are suitable synchronization primitives for use with shared memory. Semaphores were introduced with the POSIX realtime standard for interprocess synchronization. Mutexes were introduced with the POSIX threads standard for thread synchronization. Mutexes may also be used between threads in different processes. POSIX considers this an optional capability; Neutrino supports it. In general, mutexes are more efficient than semaphores.

Shared memory with message passing

Shared memory and message passing can be combined to provide IPC that offers:

very high performance (shared memory)
synchronization (message passing)
network transparency (message passing).

Using message passing, a client sends a request to a server and blocks. The server receives the messages in priority order from clients, processes them, and replies when it can satisfy a request. At this point, the client is unblocked and continues. The very act of sending messages provides natural synchronization between the client and the server. Rather than copy all the data through the message pass, the message can contain a reference to a shared memory region, so the server could read or write the data directly. This is best explained with a simple example.

Let's assume a graphics server accepts draw image requests from clients and renders them into a frame buffer on a graphics card. Using message passing alone, the client would send a message containing the image data to the server. This would result in a copy of the image data from the client's address space to the server's address space. The server would then render the image and issue a short reply.

If the client didn't send the image data inline with the message, but instead sent a reference to a shared memory region that contained the image data, then the server could access the client's data directly.

Since the client is blocked on the server as a result of sending it a message, the server knows that the data in shared memory is stable and will not change until the server replies. This combination of message passing and shared memory achieves natural synchronization and very high performance.

This model of operation can also be reversed - the server can generate data and give it to a client. For example, suppose a client sends a message to a server that will read video data directly from a CD-ROM into a shared memory buffer provided by the client. The client will be blocked on the server while the shared memory is being changed. When the server replies and the client continues, the shared memory will be stable for the client to access. This type of design can be pipelined using more than one shared memory region.

Simple shared memory can't be used between processes on different computers connected via a network. Message passing, on the other hand, is network transparent. A server could use shared memory for local clients and full message passing of the data for remote clients. This allows you to provide a high-performance server that is also network transparent.

In practice, the message-passing primitives are more than fast enough for the majority of IPC needs. The added complexity of a combined approach need only be considered for special applications with very high bandwidth.

Creating a shared memory object

Multiple threads within a process share the memory of that process. To share memory between processes, you must first create a shared memory region and then map that region into your process's address space. Shared memory regions are created and manipulated using the following calls:

Function	Description
shm_open()	Open (or create) a shared memory region
shm_close()	Close a shared memory region
mmap()	Map a shared memory region into a process's address space
munmap()	Unmap a shared memory region from a process's address space
mprotect()	Change protections on a shared memory region
shm_unlink()	Remove a shared memory region

POSIX shared memory is implemented in QNX/Neutrino via the Process Manager (ProcNto). The above calls are implemented as messages to ProcNto. For information about the Process Manager, see Chapter 3 in this book.

The shm_open() function takes the same arguments as open() and returns a file descriptor to the object. As with a regular file, this function lets you create a new shared memory object or open an existing shared memory object.

When a new shared memory object is created, the size of the object is set to zero. To set the size, you use the ftruncate() function. Note that this is the very same function used to set the size of a file.

mmap()

Once you have a file descriptor to a shared memory object, you use the mmap() function to map the object, or part of it, into your process's address space. The mmap() function is the cornerstone of memory management within Neutrino and deserves a detailed discussion of its capabilities.

The mmap() function is defined as follows:

void * mmap(void *where_i_want_it, size_t length, int memory_protections,
            int mapping_flags, int fd, off_t offset_within_shared_memory);

In simple terms this says: "Map in length bytes of shared memory at offset_within_shared_memory in the shared memory object associated with fd."

The mmap() function will try to place the memory at the address where_i_want_it in your address space. The memory will be given the protections specified by memory_protections and the mapping will be done according to the mapping_flags.

The three arguments fd, offset_within_shared_memory, and length define a portion of a particular shared object to be mapped in. It's common to map in an entire shared object, in which case the offset will be zero and the length will be the size of the shared object in bytes. On an Intel processor, the length will be a multiple of the page size, which is 4096 bytes.

fig: images/mmap.gif

How arguments to the mmap() function refer to the mapped region.

The return value of mmap() will be the address in your process's address space where the object was mapped. The argument where_i_want_it is used as a hint by the system to where you want the object placed. If possible, the object will be placed at the address requested. Most applications specify an address of zero, which gives the system free reign to place the object where it wishes.

The following protection types may be specified for memory_protections:

Manifest	Description
PROT_NONE	No access allowed
PROT_READ	Memory may be read
PROT_WRITE	Memory may be written
PROT_EXEC	Memory may be executed
PROT_NOCACHE	Memory should not be cached

The PROT_NOCACHE manifest should be used when a shared memory region is used to gain access to dual-ported memory that may be modified by hardware (e.g. a video frame buffer or a memory-mapped network or communications board). Without this manifest, the processor may return "stale" data from a previously cached read.

The mapping_flags determine how the memory is mapped and are broken down into two parts. The first part is a type and must be specified as one of the following:

Map type	Description
MAP_SHARED	The mapping is shared by the calling processes.
MAP_PRIVATE	The mapping is private to the calling process. It allocates system RAM and makes a copy of the object.
MAP_ANON	Similar to MAP_PRIVATE, but the fd parameter isn't used (should be set to NOFD), and the allocated memory is zero-filled.

The MAP_SHARED type is the one to use for setting up shared memory between processes. The other types have more specialized uses. For example, MAP_ANON can be used as the basis for a page-level memory allocator.

A number of flags may be ORed into the above type to further define the mapping. These are described in detail in the mmap() library reference. A few of the more interesting flags are:

Map type modifier	Description
MAP_FIXED	Map object to the address specified by where_i_want_it. If a shared memory region contains pointers within it, then you may need to force the region at the same address in all processes that map it. This can be avoided by using offsets within the region in place of direct pointers.
MAP_PHYS	This flag indicates that you wish to deal with physical memory. The fd parameter should be set to NOFD. When used with MAP_SHARED, the offset_within_shared_memory specifies the exact physical address to map (e.g. for video frame buffers). If used with MAP_ANON then physically contiguous memory is allocated (e.g. for a DMA buffer). MAP_NOX64K and MAP_BELOW16M are used to further define the MAP_ANON allocated memory and address limitations present in some forms of DMA.
MAP_NOX64K	Used with MAP_PHYS `\|` MAP_ANON. The allocated memory area will not cross a 64K boundary. This is required for the old 16-bit PC DMA.
MAP_BELOW16M	Used with MAP_PHYS `\|` MAP_ANON. The allocated memory area will reside in physical memory below 16M. This is necessary when using DMA with ISA bus devices.

Using the mapping flags described above, a process can easily share memory between processes:

/* Map in a shared memory region */
fd = shm_open("datapoints", O_RDWR);
addr = mmap(0, len, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);

Or share memory with hardware such as video memory:

/* Map in VGA display memory */
addr = mmap(0, 65536, PROT_READ|PROT_WRITE, MAP_PHYS|MAP_SHARED, NOFD, 0xa0000);

Or allocate a DMA buffer for a bus-mastering PCI network card:

/* Allocate a physically contiguous buffer */
addr = mmap(0, 262144, PROT_READ|PROT_WRITE|PROT_NOCACHE, MAP_PHYS|MAP_ANON, NOFD, 0);

You can unmap all or part of a shared memory object from your address space using munmap(). This primitive isn't restricted to unmapping shared memory - it can be used to unmap any region of memory within your process. When used in conjunction with the MAP_ANON flag to mmap(), you can easily implement a private page-level allocator/deallocator.

You can change the protections on a mapped region of memory using mprotect(). Like munmap(), mprotect() isn't restricted to shared memory regions - it can change the protection on any region of memory within your process.

Clock and timer services

Clock services are used to maintain the time of day, which is in turn used by the kernel timer calls to implement interval timers.

The ClockCycles() function is implemented upon a 64-bit, free-running, high-precision counter. On an Intel Pentium processor, this is implemented directly with the RDTSC instruction. For processors that don't support this opcode, an instruction fault is generated - the kernel catches and emulates this using the counter timer chip.

The ClockPeriod() function allows a thread to set the system timer to some multiple of nanoseconds; the OS kernel will do the best it can to satisfy the precision of the request with the hardware available to it. On a PC-architecture machine, the precision of this timer setting can be as fine as 838 nanoseconds.

The interval selected is always rounded down to an integral of the precision of the underlying hardware timer. Of course, setting it to an extremely low value can result in a significant portion of CPU performance being consumed servicing timer interrupts.

The ClockTick() call is provided as an entry point to be used by an external timer interrupt handler. If the system has custom timer hardware, a thread external to the kernel can use this call to explicitly indicate the advance of time to the kernel.

Microkernel call	POSIX call	Description
ClockTime()	clock_gettime(), clock_settime()	Get or set the time of day.
ClockAdjust()	N/A	Apply small time adjustments to synchronize clocks.
ClockCycles()	N/A	Read a 64-bit free-running high-precision counter.
ClockPeriod()	clock_getres()	Get or set the period of the clock.
ClockTick()	N/A	Simulate a clock interrupt from an external

Time correction

In order to facilitate applying time corrections without having the system experience abrupt ``steps'' in time (or even having time jump backwards), the ClockAdjust() call provides the option to specify an interval over which the time correction is to be applied. This has the effect of speeding or retarding time over a specified interval until the system has synchronized to the indicated current time. This service can be used to implement network-coordinated time averaging between multiple nodes on a network.

Timers

Neutrino directly provides the full set of POSIX timer functionality. Since these timers are quick to create and manipulate, they're an inexpensive resource in the kernel.

The POSIX timer model is quite rich, providing the ability to have the timer expire on:

an absolute date
a relative date (i.e. n nanoseconds from now)
cyclical (i.e. every n nanoseconds).

The cyclical mode is very significant, because the most common use of timers tends to be as a periodic source of events to ``kick'' a thread into life to do some processing and go back to sleep until the next event. If the thread had to re-program the timer for every event, there would be the danger that time would slip unless the thread was programming an absolute date. Worse, if the thread doesn't get to run on the timer event because a higher-priority thread is running, the date next programmed into the timer could be one that has already elapsed!

The cyclical mode circumvents these problems by requiring that the thread set the timer once and then simply respond to the resulting periodic source of events.

Since timers are another source of events in QNX/Neutrino, they also make use of its event-delivery system. As a result, the application can request that any of the Neutrino-supported events be delivered to the application upon occurrence of a timeout.

An often-needed timeout service provided by Neutrino is the ability to specify the maximum time the application is prepared to wait for any given kernel call or request to complete. A problem with using generic OS timer services in a preemptive realtime OS is that in the interval between the specification of the timeout and the request for the service, a higher-priority process might have been scheduled to run and preempted long enough that the specified timeout will have expired before the service is even requested. The application will then end up requesting the service with an already lapsed timeout in effect (i.e. no timeout). This timing window can result in `` hung'' processes, inexplicable delays in data transmission protocols, and other problems.

alarm(...);
   :
   :

Neutrino's solution is a form of timeout request atomic to the service request itself. One approach might have been to provide an optional timeout parameter on every available service request, but this would overly complicate service requests with a passed parameter that would often go unused.

Neutrino provides a TimerTimeout() kernel call that allows an application to specify a list of blocking states for which to start a specified timeout. Later, when the application makes a request of the kernel, the kernel will atomically enable the previously configured timeout if the application is about to block on one of the specified states.

Since Neutrino has a very small number of blocking states, this mechanism works very concisely. At the conclusion of either the service request or the timeout, the timer will be disabled and control will be given back to the application.

TimerTimeout(...);
   :
   :
   :
blocking_call();
   :

Microkernel call	POSIX call	Description
TimerAlarm()	alarm()	Set a process alarm.
TimerCreate()	timer_create()	Create an interval timer.
TimerDestroy()	timer_delete()	Destroy an interval timer.
TimerGettime()	timer_gettime()	Get time remaining on an interval timer.
TimerGetoverrun()	timer_getoverrun()	Get number of overruns on an interval timer.
TimerSettime()	timer_settime()	Start an interval timer.
TimerTimeout()	sleep(), nanosleep(), sigtimedwait(), pthread_cond_timedwait(), pthread_mutex_trylock(), intr_timed_wait()	Arm a kernel timeout for any blocking state.

Interrupt handling

No matter how much we wish it were so, computers are not infinitely fast. In a realtime system, it's absolutely crucial that CPU cycles aren't unnecessarily spent. It's also crucial to minimize the time from the occurrence of an external event to the actual execution of code within the thread responsible for reacting to that event. This time is referred to as latency.

The two forms of latency that most concern us are interrupt latency and scheduling latency.

Interrupt latency

Interrupt latency is the time from the assertion of a hardware interrupt until the first instruction of the device driver's interrupt handler is executed. QNX leaves interrupts fully enabled almost all the time, so that interrupt latency is typically insignificant. But certain critical sections of code do require that interrupts be temporarily disabled. The maximum such disable time usually defines the worst-case interrupt latency - in QNX this is very small.

The following diagrams illustrate the case where a hardware interrupt is processed by an established interrupt handler. The interrupt handler either will simply return, or it will return and cause an event to be delivered.

fig: images/intlat.gif

Interrupt handler simply terminates.

The interrupt latency (Til) in the above diagram represents the minimum latency - that which occurs when interrupts were fully enabled at the time the interrupt occurred. Worst-case interrupt latency will be this time plus the longest time in which QNX, or the running QNX process, disables CPU interrupts.

Til on various CPUs

The following table shows typical interrupt-latency times (Til) for a range of processors:

Interrupt latency (Til)	Processor
1.38 microsec	200 MHz Pentium
1.84 microsec	100 MHz Pentium
7.54 microsec	33 MHz 486
14.25 microsec	33 MHz 386EX

Scheduling latency

In some cases, the low-level hardware interrupt handler must schedule a higher-level thread to run. In this scenario, the interrupt handler will return and indicate that an event is to be delivered. This introduces a second form of latency - scheduling latency - which must be accounted for.

Scheduling latency is the time between the last instruction of the user's interrupt handler and the execution of the first instruction of a driver thread. This usually means the time it takes to save the context of the currently executing thread and restore the context of the required driver thread. Although larger than interrupt latency, this time is also kept small in a QNX system.

fig: images/schedlat.gif

Interrupt handler terminates, returning an event.

It's important to note that most interrupts terminate without delivering an event. In a large number of cases, the interrupt handler can take care of all hardware-related issues. Delivering an event to wake-up a higher-level driverthread occurs only when a significant event occurs. For example, the interrupt handler for a serial device driver would feed one byte of data to the hardware upon each received transmit interrupt, and would trigger the higher-level thread within (Devc.ser) only when the output buffer is nearly empty.

Tsl on various CPUs

This table shows typical scheduling-latency times (Tsl) for a range of processors:

Scheduling latency (Tsl)	Processor
2.93 microsec	200 MHz Pentium
4.73 microsec	100 MHz Pentium
12.57 microsec	33 MHz 486
38.55 microsec	33 MHz 386EX

Nested interrupts

Since microcomputer architectures allow hardware interrupts to be given priorities, higher-priority interupts can preempt a lower-priority interrupt.

This mechanism is fully supported by Neutrino. The previous scenarios describe the simplest - and most common - situation where only one interrupt occurs. This is usually the case for the highest-priority interrupt. Worst-case timing considerations for lower-priority interrupts must take into account the time for all higher-priority interrupts to be processed,

because a higher-priority interrupt will preempt a lower-priority interrupt.

fig: images/stackint.gif

Thread A is running. Interrupt IRQx causes interrupt handler Intx to run, which is preempted by IRQy and its handler Inty. Inty returns an event causing Thread B to run; Intx returns an event causing Thread C to run.

Interrupt calls

Neutrino implements an interrupt-handling API closely modeled after the POSIX realtime extensions (draft status at time of printing).

Microkernel call	POSIX call	Description
InterruptAttach()	intr_capture()	Attach a local function to an interrupt vector.
InterruptDetach()	intr_release()	Detach an interrupt handler.
InterruptWait()	intr_timed_wait()	Wait for an interrupt.
InterruptDisable()	N/A	Disable hardware interrupts.
InterruptEnable()	N/A	Enable hardware interrupts.
InterruptMask()	intr_lock()	Mask a hardware interrupt.
InterruptUnmask()	intr_unlock()	Unmask a hardware interrupt.

Using this API, a suitably privileged user-level thread can call InterruptAttach(), passing a hardware interrupt number and the address of a function in the thread's address space to be called when the interrupt occurs. Neutrino allows multiple ISRs (Interrupt Service Routine) to be attached to each hardware interrupt number - higher-priority interrupts can be serviced during the execution of lower-priority interrupt handlers.

The following code sample shows how to attach an ISR to the hardware timer interrupt on the PC (which Neutrino also uses for the system clock). Since the kernel's timer ISR is already dealing with clearing the source of the interrupt, this ISR can simply increment a counter variable in the thread's data space and return to the kernel:

#include <stdio.h>
#include <sys/neutrino.h>
 
struct sigevent event;
volatile unsigned counter;
 
struct sigevent *handler( void *area ) {
    // Pulse every 100'th interrupt
    if ( ++counter == 100 ) {
        counter = 0;
        return( &event );
        }
    else
        return( NULL );
    }
 
void main() {
    int i;
 
    // Initialize event structure
    event.sigev_notify = SIGEV_INTR;
 
    // Attach ISR vector
    InterruptAttach( _NTO_INTR_FIRST, &handler, NULL, 0, 0 );
 
    for( i = 0; i < 10; ++i ) {
        // Wait for ISR to pulse
        InterruptWait( 0, NULL );
        printf( "100 events\n" );
        }
 
    // Disconnect the ISR handler
    InterruptDetach( _NTO_INTR_FIRST, &handler );
    exit( 0 );
    }

With this approach, appropriately privileged user-level threads can dynamically attach (and detach) interrupt handlers to (and from) hardware interrupt vectors at run time. These threads can be debugged using regular source-level debug tools; the ISR itself can be debugged by calling it at the thread level and source-level stepping through it, or by using the kernel debugger to single-step the ISR as invoked by the hardware interrupt.

When the hardware interrupt occurs, the processor will enter the interrupt redirector in the microkernel. This code pushes the registers for the context of the currently running thread into the appropriate thread table entry and sets the processor context such that the ISR has access to the code and data that are part of the thread the ISR is contained within. This allows the ISR to use the buffers and code in the user-level thread to resolve the interrupt and, if higher-level work by the thread is required, to queue an event to the thread the ISR is part of, which can then work on the data the ISR has placed into thread-owned buffers.

Since it runs with the memory-mapping of the thread containing it, the ISR can directly manipulate devices mapped into the thread's address space, or directly perform I/O instructions. As a result, device drivers that manipulate hardware don't need to be linked into the kernel.

The interrupt redirector code in the microkernel will call each ISR attached to that hardware interrupt. If the value returned indicates that a process is to be passed an event of some sort, the kernel will queue the event. When the last ISR has been called for that vector, the kernel interrupt handler will finish manipulating the interrupt control hardware (the i8259 on a PC) and then ``return from interrupt.''

This interrupt return won't necessarily be into the context of the thread that was interrupted. If the queued event caused a higher-priority thread to become READY, the microkernel will then interrupt-return into the context of the now-READY thread instead.

This approach provides a well-bounded interval from the occurrence of the interrupt to the execution of the first instruction of the user-level ISR (measured as interrupt latency), and from the last instruction of the ISR to the first instruction of the thread readied by the ISR (measured as thread or process scheduling latency).

The worst-case interrupt latency is well-bounded, because Neutrino disables interrupts only for a couple opcodes in a few critical regions. Those intervals when interrupts are disabled have deterministic runtimes, because they're not data dependent.

The microkernel's interrupt redirector executes only a few instructions before calling the user's ISR. Since the microkernel's call interface is implemented via software interrupts (which work exactly like hardware interrupts), kernel call processing works essentially the same as interrupt processing. As a result, process preemption for hardware interrupts or kernel calls is equally quick and exercises essentially the same code path.

While the ISR is executing, it has full hardware access (since it's part of a privileged thread), but can't issue other kernel calls. The ISR is intended to respond to the hardware interrupt in as few microseconds as possible, do the minimum amount of work to satisfy the interrupt (read the byte from the UART, etc.), and if necessary, cause a thread to be scheduled at some user-specified priority to do further work.

Worst-case interrupt latency is directly computable for a given hardware priority from the kernel-imposed interrupt latency and the maximum ISR runtime for each interrupt higher in hardware priority than the ISR in question. Since hardware interrupt priorities can be reassigned, the most important interrupt in the system can be made the highest priority. Also, ISRs can be written to do no work, always readying the user-level thread to do work. This allows the priority of hardware-interrupt-generated work to be performed at OS-scheduled priorities rather than hardware-defined priorities. Since the interrupt source won't re-interrupt until serviced, the effect of interrupts on the runtime of critical code regions for hard-deadline scheduling can be controlled.

In addition to hardware interrupts, various ``events'' within the microkernel can also be ``hooked'' by user processes and threads. When one of these events occurs, the kernel can upcall into the indicated function in the user thread to perform some specific processing for this event. For example, the processor's non-maskable interrupt (NMI) is available for system watchdog threads and similar applications. Also, whenever the idle thread in the system is called, a user thread can have the kernel upcall into the thread so that hardware-specific low-power modes can be readily implemented.

Upcall	Description
_NTO_INTR_NMI	Watchdog timer hardware is often configured to generate NMIs (non-maskable interrupts) whenever the timeout expires. This event would be used by the thread that would deal with these watchdog events.
_NTO_INTR_TRACE	Neutrino can be configured to generate trace events representing significant occurrences within the kernel (hardware interrupts, context switches, etc.). Trace events generated by explicit trace calls inserted into applications also end up moving the trace data out through this interface. A thread prepared to log these events for diagnostic purposes would attach to this upcall in order to extract the events.
_NTO_INTR_IDLE	When the kernel has no active thread to schedule, it will run the idle thread, which can upcall to a user handler. This handler can perform hardware-specific power-management operations.

Chapter 2 - The QNX/Neutrino Microkernel Part 2