Patterns in register map design

If you've ever had to write a program which interfaces directly with hardware — perhaps while writing a program for an MCU or embedded system or a kernel driver — you may have noticed a few common patterns in register map behaviour and design. I'm not sure anyone has ever really collected them together, so I decided to make a list of all the ones I can think of.

Register behaviours

A plain 8m-bit register. It's just a register. It's 8, 16, 32 or 64 bits, read-write. No tricks here. Either it:

  • contains a single data value, or
  • contains a range of subfields, of potentially varied sizes in bits.

A high-truncated 8m-bit register. Since CPUs only support 8m-bit loads and stores, but some registers may naturally have a different size, only the low n bits are actually wired up and “work”. The other bits always read as 0, or always as 1.

A low-truncated 8m-bit register. A similar construct, but the low n bits always read as 0 (or 1). Often used for registers which take memory addresses but have an alignment requirement.

A register that doesn't do anything (scratch register). A register which can be read and written normally, but it doesn't have any actual effect on the hardware. Known as a scratch register. Can often be used to store data temporarily through a CPU reset, or as a primitive form of communication between multiple processors as a mailbox. Scratch registers are also commonly found in RISC ISAs as a temporary storage space used by interrupt handler prologues and epilogues, as they would otherwise be unable to store all userspace register values without clobbering at least one of them.

Read-only register. Self-explanatory — a register to which writes are ignored or result in an exception. The value may change independently of the host taking any action, however.

Read-only constant register. A read-only register which never changes. This might be used to indicate a chip version, for example. Another example is the “always zero” general-purpose register (GPR) found in many RISC ISAs.

Hybrid read-only/read-write register. Some registers may combine read-only and read-write fields. Writes are permitted but only affect the read-write bits.

Write-only register. A register for which only write semantics are useful. Attempting to read the register might fail, or might return all-ones or all-zeroes.

Partial write compatibility. A register of n bits might be designed to be compatible with CPUs which can only generate n/2 bit loads or stores (or smaller), allowing it to be written in two halves. (See Doorbell register below for a more concrete example.)

Vectors of registers. An array of registers of similar function; for example, a slave supporting 32 different kinds of interrupts might map a separate control register for each one in a linear array.

When each addressable “function” in the vector has multiple registers associated with it, this can be implemented either with multiple vectors of registers coming one after the other (for example, interrupt control registers at 0x4000 and interrupt status registers at 0x4400), or by interleaving (for example interrupt 0's control register at 0x4000, then interrupt 0's status register at 0x4004, then interrupt 1's control register at 0x4008, etc.)

The registers of a given type might be packed, meaning that one comes immediately after another (also implying they are non-interleaved), or have some kind of stride expressed in bytes between the addresses of consecutive registers of a given type. The latter is a useful approach because it can also describe the address of any interleaved register in a vector systematically without having to know or care about the other registers.

Occasionally, some slaves might actually have a read-only register describing the stride in bytes between units of an interleaved vector of registers. This has the advantage that the layout can change in later device versions, for example by adding more registers, while retaining compatibility with older devices, which are expected to use the stride value to compute register offsets.

Doorbell register. A doorbell register accepts writes. It doesn't necessarily have any meaningful read semantics, so it might (or might not) be a write-only register. What makes a doorbell register special is that writing to it sets off some asynchronous event in the background. Doorbell registers are frequently used on PCI/PCIe devices by a host to inform a device that there is new work queued and available for processing.

Variants include:

  • Simple interrupt or event trigger. A doorbell register where the value written is ignored, and all doorbell writes simply trigger some kind of interrupt or event.

  • Doorbell register with argument. A doorbell register where the value written is significant and is processed as part of the event raised.

    Example. Examples of these can be found in almost any modern PCIe device register map, but for an example, try NVMe.

  • Doorbell register with special address bits. The reflection that a register write is essentially a command comprising the tuple (address, data) can lead to doorbell registers which also encode part of their command into some of the address bits. One reason to do this is if the data field is already fully utilised. In this case, a region of the address space must be mapped to the register.

    Ordinarily one might consider these different registers, for example the vector of registers described above. However in many cases it makes no sense to consider these different registers as the address bits are very clearly just being used to encode part of the command.

    Example. The x86 platform offers an example of such a register. Originally, PCI devices signalled interrupts using a physical interrupt pin wired from the host to the PCI slot, which was level triggered. This mechanism has long been deprecated in favour of message-signalled interrupts (MSIs). In actuality, a PCI/PCIe MSI is simply a 32-bit write sent by a device to the host; in other words, modern PCIe devices actually signal interrupts via DMA. The host can configure the physical address to which such writes should be directed, and the value which should be set in the low 16 bits of the write for a given interrupt (which is of course used to specify an interrupt number). (The high 16 bits of an MSI-triggered write are set to zero.)

    This wouldn't be very useful if a host didn't have a way to turn a DMA write into a CPU interrupt; on x86 platforms, this functionality is defined as part of the APIC interrupt controller (Intel SDM Vol. 3A § 10.11). Interestingly, the area of the physical address space set aside for MSI triggers is architecturally fixed at 0xFEExxxxx rather than varying by platform. Depending on the exact address written to and the low 16 bits of the value written, an interrupt can be directed to different interrupt controllers, and different interrupt vectors can be triggered, either in level or edge triggered mode, and with various different priorities. Thus, both the low address bits and the value written form part of the interrupt “command”.

  • Doorbell register with 32-bit compatibility. Some device specifications define 64-bit doorbell registers but specify that, for the accommodation of devices not natively supporting 64-bit accesses, that a doorbell may also be written as two 32-bit writes. This creates the question of which of those writes triggers the “event”:

    • One option is to define either the low half or the high half of the register as actually triggering the “event”, with the other half acting as a normal register storing the last written value until its sibling is also written. With this design, the two halves must be written in a certain order.

    • Another option which is mandated by at least one PCI device specification is a statement that the two halves of the doorbell may be written in either order. This seems suboptimal as it raises ambiguities as to when the doorbell event is triggered. The implication is that writing each half sets an internal flag bit for that half, and once both halves have their flag bit set, the doorbell event is triggered and both flag bits cleared. The trouble with this is that it raises the question of what happens if one half of the register is written and then the host suddenly changes its mind and goes and does something else — for how long will the internal flag bit (which cannot be observed directly) persist? There is the potential for a problem if some other code then tries to also trigger the doorbell with the opposite order of writes after something like this occurs. (However, this design is safe so long as all code on the host always accesses the two halves in the same order; in this case if the host was previously interrupted halfway through a doorbell write, the first of the two writes will simply set the flag bit which is already set.)

Indirect access registers. From time to time, it is desirable to provide access to a register space but not practical or economical to provide direct access to that register space in a host's physical address space. In this case an indirect access scheme is common. A typical scheme involves a pair of registers, one named the address register, one named the data register.

To write to an indirect register, the host writes the desired register address to the address register, and then writes the data to be written to the data register. The register write is then carried out; the register write is triggered by the write to the data register, and uses whatever address was last written to the address register. In the same way, reads are performed by writing to the address register and then reading from the data register, which triggers a read of the specified register. If the hardware can support loads and stores of different sizes, usually the size of the load or store performed dictates the size of the register access performed. Occasionally instead the size of the load or store to be performed will be indicated using some extra control bits, which might be set using another register, for example.

An example of an indirect access register scheme can be found in the NVMe specification, which provides such a scheme to allow the usage of just two registers in the x86 I/O port address space to access the much larger NVMe register map. This mainly exists for compatibility during system boot if the full address space is not yet available. Another example of an indirect access scheme is found in the next item of discussion:

Start/wait/done. While more sophisticated systems to trigger, and detect completion of, asynchronous work by hardware will involve interrupts, in some cases a simpler system is called for. In this model you write a register to trigger some operation (possibly with a 'start' bit set to 1), and must then spin-poll it until a 'wait' bit found in the register is cleared (or equivalently, until a 'done' bit is asserted.) There may also be bits to indicate completion with an error condition.

An example of such a register is the MII Communication Register (0x44C) on a BCM5719 PCIe NIC, a device I am altogether much too familiar with. This register is used to perform MDIO register access transactions against an Ethernet PHY. (As such, it's also an example of a register which provides indirect access to another address space.)1

bit     29: Start/Busy
bit     28: Read Failed
bits 26:27: Command
              0b01: Write
              0b10: Read
bits 21:25: PHY Address
bits 16:20: Register Address
bits  0:15: Transaction Data

To initiate an MDIO read, you do a single write with Start/Busy=1, Command=Read, and PHY Address and Register Address set correctly, then poll the same register until Start/Busy=0. At this point the register either reads with Read Failed=1, or the result of the read is in the Transaction Data field. Writes work similarly, in which case you specify the data in the Transaction Data field.

Because this design requires polling, it's mainly only suited to low-performance interfaces related to management functions that will be infrequently used. If needed, it can be combined with an interrupt to enable higher performance.

Write-1-to-clear registers (W1C). These kinds of registers are commonly used to track various events which may occur over time, such that a host should be able to quickly determine if any of those events have occurred. For example, suppose a slave exposes a register like this:

bit  0: Parity Error Occurred
bit  1: Internal Error Occurred

A host reads the register, then clears it by writing to the register with the bits it wishes to clear set to 1. By writing an all-ones value, the entire register is cleared. Thus the next time the host reads the register, any set bits reflect events which have occurred since the last time the register was cleared. The host can also clear only some bits by setting only those bits to 1 in its write to the register. There is no way for the host to set any of the bits in the register, save possibly by some other register.

Write-to-clear registers. Similar to the write-1-to-clear register, but any write clears the entire register regardless of the value written. The value written doesn't matter.

  • Commonly used for counter registers. A counter register counts the number of events of a given type which have been handled, or a number of bytes. For example, an Ethernet controller might count the number of CRC errors it has seen in total, or the number of bytes it has transmitted in total.

Write-to-set & write-to-clear register pairs. A pair of registers both of which control the same underlying control register. A write to the “write-to-set“ register with bit n set causes that bit to be set in the underlying register; a write to the “write-to-clear” register with bit n set causes that bit to be cleared in the underlying register. Generally, reading either of these registers reads the current value of the underlying register.

  • This pattern is very commonly used to manage interrupt masking registers, where each bit in the register masks a given interrupt, as it allows a host to mask and unmask a given interrupt using only a single store instruction, rather than three (load/or/store). It is also less susceptible to race conditions if multiple host CPUs are masking/unmasking interrupts. It is also commonly used for interrupt pending registers in the same way.

    Usually a bit being set masks an interrupt. Where the opposite is the case, a register may be referred to as an interrupt enable register instead.

  • This pattern may also be used to manage interrupt pending registers; if writeable, such a register allows an interrupt to be artificially marked as pending, or cancelled without being raised if it is already pending (perhaps if it is currently masked). If readable, allows the pending status of an interrupt to be detected, for example if interrupts are currently disabled.

Write-1-to-flip registers. In this pattern, writing to a register with bit n set to 1 flips that bit in the register. Not too common, but sometimes seem. It is not possible to set the value of the register “normally”; if you want to completely change the contents of the register, you must read it and XOR it with the new value before writing it.

Time register. Not exactly a pattern in register design, but common enough to be worth mentioning. A register whose value returned changes linearly over time; a clock register. A time register is generally the basis of timekeeping on all modern OSes, and most modern ISAs expose a time register accessible to userspace. (On modern Linux on x86 platforms, calling clock_gettime(2) does not trigger a syscall but jumps into a VDSO handler which uses the CPU's TSC register or similar to determine the current time.)

FIFO push/pop register. A register which, when written, does not set the value of a register but instead appends the data written to a queue; or which, when read, pops data from a queue and returns it. These are separate functions which need not be combined (e.g. for a unidirectional channel), but often may be for a bidirectional communications channel. Note that the read and write operations on such a register affect separate FIFOs in this case. This kind of register is commonly found in UART register maps and used to read and write from the UART. The behaviour if the FIFO is full (for writes) or empty (for reads) must be defined; typically a separate status register will expose bits indicating whether a FIFO is full or empty respectively.

Semaphore/mutex register. A register used to provide simple hardware coordination of multiple independent processes accessing some kind of hardware state. (Not to be confused with the unrelated Lock bit pattern below.)

An example of this can be found in the BCM5718's Mutex Request and Mutex Grant register pairs (0x365C, 0x3660). Each bit corresponds to a processing element which might want to request the lock. A processing element writes a 1 to its assigned bit in the Mutex Request register to request the mutex, then waits for that bit to become asserted in the Mutex Grant register. Only 1 bit is ever set at a time in the Mutex Grant register when read; if it is read with a different bit set, it means another processing element currently holds the mutex, and that the pending request will be fulfilled when the existing hold on the mutex is released. Finally, once obtained, the mutex is freed by writing that bit as 1 in the Mutex Grant register.

Mutex registers can be either enforced or unenforced; if enforced, they protect access to some given set of hardware resources (for example, a set of registers) and attempts to use those registers do not succeed unless the mutex is held.

Lock bits. It is sometimes desirable to be able to lock a register against further changes until some defined event occurs, usually a system reset. A lock bit in a register can be set to 1 in the ordinary way, but cannot be cleared other than in a specially defined way. Setting the lock bit to 1 restricts system functionality in some arbitrarily defined way: it may preclude further writes to that register, or to a group of registers, or restrict read or write access to some region of memory or storage, or have other effects.

There are various lock bits found in the x86 platform, such as those used by BIOSes to prevent an OS from changing the configuration of System Management Mode (SMM) after the BIOS has configured it, for better or worse. Another typical example is to lock down a region of non-volatile storage used for booting after booting is complete, in order to facilitate a secure boot system.

Nonexistent registers. The amount of space allocated to a slave in a memory map will only rarely exactly equal the amount of space required for all the registers, therefore a behaviour for the unmapped space must be defined; for example, returning all-ones (see below).

Failure handling: all-ones on fail. While not a behaviour of a specific register, it is worth noting here: PCI and PCIe devices return all-ones values for failed register reads. A host (e.g. a PCI device driver) performing a MMIO read can check if the loaded value is the all-ones value, as this might indicate a device failure (but also might not; it is obviously ambiguous.)

Failure handling: trap on fail. Another option is for failed reads or writes to result in some kind of architectural trap. For reads, generally this involves returning a special error result to a CPU instead of a finite register value, causing an architecturally defined error handling flow. (Note that for writes, behaviour may or may not be less predictable, as it is common for CPUs to continue executing before a write is signalled completed by the slave device, if indeed any completion of the write is signalled at all. Thus if a trap does occur, it may be “asynchronous” and occur some undefined time after the instruction causing the trap was executed.)

Aliased register blocks for security. Some devices duplicate certain register blocks so that they have both “non-secure” and “secure” aliases with different addresses, with access to the latter only being possible from a secure context (see Ambient authority below). Not all registers may be present in both regions, and different restrictions may apply to the same registers when accessed via each alias. Examples include some ARM TrustZone devices.

Register access methods

The above describes register behaviours. Another question is if and how the register is addressed and how it is accessed. The following patterns arise:

Architectural, direct. These registers are part of an ISA and are accessed directly by machine code instructions; for example, general-purpose registers.

Architectural, indirect addressed. These registers are part of an ISA but are addressed. They are generally accessed by special machine code instructions which essentially access a special address space containing CPU control registers. Examples include x86 MSRs (accessed using rdmsr/wrmsr), Power SPRs, RISC-V CSRs, etc.

Non-architectural, addressed, memory-mapped, direct. These registers are mapped into a host's physical address space2 and can be accessed directly in an ordinary way.

The memory-mapping of registers creates some interesting issues:

  • Byte-addressing. Since most computer architectures use byte-addressed memory, it is conventional for registers to be mapped into memory according to their size in bytes; i.e., if a set of registers are each 32 bits in size, register 0 is mapped at X+0, register 1 is mapped at X+1 and so on.

    Where non-memory-mapped access methods are used, integral addressing is more suitable, where each register is addressed by only one integer address value (register 0, register 1, etc.) regardless of its size.

  • Alignment requirements. However, while this mapping allows unaligned access or partial access (using smaller loads/stores) to be supported if desired, a register does not necessarily support such unaligned or partial access. If such accesses are performed, the behaviour might be undefined, or might result in a failure handling case of a kind defined above.

  • Overlapping registers. Due to the mapping of registers to a byte-addressed memory subsystem according to their size, assuming no cache subsystem is involved in register access, it may be possible to assign registers to consecutive numbers so that they are essentially “overlapping”. In other words, suppose 32-bit registers A, B and C are given addresses X+0, X+1 and X+2 respectively. This would be quite peculiar from the perspective of how a byte-addressed memory subsystem is supposed to work, but in principle would work. I can see no reason to advocate such a bizarre design, however, and am unaware of any examples of this being done.

  • Caching. Memory-mapped registers are almost always intended to be accessed via uncached memory accesses, and an accessing host must therefore take care to ensure that caching is disabled for the relevant accesses. However, it is possible for a register to participate in caching, or in a cache coherency protocol, if desired.

Non-architectural, addressed, indirect. These registers have numerical addresses assigned but cannot be accessed directly from the host. Instead, there is a layer of indirection between the host's physical address space and the given register, accessed via an indirect access register scheme (see Indirect access registers above). Occasionally, there may even be multiple layers of indirection, leading to progressively less efficient and less direct methods of access. For an example, see again footnote 1.

Architectural, addressed, memory-mapped, direct. Finally, it is worth noting that some CPU ISAs actually memory-map some architectural registers to a fixed region of the CPU's memory map (for example, ARMv7-M).

Register service primitives

The discussion so far has assumed that registers are accessible via only two basic service primitives:

Read(address) → (value: uint<N>) | :error
Write(address, value: uint<N>) → :ok | :error

By definition, all service primitives are invoked by a master and completed by a slave. It is possible for a device to possess both master and slave roles.

Here, address might be an address (in the case of memory-mapped access), or a register number, or other symbolic register identifier.

Atomics. However, memory and I/O subsystems have been evolving and now in many cases support atomic operations. This support has been extended to I/O systems such as PCIe Gen3. The PCIe Gen3 specification defines three types of atomic operations: Fetch and Add, (Unconditional) Swap, and CAS:

AtomicFetchAdd(address, addValue: uint<N>) → (oldValue: uint<N>) | :error
AtomicSwap(address, newValue: uint<N>) → (oldValue: uint<N>) | :error
AtomicCAS(address, compareValue: uint<N>, swapValue: uint<N>) → (oldValue: uint<N>) | :error

Though a slave's control registers will typically not support atomics, such atomics could be supported if desired. However, device support remains an issue, as many PCIe host subsystem implementations do not support generating atomic operations.3

Synchronisation requirements. On some host systems accessing memory-mapped I/O devices, all memory-mapped I/O loads/stores must be explicitly synchronised using a synchronisation instruction. For example, on some Power ISA systems (such as POWER9), attempting to perform multiple MMIO accesses to a PCIe device without executing the correct synchronisation instruction (such as sync or eieio) will actually be detected by the hardware as an error and result in a freeze of further MMIO accesses.

Ambient authority. The service primitives above do not include any kind of authorization information. It is assumed that the identity of the invoking master is understood implicitly. In particular, arbitrary security metadata can be associated with a primitive and this fact, or the nature of the metadata associated, is generally not obvious from the perspective of the program executing a load or store instruction. For example, platforms may be extended over time to add new privilege levels (see the addition of TrustZone to the ARM ISA) while retaining binary compatibility with existing user-mode and kernel-mode code. A register may predicate its behaviour on such metadata, for example by only allowing accesses marked as originating from a “secure” context.

See also:

1. MDIO is a two-wire (clock, data) interface used by an Ethernet MAC to access control registers on an Ethernet PHY. The interface is defined by the IEEE 802.3 standard, as are some of the registers, but pretty much every PHY also has vendor-specific registers. Even if a NIC has the MAC and PHY integrated on the same chip, this interface may still be used internally to manage the control registers of the on-board PHY. The commands supported by MDIO are READ, WRITE, with each command taking a 5-bit PHY address and 5-bit register number. The register values read and written are 16 bits. The PHY address allows the same MDIO pins to be used for multiple Ethernet PHYs as a shared bus. However, the 5-bit register number turns out to be cripplingly small and isn't nearly big enough for all the registers exposed by a modern PHY. This means that vendor-specific, and indeed some standard, PHY registers accessed via MDIO actually have to be accessed via an indirect scheme in turn. Given that MDIO is typically accessed via an indirect system of access itself, this means many Ethernet PHY registers, when accessed by a host system, have to go through two layers of register indirect access.

One example of such an indirect scheme is IEEE 802.3 § 22.2's Registers 13 and 14 (“Clause 22” indirect access). This scheme is particularly inefficient; the user must write 0 to register 13, meaning that subsequent accesses to register 14 access the address register, write the desired register address to register 14, then write 0x4000 to register 13, meaning that accesses to register 14 access the data of the register addressed by the value last written to the address register, then read or write register 14 as desired. Thus every indirect register access requires not two but four register accesses.

2. Note that some architectures have multiple address spaces. x86 has the legacy I/O port address space. The SPARC architecture supports up to 256 different address spaces in its load and store instructions, via the “Load/store from alternate space” instructions.

3. I am told that when AMD adopted PCIe Gen3 for its graphics cards, it was perhaps overly optimistic about the uptake of PCIe atomics support by CPU vendors, and apparently required support for atomics in order to make full use of its GPGPU functionality (e.g. ROCm). This apparently led to such functionality being generally unavailable except for those using AMD CPUs. Ordinary graphics usage was unaffected. I assume AMD no longer uses atomics as part of its host interface for GPGPU operations, but would welcome any details.