The Talos II, Blackbird POWER9 systems support tagged memory

Nowadays, there is increasing interest in adding tagged memory functionality to CPU architectures. Such architectures associate one or more tag bits with each quantum of a system's memory. There are many motivations for this ability to associate “metadata” with individual memory words, but one of the most clear is the potential security benefits and the ability to track the providence of certain kinds of data in memory, such as pointers. This can be used to create capability-based architectures such as CHERI.

Less known is the fact that IBM's POWER CPUs have long had support for memory tagging, and can associate one tag bit with every aligned 128-bit quadword. This has historically been used by the IBM i operating system which runs on IBM's own Power servers. The basis of this memory tagging functionality is an undocumented and proprietary extension to the Power ISA known as PowerPC AS, which adds instructions to manipulate tagged memory. These extensions are only used internally to the IBM i operating system, hence their undocumented nature.

More recently, the release of the Talos II and Blackbird systems by Raptor Computing Systems represented the first time contemporary IBM POWER CPUs became accessible to a wider audience. Since IBM only supports their IBM i operating system on their own proprietary PowerVM hypervisor, which itself is only found on their own servers, these systems are obviously not intended to run IBM i, and it was unclear for many years whether the CPUs sold by Raptor even had the memory tagging functionality enabled. These instructions are after all undocumented, and were never intended for use other than by IBM's own IBM i operating system.

Thus for many years the possibility of getting memory tagging working on these systems was an interesting possibility, but there was no idea of whether it was actually feasible or whether IBM fused off this functionality in the CPUs it sells to third parties. The POWER CPUs IBM sells to third parties are fused slightly differently to those it uses in most of its own servers, being fused for 4-way multithreading (SMT4) rather than 8-way multithreading (SMT8); it would be entirely plausible that the tagging functionality is fused off in the SMT4 parts, being that IBM i was only ever intended to run on SMT8 systems. While the ISA extension is undocumented, fairly complete knowledge about it has already been pieced together from bits and pieces, so this was not actually the major obstacle. However, there was no idea as to whether use of the memory tagging functionality might require some kind of appropriate initialization of the CPU. In theory, one need simply set a single undocumented bit (“Tags Active”) in the Power Machine State Register (MSR). However, simple attempts to enable Tags Active mode on OpenPOWER systems such as the Talos II did not succeed.

This all changed when someone discovered that it was in fact possible to enable Tags Active mode on Talos II and Blackbird systems. This discovery was made by Jim Donoghue and all credit for this discovery goes to him; I publish this finding with his permission.

The structure of this article is threefold:

Firstly, I explain the background of this memory tagging functionality and IBM i (you may skip this if you wish);
Secondly, I explain how the memory tagging functionality can be used on Talos II and Blackbird systems, or on any IBM POWER system.
Thirdly, I discuss how the memory tagging functionality is implemented using clever shenanigans based around a server's ECC memory.

Background

Nowadays, there is increasing interest in tagged memory architectures such as CHERI. These architectures associate one or more tag bits with each quantum of a system's memory, allowing the creation of capability-based architectures with unforgeable pointers.

The IBM AS/400 machine, introduced in 1988, and its successor, IBM i, which runs on IBM POWER servers, make use of a memory tagging architecture to anoint pointers as valid and create a capability-based architecture. Pointers are 128-bit values which are tagged in memory, rendering them unforgeable. A single bit is associated with every aligned 128-bit quadword; if it is set, the quadword contains a valid pointer. When programming in C for example, it is not possible to cast an integer to a pointer. uintptr_t is not even defined.

Use of trusted translators. IBM's capability-based design is implemented using a hybrid of both hardware and software. The central premise of the design is that you cannot write native code for the platform directly; instead, all programs for the system must be written in an intermediate language which is translated internally (at installation time) to Power ISA machine code by the system's “kernel”. This trusted translator maintains the desired security invariants of the system, just like the modern sandboxed JIT designs used in Java, .NET, JavaScript and WebAssembly, and without relying on hardware memory protection.

All of these systems also enjoy ISA independence; in the IBM i case, IBM took advantage of this to port the OS from an undocumented CISC ISA¹ to the Power ISA without breaking the compatibility of any user program.

Comparisons can be drawn to Microsoft's subsequent Singularity research project, which wrote an entire OS in C#; since the desired security controls were enforced by the trusted translator which converts .NET IL to machine code, Microsoft researchers were able to disable the MMU as it was not needed. (Notably, they observed increased performance as TLB misses were no longer a factor.) ²

Memory tagging ISA extension. In order to support this capability-based OS, IBM implemented an undocumented extension to the Power ISA known as PowerPC AS. This ISA extension provides a memory tagging system which associates one tag bit with every aligned 16 bytes of system memory (a quadword). In a 128-bit pointer, 64 bits are used for the typical memory address; the other half stores a few bits of metadata and is otherwise reserved (the degree of futureproofing is rather extreme).

A quadword in memory is considered to hold a valid pointer only if its tag bit is set. Overwriting the quadword, or any part of in an ordinary fashion clears the tag bit automatically.

Security invariants. It's worth noting that this memory tagging system doesn't enforce any security invariant by itself. It can't sandbox arbitrary machine code. What it does do is provide a way to associate a tag bit with every aligned quadword of system memory, as well as some ISA enhancements to allow IBM i's trusted translator to generate more performant machine code. Thus, the entire memory tagging system can to some extent be seen as a hardware acceleration assist for the IBM i OS.

To illustrate the point, here is a typical Power ISA instruction sequence which the IBM i kernel might generate when translating a user program from IBM i bytecode:

lq    r3, 0(r4)
txer  0, 0, 43

What this does is as follows:

the first instruction load a tagged quadword from aligned memory to two consecutive 64-bit CPU registers, and puts the associated tag bit in XER (the Power ISA Fixed-Point Exception Register);
the second instruction, “trap on XER”, traps if the TAG bit is not set in the XER register; in other words, if what has been loaded is not a valid pointer with the tag bit set, an exception handler is jumped to.

In other words, what makes the system secure is that IBM i's trusted translator always (for example) produces a txer instruction to validate a pointer after loading it, enforcing the desired security invariant. These extensions to the ISA allow the trusted translator to produce faster code, but it is ultimately the translator that enforces the desired security invariant. Consider that it would be entirely possible to implement IBM i on vanilla x86, which has no corresponding memory tagging functionality, by storing tags in some kind of MMU-like radix table managed and checked manually by the emitted machine code; it would simply be less performant. Thus these ISA extensions should be seen as a kind of hardware accelerator more than a security control in themselves.

How to use memory tagging on Talos II and Blackbird systems

Enabling the Tags Active mode, in which the memory tagging extensions can be used, on POWER9 systems, turns out to be absurdly easy:

Disable the Radix MMU.
Run in Big Endian mode.
Set the Tags Active bit in the Power ISA's architectural Machine State Register (MSR) (see here).

It's point 1 here that was the breakthrough; again, all credit for this discovery goes to Jim Donoghue. While various people had tried to set the Tags Active bit in the MSR register on Talos II systems, it would never work and the bit would always read back as unset. But it turns out there was a very simple reason for this: the Power ISA traditionally has used an unusual Hashed Page Table (HPT) MMU design, in which page table entries are looked up by hashing the virtual address. This is unlike almost every other ISA out there, which usually use a radix tree-based design in which specific bits of a virtual address index a table at a particular level of the page table hierarchy. POWER9 introduced a new (and much more normal) Radix MMU which behaves much like the MMUs of other ISAs, but also continues to support the traditional HPT MMU. An operating system can choose which MMU model to use. Operating systems such as IBM i running on POWER9 continue to use the HPT MMU, whereas if you boot Linux on POWER9, it will use the shiny new Radix MMU by default.

Since IBM probably has no intention of ever trying to use the Radix MMU, which was intended for modern Linux workloads, with IBM i, it's entirely unsurprising that Tags Active mode doesn't work with it. Likewise, IBM i always runs in Big Endian mode, so it's not that surprising that they didn't bother to make the tagging functionality work in Little Endian mode.

Linux of course still supports the traditional HPT MMU: in the end, getting Tags Active mode working literally just turns out to be as simple as adding disable_radix to the kernel command line. So long as the CPU is running in HPT mode and in Big Endian mode, Tags Active mode can be enabled freely. Fears that some arcane system initialization sequence might be needed to get the functionality working turned out to be unfounded; it really was that simple. (After the intensive efforts one has seen in other reverse engineering efforts you might almost call it anticlimactic.)

Demonstration program

The Tags Active bit of the MSR can only be changed in kernel mode, which makes it hard to play with; however, the IBM i OS is run as a guest virtual machine on IBM Power servers, and thus Tags Active mode can be readily used inside a virtual machine. This makes it easy to experiment with.

A demonstration program can be found here. This is a “bare metal” (well, bare VM) program which runs directly inside a virtual machine. See the instructions in the repository for information on how to build and run it. Obviously, it must be run on a Talos II or Blackbird system (though an IBM Power server should also work).

You will need to reboot the host system to boot Linux with the disable_radix kernel argument. A little endian host system (i.e. a ppc64el Linux distro) is fine; only the guest VM needs to run in Big Endian mode, and that is taken care of by the code linked above. Read the README for more information on the requirements.

The code is released into the public domain. Feel free to use it as a basis for whatever you like.

Where are the tags stored?

You may have been thinking while reading all of this: where are these tags stored, anyway? The answer proves to be quite fascinating.

We don't officially know how IBM stores these tags. Surprisingly I haven't even found any IBM patents talking about it. However, with some educated guessing, we can rule out possibilities until there is only one possibility remaining.

The original AS/400 memory tagging system used custom, extra-wide memory modules to store the tag bits. However, modern IBM i, running on IBM Power servers, is able to store these memory tags using only industry standard ECC DIMMs. Moreover, no RAM is reserved to store these tags; in other words, given a 32 GiB ECC DIMM, you get 32 GiB plus the corresponding tag storage (which is 256 MiB; one bit for every 16 bytes). The Talos II supports up to 2 TiB of RAM, which would mean 16 GiB of tags. This raises the question of how IBM has found a way to magically store the memory tags without any extra overhead.

Non-possibilities. We can immediately rule out a few things:

Could the tags be simply stored in an L3 “cache” on the CPU and never flushed out to memory? The Talos II supports up to 1 TiB of RAM per socket, which means the CPU would need to be able to store 8 GiB of tags on-die. Of the POWER9 CPUs sold by IBM, none has an L3 cache more than 110 MiB. If IBM thinks 110 MiB is about the limit of the L3 cache it can reasonably accommodate given process limitations, it seems safe to say that it hasn't tacked on an extra 8 GiB cache to accommodate an OS most people don't use.
Could the tags be being handled by a trap into system firmware? The Talos II and Blackbird platforms feature entirely open source firmware. If tagging functionality were being supported via firmware in any way, this would appear in the firmware source code. However, there are almost no references to memory tagging functionality in the POWER9 system firmware. This suggests that memory tagging really is handled only in hardware, and needs no support from system firmware to work. Moreover, if a trap to firmware were to be required every single time a tagged memory quadword was loaded or stored, this obviously would not meet the performance requirements for the IBM i OS. Plus since the tagging extensions can be used in a VM, if they were handled by firmware on bare metal, the hypervisor would have to be involved in making them work in virtual machines. Since tagging functionality has been proven to work running under Linux KVM, and the entirety of Linux is open source and contains no mention of or special handling for tagging functionality, this is impossible. It really is implemented fully in hardware.

Could the tags be stored in a region of DRAM which is specially reserved? This can also be ruled out by simple measurements of how much RAM people are missing. It's normal for some RAM to be missing in the amounts reported by an OS because some will be reserved for system firmware and kernel overheads, etc. However, is this missing amount large enough to store the full tag data?

I asked a variety of people with Talos II or Blackbird systems to report the output of free -m and the amount of RAM they have installed in their systems. They reported:

	Amount Installed	Total Memory Reported	Missing	Needed for Tags
Person A	64 GiB	65219 MiB (63.69 GiB)	317 MiB	509 MiB
Person B	184 GiB	187706 MiB (183.31 GiB)	710 MiB	1.43 GiB
Person C	248 GiB	253193 MiB (247.26 GiB)	759 MiB	1.93 GiB

We can see here that the amount of memory missing and not available to the OS is far less than that which would be needed to store tags. Moreover, the amount of memory Person C's machine would need for tags is 1.35x that of Person B's, but the amount of memory missing is just 1.07x the size.

It is not possible that memory is being reserved for tags only on machines which use tags, because the machine is not informed in advance (when booting) if tags are going to be used, thus any memory reserved for tags would have to be reserved in every instance on all POWER9 machines. But this is clearly not the case.

This leaves us with one, and only one possibility for how the tags are being stored: IBM is doing something funny with ECC RAM.

Fun with information theory. Recall that an ECC RAM DIMM adds an extra RAM chip for every 8 RAM chips, which is intended to hold error correction data. In short, 64 bits of user data is mapped to a 72 bit error-correctable word. In particular, note that there is nothing “intelligent” about ECC RAM; it's just a slightly wider memory module which stores 72 bit words rather than 64 bit ones. The DIMM doesn't know or care what is done with these extra bits, and there's nothing forcing you to use them for ECC.

Of course, these extra bits in themselves would be more than enough to store the tag bits. A 64-bit block contains only half a 128-bit quadword, so only half a tag bit is required per 64-bit block, which is less than the 8 bits of ECC syndrome available per 64 bits. However, using the ECC syndrome for the tag bits alone would mean foregoing any kind of error correction functionality, which IBM is obviously not doing here. In fact IBM has always liked to play up the resilience of its servers to hardware faults and memory failures. Thus, IBM is clearly using the ECC syndrome in a way that allows it to obtain both the advantage of ECC protection and somehow store tag bits.

An interesting point here is that of the distinction between error correction coding and erasure coding. Error correction coding, of which ECC memory (or more accurately how CPUs make use of it) is an example, allows a certain number of corrupted bits to corrected, and often a slightly higher number of corrupted bits to be detected but not corrected. The typical example of this is SECDED: single error correction, dual error detection, which is basically standard.

However, information theory also offers something called erasure coding. Erasure coding differs from error correction coding in that one has reliable knowledge of when a bit has been lost, or “erased”. It is not useful to secure the contents of your RAM from cosmic rays, because when loading data from DRAM, you obviously don't know which of the bits, if any, might have become corrupted; however, if you have some application in which it is contextually known which bits are unknown, erasure coding can be used. An important point here is that erasure coding can be more performant than error correction coding because there are fewer unknowns to work with (the positions of the bits which have been erased, and the number of bits which have been erased, are known, which is not the case with error correction coding).

The memory tags used in IBM POWER are an ideal candidate for erasure coding: when storing a 128-bit quadword to DRAM, simply “forget” the tag bit, and when loading it back from memory, use erasure coding to recover the tag bit. It is known which bit has been “erased” — it is always the tag bit.

Still, this is just a demonstration of the principle. Since IBM servers also obviously offer error correction for data stored in RAM, what IBM is doing to store tag bits in DRAM is presumably some kind of hybrid application of both error correction coding and erasure coding. Such schemes do exist.

A smoking gun. This now brings me to an interesting part of the POWER9 User Manual, which talks about the processor's reliability features. In fact, once you notice it, it's really quite glaring. IBM is essentially admitting this is how they're pulling it off, and it was in plain sight the entire time:

64-byte memory ECC

Dual packet analysis for 128-byte reads

Address parity encoded into ECC code

Correction of up to one symbol in a known location plus up to two unknown symbol errors

Correction of up to one symbol in a known location plus a new ×4 chip kill

Correction of one ×4 chip in a known location plus either a known symbol or one unknown symbol error

Flexible chip and symbol marking storage

Read this line very carefully:

Correction of up to one symbol in a known location plus up to two unknown symbol errors

That's not SECDED, Ted! Notice what's strange here. Normally a server with SECDED error correction would never need to perform “correction of up to one symbol in a known location” (which is erasure coding). This is a tell; IBM is admitting their ECC implementation supports erasure coding of a single extra bit, in addition to doing error correction.³

(It amused me when I saw the above; it really was a smoking gun hiding in plain sight, in a seemingly innocuous list of POWER9 reliability features.)

Ordinarily, the 8 bits of ECC memory provided for every 64 bits of normal memory by an ECC DIMM is just enough to implement SECDED error correction coding. How can IBM offer better-than-SECDED performance and erasure code some tag bits into memory at the same time?

Read the above list closely again:

64-byte memory ECC

64-byte reads; not 64-bit reads. This implies that IBM is reading memory in 512-bit blocks and doing ECC coding, mapping a 512-bit block to a 576-bit ECC-encoded block. This maintains the same ratio of 64 bits/8 bits but uses a much larger ECC block size than the 64/72 bit blocks traditionally used. This turns out to be a powerful trick, because error correction coding becomes more efficient the larger you make the block size. (The downside is likely to be increased power consumption and an increased silicon footprint.)

By making the block size larger, efficiency is improved, so that you don't actually need the whole 8 bits set aside for ECC per 64 bits. This, in turn, creates some headroom to encode other data, like tag bits.

An IBM whitepaper which discusses some of the reliability features of IBM hardware provides a useful example of this:

On a 2-way interleaved memory system, two ECC (Error Checking and Correcting) DIMMs contain 144 bits, but only 140 bits are used for data and checksums. The remaining four bits are unused. Standard ECC memory can detect two-bit errors, but it corrects only single-bit errors. If multiple bits of a memory chip fail at once, the whole DIMM fails, crashing the server and temporarily leaving the system with reduced memory capacity (until the module is replaced).
Memory ProteXion, instead of immediately disabling the DIMM, merely rewrites the data to some of those spare bits.

IBM definitely has a long history of doing interesting things with ECC memory, and finding spare room in the ECC bits.

A DDR4 DIMM has a 64-bit data bus, plus an 8-bit bus for the extra ECC bits. When two memory channels are used, a CPU reads 128 data bits at a time, plus 16 ECC bits, for a total of 144 bits. Notice what the above paragraph says: “only 140 bits are used for data and checksums”. In other words, by increasing our block size from 64/72 to 128/144, we have freed up 4 bits spare per 128 data bits. As mentioned above, error correction coding becomes more efficient with larger block sizes.⁴

We can obviously use one of these to store the tag bit for the 128-bit quadword, and have 3 bits left over (and we didn't even have to use erasure coding). Of course, this would provide no ECC protection for the tag bit itself, but there are three bits remaining which can be used to provide this.

For those interested in more background reading, there are also other papers discussing DRAM ECC schemes which are more efficient than the conventional 64/72 bit Hamming code scheme, which has a redundancy overhead of 12.5%. For example, this paper proposes a scheme with a reduced redundancy of 6.25%, which like the above scheme hinted at by IBM, requires only 4 ECC bits to every 64 bits of data. This paper discusses possible use of erasure coding with ECC DRAM.

To be clear, I don't know what specific ECC scheme IBM is using on their servers nowadays; indeed while textbooks will tell you that ECC memory is typically based around a 64/72 bit Hamming code, it seems likely that all of the major server CPU vendors have long since moved to schemes which offer better performance. What the above shows is that it is feasible for IBM to encode memory tags using ECC bits, and that they can't possibly be doing it any other way.

Since memory tagging schemes are rapidly becoming all the rage, competing CPU vendors should take note. Any patents associated with this scheme should be long expired, so in principle there's nothing to stop Intel, AMD or ARM from implementing a memory tagging extension which can make use of ECC in this way. The requirement for ECC RAM is a drawback, but is not a problem for servers. To my knowledge, research projects such as the CHERI ISA have mainly used MMU-like tag tables with an in-CPU cache for storing tags without requiring ECC RAM. It would be interesting to see an ISA extension which can be agnostic to the storage mechanism, and support both ECC-less designs using tables for embedded and client environments, and ECC-based designs for high performance server environments.

In Closing

This article was published with the approval of Jim Donoghue, who is entirely to credit for discovering that Tags Active mode could be unlocked on Talos II and Blackbird systems. I asked him to let me make these findings public, as neither of us are currently doing anything with the information, and I imagine people will find interesting uses for this memory tagging capability.

Again, the demo program can be found here, and some information on the PowerPC AS ISA extensions can be found in my previous article here.

Feel free to contact me if you have any questions or come up with any interesting further discoveries about this functionality.

As an aside, it might also be possible to get memory tagging working on the PS3, due to it's use of IBM's Cell Broadband Engine CPU. It is thought that the Cell has IBM memory tagging support, and IBM has mentioned running IBM i on a PlayStation in a lab in connection with the Cell project (source 1, source 2). (Mentions of the PSone or PS2 in these articles are probably wrong; it seems hard to imagine IBM investing the substantial amount of effort that would be required to port IBM i to MIPS, whereas IBM running IBM i on the Power ISA IBM Cell processor in the PS3 would make much more sense. Indeed the association with the Cell project is explicitly mentioned in these reports. Probably this confusion lies in part with the fact that the PS3 hadn't been publicly announced yet at the time of these news items.)

1. This point is worth drawing attention to; the abstraction of the machine's ISA was so effective in the case of the AS/400 that IBM never even documented the CPU's ISA, nor did they need to. Supposedly, the ISA was System 360-like. ⏎

2. The fact that AS/400 and its predecessor System/38 significantly predate Java demonstrates that there really is no “new” idea in computer science that IBM wasn't doing in the 70s, an observation I've become fond of. The explosion of “virtualisation” in IT in the 2000s/2010s is another example; to my knowledge, hardware virtualisation was first invented by IBM in the System 370. ⏎

3. As an aside, note also that IBM claims “correction of [...] up to two unknown symbol errors”, which is better than SECDED (single error correction, dual error detection). In fact, IBM has long used better-than-SECDED ECC as a marketing point:

Chipkill™ is IBM's trademark for a form of advanced error checking and correcting (ECC) computer memory technology. It protects computer memory systems from any single memory chip failure as well as multi-bit errors from any portion of a single memory chip. One simple scheme to perform this function scatters the bits of a Hamming code ECC word across multiple memory chips. Hence, failure of any single memory chip affects only one ECC bit per word. This technology allows memory contents to be reconstructed despite the complete failure of one chip. Typical implementations use more advanced codes that can correct multiple bits with less load on the hardware. [Wikipedia]

IBM Chipkill memory — IBM Chipkill ECC memory (now in its third generation in industry-standard computers) comes into play only if a server encounters so many errors in a short span of time that Memory ProteXion can’t handle them all. This should be a rare occurrence, but if it does happen you are still protected. Like Memory ProteXion, Chipkill memory goes well beyond the error-correction afforded by standard ECC memory, providing correction for up to four bits per DIMM (eight bits per memory controller), whether on a single chip or multiple. Also like Memory ProteXion, Chipkill support is provided by the memory controller, so it is implemented using standard ECC DIMMs and is transparent to the OS. [IBM whitepaper]

This document providing a more general discussion of POWER9 RAS features may also be of interest. ⏎

4. Thus the 8 bits set aside for conventional SECDEC ECC in a standard server DIMM are actually only fully necessary in the case where the CPU reads in units of 64 bits — which one cannot imagine is very common for server platforms, with cache lines of 512 or 1024 bits! ⏎