ass:cursuri:01

01 - ARM CPUs & SoCs

The handheld emulator on the final slide is an Anbernic RG35XX. Check out a review if you're interested.

Lecture Notes

As you well know, the CPU - Central Processing Unit - is the brain of all of our contemporary – so-called smart – devices: PCs, laptops, mobile phones, TVs, cars, and now they're even present in our household appliances such as fridges and microwaves (with questionable utility). But how (and what) are they made (of)? How do they work? What does their firmware / kernel's code look like?

Those questions are best answered by embarking on a multi-year study in the Computer Engineering field, following disciplines ranging from Computer Science, Software Programming and Operating Systems to Robotics, Electronics and even quantum physics.

But, for our goals here, some ~~quick~~ background info should suffice.

Processors, Chips and Systems

A CPU is an integrated circuit made of billions (some are even close to trillions!) of miniature transistors (plus other types of electronic components such as capacitors, resistors and inductors) closely packed into a substrate using Very Large Scale Integration (VLSI) technologies. Actually, this is the definition of an Integrated Circuit (IC), a class of devices where the processor belongs to! There are many other types of ICs such as amplifiers, voltage regulators, sensors, high power motor drivers etc.

So let's put it this way: the processor is an IC which receives some input (instructions and data) and emits some outputs according to a well known logical diagram, accomplishing logical & arithmetic calculations, instruction flow control, communication with peripherals using various electrical protocols etc.

An important consideration, though: a CPU cannot function alone! For starters, we need separate chips and circuits for memories, input / output peripherals and communication busses to connect them all. Plus, every electronic device also requires some carefully controlled power supplies (whose voltage deviations might permanently damage it), external oscillators to keep the clocks steady and several other components which cannot be integrated due to electrical / physical considerations (e.g., minimum distances and size constrains).

For these reasons, we rarely see CPUs alone in the wilderness. More often, they are integrated – either soldered or socketed – into Printed Circuit Boards (PCBs) which integrate all of those extra components. This makes a computing system.

Which brings us to the latest idea: the System-on-Chip (SoC), which goes all in on the miniaturization trend and tries to include as much of these pieces into a single sillicon package. This made devices such as smartphones possible!

Note that a PCB is still necessary to solder the SoC on (since it's pretty fragile on its own), mainly due to the fact that we need its small pins exposed as connectors to enable communication with the outside world! Hence, the Single Board Computer (SBC) category is often used for describing such computers (although the majority of boards make up the innards of various commercial electronic products).

CPU Architectures

There is one more concept we need to familiarize before beginning to discuss the specifics of a single CPU family: what is a CPU architecture?

Recall that the computer programs you write using a programming language (just ignore the interpreted ones) will usually get compiled to an assembly language, which will ultimately get translated into machine code. The machine code is made of specific binary instructions that the processor needs to understand in order to properly execute them. It is highly desirable (and enforced by market pressure) that already compiled programs continue to work the same on all newer models and variations of a CPU (there are some who even upgrade their components yearly!) – to which problem the engineers answered with a standardization effort resulting in a Instruction Set Architecture (ISA). Those are well documented and, very often, made freely available to all developers in the form of PDFs totalling to tens of thousands of pages (although they remain protected by copyright and patents).

Now, in order for the CPU producers to update their models and keep up with Moore's ~~Law~~ trend, they usually employ some ingenious tricks to optimize the inner workings of the chips while still retaining the ISA unchanged (or, at most, only backward and – obligatory – forward-compatible features may be added). As such, the internal architecture changes for every new release but, from a user / developer perspective, it can be regarded as a black box and is safe to ignore.

To summarize, the ISA is a contract that processor manufacturers must respect to the letter in order to ensure compatibility with already-written software. Each chip might have a different internal architecture (may even translate to use a different micro-instruction set), but the external interface must be well documented.

There are many different CPU architectures available (either now, or historically) on the market; some of the most popular ones:

Intel's x86, the first commercially successful (and still popular) CPU family which is synonymous with Personal Computing;
AMD's x86-64, which brings several welcome upgrades to Intel's architecture (primarily, a 64-bit addressing scheme, hence its name); remember the backwards compatibility requirements of an ISA? we still speak of it as being an x86 due to it still being able to run unmodified programs written 40 years ago for the original models then!
The Power PC ISAs, developed by a Apple-IBM-Motorola alliance and, most notably, used for the original Machintoshes (which have switched to Intel for some long time);
RISC-V: a somewhat newer, Open-Source ISA, it's less popular than its competitors, though it's receiving a lot of love from ongoing research projects, so it might see a bright future ahead;
And finally, we introduce our main focus: the ARM family! As we describe it more thoroughly in the next section, suffice to say: it's the de-facto standard for embedded and mobile computing, especially in low power applications!

The ARM Family

At the time of its inception (1983), ARM stood for “Acorn RISC Machine”, referencing the Cambridge, England based company, Acorn Computers. Their Acorn Archimedes was one of the first personal computers (PC) to use a RISC processor. Later in 1990, ARM Ltd. was founded as a joint venture between Acorn Computers, VLSI Technology and Apple. The meaning of the acronym ARM changed to “Advanced RISC Machines” at Apple's request, them not wanting to advertise any ties with their former competitor.

Today, ARM Ltd. owns the rights to the ARM instruction set architecture (ISA) and to individual components used in processor implementations. These Intellectual Properties (IPs) are leased to other companies such as NXP and Microchip for them to develop their own implementation of the ARM standard (e.g.: the i.MX8M SoC that we are working on). While certain elements such as the Generic Interrupt Controller (GIC) are almost guaranteed to be found in each CPU implementation, other such as the Memory Controllers are often vendor specific. Meanwhile, components such as the System MMU (commonly known as IOMMU on x86) may be absent altogether.

ARM processors can usually be classified according to type and generation. The type refers to their intended use:

Cortex-A (Applications): Focused on performance and capable of running an Operating System; this has been the architecture to go for mobile devices and, as of very recent times, it has become an important contender in the personal computing (laptops & workstations) and even the servers market!
Cortex-M (Microcontrollers): Focused on simplicity, reduced power consumption and, of course, small prices! Used for accomplishing simple automation tasks in all sorts of electronic equipment (e.g., coffee machines and all sort of appliances, Internet of Things devices like smart lightbulbs etc.);
Cortex-R (Realtime): Focused on real-time applications. Needs to guarantee a bounded reponse time and, msot importantly, fault tolerance (with important applicability in cars, medical devices, oh, and let's not forget: space!).

In terms of generational versioning, we will only present the three most recent:

ARMv7: Older, 32-bit architecture. Still possible to encounter it;
ARMv8: Introduced in 2011, it is 64-bit but has a 32-bit compatiblity mode; most common today.
ARMv9: A very recent (2021) addition, not yet widely used; introduces some interesting security & tracing features.

In terms of software support, although traditional Operating Systems (Windows), programs and games written for the x86 ISA will not run as-is on a ARM device, as both architectures mostly implement the same features (all can do arithmetic/logic operations, memory read/write, even cryptographic and graphic accelerations) and given that modern compilers (GCC, CLang) have extensive support for almost all of them, an application may be ported to another platform using a simple re-compilation (there are, of course, exceptions in some code bases). Additionally, if this is not feasible (no access to the source code), emulator software can be employed to automatically translate instructions between different ISAs (with some – more or less – performance hit; see Apple's Rosetta for a nice example).

Recall earlier that a CPU's ISA will make sure that applications remain compatible with newer models. For x86, this holds even for low-level software such as entire Operating Systems being able to run unchanged across tens of years of CPU model advances. This is so because not only the ISA was standardized, but the entire System Architecture (which is amazing in its own right)!

Unfortunately, this is the most important shortcoming of the ARM platform: every System-on-Chip manufacturer changes the inner workings of a system – the set of components, their behaviour and the addresses they are mapped to – even between versions and very small variations of a model! Although compiled programs will still run on the CPU, those that use hardcoded addresses for a specific peripheral will often crash when trying to communicate with it. This issue is somewhat eased by a well-designed software architecture and development model for low-level components (firmware, bootloader, OS kernel), of which you will see examples throughout the labs.

Processor protection domains

In this section, we try to summarize some of the traditional reponsibilities of a CPU. If you are already acquainted with them (or some), feel free to skim past them!

x86 protection modes

Primarily developing for x86, you may be familiar with its protection rings, or at least with two of them: the ones traditionally used for the separation between user-space and kernel-space. When the x86' architecture was first introduced in the late '70s, the designers expected OS developers to need a mechanism to isolate critical components (e.g.: drivers) from regular applications. As a result, they implemented four levels of isolation (that the OS developers more or less ignored :D ):

Ring 3: the User Space, where access to certain privileged instructions such as RDMSR (Read Model Specific Register) is restricted; others, like RDPMC (Read Performance Monitor Counter) can be enabled here, but this is more of an exception rather than the norm;
Ring 2: originally meant to host I/O drivers; while it was used for that in some operating systems (e.g. IBM's OS/2, in 1987), it not longer serves any purpose today;
Ring 1: this was supposed to host drivers, separating them from the core systems of the kernel; although similar to Ring 2 in purpose (i.e., not having one anymore), some virtual machine software such as VirtualBox use it to run the guest operating system;
Ring 0: also known as kernel space; the most privileged code is running, with (almost) unrestricted access to the hardware.

A special System Call instruction is required to get from a lower-privileged level to a higher one. It is usually implemented as a software interrupt / trap by the CPU and, when invoked by, e.g., a user program, it will pause and save its CPU state on the stack (program counter, flags) and invoke a special routine registered by the Operating System. The OS kernel will be passed control to and begin to analyze (via a standardized calling convention) and execute the request (such as read from / write to the filesystem / disk / network / USB devices etc. – which the application would not normally be privileged enough to accomplish; remember: hardware access is quite restricted from upper rings).

Virtual Memory, paging and address translation

Pagination (see Fig. 1) is an architectural feature of all modern processors that allows the Operating System to present each process a different view of its memory. When said process tries to access a page (e.g., 4KB block), the address is translated by the Memory Management Unit (MMU), a hardware component, by means of a data structure called Page Table, unique to each process and residing in kernel memory. This allows the kernel to obscure parts of memory (e.g.: that of other processes) as a means of isolation. Or to over-commit resources, only to actually allocate them when needed (e.g.: malloc()-ed memory is assigned to the process only after it is first accessed).

Figure 1: Translation of a virtual address to its physical equivalent via the Memory Management Unit. Each process has its own unique view of how objects were loaded in memory. At the same time, the kernel appears the same to all processes.

Because changing the active Page Table is an expensive operation (mostly due to CPU cache invalidation), the Virtual Address Space of each process also contains the kernel memory mapped in the higher half. Yes, the kernel also uses virtual addressing for its own memory. When the kernel needs to intervene on behalf of the unprivileged process (e.g.: when a System Call is performed), the CPU state transitions from Ring 3 to Ring 0, but the Page Table does not need to be switched out. Although this technique increases the overall system performance, it also raises a question: how do we stop an unprivileged process from accessing kernel memory, since it's already mapped in its virtual address space?

The answer is that, aside from information relevant to the address translation itself, the Page Table also contains access restrictions. Each memory transaction that leads to an address translation also presents its intent, e.g.: whether it wants to write data to memory, or fetch an instruction to execute. This allows the MMU to block such access depending on the Read-Write-Execute permissions associated with each page. However, this is only one example of restriction that can be enforced. The Page Table can also restrict access based on privilege levels. Unfortunately, the architecture defines only two such levels: Privileged (Rings 0-2) and unprivileged (Ring 3).

Fun fact, if you're wondering why Rings 1 and 2 were abandoned: originally, when the x86 CPU was designed, it didn't have the memory pagination feature, but a precursor called segmentation (which, thankfully, we'll not cover here). A segment stored the maximum privilege level it could be accessed from using two bits (thus values 0–3), so he CPU could fully check for permissions for any of its Rings. But, because sacrificing performance only to restrict access to a few privileged instructions was simply not worth it, these mechanics were not continued for the page table, thus the other two rings lost their advantages and were forsaken.

Nonetheless, the two inner protection rings are still implemented on x86 CPUs to this day. The question is: why? Based the announcement of the new x86s architecture that's supposed to eliminate 16-bit and 32-bit modes of 64-bit processors, it's safe to rule out backward compatibility. The real reason is probably that it's just cheaper this way (changing the logic design of a processor is risky, requires extensive testing and very costly prototyping).

Want to know more curiositis about the x86 privilege levels? Check this out!

Because a CPU architecture never stops evolving, new protection modes and extensions had to be added along the way. Some more unnerving than others:

Ring -1: The Hypervisor Mode. A CPU state that integrates with other extensions (e.g.: Two Stage Address Translations, IOMMU) in order to manage guest Virtual Machines more efficiently.
Ring -2: The System Management Engine. When entering this mode, the execution of any currently running program (including the Hypervisor) is suspended. Control is passed either to an alternate OS usually residing in proprietary firmware, or to a hardware debugger.
Ring -3: The Intel Management Engine is a co-processor that is always active as long as the motherboard has a power source (not even line power; the internal battery is sufficient). Although its functionality is not publicly documented, reverse engineers have figured out that it enforces Verified Boot and has DRM and TPM functionalities.
Ring -4: A deeply embedded core that was discovered in some Intel CPUs and was presented at BlackHat 2018. This is essentially a hidden co-processor that shares an execution pipeline and some of its registers with the main processor. A transition to this mode can be performed from by a knowledgeable attacker from any privilege levels, including ring3. While in ring-4, the executing (normally unprivileged code) presumably has access to all system resources, similarly to ring-3.

Finally, please note that, although we described the virtual memory mechanisms of x86, the concepts are really the same for all other architectures (of course, the configuration registers and page entry structure will differ, but they all share a common feature set)!

ARM exception levels

In ARM's nomenclature, the CPU protection modes are called Exception Levels. Although they are analogous to x86's rings, they feature two significant improvements: first, the standardization of the most important modes for userspace, kernel space and hypervisor (for running multiple OSes in Virtual Machines); second, a secure separation between Secure and Non-Secure Worlds, but this will be discussed in Lecture 03.

Figure 2: ARM Exception Levels.

Usually, there are three exception levels:

EL0: User Space (equiv to Ring 3 from x86);
EL1: Kernel Space (~Ring 0);
EL2: Hypervisor (~Ring -1); notably absent from the Secure World;

But, with the introduction of the ARM TrustZone security extensions, [almost] all of these modes were vertically partitioned into two security domains. To make it possible to switch between them, a new Exception Level – EL3 (the Secure Monitor) – was added.

Bonus: if can't wait until Lecture 03 in order to find out about ARM's Trusted Execution features, expand!

On the Secure World side, we've got:

S-EL0: In this mode, Trusted Applications (TA) are being executed. TAs are system-critical functions that can be invoked from anywhere in the Non-Secure World (e.g.: encrypting application data when it needs to be saved to persistent storage; can be done explicitly by the application or implicitly by the kernel driver). We'll take a look at how TAs are written in the second lab.
S-EL1: This is the Trusted OS. Similarly to a regular OS, it manages access to (some) devices such as the Trusted Platform Module (TPM). Additionally, it must be able to interpret TA invocations from the Non-Secure World while providing the TAs functionalities similar to what Linux offers to its processes via system calls. For example, if a TA wants to open a file in a secure disk partition, by running in user space it will not have direct access to the File System Layer or the underlying block device. Secure or Non-Secure, it still runs in an unprivileged processor state, without access to the hardware.
EL3: The Secure Monitor acts as a bridge between the Secure and Non-Secure Worlds. All TA invocations are performed by means of a Secure Monitor Call (SMC). Think of SMCs as a system calls that instead of transitioning from EL0 to EL1, they reach the TAs the following way: NS-EL0 → NS-EL1 → NS-EL2 → EL3 → S-EL1 → S-EL0.

Although chances are you haven't heard of it, Intel had a similar solution called the Software Guard Extension (SGX). This extension was meant to protect small amounts (~72MB) of sensitive (user space) application data and code from a potentially malicious OS. This was realized by restricting access to the protected memory ranges (Enclaves) to code that already resided in the Enclave. Additionally, calls to Enclave functions could be made only via a strictly enforced API defined by the user at compile time; so no arbitrary jumps after a return to libc. There are numerous reasons why this technology failed. The main one would be that it did not work. Researchers have found dozens of ways to break the isolation guarantees that SGX was supposed to offer, most of them relying on side channels attacks (i.e.: deducing user secrets by observing how the target process influences the system). Coupled with the lack of isolation for privileged code that ARM offers (S-EL1) and the fact that Intel's remote attestation of SGX-capable CPUs and secure applications could not be offloaded to third parties, more or less guaranteed its fade from relevance.

The boot process

Remember the simplicity of x86's boot process? When you turned on the computer, the BIOS would initialize all required components and peripherals (RAM, keyboard and disks). After that, it would iterate through all persistent storage devices (in a configurable order) and pick the first one where a bootloader is detected to be installed in the first 512 bytes sector and continue with the execution from there! The bootloader would, optionally, present a menu to the user to choose an operating system (with a timeout autoselection), load kernel into memory and voila, startup process complete!

Unfortunately, things are not that simple in a ARM ecosystem:

Figure 3: ARM Trusted Firmware booting process.