Author: Lucian-Ioan Popescu (SRIC 1)
State-of-the-art compilers use Undefined Behavior (UB) to trigger various optimizations. Their effects range from different code size or code speed to different power consumption levels. All those aspects are critical for IoT application with limited resources. In this work we will uncover the impact of UB optimization on IoT devices using Nuttx, a popular real-time operating system for IoT devices, Coremark, a benchmark that measures the performance of micro-controllers, and Clang, the compiler we will use for compiling Nuttx and Coremark.
In C/C++, UB imposes no requirements on the final behavior of the program. In this context, the compiler is free to make various assumptions during the transformation passes from C/C++ source code to target specific assembly code. This example [1] displays how signed overflow UB can lead to better code transformation inside loops. In short, on 64-bit platforms, the int type still has a range of 32-bits. That causes problems when computing 64-bit addresses with 32-bit offsets. However, compilers use signed overflow UB as a free ticket to promote int32 loop counters to int64, leading to shorter and faster code.
Even if for this specific example, we proved that the generated code is shorter and faster, for the general case we do not know how the final generated code will look like. This happens because transformation passes might interact in unpredictable ways that can change the size and speed of the final code.
The first goal of this project was to compile Nuttx with Clang. This happened because much of my work on researching the impact of UB for other use cases was already done in Clang. There were already some efforts in this area [2] but I could not use them because they did not provide the complete toolchain for compiling Nuttx.
The ISA used by ESP-32 boards is designed by Cadence and it is called Xtensa [4]. Much of the work of targeting this architecture in LLVM was already started by Espressif [3]. The first step of integrating the LLVM fork of Espressif into Nuttx was to hack into the build system of Nuttx to be able to compile it with Clang. The patches that I introduced can be found in my fork of Nuttx [5].
In summary, the changes that I need to do were the following:
After this step was finished, I had to move to Xtensa LLVM to patch it in order to successfully compile all Nuttx source code. You can find my fork of Xtensa LLVM here [6]. The modification I had to do in this step was rather simple, i.e. solve a typing error in the register info tablegen. For the `intset` register, the name of the register was wrongly typed as `interrupt`. However this process was time consuming because I had do debug various parts of the Xtensa backend before getting to the root cause of the problem.
At this point, I successfully compiled Nuttx with Xtensa LLVM.
To run the benchmarks and fetch the results I used Coremark [7]. From their website: “EEMBC’s CoreMark® is a benchmark that measures the performance of microcontrollers (MCUs) and central processing units (CPUs) used in embedded systems. Replacing the antiquated Dhrystone benchmark, Coremark contains implementations of the following algorithms: list processing (find and sort), matrix manipulation (common matrix operations), state machine (determine if an input stream contains valid numbers), and CRC (cyclic redundancy check). It is designed to run on devices from 8-bit microcontrollers to 64-bit microprocessors.”
The results I was interested are the following: coremark score (speed of execution), code size and power consumption. For the first metric I used the output of coremark which I will present later. For the second metric I measured the binary size of Nuttx and Coremark after compiling them and for the third metric I used an USB tester [8] that displays the voltage and the current consumed by my ESP32 board [10].
The following is a sample output for coremark:
2K performance run parameters for coremark. CoreMark Size : 666 Total ticks : 59040 Total time (secs): 59.040000 Iterations/Sec : 338.753388 Iterations : 20000 Compiler version : Clang 15.0.0 (git@github.com:lucic71/llvm-project-espressif.git ae7b70b2d0097fd6745ebf2ade6fdffccc879142) Compiler flags : -fomit-frame-pointer -ffunction-sections -fdata-sections -O2 -fwrapv Memory location : Stack seedcrc : 0xe9f5 [0]crclist : 0xe714 [0]crclist : 0xe714 [0]crcmatrix : 0x1fd7 [0]crcstate : 0x8e3a [0]crcfinal : 0x382f Correct operation validated. See README.md for run and reporting rules. CoreMark 1.0 : 338.753388 / Clang 15.0.0 (git@github.com:lucic71/llvm-project-espressif.git ae7b70b2d0097fd6745ebf2ade6fdffccc879142) -fomit-frame-pointer -ffunction-sections -fdata-sections -O2 -fwrapv / Stack
From this output we are interested in the last line. What is displays is the Coremark score and the compiler configuration used for generating this score. For this benchmark experiment, higher scores represent better results.
To change the compiler configuration, ARCHOPTIMIZATION needs to be changed accordingly in arch/xtensa/src/lx6/Toolchain.defs.
Next I will present all configurations used for this experiment. I used a total of 13 configurations based on various flags that change the behavior of the compiler with regards to exploiting UB.
No | UB flag | Description |
---|---|---|
1 | -fwrapv | Treat signed overflow as two's complement |
2 | -fno-strict-aliasing | Don't use type based alias analysis |
3 | -fstrict-enums | Enable optimizations that take advantage of enum's value range |
4 | -fno-delete-null-pointer-checks | Assume that programs can safely dereference null pointers |
5 | -fno-finite-loops | Don't assume that all loops are finite |
6 | -fconstrain-shift-value | Constrain shift RHS so it doesn't produce undefined results when RHS >= bitwitdh |
7 | -fno-constrain-bool-value | Don't constrain bool values in {0,1} |
8 | all + -O2 | All flags from above + -O2 |
9 | all + -Os | All flags from above + -Os |
10 | base + -O2 | No flag from above + -O2 |
11 | base + -Os | No flag from above + -Os |
12 | -fno-use-default-alignment | Use alignment of one for all memory operations |
The first set of results will cover power consumption. For all benchmark configuration the consumed current had the value of 90mA and the value of the voltage was 5.11V. Because the USB tester that I used had a resolution of 10mA, it could not measure all the values between 85mA and 95mA, thus all the results have the same value, i.e. 90mA. In idle mode, the board consumed 70mA.
Note that during the experiments, the board was put in Modem-sleep power mode [9] at normal speed (80MHz). The datasheet states that in this configuration, the board should consume between 20mA and 30mA, not 70mA as presented on the USB tester. The reasons for this discrepancy are unknown at this moment.
The next set of results has to do with code size. After each compilation with a particular compiler configuration, the size of the generated binaries, nuttx and nuttx.bin was recored, the results are presented in the following plot:
`all + -O2` increases the code size with a small percent compared to the configurations from its left, i.e. the configurations where a single flag is used. Furthermore `all + -O2` and `all + -Os` both increase the code size compared to their counter sides, i.e. `base + -O2` and `base + -Os`, with 1% for both configurations in the case of nuttx.bin. Thus, for Nuttx and Coremark, the code size is increased when using flags that take advantage of UB.
The final set of results covers the coremark score for each configuration. The score is extracted from the coremark output presented in the last section.
There is no specific improvement between `base + -O2` and all the configuration that make use of UB. However what is interesting to see is the impact of `all + -Os` compared with `base + -Os`. There is a performance decrease by 1%.
Note that no results set contains numbers for the -fno-use-default-alignment configuration. This happens because Nuttx crashes when compiled with this flag and no benchmark can be run. Compared to x86, for which this flag was initially is targeted, Xtensa has stricter alignment rules that cannot be modified.
The results show that there is not much difference in terms of code size, code speed and power consumption in the context of undefined behavior for Nuttx and Coremark. One reason for that might be that while developing those systems, the developers made little to no use of undefined behavior. Another reason might be that Xtensa LLVM cannot take proper advantage of undefined behavior when triggering optimizations for Nuttx and Coremark.
Those are interesting paths worth further researching. Moreover, rerunning the experiments with a better USB tester can lead to more accurate results with regards to the power consumption capabilities of the Xtensa processsor.
[1] A bit of background on compilers exploiting signed overflow
[2] NuttX and Clang
[3] Fork of LLVM targeted at Xtensa
[4] Xtensa ISA
[5] My fork of Nuttx
[6] My fork of Xtensa LLVM
[7] Coremark
[8] Tester USB UT658 Uni-T, afisaj LCD, 9999 mAh
[9] ESP32 datasheet
[10] Placa dezvoltare ESP32, DEVKIT V1