Power Consumption Impact of Undefined Behavior Optimizations on Nuttx

Introduction

State-of-the-art compilers use Undefined Behavior (UB) to trigger various optimizations. Their effects range from different code size or code speed to different power consumption levels. All those aspects are critical for IoT application with limited resources. In this work we will uncover the impact of UB optimization on IoT devices using Nuttx, a popular real-time operating system for IoT devices, Coremark, a benchmark that measures the performance of micro-controllers, and Clang, the compiler we will use for compiling Nuttx and Coremark.

Background

In C/C++, UB imposes no requirements on the final behavior of the program. In this context, the compiler is free to make various assumptions during the transformation passes from C/C++ source code to target specific assembly code. This example [1] displays how signed overflow UB can lead to better code transformation inside loops. In short, on 64-bit platforms, the int type still has a range of 32-bits. That causes problems when computing 64-bit addresses with 32-bit offsets. However, compilers use signed overflow UB as a free ticket to promote int32 loop counters to int64, leading to shorter and faster code.

Even if for this specific example, we proved that the generated code is shorter and faster, for the general case we do not know how the final generated code will look like. This happens because transformation passes might interact in unpredictable ways that can change the size and speed of the final code.

Compiling Nuttx with Clang

The first goal of this project was to compile Nuttx with Clang. This happened because much of my work on researching the impact of UB for other use cases was already done in Clang. There were already some efforts in this area [2] but I could not use them because they did not provide the complete toolchain for compiling Nuttx.

The ISA used by ESP-32 boards is designed by Cadence and it is called Xtensa [4]. Much of the work of targeting this architecture in LLVM was already started by Espressif [3]. The first step of integrating the LLVM fork of Espressif into Nuttx was to hack into the build system of Nuttx to be able to compile it with Clang. The patches that I introduced can be found in my fork of Nuttx [5].

In summary, the changes that I need to do were the following:

Add `-target xtensa` to ARCHCFLAGS and ARCHCXXFLAGS
Use binutils' linker and libraries
Patch _bbci and srli assembler instructions in source files because they were not correctly handled
Modify the generated .config file to replace CONFIG_XTENSA_TOOLCHAIN_ESP with CONFIG_XTENSA_TOOLCHAIN_XCLANG

After this step was finished, I had to move to Xtensa LLVM to patch it in order to successfully compile all Nuttx source code. You can find my fork of Xtensa LLVM here [6]. The modification I had to do in this step was rather simple, i.e. solve a typing error in the register info tablegen. For the `intset` register, the name of the register was wrongly typed as `interrupt`. However this process was time consuming because I had do debug various parts of the Xtensa backend before getting to the root cause of the problem.

At this point, I successfully compiled Nuttx with Xtensa LLVM.

Running the Benchmarks

To run the benchmarks and fetch the results I used Coremark [7]. From their website: “EEMBC’s CoreMark® is a benchmark that measures the performance of microcontrollers (MCUs) and central processing units (CPUs) used in embedded systems. Replacing the antiquated Dhrystone benchmark, Coremark contains implementations of the following algorithms: list processing (find and sort), matrix manipulation (common matrix operations), state machine (determine if an input stream contains valid numbers), and CRC (cyclic redundancy check). It is designed to run on devices from 8-bit microcontrollers to 64-bit microprocessors.”

The results I was interested are the following: coremark score (speed of execution), code size and power consumption. For the first metric I used the output of coremark which I will present later. For the second metric I measured the binary size of Nuttx and Coremark after compiling them and for the third metric I used an USB tester [8] that displays the voltage and the current consumed by my ESP32 board [10].

The following is a sample output for coremark:

2K performance run parameters for coremark.
CoreMark Size    : 666
Total ticks      : 59040
Total time (secs): 59.040000
Iterations/Sec   : 338.753388
Iterations       : 20000
Compiler version : Clang 15.0.0 (git@github.com:lucic71/llvm-project-espressif.git ae7b70b2d0097fd6745ebf2ade6fdffccc879142)
Compiler flags   : -fomit-frame-pointer -ffunction-sections -fdata-sections -O2 -fwrapv
Memory location  : Stack
seedcrc          : 0xe9f5
[0]crclist       : 0xe714
[0]crclist       : 0xe714
[0]crcmatrix     : 0x1fd7
[0]crcstate      : 0x8e3a
[0]crcfinal      : 0x382f
Correct operation validated. See README.md for run and reporting rules.
CoreMark 1.0 : 338.753388 / Clang 15.0.0 (git@github.com:lucic71/llvm-project-espressif.git ae7b70b2d0097fd6745ebf2ade6fdffccc879142) -fomit-frame-pointer -ffunction-sections -fdata-sections
-O2 -fwrapv / Stack

From this output we are interested in the last line. What is displays is the Coremark score and the compiler configuration used for generating this score. For this benchmark experiment, higher scores represent better results.

To change the compiler configuration, ARCHOPTIMIZATION needs to be changed accordingly in arch/xtensa/src/lx6/Toolchain.defs.

Next I will present all configurations used for this experiment. I used a total of 13 configurations based on various flags that change the behavior of the compiler with regards to exploiting UB.

No	UB flag	Description
1	-fwrapv	Treat signed overflow as two's complement
2	-fno-strict-aliasing	Don't use type based alias analysis
3	-fstrict-enums	Enable optimizations that take advantage of enum's value range
4	-fno-delete-null-pointer-checks	Assume that programs can safely dereference null pointers
5	-fno-finite-loops	Don't assume that all loops are finite
6	-fconstrain-shift-value	Constrain shift RHS so it doesn't produce undefined results when RHS >= bitwitdh
7	-fno-constrain-bool-value	Don't constrain bool values in {0,1}
8	all + -O2	All flags from above + -O2
9	all + -Os	All flags from above + -Os
10	base + -O2	No flag from above + -O2
11	base + -Os	No flag from above + -Os
12	-fno-use-default-alignment	Use alignment of one for all memory operations

Results

The first set of results will cover power consumption. For all benchmark configuration the consumed current had the value of 90mA and the value of the voltage was 5.11V. Because the USB tester that I used had a resolution of 10mA, it could not measure all the values between 85mA and 95mA, thus all the results have the same value, i.e. 90mA. In idle mode, the board consumed 70mA.

Note that during the experiments, the board was put in Modem-sleep power mode [9] at normal speed (80MHz). The datasheet states that in this configuration, the board should consume between 20mA and 30mA, not 70mA as presented on the USB tester. The reasons for this discrepancy are unknown at this moment.

The next set of results has to do with code size. After each compilation with a particular compiler configuration, the size of the generated binaries, nuttx and nuttx.bin was recored, the results are presented in the following plot:

`all + -O2` increases the code size with a small percent compared to the configurations from its left, i.e. the configurations where a single flag is used. Furthermore `all + -O2` and `all + -Os` both increase the code size compared to their counter sides, i.e. `base + -O2` and `base + -Os`, with 1% for both configurations in the case of nuttx.bin. Thus, for Nuttx and Coremark, the code size is increased when using flags that take advantage of UB.

The final set of results covers the coremark score for each configuration. The score is extracted from the coremark output presented in the last section.

There is no specific improvement between `base + -O2` and all the configuration that make use of UB. However what is interesting to see is the impact of `all + -Os` compared with `base + -Os`. There is a performance decrease by 1%.

Note that no results set contains numbers for the -fno-use-default-alignment configuration. This happens because Nuttx crashes when compiled with this flag and no benchmark can be run. Compared to x86, for which this flag was initially is targeted, Xtensa has stricter alignment rules that cannot be modified.

Conclusions and Further Work

The results show that there is not much difference in terms of code size, code speed and power consumption in the context of undefined behavior for Nuttx and Coremark. One reason for that might be that while developing those systems, the developers made little to no use of undefined behavior. Another reason might be that Xtensa LLVM cannot take proper advantage of undefined behavior when triggering optimizations for Nuttx and Coremark.

Those are interesting paths worth further researching. Moreover, rerunning the experiments with a better USB tester can lead to more accurate results with regards to the power consumption capabilities of the Xtensa processsor.