Lab 04 - Memory Monitoring

Objectives

Offer an introduction to Virtual Memory.
Get you acquainted with relevant commands and their outputs for monitoring memory related aspects.
Introduce the concept of page de-duplication.
Present a step-by-step guide to Intel PIN for dynamic instrumentation.

Tasks

Introduction

Click to display ⇲

Click to hide ⇱

When talking about memory, one can be referring either to the CPU's cache or main memory (i.e., RAM). Since the former has been discussed (hopefully exhaustively) during other courses such as ASC, today we'll be focusing on the latter. If you feel that there's still more for you to learn about the CPU cache, check out this very well-known article. With that out of the way, here's a few things to keep in mind moving forward:

Virtual Memory

Reminding you of this concept may be redundant at this point, but here goes. The programs that you are writing do not have direct access to the physical memory. All addresses that you are accessing from user space are translated to physical addresses by the Memory Management Unit (MMU) of the CPU. The MMU stores as many virtual – physical address pairs as it can in its Translation Lookaside Buffer (TLB). When the TLB fills up, the least accessed addresses are flushed to make room for new ones. When a new virtual address is encountered, the CPU will look up its physical counterpart in a structure managed by the kernel. This structure is in fact a 4-level tree where each node is a list of 512 entries pointing to the next node. The leaf nodes yield the physical page address. Some of you might have already noticed something strange. an offset in the range [0; 511] can be represented using only 9 bits. Having a 4-level page table means that the offsets fit into 36 bits of the 64-bit virtual address. If we add the size of a page offset (12 bits), we're still 16 bits short. Good catch! Modern x64 CPUs, while technically using 64-bit addresses, don't support 2^64 bytes of addressable virtual memory. That being said almost nobody ever complained about this, since 2^48 is still more than anyone needs.

So what are the reasons for implementing virtual memory? Simple: security, performance and convenience. Let's tackle these one by one:

Security: User space processes should not have direct access to the physical address space. If they did, they could inspect and change the memory of other processes, and possibly even the kernel's. Moreover, Every physical address that one can access (from the perspective of the kernel) is not only RAM. Some devices have memory mapped registers that the user can interact with by reading from / writing to them. E.g., a serial device driver can put a char on the wire by writing it to a certain 32-bit aligned address. Similarly, it can check whether the serial device is currently busy writing the previous character by reading a register constantly updated by said device with its status. Normally, you'd abstract the hardware from user space program by having drivers interpret requests presented by the process via system calls. By using virtual memory, even if the process has knowledge of the underlying hardware, it won't be able to access those device registers.

“But I really want to access those registers…” you may be thinking. No worries, then: Userspace I/O (UIO) is a kernel module that allows mapping device registers to your user space process, thus enabling you to implement drivers without actually knowing anything about how kernel modules work :p. If that's not convenient enough for you, there's also /dev/mem. This device essentially can be opened as a regular file (i.e.: with open()) and allows you to read / write physical memory. This is usually done with the pread() and pwrite() syscalls, respectively. Needless to say, using either of these systems requires your process having the CAP_SYS_ADMIN capability (if you don't know what that means, just run it with sudo :p). One example where mapping devices in the virtual address space of a process is the Intel DataPlane Development Kit (DPDK), a user-space implementation of network drivers using UIO. DPDK is used on servers with a high traffic load to avoid performing too many context switches only to receive packets in user space. Note, however, that using UIO essentially makes the device inaccessible by kernel drivers. In the case of DPDK, the Network Interface Controller (NIC) becomes inaccessible system-wide, with the exception of the processes using it.

In some very particular cases, you might want to know the physical addresses of your pages. On the surface, this might seem reasonable. After all, you can access them via virtual addressing, so why not? This could be done via /proc/<pid>/pagemap, but recently it's been changed to also require CAP_SYS_ADMIN. The reason for this is that knowing the physical address of your memory pages can allow you to mount cache-based side channel attacks against other processes. This is not trivial threat; cache side-channels are the most common class of hardware side-channels and among the only practical ones, even in a research context.

Performance: You should already be fairly familiar with this: processes that use the same library don't in fact have their own copy in RAM. In stead, virtual addresses to the read-only pages of a library usually point to the same physical address in RAM. The advantage here is that you don't have to load a dozen different libraries from persistent storage (i.e.: HDD, SSD, etc.) when you start up a process. Let's say that you have 1000 processes, each using libc.so. Having ~1.8MB of read-only pages backed by libc.so copied over in RAM for each process would easily exhaust ~2GB of your RAM. And that's just one library… That being said, even mapping these libraries in the virtual address space (using mmap(), usually taken care of by ld-linux.so for you) is a costly operation. Looking at the American Fuzzy Lop (AFL) fuzzer, we can find an interesting optimization called Fork Server that allows bypassing the problem of re-mapping all libraries in the address space on newly spawned instances of the same server by hooking the main() function and in stead of exec()-ing thousands of times per second, it simply fork()s the process so that the children start off with a copy of the original's address space. Fun stuff!

Convenience: Many of the points made previously can be used to justify how convenient virtual memory. One thing to add would be that even the kernel uses it. The reason for this is the ability to remap physical devices at different addresses. ARM takes this a step further with its Two-Stage Address Translation, allowing the Hypervisor (running at Exception Level 2) to fake the existence of certain devices or to more accurately emulate certain platforms. Note, however, that ARM communicates the layout of hardware components in the address space to the kernel via a Flattened Device Tree (FDT). E.g., here the address and size of the uart1 device is given by the reg property, containing a tuple representing the base address (0x30860000) and memory size that is reserved for said device (0x10000 – 16 pages, not all used in reality). On x86-64, FDTs are not used; other systems are used to probe for available hardware.

Out Of Memory Killer

What happens when you start running out of RAM on your system? The default behavior is that the kernel chooses one or more processes to kill, this freeing up some RAM. This is known as the Out Of Memory (OOM) Killer. In order to do this, each process is assigned an OOM score. A higher score is indicative of a higher change of getting killed once the OOM Killer is woken up. The primary factor that influences this score is the amount of memory used. Modifiers that raise this value include the niceness value of the process and the number of fork()s. On the other hand, being privileged, having run for a long time or performing hardware I/O reduce the likelihood of being killed. Then comes the user's preference; writing a value to /proc/<pid>/oom_score_adj (within certain limits – decided at kernel compile time) will also tip the scales, one way or another. Writing a value just below the inferior limit will instead categorically prevent the process from being chosen. All this being said, is there an alternative to killing processes?

Swap Space

The system can reserve a portion of the persistent storage devices (i.e., HDD, SSD, etc.) for the express purpose of storing RAM pages when memory starts running low. For a long time, a dedicated partition was needed to serve as swap space. Now, users can also create swap files on top of an existing file system and mount them as loopback devices for the swap partition. This allows easily resizing the swap space without modifying partitions. When the used memory value exceeds a certain value (high watermark), the kernel's Page Frame Reclamation system begins copying the least recently used pages to swap. This goes on until the amount of used memory decreases below another certain value (low watermark). When a page is evicted to swap, the corresponding Page Table Entry (PTE) from the Page Tree is modified to indicate its location in swap, instead of its (previous) physical address in RAM.

We note that Swap Space is an optional feature, but having it can increase the system performance even if you don't have low memory issues. Nowadays, the kernel will try to evict pages from RAM proactively, given that they've not been accessed for a prolonged period of time. Evicting them to swap is not the only option. If a file is mapped in memory (via mmap()), then the kernel will have a known copy of it in your filesystem if ever needed. So evicting libc.so's pages to swap is unnecessary since there's already a copy of it in /usr/lib/. This form of proactive eviction is implemented for two main reasons: 1) to avoid reaching a point where kswapd (the kernel swap daemon) needs to aggressively evict pages, or where the OOM Killer needs to kill processes (in absence of any swap), and 2) to maximize the amount of memory available for file caching without overcommitting CPU cycles to this task. The problem with not having any Swap Space is that you can only evict file-backed pages. Memory buffers (e.g.: the malloc() memory pool) are in fact generated as anonymous mmap()-ed pages. Normally you would think that anonymous pages don't have a backing file but internally, the swap device is considered their point of origin. Not having any swap device present on your system will automatically disqualify any anonymous maps from being evicted. Knowing that most anonymous pages are part of memory allocation pools that are largely underutilized, swapping out (mostly) code pages from less utilized libraries can result in performance loss due to unnecessary I/O in the long run.

Tasks

The skeleton for this lab can be found in this repository. Clone it locally before you start.

01. [10p] Valgrind

Dynamic analysis tools can observe a running process and report memory-related issues that static analysis would miss entirely. In this exercise you will use Valgrind to detect memory leaks in a small C program – and get a first taste of the dynamic instrumentation concept that will be developed further in Task 04 with Intel Pin.

[5p] Task A - Writing a leaky program

Read the contents of leak.c and compile it:

$ gcc -g -o leak leak.c

The -g flag includes debug symbols so Valgrind can report exact file names and line numbers.

Now run it normally and observe that nothing seems wrong from the outside:

$ ./leak
$ echo "exit code: $?"

[5p] Task B - Detecting leaks with Valgrind

Run the same binary under Valgrind's memory error detector:

$ valgrind --leak-check=full --show-leak-kinds=all ./leak

Examine the output and answer the following questions:

How many bytes are reported as definitely lost? Does this match what you would expect from reading the source?
What is the difference between definitely lost and indirectly lost in Valgrind's terminology?
At what line number does Valgrind point as the origin of the leak? Why is that line significant rather than the line where the pointer goes out of scope?
Re-compile without the -g flag and run Valgrind again. What information is now missing from the report, and why?

Troubleshooting

On certain distributions such as CachyOS, you may get the following error:

valgrind:  Fatal error at startup: a function redirection
valgrind:  which is mandatory for this platform-tool combination
valgrind:  cannot be set up.  Details of the redirection are:

valgrind need the DWARF debug info for libc in order to function properly. If the ELF file itself doesn't have it, valgrind will try to use debuginfod find to download it using the Build ID stored in the .note.gnu.build-id section. If the debuginfod server doesn't have it either, your only hope of getting it to work is:

recompiling glibc with debug symbols (out of the question)
starting a docker container with Ubuntu, Debian, Arch Linux, etc.

02. [20p] Swap space

Before starting this task, call the assistant to show him your progress. If you manage to freeze your PC, it might prove tricky to do so afterwards.

[10p] Task A - Swap File

First, let us check what swap devices we have enabled. Check the NAME and SIZE columns of the following command:

$ swapon --show

No output means that there are no swap devices available.

If you ever installed a Linux distro, you may remember creating a separate swap partition. This, however, is only one method of creating swap space. The other is by adding a swap file. Run the following commands:

$ sudo swapoff -a
$ sudo dd if=/dev/zero of=/swapfile bs=1024 count=$((4 * 1024 * 1024))
$ sudo chmod 600 /swapfile
$ sudo mkswap /swapfile
$ sudo swapon /swapfile

$ swapon --show

Just to clarify what we did:

disabled all swap devices
created a 4Gb zero-initialized file
set the permission to the file so only root can edit it
created a swap area from the file using mkswap (works on devices too)
activated the swap area

The new swap area is temporary and will not survive a reboot. To make it permanent, we need to register it in /etc/fstab by adding a line such as this:

/swapfile swap swap defaults 0 0

[10p] Task B - Does it work?

In one terminal run vmstat and look at the swpd and free columns.

$ vmstat -w 1

In another terminal, open a python shell and allocate a bit more memory than the available RAM. Identify the moment when the newly created swap space is being used.

One thing you might notice is that the value in vmstat's free column is lower than before. This does not mean that you have less available RAM after creating the swap file. Remember using the dd command to create a 4GB file? A big chunk of RAM was used to buffer the data that was written to disk. If free drops to unacceptable levels, the kernel will make sure to reclaim some of this buffer/cache memory. To get a clear view of how much available memory you actually have, try running the following command:

$ free -h

Observe that once you close the python shell and the memory is freed, swpd still displays a non-zero value. Why? There simply isn't a reason to clear the data from the swap area. If you really want to clean up the used swap space, try the following:

$ vmstat
$ sudo swapoff -a && sudo swapon -a
$ vmstat

Create two swap files. Set their priorities to 10 and 20, respectively.
Include the commands (copy+paste) or a screenshot of the terminal.
Also add 2 advantages and disadvantages when using a swap file comparing with a swap partition.

03. [30p] Kernel Samepage Merging

KSM is a page de-duplication strategy introduced in kernel version 2.6.32. In case you are wondering, it's not the same thing as the file page cache. KSM was originally developed in tandem with KVM in order to detect data pages with exactly the same content and make their page table entries point to the same physical address (marked Copy-On-Write.) The end goal was to allow more VMs to run on the same host. Since each page must be scanned for identical content, this solution had no chance of scaling well with the available quantity of RAM. So, the developers compromised to scan only with the private anonymous pages that were marked as likely candidates via madvise(addr, length, MADV_MERGEABLE).

[10p] Task A - Check kernel support & enable ksmd

First things first, you need to verify that KSM was enabled during your kernel's compilation. For this, you need to check the Linux build configuration file. Hopefully, you should see something like this:

# on Ubuntu you can usually find it in your /boot partition
$ grep CONFIG_KSM /boot/config-$(uname -r)
CONFIG_KSM=y
 
# otherwise, you can find a gzip compressed copy in /proc
$ zcat /proc/config.gz | grep CONFIG_KSM
CONFIG_KSM=y

If you don't have KSM enabled, you could recompile the kernel with the CONFIG_KSM flag and try it, but you don't have to :)

Moving forward. Next thing on the list is to check that the ksmd daemon is functioning. Any configuration that we'll do will be through the sysfs files in /sys/kernel/mm/ksm. Consequently, you should change user to root (even sudo should not allow you to write to these files.)

/…/run : this is 1 if the daemon is active; write 1 to it if it's not
/…/pages_to_scan : this is how many pages will be scanned before going to sleep; you can increase this to 1000 if you want to see faster results
/…/sleep_millisecs : this is how many ms the daemon sleeps in between scans; since you've modified pages_to_scan, you can leave this be
/…/max_page_sharing : this is the maximum number of pages that can be de-duplicated; in cases like this it's better to go big or go home; so set it to something like 1000000, just to be sure

There are a few more files in the ksm/ directory. We will still use one or two later on. But for now, configuring the previous ones should be enough. Google the rest if you're interested.

[10p] Task B - Watch the magic happen

For this step it would be better to have a few terminals open. First, let's start a vmstat. Keep your eyes on the active memory column when we run the sample program.

$ vmstat -wa -S m 1

Next would be a good time to introduce two more files from the ksm/ sysfs directory:

/…/pages_shared : this file reports how many physical pages are in use at the moment
/…/pages_sharing : this file reports how many virtual page table entries point to the aforementioned physical pages

For this experiment we will also want to monitor the number of de-duplicated virtual pages, so have at it:

$ watch -n 0 cat /sys/kernel/mm/ksm/pages_sharing

Finally, look at the provided code, compile it, and launch the program. As an argument you will need to provide the number of pages that will be allocated and initialized with the same value. Note that not all pages will be de-duplicated instantly. So keep in mind your system's RAM limitations before deciding how much you can spare (1-2GB should be ok, right?)

The result should look something like Figure 1:

Figure 1: vmstat output during the execution of our sample program (unit of measure: MB). The free memory steadily decreases from a baseline value of ~4.5GB to a minimum of ~2.5GB after the process starts. As ksmd begins scanning and merging pages, the free memory steadily increases. When the process eventually terminates, the amount of free memory reverts to its initial value.

If you ever want to make use of this in your own experiments, remember to adjust the configurations of ksmd. Waking too often or scanning to many pages at once could end up doing more harm than good. See what works for your particular system.

Include a screenshot with the same output as the one in the spoiler above.
Edit the screenshot or note in writing at what point you started the application, where it reached max memory usage, the interval where KSM daemon was doing its job (in the 10s sleep interval) and where the process died.

[10p] Task C - Plot results

Now that you’ve observed the effects of KSM using vmstat, it’s time to visualize them. Solve the TODOs from plot.py from skeleton to generate a real-time plot that shows free memory, used memory, and memory used as a buffer over time, based on the freemem column from the output of the vmstat command.

Troubleshooting

If you get something resembling Could not load the Qt platform plugin “xcb” in ”” even though it was found. on either WSL or certain Linux environments (e.g., having Hyprland as a Wayland compositor), check out this post.

04. [40p] Intel PIN

Broadly speaking, binary analysis is of two types:

Static analysis - used in an offline environment to understand how a program works without actually running it.
Dynamic analysis - applied to a running process in order to highlight interesting behavior, bugs or performance issues.

In case you are still wondering, in this exercise we are going to look at (one of) the best dynamic analysis tools available: Intel Pin. Specifically, what Pin does is called program instrumentation, meaning that it inserts user-defined code at arbitrary locations in the executable. The code is inserted at runtime, meaning that Pin can attach itself to a process, just like gdb.

Although Pin is closed source, the concepts that serve as its fundament are described in this paper. Since we don't have time to scan through the whole material, we will offer a bird's eye view of its architecture. Just enough to get you started with the tasks.

Figure 2: Simplified view of the memory layout of a process being instrumented by Intel Pin. The Pin-specific memory mapped regions contain our pintool, the instrumentation API of the framework and a sandbox region where instrumented code is being reconstructed as per our tool's specification. This reconstruction phase is costly but the process attains near-native speeds afterwards.

When a process is started via Pin, the very first instruction is intercepted and new mappings are created in the virtual space of the process. These mappings contain libraries that Pin uses, the tool that the user wrote (which is compiled as a shared object) and a small sandbox that will act as a VM. During the execution, Pin will translate the original instructions into the sandbox on an as-needed basis and, according to the rules defined in the tool, insert arbitrary code. This code can be inserted at different levels of granularity:

instruction
basic block
function
image

The immediate advantages should be clear. Only from a performance evaluation standpoint, a few applications could be:

obtaining metrics from programs that were not designed with this in mind
hotpatching bugs without stopping the process
detecting the most accessed code regions to prioritize manual optimization

Although this sounds great, we should not ignore some of the glaring disadvantages:

overhead
- this is highly dependent on the amount of instrumentation and the instrumented code itself
- overall, this seems to have a bit more impact on ARM than on other architectures
volatile
- remember that the instrumented code shares things like the virtual memory space and file descriptors with the original process
- while something like in-memory fuzzing is possible, the risk of breaking the process is very high
limited use cases
- Pin works directly on a regular executable (with native bytecode)
- Pin will not work (as intended) on interpreted languages and variations of these

In case you are wondering what else you can do with Intel Pin, check out TaintInduce. The authors of this paper wrote an architecture agnostic taint analysis tool that successfully found 24 CVEs, 17 missing or wrongly emulated instructions in unicorn and 1 mistake in the Intel Developer Manual.

For reference, use the Intel Pin 4.2 User Guide (also contains examples).

[5p] Task A - Setup

In this tutorial we will build a Pin tool with the goal of instrumenting any memory reads/writes. For reads, we output the source buffer state before the operation takes place. For writes, we output the destination buffer states both before and after.

Download the skeleton for this task. First thing you will need to do is run setup.sh. This will download the Intel Pin 4.2 framework into the newly created third_party/ directory and create a stable symlink at third_party/pin.

$ bash setup.sh

Next, open src/minspect.cpp in an editor of your choice, but avoid modifying the code. In between tasks, we will apply diff patches to this file. This will allow us to gradually build our tool and observe its behavior at different stages during its development. However, altering the source in any significant manner may cause the patch to fail.

Let us apply the first patch before proceeding to the following task:

$ patch src/minspect.cpp patches/Task-A.patch

Troubleshooting

If you get a lot of compilation errors, the easiest solution we have right now is to boot up an Arch Linux container and use g++ 15.2.1 instead of 13.3. Here's how you do it and how you install the dependencies once it's up and running:

[student@host ~]$ docker run -ti archlinux:latest
 
# sync package database with remote server & install dependencies
[root@arch ~]$ pacman -Sy
[root@arch ~]$ pacman -S base-devel git wget neovim

After you clone the EP-labs repo, run setup.sh again and try to compile the project, you may still get one error regarding an undefined field called m_base. This is an error in their source code; just find the file and delete the m in m_base. That's why you have nvim installed ;)

[10p] Task B - Instrumentation Callbacks

Looking at main(), most Pin API calls are self explanatory. The only one that we're interested in is the following:

INS_AddInstrumentFunction(ins_instrum, NULL);

This call instructs Pin to trap on each instruction in the binary and invoke ins_instrum(). However, this happens only once per instruction. The role of the instrumentation callback that we register is to decide if a certain instruction is of interest to us. “Of interest” can mean basically anything. We can pick and choose “interesting” instructions based on their class, registers / memory operands, functions or objects containing them, etc.

Let's say that an instruction has indeed passed our selection. Now, we can use another Pin API call to insert an analysis routine before or after said instruction. While the instrumentation routine will never be invoked again for that specific instruction, the analysis routine will execute seamlessly for each pass.

For now, let us observe only the instrumentation callback and leave the analysis routine registration for the following task. Take a look at ins_instrum(). Then, compile the tool and run any program you want with it. Waiting for it to finish is not really necessary. Stop it after a few seconds.

$ make
$ ./third_party/pin/pin -t obj-intel64/minspect.so -- ls -l 1>/dev/null

Just to make sure everything is clear: the default rule for make will generate an obj-intel64/ directory and compile the tool as a shared object. The way to start a process with our tool's instrumentation is by calling the pin util. -t specifies the tool to be used. Everything after -- should be the exact command that would normally be used to start the target process.

Note: here, we output information to stderr from our instrumentation callback. This is not good practice. The Pin tool and the target process share pretty much everything: file descriptors, virtual memory, etc. Normally, you will want to output these things to a log file. However, let's say we can get away with it for now, under the pretext of convenience.

Remember to apply the Task-B.patch before proceeding to the next task.

Click to display ⇲

Click to hide ⇱

Figure 3: Each instruction is instrumented with a routine that outputs the memory mapped object containing it (red), the section inside that object (green), the function name (blue; defaults to section name if no symbol available), and the runtime address (yellow). The same routine also prints the instruction itself, using Pin's built-in disassembler.

[10p] Task C - Analysis Callbacks (Read)

Going forward, we got rid of some of the clutter in ins_instrum(). As you may have noticed, the most recent addition to this routine is the for iterating over the memory operands of the instruction. We check whether each operand is the source of a read using INS_MemoryOperandIsRead(). If this check succeeds, we insert an analysis routine before the current instruction using INS_InsertPredicatedCall(). Let's take a closer look at how this API call works:

INS_InsertPredicatedCall(
    ins, IPOINT_BEFORE, (AFUNPTR) read_analysis,
    IARG_ADDRINT,       ins_addr,
    IARG_PTR,           strdup(ins_disass.c_str()),
    IARG_MEMORYOP_EA,   op_idx,
    IARG_MEMORYREAD_SIZE,
    IARG_END);

The first three parameters are:

ins: reference to the INS argument passed to the instrumentation callback by default.
IPOINT_BEFORE: instructs to insert the analysis routine before the instruction executes (see Instrumentation arguments for more details.)
read_analysis: the function that is to be inserted as the analysis routine.

Next, we pass the arguments for read_analysis(). Each argument is represented by a type macro and the actual value. When we don't have any more parameters to send, we end by specifying IARG_END. Here are all the arguments:

IARG_ADDRINT, ins_addr: a 64-bit integer containing the absolute address of the instruction.
IARG_PTR, strdup(ins_disass.c_str()): all objects in the callback's local context will be lost after we return; thus, we need to duplicate the disassembled code's string and pass a pointer to the copy.
IARG_MEMORYOP_EA, op_idx: effective address of a specific memory operand; so this argument is not passed by value, but in stead recalculated each time and passed to the analysis routine seamlessly.
IARG_MEMORYREAD_SIZE: size in bytes of the memory read; check the documentation for some important exceptions.

Take a look at what read_analysis() does. Recompile the tool and run it again (just as in task B). Finally, apply Task-C.patch and move on to the next task.

Click to display ⇲

Click to hide ⇱

Figure 4: Another instruction-level instrumentation routine, targeting only instructions that perform memory reads. It prints the runtime address and the disassembly of each instruction. Additionally, it outputs the value of the read memory (green). The reads can be direct (e.g.: mov) or indirect (e.g.: ret -- obtains the return address from the stack).

[10p] Task D - Analysis Callbacks (Write)

For the memory write analysis routine, we need to add instrumentation both before and after each instruction. The former needs to save the original buffer state while the latter displays the information in its entirety. Assuming that there are more than one memory locations that are written to, we push the initial buffer state hexdumps to a stack. Consequently, we need to add the post-write instrumentation in reverse order to ensure that the succession of elements popped from the stack is correct. Let's take a look at the pre-write instrumentation insertion:

INS_InsertPredicatedCall(
    ins, IPOINT_BEFORE, (AFUNPTR) pre_write_analysis,
    IARG_CALL_ORDER,    CALL_ORDER_FIRST + op_idx + 1,
    IARG_MEMORYOP_EA,   op_idx,
    IARG_MEMORYWRITE_SIZE,
    IARG_END);

We notice a new set of parameters:

IARG_CALL_ORDER, CALL_ORDER_FIRST + op_idx + 1,: specifies the call order when multiple analysis routines are registered; see CALL_ORDER enum's documentation for details.

Recompile the tool. Test to see that the write analysis routines work properly. Apply Task-D.patch and let's move on to applying the finishing touches.

Click to display ⇲

Click to hide ⇱

Figure 5: An extension of the instrumentation routine in the previous sub-task, accounting for memory writes in addition to memory reads. For these, it prints the state of the written memory both prior (yellow) and after (red) the instruction retired.

[5p] Task E - Finishing Touches

This is only a minor addition. Namely, we want to add a command line option -i that can be used multiple times to specify multiple image names (e.g.: ls, libc.so.6, etc.) The tool must forego instrumentation for any instruction that is not part of these objects. As such, we declare a Pin KNOB:

static KNOB<string> knob_img(KNOB_MODE_APPEND,    "pintool",
        "i", "", "names of objects to be instrumented for branch logging");

We should not use argp or other alternatives. In stead, let Pin use its own parser for these things. knob_img will act as an accumulator for any argument passed with the flag -i. Observe it's usage in ins_instrum().

Determine the shared object dependencies of your target binary of choice. Then try to recompile and rerun the Pin tool while specifying some of them as arguments.

$ ldd /bin/ls
        linux-vdso.so.1 (0x00007ffd0d19b000)
        libgtk3-nocsd.so.0 => /usr/lib/x86_64-linux-gnu/libgtk3-nocsd.so.0 (0x00007f32df3ad000)
        libselinux.so.1 => /lib/x86_64-linux-gnu/libselinux.so.1 (0x00007f32df185000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f32ded94000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f32deb90000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f32de971000)
        libpcre.so.3 => /lib/x86_64-linux-gnu/libpcre.so.3 (0x00007f32de6ff000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f32df7d6000)

This concludes the tutorial. The resulting Pin tool can now be used as a starting point for developing a Taint analysis engine. Discuss more with your lab assistant if you're interested.

Patch your way through all the tasks and run the pin tool only for the base object of any binutil.
Include a screenshot of the output.

05. [10p] Feedback

Please take a minute to fill in the feedback form for this lab.

General Information

Lectures

Labs

Assignments

Archived Labs

ep/labs/04.txt · Last modified: 2026/03/12 10:57 by radu.mantu

Old revisions

Media Manager Back to top

Lab 04 - Memory Monitoring

Objectives

Contents

Introduction

Tasks

01. [10p] Valgrind

[5p] Task A - Writing a leaky program

[5p] Task B - Detecting leaks with Valgrind

02. [20p] Swap space

[10p] Task A - Swap File

[10p] Task B - Does it work?

03. [30p] Kernel Samepage Merging

[10p] Task A - Check kernel support & enable ksmd

[10p] Task B - Watch the magic happen

[10p] Task C - Plot results

04. [40p] Intel PIN

[5p] Task A - Setup

[10p] Task B - Instrumentation Callbacks

[10p] Task C - Analysis Callbacks (Read)

[10p] Task D - Analysis Callbacks (Write)

[5p] Task E - Finishing Touches

05. [10p] Feedback

General Information

Lectures

Labs

Assignments

Archived Labs

Table of Contents