This shows you the differences between two versions of the page.
ep:labs:01 [2022/10/10 10:38] radu.mantu [Proof of Work] |
ep:labs:01 [2025/02/12 00:00] (current) cezar.craciunoiu |
||
---|---|---|---|
Line 1: | Line 1: | ||
- | ~~NOTOC~~ | + | ====== Lab 01 - Plotting ====== |
- | + | ||
- | ====== Lab 01 - CPU Monitoring (Linux) ====== | + | |
===== Objectives ===== | ===== Objectives ===== | ||
- | * Offer an introduction to Performance Monitoring | + | * Offer an introduction to Numpy & matplotlib |
- | * Present the main CPU metrics and how to interpret them | + | * Get you familiarised with the numpy API |
- | * Get you to use various tools for monitoring the performance of the CPU | + | * Understand basic plotting with matplotlib |
- | * Familiarize you with the x86 Hardware Performance Counters | + | |
- | ===== Contents ===== | ||
- | {{page>:ep:labs:01:meta:nav&nofooter&noeditbutton}} | ||
- | ===== Proof of Work ===== | ||
- | Before you start, create a [[http://docs.google.com/|Google Doc]]. Here, you will add screenshots / code snippets / comments for each exercise. Whatever you decide to include, it must prove that you managed to solve the given task (so don't show just the output, but how you obtained it and what conclusion can be drawn from it). If you decide to complete the feedback for bonus points, include a screenshot with the form submission confirmation, but not with its contents. | ||
- | When done, export the document as a //pdf// and upload in the appropriate assignment on [[https://curs.upb.ro/2022/course/view.php?id=5113#section-2|moodle]]. The deadline is 23:55 on Friday. | + | ===== Python Scientific Computing Resources ===== |
- | ===== Introduction ===== | + | |
- | Performance Monitoring is the process of checking a set of metrics in order to ascertain the health of the system. Normally, the information gleaned from these metrics is in turn used to fine tune the system in order to maximize its performance. As you may imagine, both acquiring and interpreting this data requires at least //some// knowledge of the underlying operating system. | + | In this lab, we will study a new library in python that offers fast, memory efficient manipulation of vectors, matrices and tensors: **numpy**. We will also study basic plotting of data using the most popular data visualization libraries in the python ecosystem: **matplotlib**. |
- | In the following four labs, we'll discuss the four main subsystems that are likely to have an impact either on a single process, or on the system as a whole. These are: CPU, memory, disk I/O and networking. Note that these subsystems are not independent of one another. For example, a web application may be dependent on the network stack of the kernel. Its implementation determines the amount of packets processed in a given amount of time. However, protocols that require checksum calculation (e.g.: TCP) will want to use a highly optimized implementation of this function (which is written directly in assembly). If your architecture does not have such an implementation and falls back to using the one written in C, you may prefer changing your choice of protocol. | + | For scientific computing we need an environment that is easy to use, and provides a couple of tools like manipulating data and visualizing results. |
+ | Python is very easy to use, but the downside is that it's not fast at numerical computing. Luckily, we have very eficient libraries for all our use-cases. | ||
- | When dealing strictly with the CPU, these are a few things to look out for: | + | **Core computing libraries** |
- | === Context Switches === | + | * numpy and scipy: scientific computing |
+ | * matplotlib: plotting library | ||
- | A context switch is a transition from one runtime environment to another. One example would be performing a privileged call to kernel space via a system call, then returning from it. When this happens, a copy of your register state must be stored, for obvious reasons. This operation takes some time | + | **Machine Learning** |
- | This usually comes in the form of performing a privileged call to kernel space (e.g.: syscall) and returning from it. Whenever this happens, a copy of your register state must be (re)stored, which takes up some time. | + | * sklearn: machine learning toolkit |
+ | * tensorflow: deep learning framework developed by google | ||
+ | * keras: deep learning framework on top of `tensorflow` for easier implementation | ||
+ | * pytorch: deep learning framework developed by facebook | ||
- | Note, however, how each individual process has its own address space, but in every address space, the only constant is the kernel. Why is that? Well, when the time slice of a process runs out and another is scheduled in, the kernel must perform a Translation Lookaside Buffer (TLB) flush. Otherwise, memory accesses in the new process might erroneously end up targeting the memory of the previous process. Yes, some shared objects (libraries) //could// have been mapped at the same virtual addresses and deleting those entries from the TLB is a shame, but there's no workaround for that. Now, back to our original question: why is the kernel mapped identically in each virtual address space? The reason is that when you perform a context switch into the kernel after calling ''open()'' or ''read()'', a TLB flush is not necessary. If you wanted to write your own kernel, you could theoretically isolate the kernel's address space (like any other process), but you would see a huge performance drop. | ||
- | The takeaway is that some context switches are more expensive than others. Not being able to schedule a process to a single core 100% of the time comes with a huge cost (flushing the TLB). This being said, context switches from user space to kernel space are still expensive operations. As Terry Davis once demonstrated in his Temple OS, running everything at the same privilege level can reduce the cost of context switches by orders of magnitude. | + | **Statistics and data analysis** |
- | === CPU Utilization === | + | * pandas: very popular data analysis library |
+ | * statsmodels: statistics | ||
- | Each process is given a time slice for it to utilize however it sees fit. The way that time is utilized can prove to be a meaningful metric. There are two ways that we can look at this data: system level or process level. | + | We also have advanced interactive environments: |
- | At system level, the data is offered by the kernel in ''/proc/stats'' (details in **man 5 proc**; look for this file). For each core, we get the amount of time units (''USER_HZ'' configured at compile time in the kernel ~= 10ms) each core has spent on a certain type of task. The more commonly encountered are of course: | + | * IPython: advanced python console |
- | * **user:** Running unprivileged code in ring3. | + | * Jupyter: notebooks in the browser |
- | * **system:** Running privileged code in ring0. | + | |
- | * **idle:** Not running anything. In this case, the core voltage & frequency is usually reduced. | + | |
- | * **nice:** Same as **user**, but refers to processes with a //nice > 0// personality. More details [[https://www.kernel.org/doc/html/next/scheduler/sched-nice-design.html|here]]. | + | |
- | The less reliable / relevant ones are: | + | There are many more scientific libraries available. |
- | * **iowait:** Time waiting for I/O. Not reliable because this is usually done via Direct Memory Access at kernel level and processes that perform blocking I/O operations (e.g.: ''read()'' -- with the exception of certain types of files, such as sockets, opened with ''O_NONBLOCK'') automatically yield their remaining time for another CPU bound process to be rescheduled. | + | |
- | * **(soft)irq:** Time servicing interrupts. This has nothing to do with user space processes. A high number can indicate high peripheral activity. | + | |
- | * **steal:** If the current system runs under a Hypervisor (i.e.: you are running in a Virtual Machine), know that the HV has every right to steal clock cycles from any VM in order to satisfy its own goals. Just like the kernel can steal clock cycles from a regular process to service an interrupt from, let's say, the Network Interface Controller, so can the HV steal clock cycles from the VM for exactly the same purpose. | + | |
- | * **guest:** The opposite of **steal**. If you are running a VM, then the kernel can take the role of a HV in some capacity (see **kvm**). This is the amount of time the CPU was used to run the guest VM. | + | |
- | At process level, the data can be found in ''/proc/[pid]/stat'' (see **man 5 proc**). Note that in this case, the amount of information the kernel interface provides is much more varied. While we still have **utime** (user time) and **stime** (system time), note that we also have statistics for child processes that have not been orphaned: **cutime**, **cstime**. | ||
- | Although you may find many tools that offer similar information, remember that these files are the origin. Another thing to keep in mind is that this data is representative for the entire session, i.e.: from system boot or from process launch. If you want to interpret it in a meaningful manner, you need to get two data points and know the time interval between their acquisition. | + | Check out these cheetsheets for fast reference to the common libraries: |
- | === Scheduling === | + | **Cheat sheets:** |
- | When a CPU frees up, the kernel must decide which process gets to run next. To this end, it uses the [[https://www.kernel.org/doc/html/v5.7/scheduler/sched-design-CFS.html|Completely Fair Scheduler (CFS)]]. Normally, we don't question the validity of the scheduler's design. That's a few levels above our paygrade. What we can do, is adjust the value of ''/proc/sys/kernel/sched_min_granularity_ns''. This virtual file contains the minimum amount of nanoseconds that a task is allocated when scheduled on the CPU. A lower value guarantees that each process will be scheduled sooner rather than later, which is a good trait of a real-time system (e.g.: Android -- you don't want unresponsive menus). A greater value, however, is better when you are doing batch processing (e.g.: rendering a video). We noted previously that switching active processes on a CPU core is an expensive operation. Thus, allowing each process to run for longer will reduce the CPU dead time in the long run. | + | - [[https://perso.limsi.fr/pointal/_media/python:cours:mementopython3-english.pdf)|python]] |
+ | - [[https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Numpy_Python_Cheat_Sheet.pdf|numpy]] | ||
+ | - [[https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Python_Matplotlib_Cheat_Sheet.pdf|matplotlib]] | ||
+ | - [[https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Scikit_Learn_Cheat_Sheet_Python.pdf|sklearn]] | ||
+ | - [[https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf|pandas]] | ||
- | Another aspect that's not necessarily as talked about is core scheduling. Given that you have more available cores than active tasks, on what core do you schedule a task? When answering this question, we need to keep in mind a few things: the CPU does not operate at a constant frequency. The voltage of each core, and consequently its frequency, varies based on the amount of active time. That being said, if a core has been idle for quite some time and suddenly a new task is scheduled, it will take some time to get it from its low-power frequency to its maximum. But now consider this: what if the workload is not distributed among all cores and a small subset of cores overheats? The CPU is designed to forcibly reduce the frequency is such cases, and if the overall temperature exceeds a certain point, shut down entirely. | + | **Other:** |
- | At the moment, CFS likes to spread out the tasks to all cores. Of course, each process has the right to choose the cores it's comfortable to run on (more on this in the exercises section). Another reason why this may be preferable that we haven't mentioned before is not invalidating the CPU cache. L1 and L2 caches are specific to each physical core. L3 is accessible to all cores. However. L1 and L2 have an access time of 1-10ns, while L3 can go as high as 30ns. If you have some time, read a bit about [[https://www.phoronix.com/news/Nest-Linux-Scheduling-Warm-Core|Nest]], a newly proposed scheduler that aims to keep scheduled tasks on "warm cores" until it becomes necessary to power up idle cores as well. Can you come up with situations when Nest may be better or worse than CFS? | + | - [[https://stanford.edu/~shervine/teaching/cs-229/refresher-probabilities-statistics|Probabilities & Stats Refresher]] |
+ | - [[https://stanford.edu/~shervine/teaching/cs-229/refresher-algebra-calculus|Algebra]] | ||
+ | |||
+ | |||
+ | <note>This lab is organized in a Jupyer Notebook hosted on Google Colab. You will find there some intuitions and applications for numpy and matplotlib. Check out the Tasks section below.</note> | ||
===== Tasks ===== | ===== Tasks ===== | ||
{{namespace>:ep:labs:01:contents:tasks&nofooter&noeditbutton}} | {{namespace>:ep:labs:01:contents:tasks&nofooter&noeditbutton}} | ||
+ | |||
+ | |||
+ | |||