Show page

Differences

This shows you the differences between two versions of the page.

--- dss:laboratoare:05 [2019/04/11 09:46]
127.0.0.1 external edit
+++ dss:laboratoare:05 [2019/06/26 16:12] (current)
eduard.staniloiu [Exercises]
@@ Line 1: / Line 1: @@
-===== Laboratorul 05. =====
+===== Lab 05: Multithreading =====
+==== Parallelism ====
+Most modern microprocessors consist of more than one core, each of which can operate as an individual processing unit. They can execute different parts of different programs at the same time.
+A flow of execution through certain parts of a program is called a **thread of execution** or a **thread**. Programs can consist of multiple threads that are being actively executed at the same time. The operating system starts and executes each thread on a core and then suspends it to execute other threads, thus each thread is competing with the other threads in the system for computational time on the processor. The execution of each thread may involve many cycles of starting and suspending.
+All of the threads of all of the programs that are active at a given time are executed on the very cores of the microprocessor. The operating system decides when and under what condition to start and suspend each thread. We call this process of //start/suspend/swap// a **context switch**.
+The features of the **std.parallelism** module make it possible for programs to take advantage of all of the cores in order to run faster.
+=== std.parallelism.Task ===
+Operations that are executed in parallel with other operations of a program are called tasks.
+Tasks are represented by the type **std.parallelism.Task**.
+**Task** represents the fundamental unit of work. A **Task** may be executed in parallel with any other **Task**. Using this struct directly allows future/promise parallelism. In this paradigm, a function (or delegate or other callable) is executed in a thread other than the one it was called from. The calling thread does not block while the function is being executed.
+For simplicity, the [[https://dlang.org/phobos/std_parallelism.html#.task|std.parallelism.task]] and [[https://dlang.org/phobos/std_parallelism.html#.scopedTask|std.parallelism.scopedTask]] functions are generally used to create an instance of the **Task** struct.
+Using the **Task** struct has three steps:
+**1.** First, we need to create a task instance.
+<code d>
+int anOperation(string id) {
+  writefln("Executing %s", id);
+  Thread.sleep(1.seconds);
+  return 42;
+}
+void main() {
+  /* Construct a task object that will execute
+   * anOperation(). The function parameters that are
+   * specified here are passed to the task function as its
+   * function parameters. */
+   auto theTask = task!anOperation("theTask");
+   /* the main thread continues to do stuff */
+}
+</code>
+**2.** Now we've just created a new Task instance, but the task isn't running yet. Next we'll launch the task execution.
+<code d>
+  /* ... */
+  auto theTask = task!anOperation("theTask");
+  theTask.executeInNewThread(); // start task execution
+  /* ... */
+</code>
+**3.** At this point we are sure that the operation has been started, but it's unsure whether **theTask** has completed its execution. **yieldForce()** waits for the task to complete its operations; it returns only when the task has been completed. Its return value is the return value of the task function, i.e. **anOperation()**.
+<code d>
+  /* ... */
+  immutable taskResult = theTask.yieldForce();
+  writefln("All finished; the result is %s\n", taskResult);
+  /* ... */
+</code>
+<note tip>
+The **Task** struct has two other methods, [[https://dlang.org/phobos/std_parallelism.html#.Task.workForce|workForce]] and [[https://dlang.org/phobos/std_parallelism.html#.Task.spinForce|spinForce]], that are used to ensure that the Task has finished executing and to obtain the return value, if any. Read their docs and discover the differences in behaviour and when their usage is preferred.
+</note>
+=== std.parallelism.TaskPool ===
+As we've previously stated: **all of the threads** of all of the programs that are active at a given time **are** executed on the very cores of the microprocessor, **competing** for computational time **with each other**.
+This observation has the following implication: on a system that has **N** cores, we can have at most N threads running in parallel at a given time. This means that in our application we should create at most N worker threads that will execute tasks (from a tasks queue) for us, thus our N worker threads will be part of a **thread pool**; this is a common pattern used in concurrent applications.
+The **std.parallelism.TaskPool** gives us access to a task pool implementation to which we can submit **std.parallelism.Task**s to be executed by the worker threads.
+The **std.parallelism** module gives us access to a ready to use **std.parallelism.TaskPool** instance, named **std.parallelism.taskPool**. **std.parallelism.taskPool** has **totalCPUs - 1** worker threads available, where //totalCPUs// is the total number of CPU cores available on the current machine, as reported by the operating system.
+The minus 1 is included because the main thread will also be available to do work.
+=== std.parallelism.taskPool.parallel ===
+Let's start with a simple example:
+<code d>
+struct Student {
+  int number;
+  void aSlowOperation() {
+    writefln("The work on student %s has begun", number);
+    // Wait for a while to simulate a long-lasting operation
+    Thread.sleep(1.seconds);
+    writefln("The work on student %s has ended", number);
+  }
+}
+void main() {
+  auto students = [ Student(1), Student(2), Student(3), Student(4) ];
+  foreach (student; students) {
+    student.aSlowOperation();
+  }
+}
+</code>
+In the code above, as the foreach loop normally operates on elements one after the other, **aSlowOperation()** would be called for each student sequentially. However, in many cases it is not necessary for the operations of preceding students to be completed before starting the operations of successive students. If the operations on the **Student** objects were truly independent, it would be wasteful to ignore the other microprocessor cores, which might potentially be waiting idle on the system.
+Meet **taskPool.parallel**. This function can also be called simply as **parallel()**. **parallel()** accesses the elements of a range in parallel. An effective usage is with foreach loops. Merely importing the std.parallelism module and
+replacing **students** with **parallel(students)** in the program above is sufficient to take advantage of all of the cores of the system.
+This simple change
+<code d>
+  /* ... */
+  foreach (student; parallel(students)) {
+  /* ... */
+</code>
+Is enough to drop our application's total running time from //4 seconds// to just //1 second//.
+In Ali's [[http://ddili.org/ders/d.en/foreach_opapply.html|foreach for structs and classes chapter]] we can see that the expressions that are in **foreach** blocks are passed to **opApply()** member functions as delegates. **parallel()** returns a range object that knows how to distribute the execution of the delegate to a separate core for each element.
+**parallel()** constructs a new **Task** object for every worker thread and starts that task automatically. **parallel()** then waits for all of the tasks to be completed before finally exiting the loop. **parallel()** is very convenient as it constructs, starts, and waits for the tasks automatically.
+<note tip>
+We strongly encourage you to have a look at [[https://dlang.org/phobos/std_parallelism.html#.TaskPool.map|taskPool.map]], [[https://dlang.org/phobos/std_parallelism.html#.TaskPool.amap|taskPool.amap]] and [[https://dlang.org/phobos/std_parallelism.html#.TaskPool.reduce|taskPool.reduce]] to unlock your full parallelism potential.
+</note>
+==== Concurrency ====
+The concepts and mechanics exposed by the **std.concurrency** module are similar to, but different from the ones discussed in relation to the **std.parallelism** module. They both involve executing operations on threads, and as parallelism is based on concurrency, they are sometimes confused with each other.
+There are a few key insights regarding the two programming models:
+  * Probably the most notable difference between the two is that although both programming models use operating system threads, in parallelism threads are encapsulated by the concept of a task, whereas concurrency makes use of threads explicitly.
+  * Another important aspect is that in parallelism, tasks are independent from each other. In fact, it would be a bug if they did depend on results of other tasks that are running at the same time. In concurrency, it is normal for threads to depend on results of other threads.
+  * Parallelism is easy to use, and as long as tasks are independent it is easy to produce programs that work correctly. Concurrency is easy only when it is based on message passing. It is very difficult to write correct concurrent programs if they are based on the traditional model of concurrency that involves lock-based data sharing.
+Think of it this way: parallelism is most suited for solving **embarrassingly parallel problems**; concurrency will most likely be used to solve intricate problems.
+=== Starting threads ===
+**spawn()** takes a function pointer as a parameter and starts a new thread from that function. Any operations that are carried out by that function, including other functions that it may call, would be executed on the new thread. The main difference between a thread that is started with [[https://dlang.org/phobos/std_concurrency.html#.spawn|spawn()]] and a thread that is started with [[https://dlang.org/phobos/std_parallelism.html#.task|task()]] is the fact that **spawn()** makes it possible for threads to send messages to each other.
+As soon as a new thread is started, the owner and the worker start executing separately as if they were independent programs:
+<code d>
+import std.stdio;
+import std.concurrency;
+import core.thread;
+void worker() {
+  foreach (i; 0 .. 5) {
+  Thread.sleep(500.msecs);
+  writeln(i, " (worker)");
+}
+void main() {
+  spawn(&worker);
+  foreach (i; 0 .. 5) {
+    Thread.sleep(300.msecs);
+    writeln(i, " (main)");
+  }
+  writeln("main is done.");
+}
+</code>
+The program automatically waits for all of the threads to finish executing. We can see this in the output of the above program by the fact that **worker()** continues executing even after **main()** exits after printing //"main is done."//.
+<code bash>
+(main)
+(worker)
+(main)
+(main)
+(worker)
+(main)
+(worker)
+(main)
+main is done.
+(worker)
+(worker)
+</code>
+The parameters that the thread function takes are passed to **spawn()** as its second and later arguments.
+<code d>
+void worker(int x) {
+/* ... */
+void main() {
+  spawn(&worker, 42);
+/* ... */
+</code>
+<note tip>
+Every operating system puts limits on the number of threads that can exist at one time. These limits can be set for each user, for the whole system, or for something else. The overall performance of the system can be reduced if there are more threads that are busily working than the number of cores in the system.
+A thread that is busily working at a given time is said to be //CPU bound// at that point in time. On the other hand, some threads spend considerable amount of their time waiting for some event to occur like input from a user, data from a network connection, the completion of a Thread.sleep call, etc. Such threads are said to be //I/O bound// at those times.
+If the majority of its threads are I/O bound, then a program can afford to start more threads than the number of cores without any degradation of performance. As it should be in every design decision that concerns program performance, one must take actual measurements to be exactly sure whether that really is the case.
+</note>
+=== Message passing ===
+**send()** sends messages and **receiveOnly()** waits for a message of a particular type. (There are also **prioritySend()**, **receive()** and **receiveTimeout()**, which we encourage you to read about in the docs.)
+The owner in the following program sends its worker a message of type **int** and waits for a message from the worker of type **double**. The threads continue sending messages back and forth until the owner sends a negative **int**. This is the owner thread:
+<code d>
+void main() {
+  Tid worker = spawn(&workerFunc);
+  foreach (value; 1 .. 5) {
+    worker.send(value);
+    double result = receiveOnly!double();
+    writefln("sent: %s, received: %s", value, result);
+  }
+  /* Sending a negative value to the worker so that it
+   * terminates. */
+  worker.send(-1);
+}
+</code>
+The return value of **spawn()** is the id of the worker thread. **main()** stores the return value of **spawn()** under the name **worker** and uses that variable when sending messages to the worker.
+On the other side, the worker receives the message that it needs as an **int**, uses
+that value in a calculation and sends the result as type **double** to its owner:
+<code d>
+void workerFunc() {
+  int value = 0;
+  while (value >= 0) {
+    value = receiveOnly!int();
+    double result = to!double(value) / 5;
+    ownerTid.send(result);
+  }
+}
+</code>
+<note tip>
+We strongly encourage you to read more about message passing concurrency in [[http://ddili.org/ders/d.en/concurrency.html|this chapter]] from Ali's book.
+</note>
+=== Data sharing ===
+We gave you a small insight about data sharing in D in [[https://ocw.cs.pub.ro/courses/dss/laboratoare/03#shared|lab03]].
+Unlike most other programming languages, data is not automatically shared in D; data is thread-local by default.
+Although module-level variables may give the impression of being accessible by all threads, each thread actually gets its own
+copy:
+<code d>
+import std.stdio;
+import std.concurrency;
+import core.thread;
+int variable;
+void printInfo(string message)
+{
+  writefln("%s: %s (@%s)", message, variable, &variable);
+}
+void worker()
+{
+  variable = 42;
+  printInfo("Before the worker is terminated");
+}
+void main()
+{
+  spawn(&worker);
+  thread_joinAll();
+  printInfo("After the worker is terminated");
+}
+</code>
+The variable that is modified inside **worker()** is not the same variable that is seen by **main()**.
+**spawn()** does not allow passing references to thread-local variables.
+Attempting to do so will result in a compilation error
+<code d>
+void worker(bool* isDone) { /* ... */ }
+void main() {
+  bool isDone = false;
+  spawn(&worker, &isDone); // Error: Aliases to mutable thread-local data not allowed.
+}
+</code>
+Mutable variables that need to be shared must be defined with the shared keyword.
+<code d>
+void worker(shared(bool)* isDone) { /* ... */ }
+void main() {
+  shared(bool) isDone = false;
+</code>
+On the other hand, since **immutable** variables cannot be modified, there is no
+problem with sharing them directly. For that reason, **immutable** implies **shared**.
+== shared is transitive ==
+As you can remember, in the D programming language, the **const** and **immutable** type qualifiers are transitive.
+The same is true for the **shared** type qualifier.
+<code d>
+shared int* pInt;
+shared(int*) pInt;
+</code>
+The statements above are equivalent.
+The correct meaning of pInt is "The pointer is shared and the data pointed to by the pointer is also shared."
+There is, also, a notion of "unshared pointer to shared data" that does hold water. Some thread holds a private pointer, and the pointer "looks" at shared data. That is easily expressible syntactically as
+<code d>
+shared(int)* pInt;
+</code>
+== Race conditions ==
+The correctness of the program requires extra attention when mutable data is shared between threads.
+<code d>
+void inc(shared(int)* val) {
+  ++*val;
+}
+void main() {
+  shared int x = 0;
+  foreach (i; 0 .. 10) {
+    spawn(&inc, &x);
+  }
+  thread_joinAll();
+}
+</code>
+The code above exemplifies a simple race condition: it's called a race because any thread can access (read and/or write) to the shared variable at any given time. As the threads are run in an nondeterministic order, the result of the operation is also nondeterministic. Although it is possible that the program can indeed produce that result, most of
+the time the actual outcome would be wrong (corrupted).
+== synchronized ==
+The incorrect program behavior above is due to more than one thread accessing the same mutable data (and at least one of them modifying it). One way of avoiding these race conditions is to mark the common code with the synchronized keyword. The program would work correctly with the following change:
+<code d>
+void inc(shared(int)* val) {
+  synchronized {
+    ++*val;
+  }
+}
+</code>
+A synchronized block will create an anonymous lock and use it to serialize the critical section.
+If we need to synchronize access to a shared variable in multiple **synchronized** blocks, we need to create a lock object and pass it to the **synchronized** statement.
+There is no need for a special lock type in D because any class object can be used as a **synchronized** lock. The following program defines an empty class named **Lock** to use its objects as locks:
+<code d>
+import std.stdio;
+import std.concurrency;
+import core.thread;
+enum count = 1000;
+class Lock {}
+void incrementer(shared(int) * value, shared(Lock) lock) {
+  foreach (i; 0 .. count) {
+    synchronized (lock) {
+      *value = *value + 1;
+    }
+  }
+}
+void decrementer(shared(int) * value, shared(Lock) lock) {
+  foreach (i; 0 .. count) {
+    synchronized (lock) {
+      *value = *value - 1;
+    }
+  }
+}
+void main() {
+  shared(Lock) lock = new shared(Lock)();
+  shared(int) number = 0;
+  foreach (i; 0 .. 100) {
+    spawn(&incrementer, &number, lock);
+    spawn(&decrementer, &number, lock);
+  }
+  thread_joinAll();
+  writeln("Final value: ", number);
+}
+</code>
+Because both synchronized blocks are connected by the same lock, only one of them is executed at a given time and the result is zero as expected.
+<note tip>
+It is a relatively expensive operation for a thread to wait for a lock, which may slow down the execution of the program noticeably. Fortunately, in some cases program correctness can be ensured without the use of a synchronized block, by taking
+advantage of [[http://ddili.org/ders/d.en/concurrency_shared.html|atomic operations]].
+</note>
+==== Fibers ====
+As we've previously discussed, modern operating systems implement multitasking through the use of threads and context switching, also known as preemptive multitasking.
+A thread is given, by the kernel, a slice of time to run on the physical core; when it's time has elapsed or if the thread is doing a blocking operation (waiting for an I/O operation to complete), the thread is preempted and the kernel chooses another thread to run.
+Ideally, from a threads point of view, a thread would run until his time slice has elapsed. For HPC applications this might very well be the case, but for applications and services that have interact with users and/or the disk, this means a lot of blocking I/O operations that will result in an early context switch. Since every thread is competing with all the other threads in the system for its time slice, being preempted at a 3rd of your slice is not ideal: it might take significant more time until it gets scheduled again than it took for the I/O operation to complete. To mitigate this problem, developers are using asynchronous operating system APIs to achieve [[http://vibed.org/features#aio|Asynchronous I/O operations]].
+Working with the asynchronous I/O model (AIO) can become tedious and confusing to write sequences of code (e.g. performing multiple consecutive database queries). Each step will introduce a new callback with a new scope, error callbacks often have to be handled separately. Especially the latter is a reason why it is tempting to just perform lax error handling. Another consequence of asynchronous callbacks is that there is no meaningful call stack. Not only can this make debugging more difficult, but features such as exceptions cannot be used effectively in such an environment.
+A success story in the D community is the vibe.d framework to achieve AIO through a simple interface. The approach of vibe.d is to use asynchronous I/O under the hood, but at the same time make it seem as if all operations were synchronous and blocking, just like ordinary I/O.
+What makes this possible is D's support for so called fibers (also often called co-routines). Fibers behave a lot like threads, just that they are actually all running in the same thread. As soon as a running fiber calls a special **yield()** function, it returns control to the function that started the fiber. The fiber can then later be resumed at exactly the position and with the same state it had when it called **yield()**. This way fibers can be multiplexed together, running quasi-parallel and using each threads capacity as much as possible.
+A fiber is a thread of execution enabling a single thread achieve multiple tasks.
+Compared to regular threads that are commonly used in parallelism and concurrency, it is more efficient to switch between fibers. Fibers are similar to //coroutines// and //green threads//.
+Fibers are a form of cooperative multitasking. As the name implies,
+cooperative multitasking requires some help from the user functions. A function
+runs up to a point where the developer decides would be a good place to run
+another task. Usually, a library function named yield() is called, which continues
+the execution of another function. This is best shown with an example. Here is a
+simplified version of the classic producer-consumer pattern:
+<code d>
+private int goods;
+private bool exit;
+void producerFiber()
+{
+  foreach (i; 0..3)
+  {
+    goods = i^^2;
+    writefln("Produced %s", goods);
+    Thread.sleep(500.msecs);
+    Fiber.yield();
+  }
+}
+void consumerFiber()
+{
+  while (!exit)
+  {
+    /* do something */
+    writefln("Consumed %s", goods);
+    Thread.sleep(500.msecs);
+    Fiber.yield();
+  }
+}
+void main()
+{
+  auto producer = new Fiber(&producerFiber);
+  auto consumer = new Fiber(&consumerFiber);
+  while (producer.state != Fiber.State.TERM)
+  {
+    producer.call();
+    exit = producer.state == Fiber.State.TERM;
+    consumer.call();
+  }
+}
+</code>
+We know this looks like much to process, but it's actually not that complicated to understand.
+First, we create two fiber instances, **producer** and **consumer**, that receive a **function** or **delegate** to the code they will execute. When **main()** issues the **producer.call()** method, the "control" is passed to the producer and the code from
+**producerFiber** starts executing. The control is transferred back to **main()** by the **Fiber.yield()** call from the **producerFiber**; when a future **producer.call()** is made, the code will resume after the **Fiber.yield()** method call.
+Next, **main()** checks if the producer has finished executing and then passes the control to the **consumer** fiber through the same API.
+<note tip>
+For a detailed and thorough discussion about fibers, have a read [[http://ddili.org/ders/d.en/fibers.html|here]].
+</note>
+==== Exercises ====
+The lab can be found at this [[https://github.com/RazvanN7/D-Summer-School/tree/master/lab-05|link]].
+=== 1. Parallel programming ===
+Navigate to the 1-parallel directory. Read and understand the source file students.d. Compile and run the program, and explain the behaviour.
+  - What is the issue, if any.
+  - We want to fix the issue, but we want to continue using **Task**s.
+  - Do we really have to manage all of this ourselves? I think we can do a better **parallel** job.
+  - Increase the number of students by a factor of 10, then 100. Does the code scale?
+=== 2. Getting functional with parallel programming ===
+Navigate to the 2-parallel directory. Read and understand the source file students.d.
+  - The code looks simple enough, but always ask yourselves: can we do better? Can we change the **foreach** into a oneliner?
+  - Increase the number of students by a factor of 10, then 100. Does the code scale?
+  - Depending on the size of our data, we might gain performance by tweaking the **workUnitSize** parameter. Lets try it out.
+=== 3. Heterogeneous tasks ===
+Until now we've been using **std.parallelism** on sets of homogeneous tasks.
+Q: What happens when we want to perform parallel computations on distinct, unrelated tasks?
+A: We can use [[https://dlang.org/phobos/std_parallelism.html#.TaskPool|taskPool]] to run our task on a pool of worker threads.
+Navigate to the 3-taskpool directory. Write a program that performs three tasks in parallel:
+  - One reads the contents of **in.txt** and writes to stdout the total number of lines in the file
+  - One calculates the average from the previous exercise
+  - One does a task of your choice
+To submit tasks to the **taskPool** use [[https://dlang.org/phobos/std_parallelism.html#.TaskPool.put|put]].
+<note>
+Don't forget to wait for your tasks to finish.
+</note>
+=== 4. I did it My way ===
+Let's implement our own concurrent **map** function.
+Navigate to the 4-concurrent-map directory. Starting from the serial implementation found in **mymap.d** modify the code such that
+the call to **mymap** function will execute on multiple threads. You are required to use the **std.concurrency** module for this task.
+Creating a thread implies some overhead, thus we don't want to create a thread for each element, but rather have a thread process chunks of elements; basically we need a **workUnitSize**.
+=== 5. Don't stop me now ===
+Since we just got started, let's implement our our concurrent **reduce** function. **reduce** must take the initial accumulator value as it's first parameter, and then the list of elements to reduce.
+<note>
+Be careful about those race conditions.
+</note>
+=== 6. Under pressure ===
+The implementations we did at ex. 4 and ex. 5 are great and all, but they have the following shortcoming: they will each spawn a number of threads (most likely equal to the number of physical cores), so calling them both in parallel will spawn twice the amount of threads that can run in parallel.
+Change your implementations to use a thread pool. The worker threads will consume jobs from a queue. The map and reduce implementations will push job abstractions into the queue.
+Now we're talking!

dss/laboratoare/05.1554965200.txt.gz · Last modified: 2019/06/25 12:15 (external edit)

Show page Old revisions

Media Manager Back to top

Differences

Cursuri

Laboratoare

Resurse

Highschool workshop