### Parallelism

Most modern microprocessors consist of more than one core, each of which can operate as an individual processing unit. They can execute different parts of different programs at the same time.

A flow of execution through certain parts of a program is called a thread of execution or a thread. Programs can consist of multiple threads that are being actively executed at the same time. The operating system starts and executes each thread on a core and then suspends it to execute other threads, thus each thread is competing with the other threads in the system for computational time on the processor. The execution of each thread may involve many cycles of starting and suspending.

All of the threads of all of the programs that are active at a given time are executed on the very cores of the microprocessor. The operating system decides when and under what condition to start and suspend each thread. We call this process of start/suspend/swap a context switch.

The features of the std.parallelism module make it possible for programs to take advantage of all of the cores in order to run faster.

Operations that are executed in parallel with other operations of a program are called tasks. Tasks are represented by the type std.parallelism.Task.

Task represents the fundamental unit of work. A Task may be executed in parallel with any other Task. Using this struct directly allows future/promise parallelism. In this paradigm, a function (or delegate or other callable) is executed in a thread other than the one it was called from. The calling thread does not block while the function is being executed.

For simplicity, the std.parallelism.task and std.parallelism.scopedTask functions are generally used to create an instance of the Task struct.

Using the Task struct has three steps:

1. First, we need to create a task instance.

int anOperation(string id) {
writefln("Executing %s", id);
return 42;
}

void main() {
/* Construct a task object that will execute
* anOperation(). The function parameters that are
* specified here are passed to the task function as its
* function parameters. */
/* the main thread continues to do stuff */
}

2. Now we've just created a new Task instance, but the task isn't running yet. Next we'll launch the task execution.

  /* ... */

/* ... */

3. At this point we are sure that the operation has been started, but it's unsure whether theTask has completed its execution. yieldForce() waits for the task to complete its operations; it returns only when the task has been completed. Its return value is the return value of the task function, i.e. anOperation().

  /* ... */
writefln("All finished; the result is %s\n", taskResult);
/* ... */

The Task struct has two other methods, workForce and spinForce, that are used to ensure that the Task has finished executing and to obtain the return value, if any. Read their docs and discover the differences in behaviour and when their usage is preferred.

As we've previously stated: all of the threads of all of the programs that are active at a given time are executed on the very cores of the microprocessor, competing for computational time with each other.

This observation has the following implication: on a system that has N cores, we can have at most N threads running in parallel at a given time. This means that in our application we should create at most N worker threads that will execute tasks (from a tasks queue) for us, thus our N worker threads will be part of a thread pool; this is a common pattern used in concurrent applications.

The std.parallelism module gives us access to a ready to use std.parallelism.TaskPool instance, named std.parallelism.taskPool. std.parallelism.taskPool has totalCPUs - 1 worker threads available, where totalCPUs is the total number of CPU cores available on the current machine, as reported by the operating system. The minus 1 is included because the main thread will also be available to do work.

struct Student {
int number;
void aSlowOperation() {
writefln("The work on student %s has begun", number);
// Wait for a while to simulate a long-lasting operation
writefln("The work on student %s has ended", number);
}
}

void main() {
auto students = [ Student(1), Student(2), Student(3), Student(4) ];

foreach (student; students) {
student.aSlowOperation();
}
}

In the code above, as the foreach loop normally operates on elements one after the other, aSlowOperation() would be called for each student sequentially. However, in many cases it is not necessary for the operations of preceding students to be completed before starting the operations of successive students. If the operations on the Student objects were truly independent, it would be wasteful to ignore the other microprocessor cores, which might potentially be waiting idle on the system.

Meet taskPool.parallel. This function can also be called simply as parallel(). parallel() accesses the elements of a range in parallel. An effective usage is with foreach loops. Merely importing the std.parallelism module and replacing students with parallel(students) in the program above is sufficient to take advantage of all of the cores of the system.

This simple change

  /* ... */
foreach (student; parallel(students)) {
/* ... */

Is enough to drop our application's total running time from 4 seconds to just 1 second.

In Ali's foreach for structs and classes chapter we can see that the expressions that are in foreach blocks are passed to opApply() member functions as delegates. parallel() returns a range object that knows how to distribute the execution of the delegate to a separate core for each element.

parallel() constructs a new Task object for every worker thread and starts that task automatically. parallel() then waits for all of the tasks to be completed before finally exiting the loop. parallel() is very convenient as it constructs, starts, and waits for the tasks automatically.

### Concurrency

The concepts and mechanics exposed by the std.concurrency module are similar to, but different from the ones discussed in relation to the std.parallelism module. They both involve executing operations on threads, and as parallelism is based on concurrency, they are sometimes confused with each other. There are a few key insights regarding the two programming models:

• Probably the most notable difference between the two is that although both programming models use operating system threads, in parallelism threads are encapsulated by the concept of a task, whereas concurrency makes use of threads explicitly.
• Another important aspect is that in parallelism, tasks are independent from each other. In fact, it would be a bug if they did depend on results of other tasks that are running at the same time. In concurrency, it is normal for threads to depend on results of other threads.
• Parallelism is easy to use, and as long as tasks are independent it is easy to produce programs that work correctly. Concurrency is easy only when it is based on message passing. It is very difficult to write correct concurrent programs if they are based on the traditional model of concurrency that involves lock-based data sharing.

Think of it this way: parallelism is most suited for solving embarrassingly parallel problems; concurrency will most likely be used to solve intricate problems.

spawn() takes a function pointer as a parameter and starts a new thread from that function. Any operations that are carried out by that function, including other functions that it may call, would be executed on the new thread. The main difference between a thread that is started with spawn() and a thread that is started with task() is the fact that spawn() makes it possible for threads to send messages to each other.

As soon as a new thread is started, the owner and the worker start executing separately as if they were independent programs:

import std.stdio;
import std.concurrency;

void worker() {
foreach (i; 0 .. 5) {
writeln(i, " (worker)");
}

void main() {
spawn(&worker);
foreach (i; 0 .. 5) {
writeln(i, " (main)");
}
writeln("main is done.");
}

The program automatically waits for all of the threads to finish executing. We can see this in the output of the above program by the fact that worker() continues executing even after main() exits after printing “main is done.”.

0 (main)
0 (worker)
1 (main)
2 (main)
1 (worker)
3 (main)
2 (worker)
4 (main)
main is done.
3 (worker)
4 (worker)

The parameters that the thread function takes are passed to spawn() as its second and later arguments.

void worker(int x) {
/* ... */

void main() {
spawn(&worker, 42);
/* ... */

Every operating system puts limits on the number of threads that can exist at one time. These limits can be set for each user, for the whole system, or for something else. The overall performance of the system can be reduced if there are more threads that are busily working than the number of cores in the system.

A thread that is busily working at a given time is said to be CPU bound at that point in time. On the other hand, some threads spend considerable amount of their time waiting for some event to occur like input from a user, data from a network connection, the completion of a Thread.sleep call, etc. Such threads are said to be I/O bound at those times.

If the majority of its threads are I/O bound, then a program can afford to start more threads than the number of cores without any degradation of performance. As it should be in every design decision that concerns program performance, one must take actual measurements to be exactly sure whether that really is the case.

#### Message passing

send() sends messages and receiveOnly() waits for a message of a particular type. (There are also prioritySend(), receive() and receiveTimeout(), which we encourage you to read about in the docs.)

The owner in the following program sends its worker a message of type int and waits for a message from the worker of type double. The threads continue sending messages back and forth until the owner sends a negative int. This is the owner thread:

void main() {
Tid worker = spawn(&workerFunc);
foreach (value; 1 .. 5) {
worker.send(value);
writefln("sent: %s, received: %s", value, result);
}
/* Sending a negative value to the worker so that it
* terminates. */
worker.send(-1);
}

The return value of spawn() is the id of the worker thread. main() stores the return value of spawn() under the name worker and uses that variable when sending messages to the worker.

On the other side, the worker receives the message that it needs as an int, uses that value in a calculation and sends the result as type double to its owner:

void workerFunc() {
int value = 0;

while (value >= 0) {
double result = to!double(value) / 5;
ownerTid.send(result);
}
}

We strongly encourage you to read more about message passing concurrency in this chapter from Ali's book.

#### Data sharing

We gave you a small insight about data sharing in D in lab03.

Unlike most other programming languages, data is not automatically shared in D; data is thread-local by default. Although module-level variables may give the impression of being accessible by all threads, each thread actually gets its own copy:

import std.stdio;
import std.concurrency;

int variable;
void printInfo(string message)
{
writefln("%s: %s (@%s)", message, variable, &variable);
}

void worker()
{
variable = 42;
printInfo("Before the worker is terminated");
}

void main()
{
spawn(&worker);
printInfo("After the worker is terminated");
}

The variable that is modified inside worker() is not the same variable that is seen by main().

spawn() does not allow passing references to thread-local variables. Attempting to do so will result in a compilation error

void worker(bool* isDone) { /* ... */ }

void main() {
bool isDone = false;
spawn(&worker, &isDone); // Error: Aliases to mutable thread-local data not allowed.
}

Mutable variables that need to be shared must be defined with the shared keyword.

void worker(shared(bool)* isDone) { /* ... */ }

void main() {
shared(bool) isDone = false;

On the other hand, since immutable variables cannot be modified, there is no problem with sharing them directly. For that reason, immutable implies shared.

##### shared is transitive

As you can remember, in the D programming language, the const and immutable type qualifiers are transitive. The same is true for the shared type qualifier.

shared int* pInt;
shared(int*) pInt;

The statements above are equivalent. The correct meaning of pInt is “The pointer is shared and the data pointed to by the pointer is also shared.”

There is, also, a notion of “unshared pointer to shared data” that does hold water. Some thread holds a private pointer, and the pointer “looks” at shared data. That is easily expressible syntactically as

shared(int)* pInt;
##### Race conditions

The correctness of the program requires extra attention when mutable data is shared between threads.

void inc(shared(int)* val) {
++*val;
}

void main() {
shared int x = 0;
foreach (i; 0 .. 10) {
spawn(&inc, &x);
}
}

The code above exemplifies a simple race condition: it's called a race because any thread can access (read and/or write) to the shared variable at any given time. As the threads are run in an nondeterministic order, the result of the operation is also nondeterministic. Although it is possible that the program can indeed produce that result, most of the time the actual outcome would be wrong (corrupted).

##### synchronized

The incorrect program behavior above is due to more than one thread accessing the same mutable data (and at least one of them modifying it). One way of avoiding these race conditions is to mark the common code with the synchronized keyword. The program would work correctly with the following change:

void inc(shared(int)* val) {
synchronized {
++*val;
}
}

A synchronized block will create an anonymous lock and use it to serialize the critical section. If we need to synchronize access to a shared variable in multiple synchronized blocks, we need to create a lock object and pass it to the synchronized statement.

There is no need for a special lock type in D because any class object can be used as a synchronized lock. The following program defines an empty class named Lock to use its objects as locks:

import std.stdio;
import std.concurrency;

enum count = 1000;

class Lock {}

void incrementer(shared(int) * value, shared(Lock) lock) {
foreach (i; 0 .. count) {
synchronized (lock) {
*value = *value + 1;
}
}
}
void decrementer(shared(int) * value, shared(Lock) lock) {
foreach (i; 0 .. count) {
synchronized (lock) {
*value = *value - 1;
}
}
}

void main() {
shared(Lock) lock = new shared(Lock)();
shared(int) number = 0;
foreach (i; 0 .. 100) {
spawn(&incrementer, &number, lock);
spawn(&decrementer, &number, lock);
}

writeln("Final value: ", number);
}

Because both synchronized blocks are connected by the same lock, only one of them is executed at a given time and the result is zero as expected.

It is a relatively expensive operation for a thread to wait for a lock, which may slow down the execution of the program noticeably. Fortunately, in some cases program correctness can be ensured without the use of a synchronized block, by taking advantage of atomic operations.

### Fibers

As we've previously discussed, modern operating systems implement multitasking through the use of threads and context switching, also known as preemptive multitasking.

A thread is given, by the kernel, a slice of time to run on the physical core; when it's time has elapsed or if the thread is doing a blocking operation (waiting for an I/O operation to complete), the thread is preempted and the kernel chooses another thread to run.

Ideally, from a threads point of view, a thread would run until his time slice has elapsed. For HPC applications this might very well be the case, but for applications and services that have interact with users and/or the disk, this means a lot of blocking I/O operations that will result in an early context switch. Since every thread is competing with all the other threads in the system for its time slice, being preempted at a 3rd of your slice is not ideal: it might take significant more time until it gets scheduled again than it took for the I/O operation to complete. To mitigate this problem, developers are using asynchronous operating system APIs to achieve Asynchronous I/O operations.

Working with the asynchronous I/O model (AIO) can become tedious and confusing to write sequences of code (e.g. performing multiple consecutive database queries). Each step will introduce a new callback with a new scope, error callbacks often have to be handled separately. Especially the latter is a reason why it is tempting to just perform lax error handling. Another consequence of asynchronous callbacks is that there is no meaningful call stack. Not only can this make debugging more difficult, but features such as exceptions cannot be used effectively in such an environment.

A success story in the D community is the vibe.d framework to achieve AIO through a simple interface. The approach of vibe.d is to use asynchronous I/O under the hood, but at the same time make it seem as if all operations were synchronous and blocking, just like ordinary I/O.

What makes this possible is D's support for so called fibers (also often called co-routines). Fibers behave a lot like threads, just that they are actually all running in the same thread. As soon as a running fiber calls a special yield() function, it returns control to the function that started the fiber. The fiber can then later be resumed at exactly the position and with the same state it had when it called yield(). This way fibers can be multiplexed together, running quasi-parallel and using each threads capacity as much as possible.

A fiber is a thread of execution enabling a single thread achieve multiple tasks. Compared to regular threads that are commonly used in parallelism and concurrency, it is more efficient to switch between fibers. Fibers are similar to coroutines and green threads.

Fibers are a form of cooperative multitasking. As the name implies, cooperative multitasking requires some help from the user functions. A function runs up to a point where the developer decides would be a good place to run another task. Usually, a library function named yield() is called, which continues the execution of another function. This is best shown with an example. Here is a simplified version of the classic producer-consumer pattern:

private int goods;
private bool exit;

void producerFiber()
{
foreach (i; 0..3)
{
goods = i^^2;
writefln("Produced %s", goods);
Fiber.yield();
}
}

void consumerFiber()
{
while (!exit)
{
/* do something */
writefln("Consumed %s", goods);
Fiber.yield();
}
}

void main()
{
auto producer = new Fiber(&producerFiber);
auto consumer = new Fiber(&consumerFiber);
while (producer.state != Fiber.State.TERM)
{
producer.call();
exit = producer.state == Fiber.State.TERM;
consumer.call();
}
}

We know this looks like much to process, but it's actually not that complicated to understand. First, we create two fiber instances, producer and consumer, that receive a function or delegate to the code they will execute. When main() issues the producer.call() method, the “control” is passed to the producer and the code from producerFiber starts executing. The control is transferred back to main() by the Fiber.yield() call from the producerFiber; when a future producer.call() is made, the code will resume after the Fiber.yield() method call. Next, main() checks if the producer has finished executing and then passes the control to the consumer fiber through the same API.

For a detailed and thorough discussion about fibers, have a read here.

### Exercises

The lab can be found at this link.

#### 1. Parallel programming

Navigate to the 1-parallel directory. Read and understand the source file students.d. Compile and run the program, and explain the behaviour.

1. What is the issue, if any.
2. We want to fix the issue, but we want to continue using Tasks.
3. Do we really have to manage all of this ourselves? I think we can do a better parallel job.
4. Increase the number of students by a factor of 10, then 100. Does the code scale?

#### 2. Getting functional with parallel programming

Navigate to the 2-parallel directory. Read and understand the source file students.d.

1. The code looks simple enough, but always ask yourselves: can we do better? Can we change the foreach into a oneliner?
2. Increase the number of students by a factor of 10, then 100. Does the code scale?
3. Depending on the size of our data, we might gain performance by tweaking the workUnitSize parameter. Lets try it out.

Until now we've been using std.parallelism on sets of homogeneous tasks. Q: What happens when we want to perform parallel computations on distinct, unrelated tasks? A: We can use taskPool to run our task on a pool of worker threads.

Navigate to the 3-taskpool directory. Write a program that performs three tasks in parallel:

1. One reads the contents of in.txt and writes to stdout the total number of lines in the file
2. One calculates the average from the previous exercise

#### 4. I did it My way

Let's implement our own concurrent map function. Navigate to the 4-concurrent-map directory. Starting from the serial implementation found in mymap.d modify the code such that the call to mymap function will execute on multiple threads. You are required to use the std.concurrency module for this task.

Creating a thread implies some overhead, thus we don't want to create a thread for each element, but rather have a thread process chunks of elements; basically we need a workUnitSize.

#### 5. Don't stop me now

Since we just got started, let's implement our our concurrent reduce function. reduce must take the initial accumulator value as it's first parameter, and then the list of elements to reduce.

Be careful about those race conditions.

#### 6. Under pressure

The implementations we did at ex. 4 and ex. 5 are great and all, but they have the following shortcoming: they will each spawn a number of threads (most likely equal to the number of physical cores), so calling them both in parallel will spawn twice the amount of threads that can run in parallel.

Change your implementations to use a thread pool. The worker threads will consume jobs from a queue. The map and reduce implementations will push job abstractions into the queue.

Now we're talking!