Parallelization of Sequential Applications using .NET Framework 4.5

Page created by Leroy Pearson
 
CONTINUE READING
Parallelization of Sequential
Applications using .NET Framework 4.5

An evaluation, application and discussion about the parallel extensions of the
                                           .NET managed concurrency library.

                                           J O H A N L I TS F E L DT

                                                                                    KTH Information and
                                                                                 Communication Technology

                                                Master of Science Thesis
                                                Stockholm, Sweden 2013

                                                   TRITA-ICT-EX-2013:160
Parallelization of Sequential Applications using
               .NET Framework 4.5

An evaluation, application and discussion about the parallel extensions of the .NET
                           managed concurrency library.

                              JOHAN LITSFELDT

     Degree project in Program System Technology at KTH Information and
                           Communication Technology
                  Supervisor: Mats Brorsson, Marika Engström
                            Examiner: Mats Brorsson

                            TRITA-ICT-EX-2013:160
Abstract
Modern processor construction has taken a new turn in
that adding on more cores to processors appears to be the
norm instead of simply relying on clock speed improve-
ments. Much of the responsibility of writing efficient appli-
cations has thus been moved from the hardware designers
to the software developers. However, issues related to scal-
ability, synchronization, data dependencies and debugging
makes this troublesome for developers.
    By using the .NET Framework 4.5, a lot of the men-
tioned issues are alleviated through the use of the parallel
extensions including TPL, PLINQ and other constructs de-
signed specifically for highly concurrent applications. Anal-
ysis, profiling and debugging of parallel applications has
also been made less problematic in that the Visual Studio
2012 IDE provides such functionality to a great extent.
    In this thesis, the parallel extensions as well as explicit
threading techniques are evaluated along with a paralleliza-
tion attempt on an application called the LTF/Reduce In-
terpreter by the company Norconsult Astando AB. The ap-
plication turned out to be highly I/O dependent but even
so, the parallel extensions proved useful as the running
times of the parallelized parts were lowered by a factor of
about 3.5–4.1.
Sammanfattning
Parallellisering av Sekventiella Applikationer
           med .NET Framework 4.5
Modern processorkonstruktion har tagit en vändning i och
med att normen nu ser ut att vara att lägga till fler kär-
nor till processorer istället för att förlita sig på ökningar av
klockhastigheter. Mycket av ansvaret har således flyttats
från hårdvarutillverkarna till mjukvaruutvecklare. Skalbar-
het, synkronisering, databeroenden och avlusning kan emel-
lertid göra detta besvärligt för utvecklare.
     Många av de ovan nämnda problemen kan mildras ge-
nom användandet av .NET Framework 4.5 och de nya pa-
rallella utbyggnaderna vilka inkluderar TPL, PLINQ och
andra koncept designade för starkt samverkande applika-
tioner. Analys, profilering och debugging har också gjorts
mindre problematiskt i och med att Visual Studio 2012 IDE
tillhandahåller sådan funktionalitet i stor utsträckning.
     De parallella utbyggnaderna samt tekniker för explicit
trådning utvärderas i denna avhandling tillsammans med
ett parallelliseringsförsök av en applikation vid namn LT-
F/Reduce Interpreter av företaget Norconsult Astando AB.
Applikationen visade sig vara starkt I/O beroende men
även om så var fallet så visade sig de parallella utbyggna-
derna vara användbara då körtiden för de parallelliserade
delarna kunde minskas med en faktor av c:a 3.5–4.1.
Contents

List of Tables

List of Figures

I Background                                                                                                           1

1 Introduction                                                                                                         2
  1.1 The Modern Processor . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   2
  1.2 .NET Framework 4.5 Overview . . .        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   3
       1.2.1 Common Language Runtime           .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   3
       1.2.2 The Parallel Extension . . . .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   3
  1.3 Problem Definition . . . . . . . . . .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   4
  1.4 Motivation . . . . . . . . . . . . . .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   4

II Theory                                                                                                              7

2 Modern Processor Architectures                                                                                        8
  2.1 Instruction Handling . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    8
  2.2 Memory Principles . . . . . . . . . . . .        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    9
  2.3 Threads and Processes . . . . . . . . . .        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   10
  2.4 Multi Processor Systems . . . . . . . . .        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   11
      2.4.1 Parallel Processor Architectures           .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   11
      2.4.2 Memory Architectures . . . . . .           .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   11
      2.4.3 Simultaneous Multi-threading . .           .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   13

3 Parallel Programming Techniques                                                                                      14
  3.1 General Concepts . . . . . . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   14
      3.1.1 When to go Parallel . . . . . . .          .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   14
      3.1.2 Overview of Parallelization Steps          .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   15
  3.2 .NET Threading . . . . . . . . . . . . .         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   16
      3.2.1 Using Threads . . . . . . . . . .          .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   17
      3.2.2 The Thread Pool . . . . . . . . .          .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   17
3.2.3 Blocking and Spinning        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   19
         3.2.4 Signaling Constructs .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   19
         3.2.5 Locking Constructs . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   20
   3.3   Task Parallel Library . . . . .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   21
         3.3.1 Task Parallelism . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   22
         3.3.2 The Parallel Class .         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   22
   3.4   Parallel LINQ . . . . . . . . .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   23
   3.5   Thread-safe Data Collections       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   25
   3.6   Cancellation . . . . . . . . . .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   26
   3.7   Exception Handling . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   27

4 Patterns of Parallel Programming                                                                                                  28
  4.1 Parallel Loops . . . . . . . . . . . . . . .                  .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   28
  4.2 Forking and Joining . . . . . . . . . . .                     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   29
      4.2.1 Recursive Decomposition . . . .                         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   29
      4.2.2 The Parent/Child Relation . . .                         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   29
  4.3 Aggregation and Reduction . . . . . . .                       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   30
      4.3.1 Map/Reduce . . . . . . . . . . .                        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   30
  4.4 Futures and Continuation Tasks . . . .                        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   31
  4.5 Producer/Consumer . . . . . . . . . . .                       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   32
      4.5.1 Pipelines . . . . . . . . . . . . .                     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   32
  4.6 Asynchronous Programming . . . . . . .                        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   33
      4.6.1 The async and await Modifiers .                         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   33
  4.7 Passing Data . . . . . . . . . . . . . . .                    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   34

5 Analysis, Profiling and Debugging                                                                                                 35
  5.1 Application Analysis . . . . . . . .              .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   35
  5.2 Visual Studio 2012 . . . . . . . . .              .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   36
      5.2.1 Debugging Tools . . . . . .                 .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   36
      5.2.2 Concurrency Visualizer . .                  .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   37
  5.3 Common Performance Sinks . . . .                  .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   38
      5.3.1 Load Balancing . . . . . . .                .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   38
      5.3.2 Data Dependencies . . . . .                 .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   38
      5.3.3 Processor Oversubscription                  .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   39
      5.3.4 Profiling . . . . . . . . . . .             .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   39

IIIImplementation                                                                                                                   41

6 Pre-study                                                                                                                         42
  6.1 The Local Traffic Prescription        (LTF)           .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   42
  6.2 Application Design Overview           . . . .         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   42
  6.3 Choice of Method . . . . . . .        . . . .         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   43
  6.4 Application Profile . . . . . .       . . . .         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   45
6.4.1   Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                                   45
        6.4.2   Performance Overview . . . . . . . . . . . . . . . . . . . . . .                                                     45
        6.4.3   Method Calls . . . . . . . . . . . . . . . . . . . . . . . . . . .                                                   46

7 Parallel Database Concepts                                                                                                         48
  7.1 I/O Parallelism . . . . . .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   48
  7.2 Interquery Parallelism . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   49
  7.3 Intraquery Parallelism . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   49
  7.4 Query Optimization . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   49

8 Solution Design                                                                                                                    50
  8.1 Problem Decomposition . . . . . . . . .                        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   50
  8.2 Applying Patterns . . . . . . . . . . . .                      .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   53
      8.2.1 Explicit Threading Approach . .                          .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   53
      8.2.2 Thread Pool Queuing Approach                             .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   55
      8.2.3 TPL Approaches . . . . . . . . .                         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   56
      8.2.4 PLINQ Approach . . . . . . . . .                         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   57

IVExecution                                                                                                                          59

9 Results                                                                                                                            60
  9.1 Performance Survey . . . . . . . . . . . . .                           .   .   .   .   .   .   .   .   .   .   .   .   .   .   60
  9.2 Discussion . . . . . . . . . . . . . . . . . . .                       .   .   .   .   .   .   .   .   .   .   .   .   .   .   64
      9.2.1 Implementation Analysis . . . . . .                              .   .   .   .   .   .   .   .   .   .   .   .   .   .   64
      9.2.2 Parallel Extensions Design Analysis                              .   .   .   .   .   .   .   .   .   .   .   .   .   .   65
  9.3 Conclusion . . . . . . . . . . . . . . . . . .                         .   .   .   .   .   .   .   .   .   .   .   .   .   .   67
  9.4 Future Work . . . . . . . . . . . . . . . . .                          .   .   .   .   .   .   .   .   .   .   .   .   .   .   67

V Appendices                                                                                                                         69

A Code Snippets                                                                                                                      70
  A.1 The ReadData Method . . . . . . . . . . . . . . . . . . . . . . . . . .                                                        70
  A.2 The ThreadObject Class . . . . . . . . . . . . . . . . . . . . . . . .                                                         71

B Raw Data                                                                                                                           72
  B.1 Different Technique Meaurements . . . . . . . . . . . . . . . . . . . .                                                        72
  B.2 Varying the Thread Count using Explicit Threading . . . . . . . . .                                                            73
  B.3 Varying the Core Count using TPL . . . . . . . . . . . . . . . . . . .                                                         73

Bibliography                                                                                                                         74
List of Tables

2.1   Specifications for some common CPU architectures. [1] . . . . . . . . .            9
2.2   Typical specifications for different memory units (as of 2013). [2] . . . .        9
2.3   Specifications for some general processor classifications. . . . . . . . . .      11

3.1   Typical overheads for threads in .NET. [3] . . . . . . . . . . . . . . . . .      17
3.2   Typical overheads for signaling constructs. [4] . . . . . . . . . . . . . . .     20
3.3   Properties and typical overheads for locking constructs. [4] . . . . . . .        21

6.1   Processor specifications of the hardware used for evaluating the applica-
      tion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   45
6.2   Other hardware specifications of the hardware used for evaluating the
      application. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    45
6.3   Methods with respective exclusive instrumentation percentages. . . . .            46
6.4   Methods with respective CPU exclusive samplings. . . . . . . . . . . .            47

9.2   Perceived difficulty- and applicability levels of implementing the tech-
      niques of parallelization. . . . . . . . . . . . . . . . . . . . . . . . . . .    64
9.1   Methods with the highest elapsed exclusive time spent executing for the
      sequential- and TPL based approach. . . . . . . . . . . . . . . . . . . .         64

B.1 Running times measured using the different techniques of parallelization. 72
B.2 Running time measurements using explicit threading with different amounts
    of threads. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
B.3 Running time measurements using TPL with different amounts of cores. 73
List of Figures

1.1   Overview of .NET threading concepts. Concepts in white boxes with
      dotted borders are not covered to great extent in this thesis. [3] . . . . .          6
1.2   A high level flow graph of typical parallel execution in the .NET Frame-
      work. [5] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     6

2.1   Difference between UMA (left) and NUMA (right). Processors are de-
      noted P1 , P2 , ..., Pn and memories M1 , M2 , ..., Mn . [6] . . . . . . . . . . .   12

3.1   Thread 1 steals work from thread 2 since both         its local- as well as the
      global queue is empty. [7] . . . . . . . . . . . .     . . . . . . . . . . . . . .   18
3.2   The different states of a thread. [3] . . . . . . .    . . . . . . . . . . . . . .   19
3.3   A possible execution plan for a PLINQ query.          Note that the ordering
      may change after execution. . . . . . . . . . .        . . . . . . . . . . . . . .   24

4.1   Example of dynamic task parallelism. New child tasks are forked from
      their respective parent tasks. . . . . . . . . . . . . . . . . . . . . . . . .       29
4.2   An illustration of typical Map/Reduce execution. . . . . . . . . . . . .             31

5.1   Load imbalance as shown by the concurrency visualizer. . . . . . . . .               38
5.2   Data dependencies as shown by the concurrency visualizer. . . . . . . .              39
5.3   Processor oversubscription as shown by the concurrency visualizer. . .               39

6.1   The internal steps of the application showing database accesses. . . . .             44
6.2   The figure shows how the LTF/Reduce Interpreter is used by other ap-
      plications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      44
6.3   The CPU usage of the application over time. . . . . . . . . . . . . . . .            45
6.4   The call tree of the application showing inclusive percentages of time
      spent executing methods (instrumentation). . . . . . . . . . . . . . . .             46
6.5   The call tree of the application showing inclusive percentages of CPU
      samplings of methods. . . . . . . . . . . . . . . . . . . . . . . . . . . .          47

9.1   Average total running times for different techniques. . . . . . . . . . . .          61
9.2   Total running times for the techniques represented as a box-and-whiskers
      graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       61
9.3   Box-and-whiskers diagrams for the running times of the methods MapLastzon,
      Map, Geolocate and Stretch. . . . . . . . . . . . . . . . . . . . . . . . 62
9.4   Average total running times for TPL using different amount of cores. . 63
9.5   Total running times for different amount of threads using the explicit
      threading approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
List of Listings

 3.1    Explicit thread creation. . . . . . . . . . . . . . . . . . . . . . . . . .      17
 3.2    Locking causes threads unable to be granted the lock to block. . . .             20
 3.3    Explicit task creation (task parallelism). . . . . . . . . . . . . . . . .       22
 3.4    The Parallel.Invoke method. . . . . . . . . . . . . . . . . . . . . .            23
 3.5    The Parallel.For loop. . . . . . . . . . . . . . . . . . . . . . . . . .         23
 3.6    An example of a PLINQ query. . . . . . . . . . . . . . . . . . . . . .           24
 4.1    Applying the aggregation pattern using PLINQ. . . . . . . . . . . .              30
 4.2    Example of the future pattern for calculating f4 (f1 (4),f3 (f2 (7))). . .       31
 4.3    Usage of the BlockingCollection class for a producer/consumer
        problem.
         . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   32
 4.4    An example of the async/await pattern for downloading HTML. . .                  33
 4.5    Downloading HTML asynchronously without async/await. . . . . .                   34
 5.1    A loop with unbalanced work among iterations (we assume that the
        factorial method is completely independent between calls).
         . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   38
 8.1    Bottleneck #1: Reducing LTFs as they become available. . . . . . .               51
 8.2    Bottleneck #2.1: Geolocating service days. . . . . . . . . . . . . . .           51
 8.3    Bottleneck #2.2: Stretching of LTFs. . . . . . . . . . . . . . . . . . .         51
 8.4    Bottleneck #3: Storage/Persist of modified LTFs. . . . . . . . . . . .           52
 8.5    Bottleneck #1: Reduce using explicit threading. . . . . . . . . . . .            53
 8.6    Bottleneck #2.1: Geolocate using explicit threading. . . . . . . . . .           54
 8.7    Bottleneck #2.2: Stretch using explicit threading. . . . . . . . . . .           54
 8.8    Bottleneck #1: Reduce using thread pool queuing. . . . . . . . . . .             55
 8.9    Bottleneck #2.1: Geolocate using thread pool queuing. . . . . . . . .            55
 8.10   Bottleneck #2.2: Stretch using thread pool queuing. . . . . . . . . .            56
 8.11   Bottleneck #1: Reduce using TPL. . . . . . . . . . . . . . . . . . . .           56
 8.12   Bottleneck #2.1: Geolocate using TPL. . . . . . . . . . . . . . . . .            56
 8.13   Bottleneck #2.2: Stretch using TPL. . . . . . . . . . . . . . . . . . .          57
 8.14   Bottleneck #1: Reduce using PLINQ. . . . . . . . . . . . . . . . . .             57
 8.15   Bottleneck #2.1: Geolocate using PLINQ. . . . . . . . . . . . . . . .            58
 8.16   Bottleneck #2.2: Stretch using PLINQ. . . . . . . . . . . . . . . . .            58
 A.1    The ReadData method used in bottleneck #1. . . . . . . . . . . . . .             70
 A.2    The ThreadObject class used for the explicit threading technique. .              71
Part I

Background

    1
Chapter 1

Introduction

The art of parallel programming has long been considered difficult and not worth
the investment. This chapter describes why parallel programming has become ever
so important and why there is a need for modern parallelization technologies.

1.1    The Modern Processor
A system composed of a single core processor executes instructions, performs cal-
culations and handles data sequentially switching between threads when necessary.
Moore’s law states that the amount of transistors used in single processor cores
doubles approximately every two years. Even so, this law which is predicted to
continue only for a few more years has been subject to claim. [8]
    Recent developments in processor construction have enabled a shift of paradigm.
Instead of naively relying on Moore’s law for future clock speeds, other technologies
have emerged. After all, a transistor cannot be made infinitely small as problems
with cost efficiency, heat issues and other physical capabilities are limited. [9]
    In later years we have witnessed a turn towards usage of multi core processors
as well as systems consisting of multiple processors. There are a number of benefits
that comes with such an approach including less context switching and the ability to
solve certain problems more efficiently. Unfortunately the increased complexity also
introduces problems with synchronization, data dependencies and large overheads.

        “The way the processor industry is going, is to add more and more
       cores, but nobody knows how to program those things. I mean, two,
         yeah; four, not really; eight, forget it.” - Steve Jobs, Apple. [10]

    Even though the subject is regarded problematic, all hope is not lost. Developed
by Microsoft, the .NET Framework (pronounced dot net) is a software framework
which includes large libraries while being highly interoperable, portable and scal-
able. A feature called the Parallel Extensions were introduced in version 4.0 released
in 2010 targeting parallel- and distributed systems. With the parallel extensions,
writing parallel applications using .NET has been made tremendously less painful.

                                          2
CHAPTER 1. INTRODUCTION

1.2     .NET Framework 4.5 Overview
The .NET Framework is a large and complex system framework with several layers
of abstraction. This section scratches the surface by providing the basics of the
Common Language Runtime as well as the new parallel extensions of .NET 4.0.

1.2.1    Common Language Runtime
The Common Language Runtime (CLR) is one of the main components in .NET.
It constitutes an implementation of the Common Language Infrastructure (CLI)
and represents a virtual machine as well as an execution environment. After code
has been written in a programming language supported by .NET (e.g. C#, F#
or VB.NET) it is compiled into an assembly consisting of Common Intermediate
Language (CIL) code. When the assembly is executed, the CIL code is compiled
into machine code by the Just In Time (JIT) compiler at runtime (alternatively at
compile time for performance reasons). [11] See 1.2 for an overview.
    The CLR also supports handling of issues regarding memory, threads and ex-
ceptions as well as garbage collection ∗ and enforcement of security and robustness.
Code executed by the CLR is called managed code which means that the frame-
work handles the above mentioned issues so that the programmer does not have to
tend to them. Native code in contrary represents machine specific code executed
on the operative system directly. This is typically associated with C/C++ and has
certain potential performance benefits since such applications are of lower level than
managed ones. [12]

1.2.2    The Parallel Extension
The Task Parallel Library (TPL) is made to simplify the process of parallelizing code
by offering a set of types and application programming interfaces (APIs). The TPL
has many useful features including work partitioning and proper thread scheduling.
TPL may be used to solve issues related to data parallelism, task parallelism and
other patterns (see chapter 4). Another parallelization method introduced in the
.NET Framework 4.0 is called Parallel LINQ (PLINQ) which is a parallel extension
to the regular LINQ query. PLINQ offers a declarative and typically cleaner style
of programming than that of TPL usage.
    The parallel extensions further introduce a set of concurrent lock-free data struc-
tures as well as slimmed constructs for locking and event handling specifically de-
signed to meet the new standards of highly concurrent programming models. See
figure 1.1 for an overview.
   ∗
     The garbage collector reclaims objects no longer in use from memory when certain conditions
are met and thus prevents memory leaks and other related problems.

                                               3
CHAPTER 1. INTRODUCTION

1.3    Problem Definition
This thesis presents the steps and best practices for parallelizing a sequential appli-
cation using the .NET Framework. Parallelization techniques, patterns and analysis
are discussed and evaluated in detail along with an overview of modern processor
design and profiling/analysis methods of Visual Studio 2012. The thesis also in-
cludes a parallelization attempt of an industry deployed application developed by
the company Norconsult Astando AB.

This thesis does cover:
  •   Decomposition and Scalability (potential parallelism, granularity, etc.).
  •   Coordination (data races, locks, thread safety, etc.)
  •   Regular Threading Concepts in .NET (incl. thread pool usage).
  •   Parallel Extensions (PLINQ, TPL, etc.).
  •   Patterns of Parallel Programming (i.e. best practices).
  •   Profiling and debugging using Visual Studio 2012.
  •   Implementation details, results and conclusions.
This thesis does not cover:
  • Basics of .NET programming (C#, delegates, LINQ, etc.).
  • Advanced parallel algorithms.
  • Thread pool optimizations (e.g. custom thread schedulers).

Note that this thesis is based on the C# programming language although the con-
cepts are more or less the same when written in one of the other .NET languages
such as VB.NET or F#.
   The goal of this thesis is to provide an evaluation of .NET parallel programming
concepts and in doing so giving insights on how sequential applications should be
parallelized, especially using .NET technology. The theory presented along with
experiments, results and insights should provide a useful reference, both for Nor-
consult Astando and for future academic research.

1.4    Motivation
With rapid advancements in multi core application programming it is often the case
that companies using .NET technology may not be able to catch up with emerging
techniques for parallel programming leaving them with inefficient software. It is
therefore of utmost importance that best practices for identifying, applying and
examining parallel code using modern technologies are investigated and thoroughly
evaluated.
    Applying patterns of parallel programming to already existing code is often the
case more difficult than writing parallel code from scratch as extensive profiling and

                                          4
CHAPTER 1. INTRODUCTION

possible redesign of system architecture may be an issue. Because of the fact that
many .NET applications in use today are designed for sequential execution, this
thesis targets the iterative approach of parallelization which also includes thorough
analysis and decomposition of sequential code.

                                         5
CHAPTER 1. INTRODUCTION

          Task Parallel Library                                                    .NET 4.0
                                       Structured Data Parallelism

               Parallel Class                 PLINQ
                                                                              Lazy Init
                                                                               Types

                                         Concurrent         Spinning       Slim Signaling
              Task Parallelism           Collections        Primitives       Constructs

                                     .NET x                                               Tools
                                                        Concurrency
                  CLR Thread Pool                        Visualizer         CHESS

                                                              Parallel Debugger
                     Threading                                    Windows

    Figure 1.1. Overview of .NET threading concepts. Concepts in white boxes with
    dotted borders are not covered to great extent in this thesis. [3]

     .NET Program               Intermediate Language

                                                                                       PLINQ
               C# Compiler                               Queries
                                                                                  Execution Engine

                                                                                  TPL and Parallel
               VB Compiler                              Algorithms
                                                                                    Structures

              C++ Compiler                               Threads

                                                                                   Execution on
               F# Compiler
                                                                                  Processor Cores

    Figure 1.2. A high level flow graph of typical parallel execution in the .NET Frame-
    work. [5]

                                                  6
Part II

Theory

   7
Chapter 2

Modern Processor Architectures

The term processor has been used since the early 1960 s and has since undergone
many changes and improvements to become what it is today [13]. This chapter
includes an overview of modern processor technology with a focus on multi core
processors.

2.1    Instruction Handling
The processor architecture states how data paths, control units, memory compo-
nents and clock circuitries are composed. The main goal of a processor is to fetch
and execute instructions to perform calculations and handle data. The only part
of the processor normally visible to the programmer is the registry where variables
and results of calculations are stored. [14]
    Instructions to be carried out by the processor include arithmetic-, load/store-
and jump instructions among others. After an instruction has been fetched from
memory using the value of the program counter (PC) registry, the program counter
needs to be updated and the instruction decoded. This is followed by fetching
the operands of the instruction from the registry (or the instruction itself) so that
the instruction can be executed. The execution is typically an ALU-operation, a
memory reference or a jump in the program flow. If there is a result from the
execution it will be stored in the registry. [14]
    For a processor to be able to execute instructions, certain hardware is needed.
A memory holding the instructions is needed as well as space for the registry. An
adder needs to be in place for incrementing the program counter along with a clock
signal for synchronization. The arithmetic logic unit (ALU) is used for performing
arithmetic and logical operations on binary numbers with operands fetched from
the registry. Combining the ALU with multiplexes and control units will allow for
the different instructions to be carried out. [14]
    RISC stands for Reduced Instruction Set Computer and has certain properties
such as easy-to-decode instructions, many registry locations without special func-
tions and only allowing special load/store instructions to reference memory. The

                                         8
CHAPTER 2. MODERN PROCESSOR ARCHITECTURES

               Architecture        Bits     Design      Registers       Year
               x86                 32       CISC        8               1978
               x86-64              64       CISC        16              2003
               MIPS                64       CISC        32              1981
               ARMv7               32       RISC        16              1983
               ARMv8               64       RISC        30              2011
               SPARC               64       RISC        31              1985
             Table 2.1. Specifications for some common CPU architectures. [1]

contrasting architecture is called CISC which stands for Complex Instruction Set
Computer and is built on the philosophy that more complex instructions would
make it easier for programmers and compilers to write assembly code. [14]

2.2    Memory Principles
Memories are often divided into two groups: dynamic random access memory
(DRAM) and static random access memory (SRAM). SRAMs are typically easier
to use and faster while DRAMs are cheaper with less complex design. [14]
    When certain memory cells are used more often than others, the cells show
locality of reference. Locality of reference can further be divided into the groups
temporal locality where cells recently accessed are likely to be accessed again and
spatial locality in which cells previously accessed are more likely than others to
be accessed again. Reference of locality can be taken advantage of by placing the
working set in a smaller and faster memory. [14]
    The CLR of .NET improves locality of reference automatically. Examples of
this includes that objects allocated consecutively are allocated adjacently and that
the garbage collector defragments memory so that objects are kept close together.
[15]
    The memory hierarchy of a system indicates the different levels of memories
available to the processor. Closest to the processor are the SRAM cache memories
which may further be divided into several sub levels (L1, L2, etc.). At the next level
is the DRAM primary memory followed by the secondary hard drive memory. [14]

                    Memory                    Size       Access time
                    Processor registers       128B       1 cycle
                    L1 Cache                  32KB       1-2 cycles
                    L2 Cache                  256KB      8 cycles
                    Primary Memory            GB         30-100 cycles
                    Secondary Memory          GB+        > 500 cycles
        Table 2.2. Typical specifications for different memory units (as of 2013). [2]

                                              9
CHAPTER 2. MODERN PROCESSOR ARCHITECTURES

    When data is requested from memory the entire cache line on where it is located
is fetched for spatial locality reasons. The cache lines may vary in size depending
of the level of memory but are always aligned at multiples of its size. After the
line has been fetched it will be stored in a lower level cache for temporal locality.
Different levels of cache typically have different sizes the reason being that finding
an item in a smaller cache is faster when temporally referenced. [16]
    The amount of associativity of a cache refers to the number of positions in the
memory that maps to a position in the cache. Increasing associativity may lead to
a reduced possibility of conflicts in the cache. A directly mapped cache is a cache
where each line in memory maps to exactly one position. In a fully associative cache,
any position in memory can map to any line of cache. This approach is however
very complex and is thus rarely implemented. [16]
    On multi core processors, cores typically share a single cache. Two issues related
to these caches are capacity misses and conflict misses.
    In a conflict cache miss one thread causes data needed by another thread to be
evicted. This can potentially lead to thrashing where multiple threads map its data
to the same cache line. This is usually not an issue but may cause problems on
certain systems. Conflict cache misses can be solved by using padding and spacing
of data. [16]
    Capacity misses occur when the cache only fits a certain amount of threads.
Adding more threads then cause data to be fetched from higher level caches or from
the main memory, the threads are thus no longer cache resilient. Because of these
issues, a high level of associativity should be preferred. [16]

2.3    Threads and Processes
A software thread is a stream of instructions for the processor to execute whereas
a hardware thread represents the resources that execute a single software thread.
Processors usually have multiple hardware threads (also called virtual CPUs or
strands) which are considered equal performance wise. [16]
    Support for multiple threads on a single chip can be achieved in several ways.
The simplest way is to replicate the cores and have each of them share an interface
with the rest of the system. An alternative approach is to have multiple threads
run on a single core, cycling between the threads. Having multiple threads share a
core means that they will get a fair share of the resources depending on activity and
currently available resources. Most modern multi core processors use a combination
of the two techniques e.g. a processor with two cores each capable of running two
threads. From the perspective of the user, the system appears to have many virtual
CPUs running multiple threads, this is called chip multi threading (CMT). [16]
    A process is a running application and consists of instructions, data and a state.
The state consists of processor registers, currently executing instructions and other
values that belong to the process. Multiple threads may run in a single process but
not the other way around. A thread also has a state although it is much simpler

                                         10
CHAPTER 2. MODERN PROCESSOR ARCHITECTURES

than that of a process. Advantages of using threads over processes is that they can
perform a high degree of communication via the shared heap space, it is also often
very natural to decompose problems into multiple threads with low costs of data
sharing. Processes have the advantage of isolation although it also means that they
require their own TLB entries. If one thread fails, the entire application might fail.
[16]

2.4     Multi Processor Systems
2.4.1   Parallel Processor Architectures
Proposed by M. J. Flynn, the taxonomy of computer systems can be divided into
four different classifications (see table 2.3) based on the number of concurrent in-
struction and data streams available. [17]

  Classification     I-Parallelism        D-Parallelism         Example
  SISD               No                   No                    Uniprocessor machines.
  SIMD               No                   Yes                   GPU or array processors.
  MISD               Yes                  No                    Fault tolerant systems.
  MIMD               Yes                  Yes                   Multi core processors.
             Table 2.3. Specifications for some general processor classifications.

    The Single Instruction, Single Data (SISD) architecture cannot constitute a
parallel machine as there is only one instruction carried out per clock cycle which
only operates on one data element. Using multiple data streams however (SIMD),
the model is extended to having instructions operate on several data elements. The
instruction type is however still limited to one per clock cycle. This is useful for
graphics and array/matrix processing. [17]
    The Multiple Instruction, Single Data stream (MISD) architecture supports dif-
ferent types of instructions to be carried out for each clock cycle but only using the
same data elements. The MISD architecture is rarely used due to its limitations but
can provide fault tolerance for usage in aircraft systems and the like. Most modern,
multi processor computers however fall into the Multiple Instruction, Multiple Data
(MIMD) stream category where both the instruction as well as the data stream is
parallelized. This means that every core can have its own instructions operating on
their own data. [17]

2.4.2   Memory Architectures
When more than one processor is present in the system, memory handling becomes
much more complex. Not only can data be stored in main memory but also in
caches in one of the other processors. An important concept is cache coherence
which essentially means that data requested from some memory should always be
the most up-to-date version of it.

                                              11
CHAPTER 2. MODERN PROCESSOR ARCHITECTURES

     M1       M2              Mn

       Interconnection Network                    Interconnection Network

     P1        P2             Pn          P1    M1        P2    M2       Pn    Mn

      Figure 2.1. Difference between UMA (left) and NUMA (right). Processors are
      denoted P1 , P2 , ..., Pn and memories M1 , M2 , ..., Mn . [6]

    There are several methods to maintain cache coherence often built on the concept
of tracking the state of sharing of data blocks. One method is called directory based
tracking in which the state of blocks is kept at a single location called the directory.
Another approach is called snooping in which no centralized state is kept. Instead,
every cache has its own sharing status of blocks and snoops other caches to determine
whether the data is relevant. [18]

Shared Memory
When every processor have access to the same global memory space it constitutes
a shared memory architecture. The advantages of such an approach are that data
sharing is fast due to the short distance between processors and memory. Another
advantage is that programming applications for such architecture is rather simple
even though it is the programmers responsibility to provide synchronization between
processors. The shared memory approach however lacks scalability between memory
and processor count and is rather difficult and expensive to design. [6]
    Memory may be attached to CPUs in different constellations and for every
link between the CPU and the memory requested there is some latency involved.
Typically one wants to have similar memory latencies for every CPU. This can be
achieved through an interconnection network which every process communicates
with memory through. This is what uniform memory access (UMA) is based on
and what is typically used in modern multi processor machines. [6]
    The other approach to shared memory is non-uniform memory access (NUMA)
in which processors have local memory areas. This results in constant access times
when requested data is present in the local storage but slightly slower than UMA
when not. [6] See figure 2.1 for an illustration of UMA- and NUMA designs.

                                          12
CHAPTER 2. MODERN PROCESSOR ARCHITECTURES

Distributed Memory
Using distributed memory, every processor have their own memory spaces and ad-
dresses. Advantages to the distributed memory approach are that the processor
count scales with memory and that it is rather cost effective. Each processor typi-
cally has instant access to its own memory as no external communication is needed
for such cases. [6]
    The disadvantages regarding distributed memory are that the programmer is
responsible for intra-processor communication as there is no straightforward way of
sharing data. It may also prove difficult to properly store data structures between
processor memories. [6]

2.4.3   Simultaneous Multi-threading
Simultaneous multi-threading (often used in conjunction with the Intel technology
hyper-threading) is a technology in which cores are able to execute two streams of
instructions concurrently, improving CPU efficiency. This means that each core is
divided into two logical ones with their own states and instruction pointers but with
the same memory. By switching between the two streams, the instruction stream
of the pipeline can be made more efficient. When one of the instruction streams is
stalled waiting for other steps to finish (e.g. memory is being fetched), the CPU may
execute instructions from the other instruction stream until the stalling is complete.
Studies have shown that this technology may improve performance with up to 40
percent. [15]
    One of the key elements when optimizing code for simultaneously multi-threaded
hardware is to make good use of the CPU cache. Storing data which is often accessed
in close proximity is highly important due to locality of reference. Memory usage
should generally be kept to a minimum. Instead of caching easily calculated values,
it may be more efficient to just recalculate it. Besides the issues related to memory,
all of the problems associated with multi core programming applies to simultaneous
multi-threading as well. Note also that the CLR in .NET is able to optimize code
for simultaneous multi-threading and different cache sizes to a certain extent. [15]

                                         13
Chapter 3

Parallel Programming Techniques

Many techniques have been developed to make parallel programming a more ac-
cessible subject for developers. This chapter discusses topics related to identifying
problems suited for parallelization as well as modern techniques used in parallel
programming.

3.1     General Concepts
Identifying, decomposing and synchronizing units of work are all examples of general
problems related to parallel programming. This section gives an overview of these
subjects.

3.1.1   When to go Parallel
The main reason for writing parallel code is for performance reasons. Parallelizing
an application so that it runs on four cores instead of one can potentially cut down
the computation time by a factor of 4. The parallel approach is however not always
suited and one should always investigate the gains of parallelization in contrast to
the introduced costs of increased complexity and overhead.
    Amdahl’s law is an approximation of potential runtime gains when parallelizing
applications. Let S represent time spent executing serial code and P time spent
executing parallelized code. Amdahl’s law states that the total runtime is S + P/N
where N is the number of threads executing the parallel code. This is obviously
very unrealistic as the overheads (synchronization, thread handling etc.) of using
multiple threads are not taken into account. [16]
    Let the overhead of N threads be denoted F (N ). The estimate of F (N ) could
vary between a constant to linear or even exponential running time depending on
implementation. A fair estimate is to let F (N ) = K · ln (N ) where K is some
constant communication latency. The logarithm could for example represent the
communication costs when threads form a tree structure. The total runtime includ-

                                         14
CHAPTER 3. PARALLEL PROGRAMMING TECHNIQUES

ing the overhead cost would now be updated to

                             S + (P/N ) + K ln (N ).                          (3.1)

   By plotting the running time over an increasing number of threads it is apparent
that the performance at some point will start decreasing. By differentiating the
running time over thread count, this exact point may be calculated as
                                     P   K
                                 −     +   =0                                 (3.2)
                                     N2 N
and when solved for N,
                                             P
                                     N=        .                              (3.3)
                                             K
    In other words, a proportional number of threads for a particular task is de-
termined by the runtime of the parallelizable code divided by the communication
latency K. This also means that the scalability of an application can be increased
by finding larger proportions of code to parallelize or by minimizing the synchro-
nization costs. [16]
    A side note is that lower communication latencies can be achieved when threads
share the same level of cached data than if they were to communicate through
memory. Multi core processors therefore have the opportunity to lower the value
on K through efficient memory handling. [16]

3.1.2   Overview of Parallelization Steps
When the sequential application has been properly analyzed and profiled (see chap-
ter 5), the steps for parallelizing the application typically are as follows:

  1. Decomposition of code into units of work.
  2. Distribution of work among processor cores.
  3. Synchronizing work.

Decomposition and Scalability
An important concept when parallelizing an application is that of potential paral-
lelism. Potential parallelism means that an application should utilize the cores of
the hardware regardless of how many of them are present. The application should
be able to run on both single core systems as well as multi-core ones and perform
thereafter. For some applications the level of parallelism may be hard coded based
on the underlying hardware. This approach should typically be used only when the
hardware running the application is known beforehand and is guaranteed to not
change over time. Such cases are rarely seen in modern hardware but still exist to
some extent, for example in gaming consoles. [7]
     To provide potential parallelism in a proper way, one might use the concept of
a task. Tasks are mostly independent units of work in which an application can

                                        15
CHAPTER 3. PARALLEL PROGRAMMING TECHNIQUES

be divided into. They are typically distributed among threads and works towards
fulfilling a common goal. The size of the task is called its granularity and should be
carefully chosen. If the granularity is too fine grained, the overheads of managing
threads might dominate while a too coarse grained granularity leads to a possible
loss of potential parallelism. The general guideline to choosing task granularity is
that the task should be as large as possible while properly occupying the cores and
being as independent as possible of one another. Making this choice requires good
knowledge of the underlying algorithms, data structures and overall design of the
code to be parallelized. [7]

Data Dependencies and Synchronization
Tasks of a parallel program are usually created to run in parallel. In some cases
there is no synchronization between tasks. Such problems are called embarrass-
ingly parallel and as the name suggests imposes very few problems while providing
good potential parallelism. One does not always have the luxury of encountering
such problems which is why the concept of synchronization is important. Task
synchronization typically has different designs depending on the pattern used for
parallelization (see chapter 4). Common for all patterns is however that the tasks
must be coordinated in one way or another.
    When data needs to be shared between tasks the problem of data races becomes
prevalent. Data races can be solved in numerous ways. One way is to use a locking
structure around the variable raced for (see section 3.2.5). Another solution is to
make variables immutable which can be enforced by using copies of data instead
of references where possible. A final approach to utilize when all else fails is to
redesign code for lower reliance on shared variables.
    It is important to note that synchronization limits parallelism and may in worst
case serialize the application. There is also the possibility of deadlocks when us-
ing locking constructs. As an alternative to explicit locking a number of lock-free,
concurrent collections were introduced in .NET Framework 4.0 to minimize syn-
chronization for certain data structures such as queues, stacks and dictionaries (see
section 3.5). Note that these constructs comes with a set of quite heavy limitations
which should be studied before designing an application based on them.
    One should generally try to eliminate as much synchronization as possible al-
though it is important to note that it is not always possible to do so. Choosing the
simplest, least error prone solution to the problem is then recommended, as parallel
programming in itself is difficult enough as it is.

3.2    .NET Threading
Before the introduction of .NET Framework 4.0, developers were limited to using
explicit threading methods. Even though the new parallel extensions have been
introduced, explicit threading is still used to great extent. This section describes
how threading is carried out in .NET along with other relevant subjects.

                                         16
CHAPTER 3. PARALLEL PROGRAMMING TECHNIQUES

    3.2.1   Using Threads
    Threads in .NET are handled by the thread scheduler provided by the CLR and
    are typically pre-empted after a certain time slice which depends on the underly-
    ing operating system (typically ranges between 10 to 15 ms[19]). On multi core
    systems, time slicing is mixed with true concurrency as multiple threads may run
    simultaneously on different cores.

                             Listing 3.1. Explicit thread creation.
1   new Thread (() => {
2          Work(); }).Start();

    Threads created explicitly are called foreground threads unless the property
    IsBackground is set to true in which case the thread is a background thread.
    When every foreground thread has terminated the application ends and background
    threads are terminated as a result. Waiting for threads to finish is typically done
    using event wait handles (see section 3.2.4). Note that exception handling should
    be carried out within the thread from which the catch-block typically is used for
    signaling another thread or logging the error. [3]
        As mentioned in previous sections, writing threaded code has issues regarding
    complexity. A good practice to follow is to encapsulate as much of the threaded
    code as possible for unit testing. The other main problem is the overhead of creating
    and destroying threads as seen in table 3.1 below. These overheads can however be
    limited by using the .NET thread pool as described in the following section. [3]

                     Action                         Overhead
                     Allocating a thread            1MB stack space
                     Context switch                 6000-8000 CPU cycles
                     Creation of a thread           200 000 CPU cycles
                     Destruction of a thread        100 000 CPU cycles
                      Table 3.1. Typical overheads for threads in .NET. [3]

    3.2.2   The Thread Pool
    The thread pool consists of a queue of waiting threads to be used. The easiest way
    of accessing the thread pool is by adding units of work to its global queue by calling
    the ThreadPool.QueueUserWorkItem method.
        The upper limit of threads for the thread pool in .NET 4.0 is 1023 and 32768
    for 32-bit and 64-bit systems respectively while the lower limit is determined by the
    number of processor cores. The thread pool is dynamic in that it typically starts
    out with few threads and injects more as long as performance is gained. [3]
        The .NET Framework is designed to run applications using millions of as tasks
    each being possibly as small as a couple of hundred clock cycles each. This would

                                               17
CHAPTER 3. PARALLEL PROGRAMMING TECHNIQUES

                                              Empty local queue of thread 1

                                                        Work steal!
                                   Thread 1

   Empty global queue
                                                                  Local queue of thread 2

                                   Thread 2

      Figure 3.1. Thread 1 steals work from thread 2 since both its local- as well as the
      global queue is empty. [7]

normally not be possible using a single, global thread pool because of synchro-
nization overheads. However, the .NET Framework solves this issue by using a
decentralized approach. [7]
    In the .NET Framework, every thread of the thread pool is assigned a local task
queue in addition to having access to the global queue. When new tasks are added,
they are sometimes put on local queues (sub-level) and sometimes on the global one
(top-level). Threads not in the thread pool always have to place tasks on the global
queue. [7]
    The local queues are typically double headed and lock-free which opens up for
the possibility of a concept called work stealing. A thread with a local queue operates
on one end of that queue while others may pull work from the other, public end
(see figure 3.1). Work stealing has also shown to provide good cache properties and
fairness of work distribution. [7]
    In certain scenarios, a task has to wait for another task to complete. If threads
have to wait for other tasks to be carried out it might lead to long waits and in the
worst case deadlocks. The thread scheduler can detect such issues and let the thread
waiting for another task to run it inline. It is important to note that top-level as
well as long running (see section 3.3.1) threads are unable to be inlined. [7]
    The number of threads in the pool is automatically managed in .NET using
complex heuristics. Two main approaches should be noted. The first one tries
to reduce starvation and deadlocks by injecting more threads if little progress is
made. The other one is hill-climbing which maximizes throughput while keeping
the number of threads to a minimum. This is typically done by monitoring if
injected threads increase throughput or not. [7]
    Threads can be injected when a task is completed or at 500 ms intervals. A
good reason for keeping tasks short is therefore to let the task scheduler get more
opportunities for optimization. A second approach is to implement a custom thread
scheduler that injects threads in sought ways. Note that the lower and upper limits
of threads in the pool may also be explicitly set (within certain limits). [7]

                                              18
CHAPTER 3. PARALLEL PROGRAMMING TECHNIQUES

                                   WaitSleepJoin
                                                             Abort
                         Thread            Thread
                         Blocks            Unblocks
                                                        Abort
        Unstarted                      Running                           Abort Requested
                        Start
                                 Thread            ResetAbort
                                 Ends
                                                           Thread
                                       Stopped                              Aborted
                                                           Ends

                     Figure 3.2. The different states of a thread. [3]

3.2.3   Blocking and Spinning
When a thread is blocked its execution is paused and its time slice yielded resulting
in a context switch. The thread is unblocked and again context switched either when
the blocking condition is met, by operation timeout, by interruption or abortion.
To have a thread blocked until a certain condition is met, signaling and/or locking
constructs may be used.
    Instead of having a thread block and perform the context switch it may spin
for a short amount of time, constantly polling for the signal or lock. Obviously this
wastes processing time but may be effective when the condition is expected to be
met in a very short time. A set of slimmed variants of locks and signaling constructs
were introduced in .NET Framework 4.0 targeting this issue. Slimmed constructs
can be used between threads but not between processes. [3]

3.2.4   Signaling Constructs
Signaling is the concept of having one thread wait until it receives a notifica-
tion from one or more other threads. Event wait handles are the simplest of
signaling constructs and comes in a number of forms. One particular useful fea-
ture of signaling constructs is to wait for multiple threads to finish using the
WaitAll(WaitHandle[]) method where the wait handles usually are distributed
among the threads waited for. [3]

  • Using a ManualResetEvent allows communication between threads using a
    Set call. Threads waiting for the signal are all unblocked until the event is
    reset manually. If threads wait using the WaitOne call, they will form a queue.

  • The AutoResetEvent is similar to ManualResetEvent with the difference that
    the event is automatically reset when a thread has been unblocked by it.

                                            19
CHAPTER 3. PARALLEL PROGRAMMING TECHNIQUES

      • A CountdownEvent unblocks waiting threads once its counter has reached
        zero. The counter is initially set to some number and is decremented by one
        for each signal.

             Lock                          Cross-process        Overhead
             AutoResetEvent                Yes                  1000 ns
             ManualResetEvent              Yes                  1000 ns
             ManualResetEventSlim          No                   40 ns
             CountdownEvent                No                   40 ns
             Barrier                       No                   80 ns
             Wait and Pulse                No                   120 ns (for Pulse)
                     Table 3.2. Typical overheads for signaling constructs. [4]

    3.2.5   Locking Constructs
    Locking is used to ensure that a set amount of threads are able to be granted access
    to some resource at the same time. For ensuring thread safety, locking should
    be performed around any writable shared field independently of its complexity.
    Writing thread-safe code is however usually more time consuming and typically
    induces performance costs. [3]
        It is important to note that making one method of a class thread-safe does not
    mean that the whole object is thread-safe. Locking a complete, thread-unsafe object
    with one lock may prove to be inefficient. It is therefore important to choose the
    right level of locking so that the program may utilize multiple threads in a safe way
    while being as efficient as possible.

    Exclusive Locks
    The locking structures for exclusive locking in .NET are lock and Mutex which
    both lets at most one thread claim the lock at the same time. The lock structure
    is typically faster but cannot be shared between processes in contrast to the Mutex
    construct. [3]

            Listing 3.2. Locking causes threads unable to be granted the lock to block.
1   lock(syncObj) {
2          // thread-safe area
3   }

    A Spinlock is a lock which continuously polls a lock to see if it has been released.
    This consumes processor resources but typically grants good performance when
    locks are estimated to be held only for short amounts of time.

                                                20
CHAPTER 3. PARALLEL PROGRAMMING TECHNIQUES

Non-Exclusive Locks
A variant of the mutex is the Semaphore which lets n threads hold the lock. Typical
usage of the semaphore is to limit the maximum amount of database connections
or to limit certain CPU/memory intensive operations to avoid starvation.
    The ReaderWriterLock allows for multiple threads to simultaneously read a
value while at most one thread may update it. Two locks are used, one for reading
and one for writing. The writer will acquire both locks when writing to ensure
that no reader gets inconsistent data. The ReaderWriterLock should be used when
reading the locked variables is performed frequently in contrast to updating the
variables. [3]

               Lock                           Cross-process        Overhead
               Mutex                          Yes                  1000 ns
               Semaphore                      Yes                  1000 ns
               SemaphoreSlim                  No                   200 ns
               ReaderWriterLock               No                   100 ns
               ReaderWriterLockSlim           No                   40 ns
            Table 3.3. Properties and typical overheads for locking constructs. [4]

Thread-Local Storage
Data is typically shared among threads but may be isolated through thread-local
storage methods; this is highly useful for parallel code. One such example is the
usage of the Random class which is not thread-safe. To use this class properly
the object either has to be locked around or be local to every thread. The latter
is typically preferred due to performance reasons and may be implemented using
thread-local storage. [3]
    Introduced in .NET 4.0, the ThreadLocal class offers thread local storage
for both static- as well as instance fields. A useful feature with the ThreadLocal
class is that the data it holds is lazily ∗ evaluated. A second way of implementing
thread-local storage is to use the GetData and SetData methods of the Thread
class. These are used to access thread-specific slots in which data can be stored and
retrieved. [3]

3.3     Task Parallel Library
The task parallel library (TPL) consists of two techniques: task parallelism and
the Parallel class. The two methods are quite similar but typically have different
areas of usage. The two techniques are described in this section.
    ∗
      Lazy evaluation delays the evaluation of an expression until its value is needed. In some
circumstances, lazy evaluation can also be used to avoid repeated evaluations among shared data.

                                              21
CHAPTER 3. PARALLEL PROGRAMMING TECHNIQUES

    3.3.1     Task Parallelism
    The Task represents an object responsible for carrying out some independent unit of
    work. Both the Parallel class as well as PLINQ is built on top of task parallelism.
    Task parallelism offers the lowest level of parallelization without using threads ex-
    plicitly while offering a simple way of utilizing the thread pool. It may be used for
    any concurrent application even though it was meant for multi core applications.
                        Listing 3.3. Explicit task creation (task parallelism).
1   Task.Factory.StartNew (() =>
2          Work());

    Tasks in .NET have several features which makes them highly useful:
      •     The scheduling of tasks may be altered.
      •     Relationships between tasks can be established.
      •     Efficient cancellation and exception handling.
      •     Waiting on tasks and continuations.

    Parallel Options
    If no parallel options are provided during task creation there is no explicit fairness,
    no logical parent and the task is assumed to be running for a short amount of time.
    These options may be specified using TaskCreationOptions. [3]
        The PreferFairness task creation option forces the default task scheduler to
    place the task in the global (top-level) queue. This is also the case when a task is
    created from a thread which does not belong to one of the worker threads of the
    thread pool. Tasks created under this option typically follows a FIFO ordering if
    scheduled by the default task scheduler. [3]
        Sometimes it is not preferred to have tasks using worker threads from the thread
    pool. This is often the case when there are few threads that are known to be running
    for a great period of time (e.g. long I/O and other background work). When this
    is the case, including the LongRunning task creation option creates a new thread
    which bypasses the thread pool. As usual with parallel programming, one should
    generally carry out performance tests before deciding on whether to use this option
    or not. [3]
        A parent-child relationship may be established using the AttachedToParent
    task creation option. When such an option is provided it is imposed that the parent
    will not finish until all of its children finishes. [20]

    3.3.2     The Parallel Class
    The Parallel class includes the methods Parallel.Invoke, Parallel.For and
    Parallel.ForEach. These methods are highly useful for both data- as well as task
    parallelism and they all block until all of the work has completed in contrast to
    usage of explicit tasks.

                                                  22
CHAPTER 3. PARALLEL PROGRAMMING TECHNIQUES

    Parallel.Invoke
    The Parallel.Invoke static method of the Parallel class is used for executing
    multiple Action delegates in parallel and then wait for the work to complete. The
    main difference between this method of performing parallel work in comparison
    explicit task creation and waiting is that the work is partitioned into properly sized
    batches. Note that the tasks needs to be known beforehand to properly utilize the
    method. [3]

                           Listing 3.4. The Parallel.Invoke method.
1   Parallel.Invoke (
2          () => Work1(),
3          () => Work2());

    Parallel.For and Parallel.ForEach
    The Parallel.For and Parallel.ForEach methods are used for performing steps
    of looping constructs in parallel. Just like the Parallel.Invoke method these
    methods are properly partitioned. Parallel.ForEach is used for iterating over an
    enumerable data set just like its sequential counterpart with the exception that
    the parallel method will use multiple threads. The data set should implement the
    IEnumerable† interface. [3]

                              Listing 3.5. The Parallel.For loop.
1   Parallel.For(0, 100,
2          i => Work(i));

        The two important concepts parallel break and parallel stop are used for exiting
    a loop. A parallel break at index i guarantees that every iteration indexed less than
    i will or have been executed. It does not guarantee that iterations indexed higher
    than i has or has not been executed. The parallel stop does not guarantee anything
    else than the fact that the iteration indexed i has reached this statement. This
    is typically used when some particular condition is searched for using the parallel
    loop. Canceling a loop externally is typically done using cancellation tokens (see
    section 3.6). [3]

    3.4     Parallel LINQ
    The Parallel LINQ technique offers the highest level of parallelism by automat-
    ing most of the implementation such as partitioning of work into tasks, execution
    of the tasks using threads and collation of results into a single output sequence.
        †
          The IEnumerable interface ensures that the underlying type implements the
    GetEnumerator method which in turn should return an enumerator for iterating through some
    collection.

                                               23
You can also read