Daniel J. Duffy
In the previous blog (Part I) I gave a
high-level overview of the OpenMP Application Programming Interface (API).
OpenMP consists of library functions, directives and environment variables that
allow developers to create multi-threaded code on shared-memory architectures.
In this blog I would like to give a
concrete example of using OpenMP C++ code. The example is simple but it does
show how serial code can be made parallel. We concentrate on a problem that
creates two STL vectors, calculates their inner product and then prints them on
the console. The code was compiled under Microsoft VS2005 that supports OpenMP.
We have created a Win32 console application and all code is placed in one file.
The first statements tell the compiler that we are using STL and OpenMP:
#include <vector>
#include <iostream>
#include <omp.h>
using namespace std;
After having done this we can then use the
OpenMP API library functions to parallelise the code. The main program is very
simple:
int main()
{
//
Preprocessing: Input
cout << "Give
size of the arrays: ";
int N; cin >> N;
cout << "Give
value in the first array: ";
double val1; cin >> val1;
cout << "Give
value in the second array: ";
double val2; cin >> val2;
// Processing;
Data and algorithms
vector<double>
v1(N, val1);
vector<double>
v2(N, val2);
double
result = InnerProduct(v1, v2); // Sum of products
//
Postprocessing: Output
print(v1);
print(v2);
cout << endl << "Inner product is: " << result
<< endl;
return 0;
}
This program prompts for input and then
creates two STL vectors. It then calculates their inner product and prints both
of them on the console. This program is serial but the functions for
calculating the inner product and printing use loop-level parallel pragmas. First, the code for the inner product
is:
double InnerProduct(const vector<double>&
v1, const vector<double>&
v2)
{
double
result = v1[0] * v2[0];
// Assume sizes
of v1 and v2 are equal
// Perform a
reduction
#pragma omp parallel for
reduction (+: result)
for (int j = 1; j < (int)v1.size();
++j)
{
result
+= v1[j]*v2[j];
}
//
implicit barrier here
return
result;
}
The presence of an OpenMP directive ensures
that the master thread forks a number of child threads. Each thread is
allocated parts of the work to calculate the inner product. Each thread’s
contribution is added to a global variable result. We use a special keyword reduction in order to
add the individual contributions and to avoid race conditions at the same time.
The code for printing a vector is given by:
void print(const vector<double>&
vec)
{
cout << endl;
// Since we only
read the values of vec, the default shared
// variable access is OK
#pragma omp parallel for
for(int j=0; j < (int)vec.size();
++j)
{
cout
<< "vec[" << j <<
"] = " << vec[j] <<
endl;
}
// implicit barrier here
cout << endl;
}
In this case, multiple threads are created
and each thread is responsible for printing one block of the vector. We mention
that a so-called implicit barrier is
defined in both functions at the end of loop. This implies that the threads are
removed and the code goes back into serial mode.
We can draw some conclusions: first, we see
that it is easy to incorporate parallel commands in serial code in order to
improve speedup. Second, OpenMP API takes care of thread creation and
destruction and this fact lessens the burden on the developer. Finally, the
loop-level code can be used to improve the performance of matrix-based
computations in finance. For example, in some cases speedup of 80% is possible
on duo-core machines.
The full source code for
the test program can be found at www.datasimfinancial.com
(where you can register and log into the forum, see the OpenMP thread).
In the next blog I shall
discuss the application of coarse-grained
techniques to the development of efficient code for Monte Carlo applications.
Recent Forum Discussions