## C# Optimization Revisited Part 2: Concurrency

Wednesday, March 25, 2009 – 3:00 AM In Part 1 of Optimization Revisited I considered, with the help of Bill Clinton, improving performance by using a fundamentally better algorithm rather than micro-optimizing an existing one.

One thing that’s pretty obvious looking at the CPU usage on a  machine with more than one core is that even the Leapfrog algorithm isn’t making good use of the available computer power. Looking at the resource monitor (right) you can see that only one of the available CPU cores is being used. Given that I gave Intel a lot of money for that nice new i7 920 processor it would be nice to make use of it.

The trick here is to take advantage of the additional cores by running different parts of the problem on different threads which can then execute on different cores. Rewriting the original code using the .NET threading libraries isn’t for the faint hearted but the Parallel Extensions for .NET 3.5 comes to the rescue with various . The one we’ll use here is Parallel.For. A parallelized Leapfrog implementation (see part 1 for the original serial version):

```    public void Integrate(double dT, Body[] bodies)
{
n = bodies.Length;        Parallel.For(0, n, i =>
{
bodies[i].Velocity += bodies[i].Acceleration * (dT / 2.0);
bodies[i].Position += bodies[i].Velocity * dT;
});

Parallel.For(0, n, i =>
{
Vector f = new Vector();
for (int j = 0; j < bodies.Length; j++)
{
if (i == j)
continue;
f += Body.Force(bodies[i], bodies[j]);
}
bodies[i].Acceleration = f / bodies[i].Mass;
bodies[i].Velocity += bodies[i].Acceleration * (dT / 2.0);
});
}```

Comparing this to the serial implementation in part 1 you’ll see that the looping structure has been modified.

```    for (int i = 0; i < bodies.Length; i++)
{
for (int j = (i + 1); j < bodies.Length; j++)
{
Vector f = Body.Force(bodies[i], bodies[j]);
bodies[i].Acceleration += f / bodies[i].Mass;
bodies[j].Acceleration -= f / bodies[j].Mass;
}
bodies[i].Velocity += bodies[i].Acceleration * (dT / 2.0);
}``` The reason for this change is because the serial algorithm attempts to modify the contents of the array at bodies[i] and bodies[j] in the inner loop. This isn’t going to behave well in the parallel version where, for example, the first thread executes with i = 0 and starts by updating bodies[j] where j = i + 1 (1) and the second thread executes with i = 1 and updates bodies[i] so both threads try and update bodies at the same time.

This is the single biggest pitfall when parallelizing an existing algorithm to take advantage of data parallelism. Dealing with shared data or state and ensuring that different threads don’t attempt to update the same piece of data at the same time.

So after all that work is this faster? Well the maxed out CPU usage (right) suggests that it will be. Now all eight cores on the machine are being used pretty much 100%. Does it run eight times as fast? Well alas no. There are several reasons for this. First and foremost the i7 isn’t really an eight core processor. Actually it’s a four core processor but each core is hyper threaded so appears as two hardware threads. In reality the hyper threading gives you some improvement but it’s not the same as having twice as many physical cores. So you see a four times speed up on this type of hardware. You can see this in the output from a timing run written with code timer library (left). The serial Leapfrog executes its iterations in about 8.6s whereas the parallel implementation takes about 2.0s. A speedup of 4.1. Numbers for the simpler Forward Euler are also shown here (covered in part 1).

There’ll be more on this as I work to parallelize an Octree based N-body code in the future.

1. ## 5 Responses to “C# Optimization Revisited Part 2: Concurrency”

2. Very interesting Ade! Have you looked at what the LINQ code is compiled to in Reflector? I wonder if you could look at the compiler generated stuff and then optimize that.

By Sam on May 10, 2009

3. Sam,

You might be able to take the threading code generated by the Parallel.For statement using Reflector and tweak it. I’m familiar with what would get produced (the parallel library team show this in some of their demos). My worry with doing that is that the source code will be much harder to read and maintain.

I actually took a different approach and re-wrote the code code in C++. I don’t think this was any harder and results in significant performance improvement.

More on this in part 3!