In part 1 and part 2 of my thread on C# optimization there was a lot of talk about algorithms and the like. The next two posts take a different tack… first re-implementing the same algorithm using different languages and then in a different language and hardware. It turns out that there are some dramatic improvements in performance to be had in doing this.
So first let’s implement the Leapfrog Integrator in native code using C++ and C++/CLI…
Wrapping native code with C++/CLI
The first challenge is how to glue a native C++ class into your C# application. Luckily MSDN has a pretty good example of how to do this, “How to: Wrap Native Class for Use by C#”. Essentially the approach is to write a managed class—here it’s LeapfrogNativeIntegrator—which wraps the unmanaged code implemented in a second class—in this case LeapfrogNativeIntegratorImpl.
Caveat Emptor: My C++ is sort of rusty, treat accordingly!
The managed class simply implements the IIntegrate interface and handles the lifetime of the contained unmanaged class:
The calls to the Initialize and Integrate methods are simply pass-through methods to the corresponding native implementations.
In the previous discussions the Body struct was left to your imagination, it’s details really didn’t matter that much. Now that’s we’re going to pass these to C++ it’s time to get serious about how these look and, most importantly, how they are laid out in memory. Adding the StructLayout attribute to each of the C# structs allows you to specify the exact sequential memory layout.
Similarly with the Vector struct:
In this case the individual members of the struct are all the same size. If your struct has members of varying sizes then placing larger members first in the layout may result in a smaller overall size as it allows the compiler to pack the data better.
I’m using structs because an array of structs is laid out linearly in memory. This, plus the fact that double is a blittable type, means that moving between these managed structs and the following unmanaged ones is trivial:
Similarly for the Vector class:
The next problem is that the C# structure’s location in memory is controlled by the CLR’s garbage collection. When passing the managed structure to native code it must be “pinned” in memory by wrapping it in an pin_ptr. This prevents the garbage collector from moving the array as it runs its collection and manages memory. Lastly the C# type is cast to the C++ type:
Note the use of ^ to denote a handle to a managed object in C++. C++/CLI adds two new decorators for managed objects. We’re running out of symbols here people!
A native implementation
Now all that’s left is to implement the Leapfrog algorithm in C++ and make sure it compiles as unmanaged by using the unmanaged pragma.
There’s not a lot to say about this code. Other than the fact that it doesn’t try and do anything “clever” with pointers, it just uses array indexers. Using pointers was a common optimization when I first started C/C++ development a long time ago but these days it just prevents the compiler from optimizing the code (see AMD’s optimization notes). Altering the code to use array indexers resulted in a 40% improvement.
The C++ implements exactly the same Leapfrog algorithm. It uses macros to inline various VectorNative operations like addition, assignment etc. Inlining would have been preferable but inline and even __forceinline are only treated as suggestions by the compiler and in this case adding methods to the VectorNative struct actually make the code a lot slower. Manually inlineing with macros, like VECTOR_SUB doesn’t read nearly as well but gets the job done.
This is really, really ugly. I’m not entirely happy with the code but unless I can use (inline) methods with little or no performance penalty that’s the way it is. In part 4 we’ll see that we can at least reuse these macros with the Nvidia CUDA C implementation.
So now we have a native implementation of the Leapfrog algorithm. How about having it run on all the available cores on the processor?
Parallelizing with OpenMP
So how do can you parallelize this C++ implementation? Obviously the Task Parallel Library (TPL) is only available to managed code. The next version of Visual C++ (Visual C++ 0x) will ship with the Parallel Pattern Library (PDC talk video) but that’s not much help today (I’ll be trying it out as soon as Visual Studio 2010 Beta 2 ships). Luckily Visual Studio 2008 supports the OpenMP 2.0, a library specifically designed to support loop parallelism.
The end result is very similar looking code to the parallel C# implementation we saw in part 2:
The OpenMP parallel for pragma results in something like the TPC’s Parallel.For (in very simplistic terms).
Of course the real question, is it faster?
Let’s be clear this isn’t a real benchmark. It’s a very special case on some respects; a tight series of loops containing a SQRT operation (which accounts for a sizable chunk of the CPU cycles). It also doesn’t involve any significant memory allocation which seems to be where it can be easier for the managed code to produce better performance.
Some of this makes sense but it also raises big some questions. Running on a quad core Intel i7 (eight hardware threads) the C# parallel implementation runs four times as fast as it does on a single core. This is largely what you’d expect.
Firstly, you would expect a similar 4x speedup when comparing the single core and parallel C++ implementations but that’s not what’s observed. The OpenMP implementation is only twice as fast, not what you’d expect. Clearly something is limiting performance. At this stage it is hard to determine.
Secondly, the serial C++ implementation is much faster than I’d expect when compared to the C# native C++. 4x speedup just seems odd. This suggests that my C# code could be more efficient but this might be at the expense of readability. The C++ code uses macros to modify the BodyNative struct’s data directly whereas the C# code does so through via methods on the Body class. If these method calls aren’t placed inline then this could well be the cause (see previous post on C# and inline methods).
Caveat Emptor: In both cases some more investigation is required. This post doesn’t have all the answers. Maybe an evening playing with the profiler and another post on the results is in order? If nothing else the thing I learnt from this is you have to think a lot harder about performance than you’d expect.