C# Optimization Revisited Part 3: The “Native Option” C++

Tuesday, September 1, 2009 – 10:21 AM

In part 1 and part 2 of my thread on C# optimization there was a lot of talk about algorithms and the like. The next two posts take a different tack… first re-implementing the same algorithm using different languages and then in a different language and hardware. It turns out that there are some dramatic improvements in performance to be had in doing this.

So first let’s implement the Leapfrog Integrator in native code using C++ and C++/CLI…


Wrapping native code with C++/CLI

The first challenge is how to glue a native C++ class into your C# application. Luckily MSDN has a pretty good example of how to do this, “How to: Wrap Native Class for Use by C#”. Essentially the approach is to write a managed class—here it’s LeapfrogNativeIntegrator—which wraps the unmanaged code implemented in a second class—in this case LeapfrogNativeIntegratorImpl.

Caveat Emptor: My C++ is sort of rusty, treat accordingly!

The managed class simply implements the IIntegrate interface and handles the lifetime of the contained unmanaged class:

#pragma once
 
#include "..\Include\BodyNative.h"
#include "LeapfrogNativeIntegratorImpl.h"
 
using namespace System;
using namespace System::ComponentModel;
using namespace NBody::DomainModel;
 
namespace NBody
{
    namespace DomainModel
    {
        namespace Integrators
        {
            [Description("Leapfrog (native)")]
            public ref class LeapfrogNativeIntegrator : public IIntegrate
            {
            private:
                LeapfrogNativeIntegratorImpl *m_integrator;
                bool m_isInitialized;
 
            public:
                LeapfrogNativeIntegrator() { m_integrator = new LeapfrogNativeIntegratorImpl(); }
                ~LeapfrogNativeIntegrator() { delete m_integrator; }
 
                virtual void Initialize(cli::array<Body, 1> ^bodies);
                virtual void Integrate(Double dT, cli::array<Body, 1> ^bodies);
            };
        }
    }
}

The calls to the Initialize and Integrate methods are simply pass-through methods to the corresponding native implementations.

In the previous discussions the Body struct was left to your imagination, it’s details really didn’t matter that much. Now that’s we’re going to pass these to C++ it’s time to get serious about how these look and, most importantly, how they are laid out in memory. Adding the StructLayout attribute to each of the C# structs allows you to specify the exact sequential memory layout.

[StructLayout(LayoutKind.Sequential)]
public struct Body
{
  public Vector Position { get; set; }
  public Vector Velocity { get; set; }
  public Vector Acceleration { get; set; }
 
  // ...
}

Similarly with the Vector struct:

[StructLayout(LayoutKind.Sequential)]
public struct Vector : IPosition
{
  public double X { get; set; }
  public double Y { get; set; }
  public double Z { get; set; }
 
  // ...
}

In this case the individual members of the struct are all the same size. If your struct has members of varying sizes then placing larger members first in the layout may result in a smaller overall size as it allows the compiler to pack the data better.

I’m using structs because an array of structs is laid out linearly in memory. This, plus the fact that double is a blittable type, means that moving between these managed structs and the following unmanaged ones is trivial:

struct BodyNative
{
public:
  VectorNative Position;
  VectorNative Velocity;
  VectorNative Acceleration;
 
  // ...
}

Similarly for the Vector class:

struct VectorNative    
{
public:
  double x;  
  double y;
  double z;
 
  // ...
}

The next problem is that the C# structure’s location in memory is controlled by the CLR’s garbage collection. When passing the managed structure to native code it must be “pinned” in memory by wrapping it in an pin_ptr. This prevents the garbage collector from moving the array as it runs its collection and manages memory. Lastly the C# type is cast to the C++ type:

#include "stdafx.h"
#include "LeapfrogNativeIntegrator.h"
 
using namespace System;
using namespace NBody::DomainModel;
using namespace NBody::DomainModel::Integrators;
 
void LeapfrogNativeIntegrator::Initialize(cli::array<Body, 1>^ bodies)
{
  pin_ptr<Body> pBodies = &bodies[0];
  m_integrator->Initialize((BodyNative*)pBodies, bodies->Length);
  m_isInitialized = true;
}
 
void LeapfrogNativeIntegrator::Integrate(System::Double dT, cli::array<Body, 1> ^bodies)
{
  if (!m_isInitialized)
    throw gcnew InvalidOperationException();
  pin_ptr<Body> pBodies = &bodies[0];
  m_integrator->Integrate(dT, (BodyNative*)pBodies, bodies->Length);
}

Note the use of ^ to denote a handle to a managed object in C++. C++/CLI adds two new decorators for managed objects. We’re running out of symbols here people!

A native implementation

Now all that’s left is to implement the Leapfrog algorithm in C++ and make sure it compiles as unmanaged by using the unmanaged pragma.

#include "stdafx.h"
#include "LeapfrogNativeIntegratorImpl.h"
 
using namespace NBody::DomainModel::Integrators;
 
#pragma unmanaged
 
void LeapfrogNativeIntegratorImpl::Initialize(BodyNative* __restrict pBodies, const size_t lenBodies)
{
  for (size_t i = 0; i < lenBodies; i++)
  {
    VECTOR_ZERO(pBodies[i].Acceleration);
  }
 
  for (size_t i = 0; i < lenBodies; i++)
  {
    const double Mi = pBodies[i].Mass;
 
    for (size_t j = (i + 1); j < lenBodies; j++)
    {
      VectorNative r;
      VECTOR_SUB(r, pBodies[j].Position, pBodies[i].Position);
 
      const double rm = VECTOR_MAG_SQUARED(r);
 
      VectorNative f;
      const double k = Mi * pBodies[j].Mass / (sqrt(rm) * rm);
      VECTOR_SCALAR_MUL(f, r, k);
      
      VECTOR_ADD_SCALAR_DIV(pBodies[i].Acceleration, f, Mi);
      VECTOR_SUB_SCALAR_DIV(pBodies[j].Acceleration, f, pBodies[j].Mass);
    }
  }
}
 
void LeapfrogNativeIntegratorImpl::Integrate(const double dT, BodyNative*  __restrict pBodies, const size_t lenBodies)
{
  const double dT2 = dT / 2.0;
 
  for (size_t i = 0; i < lenBodies; i++)
  {
    VECTOR_ADD_SCALAR_MUL(pBodies[i].Velocity, pBodies[i].Acceleration, dT2);
    VECTOR_ADD_SCALAR_MUL(pBodies[i].Position, pBodies[i].Velocity, dT);
    VECTOR_ZERO(pBodies[i].Acceleration);
  }
 
  for (size_t i = 0; i < lenBodies; i++)
  {
    const double Mi = pBodies[i].Mass;
 
    for (size_t j = (i + 1); j < lenBodies; j++)
    {
      VectorNative r;
      VECTOR_SUB(r, pBodies[j].Position, pBodies[i].Position);
 
      const double rm = VECTOR_MAG_SQUARED(r);
 
      VectorNative f;
      const double k = Mi * pBodies[j].Mass / (sqrt(rm) * rm);
      VECTOR_SCALAR_MUL(f, r, k);
 
      VECTOR_ADD_SCALAR_DIV(pBodies[i].Acceleration, f, Mi);
      VECTOR_SUB_SCALAR_DIV(pBodies[j].Acceleration, f, pBodies[j].Mass);
    }
    VECTOR_ADD_SCALAR_MUL(pBodies[i].Velocity, pBodies[i].Acceleration, dT2);
  }
}

There’s not a lot to say about this code. Other than the fact that it doesn’t try and do anything “clever” with pointers, it just uses array indexers. Using pointers was a common optimization when I first started C/C++ development a long time ago but these days it just prevents the compiler from optimizing the code (see AMD’s optimization notes). Altering the code to use array indexers resulted in a 40% improvement.

The C++ implements exactly the same Leapfrog algorithm. It uses macros to inline various VectorNative operations like addition, assignment etc. Inlining would have been preferable but inline and even __forceinline are only treated as suggestions by the compiler and in this case adding methods to the VectorNative struct actually make the code a lot slower. Manually inlineing with macros, like VECTOR_SUB doesn’t read nearly as well but gets the job done.

#define VECTOR_SUB(v, a, b)     \
{                               \
  (v).x = (a).x - (b).x;        \
  (v).y = (a).y - (b).y;        \
  (v).z = (a).z - (b).z;        \
}

This is really, really ugly. I’m not entirely happy with the code but unless I can use (inline) methods with little or no performance penalty that’s the way it is. In part 4 we’ll see that we can at least reuse these macros with the Nvidia CUDA C implementation.

So now we have a native implementation of the Leapfrog algorithm. How about having it run on all the available cores on the processor?

Parallelizing with OpenMP

So how do can you parallelize this C++ implementation? Obviously the Task Parallel Library (TPL) is only available to managed code. The next version of Visual C++ (Visual C++ 0x) will ship with the Parallel Pattern Library (PDC talk video) but that’s not much help today (I’ll be trying it out as soon as Visual Studio 2010 Beta 2 ships). Luckily Visual Studio 2008 supports the OpenMP 2.0, a library specifically designed to support loop parallelism.

The end result is very similar looking code to the parallel C# implementation we saw in part 2:

 
void LeapfrogNativeParallelIntegratorImpl::Integrate(const double dT, BodyNative* __restrict pBodies, const size_t lenBodies)
{
  const double dT2 = dT / 2.0;
  const int len = static_cast<int>(lenBodies);
 
  #pragma omp parallel for schedule(static) 
  for (int i = 0; i < len; i++)
  {
    VECTOR_ADD_SCALAR_MUL(pBodies[i].Velocity, pBodies[i].Acceleration, dT2);
    VECTOR_ADD_SCALAR_MUL(pBodies[i].Position, pBodies[i].Velocity, dT);
  }
 
  #pragma omp parallel for schedule(static) 
  for (int i = 0; i < len; i++)
  {
    const double Mi = pBodies[i].Mass;
    VectorNative f;
    VECTOR_ZERO(f);
 
    for (int j = 0; j < len; j++)
    {
      if (i == j)
        continue;
 
      VectorNative r;
      VECTOR_SUB(r, pBodies[j].Position, pBodies[i].Position);
 
      const double rm = VECTOR_MAG_SQUARED(r);
      const double k = Mi * pBodies[j].Mass / (sqrt(rm) * rm);
      VECTOR_ADD_SCALAR_MUL(f, r, k);
    }
    VECTOR_SCALAR_DIV(pBodies[i].Acceleration, f, Mi);
    VECTOR_ADD_SCALAR_MUL(pBodies[i].Velocity, pBodies[i].Acceleration, dT2);
  }
}

The OpenMP parallel for pragma results in something like the TPC’s Parallel.For (in very simplistic terms).

Of course the real question, is it faster?

Let’s be clear this isn’t a real benchmark. It’s a very special case on some respects; a tight series of loops containing a SQRT operation (which accounts for a sizable chunk of the CPU cycles). It also doesn’t involve any significant memory allocation which seems to be where it can be easier for the managed code to produce better performance.

integrator_perf

Some of this makes sense but it also raises big some questions. Running on a quad core Intel i7 (eight hardware threads) the C# parallel implementation runs four times as fast as it does on a single core. This is largely what you’d expect.

Firstly, you would expect a similar 4x speedup when comparing the single core and parallel C++ implementations but that’s not what’s observed. The OpenMP implementation is only twice as fast, not what you’d expect. Clearly something is limiting performance. At this stage it is hard to determine. 

Secondly, the serial C++ implementation is much faster than I’d expect when compared to the C# native C++. 4x speedup just seems odd. This suggests that my C# code could be more efficient but this might be at the expense of readability. The C++ code uses macros to modify the BodyNative struct’s data directly whereas the C# code does so through via methods on the Body class. If these method calls aren’t placed inline then this could well be the cause (see previous post on C# and inline methods).

Caveat Emptor: In both cases some more investigation is required. This post doesn’t have all the answers. Maybe an evening playing with the profiler and another post on the results is in order? If nothing else the thing I learnt from this is you have to think a lot harder about performance than you’d expect.

You’ll also see I’ve finally upgraded to a slightly better code snippet plug-in for Windows Live Writer.

  1. 3 Responses to “C# Optimization Revisited Part 3: The “Native Option” C++”

  2. About the relatively (twice is not so bad) poor performance gain of using OpenMP: it’s very likely that your code snippet is too small to counterbalance the tradeoff of being parallelized (synchronization and all), going in/out math core, and the usual various cache miss :)
    My two cents is that at this (low) level of code, you’re hitting a lot of hardware buttons…

    By Awen on May 18, 2010

  3. Awen,

    Yes. I have a further post showing a cache aware algorithm which shows considerable improvement over this version.

    Ade

    By Ade Miller on May 18, 2010

  1. 1 Trackback(s)

  2. Mar 26, 2010: Implementing a Parallelized Octree in F# | #2782 - Thinking about agile (small 'a') software development, patterns and practices for building Microsoft .NET applications.

Sorry, comments for this entry are closed at this time.