Using CUDA/Thrust with the Parallel Patterns Library

Friday, December 31, 2010 – 4:48 PM

I’ve been working on getting my CUDA/Thrust N-body code working with multiple GPUs. The following is a quick code spike showing how to use the Parallel Patterns Library (PPL) to create a task for each CUDA device and execute a CUDA kernel on it using the Thrust library for CUDA.

Building a solution using the PPL and CUDA together is a little tricky. While the CUDA 3.2 SDK supports VS 2010 it requires that the project is setup to target the VC++ 9 compiler (from VS 2008) to be able to use the NVCC compiler. This means that you need VS 2008 installed and that the project has the Platform Toolset property set to “v90”. The PPL requires the VC++ 10 compiler so you need to have two projects in the solution; a DLL build using the v90 toolset and the CUDA 3.2 SDK to compile any CUDA kernel code and the main EXE build using the v100 toolset.

The spike calculates the sum of a vector of integers by partitioning the range across multiple GPUs and then adds the results from each GPU to get the final total.

The following is the body of the EXE compiled with the v100 (VS 2010) toolset:

   1: // Setup data
   2:  
   3: const size_t N = 100000;
   4: thrust::host_vector<ULONG> numbers(N);
   5: ULONG i = 1;
   6: generate(numbers.begin(), numbers.end(), [&i]()->ULONG{ return i++; } );
   7:  
   8: // Get device count 
   9:  
  10: int deviceCount = 0;
  11: cudaGetDeviceCount(&deviceCount);
  12: vector<ULONG> results(deviceCount);
  13:  
  14: // Calculate partitions
  15:  
  16: vector<tuple<host_vector_citer, host_vector_citer>> ranges(deviceCount);
  17: host_vector_citer start = numbers.begin();
  18: size_t stride = N / deviceCount;
  19: size_t remainder = N % deviceCount;
  20: for (int d = 0; d < deviceCount; ++d)
  21: {
  22:     get<0>(ranges[d]) = start;
  23:     get<1>(ranges[d]) = start + stride;
  24:     if (remainder > 0)
  25:     {
  26:         get<1>(ranges[d]) += 1;
  27:         --remainder;
  28:     }
  29:     start = get<1>(ranges[d]);
  30: }
  31:  
  32: // Run jobs
  33:  
  34: for (int j = 0; j < 2; j++)
  35: {
  36:     cout << "Job " << j << endl;
  37:     task_group tasks;
  38:     for (int d = 0; d < deviceCount; ++d)
  39:     {
  40:         tasks.run(
  41:             [d, &ranges, &results]()
  42:         {
  43:             cudaSetDevice(d);
  44:             results[d] = Sum::DoWork(get<0>(ranges[d]), 
  45:                                      get<1>(ranges[d]));
  46:             cudaThreadExit();
  47:         });
  48:     }
  49:     tasks.wait();
  50:     ULONG result = 0;
  51:     for_each(results.cbegin(), results.cend(), [&result](ULONG i) 
  52:         {
  53:             cout << "  Subtotal = " << i << endl;
  54:             result += i; 
  55:         });
  56:     cout << "  Total = " << result << endl;
  57: }

Note the calls to cudaSetDevice and cudeThreadExit, these ensure that each CUDA device is associated with a specific thread and that the context is freed so it can be reused by the second job.

   1: typedef thrust::host_vector<unsigned long>::iterator 
   2:     host_vector_iter;
   3: typedef thrust::host_vector<unsigned long>::const_iterator 
   4:     host_vector_citer;
   5:  
   6: class __declspec(dllexport) Sum
   7: {
   8: public:
   9:     static unsigned long DoWork(host_vector_citer cbegin, 
  10:         host_vector_citer cend);
  11: };

The CUDA/Thrust code is compiled in a separate DLL with the v90 toolset and imported using the following header:The .CU file containing the code for the CUDA C compiler copies the appropriate range to the device and executes a reduction on it, adding all the numbers together.

   1: unsigned long Sum::DoWork(host_vector_citer cbegin, host_vector_citer cend)
   2: {
   3:     thrust::device_vector<unsigned long> 
   4:         numbers(thrust::distance(cbegin, cend));
   5:     thrust::copy(cbegin, cend, numbers.begin());
   6:     return thrust::reduce(numbers.cbegin(), numbers.cend(), 
   7:         0, thrust::plus<unsigned long>());
   8: }

This is a very trivial example. It doesn’t address load balancing, error handling or performance. It’s probably far faster to do this on the CPU for the example shown. I should have another example soon that uses the Asynchronous Agents Library to create a pool of persistent worker threads and assign them work. It may well address some of these issues.

If you want to look at a fully fledged example of CUDA/Thrust working with a VS 2010 solution the current codebase for NBody.GPU is on Bitbucket.

  1. 5 Responses to “Using CUDA/Thrust with the Parallel Patterns Library”

  2. To copy/paste the code into VS and reformat it…

    1) Copy the code into a code window. I’ll appear on one line with line numbers.

    2) Open Find and Replace:

    Find what: :d+\::Wh
    Replace with: \n
    Use: Regular Expressions

    Replace All.

    By Ade Miller on Jan 1, 2011

  1. 4 Trackback(s)

  2. Dec 31, 2010: Tweets that mention Using CUDA/Thrust with the Parallel Patterns Library | #2782 - Thinking about agile (small 'a') software development, patterns and practices for building Microsoft .NET applications. -- Topsy.com
  3. Jan 1, 2011: Dew Drop – January 1, 2011 | Alvin Ashcraft's Morning Dew
  4. Jan 9, 2011: F# Discoveries This Week 01/09/2011 « F# Central
  5. Jan 10, 2011: Inner West LIVE

Sorry, comments for this entry are closed at this time.