C++ AMP Extras: String Reversal

Tuesday, October 16, 2012 – 1:52 PM

As part of a series of follow ups to the C++ AMP book I’m writing some blog posts highlighting other examples of C++ AMP at work. Most of these are going to take the form of interview style questions that highlight various things people have asked me about or discussions of data-parallel programming patterns.

This one takes the form of a coding interview question beloved by interviewers, “Given a pointer to a null terminated array of char, reverse the order of the string”. It highlights how to deal with types not directly supported by C++ AMP.

How would you do this in C++ AMP? It turns out this isn’t quite as easy as you might think. C++ AMP lock of a char type is pretty easy to solve by packing four chars into an unsigned long and declaring a container, CharBlock, to handle packing and unpacking the individual values. The Cartoonizer case study, described in Chapter 10, shows how to use a similar trick for handling 32-bit ARGB image values.

   1: typedef unsigned long PackedChars;
   2:  
   3: struct CharBlock
   4: {
   5:     unsigned A, B, C, D;
   6:  
   7:     CharBlock(const PackedChars& chrs) restrict(cpu, amp)
   8:     {
   9:         A =  chrs & 0xFF;
  10:         B = (chrs & 0xFF00) >> 8;
  11:         C = (chrs & 0xFF0000) >> 16;
  12:         D = (chrs & 0xFF000000) >> 24; 
  13:     }
  14:  
  15:     inline PackedChars ReversePack() restrict(cpu, amp)
  16:     {
  17:         return (D | (C << 8) | (B << 16) | (A << 24));
  18:     }
  19: };

Note that ReversePack() places the values AD into the 32 bit value in reverse order (line 17). This allows you to define a Swap() function which swaps the left and right PackedChars values and also uses ReversePack() to reverse the order of the chars packed into both the left and right values.

   1: inline void Swap(PackedChars& left, PackedChars& right) restrict(amp)
   2: {
   3:     PackedChars tmp = right;
   4:     right = CharBlock(left).ReversePack();
   5:     left = CharBlock(tmp).ReversePack();
   6: }

Swapping all the elements in the array of PackedChars still needs a little more thought. The code needs to deal with cases where the string length is not exactly divisible by the number of characters in a CharBlock (four). It creates an additional array, strData, to hold a padded version of the original string.

   1: void ReverseStrAmp(char* const pStr)
   2: {
   3:     const unsigned blkSize = 4 * sizeof(char);
   4:     const unsigned charLen = 
   5:         static_cast<unsigned>(std::distance(pStr, FindEnd(pStr)));
   6:     unsigned blkLen = charLen / blkSize;
   7:     blkLen += ((blkLen % blkSize) > 0) ? 1 : 0; // Pad blocks.
   8:     blkLen += (blkLen % 2);                     // Even out block count.
   9:     const unsigned blkChars = blkLen * blkSize;
  10:  
  11:     std::vector<char> strData(blkChars);
  12:     std::copy(pStr, pStr + charLen, strData.begin());
  13:     array_view<PackedChars, 1> str(extent<1>(blkLen), 
  14:         reinterpret_cast<PackedChars*>(strData.data()));
  15:  
  16:     const unsigned offset = blkLen - 1;
  17:     parallel_for_each(extent<1>(blkLen / 2), [str, offset](index<1> idx) 
  18:         restrict(amp)
  19:     {
  20:         Swap(str[idx], str[offset - idx]);
  21:     });
  22:     str.synchronize();
  23:  
  24:     copy(strData.cbegin() + blkChars - charLen, strData.cend(), 
  25:         stdext::make_checked_array_iterator<char*>(pStr, charLen));
  26: }

If this were really an interview there are some other observations you might make:

This isn’t very efficient, the cost of copying data would likely outweigh any gains made from running this on a GPU. It might be useful as part of a series of operations that all took place on the GPU.

The *pStr points to an unpadded string. This cannot be used as the backing store to an array_view which must be padded to an even number of 4 character blocks. This results in further copy overhead to create the strData buffer. The API could be reframed to reduce this overhead. For example pStr could always be guaranteed have the additional padding, not to mention using a more up-to-date container like std::string!

You could use texture<uint_4, 1>. This would simplify the code by using the packing/unpacking and swizzling support provided by textures but would limit the size of the string to 16,384 x 4 characters. You can learn more about this in the Using Textures and Short Vectors, section of Chapter 11. This shows how to extend the Cartoonizer Case study described in Chapter 10 to use textures.

The solution doesn’t use tiling. Firstly, because it’s an interview and you should focus on an answer you think you can code in the time available. Secondly, the C++ AMP kernel only accesses each memory location once so using faster tile memory to reduce the cost of multiple accesses to global memory is unlikely to make a big difference here. This is not always the case, for example the matrix transpose example from Chapter 7 shows how smart use of tile static memory can improve coalescing of global memory accesses and improve performance.

Certainly you could experiment here and try a tiled and texture based approaches but this is a reasonable solution. You can find the complete code for this, along with unit tests and a CPU based version on the C++ AMP CodePlex site.

  1. 1 Trackback(s)

  2. Oct 17, 2012: Dew Drop – October 17, 2012 (#1,423) | Alvin Ashcraft's Morning Dew

Sorry, comments for this entry are closed at this time.