triperino_cl (github)

triperino_cl is essentialy a direct port of crypt(3) or the FreeSec implementation of DES to OpenCL for GPGPU acceleration. Ostensibly, it's a tripcode brute-forcer, though I'll justify its lackluster performance by saying it's mostly just for fun. Current performance is around 14 million tripcodes per second on a reference Nvidia GTX 660 GPU.

As is, the FreeSec implementation of DES appears to be quite inefficient. I tested a naive implementation of Tripcode Explorer-like program based on this implementation of DES. It only yielded around 180 thousand tripcodes per second per thread on an Intel i5-2520M, where the expected performance under SSE2 acceleration with Tripcode Explorer would be 2+ million tripcodes per second.

However, the simplicity of the FreeSec implementation makes it easy to port to OpenCL. It relies on essentially no outside libraries--as compared to say DES fcrypt and is easier for a complete DES-newbie like me to understand than some bitslice versions of DES.

The Porting Process

The FreeSec implementation is largely lookup table based (this is also important when talking about performance on GPUs). The issue is that these lookup tables are mostly two-dimensional, and two-dimensional arrays are not supported out of the box in OpenCL kernels as global memory. To solve this issue, the first step of the ported implementation is to flatten each of the two-dimensional lookup tables before copying them to device memory.

The rest of the porting process was more mechanical and mainly involved adapting data types to their OpenCL counterparts. For example, u_char becomes uchar and uint32_t becomes uint and so on. Other changes included breaking up structs into their individual fields to make passing things around easier within the kernel.

Each OpenCL work-item brute forces a pre-determined number of passwords, say 256 or 512. If it finds a match, it writes the password and corresponding tripcode to preallocated regions of global memory. There is a bit of cheating in this last step; it is explained in detail under Performance Optimization.

Performance Optimizations

So begins the fun part.

Random Password Generation

The first performance optimizations took place before the first time I ran the code on the GPU to do reasonably efficient random number generation. As it's pretty tough (is it even possible?) to run code from outside libraries in OpenCL, the code for generating random numbers and subsequently passwords quickly is done with an XORshift algorithm--shamelessly lifted directly from the Xorshift Wikipedia Article.

Lookup Table (LUT)

LUTs are notoriously terrible for GPGPU performance, as described in this Intel OpenCL article. The main issue is that LUTs are by definition memory access-bound, whereas GPGPU programs really, really like to only be compute bound. My initial hope was that the magic of thread scheduling could alleviate some of these issues by hiding latency, but nonetheless having a bunch of LUTs sitting in low-bandwith global memory on a GPU means your work-items are going to be bogged down trying to read memory most of the time. This issue was reflected early on in the performance figures, with 1-2 million tripcodes per second being the initial performance on a GTX 660. That's slower than the singlethreaded performance on an i5-2520M!

The Intel article provides a reasonable workaround for when LUTs are absolutely necessary, or when the programmer is too lazy to use another algorithm: try to stuff the LUTs into local memory on a GPU. Local memory on GPUs is quite small-- usually something like 16, 32, 48, or 64 KiB, but like L1 cache on a CPU, it's much, much faster than main/global device memory. Luckily, the most heavily abused LUTs in the FreeSec implementation, m_sbox, and psbox, will fit in 32KiB. As the GTX 660 has around 48 KiB of local memory per workgroup, I was in business. This single change yielded a 10X improvement in performance, the throughput from around 1 million to 10+ million tripcodes per second.

Tuning global_work_size and local_work_size

Every OpenCL-capable device is different, and this is one of the more device specific optimizations you can do. Basically I just tried a bunch of various global worksizes (make sure that your local worksize) evenly divides this value for best performance. I left the local worksize up to the OpenCL runtime by passing a NULL pointer for the local worksize when firing up the kernel as I found this to yield the best performance. However, there are some issues with playing around with these values--see Gotchas. Tuning these values got performance from 10 million to around 13 million tripcodes per second.

Cheating by Allocating and Copying Less Memory than May be Needed

The last major performance optimization was to recognize that most passwords do not yield a matching tripcode, especially as the search string grows. It's not unreasonable to assume that fewer than one in 256 or one in 1024 randomly guessed passwords will generate a matching tripcode. triperino_cl exploits this property by only allocating enough space for a single password/tripcode pair (~20 bytes total) for each work-item despite the fact that each work-item tries many more passwords. Cutting back on the memory allocated and by extension the amount that eventually gets copied back to the host to read the found passwords and tripcodes brings performance from 13 to 14 million tripcodes per second.

Where does triperino_cl fit in performance-wise?

I'm fairly satisfied with this few-days-long hack. 14 million tripcodes per second is on par with the multithreaded and SSE2 accelerated version of Tripcode Explorer running on an i5-3570K. I guess a ~1 billion transistor count dvantage (for the GTX 660) makes up for a poor choice of algorithm implementation.

There has been previous work on GPU-accelerated DES brute force cracking--with some implementations on AMD GPUs achieving 40+ million tripcodes per second. An optimized bitslice implementation for the PS3's cell processor yielded 11.5 million passwords per second.

Epilogue: C and OpenCL Gotchas

It's always fun to document the various stupid or frustrating (or both) bugs that I run into whenever I'm working with a language such as C that requires you to manage your own memory. Here are some of the more memorable ones. I describe them in no particular order.

What's a Pointer to an Array?

For some reason, compiling an OpenCL kernel from source requires you to pass a pointer to a pointer (char **) to a character array containing the source code to the kernel. However, I forgot that writing something like &my_array where my_array is a character array is equivalent to writing &my_array[0]. It's actually not possible to get the address of a pointer to the pointer to the first element of a statically allocated array in C unless you do a dumb workaround like creating another pointer and setting that to my_array[0] and then using the address of the other pointer. In my code, that looks something like:

const char *kernel_loc = (const char *) triperino_kernel_cl;
size_t kernel_len = triperino_kernel_cl_len;
triperino_program = clCreateProgramWithSource(context, 1,\
(const char **) &kernel_loc, &kernel_len, &status);
assert(!status);

Very ugly, indeed, but it was a relief to figure out the reason why compiling the kernel was causing a segfault.

No Work Done

For a while running my kernel did absolutely nothing. The source of my problem turned out to be passing global_worksize[1] instead of global_worksize[0], which for some reason had the value '0' even though global_worksize was only allocated as a single-entry array.

Printing a char

Another embarrassing mistake. Always remember to pass char *s to printf, not char.

Abusing Global Memory too Much

This was a fun one to debug--I was repurposing device memory as space to store parameters passed to the kernel. Unfortunately, the same device memory is used to store eventual password/tripcode combinations to be returned by the kernel. This conflict meant that some work-items would receive the correct parameters--but others, those that are scheduled to run later, have their parameters clobbered by the results of the earlier work-items.

Display Timeout??

According to word of mouth the Nvidia OpenCL implementation doesn't like it if you keep the GPU too busy for more than 5 seconds at a time if the GPU is also being used to drive a display. This led to a lot of time spent trying to find memory bugs when I was increasing the number of tripcodes each work-item would process, as OpenCL tends to throw the same status (-5 OUT OF RESOURCES) for a variety of wildly different problems.

wtf printf

Not everyone implements printf, and even if they do, not everyone does it according to your expectations. It took me a while to realize that the reason my kernel build was failing was that printf is just not implemented by Nvidia.