triperino_cl is essentialy a direct port of crypt(3) or the FreeSec implementation of DES to OpenCL for GPGPU acceleration. Ostensibly, it's a tripcode brute-forcer, though I'll justify its lackluster performance by saying it's mostly just for fun. Current performance is around 14 million tripcodes per second on a reference Nvidia GTX 660 GPU.
As is, the FreeSec implementation of DES appears to be quite inefficient. I tested a naive implementation of Tripcode Explorer-like program based on this implementation of DES. It only yielded around 180 thousand tripcodes per second per thread on an Intel i5-2520M, where the expected performance under SSE2 acceleration with Tripcode Explorer would be 2+ million tripcodes per second.
However, the simplicity of the FreeSec implementation makes it easy to port to OpenCL. It relies on essentially no outside libraries--as compared to say DES fcrypt and is easier for a complete DES-newbie like me to understand than some bitslice versions of DES.
The FreeSec implementation is largely lookup table based (this is also important when talking about performance on GPUs). The issue is that these lookup tables are mostly two-dimensional, and two-dimensional arrays are not supported out of the box in OpenCL kernels as global memory. To solve this issue, the first step of the ported implementation is to flatten each of the two-dimensional lookup tables before copying them to device memory.
The rest of the porting process was more mechanical and mainly involved adapting
data types to their OpenCL counterparts. For example,
uint and so on.
Other changes included breaking up structs into their individual fields to make
passing things around easier within the kernel.
Each OpenCL work-item brute forces a pre-determined number of passwords, say 256 or 512. If it finds a match, it writes the password and corresponding tripcode to preallocated regions of global memory. There is a bit of cheating in this last step; it is explained in detail under Performance Optimization.
LUTs are notoriously terrible for GPGPU performance, as described in this Intel OpenCL article. The main issue is that LUTs are by definition memory access-bound, whereas GPGPU programs really, really like to only be compute bound. My initial hope was that the magic of thread scheduling could alleviate some of these issues by hiding latency, but nonetheless having a bunch of LUTs sitting in low-bandwith global memory on a GPU means your work-items are going to be bogged down trying to read memory most of the time. This issue was reflected early on in the performance figures, with 1-2 million tripcodes per second being the initial performance on a GTX 660. That's slower than the singlethreaded performance on an i5-2520M!
The Intel article provides a reasonable workaround for when LUTs are absolutely
necessary, or when the programmer is too lazy to use another algorithm: try to
stuff the LUTs into local memory on a GPU. Local memory on GPUs is quite small--
usually something like 16, 32, 48, or 64 KiB, but like L1 cache on a CPU, it's
much, much faster than main/global device memory. Luckily, the most heavily
abused LUTs in the FreeSec implementation,
will fit in 32KiB. As the GTX 660 has around 48 KiB of local memory per
workgroup, I was in business. This single change yielded a 10X
improvement in performance, the throughput from around 1 million to 10+ million
tripcodes per second.
Every OpenCL-capable device is different, and this is one of the more device
specific optimizations you can do. Basically I just tried a bunch of various
global worksizes (make sure that your local worksize) evenly divides this value
for best performance. I left the local worksize up to the OpenCL runtime by
NULL pointer for the local worksize when firing up the
kernel as I found this to yield the best performance. However, there are some
issues with playing around with these values--see Gotchas. Tuning these values
got performance from 10 million to around 13 million tripcodes per second.
The last major performance optimization was to recognize that most passwords do not yield a matching tripcode, especially as the search string grows. It's not unreasonable to assume that fewer than one in 256 or one in 1024 randomly guessed passwords will generate a matching tripcode. triperino_cl exploits this property by only allocating enough space for a single password/tripcode pair (~20 bytes total) for each work-item despite the fact that each work-item tries many more passwords. Cutting back on the memory allocated and by extension the amount that eventually gets copied back to the host to read the found passwords and tripcodes brings performance from 13 to 14 million tripcodes per second.
I'm fairly satisfied with this few-days-long hack. 14 million tripcodes per second is on par with the multithreaded and SSE2 accelerated version of Tripcode Explorer running on an i5-3570K. I guess a ~1 billion transistor count dvantage (for the GTX 660) makes up for a poor choice of algorithm implementation.
There has been previous work on GPU-accelerated DES brute force cracking--with some implementations on AMD GPUs achieving 40+ million tripcodes per second. An optimized bitslice implementation for the PS3's cell processor yielded 11.5 million passwords per second.
It's always fun to document the various stupid or frustrating (or both) bugs that I run into whenever I'm working with a language such as C that requires you to manage your own memory. Here are some of the more memorable ones. I describe them in no particular order.
For some reason, compiling an OpenCL kernel from source requires you to pass a
pointer to a pointer (
char **) to a character array containing the
source code to the kernel. However, I forgot that writing something like
my_array is a character array is equivalent to
&my_array. It's actually not possible to get the address of
a pointer to the pointer to the first element of a statically allocated array in
C unless you do a dumb workaround like creating another pointer and setting that
my_array and then using the address of the other pointer. In my
code, that looks something like:
const char *kernel_loc = (const char *) triperino_kernel_cl; size_t kernel_len = triperino_kernel_cl_len; triperino_program = clCreateProgramWithSource(context, 1,\ (const char **) &kernel_loc, &kernel_len, &status); assert(!status);
Very ugly, indeed, but it was a relief to figure out the reason why compiling the kernel was causing a segfault.
For a while running my kernel did absolutely nothing. The source of my problem
turned out to be passing
global_worksize instead of
global_worksize, which for some reason had the value '0' even though
global_worksize was only allocated as a single-entry array.
Another embarrassing mistake. Always remember to pass
char *s to
This was a fun one to debug--I was repurposing device memory as space to store parameters passed to the kernel. Unfortunately, the same device memory is used to store eventual password/tripcode combinations to be returned by the kernel. This conflict meant that some work-items would receive the correct parameters--but others, those that are scheduled to run later, have their parameters clobbered by the results of the earlier work-items.
According to word of mouth the Nvidia OpenCL implementation doesn't like it if you keep the GPU too busy for more than 5 seconds at a time if the GPU is also being used to drive a display. This led to a lot of time spent trying to find memory bugs when I was increasing the number of tripcodes each work-item would process, as OpenCL tends to throw the same status (-5 OUT OF RESOURCES) for a variety of wildly different problems.
Not everyone implements printf, and even if they do, not everyone does it according to your expectations. It took me a while to realize that the reason my kernel build was failing was that printf is just not implemented by Nvidia.