CryptoComputeLab is very proud to release Proofs v10.0.0! A lot has changed under the hood in this release, but ultimately, it comes down to additional GPU support options and better performance during proving. The far less visible highlights include consolidating a number of our internal dependencies which started life as forks from other projects back to the upstream sources. This kind of tooling update is beneficial for several reasons, including gaining optimizations from the upstream sources as our code diverged over time, and also making future updates (e.g. Halo2 work) more realistic. But that is a more involved topic for another time, so let’s stick to the more visible highlights of this release.
This release includes major performance benefits available in the GPU code paths. The biggest change is that it is now possible to run proofs with either OpenCL or CUDA support. When using the proofs library, it is as easy as enabling a feature flag called cuda. This is a simple toggle now, but a lot of changes were needed under the hood to make it an overall smooth experience.!
Unified GPU API
CUDA and OpenCL have a lot in common, but they also have subtle differences – not only on the API level, but also in the kernels used. Our idea was to create a library where you write your code once, which will then work with both CUDA and OpenCL. This new library is called rust-gpu-tools, and using it] allows interaction with GPU kernels and buffers through a unified API.
The biggest difference is related to how memory (that is exclusively accessible on the GPU) is dealt with. In OpenCL it’s called Local Buffer Memory, but in CUDA it’s Shared Memory. CUDA has a limitation where you can only have a single memory space per kernel, and that limitation is inherited by rust-gpu-tools. Although the API is based on OpenCL’s approach, only a single LocalBuffer can be defined, which is then passed into the Kernel as an argument.
The next hurdle to overcome is that in CUDA it’s not possible to pass in shared memory as a kernel parameter. Not only are the APIs between CUDA and OpenCL different, but kernel code is as well. Most of the differences can be solved with simple #defines, as can be seen in the shared kernel code in ec-gpu. That leaves the problem with the shared memory. Unfortunately, this cannot be fully unified between CUDA and OpenCL because you need to make CUDA specific changes to the kernel. Though it is an arguably minimal exception, an example can be found in bellperson’s FFT kernel. The end result is that the kernel still has the same number of arguments for both CUDA and OpenCL. In rust-gpu-tools the API is like the one you would expect from OpenCL, where you pass in a LocalBuffer. It will then automatically do the right thing for CUDA and pass on the amount of memory that should be allocated when the kernel is executed. Ultimately, we now have a unified API.
This GPU work had started earlier this year with optimizations for Nvidia cards, developed by Supranational. The optimizations target finite-field arithmetic over prime fields and elliptic curve arithmetic in general, hence they are in a separate library called ec-gpu. The generated source code works for both CUDA and OpenCL, so you can use those in your own kernel and automatically get maximum performance on CUDA. The reason why some optimizations are not enabled on Nvidia with OpenCL is that compiling the resulting kernel would take a very long time (dozens of minutes). The CUDA compiler does a better job in only a matter of a few minutes. All of our kernels (i.e. FFT, Multiexp, Poseidon) make use of those optimizations where we see about 10-20% better performance. On synthetic tests the proving time was even up to 30% faster.
While adding support for CUDA, a lot of code has changed – even on the OpenCL code path. As a result we decided to disable the cuda feature by default in order to make sure there are no performance regressions in the existing OpenCL implementation. However, we strongly encourage everyone to give the CUDA implementation a try, and early testing is showing sustained promise. Currently, testing requires compiling the proofs from source and setting FFI_USE_CUDA=1 in the environment at build time. For more information, see the Experimental CUDA build support section of the Filecoin FFI README.
Improved proving performance
By focusing on very specific optimizations, we’ve increased proving performance in v10.0.0 by as much as 40-50%. Some of the key improvements were in the bellperson library and include the ability to use multiple GPUs for FFTs, optimized linear combination routines, further parallelized parameter loading, and optimized generation of exponents for GPU multiexp steps. The result is that the overall Filecoin mining Commit Phase 2 stage is now ~1.7x faster compared to the 9.0.2 release. Specifically, we have observed on our test hardware that the time decreased from about 16 minutes to under 9.5 minutes when sealing a 32 GiB sector. We are still gathering performance numbers from the community helping us to test this:for more information, stay tuned here!
Here are more complete results that our team has gathered for the v10.0.0 release compared to the v9.0.2 release:
|Mining Operation (32GiB sector)||Proofs v10.0.0 CUDA||Proofs v10.0.0 OpenCL||Proofs v9.0.2 OpenCL|
|Commit Phase 2 (milliseconds)||564,797||680,153||974,893|
|Window PoSt (milliseconds)||110,875||124,998||159,908|
|Verify Window PoSt (milliseconds)||6||7||1,959|