Christoph M. Hoeppke

Rust order book implementation using SIMD accelerated instructions

Implementation

In this article we explore the opportunity of applying single instruction, multiple data (SIMD) vectorisation to accelerate price calculation algorithms on order book data structures. Please see here for the rust project that this article is based upon.

Order book structure

During market making, high-frequency trading (HFT), and general quantitative trading strategies it is important to have an accurate view of how orders are executed in the order book. Assuming we have an know the complete state of the order book at a given time, we can calculate the price at which our orders will execute given their volume. Let us first consider how prices form when a market order hits the order book. Let us commence by considering the example order book: \[ \begin{array}{l|l} \text{Price}& \text{Volume} \\\hline £103.00 & 25 \\ £102.50 & 45 \\ £102.00 & 35 \\ £101.50 & 50 \end{array} \] We notice a couple of things on this order book representation,
  • Prices are quantise (in this example to the nearest £0.50), so not every price can be selected when submitting an order.
  • The orders are sorted in descending (ascending for sell orders) order.
Now when we calculate the final execution price, assuming no further transaction fees, for selling 100 units of our asset, we observe that we will consume the 25 units available at £103.00, and the 45 units at £102.50 entirely. The remaining 30 units of our order will be filled by partially executing on the order at £102.00.
In summary our order is executed by consuming \[ \begin{array}{l|l} \text{Execution Price} & \text{Units} \\\hline £103.00 & 25 \\ £102.50 & 45 \\ £102.00 & 30 \end{array} \] for a final execution price of £10,247.50. Immediately after executing our order the order book will look like \[ \begin{array}{l|l} \text{Price}& \text{Volume} \\\hline £102.00 & 5 \\ £101.50 & 50 \end{array} \] Algorithms to compute execution prices are fundamental to many trading algorithms, either to exploit arbitrage opportunities between exchanges, or to execute on a signal predicted by an alpha engine. As execution speed for such opportunities is key, we want our pricing algorithm to be as performant as possible.

A brief introduction to SIMD instructions

Due to cache locality benefits, usign simple vectors in Rust is often superior compared to more complicated data structures, such as trees or dictionaries. We now want to present one further method to speed-up such price finding algorithms by exploiting single instruction, multiple data (SIMD) vectorisation capabilities of modern CPUs. Many modern CPUs contain special registers that are wider than the standard CPU word size. For example AVX-2 (AVX-256) capable CPUs contains registers which are 256 bits wide, and thus capable of storing 4 double precision (64 bit) floating point numbers. Once values are loaded in these aligned memory registers we can execute vectorised operations, allowing us to compute, for example, four additions using a single CPU cycle. Such instructions provide the opportunity for parallelism using a single CPU core.
Let us now examine how we can improve the performance of our execution price algorithm by applying SIMD vectorisation. To achieve this we will group 4 subsequent orders together in an order block, allowing us to represent prices and volumes using u64x4 and f64x4 values respectively.

Parallel order execution algorithm

We previously removed an order from our book if its volume was fully consumed (the remaining volume is 0). Now that we group orders in sets of four, we drop an entry from our order vector, if all volumes are zeroed, thus reducing the amount of memory allocations and de-allocations.
To calculate the execution price on a single block of four orders, we first calculate the cumulative volume offered that the four subsequent prices, this can be done efficiently using only two additions and SIMD vector rotations. With this we then calculate the volume used at each price and, using a final inner product, we calculate the volume gained and price spend on the current order block. Here we have profited greatly from Rust's first class support for SIMD vectorised operations. While this pricing algorithm appears, at first glance, to be entirely sequential, we see that it can be vectorised and even help up to reduce the amount of branches in our code.

Benchmarks and Rust implementation

The implementation of this project can be found here. To benchmark the improvements offered by SIMD vectorised order books we compare the standard vector based order book implementation, to an equivalent one using prices and volumes in u64x4 and f64x4 SIMD registers. In both cases we create an order book with depth 5000. We compare the time required to query prices at volumes equidistantly placed across the total volume available, and at volumes equidistantly placed in the lowest 5 per cent of the order book. When analysing hardware oriented topics, such as SIMD, it is often interesting to compare their effectiveness across different processors. For this purpose we compare the performance benefit on an ARM based MacBook Pro with M2 (12P, 3E) to the performance differential on a 12 Core AMD Ryzen 3900x. The key differentiator between these two for our purpose is that the AMD processor offers full support for AVX-2 vectorisation. \[ \begin{array}{l|l} \text{System}& \text{Benchmark}& \text{Baseline duration}& \text{SIMD duration} & \text{SIMD speed-up} \\\hline \text{M2 MacBook} & \text{Full Depth} & 5.956 \text{s} & 4.481\text{s} & 1.33 \times \\ \text{M2 MacBook} & \text{5% Depth} & 0.317 \text{s} & 0.212\text{s} & 1.49 \times \\ \text{AMD 3900x} & \text{Full Depth} & 7.957 \text{s} & 4.254\text{s} & 1.87 \times \\ \text{AMD 3900x} & \text{5% Depth} & 0.430 \text{s} & 0.226\text{s} & 1.90 \times \\ \end{array} \] We observe that both systems benefitted from the SIMD implementation for the same underlying algorithm and data structure. Additionally, we observe a greater, almost 2x improvement, when applying the SIMD optimisations on the AMD processor, which is in line with our expectation, due to the better vectorisation support on this platform.