Playing the game

A treatise on threads and threading

     One way to achieve low latency computation is matching binary patterns directly. Data packets are routed to a listener which compares the sequence of bits in a packet to pre-defined bit patterns, then decides what to do as a result of a match. The whole process takes place in the L1 cache, in other words nothing is stored in main memory. Once you have the data that couldn't get faster

L1 cache

     If we look at inter thread cost and performance we have to look at a CPU core: registers, arithmetic and floating point units, L1 and L2 cache, and then the uncore functions like QPI controllers, L3 (or other future) cache, memory controllers, and buses to other cores, and finally IO sockets to clustered CPU. Consider a primitive command used to manage multiple threads like the volatile keyword: it's purpose is a synchronised read. It tells the CPU core to wait for x clock cycles - for as long as the core's memory buffers take to empty to the shared L3 cache - so all CPU cores, and hence all threads, can see the latest copy of the data. It works where several threads use several cores, in other words CPU 1 data isn't yet synchronized with CPU 2 data, and so on. That's all about cache coherence, as below:

CPU registers, execution units, buffers and caches

     The way that the volatile primitive L3 cache flush actually works is through the use of memory fences (or barriers). A store fence waits for all store (or write) buffers to empty so the data becomes visible to all CPU cores. A load fence waits for all load (or read) buffers to drain. A full memory barrier is a composite of store and load fences, ie ensuring all buffers drain before executing CPU instructions. In Java, volatile data orders a load fence before a read, and a store fence after a write.

store, load and full memory barriers

     So the primitives used to manage threads involve memory fences. In the Java Memory Model, a final variable requests a store fence after initialisation.

     Let's move on to locks. When a lock is placed on an object a full memory fence is employed around the lock to block the memory sub system across cores, and even across CPU's. By these methods, we see that with thread primitives and locks, inter thread visibility and program order is preserved.

     Yet the complexity is high, and processing costs begin with fences around each thread execution core's memory buffers. So really, the way to write multi threaded code and avoid such costs, is not to. The most efficient way to code is to use 1 thread on 1 core requiring no thread primitives and no locks so to minimise the use of memory fences. But that is not always practical or even possible.

     Consider the following picture from Mechanical Sympathy that shows estimated timings to access CPU registers, caches, the motherboard network (for now?) and main memory. It shows that the QPI bus is faster than the bus to main memory.

Cycle times

     So if the use case requires many processes (and which doesn't) then what we want to do is have single threaded processes each assigned to 1 CPU core. However, then we want to use lock free code for the inter thread communication, that uses the QPI bus as a point to point interconnect. To do this what we want is a circular buffer solution like the Disruptor to handle the ITC, that sits in the L3 cache. Consider that accessing RAM is 6 times slower than accessing the shared L3 cache. So, if you can get your multi threaded execution code small enough to stay within the L3 cache you maximise TPS (transactions per second).

     However, to get a transaction, there has to be at least 2 messages. So it is IO bound. The picture above doesn't show network interface buffer times. I am interested to see what Aeron is all about.

     So there are some ideas for a multi threaded environment.

     As an aside, it is said to be easier to write and maintain code that uses the primitives, and uses locks. That's why many do. If your use case is suitable for using locks rather than a circular buffer then maximum performance is obtained by modelling the algorithm required to have necessary memory fences occur at the boundaries before and after the work unit is completed. According to Norvig locking or unlocking a mutex lock takes a modern CPU approx 25 ns.

     In other words a very simple way to write thread safe code is:

Simple lock

     The thing is that in a high throughput environment, then besides the memory fencing costs, if the CPU cores are open then the OS can schedule other application or kernel threads to run on the same core. An OS thread assignment means a context switch, which means all CPU registers are emptied to the L3 cache, as opposed to a lock on just one CPU register. If your use case requires a wait time then a spinlock prevents context switches. For more investigation, see: Stack Overflow on spinlocks. A textbook example of a spinlock is:

CPU spin lock

     So these days the industry is moving to writing lock free code using only 1 thread running on 1 core with no context switches, which process sits in the L3 cache and minimises memory fencing. Finally, since memory fetch times increase with successive CPU caches and main memory, then per the binary pattern matching the most efficient use of the CPU is through the L1 cache. The way that is done is tricky because "CPU cache is not addressable". Several techniques exist to achieve the same end including spatial locality (see CPU cache coding) and pre-fetching, see __builtin_prefetch

References: Malcolm Brown, Mechanical Sympathy

Tell me what you think, please get in touch!