Monday 28 December 2015

Oracle In Memory and CPU Execution Efficiency

Tanel Poder has been doing a series of posts on "RAM is the new disk" trying to show how the new "Oracle Database In-Memory" feature in Oracle 12c uses memory in a different and more efficient way than the normal, traditional buffer cache does. He hasn't finished the series of posts yet, but I started to draw some of my own conclusions from the data he published in his last post and I'd thought I'd publish my thoughts.

I know a reasonable amount about how CPU's work internally, so I assumed that the main performance improvement of Oracle Database In-Memory would be from a better memory access pattern that would significantly reduce the number of execution stalls inside the CPU. Tanel is one of the good Oracle bloggers, and he has been sharing the details of the tests he has been doing in this series of blog posts as well as the results from them so I can see if my hypothesis on the new Database In-Memory holds up.

Tanels' Tests

The main results are in this post where he gives the 6 SQL queries he ran and their results in terms of elapsed time and CPU execution cycles - some against the traditional buffer cache and some against the new In-Memory cache. No disk accesses were involved as all the data was already in memory in the Oracle SGA.

I'm only interested in the main 2 SQL queries that are the same as each other, except the first one gets its data from the traditional data block based buffer cache in memory while the second gets its data from the new In Memory column storage based cache i.e. both execute the same "SELECT" against the same table and all of the data needed is already in memory in the SGA.
3. FULL TABLE SCAN BUFFER CACHE (NO INMEMORY)
SELECT /*+ MONITOR FULL(c) NO_INMEMORY */ COUNT(cust_valid)  
FROM customers_nopart c WHERE cust_id > 0;

4. FULL TABLE SCAN IN MEMORY WITH WHERE cust_id > 0
SELECT /*+ MONITOR FULL(c) INMEMORY */ COUNT(cust_valid) 
FROM customers_nopart c WHERE cust_id > 0;

Tanel has also made all the low level CPU execution measurements available in a Google online spreadsheet. Here is a cut and paste of the CPU measurements for just the 2 queries I am interested in:

MetricTABLE BUFCACHETABLE INMEMORY PRED
task-clock-ms273741578
cycles864286530404573793724
instructions321154128777080326242
branches7386220210940579984
branch-misses220563974637243
stalled-cycles-frontend766970494202251325295
stalled-cycles-backend586273933951328333827
cache-references25644038411507915
cache-misses2220369817316366
LLC-loads2343611899712269
LLC-load-misses2185702947272805
LLC-stores184935821697666
LLC-store-misses323323127797
L1-dcache-loads73249460421069917316
L1-dcache-load-misses30527634185368159
L1-dcache-prefetches3689030225169253

The most obvious observation is that the In-Memory query execution is 17.34 times faster than the traditional buffer cache query execution (elapsed task-clock-ms time ratio of 27374 / 1578). I'm most interested in trying to explain what has caused this difference in elapsed time.

My Hypothesis & CPU / Memory Access

At the beginning of this post I stated that I assumed that the main performance improvement would be a result of a reduction in the number of execution stalls inside the CPU. Let me try and explain what I mean by that.

In the very old days of Intel CPU's (from the original 8086 up to the first '486 I think) the internal clock of the CPU (the rate at which it executed instructions) was the same as the external clock of the system which also governed memory access times. A read from the CPU of a memory location would take something like 3 or 4 external clock cycles (put the address on the system bus, the memory to process that to its output, put the data out on the bus to the CPU). During these clock cycles the CPU cannot proceed with the execution of that instruction i.e. it stalls until the data it needs from memory has been read into the CPU.

During the '486 lifetime Intel introduced technology that allowed the internal clock of the CPU to run faster than the external clock i.e. at a multiple of the external system clock. Now we have external clocks of 100 MHz typically and CPU internal clocks of 3 GHz and more i.e. the CPU is running internally at 30 or more times the external system clock rate.

A side effect of this change is that now when a CPU "stalls" to read needed data from memory, the stall is for much longer in terms of CPU instruction cycles. As the external clock is now 30 or more times slower than the internal clock of the CPU, the CPU may end up waiting doing nothing for over 100 execution cycles. NB. This is a simplification of what really happens, but it does describe a real effect.

We can see this effect in the separately reported statistics for "CPU cycles" and "CPU instructions completed" - the first one will be much higher than the second one. The "efficiency" of how a CPU executed a given piece of program code can be seen from the calculated "instructions per cycle" value i.e. "CPU instructions completed" divided by "CPU cycles". The closer this value is to 1 the more efficient the execution has been within the CPU. The lower this value is then the more "stalls" occurred within the CPU during execution that stopped it completing execution of an instruction every clock cycle.

Of course CPU designers have tried to work around or reduce the effect of such stalls in various ways, the details of which are not relevant here. It just means that a) there is various technology in the CPU to try and minimise the occurrence and impact of such "memory load stalls", and b) the net impact of such "memory load stalls" is much less than the 100 or so wasted execution cycles I've described. In fact, and this is also important, modern CPU's can actually issue more than one instruction at the same time for execution within the same clock cycle e.g. an integer operation and a floating point operation. This can result in an "instructions completed per clock cycle" value larger than 1, which can occur when executing well optimised program code.

If my hypothesis is correct then we should see a difference in the number of "stalls" in the CPU execution statistics between the 2 SQL query executions, and the ratio between these 2 sets of "stall" statistics should be close to the observed ratio in execution times.

My Initial Conclusions

To cut to the chase a bit, the observed difference in execution times between the 2 SQL queries is not wholly explained by just a reduction in the number of "stalls" inside the CPU, but the reduction in "stalls" is a significant contributor to this reduction in elapsed time. The reduction in elapsed time is roughly a 50:50 split between a reduction in the total number of instructions executed (code efficiency) and CPU memory access stalls (memory access efficiency).

We can see from the measurements that Tanel has reported that the number of instructions executed by the CPU decreased by a factor of 4.54 (32115412877 / 7080326242) to process the same amount of data (rows in the table). This is a significant improvement in code efficiency to process this volume of data - probably a result of the In-Memory cache using far simpler internal data structures than the traditional buffer cache that can in turn be processed much easier.

This leaves an improvement factor of 3.82 (17.34 / 4.54) that needs to be explained by something else. I propose that this other factor is due to a reduction in the number of "stalls" within the CPU when executing these instructions as described earlier. This will have the net effect of increasing the "instructions per cycle" value during the execution of the SQL query, and indeed we can see this from the measurements reported by Tanel:
  • Buffer cache SQL query execution has CPU instructions per cycle of 32115412877 / 86428653040 = 0.37
  • In-Memory cache SQL query execution has CPU instructions per cycle of 7080326242 / 4573793724 = 1.55
  • This gives an improvement ratio of 1.55 / 0.37 = 4.17
And this value is very close to the 3.82 value for the other performance improvement factor.

Trying to drill down further into this difference in "instructions per cycle" and identifying "CPU memory stalls" due to data access is difficult if not impossible for a number of reasons:
  • Above the L1 cache on the CPU the other CPU caches are shared by both instructions and data. We cannot tell which cache reads and which cache misses were due to fetching an instruction to execute or fetching a piece of data needed by an instruction being executed. Both lead to CPU stalls but for different reasons.
  • The In-Memory cache SQL query is executing a completely different code path within Oracle with almost 78% fewer instructions being executed in total (100 / 4.54). So we cannot assume that the instruction execution profile and also instruction cache misses are similar in both cases.
We can however see other evidence that this other factor for the improvement in performance is due to a reduction in CPU stalls. Tanel has reported two other CPU execution statistics - "stalled-cycles-frontend" and "stalled-cycles-backend". Tanel gives descriptions for these in his part 2 post, which I will summarise as follows:
  • Stalled cycles front end occurs before an instruction gets executed by the CPU and is generally due to fetching an instruction from memory
  • Stalled cycles back end occurs during instruction execution and is generally due to fetching data from memory needed by that instruction
So the most relevant statistic for stalls due to reading data from memory is "stalled-cycles-backend". We cannot compare these directly between the two SQL query executions because they executed a different number of instructions (remember the 4.54 reduction factor). What we can do is normalise these to a "stalled cycle per cycle" value and then compare them:
  • Buffer cache SQL query execution has backend stalls per cycle of 58627393395 / 86428653040 = 0.68
  • In-Memory cache SQL query execution has backend stalls per cycle of 1328333827 / 4573793724 = 0.29
  • This gives an improvement ratio of 0.68 / 0.29 = 2.34
Again, this is a significant improvement and shows that the number of stalls due to memory accesses by the CPU for data has been reduced.

Why Have Stalls Reduced?

I hope that I have shown that part of the improvement in the performance of the In-Memory cache has been the reduction in the number of "memory access stalls" in the CPU when executing the SQL query i.e. less time wasted waiting to read data in from main memory with a corresponding increase in the number of instructions executed per clock cycle and less total elapsed to execute those instructions. But how has Oracle actually managed to achieve this? How has it managed to reduce the number of "memory access stalls" while executing the same SQL query i.e. if it needs to read the same "set of data" to perform the same SQL query, how has it managed to read it more efficiently?

I am guessing that this is due to two specific factors:
  1. Data layout in memory - the layout is more "compact" which decreases the total amount of memory that must be read
  2. Sequential access to that memory, causing the CPU to "prefetch" the next memory line into the CPU cache, so it is already there when it is needed i.e. overlapping the processing of one memory line with the fetching of the next memory line
Note that this is based on what I have read on In-Memory as I've not yet had a chance to do any real testing of it yet. Again, thanks to Tanel for both doing some real testing and for sharing all the results he got from those tests.

In the traditional buffer cache a data block will span multiple memory lines (assuming a 64 byte memory cache line, as Tanel states) and this will contain data for all the columns in a row. When you only need the data from one column it means that you are skipping about in memory quite a lot, jumping from row to row, and not using most of the data in each block or memory line. In the new In-Memory cache the data is stored by column, so all the data for one column is stored together, next to each other in memory. This immediately reduces the total amount of memory that must be read to access the data for one column across all the rows in the table.

It also will increase the cache efficiency within the CPU, because one memory line in the cache will contain data for multiple rows i.e. the CPU can process the data for that column from multiple rows before it needs to read in another memory line into cache to get the next set of rows' data.

If the data for a column is physically laid out sequentially and contiguously in the system's memory, then it means that Oracle will be doing a series of sequential memory reads to process all the data for one column. Modern CPU's can detect this kind of sequential memory access and have an optimisation to "prefetch" the next data line from memory while the previous data line is still being processed. The net result is that stalls are either eliminated because the next line of data is already in the CPU cache or the impact of stalls is significantly reduced because the data is already on its way from memory to the CPU cache.

We can see evidence for this occurring in the "L1-dcache-prefetches" CPU statistic. The "L1 data cache" is the first on chip cache immediately used by the CPU core, and is small but very fast. All data needed by the CPU goes through this cache. Again, normalising to a "prefetch per instruction executed" we can see that the number of data prefetches by the CPU increased significantly during the In-Memory cache SQL query execution:
  • Buffer cache SQL query execution has L1 dcache prefetch per instruction executed of 36890302 / 32115412877 = 0.00115
  • In-Memory cache SQL query execution has L1 dcache prefetch per instruction executed of 25169253 / 7080326242 = 0.00355
  • This gives an improvement ratio of 0.00355 / 0.00115 = 3.09

The End

Well that's been a longer post than I wanted, but I wanted to get everything together into one post as I'm sure Tanel will finish his series of posts soon and probably make similar observations. And I wanted to see if he would come to exactly the same conclusions or slightly different ones, and doing this post was the only way to do that. Thanks again to Tanel for doing the tests and sharing all the data with us.

No comments: