Tuesday 28 June 2011

Histograms are not what I assumed

One of the themes I'll keep coming back to is how things are often not how you have assumed they are, and how there are always opportunities to learn something new about Oracle. I am quite familiar with histograms in Oracle from reading the manuals and white papers, but have always left it to Oracle to decide what columns histograms were needed. This is mainly due to the databases I have worked on not being large or complicated enough to need further analysis so far. But knowing the day would come when I would have to do something I thought I would review what I knew, and in doing so learnt a few new things - or rather realised that some of my assumptions had been incorrect.

Rather than re-read the manual I decided to go to a better source of knowledge rather than just raw information - Jonathan Lewis's Cost Based Oracle Fundamentals. I already knew that Oracle had 2 types of histogram - Frequency and Height Balanced. And I knew that Frequency stored a row count per individual data value, while a Height Balanced stores a row count per range of values (strictly the row count is the same per range, and it is the data range size that varies). And I knew that the Optimizer in Oracle used this extra row count data in the histogram to calculate better estimates for row counts that would match constraints in queries. In reading the chapter on Histograms and working through the examples and doing my own tests I realised that some of my assumptions had been wrong about how Oracle uses these histograms.

Assumption 1 - Oracle uses all the entries in a histogram to calculate its row count estimate

I knew this to be true for a Frequency Histogram, and had assumed something similar for a Height Balanced one. But I was wrong. In a height balanced histogram although Oracle may have up to 254 buckets, it only uses the ones that contain the same end point values to identify "popular values" - those that occur more frequently than the width of a single bucket, and so span more than one bucket in size. For all other values it ignores the histogram completely and uses the column level Density statistic in the same manner as if there were no histogram.

In other words there is a threshold, being the number of rows covered by a single bucket. For data values that occur more often than this and are also recorded in two or more buckets in a height balanced histogram, the Optimizer uses the number of buckets to calculate the estimated row count, along with the number of rows per bucket. For all other values Oracle assumes a uniform distribution and uses the column level Density statistic. This will not be the same as one over the Number of Distinct Values (NDV), but is calculated differently to remove the effect of the rows for those popular values i.e. the Density is lowered to the average of the "unpopular" values.

Assumption 2 - A Frequency Histogram has an entry for all data values occurring in the table

This is true if either the table is relatively small, or you force Oracle to read all the data rows when gathering statistics. If you leave Oracle to use its default sampling method and you have a large table then some values may not be sampled, and they will be missing from the Frequency Histogram produced.

What Oracle does is to calculate and store a Density value for the column that is half that of the least frequent occurring value in the histogram. Values that appear in query constraints that do not appear in the Frequency Histogram are therefore assumed to occur at half the row count of the value with the smallest row count in the histogram. So again, the Optimizer may end up not using a histogram that exists and instead use the Density statistic of the column when executing a particular query.

None of this is new, and I was able to double check this via various other sources. It was just another thing to add to the list of assumptions I've made in the past that turn out not to be true. Even with histograms in place there is a reasonable chance that the Optimizer will actually not be using the histogram itself but instead the Density statistic value of the column. And also changing the value of the Density for the column can have an impact on queries, even when there is a histogram on that column.

No comments: