Saturday, 21 April 2018

Top Oracle Monitoring Views & Tables


I can be a bit "old school" at times on some things, and I like to be familiar with the underlying technology being used by advanced features. So while using Enterprise Manager to see what is happening on your database can be easier and quicker than running queries manually, there may be times when EM is not available on a particular database or it doesn't provide a particular piece of information you are after. Knowing where this information is held in the database instance lets you go direct to the source to get what you need, which can either be quicker in some circumstances or the only option in other cases.

The key information about activity on a database instance is available in a handful of dynamic performance views. While using Enterprise Manager can be quicker and easier, don't be afraid of these dynamic performance views. Learn about them and what is in them and they can be your friends, and not your enemies.

Generally I break these activity related views down into current, real time activity happening now and historical activity that happened some time ago. And the historical activity can be further split into very recent as opposed to long ago activity, where "long ago" is probably more than an hour ago.

Current Session Activity

The main view has to be V$SESSION, which has a record per connected session. You can query V$SESSION to see what each session is currently doing (the SQL it is executing (if any), or if it is waiting or blocked, or idle), or query it repeatedly to see how a session changes over time (blocked or moving through different SQL).

This view has been expanded over the different versions of Oracle and now includes a richer set of data about what each session is doing spread across a number of useful columns e.g. STATUS, TYPE, SQL_ID, SQL_CHILD_NUMBER, ROW_WAIT_*, BLOCKING_SESSION, EVENT, WAIT_TIME, SECONDS_IN_WAIT, STATE.

When a session waits for something to complete, such as a disk wait, you can see this wait in V$SESSION. But this is transitory, being about the current wait that session is experiencing. These session wait times are also cumulatively recorded in V$SESSION_EVENT, which records both the wait event and the total wait time since the session connected to Oracle. You cannot just use the current, total values in this, as they are cumulative since the session was created. However you can compare the changes to this over a short period of time, which will tell you the events that the session waited on during that period of time, and how long it waited for them. Tanel Poder has given us the very useful snapper.sql script that snapshots this dynamic performance view in real time and reports on what changed i.e. the waits a session is experiencing.

Recent Active Session Activity

In version 10 (10.1) Oracle added in the new dynamic performance view of V$ACTIVE_SESSION_HISTORY (ASH), which contains point in time snapshot copies of data from V$SESSION. The idea is that you can now look back over recent history to see what was happening recently in individual sessions, letting you investigate issues soon after they have happened. Potentially it also lets you investigate temporary problems that do not persist long enough to investigate in real time using V$SESSION directly.

The manual says "It contains snapshots of active database sessions taken once a second. A database session is considered active if it was on the CPU or was waiting for an event that didn't belong to the Idle wait class." The one snapshot every second is to minimise any performance impact on the system and the storage requirements, and it also only saves the data for active sessions i.e. nothing is copied for inactive, idle sessions. The data from each snapshot is then held in memory within the SGA, and accessible via the dynamic performance view of V$ACTIVE_SESSION_HISTORY. There is a limit to how much memory it will use, but it is intended to cover the past 30 minutes.

Note that Active Session History and AWR (Automatic Workload Repository) are part of the "Oracle Diagnostic Pack", and a license for this must be purchased before you have the right to use either of these i.e. these features are not part of the core Oracle Database license. And such additional "Packs" are only available for the Enterprise Edition of the Oracle Database.

Most of the columns in V$ACTIVE_SESSION_HISTORY are the same as in V$SESSION, with the addition of a SAMPLE_TIME column. You can query this in similar ways to how you might query V$SESSION, but with the addition of a SAMPLE_TIME column, and that the same SESSION_ID can be repeated across different samples. Using ASH you can do things such as identifying the most frequent wait events, or sessions experiencing a specific wait event, or any session experiencing long waits.

Historical Session Activity

As stated, the data in V$ACTIVE_SESSION_HISTORY is held in memory in the SGA and is limited in size. Oracle will also automatically save 10% of this data to a table on disk in the AWR (Automatic Workload Repository), so you have an even longer period of history to query (up to the AWR retention limits) but at a lower sample rate. Even though the sample rate is much lower, the idea is that any peaks in activity or contention will still be captured and be available for analysis later on in DBA_HIST_ACTIVE_SESS_HISTORY.

The actual way it decides what samples to save is somewhat complicated i.e. not just a straight "1 in 10" of the samples from the in-memory ASH. It saves 1 in 10 of the sample sets when there was activity, but saves all of the session data from such a saved sample. This way you get all of the active session data from a saved sample, rather than just 1 in 10 of the active sessions and so be missing a complete picture of a sample set.

You can query DBA_HIST_ACTIVE_SESS_HISTORY in much the same way as you would query V$ACTIVE_SESSION_HISTORY, except that there are fewer samples covering longer periods of time.

Historical System Activity

The previous views are all about individual sessions, but Oracle also has a lot of information about other activity on the database as a whole in the other dynamic performance views. Most of these hold cumulative values - the data values only ever go up as new measurements are added in. In many cases their data is essentially a sum of the activity across all of the sessions in the system. These views can provide other useful information about what has been happening on a system. Oracle also takes snapshots of these data sets into the AWR, into a series of DBA_HIST_* tables, which you again need the Diagnostics Pack license for the right to query them.

The default snapshot interval is every hour, and they are retained for 7 days (these can be changed if desired, which can be useful). An AWR snapshot copies all data in all of the relevant dynamic performance views into their corresponding DBA_HIST_ table with a SNAP_ID column added in to identify the relevant snapshot. Unlike ASH, which is only a sampled subset, all of the data in the corresponding dynamic performance views is saved on each snapshot, so there are no issues over any missing data within a snapshot.

You can make use of this AWR data either by running a standard AWR report which summarises all activity between two snapshots, or by writing your own queries directly against these DBA_HIST_* tables for things of interest to you. The latter can be useful if you don't want to have to produce a set of AWR reports and then read through them manually to find something of interest. The right kind of queries can summarise key activity data across multiple snapshots, helping you identify any abnormalities or peak activity periods or anything else of interest.

In the same way that you might look at a dynamic performance view to see changes in real time, you can query the corresponding DBA_HIST_* table to see what changed over longer periods of time and between multiple snapshots. But your query will need to get two sets of snapshots (using the SNAP_ID column probably) and subtract the cumulative values of one from the other to get the change in value over that time i.e. the actual measure of the activity over that period of time. I've blogged about querying the AWR DBA_HIST_ tables directly before.

For reference here are a few dynamic performance views and the corresponding DBA_HIST_ table they get saved to by an AWR snapshot:
  • V$SYSSTAT (individual named system statistic counters) maps to DBA_HIST_SYSSTAT
  • V$SYSTEM_EVENT (wait event occurrence and time counters) maps to DBA_HIST_SYSTEM_EVENT and to DBA_HIST_BG_EVENT_SUMMARY for Background events
  • V$SQL has no direct map, but V$SQLSTATS maps to DBA_HIST_SQLSTAT

SQL Statements Executed

The other approach to investigating performance problems is to look at the SQL statements being executed, as opposed to what individual sessions are doing or the system as a whole. The most relevant dynamic performance views for SQL execution information are:
  • V$SESSION - contains the SQL_ID of the SQL statement a session is currently executing (if any)
  • V$SQL and V$SQLSTATS for all SQL statements recently executed, with execution statistics, etc.
    • Remember that these activity counters are cumulative and simply increase in value on each execution of that SQL statement
    • SQL statements are retained in V$SQL while they are being executed, but after that they can be removed for that slot to be reused by another SQL statement. So there is no guarantee that V$SQL contains all SQL statements recently executed
    • However, the most frequently executed SQL statements will tend to remain in V$SQL


The following dynamic performance views and AWR tables can directly provide you with information about what is and what has happened on an Oracle database instance, and are extremely useful for investigating performance problems. Remember that to use some of these you must have purchased the Diagnostics Pack for your Enterprise Edition Oracle database software.
  • V$SESSION for what is happening now in each session
  • V$SESSION_EVENT for total waits a session has experienced
  • V$ACTIVE_SESSION_HISTORY for recent session history
  • DBA_HIST_ACTIVE_SESS_HISTORY for a sampled subset of session history over a longer period of time
  • DBA_HIST_SYSSTAT, DBA_HIST_SYSTEM_EVENT, DBA_HIST_SQL_STAT and DBA_HIST_SYS_TIME_MODEL in the AWR for full copies of the corresponding dynamic performance view taken at regular intervals
  • V$SQL and V$SQLSTATS for the most recently executed SQL statements, still present in the shared pool in the SGA

Wednesday, 28 March 2018

Hash Join Overflow to Disk Summary

[I'm a bit busy at the moment working with a new client on a large and complex database, so I'm afraid I've only got time to do a quick post. I have a new post on a different topic planned, but am struggling with the time needed to write it up properly, so a short post is all I can squeeze in]

This is just a summary of the key results from my findings on how Oracle is costing a Hash Join when it overflows to disk, pulling all the conclusions together into one place.

In the first post I arrived at the following formula:
  • ( ((Build Columns Size + 12) * Build Row Count) + ((Probe Columns Size + 12) * Probe Row Count) ) * 0.0475
Where the "Columns Size" is the sum of the hash table storage for each column i.e. column data storage + 2 bytes per column, and the "Row Count" values are per 1,000 rows.

In the second post I showed that the costing was unaffected by the value of the Multi-Block Read Count parameter i.e. the costing was done only using single block disk I/O's.

In the third post I proposed a simpler formula using only values shown in an execution plan of:
  • ( (Build KBytes + (12 * Build Rows / 1000) ) + (Probe KBytes + (12 * Probe Rows / 1000) ) ) * 0.0485
This is only slightly less accurate than the original formula, but is much simpler to use as it only needs values reported in the execution plan, and no other calculations or derived values by you.

In the fourth post I showed that the reported size of the Temporary Space (TempSpc) in the execution plan is actually the size of the hash table itself, being dependent only on the size of the Build data set. And that knowing this hash table size will help you in deciding how much larger it is than the available memory in the PGA used to store such hash tables.

Wednesday, 14 February 2018

Hash Join Overflow Costing #4 - Temporary Space + PGA Use

Now I want to look at the value reported for "Temporary Space" (TempSpc) and how to check whether the hash table is close to fitting in memory within the PGA or not i.e. just a bit too big or way too big.

Temporary Space

It seems that this is simply the size of the hash table itself, which is in turn the size of the Build data set plus an additional 12 bytes per row. Which means that we can use part of the cost formula posted earlier for this to estimate the size in KiloBytes:
  • Build KBytes + (12 * Build Rows / 1000)
The evidence I have for this is:
  • The "TempSpc" value only changes when the Build size changes - never when the Probe size changes
  • When additional columns are added to Build data set in the SELECT the "TempSpc" value increases by the same amount as the Build size increases (for the same number of rows)
  • When the number of rows in the Build data set increases, the "TempSpc" value increases by an additional 11.76 bytes per additional row
    • Which is almost 12 bytes per row, as mentioned before as overhead in the hash table
You could of course "refine" the above formula and use "11.76" instead of "12" and that would give you a value for the hash table that is much closer to what is reported for the "TempSpc" in the execution plan. But I think that using 12 is good enough for only an estimate.

Hash Table Size vs PGA Size

I mentioned in the first post that the Optimizer uses the value of the internal initialization parameter "_smm_min_size" for the memory it can use as the work area for the hash table within the PGA. Thus if the size of the hash table is larger than this the Optimizer will cost the Hash Join as an overflow to disk operation. Using the formula posted above for the Temporary Space we can now estimate the size of the hash table needed and compare it to this minimum work area size value.

The notes I have on "_smm_min_size" say that it will be 0.1% of the PGA when the size of the PGA is greater than 128 MB, otherwise it will be 128 KB as a minimum. Assuming your PGA is larger than 128 MB then you can calculate what this is yourself. Note also that its value is reported in 10053 trace files along with all other initialization parameters, so you can directly verify its value if you want to.

At run time, though, the PGA may not be heavily used and may have a lot of unused memory available. An executing query can allocate and use more than the minimum work area value at run time for execution plan operations, up to a limit specified by the "_smm_max_size" initialization parameter. For PGAs larger than 1 GB this is set to be 10% of the PGA size. For smaller PGAs it is 100 MB, or 20% when the PGA is less than 500 MB.

Knowing how large the hash table needed will be, lets you compare it to the maximum possible work area size in your PGA and see what the likelihood is of the hash table being able to fit within the PGA at run time. As the ratio between the minimum and maximum possible work area sizes is 100 (0.1% to 10%) this is a large range and a hash table may have a reasonable chance of fitting in the PGA at run time, depending on how active your system is and the other queries running.

Equally it lets you consider what the benefit might be from increasing the size of the PGA significantly if you have a critical query involving a large hash join operation that is overflowing to disk. That said, there are some ways of rewriting queries that can minimise the size of the data being pulled through into the hash table used by the hash join, and so potentially get a smaller hash table to fit into memory in the PGA. But this depends on multiple factors, mainly being the number of columns of data coming through in the Build data set. If this is only a few columns then there won't anything that can be saved by such a query rewrite.

Tuesday, 6 February 2018

Hash Join Overflow Costing #3 - Simpler Formula

So far I have offered a formula for how a Hash Join that overflows to disk is costed and confirmed that this is only costed in terms of single block disk operations. While that formula produces very accurate results (less than 1% difference to the reported cost for the Hash Join operation) it requires you to obtain size information on each individual column being retrieved from both the Build and Probe data sets. And this could be quite tedious when there are many columns involved, or you don't have an easy way to work out the source columns involved in each data set. There is a simplification we can make to eliminate this detailed per column information and just use the information reported in the execution plan itself.

All of the columns being retrieved for the Build and Probe data sets are actually included within the "Bytes" value reported in the execution plan for each data set. And this value includes both the individual column data and any per-column overhead as well. What it does not include is the extra 12 bytes per row overhead needed in the hash table itself. We can approximate the size of this second part of the data by using the "Rows" value reported in the execution plan.

Examining the "Bytes" and "Rows" values in the execution plans for my test queries (see first post for details of these queries), I can see that the increase in the Hash Join cost is about 0.0485 per KB of increase in either data set (Build or Probe) for the same number of rows. Previously I determined that there was a 12 byte overhead within the hash table per data row expected.

This produces a revised Hash Join cost formula using only values from the execution plan of:
  • ( (Build KBytes + (12 * Build Rows / 1000) ) + (Probe KBytes + (12 * Probe Rows / 1000) ) ) * 0.0485
Note that the "Bytes" values used need to be in "K" units, whereas the "Rows" is not and so is divided by 1000. Checking this formula against the actual Hash Join costs of the test queries I ran I can see that it has an error of about 1% i.e. it is not as accurate as the previous formula, but is still accurate enough I believe.

Lets check how close this from one of the test queries. Here is the execution plan produced:
| Id  | Operation     | Name   | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
|   0 | SELECT STATEMENT    |      |      |      |      | 1266 (100)|          |
|   1 |  SORT AGGREGATE     |      |    1 |   30 |      |           |          |
|*  2 |   HASH JOIN     |      | 50000 | 1464K| 1712K| 1266   (1)| 00:00:01 |
|   3 |    TABLE ACCESS FULL| HJ50K  | 50000 | 1123K|      |  365   (1)| 00:00:01 |
|   4 |    TABLE ACCESS FULL| HJ100K |  100K|  683K|      |  728   (1)| 00:00:01 |
We can see:
  • The Build and Probe data access cost is 365 + 728 = 1093
  • The cost of the Hash Join operation itself is 1266 - 1093 = 173
  • The calculated approximate cost is (1123 + (12 * 50) + 683 + (12 * 100)) * 0.0485
    • = (1123 + 600 + 683 + 1200) * 0.0485
    • = 3606 * 0.0485 
    • = 174.891
  • The difference is +1.891, which is 1.1% of the actual cost
This formula can therefore be a useful check against the cost reported when a Hash Join operation overflows to disk, and for determining which is the biggest cost factor i.e. the Build or the Probe data sets.

Tuesday, 23 January 2018

Hash Join Overflow Costing #2

In my previous post on costing a Hash Join overflow to disk I came up with a formula for the cost of a Hash Join operation that overflows to disk based on tests I had done, but I also mentioned that there might be other factors involved not yet exposed by my testing. My main concern was whether the disk I/O's involved were all of the same type, or a mix of single block and multi-block disk I/O's. I've now done some further testing, and have something to share.

The tests show that all the disk I/O's are costed by the Optimizer as single block disk I/O's. I say this because I changed the cost of multi-block disk I/O's and the cost reported for the Hash Join operation itself did not change. Hence it must be costed as single block disk I/O's by the Optimizer.

The tests I was doing were a join between two tables that used a Hash Join in the execution plan (see previous post for the SQL used in this query). The key points are:
  • The two source tables are accessed by Full Table Scans (TABLE ACCESS FULL) in the execution plan
  • These feed into a HASH JOIN operation
  • Which in turn feeds into a final operation - a SORT AGGREGATE for the SUM's in the SELECT output
  • By adding together the costs of the source data accesses and subtracting from the total cost reported for the Hash Join, we get the net cost of the Hash Join operation itself
Oracle costs disk I/O operations using a set of statistics it has about the computer system it is running on - termed System Statistics and stored in the database itself. These can be a minimal set of just a few statistics or a more complete set of statistics providing a more detailed breakdown of different types of operations on the computer.

In my Oracle 12 instance ( I only have the "No Workload" default system statistics set in the database, which is the minimal set of statistics. This provides values for the following:
  • IOSEEKTIM - Time in milliseconds for a disk to move the read / write head to the track you want to access - default value of 10
  • IOTFRSPEED - Bytes transferred per millisecond once a transfer starts - default value of 4096
  • MBRC - Multi-block read count in blocks achieved on average - default value of 8
  • CPUSPEEDNW - Speed of CPU in units of millions of cycles per second (not relevant here)
Note that the MBRC here is not the same as the initialization parameter db_file_multiblock_read_count, even though their definitions seem to be the same. This MBRC is not set from the initialization parameter, and remains the same within the system statistics until explicitly changed (one way or another). MBRC is intended to represent the actual size of multi-block reads that were achieved on your computer system, and is set when Workload statistics are collected or when set manually.

The Optimizer uses these minimal system statistics to derive the values for the elapsed times for single and multi-block reads (SREADTIM and MREADTIM respectively). The formulae it uses are:
With the default values for System Statistics as stated, and a database block size of 8192 (8 KB) this gives values of SREADTIM = 12 ms, MREADTIM = 26 ms.

What this really means for the Optimizer is that a multi-block read takes 26 / 12 = 2.167 times longer than a single block read i.e. its "cost" is 2.167 times that of a single block read. This is used by the Optimizer when costing multi-block read based operations - it "converts" the multi-block read time into an equivalent number of single block read operations, which all have a cost unit of 1 and are the basis for how the Optimizer reports and compares execution plan costs.

If I were to increase the MBRC from 8 to 16, then the MREADTIM would increase to 42 ms, and a cost factor of 42 / 12 = 3.5 times that of a single block read. The relative net cost though would actually decrease, because each multi-block read would now be reading 16 blocks in one disk I/O rather than 8 blocks i.e. half the number of disk reads are needed to read the same number of blocks in from disk, at a relative cost ratio of 1.615 (3.5 / 2.167). So the decrease in the number of disk reads is greater than the relative increase in cost per disk read.

If the Hash Join operation involved multi-block disk reads then changing MBRC would change the relative cost of those disk reads and we would see a change in the net cost of the Hash Join as reported in the execution plan.

I changed the value of MBRC manually and then shutdown and restarted the database:
exec dbms_stats.set_system_stats ('MBRC', 16)
shutdown immediate
This was to ensure that nothing was still present in memory in the SGA - neither any existing execution plans, or any system statistics used to cost those execution plans.

When I ran the same test queries again, the total costs decreased because the source data accesses are full table scans which use multi-block reads i.e. this was expected, and so the total cost reported at the Hash Join step decreased. However, when the costs of the source data accesses were subtracted from the total cost after the Hash Join step, the net cost of the Hash Join operation itself was exactly the same as it was before for the same query.

Likewise when I increased the MBRC value to 32 there was no change in the net cost of the Hash Join operation when it overflowed to disk.

The conclusion then is that no multi-block disk reads are used within the costing of the Hash Join operation by the Optimizer when the hash table it uses overflows to disk.

Which means that the formula I posted before for an overflowing Hash Join cost is not dependent on multi-block reads at all:-
  • ( ((Build Columns Size + 12) * Build Row Count) + ((Probe Columns Size + 12) * Probe Row Count) ) * 0.0475
Where the "Columns Size" is the sum of the hash table storage for each column i.e. data storage + 2 bytes per column.

Thursday, 11 January 2018

Hash Join Overflow Cost Formula #1


The Hash Join join method was introduced in Oracle version 7 (7.3 specifically I believe), and one of its main goals was to be a method that lent itself well to being parallelisable. However, it is such an efficient join method for larger data volumes even in serial execution that it is often picked by the Optimizer over Nested Loops or Sort Merge because of its lower execution cost. This makes the Hash Join method probably the most frequently used method by the Oracle Optimizer, often appearing in execution plans for SQL queries.

A Hash Join works by building a table in memory containing the data from the first data set (termed the Build data set), and then reading the second data set (the Probe data set) to lookup each data row into the in-memory table for any match (join). The in-memory table is structured and accessed by applying a "hash function" to the relevant join data columns. This hashing has various benefits around performance, handling any kind of input data value range, and distributing the input values within the table (minimising bunching values together).

When this hash table fits in memory the additional cost of the Hash Join operation is negligible because it only involves CPU and memory operations and these are multiple orders of magnitude faster than disk accesses i.e. there is often little or no difference to the total reported cost of the Hash Join over the sum of the costs of access of each source data set. However, when the hash table needed is larger than can fit in available memory in the PGA then it must overflow to disk, which in turn significantly increases the cost of the Hash Join operation itself.

A question I have had for a long time is "How does Oracle cost this overflowing Hash Join operation"? Can I replicate this cost formula and understand what the main factors are within this reported cost? Which is a bigger factor - the size of the Build data set or the Probe data set? Knowing such things it might offer the possibility of gaining some insights into ways of tuning such large hash joins. At the least I would know more about how the overflowing hash join actually works in practice.

Jonathan Lewis gives a description in his Cost Based Oracle Fundamentals book of how the overflowing Hash Join operation works, with a formula for the major cost components involved. However, I have always found this formula to be more descriptive than quantitative, and to not be easy to use to arrive at a comparable value to what the Optimizer has reported.

I would like a more straightforward quantitative formula that I could use myself to estimate whether a Hash Join will overflow to disk or not, and how much its cost will be. After some research I believe I have arrived at such a formula which I will share here. Note that I'm not saying this is a "perfect" formula, just that this is the conclusion I have arrived at so far as a result of the tests I have done, and it seems to fit the results I have very well. I'll continue to post more details when I refine or revise this in the future

Why Hash Join Overflows

The limit to the size of a single Hash Table or other "work area" in the PGA is determined by a hidden, internal initialization parameter of "_smm_min_size". If the size of the Hash Table needed would be larger than this, then the Optimizer assumes that it will overflow to disk and costs it accordingly. My notes say that the value of "_smm_min_size" is the larger of 128 KB or 0.1% of PGA_AGGREGATE_TARGET. I cannot find exactly where I got this information from, but my memory is telling me that it was from Randolf Geist, possibly in a response to a question on one of the Oracle forums.

The main reason Oracle limits the potential size of a Hash Table is to ensure that a high number of other queries can run concurrently and not be starved of memory within the PGA. The Optimizer is assuming a worst case scenario to make sure other sessions do not suffer when a very large query is executed by one session. However, it is possible that when executed the hash table will not overflow to disk i.e. if at the moment that query is executed there is enough free memory in the PGA, then Oracle will let that session have a larger "work area" than the value of "_smm_min_size". So even though the Optimizer has costed the Hash Join operation as an overflow to disk and costed it accordingly, it does not mean that it will always overflow to disk when executed.

How Overflow Hash Join Works

Jonathan Lewis gives a description in his book of how the Hash Join works when it overflows to disk. I won't repeat the details here as they are not particularly relevant at the end of the day. But a crude summary would be that:
  • The first data set is read once and broken into "chunks" that are written out to disk (temporary storage), where each chunk is a contiguous sub-set of the overall hash table
  • One or more of these chunks are then kept in memory ready for the pass over the second data set
  • The second data set is read in and hashed in the normal way:-
    • If it hashes to an in-memory chunk then it is matched as normal
    • Otherwise it is written out to disk (temporary storage) split out into chunks on the same basis as the first data set
  • Then remaining chunks of the first data set are read into memory, and the corresponding chunks of the second data set read again and matched as normal. This is repeated until all of both data sets have been processed.
Essentially this is saying that there will be an additional pass over both sets of data - after the first read of each data set (already costed within the execution plan), there is an additional write out to disk of the majority of each data set, followed by a read back of each data set. Also extra reads and writes may be needed in the first pass of each data set, to keep the data in the "chunks" properly grouped together on disk.

It is not clear whether these write and read operations will be single block or multi-block disk read operations, or how many of them there will be. Potentially multi-block reads could be used when reading back in the pre-hashed chunks from disk. Luckily though this turns out to be irrelevant to the cost formula I have arrived at.

Deriving Hash Join Overflow Cost

Here is how I went about it. I created a set of standard test tables (see later for SQL DDL), each with a mix of NUMBER and VARCHAR2 columns to pad them out a bit, and populated them using a repeatable "connect by" data generator with a different number of rows in each test table. I then ran a query joining 2 of these tables together (see later for SQL), immediately examined the execution plan (from dbms_xplan.display_cursor) and noted the costs of each operation.

Without any indexes on any of these tables the execution plan was always 2 Full Table Scans feeding into a Hash Join operation. When the smaller, build table became large enough the Hash Join operation would overflow to disk, causing its cost to rise significantly, and a "TempSpc" column to appear in the execution plan with a reported value.

By varying only one thing at a time between queries I could see how the Hash Join cost changed when it was overflowing to disk. I was not interested in those executions where the Hash Join did not overflow to disk i.e. where the hash table did fit in memory. Only those executions that involved the Optimizer assuming it would overflow to disk. By examining the change in cost for the Hash Join operation for a corresponding change in only one of the joined tables I could deduce a multiplying factor being used within the underlying Hash Join cost calculation.

My Oracle version is 12.1 on Oracle Linux, so my results are only guaranteed to be accurate for that version. I would assume the results should be the same for 11g, as I don't think anything significant has changed in how the Hash Join operation is costed, but that would need to be verified.
Oracle Database 12c Enterprise Edition Release - 64bit Production
PL/SQL Release - Production
CORE Production
TNS for Linux: Version - Production
NLSRTL Version - Production

Hash Join Overflow Formula and its Accuracy

I started by comparing the Hash Join cost when the number of columns in a data set changed i.e. the size of the data set increased for the same number of rows. Jonathan Lewis states that the memory needed per column is the storage size of the column itself plus 2 additional bytes. I observed that the Hash Join cost changed by a factor of 0.0475 per byte per 1000 rows. And that this multiplying factor was the same under different row counts in either table in the query.

This only involves the columns needed by the query itself, which the Optimizer must extract and process, and not all of the columns in the table. In this case it is the columns referenced in the "SELECT" list and those referenced in the "WHERE" clause. And the "storage size" is the number of bytes that Oracle uses to store that data on disk, which is not necessarily 1 byte per value or 1 byte per character or digit. For instance, a NUMBER is stored as 2 digits per byte.

My other observation was that when only the row count in one table changed the Hash Join cost changed by a factor of 0.5675 per 1000 rows. As this was a constant per row I wondered if this was something to do with some extra data per row causing extra disk I/Os. And 0.5675 divided by 0.0475 gives 11.9474 which is almost 12, implying a 12 byte overhead per data row within the Hash Table.

Based on this, I arrived at the following formula for an overflowing Hash Join cost:
  • ( ((Build Columns Size + 12) * Build Row Count) + ((Probe Columns Size + 12) * Probe Row Count) ) * 0.0475
Where the "Columns Size" is the sum of the hash table storage for each column i.e. data storage + 2 bytes per column.

I then checked the calculated costs from this formula against the series of test queries I had been using, and the resultant cost for the overflowing Hash Join came out almost the same in all cases. The percentage difference was under 1% in almost all cases, which I take to be a high degree of accuracy. The only anomalies are for when the "build data set" is only just bigger than can fit in the PGA, but even then it is only a 3% difference. As the size of the build data set increased so the percentage difference decreased.

On the one hand it might not be surprising to some people that my derived formula produces the same results using the same inputs that were used to create the formula in the first place. However, given that the queries I tested varied both the number of columns being selected, the number of columns being joined on, and the number of rows in each table, these would appear to cover the only variables relevant to this formula. And in each of these cases the change in the reported Hash Join cost from Oracle was always a multiplier of this fixed constant of 0.0475.

Other Factors

While I do believe that this formula is true and valid for the system I was testing on, it may not be true for all other systems. It is likely that the multiplying factor of 0.0475 will be different on other systems. Given that this additional cost for the overflowing Hash Join is due to the additional disk I/Os involved, then it would seem likely that changes to the system statistics inside Oracle for disk read times would result in a change in the value of this multiplying factor. I will investigate this in my next series of tests.

There may or may not be some small "constant cost value" involved as well within the formula, for some kind of constant overhead within the overflowing Hash Join operation. This "constant cost value" would become negligible at higher data volumes compared to the costs for the build and probe data sets, but it might explain the slightly larger difference in calculated cost at the smallest overflowing data set size.

There is also the concept of "one pass" and "multi-pass" hash joins within Oracle, as well as "optimal" hash joins. I don't understand the real difference between these, other than "optimal" is when it fits in memory and the other two are when it overflows to disk. It is possible that what I've seen has been the cost for "one pass" overflowing hash joins, and for even larger data sets a "multi-pass" hash join would be used that would involve a different cost formula.

The SQL for the table and query

Here is the SQL to create one example table - they are all the same but for name and row counts - and the query used.

Create Table - run from SQL*Plus with 2 command line arguments of the row count and a table name suffix e.g. "@crhjtab 1000000 100k".
create table hj&2
tablespace testdata
select r pid
     , 1 one
     , 2 two
     , 3 three
     , 4 four
     , 5 five
     , 10 ten
     , trunc (r / 10) per10
     , trunc (r / 100) per100
     , mod (r, 10) mod10
     , mod (r, 100) mod100
     , mod (r, 1000) mod1000
     , mod (r, 10000) mod10000
     , mod (r, 100000) mod100000
     , 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz' filler2
     , 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz' filler3
  from (select rownum r
          from (select rownum r from dual connect by level <= 1000) a,
               (select rownum r from dual connect by level <= 1000) b,
               (select rownum r from dual connect by level <= 1000) c
         where rownum <= &1) 
exec dbms_stats.gather_table_stats (user, upper ('hj&2') )
Query - run from SQL*Plus with 2 command line arguments of table name suffixes e.g. "@hj0 100k 200k"
select /* HashTest0 */ sum ( sum_b1
  from hj&1 hj1, hj&2 hj2
 where = hj2.per10 
   and hj1.mod10 = hj2.mod100 ;
select * from table (dbms_xplan.display_cursor) ;
I've only shown one query here, as the others I used are almost the same but for the columns in the "SELECT" list. The variations of this query had different numbers of columns in the "SELECT" list, to increase the number of columns from the build and/or the probe tables.

Wednesday, 3 January 2018

Reading AWR Reports #2 - Report Overview

In the first post on Reading AWR Reports I made the point that you should first be clear on the details of the "performance problem" you are investigating. If there is no specific problem reported then there is no point looking for something that is not there in an AWR report. I also stated that an AWR Report is one amongst several tools available to you that you can use to investigate Oracle database performance problems, and you should make sure that a database wide AWR Report is the most suitable tool to be using for the specific performance problem you are currently tasked with investigating.

Assuming that has all been addressed, then the first thing I do with an AWR Report is a few high level checks - a kind of sanity check to get a feel for whether there might be some kind of performance issue there or not.

A performance problem is normally when a task takes too long to complete, and that is affected by the various resources it uses while doing its work. The key measurements are therefore both sides of this - the input work requests, and the system resource usage while doing that work. I basically want to check:
  • The resource usage as a percentage of system capacity i.e. utilisation
  • The amount of work requested / done (assuming they are the same), which is really SQL statements executed for a database
  • Amount of "wait time" within "SQL execution time" i.e. wait as percentage of work time
  • Top wait events to see how these correlate with the previous observations
That's it. Nothing more detailed for the first pass. If the input workload is high enough and the resource utilisation high enough and something looks wrong then I follow up with a second pass of the AWR Report diving down into more details based on what the first pass showed up.

Lets use the following AWR Report to show what I mean:

DB Name         DB Id    Instance     Inst Num Startup Time    Release     RAC
------------ ----------- ------------ -------- --------------- ----------- ---
O12DB         3429470280 o12db               1 05-Dec-17 06:59  NO

Host Name        Platform                         CPUs Cores Sockets Memory(GB)
---------------- -------------------------------- ---- ----- ------- ----------
xxxxxxxxxx.local Linux x86 64-bit                    2     2       1       7.80

              Snap Id      Snap Time      Sessions Curs/Sess
            --------- ------------------- -------- ---------
Begin Snap:        79 05-Dec-17 09:53:43       142       3.0
  End Snap:        80 05-Dec-17 10:03:44       142       4.1
   Elapsed:               10.03 (mins)
   DB Time:                0.39 (mins)

Top 10 Foreground Events by Total Wait Time
                                           Total Wait       Wait   % DB Wait
Event                                Waits Time (sec)    Avg(ms)   time Class
------------------------------ ----------- ---------- ---------- ------ --------
DB CPU                                           12.4              53.8
log file sync                        2,727        9.3       3.42   40.4 Commit
db file sequential read                177        1.2       6.82    5.2 User I/O
control file sequential read           827         .1       0.12     .4 System I
The snapshot duration was 10 minutes, and the system had 2 CPU Cores, so there were 20 minutes of CPU capacity available. The database processing time is reported as "DB Time" and is 0.39 minutes, which is about 1.95% resource utilisation. From this I can see that the database was doing very little work at all, so there is not a database wide performance problem worth investigating. Even though the Top Wait Events seem to show that 40% of the time was spent waiting on the "log file sync" event, the actual amount of Time is very trivial - 9.3 seconds of waiting out of a 600 second period (10 minutes). Such low workloads and resource utilisation can lead to various measurement anomalies, so it is not worth trying to drill down further into the wait event details. Any actual problem is probably specific to a single session, and should be investigated using session specific tools.

Another AWR Report:

DB Name         DB Id    Instance     Inst Num Startup Time    Release     RAC
------------ ----------- ------------ -------- --------------- ----------- ---
O12DB         3429470280 o12db               1 05-Dec-17 06:59  NO

Host Name        Platform                         CPUs Cores Sockets Memory(GB)
---------------- -------------------------------- ---- ----- ------- ----------
xxxxxxxxx.local Linux x86 64-bit                    2     2       1       7.80

              Snap Id      Snap Time      Sessions Curs/Sess
            --------- ------------------- -------- ---------
Begin Snap:        71 05-Dec-17 07:55:35        39       1.8
  End Snap:        72 05-Dec-17 08:05:39        40       2.0
   Elapsed:               10.06 (mins)
   DB Time:                9.35 (mins)

Load Profile                    Per Second   Per Transaction  Per Exec  Per Call
~~~~~~~~~~~~~~~            ---------------   --------------- --------- ---------
      Redo size (bytes):       3,470,831.7           5,403.2
  Logical read (blocks):          61,507.4              95.8
          Block changes:          20,563.9              32.0
             User calls:             495.9               0.8
           Parses (SQL):             302.6               0.5
         Executes (SQL):          13,287.8              20.7
           Transactions:             642.4

Top 10 Foreground Events by Total Wait Time
                                           Total Wait       Wait   % DB Wait
Event                                Waits Time (sec)    Avg(ms)   time Class
------------------------------ ----------- ---------- ---------- ------ --------
DB CPU                                          372.9              66.5
log file sync                      272,633      196.8       0.72   35.1 Commit
db file sequential read                472        2.2       4.75     .4 User I/O
Again, same database instance and a 10 minute period of time but a different workload. Now we see that the resource utilisation is up at 46.47%, being 9.35 minutes of DB Time out of 20 minutes of potential CPU capacity. From the Load Profile we can see that the database was executing over 13,000 SQL statements per second. From the wait events we see that "DB CPU" is 66.5% of the "DB Time" i.e. doing real work. So waiting as a percentage of SQL execution time would be about 33.5%. And we can see that almost all of this wait time is due to a single wait event - "log file sync".

Again, this does not directly tell us whether there is a performance problem or not. In all databases doing work there will be some wait events occurring, and one of these will always be top when ordered by wait time. That does not mean that it is necessarily a problem, or that it can be "fixed" in any way. In this case the workload is doing a lot of inserts and updates in small transactions, so there is a high frequency of commits which all involve writes to the redo logs. And we would expect the "log file sync" wait event to be there in the set of top 10 wait events in this scenario i.e. it is not a "problem" but a "result" of the workload running on the database.