Wednesday 3 January 2018

Reading AWR Reports #2 - Report Overview

In the first post on Reading AWR Reports I made the point that you should first be clear on the details of the "performance problem" you are investigating. If there is no specific problem reported then there is no point looking for something that is not there in an AWR report. I also stated that an AWR Report is one amongst several tools available to you that you can use to investigate Oracle database performance problems, and you should make sure that a database wide AWR Report is the most suitable tool to be using for the specific performance problem you are currently tasked with investigating.

Assuming that has all been addressed, then the first thing I do with an AWR Report is a few high level checks - a kind of sanity check to get a feel for whether there might be some kind of performance issue there or not.

A performance problem is normally when a task takes too long to complete, and that is affected by the various resources it uses while doing its work. The key measurements are therefore both sides of this - the input work requests, and the system resource usage while doing that work. I basically want to check:
  • The resource usage as a percentage of system capacity i.e. utilisation
  • The amount of work requested / done (assuming they are the same), which is really SQL statements executed for a database
  • Amount of "wait time" within "SQL execution time" i.e. wait as percentage of work time
  • Top wait events to see how these correlate with the previous observations
That's it. Nothing more detailed for the first pass. If the input workload is high enough and the resource utilisation high enough and something looks wrong then I follow up with a second pass of the AWR Report diving down into more details based on what the first pass showed up.

Lets use the following AWR Report to show what I mean:
WORKLOAD REPOSITORY report for

DB Name         DB Id    Instance     Inst Num Startup Time    Release     RAC
------------ ----------- ------------ -------- --------------- ----------- ---
O12DB         3429470280 o12db               1 05-Dec-17 06:59 12.1.0.2.0  NO

Host Name        Platform                         CPUs Cores Sockets Memory(GB)
---------------- -------------------------------- ---- ----- ------- ----------
xxxxxxxxxx.local Linux x86 64-bit                    2     2       1       7.80

              Snap Id      Snap Time      Sessions Curs/Sess
            --------- ------------------- -------- ---------
Begin Snap:        79 05-Dec-17 09:53:43       142       3.0
  End Snap:        80 05-Dec-17 10:03:44       142       4.1
   Elapsed:               10.03 (mins)
   DB Time:                0.39 (mins)

Top 10 Foreground Events by Total Wait Time
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                           Total Wait       Wait   % DB Wait
Event                                Waits Time (sec)    Avg(ms)   time Class
------------------------------ ----------- ---------- ---------- ------ --------
DB CPU                                           12.4              53.8
log file sync                        2,727        9.3       3.42   40.4 Commit
db file sequential read                177        1.2       6.82    5.2 User I/O
control file sequential read           827         .1       0.12     .4 System I
The snapshot duration was 10 minutes, and the system had 2 CPU Cores, so there were 20 minutes of CPU capacity available. The database processing time is reported as "DB Time" and is 0.39 minutes, which is about 1.95% resource utilisation. From this I can see that the database was doing very little work at all, so there is not a database wide performance problem worth investigating. Even though the Top Wait Events seem to show that 40% of the time was spent waiting on the "log file sync" event, the actual amount of Time is very trivial - 9.3 seconds of waiting out of a 600 second period (10 minutes). Such low workloads and resource utilisation can lead to various measurement anomalies, so it is not worth trying to drill down further into the wait event details. Any actual problem is probably specific to a single session, and should be investigated using session specific tools.

Another AWR Report:
WORKLOAD REPOSITORY report for

DB Name         DB Id    Instance     Inst Num Startup Time    Release     RAC
------------ ----------- ------------ -------- --------------- ----------- ---
O12DB         3429470280 o12db               1 05-Dec-17 06:59 12.1.0.2.0  NO

Host Name        Platform                         CPUs Cores Sockets Memory(GB)
---------------- -------------------------------- ---- ----- ------- ----------
xxxxxxxxx.local Linux x86 64-bit                    2     2       1       7.80

              Snap Id      Snap Time      Sessions Curs/Sess
            --------- ------------------- -------- ---------
Begin Snap:        71 05-Dec-17 07:55:35        39       1.8
  End Snap:        72 05-Dec-17 08:05:39        40       2.0
   Elapsed:               10.06 (mins)
   DB Time:                9.35 (mins)

Load Profile                    Per Second   Per Transaction  Per Exec  Per Call
~~~~~~~~~~~~~~~            ---------------   --------------- --------- ---------
      Redo size (bytes):       3,470,831.7           5,403.2
  Logical read (blocks):          61,507.4              95.8
          Block changes:          20,563.9              32.0
             User calls:             495.9               0.8
           Parses (SQL):             302.6               0.5
         Executes (SQL):          13,287.8              20.7
           Transactions:             642.4

Top 10 Foreground Events by Total Wait Time
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                           Total Wait       Wait   % DB Wait
Event                                Waits Time (sec)    Avg(ms)   time Class
------------------------------ ----------- ---------- ---------- ------ --------
DB CPU                                          372.9              66.5
log file sync                      272,633      196.8       0.72   35.1 Commit
db file sequential read                472        2.2       4.75     .4 User I/O
Again, same database instance and a 10 minute period of time but a different workload. Now we see that the resource utilisation is up at 46.47%, being 9.35 minutes of DB Time out of 20 minutes of potential CPU capacity. From the Load Profile we can see that the database was executing over 13,000 SQL statements per second. From the wait events we see that "DB CPU" is 66.5% of the "DB Time" i.e. doing real work. So waiting as a percentage of SQL execution time would be about 33.5%. And we can see that almost all of this wait time is due to a single wait event - "log file sync".

Again, this does not directly tell us whether there is a performance problem or not. In all databases doing work there will be some wait events occurring, and one of these will always be top when ordered by wait time. That does not mean that it is necessarily a problem, or that it can be "fixed" in any way. In this case the workload is doing a lot of inserts and updates in small transactions, so there is a high frequency of commits which all involve writes to the redo logs. And we would expect the "log file sync" wait event to be there in the set of top 10 wait events in this scenario i.e. it is not a "problem" but a "result" of the workload running on the database.

No comments: