Wednesday, 6 November 2013

AWR Reporting #6 - SQL Statements

This is the last of the main set of posts on this topic. As for the last post, I'll try and keep this brief and post the SQL involved.

Another data set we can look at is on SQL statements i.e. statistics collected by AWR on individual SQL statements, not overall totals. AWR snapshots SQL statements from V$SQLSTAT to DBA_HIST_SQLSTAT for per SQL statement statistics.

As before we could use the technique of subtracting the values from the previous snapshot from the values of the current snapshot to get the change in value between the snapshots. Unfortunately this runs into issues as not all SQL statements may be present in both snapshots, and SQL statements can be flushed out of the library cache in the SGA and then added back in again later so their statistics have been reset in-between.

Luckily Oracle has solved this problem for us and provided a set of DELTA columns for the most important statistics giving us just what we want. Also this avoids the need to join back to the previous snapshot to calculate the change between the snapshots - the DELTA columns are already the change in value from the previous snapshot.

The main query then to extract useful statistics per SQL statement per database per snapshot is:
select sqlstat.snap_id
     , sqlstat.dbid
     , sqlstat.sql_id
     , sum (sqlstat.elapsed_time_delta)  sql_time
     , sum (sqlstat.executions_delta) sql_execs
     , sum (sqlstat.disk_reads_delta) sql_reads
  from dba_hist_sqlstat sqlstat
 group by sqlstat.snap_id
     , sqlstat.dbid
     , sqlstat.sql_id
We cannot use this query directly in our current main AWR summary query as we are getting multiple data rows per snapshot - one per SQL statement captured by that snapshot. The obvious thing to do is to sum the values over all of the SQL statements executed in that snapshot to produce one set of values per snapshot. Unfortunately this doesn't really give us anything that useful - we already have data values from AWR for these statistics across the whole system (SQL execution time, number of SQL statements executed, and number of disk reads), and when added up across all the SQL statements it doesn't help us see if there are any anomalies within the SQL statement themselves.

A better use for this query is to run it separately within another query that instead groups by SQL_ID and sums over multiple snaphots e.g. all snapshots in one day. This is a useful way of seeing which particular SQL statements put a greater load on the system than other SQL statements. "Load" could be number of executions, CPU used, disk reads, or some other measurement.

For example, one system I was looking at recently was doing a lot of disk reads and I could see that there were a number of table scans occurring. So I wanted to identify the SQL statements causing these table scans i.e. the SQL statements with the highest disk reads. By summarising over a whole day I could ensure that I was looking at the worst offenders who were executed multiple times during the day, and not a bad query only executed once.

The following query reports SQL statements captured by AWR yesterday ("sysdate - 1" truncated to midnight) sorted by total number of disk reads. The "sql_reads > 100000" is a filter so that not all SQL statements are listed, only those with a significant number of disk reads - you can increase or decrease this threshold based on how active your system is. You may want to start higher at a million and then reduce it by a factor of 10 until you get enough SQL statements listed.
snaps as 
(select csnaps.snap_id
      , csnaps.dbid
      , min (csnaps.end_interval_time) end_snap_time
      , min (csnaps.end_interval_time) - min (csnaps.begin_interval_time) snap_interval
   from dba_hist_snapshot csnaps
  group by csnaps.snap_id, csnaps.dbid
) -- snaps
, sqlstats as
[insert previous query here]
select sqlstats.sql_id
     , sum (sqlstats.sql_reads)  sql_reads
     , sum (sqlstats.sql_execs)  sql_execs
     , sum (sqlstats.sql_time) / 1000000 sql_time
  from snaps
     join sqlstats
       on snaps.snap_id = sqlstats.snap_id and snaps.dbid = sqlstats.dbid
 where snaps.end_snap_time between 
       (trunc (sysdate) - 1) and (trunc (sysdate))
   and sql_reads > 100000
 group by sqlstats.sql_id
 order by sql_reads desc
Note that the time values are in microseconds and so must be divided by one million to output them as seconds.

A further optimization to this query is to restrict it to the main working hours, say 8am to 6pm:
   and extract (hour from snaps.end_snap_time) between 8 and 17 -- 8:00 to 17:59
This avoids issues with any overnight batch jobs you may be running or the Oracle default job to update stale statistics on database objects (which does do a lot of disk reads).

When I used this I was able to identify about 4 SQL statements responsible for about 25% of the disk reads on the system, each having a relatively low execution count - under 100 each for millions of disk reads in total. Each was clearly doing full table scans, which was validated by checking the execution plans of each - there is an AWR report supplied with Oracle that reports this information for a SQL statement across a number of snapshots (awrsqrpt.sql).

Now that I knew the top contributors to disk reads on the system I was able to investigate each SQL statement individually and work out changes to improve their execution. This was a mixture of extra indexes, SQL rewrites and other database changes.

1 comment:

OracleMan Consulting said...

when i put it all together, i got an error on line 11 re parenthesis