Sunday, July 27, 2008

Do we need stats on all columns?

Recently I had visited one of the customers for performance review. As per the current stats gathering policy, they use following method to gather stats on the tables.

dbms_stats.gather_table_stats (user, tabname, cascade=>true, method_opt=>'FOR ALL INDEXED COLUMNS 75');

As can be seen from the command, the input from method_opt is to gather stats on all the indexed columns with bucket size as 75. This means the stats on the other non-index columns will not be collected. This is a wrong practice and stats on the columns of the table are crucial, particularly if these are use in queries.

When I was discussed with the customer, the customer had following misconception.

1. To make decision for index scan; the stats on indexed columns are available.
2. When there is no index available on column, why oracle needs stats on it. It will use FTS for such queries.

This is purely a misconception, even if a column, use in a query, is not indexed; the correct cardinality is very crucial in determining a correct join order. To prove this point, I have created following small test case.

I created two tables (test1 and test2) and indexes on it.

SQL> create table test1 as
2 select mod(LVL, 3) ID, LVL VAL from
3 (select level "LVL" from dual connect by level <= 100000);

Table created.

SQL> create index ind_test1 on test1 (id);

Index created.

SQL> create table test2 as
2 select lvl "ID", dbms_random.string('A',10) "STR" from (select
3 level "LVL" from dual connect by level <=50000);

Table created.

SQL> create index ind_test2 on test2 (id);

Index created.

Let us generate optimizer statistics based on the policy as the customer used.

SQL> exec dbms_stats.gather_table_stats (user, 'TEST1', cascade=>true, method_opt=>'FOR ALL INDEXED COLUMNS SIZE 75');

PL/SQL procedure successfully completed.

SQL> exec dbms_stats.gather_table_stats (user, 'TEST2', cascade=>true, method_opt=>'FOR ALL INDEXED COLUMNS SIZE 75');

PL/SQL procedure successfully completed.

The stats on the table are


SQL> select owner, table_name, partitioned, num_rows, blocks
2 from dba_tables
3 where table_name in ('TEST1','TEST2')
4 and owner = 'TEST';

OWNER TABLE_NAME PAR NUM_ROWS BLOCKS
---------- ---------- --- ---------- ----------
TEST TEST1 NO 100000 186

TEST TEST2 NO 50000 156


break on table_name skip 1
select table_name, column_name,num_distinct, num_nulls, density
from dba_tab_columns
where table_name in ('TEST1','TEST2')
and owner='TEST'
order by table_name;

TABLE_NAME COLUMN_NAM NUM_DISTINCT NUM_NULLS DENSITY
---------- ---------- ------------ ---------- ----------
TEST1 ID 3 0 .000005
VAL

TEST2 ID 50000 0 .00002
STR

Based on the input to dbms_stats, stats are collected only on indexed columns and not for other non-indexed columns.
The following query is used to check the execution plan.

SQL> set autot trace
SQL> select a.id, a.val, b.str
2 from test1 a,
3 test2 b
4 where a.id = b.id
5 and a.val = 40;

Elapsed: 00:00:00.02

Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=46 Card=667 Bytes=77372)
1 0 HASH JOIN (Cost=46 Card=667 Bytes=77372)
2 1 TABLE ACCESS (FULL) OF 'TEST1' (Cost=19 Card=1000 Bytes=16000)
3 1 TABLE ACCESS (FULL) OF 'TEST2' (Cost=17 Card=50000 Bytes=5000000)


Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
327 consistent gets
315 physical reads
0 redo size
489 bytes sent via SQL*Net to client
499 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed


Now, let’s delete and gather new statistics on the same tables.

SQL> exec dbms_stats.delete_table_stats (user, 'TEST1');

PL/SQL procedure successfully completed.

SQL> exec dbms_stats.delete_table_stats (user, 'TEST2');

PL/SQL procedure successfully completed.

SQL> exec dbms_stats.gather_table_stats (user, 'TEST1', cascade=>true, method_opt=>'FOR ALL COLUMNS');

PL/SQL procedure successfully completed.

SQL> exec dbms_stats.gather_table_stats (user, 'TEST2', cascade=>true, method_opt=>'FOR ALL COLUMNS');

PL/SQL procedure successfully completed.

Let us check the stats again.

TABLE_NAME COLUMN_NAM NUM_DISTINCT NUM_NULLS DENSITY
---------- ---------- ------------ ---------- ----------
TEST1 ID 3 0 .000005
VAL 100000 0 .00001

TEST2 ID 50000 0 .00002
STR 50000 0 .00002

The plan by the same query is

SQL> select a.id, a.val, b.str
2 from test1 a,
3 test2 b
4 where a.id = b.id
5 and a.val = 40;

Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=21 Card=1 Bytes=22)
1 0 TABLE ACCESS (BY INDEX ROWID) OF 'TEST2' (Cost=2 Card=1 Bytes=15)
2 1 NESTED LOOPS (Cost=21 Card=1 Bytes=22)
3 2 TABLE ACCESS (FULL) OF 'TEST1' (Cost=19 Card=1 Bytes=7)
4 2 INDEX (RANGE SCAN) OF 'IND_TEST2' (NON-UNIQUE) (Cost=1 Card=1)


Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
183 consistent gets
174 physical reads
0 redo size
489 bytes sent via SQL*Net to client
499 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed

The comparison between both of the above plans is -

Plan when stats are gathered
Cardinality of TEST1 table
Consistent Gets
Only on indexed columns 1000 327
On all columns 1 183


Why such difference in both the execution plans?

Cardinality plays a very big role in opting an execution plan. In first case, as the stats are not available on VAL column of the table TEST1.

The cardinality when statistics are available only on indexed columns

The filter condition is on ID column of table TEST1 and as there are no stats available on this column, oracle will take the default selectivity as 1/100 and calculated cardinality as

num_rows*selectivity = 100000*(1/100) = 1000

As per the plan, the TEST1 is driving table and cardinality is very high and hence oracle opted the FTS for TEST2 table and cardinality is calculated as

num_rows = 50000

When stats are available on both the all column of the tables, for TEST1 table cardinality is calculated as

Num_rows*(“ID column selectivity” * “VAL column Selectivity)
= round (100000 * (1/3 * 1/100000))
= 0

As cardinality cannot be 0 it will consider as 1.

As the cardinality for TEST1 table is very low i.e. 1, TEST2 will be opted for index scan and hence cardinality is calculated as

Num_rows * (1/distinct)
= round (50000 * (1/50000))
= 1

From above discussion it is concluded that the stats are necessary on all columns. In the above-mentioned test case, we have calculated the stats with histograms but the same result can be found without histograms, as the column stats are getting updated.

To sum up, the stats on all columns are required for optimal execution plan.

No comments: