Null is Odd…or Things I Used to Know about SQL Aggregate Functions

Brendan McCullum recently played his final Test for New Zealand.
That’s something of an understatement. In his last game he made a century in a mere 54 balls, a feat unmatched in 139 years of test cricket.
From the outside looking in, it seemed that McCullum had come to realise something he’d always known. Playing cricket is supposed to be fun.
What’s more, you can consider yourself quite fortunate if you get paid for doing something you enjoy, especially when that something is hitting a ball with a stick.

With the help of Mr McCullum, what follows will serve to remind me of something I’ve always known but may forget from time to time.
In my case, it’s the fact that NULL is odd. This is especially true when it comes to basic SQL aggregation functions.

Some test data

We’ve got a simple table that holds the number of runs scored by McCullum in each of his Test innings together with a nullable value to indicate whether or not he was dismissed in that innings.

This is relevant because one of the things we’re going to do is calculate his batting average.

In Cricket, the formula for this is :

Runs Scored / (Innings Batted – Times Not Out)

Anyway, here’s the table :

create table mccullum_inns
(
    score number not null,
    not_out number
)
/

…and the data…

insert into mccullum_inns( score,not_out) values (57, null);
insert into mccullum_inns( score,not_out) values (19, 1);
insert into mccullum_inns( score,not_out) values (13, null);
insert into mccullum_inns( score,not_out) values (55, null);
insert into mccullum_inns( score,not_out) values (3, null);
insert into mccullum_inns( score,not_out) values (5, null);
insert into mccullum_inns( score,not_out) values (96, null);
insert into mccullum_inns( score,not_out) values (54, null);
insert into mccullum_inns( score,not_out) values (20, null);
insert into mccullum_inns( score,not_out) values (21, null);
insert into mccullum_inns( score,not_out) values (4, null);
insert into mccullum_inns( score,not_out) values (143, null);
insert into mccullum_inns( score,not_out) values (17, 1);
insert into mccullum_inns( score,not_out) values (10, null);
insert into mccullum_inns( score,not_out) values (8, null);
insert into mccullum_inns( score,not_out) values (10, null);
insert into mccullum_inns( score,not_out) values (36, null);
insert into mccullum_inns( score,not_out) values (29, null);
insert into mccullum_inns( score,not_out) values (24, null);
insert into mccullum_inns( score,not_out) values (3, null);
insert into mccullum_inns( score,not_out) values (25, null);
insert into mccullum_inns( score,not_out) values (0, null);
insert into mccullum_inns( score,not_out) values (99, null);
insert into mccullum_inns( score,not_out) values (7, null);
insert into mccullum_inns( score,not_out) values (0, null);
insert into mccullum_inns( score,not_out) values (111, null);
insert into mccullum_inns( score,not_out) values (24, null);
insert into mccullum_inns( score,not_out) values (19, null);
insert into mccullum_inns( score,not_out) values (74, null);
insert into mccullum_inns( score,not_out) values (23, null);
insert into mccullum_inns( score,not_out) values (31, null);
insert into mccullum_inns( score,not_out) values (33, null);
insert into mccullum_inns( score,not_out) values (5, null);
insert into mccullum_inns( score,not_out) values (0, null);
insert into mccullum_inns( score,not_out) values (5, null);
insert into mccullum_inns( score,not_out) values (0, null);
insert into mccullum_inns( score,not_out) values (14, 1);
insert into mccullum_inns( score,not_out) values (43, null);
insert into mccullum_inns( score,not_out) values (17, null);
insert into mccullum_inns( score,not_out) values (9, null);
insert into mccullum_inns( score,not_out) values (26, null);
insert into mccullum_inns( score,not_out) values (13, null);
insert into mccullum_inns( score,not_out) values (21, null);
insert into mccullum_inns( score,not_out) values (7, null);
insert into mccullum_inns( score,not_out) values (40, null);
insert into mccullum_inns( score,not_out) values (51, null);
insert into mccullum_inns( score,not_out) values (0, null);
insert into mccullum_inns( score,not_out) values (25, null);
insert into mccullum_inns( score,not_out) values (85, null);
insert into mccullum_inns( score,not_out) values (9, null);
insert into mccullum_inns( score,not_out) values (42, null);
insert into mccullum_inns( score,not_out) values (97, null);
insert into mccullum_inns( score,not_out) values (24, null);
insert into mccullum_inns( score,not_out) values (11, null);
insert into mccullum_inns( score,not_out) values (0, null);
insert into mccullum_inns( score,not_out) values (9, null);
insert into mccullum_inns( score,not_out) values (71, null);
insert into mccullum_inns( score,not_out) values (25, null);
insert into mccullum_inns( score,not_out) values (2, null);
insert into mccullum_inns( score,not_out) values (66, null);
insert into mccullum_inns( score,not_out) values (8, null);
insert into mccullum_inns( score,not_out) values (3, null);
insert into mccullum_inns( score,not_out) values (30, null);
insert into mccullum_inns( score,not_out) values (84, 1);
insert into mccullum_inns( score,not_out) values (25, null);
insert into mccullum_inns( score,not_out) values (31, null);
insert into mccullum_inns( score,not_out) values (19, null);
insert into mccullum_inns( score,not_out) values (3, null);
insert into mccullum_inns( score,not_out) values (84, null);
insert into mccullum_inns( score,not_out) values (115, null);
insert into mccullum_inns( score,not_out) values (24, null);
insert into mccullum_inns( score,not_out) values (6, null);
insert into mccullum_inns( score,not_out) values (1, null);
insert into mccullum_inns( score,not_out) values (29, null);
insert into mccullum_inns( score,not_out) values (18, null);
insert into mccullum_inns( score,not_out) values (13, null);
insert into mccullum_inns( score,not_out) values (78, null);
insert into mccullum_inns( score,not_out) values (0, null);
insert into mccullum_inns( score,not_out) values (0, null);
insert into mccullum_inns( score,not_out) values (24, null);
insert into mccullum_inns( score,not_out) values (89, null);
insert into mccullum_inns( score,not_out) values (185, null);
insert into mccullum_inns( score,not_out) values (19, 1);
insert into mccullum_inns( score,not_out) values (24, null);
insert into mccullum_inns( score,not_out) values (104, null);
insert into mccullum_inns( score,not_out) values (5, null);
insert into mccullum_inns( score,not_out) values (51, null);
insert into mccullum_inns( score,not_out) values (65, null);
insert into mccullum_inns( score,not_out) values (11, 1);
insert into mccullum_inns( score,not_out) values (4, null);
insert into mccullum_inns( score,not_out) values (225, null);
insert into mccullum_inns( score,not_out) values (40, null);
insert into mccullum_inns( score,not_out) values (25, null);
insert into mccullum_inns( score,not_out) values (56, null);
insert into mccullum_inns( score,not_out) values (35, null);
insert into mccullum_inns( score,not_out) values (2, null);
insert into mccullum_inns( score,not_out) values (64, null);
insert into mccullum_inns( score,not_out) values (14, null);
insert into mccullum_inns( score,not_out) values (11, null);
insert into mccullum_inns( score,not_out) values (34, null);
insert into mccullum_inns( score,not_out) values (1, null);
insert into mccullum_inns( score,not_out) values (16, null);
insert into mccullum_inns( score,not_out) values (12, null);
insert into mccullum_inns( score,not_out) values (83, null);
insert into mccullum_inns( score,not_out) values (48, null);
insert into mccullum_inns( score,not_out) values (58, 1);
insert into mccullum_inns( score,not_out) values (61, null);
insert into mccullum_inns( score,not_out) values (5, null);
insert into mccullum_inns( score,not_out) values (31, null);
insert into mccullum_inns( score,not_out) values (0, null);
insert into mccullum_inns( score,not_out) values (25, null);
insert into mccullum_inns( score,not_out) values (84, null);
insert into mccullum_inns( score,not_out) values (0, null);
insert into mccullum_inns( score,not_out) values (19, null);
insert into mccullum_inns( score,not_out) values (22, null);
insert into mccullum_inns( score,not_out) values (42, null);
insert into mccullum_inns( score,not_out) values (0, null);
insert into mccullum_inns( score,not_out) values (23, null);
insert into mccullum_inns( score,not_out) values (68, null);
insert into mccullum_inns( score,not_out) values (13, null);
insert into mccullum_inns( score,not_out) values (4, null);
insert into mccullum_inns( score,not_out) values (35, null);
insert into mccullum_inns( score,not_out) values (7, null);
insert into mccullum_inns( score,not_out) values (51, null);
insert into mccullum_inns( score,not_out) values (13, null);
insert into mccullum_inns( score,not_out) values (11, null);
insert into mccullum_inns( score,not_out) values (74, null);
insert into mccullum_inns( score,not_out) values (69, null);
insert into mccullum_inns( score,not_out) values (38, null);
insert into mccullum_inns( score,not_out) values (67, 1);
insert into mccullum_inns( score,not_out) values (2, null);
insert into mccullum_inns( score,not_out) values (8, null);
insert into mccullum_inns( score,not_out) values (20, null);
insert into mccullum_inns( score,not_out) values (1, null);
insert into mccullum_inns( score,not_out) values (21, null);
insert into mccullum_inns( score,not_out) values (22, null);
insert into mccullum_inns( score,not_out) values (11, null);
insert into mccullum_inns( score,not_out) values (113, null);
insert into mccullum_inns( score,not_out) values (9, null);
insert into mccullum_inns( score,not_out) values (37, null);
insert into mccullum_inns( score,not_out) values (12, null);
insert into mccullum_inns( score,not_out) values (224, null);
insert into mccullum_inns( score,not_out) values (1, null);
insert into mccullum_inns( score,not_out) values (8, null);
insert into mccullum_inns( score,not_out) values (302, null);
insert into mccullum_inns( score,not_out) values (7, null);
insert into mccullum_inns( score,not_out) values (17, null);
insert into mccullum_inns( score,not_out) values (4, null);
insert into mccullum_inns( score,not_out) values (3, null);
insert into mccullum_inns( score,not_out) values (31, null);
insert into mccullum_inns( score,not_out) values (25, null);
insert into mccullum_inns( score,not_out) values (18, null);
insert into mccullum_inns( score,not_out) values (39, null);
insert into mccullum_inns( score,not_out) values (43, null);
insert into mccullum_inns( score,not_out) values (45, null);
insert into mccullum_inns( score,not_out) values (202, null);
insert into mccullum_inns( score,not_out) values (195, null);
insert into mccullum_inns( score,not_out) values (0, null);
insert into mccullum_inns( score,not_out) values (22, null);
insert into mccullum_inns( score,not_out) values (42, null);
insert into mccullum_inns( score,not_out) values (0, null);
insert into mccullum_inns( score,not_out) values (41, null);
insert into mccullum_inns( score,not_out) values (55, null);
insert into mccullum_inns( score,not_out) values (6, null);
insert into mccullum_inns( score,not_out) values (80, null);
insert into mccullum_inns( score,not_out) values (27, null);
insert into mccullum_inns( score,not_out) values (4, null);
insert into mccullum_inns( score,not_out) values (20, null);
insert into mccullum_inns( score,not_out) values (75, null);
insert into mccullum_inns( score,not_out) values (17, 1);
insert into mccullum_inns( score,not_out) values (18, null);
insert into mccullum_inns( score,not_out) values (18, null);
insert into mccullum_inns( score,not_out) values (0, null);
insert into mccullum_inns( score,not_out) values (10, null);
insert into mccullum_inns( score,not_out) values (145, null);
insert into mccullum_inns( score,not_out) values (25, null);
commit;

I’ve loaded this into my Oracle 11gXE Enterprise Edition database.

Don’t count on COUNT()

Let’s just check the number of rows in the table :

select count(*), count(score), count(not_out)
from mccullum_inns
/

  COUNT(*) COUNT(SCORE) COUNT(NOT_OUT) 
---------- ------------ --------------
       176          176             9
       

Hmmm, that’s interesting. Whilst there are 176 rows in the table, a count of the NOT_OUT column only returns 9, which is the number of rows with a non-null value in this column.

The fact is that COUNT(*) behaves a bit differently from COUNT(some_column)…

with stick as
(
    select 1 as ball from dual
    union all select 2 from dual
    union all select null from dual
)    
select count(*), count(ball)
from stick
/

COUNT(*)                             COUNT(BALL)
---------- ---------------------------------------
         3                                       2

Tanel Poder provides the explanation as to why this happens here.
Due to this difference in behaviour, you may well consider that COUNT(*) is a completely different function to COUNT(column), at least where NULLS are concerned.

When all else fails, Read the Manual

From very early on, database developers learn to be wary of columns that may contain null values and code accordingly, making frequent use of the NVL function.
However, aggregate functions can prove to be something of a blind spot. This can lead to some interesting results.
Whilst we know ( and can prove) that NULL + anything equals NULL…

select 3 + 1 + 4 + 1 + null as ball
from dual
/

     BALL
----------
         

…if we use an aggregate function…

with stick as 
(
    select 3 as ball from dual
    union all select 1 from dual
    union all select 4 from dual
    union all select 1 from dual
    union all select null from dual
)
select sum(ball) 
from stick
/

SUM(BALL)
----------
         9

…so, calculating an average may well lead to some confusion…

with stick as 
(
    select 3 as ball from dual
    union all select 1 from dual
    union all select 4 from dual
    union all select 1 from dual
    union all select null from dual
)
select avg(ball)
from stick
/

AVG(BALL)
----------
      2.25

…which is not what we would expect given :

with stick as 
(
    select 3 as ball from dual
    union all select 1 from dual
    union all select 4 from dual
    union all select 1 from dual
    union all select null from dual
)
select sum(ball)/count(*) as Average
from stick
/

   AVERAGE
----------
       1.8 

You can see similar behaviour with the MAX and MIN functions :

with stick as 
(
    select 3 as ball from dual
    union all select 1 from dual
    union all select 4 from dual
    union all select 1 from dual
    union all select null from dual
)
select max(ball), min(ball)
from stick
/

 MAX(BALL)  MIN(BALL)
---------- ----------
         4          1

Looking at the documentation, we can see that :

“All aggregate functions except COUNT(*), GROUPING, and GROUPING_ID ignore nulls. You can use the NVL function in the argument to an aggregate function to substitute a value for a null. COUNT and REGR_COUNT never return null, but return either a number or zero. For all the remaining aggregate functions, if the data set contains no rows, or contains only rows with nulls as arguments to the aggregate function, then the function returns null.”

So, if we want our aggregate functions to behave themselves, or at least, behave as we might expect, we need to account for situations where the column on which they are operating may be null.
Returning to COUNT…

select count(nvl(not_out, 0)) 
from mccullum_inns
/

                 COUNT(NVL(NOT_OUT,0))
---------------------------------------
                                    176
                    

Going back to our original task, i.e. finding McCullum’s final batting average, we could do this :

select count(*) as Inns, 
    sum(score) as Runs,
    sum(nvl(not_out,0)) as "Not Outs",
    round(sum(score)/(count(*) - count(not_out)),2) as Average
from mccullum_inns
/

INNS  RUNS   Not Outs AVERAGE
----- ----- ---------- -------
  176  6453          9   38.64
  

However, now we’ve re-learned how nulls are treated by aggregate functions, we could save ourselves a bit of typing…

select count(*) as Inns, 
    sum(score) as Runs,
    count(not_out) as "Not Outs",
    round(sum(score)/(count(*) - count(not_out)),2) as Average
from mccullum_inns
/

INNS  RUNS   Not Outs AVERAGE
----- ----- ---------- -------
  176  6453          9   38.64

Time to draw stumps.

Advertisements

2 thoughts on “Null is Odd…or Things I Used to Know about SQL Aggregate Functions

  1. Nicely laid out and explained.

    It’s probably important to point out that “logically”, a NULL is (or should generally be) a marker that indicates either that a value is Unknown, or Inapplicable. If the NULL is merely used as an alternative representation of the value zero (0), then using NVL(not_out,0) for aggregation makes sense (of course, it would be even better to actually store zeroes, but then you wouldn’t be able to demonstrate the oddities of NULL in blog articles 🙂

    If the NULL is a marker for Inapplicable, Oracle’s treatment of NULLs for aggregations makes sense: if the value is not applicable to that row, we most probably would not expect it to affect the calculation of an average, sum, or other aggregate. Personally, in this case I might try to make my intent even clearer by explicitly omitting the NULL results from my query before the aggregate operator is applied.

    If, on the other hand, the NULL is a marker for Unknown, we leave Oracle between a rock and a hard place: it cannot, by definition, give an absolutely “correct” answer to questions like “what is the sum” or “what is the average” in cases where a NULL is found – if we don’t know one of the values, we can’t know the total! So Oracle should really raise an exception in this case. However, this would preclude the previous logic we wanted for the Inapplicable interpretation of NULL; so it makes sense that Oracle would prefer the non-exception-raising behaviour.

    • Jeff,

      your absolutely right as usual.

      In the “Real World” I would indeed populate the NOT_OUT column with 0 as a default value.
      So the table definition would be more like :

      create table mccullum_inns
      (
          scores number not null,
          not_outs number not null default 0
      )
      /
      

      Like you, I would explicitly exclude nulls in the predicate of a query where the value is not applicable to the row.
      I think that the efforts we take to avoid any NULL strangeness is one of the reasons why I sometimes forget exactly how these functions behave when confronted with null values.
      For example, I would instinctively write a query like :

      select avg(pct_commission)
      from hr.employees
      where pct_commission is not null
      /
      

      …although I will get the same result with :

      select avg(pct_commission)
      from hr.employees
      /
      

      I was going to say that which of these is preferrable is probably a bit of a grey area.
      However, if the execution plan for each query is surprisingly different. The first one has a cost of 35 with the second having 107.
      OK, so I checked this on a table with just over 100 rows so I wouldn’t draw too many conclusions, but it’s probably worth checking.

      Mike

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s