Use the GROUP BY clause in SQL for statistics
Part two of a four-part series of blogs

If you want to average, count, sum or perform any other statistic on values in SQL, you need to learn the GROUP BY and HAVING parts of a SQL statement. That's where this online tutorial comes in!

1. Grouping and Aggregating Data using GROUP BY
2. Using criteria in a GROUP BY query to restrict results (this blog)
3. Special considerations for aggregation statistics
4. Varying Grouping Results: Rollup, Cube and Compute

This blog is part of our much longer full SQL tutorial (although we also run training courses in SQL, for up to 6 people at a time).

Posted by Andy Brown on 14 December 2012

You need a minimum screen resolution of about 700 pixels width to see our blogs. This is because they contain diagrams and tables which would not be viewable easily on a mobile phone or small laptop. Please use a larger tablet, notebook or desktop computer, or change your screen resolution settings.

# Using criteria in a GROUP BY query to restrict results

All of our queries so far have worked on an entire table; I'll now show how to apply criteria to work only on certain records.

If you're not sure about using criteria, I'd recommend first reading my earlier blog on using criteria in queries before continuing.

## The difference between WHERE and HAVING

You can apply criteria to a query using WHERE to restrict the set of rows on which it operates.  Alternatively, once SQL has assembled the aggregated data it can apply a HAVING clause to restrict the rows returned.

Here's an example of this:

-- show the number of Oscar-winning films

-- for each director (but only show directors

-- who have made more than one Oscar-winning film)

SELECT

d.DirectorName,

COUNT(f.FilmId) AS 'Number films'

FROM

tblFilm AS f

INNER JOIN tblDirector AS d

ON f.FilmDirectorId=d.DirectorId

INNER JOIN tblStudio AS s

ON f.FilmStudioId=s.StudioId

WHERE

f.FilmOscarWins > 0

GROUP BY

d.DirectorName

HAVING

COUNT(f.FilmId) > 1

ORDER BY

'Number films' DESC

This query:

1. Creates a list of films which have won at least one Oscar;
2. Groups these by director name, showing the number of films for each;
3. Returns the results, excluding any directors with only one film.

For the Wise Owl movies database, here's what this query returns:

There may be lots of directors who have made only one Oscar-winning film, but this query doesn't show them.

## Remembering the order of commands in SQL

The above query uses all 6 SQL keywords.  To remember the order, just say to yourself that:

"Sweaty Feet Will Give Horrible Odours"

In other words:

SELECT | FROM | WHERE | GROUP BY | HAVING | ORDER BY

## The pointlessness of the HAVING clause

I hardly ever use HAVING.  The reason is that it doesn't save SQL any processing time, as it must compile all of the dataset for a query in order to find out which rows to include/exclude.  For the above query, SQL must calculate the full set of data:

The first few of the 39 rows which the above query would return without the HAVING clause.

All that the HAVING clause does is to make the results easier to read, by removing those which aren't of interest.

## Using aliases in the HAVING clause

An irritation of the HAVING clause is that you can't refer to columns by their aliases.  The following query shows a list of directors who have made at least 3 films, but whose longest film was still less than 2 hours long:

-- show directors who have made at least 3

-- films, none of them lasting more than 2 hours

SELECT

d.DirectorName,

MAX(f.FilmRunTimeMinutes) AS Longest,

COUNT(f.FilmId) AS 'Number films'

FROM

tblFilm AS f

INNER JOIN tblDirector AS d

ON f.FilmDirectorId=d.DirectorId

GROUP BY

d.DirectorName

HAVING

MAX(f.FilmRunTimeMinutes) <= 120="">AND

COUNT(f.FilmId) > 2

Out of interest, here's what this query returns for the Wise Owl movies database:

Are Brett Ratner's audience getting value for money? Three films, and none of them longer than 97 minutes.

In this query, you couldn't use column aliases in the HAVING clause.  The following wouldn't work:

HAVING

Longest <= 120="">AND 'Number films' > 2

On our SQL training courses I always find this hard to explain.  By the time SQL gets to this clause it's done all of the processing it needs to do, and it knows the final column names.  Why can't it use them?

If you do run the above query with the HAVING clause shown, you'll get this error:

SQL doesn't recognise the column, even though I think it should!

Having looked at how you can apply criteria to filter out unwanted records from a grouping query (both before and after aggregation), let's now turn our attention to some special cases - including dealing with nulls and counting and averaging.