Use the GROUP BY clause in SQL for statistics
Part one of a four-part series of blogs

If you want to average, count, sum or perform any other statistic on values in SQL, you need to learn the GROUP BY and HAVING parts of a SQL statement. That's where this online tutorial comes in!

1. Grouping and Aggregating Data using GROUP BY (this blog)
2. Using criteria in a GROUP BY query to restrict results
3. Special considerations for aggregation statistics
4. Varying Grouping Results: Rollup, Cube and Compute

This blog is part of our much longer full SQL tutorial (although we also run training courses in SQL, for up to 6 people at a time).

Posted by Andy Brown on 14 December 2012

You need a minimum screen resolution of about 700 pixels width to see our blogs. This is because they contain diagrams and tables which would not be viewable easily on a mobile phone or small laptop. Please use a larger tablet, notebook or desktop computer, or change your screen resolution settings.

Grouping and Aggregating Data using GROUP BY

This part of our SQL training series of blogs looks at the GROUP BY clause.

What does aggregating data involve?

Aggregating data involves taking a set of rows and summarising it.  For example, you might take a set of films (movies, to those in the US) ...

A list of films without any aggregation.

... and show the total number of Oscars (wins and nominations) by country:

The aggregated data, showing the total number of Oscar wins and nominations for each film-making country in our database.

One way to think of grouping by a column is that you start with a bucket for each country's data, and then add, count, average or perform some other statistic on all of the films in each country's bucket.  Then again, this may not help you!

The above query would group the data by one column (the country).  If you don't group it by any columns, you'll get a single row:

The total figure for all films.

What aggregation functions exist?

Here are the possible functions that you can use:

Function What it does
SUM Returns the total value of a column or field.
AVG Returns the average value of a column or field.
MIN Returns the smallest number or date.
MAX Returns the largest number or date.
COUNT Returns the number of records.
COUNT_BIG Returns the number of records as a number of data type bigint.
STDEV Returns the standard deviation.
STDEP Returns the standard deviation for the population.
VARP Returns the variance for the population.

Syntax of GROUP BY

The basic syntax of the GROUP BY clause is this (where XXX is any of the aggregation functions listed above, such as SUM, COUNT, etc):

SELECT

-- zero, one or more group by columns

FirstColumnGroupingBy,

SecondColumnGroupingBy,

...

-- one or more aggregated columns

XXX(FirstColumnAggregating),

XXX(SecondColumnAggregating),

...

FROM

TableName

GROUP BY

FirstColumnGroupingBy,

SecondColumnGroupingBy,

...

This doesn't mean much on its own, so let's look at a practical example: the query which shows the total number of Oscars by country at the top of this page.  Here it is:

SELECT

c.CountryName,

SUM(f.FilmOscarWins) AS 'Wins',

SUM(f.FilmOscarNominations) AS 'Nominations'

FROM

tblFilm AS f

INNER JOIN tblCountry AS c

ON f.FilmCountryId=c.CountryId

GROUP BY

c.CountryName

In a grouping query, the rule is:

Every non-aggregated column in your query must be repeated in the GROUP BY clause.

What does this mean?  In our example, we have 3 columns of output:

The three columns include two aggregated fields plus the column that we're grouping by (the country name).

The following query would not work, and would generate an error:

SELECT

c.CountryName,

SUM(f.FilmOscarWins) AS 'Wins',

SUM(f.FilmOscarNominations) AS 'Nominations'

FROM

tblFilm AS f

INNER JOIN tblCountry AS c

ON f.FilmCountryId=c.CountryId

This is because we're trying to show the country alongside a couple of summed fields, and we haven't said that we want to group by this column.

Grouping by multiple columns

You can group by as many columns as you like.  This query shows the number of films in our database for each director and studio (so we're grouping by two columns):

SELECT

d.DirectorName,

s.StudioName,

COUNT(f.FilmId) AS 'Number films'

FROM

tblFilm AS f

INNER JOIN tblDirector AS d

ON f.FilmDirectorId=d.DirectorId

INNER JOIN tblStudio AS s

ON f.FilmStudioId=s.StudioId

GROUP BY

d.DirectorName,

s.StudioName

ORDER BY

'Number films' DESC

Here are the first few rows produced by this query:

What does this show?  That Richard Donner is the most loyal person, directing no less than 5 fllms for Warner Brothers?

Grouping by expressions

You can group by any calculated expression, as this example shows:

SELECT

YEAR(FilmReleaseDate) AS 'Release year',

COUNT(FilmId) AS 'Number films'

FROM

tblFilm

GROUP BY

YEAR(FilmReleaseDate)

This would show the number of films released in a given year:

The query shows how many films there are for each year.

Grouping by no columns at all

If you don't group by any columnns, you get a single summary statistics row (as mentioned above):

SELECT

SUM(FilmOscarWins) AS 'Total wins',

SUM(FilmOscarNominations) AS 'Total nominations'

FROM

tblFilm AS f

This query doesn't have any grouping, so produces one row:

if you don't group, you only get one row of data in the results set!

Now that I've covered the basics of grouping, let's look at how you can apply criteria to a grouped query using WHERE and HAVING (then we'll have a look at how to deal with some special cases).