by David Goodmanson
In a world which has more and more data, statistics are really important. They help you distill lots of numbers down so that you can see the overall pattern.
And yes, alas, they can also be used to distort and obfuscate. Good tools can be used for evil.
As you’d expect, Excel has lots of statistical functions from the basic to the downright obscure ones that even some maths geeks don’t understand.
The New York Times recently has some career advice for kids starting college – study statistics. For those of us beyond college years, this series of Office for Mere Mortals articles will cover the range of statistical functions in Excel.
We’ll look in depth at the most used functions and include links to sites were you can get more information on the maths behind the function.
Over the years Excel has added not only more complex stats functions but also useful ones that will make it easier for you to include or exclude data from your analysis.
For example, if you thought that the Excel Average() function was, well, average, then check out the many useful variants in recent versions of Excel.
This article will cover what the stats gurus call ‘Measures of Central Tendency’ – in other words some indication of the ‘middle ground’ or ‘center’ for a set of numbers.
In Excel that means these functions:
Average() and the related functions AverageA(), AverageIF(), AverageIFS() and TrimMean()
Then there’s the much misunderstood and misused Median() plus Mode() that I'll cover in a future article.
The average gives us an idea of what to expect. If we are operating a retail business and we know the average daily sale is 3, then we have an idea of what to expect, what is likely to happen, how to cater for that level of demand. The average is usually the first statistic calculated when doing data analysis on a set of data.
Average or Mean is the most commonly used and widely understood statistical function. It’s simple – add all the values together and divide by the number of values. That’s all the Average() function does.
But Excel developers realized that lists of numbers aren’t always as nicely arranged as you’d like – there are missing values, extreme values that distort the average or you want to average only some of the list. These days Excel can handle all those situations with in-built variations on Average.
The following table shows a collection of orders data for each day of the business week. The data is the same in each column except for Wednesday. On Wednesday there were no orders.
The Wednesday data contains 0 for Orders1, “None” for Orders2 and an empty cell for Orders3 which means the same thing to us humans but means different things depending on the Excel function.
Figure 1 - Orders data using Average and AverageA
Row 7 shows the formula =Average(B2:B6) which is copied across the three columns. Row 8 contains the formula =AverageA(B2:B6) copied across the same columns. Wednesday’s data says that no orders were taken, but it is expressed in three different ways.
The AVERAGE() function ignores cells that do not contain valid numbers. Non-number cells, including blanks, aren’t included in the count of values used to calculate the average.
- Cell B7. Wednesday has a numeric 0 and therefore AVERAGE() correctly computes the average to be 3. The calculation is based on a count of 5 days for the week
- Cell C7. Wednesday is now denoted by the text “None”. AVERAGE() ignores this cell and computes the average based on the other 4 days of the week, giving 3.75.
- Cell D7. This time Wednesday’s data is defined by a blank cell. Once again AVERAGE() ignores cells that do not contain valid numbers. The number of days is considered to be 4. It treats empty cells as non-numeric and therefore ignores them and incorrectly computes the average at 3.75.
AVERAGEA() is the same maths but considers non-numeric cells to have a value of zero and are included in the count of values.
However AVERAGEA() ignores cells that are not used, i.e. blank or empty cells. Let’s look at the six cells where the AVERAGE() and AVERAGEA() functions are deployed – row 7 and row 8 – and examine how these functions behave.
- Cell B8. Wednesday is a numeric 0 and AVERAGEA() calculates exactly the same as AVERAGE(), i.e. equal to 3 based on a count of 5 days
- Cell C8. AVERAGEA() treats the text “None” as having a value of zero. AVERAGEA() computes the average as 3 based on a count of 5 days.
- Cell D8. AVERAGEA(), does not consider the empty cell to have a value of zero and therefore ignores the empty cell, calculating the average of 3.75 based on 4 days.
Filtering before averaging
Excel has various ways to quickly filter a list of values then work out the average without changing the original list.
Is a new function introduced with Office 2007. It relies on a set of criteria to select the data to be averaged. The criteria can be used in two ways. 1) Applying a criterion to a column of data and averaging the data in the column of the criterion or, 2) Applying the criterion to a column of data and taking an average of another column of data for each row that meets that criterion.
The basic format is:
AVERAGEIF(range to filter,criteria,[average_range])
The first two parameters are compulsory, add the third parameter if you want to average a different range from the numbers being averaged.
Is also a new function introduced with Office 2007. The only difference between this and AVERAGEIF() is that AVERAGEIFS() uses multiple criteria.
It is also worth noting that both of the AVERAGEIF() functions can be calculated by using SUMIF() divided by COUNTIF(), provided both functions use the same criteria.
The DAVERAGE function is a hangover from earlier versions of Excel. It evolved from worksheet areas of data being called “database” (Hence the “D” in Daverage), which was basically just a flat file database which could be queried using DAverage. It has largely been superseded by AverageIF and AverageIFS for Excel 2007 users. It relies on:
- a range of numbers (in its first argument),
- a field which can be denoted by a numerical position or the name at the top of the column (2nd argument), and
- a criteria set in a range of cells in another “neutral” area of the worksheet (final argument).
More formally, the syntax is:
DAverage( range, field, criteria )
The Excel help files have good explanations of AverageIF(), AverageIFS() and Daverage().
Trimmean() is a slightly more advanced measure used to enhance the average by excluding extreme values at either end of the data set. It saves you the trouble of excluding extreme values by doing it automatically.
Trimmean() is similar to the average except that it gives an additional option of excluding a percentage of the extreme values that distort the average. In the salaries example the formula for TRIMMEAN() is
This computes the average for the given range of data, while excluding 0.2 or 20% of the data 10% from each end. The formula in the example for the next section excludes 2 of the 10 values, the first value ($200,000,000) and the last value ($50,000).
Notes about trimming:
- TrimMean is symmetrical – in other words it will exclude the same number of values from the high and low end of the list.
- The trim value you give is the total to exclude – NOT the amount trimmed at both ends.
- The amount of trimming works off the value you give and then rounded to the nearest multiple of two with half of that rounded number trimmed off each end of the list.
- If you have 50 values and want 10% trim (eg TrimMean(A1:A30, 0.1) ) that means 5 values, rounded down to 4 values in total – so 2 value are trimmed from each end of the list.
- This rounding means you should be careful when using a small number of values or very small trim value. For example TrimMean(A1:A7, 0.1) would return the same result as Average(A1:A7) because no values are trimmed.
Finally, we don’t know why all the Average/Mean functions are called ‘Average’ except for TrimMean() which should have been called TrimAverage for consistency.
And there’s more …
For the sake of completeness, there are some other Average/Mean related functions available in recent versions of Excel:
The technical definition of Geomean is the nth root of n number of variables. Basically, if you have 5 numbers, you multiply the five numbers and take the 5th root of those numbers. The syntax is Geomean(data range) for an explanation go to the Geometric Mean page at Wikipedia.
The harmonic mean is defined as the mean of a set of positive variables, it is calculated by dividing the number of observations by the reciprocal of each number in the series.
This is a very technical mean which I’m not sure you’ll use that often. It is often used to calculate the average of rates e.g. interest rates. Again see the Wikipedia Harmonic Mean page.
Moving Averages will be part of our future article in this series when we look at the forecasting statistical functions in Excel.
Article posted: Tuesday, 22 September 2009
there's more ...
If you liked this article you'll LOVE our new ebooks.
Windows 8 for Microsoft Office users A practical guide the new, changed and unfamiliar in Windows 8
A focused and unvarnished look at Windows 8, especially written for
the many people who use Microsoft Office Get it today
- click here.
ORGANIZING OUTLOOK EMAIL - tame your Outlook 2010 Inbox
100+ pages of practical tips and help to streamline,
automate and search your Inbox. Get more
than you ever thought possible from Outlook. Read it today
- click here.
More from Office Watch: