Category Archives: Data

Data Statistics

Mean vs Median – What’s the Difference?

When reporting ‘central tendencies’ in data (that is, finding the ‘middle’ values), you often find that two measures are used:-

Mean: This is the sum of all values, divided by the number of available values.
Median: If all values are placed in order, this is the midpoint – the central value in the list, or the average of the central two values if there are an even number of values.

Let me explain further. Look at the following simple set of data:-

1, 8, 10, 7, 4, 8, 6, 2, 3, 7

To work out the mean of this set, we add all of the figures together and divide by 10 (the number of values available). This would give us:-

1+8+10+7+4+8+6+2+3+7 = 56, which divided by 10 returns a value of 5.6. So in this case, 5.6 is the mean.

To get to median, we first need to sort the values in ascending order, which would give us:-

1, 2, 3, 4, 6, 7, 7, 8, 8, 10

As there are an even number of figures here, the median is the average of the two middle figures in the list (6 and 7), which is 6.5.

Pretty easy, huh?

So, which should you use and when? It depends on the data you have to hand and how you want to report on it. In our case the two are pretty close, mainly because there is only a small amount of data and also that the range of figures only runs from 1 to 10.

What if the data were more like this?

1, 4, 8, 2, 9, 3, 7, 9, 2, 1026

Let’s quickly work these out:-

Mean: 107.1
Median: 5.5

Which would you use, taking the data into account, and why?

Well..take a look at the data. The value of 1026 is much, much higher than the rest. It’s actually what we would refer to as an outlier, which means it is one that possibly falls outside of the expected or observed set of values. Using the mean calculation in these cases takes that outlying value into consideration and ‘drags’ the mean more towards the larger value, giving a skewed view of ‘middle’. Median, however, looks the values as a whole and finds the value in the middle, which, given the data presented gives us a much more reasonable measure.

Which to use and in what circumstances does depend on the data. If the data you’re reporting on is ‘normal’ in nature (that is, no outliers and, when plotted in a graphs follows a ‘normal distribution’ (or symmetrical distribution) of values, then I would say go for the mean. If there are outliers present, then the median can give a much better view of ‘middle’, as the distribution might be skewed one way or the other. As with all analysis it’s often useful to use a little commonsense when looking at the figures you’ve left and choose appropriately.

As a real world example we could ask why, in the BBC article I posted about yesterday, they used the median age of users for reporting rather than the mean.

Why? It’s quite simple. Because of the nature of Facebook it’s pretty much a given that younger people will make up the majority of users, but the presence of ‘middle age and over’ users might actually skew the reported ‘middle’ value if mean were used.

As with most things, both figures can be useful, but think about the data you’re using.

Here’s an interesting article about the problems that might crop up if the wrong measure of central tendency is used.

Data Web Links

On consistency in reporting

When reporting any kinds of figures, consistency in the data reported is paramount.  It gives a much-needed clarity to your data, making comparisons between data sets simple and intuitive, and allows the user to easily see trends and changes in data over time and by proportion.

When reporting lacks consistency however, things start to fall apart.  I’ve a very recent example of this, which I want to talk about now.

On the 5th October, the BBC reported on Facebook surpassing 1 billion users per month.  Quite an achievement (if only 2toria had a billionth of that usage…).  I don’t have a big problem with the graphic used in this article (although I could find some if I tried harder, I’m sure.)

The big problem I have is with the data presented in the side panel entitled ‘Evolution of a network’.  For a change, this data is presented in text rather than a graphic, but I have issues with it.  Here it is:-

On first glance the information is relatively useful and offers a comparison of Facebook usage at key stages (25 million users, 50 million, 100 million, 500 million and 1 billion).  The problem I have with the figures reported however is that of consistency.  If you look at the detail for 25, 50, 100 and 500 million you can see the reports have been about the following:-

  • Median user age
  • Top countries
  • Average friends for users joining the site at this stage

When we get to the 1bn point however, things change.  Instead of being consistent with previous reporting, the figures shift:-

  • Median user age
  • Top countries
  • Number of mobile users
What?  There are two things wrong here.  Not only is there no continuance of the comparisons of number of friends, but a whole new metric has been introduced to highlight the number of mobile users.

For me, this fails…  I can maybe understand that there was no data for numbers of friends at the 1 billion mark, and this is fine.  It can happen, but for me this should have left the reporters with one of two choices.  1)  Don’t report it at all or 2)  Explain that the data was not available at the time of reporting.  I might have forgiven that.

Instead, the report shifts to focus on the number of mobile users without that data having been present throughout the other milestones.  If they were going to report on this metric it would have been useful to see the numbers of mobile users from the beginning too, wouldn’t it?

Or am I being picky because it’s Sunday afternoon and I’m grumpy?

Basically, my takeaway from this is that the report would have been a lot more meaningful for those wanting a true comparison if consistency in the figures reported had been taken into account.  I would urge you to do the same when reporting yourself.  If you are reporting a metric over time or over a key set of milestones, make sure that you use the same metrics for each, otherwise the data means very little.


The perils of impossible target setting – a true story

Some time ago, in a meeting we were looking at some RAG (Red, Amber, Green) style reporting that measured the data quality of input into one of our systems.  I can’t be more specific, or I’d have to kill you all.

The report we were observing had thresholds set for each of the coloured indicators.  The report ran green for zero errors, amber for between 1 and 20, and red for anything over 20.  That’s fine.  Absolutely nothing wrong with that.

The only thing was, upon looking at the reports, the measure had been red for the past twelve months reported on, and was nowhere near getting close to amber, let alone green.  When asked about whether it would be possible at this moment to ever get to green, the person responsible for the reporting and analysis simply exclaimed “No”.  The report was based on the analysis of tens of thousands of entries, of which there were generally always about 5% containing some kind of error at any time (that’s over 500 errors each month – nowhere near 20).  This is the way it had always been, and would likely be for the near future.

This got me thinking.. If it isn’t possible to ever get to green at this moment in time, or even amber, then why have them in the report in the first place?  Red indicates bad, yes, that’s a given.  But what if bad is never going to get better?

Here’s my suggestion.  Find an acceptable level of good (or even great) which will work for the short to medium term and set that to green.   If you are always hitting 5% errors, then set green for 3.5%, amber for 4% and red for 5% and above, or something to that effect.  You should ideally do some analysis to look at how the figures vary and set your new thresholds to something sensible and realistic.  This way you aren’t expecting the impossible from your team, rather you are setting them a goal which might actually be achievable.  If, and only if you ever get to the point that staff are consistently hitting green, you then might want to think about adjusting your thresholds accordingly and maybe, maybe, at some point you might get close to your initial thresholds.  If, on the other hand, you keep them as they are, your staff will ignore the results of the report because “we never get anything but red, so why bother”?  Using the current situation/context as a baseline for target setting will make it much more likely that over time the number of amber and red figures will reduce and your figures will hopefully start to look much more healthy.

Also, ensuring that red does actually mean bad based on your current thresholds means that you can put measures in place to prevent bad from happening next month.  If it is always red, what actions will you ever put in place to see some kind of noticeable improvement?  None.  I know.  I’ve seen it happen.  Momentum always seems to be lost when you are continually fighting a losing battle.  Provide an opportunity to win and you will more likely see positive results.

I didn’t speak up about it in the meeting, but others did, and the reporting team said they would look into it.

Let’s hope common sense prevails.

Data R

Quick R Tip – easily read files using file.choose()

I’ve been using R recently for a few things and the one thing I’ve often struggled with is getting the software to read external datasets (such as csv files) to work on.  The general code I’ve been using to do this is:-

data <-read.csv("My file path and file name")

but it’s a real pain in the backside to do and you can get some pretty weird errors if you don’t get the path and filename correct.

Anyway, today I found a much nicer way of loading in my data:-

data <- read.csv(file.choose())

Which pops up a standard Windows file dialog so that I can just select a file to read in.  Much easier and less prone to errors and resulting frustrated hair-pulling, I think.