Tag Archives: data

Data Statistics

Mean vs Median – What’s the Difference?

When reporting ‘central tendencies’ in data (that is, finding the ‘middle’ values), you often find that two measures are used:-

Mean: This is the sum of all values, divided by the number of available values.
Median: If all values are placed in order, this is the midpoint – the central value in the list, or the average of the central two values if there are an even number of values.

Let me explain further. Look at the following simple set of data:-

1, 8, 10, 7, 4, 8, 6, 2, 3, 7

To work out the mean of this set, we add all of the figures together and divide by 10 (the number of values available). This would give us:-

1+8+10+7+4+8+6+2+3+7 = 56, which divided by 10 returns a value of 5.6. So in this case, 5.6 is the mean.

To get to median, we first need to sort the values in ascending order, which would give us:-

1, 2, 3, 4, 6, 7, 7, 8, 8, 10

As there are an even number of figures here, the median is the average of the two middle figures in the list (6 and 7), which is 6.5.

Pretty easy, huh?

So, which should you use and when? It depends on the data you have to hand and how you want to report on it. In our case the two are pretty close, mainly because there is only a small amount of data and also that the range of figures only runs from 1 to 10.

What if the data were more like this?

1, 4, 8, 2, 9, 3, 7, 9, 2, 1026

Let’s quickly work these out:-

Mean: 107.1
Median: 5.5

Which would you use, taking the data into account, and why?

Well..take a look at the data. The value of 1026 is much, much higher than the rest. It’s actually what we would refer to as an outlier, which means it is one that possibly falls outside of the expected or observed set of values. Using the mean calculation in these cases takes that outlying value into consideration and ‘drags’ the mean more towards the larger value, giving a skewed view of ‘middle’. Median, however, looks the values as a whole and finds the value in the middle, which, given the data presented gives us a much more reasonable measure.

Which to use and in what circumstances does depend on the data. If the data you’re reporting on is ‘normal’ in nature (that is, no outliers and, when plotted in a graphs follows a ‘normal distribution’ (or symmetrical distribution) of values, then I would say go for the mean. If there are outliers present, then the median can give a much better view of ‘middle’, as the distribution might be skewed one way or the other. As with all analysis it’s often useful to use a little commonsense when looking at the figures you’ve left and choose appropriately.

As a real world example we could ask why, in the BBC article I posted about yesterday, they used the median age of users for reporting rather than the mean.

Why? It’s quite simple. Because of the nature of Facebook it’s pretty much a given that younger people will make up the majority of users, but the presence of ‘middle age and over’ users might actually skew the reported ‘middle’ value if mean were used.

As with most things, both figures can be useful, but think about the data you’re using.

Here’s an interesting article about the problems that might crop up if the wrong measure of central tendency is used.

Data

The perils of impossible target setting – a true story

Some time ago, in a meeting we were looking at some RAG (Red, Amber, Green) style reporting that measured the data quality of input into one of our systems.  I can’t be more specific, or I’d have to kill you all.

The report we were observing had thresholds set for each of the coloured indicators.  The report ran green for zero errors, amber for between 1 and 20, and red for anything over 20.  That’s fine.  Absolutely nothing wrong with that.

The only thing was, upon looking at the reports, the measure had been red for the past twelve months reported on, and was nowhere near getting close to amber, let alone green.  When asked about whether it would be possible at this moment to ever get to green, the person responsible for the reporting and analysis simply exclaimed “No”.  The report was based on the analysis of tens of thousands of entries, of which there were generally always about 5% containing some kind of error at any time (that’s over 500 errors each month – nowhere near 20).  This is the way it had always been, and would likely be for the near future.

This got me thinking.. If it isn’t possible to ever get to green at this moment in time, or even amber, then why have them in the report in the first place?  Red indicates bad, yes, that’s a given.  But what if bad is never going to get better?

Here’s my suggestion.  Find an acceptable level of good (or even great) which will work for the short to medium term and set that to green.   If you are always hitting 5% errors, then set green for 3.5%, amber for 4% and red for 5% and above, or something to that effect.  You should ideally do some analysis to look at how the figures vary and set your new thresholds to something sensible and realistic.  This way you aren’t expecting the impossible from your team, rather you are setting them a goal which might actually be achievable.  If, and only if you ever get to the point that staff are consistently hitting green, you then might want to think about adjusting your thresholds accordingly and maybe, maybe, at some point you might get close to your initial thresholds.  If, on the other hand, you keep them as they are, your staff will ignore the results of the report because “we never get anything but red, so why bother”?  Using the current situation/context as a baseline for target setting will make it much more likely that over time the number of amber and red figures will reduce and your figures will hopefully start to look much more healthy.

Also, ensuring that red does actually mean bad based on your current thresholds means that you can put measures in place to prevent bad from happening next month.  If it is always red, what actions will you ever put in place to see some kind of noticeable improvement?  None.  I know.  I’ve seen it happen.  Momentum always seems to be lost when you are continually fighting a losing battle.  Provide an opportunity to win and you will more likely see positive results.

I didn’t speak up about it in the meeting, but others did, and the reporting team said they would look into it.

Let’s hope common sense prevails.

Data R

Quick R Tip – easily read files using file.choose()

I’ve been using R recently for a few things and the one thing I’ve often struggled with is getting the software to read external datasets (such as csv files) to work on.  The general code I’ve been using to do this is:-

data <-read.csv("My file path and file name")

but it’s a real pain in the backside to do and you can get some pretty weird errors if you don’t get the path and filename correct.

Anyway, today I found a much nicer way of loading in my data:-

data <- read.csv(file.choose())

Which pops up a standard Windows file dialog so that I can just select a file to read in.  Much easier and less prone to errors and resulting frustrated hair-pulling, I think.

Infographics

Infographics resume: Design heaven, or visual data hell?

As I learn more about infographics and the visual communication of data, I often find myself torn between what I like to look at and what is functional in terms of getting a message across.

My post the other day about debts was a perfect example of the feeling I get.  Whilst the aesthetics of the piece were ok and made the point clear, I also felt that the actual data behind the graphic could be displayed in a simpler and clearer format.

Today, I came across this infographic posing as a resume/resume posing as an infographic:-

infographics resume

Would this impress you if you were wanting to employ a graphic designer?  What would you be focusing on, form, or function?

As a person with little graphic design experience, but one who knows what he likes in terms of both aesthetics and communication of information, this does very little.  The use of three dimensions doesn’t make anything particularly clear.  I think the donut in the bottom right is difficult to interpret for too many reasons to list,  and the right hand section takes more time to read than I feel it should, when I think describing your career path to date would maybe serve better as text than graphics.

The ‘Others Skill’ section bugs me, too. The chart axes are in two dimensions, but the bars are represented in 3d.  To me it doesn’t work – stick with one or the other.  In fact, stick with 2d, because adding a third dimension adds nothing to the information presented other than confusion.

Having said all this, as an overall piece it does demonstrate the creators ability to put together infographics/design work, but I don’t know if it’s good from a graphic design perspective or not.  I’m just speaking from the point of view of somebody who wants to get information from these kind of things, and I feel some of the skills demonstrated don’t lend themselves particularly well to that purpose.

I feel like I’m being over-picky.  Am I being over-picky?  I’m definitely being over-picky, aren’t I?