How we change what others think, feel, believe and do
Frequency distributions are common in research and statistics. Here's some useful notes about them.
A frequency distribution is a set of bars (or a smoothed line) that shows the numbers in each of a set of related groups, for example how many people in a room are in different weight groups. This may be shown in a histogram, as below.
The shape of the distribution can often be described mathematically, with a very common shape known as the Normal (or Gaussian) distribution. This is bell-shaped and something like the chart above.
Knowing where the middle of the distribution lies is useful as it gives a single number that may be used to represent all other numbers.
Three common measures are the mean, the median and the mode. With a symmetrical distribution (eg. the Normal distribution), these are all equal.
The mean, or average, is calculated as the sum of the scores divided by the number of data items (SUM(X)/N). This represents all data items, but can be skewed by distant outliers, especially when there is a limited data set.
The median is found by arranging the data in numeric order, then selecting the middle number (or the average of the two middle numbers when there is an even number of data items).
This is useful when you want to divide items into equal sized groups, for example to be able to select the 'top half' of scores.
The mode is the most common score and is useful when you want to answer questions about 'most popular' or 'most common' items. When scores are on a continuous scale, then (as with histograms) this is calculated as the most common range of scores.
When the histogram has multiple peaks, it is called multi-modal. Where there are only two peaks, this is called bimodal. Multiple peaks can signify multiple processes or situations being identified within one measure.
As well as centrality, the way the distribution is spread is often important to understand.
Skew is the measure to which bars in the histogram are higher
Skew is zero in a normal distribution. It is positive when bars are higher on the left and negative when the scores are higher on the right.
Kurtosis is a measure of how 'pointed' the distribution is. A kurtosis of zero indicates a Normal distribution. A positive value indicate a more pointy distribution, whilst a negative value indicates a flatter distribution.
When the data set is not normally distributed, then it may be possible to do a mathematical transformation on the data to convert it back to Normal. This may seem like a fudge but the principle is statistically quite valid.
The following methods can be used to reduce positive skew. To transform negative skew, first do a reversal by subtracting each score from the highest score (or a convenient higher number).
Taking the logarithm of a set of numbers reduced the right tail of the distribution and is hence useful to reduce positive skew.
Logs cannot be taken of zero or negative values, so data for log transformation must all be positive. A way around this is to add a fixed number to all data items, effectively shifting it all right.
The square root of a set of numbers reduces big numbers more than small numbers. This makes it useful for correcting positively-skewed data.
Square roots cannot be taken of negative numbers, and the same approach with logarithms may be taken.
Inverting scores (1/x) balances around the number 1 -- numbers greater than 1 turn into a fraction, whilst fractions turn into numbers greater than 1. Very small fractions become very large numbers and vice versa.
This method is thus best when all numbers are below or above 1.
And the big