We started with identifying data types as either categorical or quantitative. We then learned, we could identify quantitative variables as either continuous or discrete. We also found we could identify categorical variables as either ordinal or nominal.
When analysing categorical variables, we commonly just look at the count or percent of a group that falls into each level of a category. For example, if we had two levels of a dog category: lab and not lab. We might say, 32% of the dogs were lab (percent), or we might say 32 of the 100 dogs I saw were labs (count). However, the 4 aspects associated with describing quantitative variables are not used to describe categorical variables.
Then we learned there are four main aspects used to describe quantitative variables:
- Measures of Center.
- Measures of Spread.
- Shape of the Distribution.
We looked at calculating measures of Center:
We also looked at calculating measures of Spread:
- Interquartile Range.
- Standard Deviation.
Standard Deviation vs. Variance.
The standard deviation is the square root of the variance. In practice, you usually use the standard deviation rather than the variance. The reason for this is because the standard deviation shares the same units with our original data, while the variance has squared units.
We learned that the distribution of our data is frequently associated with one of the three shapes:
- Symmetric (frequently normally distributed).
Depending on the shape associated with our dataset, certain measures of center or spread may be better for summarizing our dataset.
When we have data that follows a normal distribution, we can completely understand our dataset using the mean and standard deviation.
However, if our dataset is skewed, the 5-number summary (and measures of center associated with it) might be better to summarize our dataset.
We learned that outliers have a larger influence on measures like the mean than on measures like the median. We learned that we should work with outliers on a situation by situation basis. Common techniques include:
- At least note they exist and the impact on summary statistics.
- If typo – remove or fix.
- Understand why they exist, and the impact on questions we are trying to answer about our data.
- Reporting the 5-number summary values is often a better indication than measures like the mean and standard deviation when we have outliers.
- Be careful in reporting. Know how to ask the right questions.
Histograms and Box Plots.
We also looked at histograms and box plots to visualize our quantitative data. Identifying outliers and the shape associated with the distribution of our data are easier when using a visual as opposed to using summary statistics.