Creating data visualisation beyond default
In this take-home exercise, we are going to reveal the demographic of the city of Engagement, Ohio USA by using appropriate static statistical graphics methods.
The data is processed by using appropriate tidyverse family of packages, whereas the statistical graphics are prepared using ggplot2 and its extensions.
Engagement city does not physically exist in Google Maps. It often refers to Urbana which is located halfway between Dayton and Marion (dating and marrying). Source: Barry Popik
Photo: Greenwich Mean Time
Before we get started, it is important for us to ensure that the required R packages have been installed. If yes, we will load the R packages. If they have yet to be installed, we will install the R packages and load them onto R environment.
The packages required for this exercise are tidyverse, ggdist and hrbrthemes.
packages = c('tidyverse','ggdist','hrbrthemes')
for(p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
The original dataset was obtained from VAST Challenge 2022 in csv format. It consists of basic information about the residents of Engagement, OH that have agreed to participate in this study.
The code chunk below imports dataset into R by using
read_csv() of readr and saves it as a
tibble data frame called demographics. It consists of 1011
records as shown below.
demographics <- read_csv("data/Participants.csv")
glimpse(demographics)
Rows: 1,011
Columns: 7
$ participantId <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,~
$ householdSize <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ~
$ haveKids <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU~
$ age <dbl> 36, 25, 35, 21, 43, 32, 26, 27, 20, 35, 48, 2~
$ educationLevel <chr> "HighSchoolOrCollege", "HighSchoolOrCollege",~
$ interestGroup <chr> "H", "B", "A", "I", "H", "D", "I", "A", "G", ~
$ joviality <dbl> 0.001626703, 0.328086500, 0.393469590, 0.1380~
For easier understanding, the age distribution of the population is further split into 4 broad age groups of similar distribution, namely 30 and below, 31-40, 41-50, and 51 and above.
To do this, mutate() of dplyr adds a
new variable called age_category while still preserves the
existing variables - age.
demographics <- demographics %>%
mutate(age_category=cut(age,
breaks=c(17, 30, 40, 50, 60),
labels=c('30 and below','31-40','41-50','51 and above')))
Basic statistical graphs are derived from ggplot2,
such as geom_bar() to create a bar chart,
geom_histogram() to create a histogram,
geom_text() to display a text, geom_hline() to
create a horizontal line, and coord_flip() to rotate
vertical plots into horizontal plots.
Next, we also explore and go beyond basic statistical graphs by using
ggdist
package, such as stat_halfeye() to create a half violin
plot, as well as stat_dots().
The label on x-axis, label on y-axis and title are set by
xlab(), ylab() and ggtitle()
respectively, whereas the base theme used below is
theme_ipsum() of hrbrthemes
which is an Arial Narrow-based theme.
See Appendix section for more details.
Out of 1011 Engagement residents participated in this study, about 70% of them do not have kids. Furthermore, the second graph shows that among these 710 residents without kids, 337 of them are currently single.
Two figures below show the happiness level across different age groups and education levels.
It is observed that the happiness level is almost equally distributed among young adults aged 30 and below, while on the other hand, most of the elderlies seem to be unhappy. In addition, unhappiness is also associated with low-educated residents.
https://isss608-ay2021-22april.netlify.app/hands-on_ex/hands-on_ex01/hands-on_ex01#1
https://isss608-ay2021-22april.netlify.app/hands-on_ex/hands-on_ex02/hands-on_ex02#1
https://rstudio.github.io/distill/basics.html
https://www.r-bloggers.com/2021/09/adding-text-labels-to-ggplot2-bar-chart/
https://www.statology.org/r-create-categorical-variable-from-continuous/
The code chunk below displays the number of Engagement residents that have kids.
ggplot(data = demographics,
aes(x = haveKids)) +
geom_bar() +
geom_text(aes(label=..count..),
stat="count",
vjust=1.5,
color="white") +
xlab("Residents Have Kids") +
ylab("No. of\nResidents") +
ggtitle("Only 30% of Engagement Residents Have Kids",
subtitle = "A survey of 1011 Residents Taken in 2022") +
theme_ipsum(plot_title_size = 14,
axis_title_face = 24,
base_size=10,
grid="Y")
The code chunk below displays the household distribution among Engagement residents.
ggplot(data = demographics,
aes(x = householdSize)) +
geom_bar() +
geom_text(aes(label=..count..),
stat="count",
vjust=1.5,
color="white") +
xlab("Household Size (Persons)") +
ylab("No. of\nResidents") +
ggtitle("Most of Households Consist of 2 Person",
subtitle = "A survey of 1011 Residents Taken in 2022") +
theme_ipsum(plot_title_size = 14,
axis_title_face = 24,
base_size=10,
grid="Y")
The code chunk below displays happiness level percentage across different age groups.
ggplot(data = demographics,
aes(x = joviality*100))+
geom_histogram(bins=20) +
facet_grid(~ age_category) +
xlab("Happiness Level Percentage") +
ylab("No. of\nResidents") +
scale_x_continuous(breaks = c(0,50,100)) +
ggtitle("Elderlies are Getting Unhappy",
subtitle="Happiness Level Across Different Age Groups") +
theme_ipsum(plot_title_size = 14,
axis_title_face = 24,
strip_text_size = 10,
base_size=10,
plot_margin = margin(5,5,5,5))
The code chunk below displays happiness level percentage across different education levels.
ggplot(data=demographics,
aes(x = educationLevel, y = joviality*100)) +
scale_y_continuous(breaks = seq(0, 100, 50),
limits = c(0, 100)) +
stat_halfeye(adjust = .33,
width = .67,
color = NA,
justification = -.01,
position = position_nudge(
x = .15)) +
stat_dots(side = "left",
justification = 1.1,
binwidth = .25,
dotsize = 5) +
xlab("Education Level") +
ylab("Happiness Level\nPercentage") +
ggtitle("Low-educated Residents Tend to be Unhappy",
subtitle="Happiness Level Across Different Education Levels") +
theme_ipsum(plot_title_size = 14,
axis_title_face = 24,
strip_text_size = 10,
base_size=10) +
geom_hline(aes(yintercept=mean(joviality,
na.rm=T)*100),
color="red",
linetype="dashed",
size=0.5) +
geom_text(aes(5,mean(joviality)*100,
label="Average", vjust=-0.5), size=3)+
coord_flip()