Take-home Exercise 2

Creating data visualisation beyond default

Antonius Handy https://www.linkedin.com/in/antoniushandy (Singapore Management University, Master of IT in Business)https://scis.smu.edu.sg/master-it-business
2022-05-01

1. Overview

Similar to take-home exercise 1, we are still interested in the demographic of the city of Engagement, Ohio USA but this time we will evaluate and make over the data visualisation made by one of our classmates.

The data is processed by using appropriate tidyverse family of packages, whereas the statistical graphics are prepared using ggplot2 and its extensions.

2. Getting Started

Before we get started, it is important for us to ensure that the required R packages have been installed. If yes, we will load the R packages. If they have yet to be installed, we will install the R packages and load them onto R environment.

The packages required for this exercise are tidyverse and ggridges.

packages = c('tidyverse','ggridges')

for(p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

3. Data

3.1 Data Source

The original datasets were obtained from VAST Challenge 2022 in csv format. It consists of basic information about the residents of Engagement, OH that have agreed to participate in this study.

3.2 Importing Data

The code chunk below imports 2 datasets, namely Participants.csv and Jobs.csv into R by using read_csv() of readr and saves it as tibble data frames called demographics and jobs respectively. Demographic dataset consists of 1011 records, whereas Jobs dataset consists of 1328 records as shown below.

demographics <- read_csv("data/Participants.csv")
glimpse(demographics)
Rows: 1,011
Columns: 7
$ participantId  <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,~
$ householdSize  <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ~
$ haveKids       <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU~
$ age            <dbl> 36, 25, 35, 21, 43, 32, 26, 27, 20, 35, 48, 2~
$ educationLevel <chr> "HighSchoolOrCollege", "HighSchoolOrCollege",~
$ interestGroup  <chr> "H", "B", "A", "I", "H", "D", "I", "A", "G", ~
$ joviality      <dbl> 0.001626703, 0.328086500, 0.393469590, 0.1380~
jobs <- read_csv("data/Jobs.csv")
glimpse(jobs)
Rows: 1,328
Columns: 7
$ jobId                <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 1~
$ employerId           <dbl> 379, 379, 380, 380, 381, 381, 381, 381,~
$ hourlyRate           <dbl> 10.00000, 22.21763, 10.00000, 15.31207,~
$ startTime            <time> 07:46:00, 07:31:00, 08:00:00, 07:39:00~
$ endTime              <time> 15:46:00, 15:31:00, 16:00:00, 15:39:00~
$ daysToWork           <chr> "[Monday,Tuesday,Wednesday,Thursday,Fri~
$ educationRequirement <chr> "HighSchoolOrCollege", "Bachelors", "Ba~

3.3 Data Wrangling

Column names are renamed such that the first letter of each word is capitalised. To do this, we will use a function called rename() of dplyr.

demographics <- demographics %>%
  rename('ParticipantID' = 'participantId', 
         'HouseholdSize' = 'householdSize', 
         'HaveKids' = 'haveKids', 
         'Age' = 'age', 
         'EducationLevel' = 'educationLevel', 
         'InterestGroup' = 'interestGroup', 
         'Joviality' = 'joviality')
jobs <- jobs %>%
  rename('JobID' = 'jobId', 
         'EmployerID' = 'employerId', 
         'HourlyRate' = 'hourlyRate', 
         'StartTime' = 'startTime', 
         'EndTime' = 'endTime', 
         'DaystoWork' = 'daysToWork', 
         'EducationRequirement' = 'educationRequirement')

4. Demographic Visualisation

4.1 Age Distribution and Having Kids Status

ORIGINAL DATA VISUALISATION

CLARITY

AESTHETIC

MAKEOVER DESIGN

In this makeover design, we add color in geom_histogram() to depict the outline of the bar. In addition, we also add some other functions such as theme() to rotate the y-axis label, geom_vline() to create a dashed mean line, and geom_text() to display the text ‘Average’.

ggplot(data = demographics, 
       aes(x = Age, fill = HaveKids)) +
  geom_histogram(bins = 20,
                 color = "grey20") +
  labs(x = "Age",
       y = "No. of\n People",
       title = "Most People Do Not Have Kids",
       fill = "Have Kids") +
  theme(axis.title.y= element_text(angle=0)) +
  geom_vline(aes(xintercept=mean(Age,
                                 na.rm=T)),
             color="black", 
             linetype="dashed", 
             size=0.5) +
  geom_text(aes(42,85,
                label="Average"), size=3.5)

4.2 Education Level Distribution

ORIGINAL DATA VISUALISATION

CLARITY

AESTHETIC

MAKEOVER DESIGN

In this makeover design, we add some functions such as theme() to rotate the y-axis label and geom_text() to display the frequency on top of each bar.

ggplot(data = demographics, 
       aes(x = reorder(EducationLevel, EducationLevel, function(x)-length(x)))) +
  geom_bar(fill = "lightblue")+
  labs(x = "Education Level",
       y = "No. of\n People",
       title = "Most People Have High School or College Degree") +
  theme(axis.title.y= element_text(angle=0)) +
  geom_text(aes(label=..count..),
            stat="count",
            vjust=-0.3)

4.3 How Education Level Affects Joviality?

ORIGINAL DATA VISUALISATION

CLARITY

AESTHETIC

MAKEOVER DESIGN

In this makeover design, we multiply joviality by 100 to get joviality percentage. Next, we add some functions such as theme() to rotate the y-axis label and stat_summary() to display the average value on each boxplot.

ggplot(data = demographics, 
       aes( x =reorder(EducationLevel, -Joviality), y = Joviality*100)) +
  geom_boxplot()+
  stat_summary(geom = "point",
               fun = "mean",
               colour = "red",
               size = 1.5) +
  stat_summary(fun.y=mean, colour="darkred", geom="text", show_guide = FALSE, 
               vjust=-0.7, aes( label=round(..y.., digits=3))) +
  labs( x = "Education Level",
        y = "Joviality\n Percentage",
        title = "Graduates are the Most Jovial on Average") +
  theme(axis.title.y= element_text(angle=0))

4.4 How Education Level Affects Hourly Wage?

ORIGINAL DATA VISUALISATION

CLARITY

AESTHETIC

MAKEOVER DESIGN

Similar to the first chart, in this ridge plot makeover design, we add some functions such as theme() to rotate the y-axis label, geom_vline() to create a dashed mean line, and geom_text() to display the text ‘Average’.

ggplot(data = jobs,
       aes(x = HourlyRate, y = reorder( EducationRequirement, -HourlyRate))) +
  geom_density_ridges(rel_min_height = 0.01)+
  labs(x = "Hourly Rate",
       y = "Education\n Requirement", 
       title = "Graduates and Bachelors Earn Higher Hourly Wage") +
  theme(axis.title.y= element_text(angle=0)) +
  geom_vline(aes(xintercept=mean(HourlyRate,
                                 na.rm=T)),
             color="black", 
             linetype="dashed", 
             size=0.5) +
  geom_text(aes(25,5.5,label="Average"), size=3)

5. References

University of New Mexico. (n.d.). Colors in HTML. https://www.unm.edu/~tbeach/IT145/color.html

Xie, Y.H., et.al. (2022, April 14). Font Color. R Markdown Cookbook. https://bookdown.org/yihui/rmarkdown-cookbook/font-color.html