1. Overview

Similar to take-home exercise 1, we are still interested in the demographic of the city of Engagement, Ohio USA but this time we will evaluate and make over the data visualisation made by one of our classmates.

The data is processed by using appropriate tidyverse family of packages, whereas the statistical graphics are prepared using ggplot2 and its extensions.

2. Getting Started

Before we get started, it is important for us to ensure that the required R packages have been installed. If yes, we will load the R packages. If they have yet to be installed, we will install the R packages and load them onto R environment.

The packages required for this exercise are tidyverse and ggridges.

packages = c('tidyverse','ggridges')

for(p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

3. Data

3.1 Data Source

The original datasets were obtained from VAST Challenge 2022 in csv format. It consists of basic information about the residents of Engagement, OH that have agreed to participate in this study.

3.2 Importing Data

The code chunk below imports 2 datasets, namely Participants.csv and Jobs.csv into R by using read_csv() of readr and saves it as tibble data frames called demographics and jobs respectively. Demographic dataset consists of 1011 records, whereas Jobs dataset consists of 1328 records as shown below.

demographics <- read_csv("data/Participants.csv")
glimpse(demographics)

Rows: 1,011
Columns: 7
$ participantId  <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,~
$ householdSize  <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ~
$ haveKids       <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU~
$ age            <dbl> 36, 25, 35, 21, 43, 32, 26, 27, 20, 35, 48, 2~
$ educationLevel <chr> "HighSchoolOrCollege", "HighSchoolOrCollege",~
$ interestGroup  <chr> "H", "B", "A", "I", "H", "D", "I", "A", "G", ~
$ joviality      <dbl> 0.001626703, 0.328086500, 0.393469590, 0.1380~

jobs <- read_csv("data/Jobs.csv")
glimpse(jobs)

Rows: 1,328
Columns: 7
$ jobId                <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 1~
$ employerId           <dbl> 379, 379, 380, 380, 381, 381, 381, 381,~
$ hourlyRate           <dbl> 10.00000, 22.21763, 10.00000, 15.31207,~
$ startTime            <time> 07:46:00, 07:31:00, 08:00:00, 07:39:00~
$ endTime              <time> 15:46:00, 15:31:00, 16:00:00, 15:39:00~
$ daysToWork           <chr> "[Monday,Tuesday,Wednesday,Thursday,Fri~
$ educationRequirement <chr> "HighSchoolOrCollege", "Bachelors", "Ba~

3.3 Data Wrangling

Column names are renamed such that the first letter of each word is capitalised. To do this, we will use a function called rename() of dplyr.

demographics <- demographics %>%
  rename('ParticipantID' = 'participantId', 
         'HouseholdSize' = 'householdSize', 
         'HaveKids' = 'haveKids', 
         'Age' = 'age', 
         'EducationLevel' = 'educationLevel', 
         'InterestGroup' = 'interestGroup', 
         'Joviality' = 'joviality')

jobs <- jobs %>%
  rename('JobID' = 'jobId', 
         'EmployerID' = 'employerId', 
         'HourlyRate' = 'hourlyRate', 
         'StartTime' = 'startTime', 
         'EndTime' = 'endTime', 
         'DaystoWork' = 'daysToWork', 
         'EducationRequirement' = 'educationRequirement')

4. Demographic Visualisation

4.1 Age Distribution and Having Kids Status

ORIGINAL DATA VISUALISATION

CLARITY

y-axis Range: Good visualisation as it starts from zero.
Legend: Legend is clearly shown next to the histogram.
Misleading Title: The word ‘versus’ is misleading since the y-axis only shows number of people. Instead, it would be better to write the key point of the bar chart.
Missing Mean/Median Line: Having the average and/or median line would be better for visualisation.

AESTHETIC

Bar Colour: Colour is appropriately selected to easily distinguish whether the person has kids or not.
Bar Outline: The outline colour should have been selected to show clearer border lines between each bar.
Legend Title: The legend title should have space in between words and the first letter of the word should be capitalised.
y-axis Label: It is more reader-friendly to write the label horizontally.

MAKEOVER DESIGN

In this makeover design, we add color in geom_histogram() to depict the outline of the bar. In addition, we also add some other functions such as theme() to rotate the y-axis label, geom_vline() to create a dashed mean line, and geom_text() to display the text ‘Average’.

ggplot(data = demographics, 
       aes(x = Age, fill = HaveKids)) +
  geom_histogram(bins = 20,
                 color = "grey20") +
  labs(x = "Age",
       y = "No. of\n People",
       title = "Most People Do Not Have Kids",
       fill = "Have Kids") +
  theme(axis.title.y= element_text(angle=0)) +
  geom_vline(aes(xintercept=mean(Age,
                                 na.rm=T)),
             color="black", 
             linetype="dashed", 
             size=0.5) +
  geom_text(aes(42,85,
                label="Average"), size=3.5)

4.2 Education Level Distribution

ORIGINAL DATA VISUALISATION

CLARITY

Title: The key point of the bar chart is clearly stated.
y-axis Range: Good visualisation as it starts from zero.
Bar Order: It shows an effective comparison as the bars are sorted by their respective frequency values in descending order.
Missing Frequency Values: Frequency value for each bar should be added to provide additional information.

AESTHETIC

y-axis Label: The label title is not consistent as in the previous chart, i.e. Number of People instead of No. of People. In addition, it is more reader-friendly to write the label horizontally.

MAKEOVER DESIGN

In this makeover design, we add some functions such as theme() to rotate the y-axis label and geom_text() to display the frequency on top of each bar.

ggplot(data = demographics, 
       aes(x = reorder(EducationLevel, EducationLevel, function(x)-length(x)))) +
  geom_bar(fill = "lightblue")+
  labs(x = "Education Level",
       y = "No. of\n People",
       title = "Most People Have High School or College Degree") +
  theme(axis.title.y= element_text(angle=0)) +
  geom_text(aes(label=..count..),
            stat="count",
            vjust=-0.3)

4.3 How Education Level Affects Joviality?

ORIGINAL DATA VISUALISATION

CLARITY

Title: The key point of the boxplot is clearly stated.
Boxplot Order: Similar to the previous bar chart, the boxplots are sorted in descending order for effective comparison.
Unclear Average Value: At a glance, the four red dots lie on the same horizontal line. It would be clearer if the average value was shown as well.
y-axis Range: The joviality index from 0 to 1 is quite difficult to understand. It would be better if the joviality index was expressed in terms of percentage, for instance, 100% joviality level.

AESTHETIC

Mean Point: Colour and size are appropriately selected to show where the average lies on each boxplot.
Title: The first letter of the word should be capitalised.
y-axis Label: It is more reader-friendly to write the label horizontally.

MAKEOVER DESIGN

In this makeover design, we multiply joviality by 100 to get joviality percentage. Next, we add some functions such as theme() to rotate the y-axis label and stat_summary() to display the average value on each boxplot.

ggplot(data = demographics, 
       aes( x =reorder(EducationLevel, -Joviality), y = Joviality*100)) +
  geom_boxplot()+
  stat_summary(geom = "point",
               fun = "mean",
               colour = "red",
               size = 1.5) +
  stat_summary(fun.y=mean, colour="darkred", geom="text", show_guide = FALSE, 
               vjust=-0.7, aes( label=round(..y.., digits=3))) +
  labs( x = "Education Level",
        y = "Joviality\n Percentage",
        title = "Graduates are the Most Jovial on Average") +
  theme(axis.title.y= element_text(angle=0))

4.4 How Education Level Affects Hourly Wage?

ORIGINAL DATA VISUALISATION

CLARITY

Title: The key point of the ridge plot is clearly stated.
x-axis Range: Good visualisation as it starts from zero.
Missing Mean/Median Line: Having the average and/or median line would be better for visualisation.

AESTHETIC

Title: The first letter of the word should be capitalised.
y-axis Label: It is more reader-friendly to write the label horizontally.

MAKEOVER DESIGN

Similar to the first chart, in this ridge plot makeover design, we add some functions such as theme() to rotate the y-axis label, geom_vline() to create a dashed mean line, and geom_text() to display the text ‘Average’.

ggplot(data = jobs,
       aes(x = HourlyRate, y = reorder( EducationRequirement, -HourlyRate))) +
  geom_density_ridges(rel_min_height = 0.01)+
  labs(x = "Hourly Rate",
       y = "Education\n Requirement", 
       title = "Graduates and Bachelors Earn Higher Hourly Wage") +
  theme(axis.title.y= element_text(angle=0)) +
  geom_vline(aes(xintercept=mean(HourlyRate,
                                 na.rm=T)),
             color="black", 
             linetype="dashed", 
             size=0.5) +
  geom_text(aes(25,5.5,label="Average"), size=3)

5. References

University of New Mexico. (n.d.). Colors in HTML. https://www.unm.edu/~tbeach/IT145/color.html

Xie, Y.H., et.al. (2022, April 14). Font Color. R Markdown Cookbook. https://bookdown.org/yihui/rmarkdown-cookbook/font-color.html

Take-home Exercise 2

1. Overview

2. Getting Started

3. Data

3.1 Data Source

3.2 Importing Data

3.3 Data Wrangling

4. Demographic Visualisation

4.1 Age Distribution and Having Kids Status

4.2 Education Level Distribution

4.3 How Education Level Affects Joviality?

4.4 How Education Level Affects Hourly Wage?

5. References