1. Overview

In this take-home exercise 6, we are going to reveal the patterns of community interactions of the city of Engagement, Ohio USA by using appropriate social network analysis approach.

The data is processed by using appropriate tidyverse family of packages, whereas the statistical graphics are prepared using ggraph/visNetwork and its extensions.

2. Getting Started

Before we get started, it is important for us to ensure that the required R packages have been installed. If yes, we will load the R packages. If they have yet to be installed, we will install the R packages and load them onto R environment.

The packages required for this exercise are tidyverse, tidygraph, ggraph, igraph, graphlayouts, visNetwork, lubridate and clock.

packages = c('tidyverse','tidygraph','ggraph','igraph','graphlayouts','visNetwork','lubridate','clock')

for(p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

3. Data

3.1 Data Source

The original datasets were obtained from VAST Challenge 2022 in csv format. They show the social interaction among 1011 residents of Engagement, OH that have agreed to participate in this study.

3.2 Importing Data

The code chunk below imports the original datasets called SocialNetwork.csv and Participants.csv into R by using read_csv() of readr and saves them as tibble data frame called network and participant respectively.

However, since the size of SocialNetwork.csv is big, we will save the output tibble into an output file in rds format and read the saved rds data file into R by using readRDS().

network <- read_csv("data/SocialNetwork.csv")
saveRDS(network, 'data/SocialNetwork.rds')

participant <- read_csv("data/Participants.csv")
network <- readRDS('data/SocialNetwork.rds')

It is observed that there are 7,482,488 interactions in total among 1011 participants.

glimpse(network)

Rows: 7,482,488
Columns: 3
$ timestamp         <dttm> 2022-03-01, 2022-03-01, 2022-03-01, 2022-~
$ participantIdFrom <dbl> 173, 178, 178, 180, 183, 183, 185, 185, 18~
$ participantIdTo   <dbl> 180, 183, 185, 173, 178, 185, 178, 183, 18~

glimpse(participant)

Rows: 1,011
Columns: 7
$ participantId  <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,~
$ householdSize  <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ~
$ haveKids       <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU~
$ age            <dbl> 36, 25, 35, 21, 43, 32, 26, 27, 20, 35, 48, 2~
$ educationLevel <chr> "HighSchoolOrCollege", "HighSchoolOrCollege",~
$ interestGroup  <chr> "H", "B", "A", "I", "H", "D", "I", "A", "G", ~
$ joviality      <dbl> 0.001626703, 0.328086500, 0.393469590, 0.1380~

3.3 Data Wrangling

It is important in network data analysis to have two datasets that describe the edges and nodes. For edges dataset, the interaction source and target must be placed in the first two columns. Therefore, we will rearrange the columns from network dataset by using select() of dplyr and create a new data frame called network_edges. We will then use year(), month() and week() function of lubridate to return the year, month of the year and week of the year respectively. In this exercise, we will only focus on the interactions occurred in the second week of March 2022 (week 10 of the year).

Next, we will aggregate the interaction by senders and receivers.

network_edges <- network %>%
  select(participantIdFrom,participantIdTo,timestamp) %>%
  mutate(Year = year(timestamp),
         Month = month(timestamp),
         Week = lubridate::week(timestamp)) %>%
  filter(Year == 2022) %>%
  filter(Month == 3) %>%
  filter(Week == 10) %>%
  group_by(participantIdFrom,participantIdTo) %>%
    summarise(Weight = n()) %>%
  filter(participantIdFrom!=participantIdTo) %>%
  ungroup()

As the nodes dataset requires the participant ID, we will select participantIdFrom and their unique rows from network dataset, and save it as network_nodes.

network_nodes <- network %>%
  select(participantIdFrom) %>%
  distinct(participantIdFrom, .keep_all = TRUE) %>%
  rename(participantId = participantIdFrom)

However, this network_edges dataset is still incomplete as it only shows the participant ID. Thus, we will use left join function (indicated by all.x = TRUE) to join this dataset with participant dataset to obtain the information of each node and create a new data frame called network_nodes_all. There are 963 rows in total, therefore we can say that only 963 out of 1011 participants interact with each other throughout this study.

network_nodes_all <- merge(x = network_nodes, y = participant, by = "participantId", all.x = TRUE)
glimpse(network_nodes_all)

Rows: 963
Columns: 7
$ participantId  <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,~
$ householdSize  <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ~
$ haveKids       <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU~
$ age            <dbl> 36, 25, 35, 21, 43, 32, 26, 27, 20, 35, 48, 2~
$ educationLevel <chr> "HighSchoolOrCollege", "HighSchoolOrCollege",~
$ interestGroup  <chr> "H", "B", "A", "I", "H", "D", "I", "A", "G", ~
$ joviality      <dbl> 0.001626703, 0.328086500, 0.393469590, 0.1380~

We will start building the graph model by using graph_from_data_frame() of igraph and as_tbl_graph() of tidygraph. The output reveals that there are 963 nodes and 4508 edges.

network_graph <- graph_from_data_frame(network_edges,
                                vertices = network_nodes_all) %>%
  as_tbl_graph()
network_graph

# A tbl_graph: 963 nodes and 4508 edges
#
# A directed simple graph with 50 components
#
# Node Data: 963 x 7 (active)
  name  householdSize haveKids   age educationLevel interestGroup
  <chr>         <dbl> <lgl>    <dbl> <chr>          <chr>        
1 0                 3 TRUE        36 HighSchoolOrC~ H            
2 1                 3 TRUE        25 HighSchoolOrC~ B            
3 2                 3 TRUE        35 HighSchoolOrC~ A            
4 3                 3 TRUE        21 HighSchoolOrC~ I            
5 4                 3 TRUE        43 Bachelors      H            
6 5                 3 TRUE        32 HighSchoolOrC~ D            
# ... with 957 more rows, and 1 more variable: joviality <dbl>
#
# Edge Data: 4,508 x 3
   from    to Weight
  <int> <int>  <int>
1     1   227      6
2     1   629      6
3     2    60      5
# ... with 4,505 more rows

It is observed that in the second week of March there are some people that have zero interaction as they are not linked to each other.

set.seed(1234)
ggraph(network_graph,
       layout = "fr") + 
  geom_edge_link() +
  geom_node_point() +
  labs(title = "Not Everyone Interacts with Each Other") +
  theme_graph()

While group B looks to have a greater level of connectivity, we can also see that there are still people that do not interact with each other even though they are in the same interest groups.

set_graph_style()
ggraph(network_graph,
       layout = "stress") + 
  geom_edge_link(aes(width=Weight), 
                 alpha=0.1) +
  scale_edge_width(range = c(0.1, 5)) +
  geom_node_point(aes(colour = interestGroup), 
                  size = 2) + 
  facet_nodes(~interestGroup)+
  th_foreground(foreground = "grey80",  
                border = TRUE) +
  labs(title = "Interest Group B is More Well-connected than Others") +
  theme(legend.position = 'bottom')

Interactive graph below allows us to select participant ID from the drop-down list to see how that person interacts with other people. At a glance, those that do not interact with each other are mostly residents graduated from high school or college.

network_edges_aggregated <- network_edges %>%
  rename(from = participantIdFrom) %>%
  rename(to = participantIdTo) %>%
  filter(from!=to) %>%
  ungroup()

network_nodes_all <- network_nodes_all %>%
  rename(group = educationLevel,
         id = participantId)

visNetwork(network_nodes_all,
           network_edges_aggregated,
           main = "Passive People Mostly Have High School or College Degree") %>%
  visIgraphLayout(layout = "layout_with_fr") %>%
  visOptions(highlightNearest = TRUE,
             nodesIdSelection = TRUE) %>%
  visLegend() %>%
  visLayout(randomSeed = 1234) %>%
  visEdges(arrows = "from")

5. References

DataScience Made Simple (n.d.). Rearrange or Reorder the Rows and Columns in R Using dplyr. https://www.datasciencemadesimple.com/re-arrange-re-order-column-dataframe-r-using-dplyr/

Data Cornering (2021, October 15). How Use dplyr Distinct with Exceptions, Select Unique Rows in R. https://datacornering.com/how-use-dplyr-distinct-with-exceptions-select-unique-rows-in-r/

Take-home Exercise 6

1. Overview

2. Getting Started

3. Data

3.1 Data Source

3.2 Importing Data

3.3 Data Wrangling

4. Social Network Visualisation

Overall Social Network

Social Network by Interest Group

Social Network by Education Level

5. References