In-class Exercise 5

1. Install R packages

lubridate is part of tidyverse now, so there is no need to import lubridate separately.

Show the code
pacman::p_load(igraph, tidyverse, tidygraph, ggraph, visNetwork, clock, graphlayouts)

2. Data preparation

2.1 Import nodes and edges data

Show the code
GAStech_nodes <- read_csv("data/GAStech_email_node.csv")
GAStech_edges <- read_csv("data/GAStech_email_edge-v2.csv")
Tip

For the edges data frame, the first column should always be source and second column be target.

This is import.

For missing values, use “unknown” or “no value” instead of build-in missing value function to avoid losing the data entry.

2.2 Review the imported data

Use glimpse() of dplyr to examine the structure of the data frame.

Show the code
glimpse(GAStech_edges)
Rows: 9,063
Columns: 8
$ source      <dbl> 43, 43, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 26, 26, 26…
$ target      <dbl> 41, 40, 51, 52, 53, 45, 44, 46, 48, 49, 47, 54, 27, 28, 29…
$ SentDate    <chr> "6/1/2014", "6/1/2014", "6/1/2014", "6/1/2014", "6/1/2014"…
$ SentTime    <time> 08:39:00, 08:39:00, 08:58:00, 08:58:00, 08:58:00, 08:58:0…
$ Subject     <chr> "GT-SeismicProcessorPro Bug Report", "GT-SeismicProcessorP…
$ MainSubject <chr> "Work related", "Work related", "Work related", "Work rela…
$ sourceLabel <chr> "Sven.Flecha", "Sven.Flecha", "Kanon.Herrero", "Kanon.Herr…
$ targetLabel <chr> "Isak.Baza", "Lucas.Alcazar", "Felix.Resumir", "Hideki.Coc…

Notice that the SentDate is categorized as “Character” instead of “Date” data type, and we need to change it back to “Date”data type before further analysis.

2.3 Data wrangling

2.3.1 Wrangling Time

The code chunk below will change the data type of SentDate to “Date”.

Show the code
GAStech_edges <- GAStech_edges %>%
  mutate(SendDate = dmy(SentDate)) %>%
  mutate(Weekday = wday(SentDate, label = TRUE, abbr = FALSE))

Review the data again after the change.

Show the code
glimpse(GAStech_edges)
Rows: 9,063
Columns: 10
$ source      <dbl> 43, 43, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 26, 26, 26…
$ target      <dbl> 41, 40, 51, 52, 53, 45, 44, 46, 48, 49, 47, 54, 27, 28, 29…
$ SentDate    <chr> "6/1/2014", "6/1/2014", "6/1/2014", "6/1/2014", "6/1/2014"…
$ SentTime    <time> 08:39:00, 08:39:00, 08:58:00, 08:58:00, 08:58:00, 08:58:0…
$ Subject     <chr> "GT-SeismicProcessorPro Bug Report", "GT-SeismicProcessorP…
$ MainSubject <chr> "Work related", "Work related", "Work related", "Work rela…
$ sourceLabel <chr> "Sven.Flecha", "Sven.Flecha", "Kanon.Herrero", "Kanon.Herr…
$ targetLabel <chr> "Isak.Baza", "Lucas.Alcazar", "Felix.Resumir", "Hideki.Coc…
$ SendDate    <date> 2014-01-06, 2014-01-06, 2014-01-06, 2014-01-06, 2014-01-0…
$ Weekday     <ord> Friday, Friday, Friday, Friday, Friday, Friday, Friday, Fr…

2.3.2 Wrangling attributes

With a close examination, GAStech_edges data.frame is at email records level, which is not very useful for visualization. Therefore, we will aggregate individual by date, senders, receivers, main subject and day of the week.

Show the code
GAStech_edges_aggregated <- GAStech_edges %>%
  filter(MainSubject == "Work related") %>%
  group_by(source, target, Weekday) %>%
    summarise(Weight = n()) %>%
  filter(source != target) %>%
  filter(Weight > 1) %>%
  ungroup()
Tip

Always check dplyr to use the latest way to wrangle data. Google result may show old ways to solve a problem

Note

ungroup() at the end of the code chunk is for future calculation. If it’s not done, when we do calculation later it will be based on the grouped data.

Review the data again after the change.

Show the code
glimpse(GAStech_edges_aggregated)
Rows: 1,372
Columns: 4
$ source  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ target  <dbl> 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 6,…
$ Weekday <ord> Sunday, Monday, Tuesday, Wednesday, Friday, Sunday, Monday, Tu…
$ Weight  <int> 5, 2, 3, 4, 6, 5, 2, 3, 4, 6, 5, 2, 3, 4, 6, 5, 2, 3, 4, 6, 5,…

3. Creating network objects

Use tbl_graph() from tidygraph to build a network graph data.frame.

Show the code
GAStech_graph <- tbl_graph(nodes = GAStech_nodes,
                           edges = GAStech_edges_aggregated,
                           directed = TRUE)

Review the output tidygraph’s graph object.

Show the code
GAStech_graph
# A tbl_graph: 54 nodes and 1372 edges
#
# A directed multigraph with 1 component
#
# A tibble: 54 × 4
     id label               Department     Title                                
  <dbl> <chr>               <chr>          <chr>                                
1     1 Mat.Bramar          Administration Assistant to CEO                     
2     2 Anda.Ribera         Administration Assistant to CFO                     
3     3 Rachel.Pantanal     Administration Assistant to CIO                     
4     4 Linda.Lagos         Administration Assistant to COO                     
5     5 Ruscella.Mies.Haber Administration Assistant to Engineering Group Manag…
6     6 Carla.Forluniau     Administration Assistant to IT Group Manager        
# ℹ 48 more rows
#
# A tibble: 1,372 × 4
   from    to Weekday Weight
  <int> <int> <ord>    <int>
1     1     2 Sunday       5
2     1     2 Monday       2
3     1     2 Tuesday      3
# ℹ 1,369 more rows