進階資料框處理

Data frames combine the behaviour of lists and matrices to make a structure ideally suited for the needs of statistical data.

Hadley Wickham

基礎資料框處理我們已經對基礎資料框處理技巧駕輕就熟,在這個章節中我們將要討論一些進階資料框處理技巧,這些技巧包含如何將文字與數值變數進行重新分類、如何運用 tidyr 套件對資料框進行寬格式(wide-format)與長格式(long-format)之間的轉換以及如何將多個資料框進行垂直和水平方向的合併。

變數重新分類

對資料框中的變數重新分類是基礎資料框處理技巧 mutate 的延伸,針對既有的文字(類別、離散)或數值(連續)變數進行重新的編碼和歸類。以文字變數的重新歸類來舉例,像是將記錄有 1995 至 1996 年球季美國職籃(NBA)芝加哥公牛隊球員名單與基本資訊的資料框中 Pos 變數,從原有的五類(PG、SG、SF、PF、C)重新歸類為兩類(後場、前場),其中 PG 與 SG 會對應為後場,SF、PF 與 C 會對應為前場,這時我們將運用先前在函數型程式設計所習得的 apply() 系列函數來完成這個任務。

recoding_pos <- function(x) {
  if (x %in% c("PG", "SG")) {
    pos <- "Back Court"
  } else {
    pos <- "Front Court"
  }
  return(pos)
}

csv_url <- "https://s3-ap-northeast-1.amazonaws.com/r-essentials/chicago_bulls_1995_1996.csv"
chicago_bulls <- read.csv(csv_url)
chicago_bulls$pos_recoded <- sapply(FUN = recoding_pos, X = chicago_bulls$Pos)
chicago_bulls[, c("Player", "Pos", "pos_recoded")]
1
2
3
4
5
6
7
8
9
10
11
12
13
## > recoding_pos <- function(x) {
## +   if (x %in% c("PG", "SG")) {
## +     pos <- "Back Court"
## +   } else {
## +     pos <- "Front Court"
## +   }
## +   return(pos)
## + }
## > 
## > csv_url <- "https://s3-ap-northeast-1.amazonaws.com/r-essentials/chicago_bulls_1995_1996.csv"
## > chicago_bulls <- read.csv(csv_url)
## > chicago_bulls$pos_recoded <- sapply(FUN = recoding_pos, X = chicago_bulls$Pos)
## > chicago_bulls[, c("Player", "Pos", "pos_recoded")]
##             Player Pos pos_recoded
## 1      Randy Brown  PG  Back Court
## 2     Jud Buechler  SF Front Court
## 3     Jason Caffey  PF Front Court
## 4    James Edwards   C Front Court
## 5       Jack Haley   C Front Court
## 6       Ron Harper  PG  Back Court
## 7   Michael Jordan  SG  Back Court
## 8       Steve Kerr  PG  Back Court
## 9       Toni Kukoc  SF Front Court
## 10     Luc Longley   C Front Court
## 11  Scottie Pippen  SF Front Court
## 12   Dennis Rodman  PF Front Court
## 13     John Salley  PF Front Court
## 14 Dickey Simpkins  PF Front Court
## 15 Bill Wennington   C Front Court
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

再以數值變數重新編碼作為文字類別來舉例,將與前例相同的資料框中 Wt 變數,將超過 210 磅的球員歸類為重(Heavy)、小於等於 210 磅的球員歸類為輕(Thin)兩個類別,同樣可以使用 apply() 系列函數協助。

recoding_wt <- function(x) {
  if (x > 210) {
    wt_group <- "Heavy"
  } else {
    wt_group <- "Thin"
  }
  return(wt_group)
}

csv_url <- "https://s3-ap-northeast-1.amazonaws.com/r-essentials/chicago_bulls_1995_1996.csv"
chicago_bulls <- read.csv(csv_url)
chicago_bulls$wt_recoded <- sapply(FUN = recoding_wt, X = chicago_bulls$Wt)
chicago_bulls[, c("Player", "Wt", "wt_recoded")]
1
2
3
4
5
6
7
8
9
10
11
12
13
## > recoding_wt <- function(x) {
## +   if (x > 210) {
## +     wt_group <- "Heavy"
## +   } else {
## +     wt_group <- "Thin"
## +   }
## +   return(wt_group)
## + }
## > 
## > csv_url <- "https://s3-ap-northeast-1.amazonaws.com/r-essentials/chicago_bulls_1995_1996.csv"
## > chicago_bulls <- read.csv(csv_url)
## > chicago_bulls$wt_recoded <- sapply(FUN = recoding_wt, X = chicago_bulls$Wt)
## > chicago_bulls[, c("Player", "Wt", "wt_recoded")]
##             Player  Wt wt_recoded
## 1      Randy Brown 190       Thin
## 2     Jud Buechler 220      Heavy
## 3     Jason Caffey 255      Heavy
## 4    James Edwards 225      Heavy
## 5       Jack Haley 240      Heavy
## 6       Ron Harper 185       Thin
## 7   Michael Jordan 195       Thin
## 8       Steve Kerr 175       Thin
## 9       Toni Kukoc 192       Thin
## 10     Luc Longley 265      Heavy
## 11  Scottie Pippen 210       Thin
## 12   Dennis Rodman 210       Thin
## 13     John Salley 230      Heavy
## 14 Dickey Simpkins 248      Heavy
## 15 Bill Wennington 245      Heavy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

除了自己撰寫將數值分箱(bin)為文字類別的函數,也能夠透過內建函數 cut() 來完成這項任務,cut() 要傳入兩個參數來告知重新編碼的規範,參數設定稍微複雜一點,使用者必須要針對不同的類別指定切點 breaks,要將體重區分為兩類,必須給定三個體重切點(就像是植樹或者路燈問題!)breaks = c(0, 210, Inf),其中 Inf 是 R 語言內建的無限大數值;以及給予兩類型標籤 labels=c("Thin", "Heavy")

csv_url <- "https://s3-ap-northeast-1.amazonaws.com/r-essentials/chicago_bulls_1995_1996.csv"
chicago_bulls <- read.csv(csv_url)
chicago_bulls$wt_recoded <- cut(chicago_bulls$Wt, breaks = c(0, 210, Inf), labels = c("Thin", "Heavy"))
chicago_bulls[, c("Player", "Wt", "wt_recoded")]
1
2
3
4
## > csv_url <- "https://s3-ap-northeast-1.amazonaws.com/r-essentials/chicago_bulls_1995_1996.csv"
## > chicago_bulls <- read.csv(csv_url)
## > chicago_bulls$wt_recoded <- cut(chicago_bulls$Wt, breaks = c(0, 210, Inf), labels = c("Thin", "Heavy"))
## > chicago_bulls[, c("Player", "Wt", "wt_recoded")]
##             Player  Wt wt_recoded
## 1      Randy Brown 190       Thin
## 2     Jud Buechler 220      Heavy
## 3     Jason Caffey 255      Heavy
## 4    James Edwards 225      Heavy
## 5       Jack Haley 240      Heavy
## 6       Ron Harper 185       Thin
## 7   Michael Jordan 195       Thin
## 8       Steve Kerr 175       Thin
## 9       Toni Kukoc 192       Thin
## 10     Luc Longley 265      Heavy
## 11  Scottie Pippen 210       Thin
## 12   Dennis Rodman 210       Thin
## 13     John Salley 230      Heavy
## 14 Dickey Simpkins 248      Heavy
## 15 Bill Wennington 245      Heavy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

寬格式與長格式的互相轉換

在不同資料設定情況下,我們需要為分析應用中的不同需求設置相對應的資料形式,常見的資料框外觀有寬格式(wide-format)和長格式(long-format);在寬格式的資料框中,每個觀測值所對應的資訊會重複出現在單獨的一列、而每個資訊會被記錄在單獨的欄位中。像是利用兩個變數 winslosses 記錄 1995–96 球季芝加哥公牛隊與 2015–16 球季金州勇士隊的勝場數與敗場數。

teams <- c("Chicago Bulls", "Golden State Warriors")
wins <- c(72, 73)
losses <- c(10, 9)
great_nba_teams <- data.frame(team = teams, win = wins, loss = losses)
great_nba_teams
1
2
3
4
5
## > teams <- c("Chicago Bulls", "Golden State Warriors")
## > wins <- c(72, 73)
## > losses <- c(10, 9)
## > great_nba_teams <- data.frame(team = teams, win = wins, loss = losses)
## > great_nba_teams
##                    team win loss
## 1         Chicago Bulls  72   10
## 2 Golden State Warriors  73    9
1
2
3
4
5
6
7
8

一如範例所呈現的資料框外觀我們稱為寬格式,在寬格式的外觀結構下,如果希望增加觀測值的資訊,例如勝率,會以增加欄位數的方式來更新。

teams <- c("Chicago Bulls", "Golden State Warriors")
wins <- c(72, 73)
losses <- c(10, 9)
great_nba_teams <- data.frame(team = teams, win = wins, loss = losses)
great_nba_teams$winning_percentage <- great_nba_teams$win / (great_nba_teams$win + great_nba_teams$loss)
great_nba_teams
1
2
3
4
5
6
## > teams <- c("Chicago Bulls", "Golden State Warriors")
## > wins <- c(72, 73)
## > losses <- c(10, 9)
## > great_nba_teams <- data.frame(team = teams, win = wins, loss = losses)
## > great_nba_teams$winning_percentage <- great_nba_teams$win / (great_nba_teams$win + great_nba_teams$loss)
## > great_nba_teams
##                    team win loss winning_percentage
## 1         Chicago Bulls  72   10          0.8780488
## 2 Golden State Warriors  73    9          0.8902439
1
2
3
4
5
6
7
8
9

而同樣的資訊內容,若是以長格式儲存,每個數值(value)都會對應一個類別標籤(key),而每個觀測值主體(Identifier)將有多列觀測值、而不隨類別標籤改變。像是利用兩個變數 game_resultgames 記錄 1995–96 球季芝加哥公牛隊與 2015–16 球季金州勇士隊的勝場數與敗場數。

teams <- c("Chicago Bulls", "Golden State Warriors", "Chicago Bulls", "Golden State Warriors")
game_results <- c("win", "win", "loss", "loss")
games <- c(72, 73, 10, 9)
great_nba_teams <- data.frame(team = teams, game_result = game_results, games = games)
great_nba_teams
1
2
3
4
5
## > teams <- c("Chicago Bulls", "Golden State Warriors", "Chicago Bulls", "Golden State Warriors")
## > game_results <- c("win", "win", "loss", "loss")
## > games <- c(72, 73, 10, 9)
## > great_nba_teams <- data.frame(team = teams, game_result = game_results, games = games)
## > great_nba_teams
##                    team game_result games
## 1         Chicago Bulls         win    72
## 2 Golden State Warriors         win    73
## 3         Chicago Bulls        loss    10
## 4 Golden State Warriors        loss     9
1
2
3
4
5
6
7
8
9
10

這樣外觀的資料框我們稱之為長格式,在長格式的外觀結構下,如果希望增加觀測值的資訊,會以增加觀測值數的方式來更新。

在 R 語言應用中透過 tidyr 套件中的 spread() 函數與 gather() 函數可以實現資料框長格式與寬格式之間的轉換。

  • spread() 函數:指定類別標籤(key)與數值(value)的變數名稱將長格式分散為不同變數欄位
  • gather() 函數:為堆疊後類別標籤(key)與數值(value)的變數命名後再指定有哪些數值變數要進行分類堆疊

例如將前例中的長格式以 spread() 函數轉換為寬格式,類別標籤參數指定為 key = "game_result" 以及數值標籤參數指定為 value = "games"

# install.packages("tidyr")
library(tidyr)

teams <- c("Chicago Bulls", "Golden State Warriors", "Chicago Bulls", "Golden State Warriors")
game_results <- c("win", "win", "loss", "loss")
games <- c(72, 73, 10, 9)
great_nba_teams_long <- data.frame(team = teams, game_result = game_results, games = games)
great_nba_teams_wide <- spread(great_nba_teams_long, key = "game_result", value = "games")
great_nba_teams_wide
1
2
3
4
5
6
7
8
9
## > # install.packages("tidyr")
## > library(tidyr)
## > 
## > teams <- c("Chicago Bulls", "Golden State Warriors", "Chicago Bulls", "Golden State Warriors")
## > game_results <- c("win", "win", "loss", "loss")
## > games <- c(72, 73, 10, 9)
## > great_nba_teams_long <- data.frame(team = teams, game_result = game_results, games = games)
## > great_nba_teams_wide <- spread(great_nba_teams_long, key = "game_result", value = "games")
## > great_nba_teams_wide
##                    team loss win
## 1         Chicago Bulls   10  72
## 2 Golden State Warriors    9  73
1
2
3
4
5
6
7
8
9
10
11
12

或者將寬格式以 gather() 函數轉換為長格式,命名堆疊後的類別標籤變數為 key = "game_result" 以及數值標籤變數為 value = "games",再告知有 losswin 兩個數值變數需要進行堆疊。

# install.packages("tidyr")
library(tidyr)

teams <- c("Chicago Bulls", "Golden State Warriors")
wins <- c(72, 73)
losses <- c(10, 9)
great_nba_teams_wide <- data.frame(team = teams, win = wins, loss = losses)
great_nba_teams_long <- gather(great_nba_teams_wide, key = "game_result", value = "games", win, loss)
great_nba_teams_long
1
2
3
4
5
6
7
8
9
## > # install.packages("tidyr")
## > library(tidyr)
## > 
## > teams <- c("Chicago Bulls", "Golden State Warriors")
## > wins <- c(72, 73)
## > losses <- c(10, 9)
## > great_nba_teams_wide <- data.frame(team = teams, win = wins, loss = losses)
## > great_nba_teams_long <- gather(great_nba_teams_wide, key = "game_result", value = "games", win, loss)
## > great_nba_teams_long
##                    team game_result games
## 1         Chicago Bulls         win    72
## 2 Golden State Warriors         win    73
## 3         Chicago Bulls        loss    10
## 4 Golden State Warriors        loss     9
1
2
3
4
5
6
7
8
9
10
11
12
13
14

垂直合併

使用 rbind() 函數能夠以垂直方向將兩個資料框合併起來,其中 rbind() 函數命名是 Row Bind 的縮寫,意即結合觀測值,這樣的操作前提是上下來兩資料框必須具有相同的變數欄位,例如在前例中所使用長格式的 1995–96 球季芝加哥公牛隊與 2015–16 球季金州勇士隊的勝場數與敗場數,我們可以用兩支球隊個別的資料框來做垂直合併。

teams <- c("Chicago Bulls", "Chicago Bulls")
game_results <- c("win", "loss")
games <- c(72, 10)
bulls <- data.frame(team = teams, game_result = game_results, game = games)
teams <- c("Golden State Warriors", "Golden State Warriors")
game_results <- c("win", "loss")
games <- c(73, 9)
warriors <- data.frame(team = teams, game_result = game_results, game = games)
# upper dataframe
bulls
# lower dataframe
warriors
# dataframe after rbind
rbind(bulls, warriors)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
## > teams <- c("Chicago Bulls", "Chicago Bulls")
## > game_results <- c("win", "loss")
## > games <- c(72, 10)
## > bulls <- data.frame(team = teams, game_result = game_results, game = games)
## > teams <- c("Golden State Warriors", "Golden State Warriors")
## > game_results <- c("win", "loss")
## > games <- c(73, 9)
## > warriors <- data.frame(team = teams, game_result = game_results, game = games)
## > # upper dataframe
## > bulls
##            team game_result game
## 1 Chicago Bulls         win   72
## 2 Chicago Bulls        loss   10
## > # lower dataframe
## > warriors
##                    team game_result game
## 1 Golden State Warriors         win   73
## 2 Golden State Warriors        loss    9
## > # dataframe after rbind
## > rbind(bulls, warriors)
##                    team game_result game
## 1         Chicago Bulls         win   72
## 2         Chicago Bulls        loss   10
## 3 Golden State Warriors         win   73
## 4 Golden State Warriors        loss    9
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

或者我們亦可以透過在使用 dplyr 處理資料框章節中所介紹 dplyr 套件中的 bind_rows() 函數來完成垂直合併。

# install.packages("dplyr")
library(dplyr)

teams <- c("Chicago Bulls", "Chicago Bulls")
game_results <- c("win", "loss")
games <- c(72, 10)
bulls <- data.frame(team = teams, game_result = game_results, game = games, stringsAsFactors = FALSE)
teams <- c("Golden State Warriors", "Golden State Warriors")
game_results <- c("win", "loss")
games <- c(73, 9)
warriors <- data.frame(team = teams, game_result = game_results, game = games, stringsAsFactors = FALSE)
# upper dataframe
bulls
# lower dataframe
warriors
# dataframe after bind_rows
bind_rows(bulls, warriors)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
## > # install.packages("dplyr")
## > library(dplyr)
## > 
## > teams <- c("Chicago Bulls", "Chicago Bulls")
## > game_results <- c("win", "loss")
## > games <- c(72, 10)
## > bulls <- data.frame(team = teams, game_result = game_results, game = games, stringsAsFactors = FALSE)
## > teams <- c("Golden State Warriors", "Golden State Warriors")
## > game_results <- c("win", "loss")
## > games <- c(73, 9)
## > warriors <- data.frame(team = teams, game_result = game_results, game = games, stringsAsFactors = FALSE)
## > # upper dataframe
## > bulls
##            team game_result game
## 1 Chicago Bulls         win   72
## 2 Chicago Bulls        loss   10
## > # lower dataframe
## > warriors
##                    team game_result game
## 1 Golden State Warriors         win   73
## 2 Golden State Warriors        loss    9
## > # dataframe after bind_rows
## > bind_rows(bulls, warriors)
##                    team game_result game
## 1         Chicago Bulls         win   72
## 2         Chicago Bulls        loss   10
## 3 Golden State Warriors         win   73
## 4 Golden State Warriors        loss    9
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28

水平合併

單純的水平合併使用 cbind() 函數能夠以水平方向將兩個資料框合併起來,其中 cbind() 函數命名是 Column Bind 的縮寫,意即結合變數欄位,例如在前文中所使用的 1995 至 1996 年球季美國職籃(NBA)芝加哥公牛隊球員名單與基本資訊的資料框,可以將原本只有背號與球員姓名的資料框,水平合併加入身高與體重的資訊。

csv_url <- "https://s3-ap-northeast-1.amazonaws.com/r-essentials/chicago_bulls_1995_1996.csv"
chicago_bulls <- read.csv(csv_url)
chicago_bulls_players <- chicago_bulls[, c("No.", "Player")]
chicago_bulls_ht_wt <- chicago_bulls[, c("Wt", "Ht")]
# left dataframe
chicago_bulls_players
# right dataframe
chicago_bulls_ht_wt
# dataframe after cbind
cbind(chicago_bulls_players, chicago_bulls_ht_wt)
1
2
3
4
5
6
7
8
9
10
## > csv_url <- "https://s3-ap-northeast-1.amazonaws.com/r-essentials/chicago_bulls_1995_1996.csv"
## > chicago_bulls <- read.csv(csv_url)
## > chicago_bulls_players <- chicago_bulls[, c("No.", "Player")]
## > chicago_bulls_ht_wt <- chicago_bulls[, c("Wt", "Ht")]
## > # left dataframe
## > chicago_bulls_players
##    No.          Player
## 1    0     Randy Brown
## 2   30    Jud Buechler
## 3   35    Jason Caffey
## 4   53   James Edwards
## 5   54      Jack Haley
## 6    9      Ron Harper
## 7   23  Michael Jordan
## 8   25      Steve Kerr
## 9    7      Toni Kukoc
## 10  13     Luc Longley
## 11  33  Scottie Pippen
## 12  91   Dennis Rodman
## 13  22     John Salley
## 14   8 Dickey Simpkins
## 15  34 Bill Wennington
## > # right dataframe
## > chicago_bulls_ht_wt
##     Wt   Ht
## 1  190  6-2
## 2  220  6-6
## 3  255  6-8
## 4  225  7-0
## 5  240 6-10
## 6  185  6-6
## 7  195  6-6
## 8  175  6-3
## 9  192 6-10
## 10 265  7-2
## 11 210  6-8
## 12 210  6-7
## 13 230 6-11
## 14 248  6-9
## 15 245  7-0
## > # dataframe after cbind
## > cbind(chicago_bulls_players, chicago_bulls_ht_wt)
##    No.          Player  Wt   Ht
## 1    0     Randy Brown 190  6-2
## 2   30    Jud Buechler 220  6-6
## 3   35    Jason Caffey 255  6-8
## 4   53   James Edwards 225  7-0
## 5   54      Jack Haley 240 6-10
## 6    9      Ron Harper 185  6-6
## 7   23  Michael Jordan 195  6-6
## 8   25      Steve Kerr 175  6-3
## 9    7      Toni Kukoc 192 6-10
## 10  13     Luc Longley 265  7-2
## 11  33  Scottie Pippen 210  6-8
## 12  91   Dennis Rodman 210  6-7
## 13  22     John Salley 230 6-11
## 14   8 Dickey Simpkins 248  6-9
## 15  34 Bill Wennington 245  7-0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58

或者亦可以使用 dplyr 套件中的 bind_cols() 函數來完成水平合併。

# install.packages("dplyr")
library(dplyr)

csv_url <- "https://s3-ap-northeast-1.amazonaws.com/r-essentials/chicago_bulls_1995_1996.csv"
chicago_bulls <- read.csv(csv_url)
chicago_bulls_players <- chicago_bulls[, c("No.", "Player")]
chicago_bulls_ht_wt <- chicago_bulls[, c("Wt", "Ht")]
# left dataframe
chicago_bulls_players
# right dataframe
chicago_bulls_ht_wt
# dataframe after bind_cols
bind_cols(chicago_bulls_players, chicago_bulls_ht_wt)
1
2
3
4
5
6
7
8
9
10
11
12
13
> # install.packages("dplyr")
> library(dplyr)
> 
> csv_url <- "https://s3-ap-northeast-1.amazonaws.com/r-essentials/chicago_bulls_1995_1996.csv"
> chicago_bulls <- read.csv(csv_url)
> chicago_bulls_players <- chicago_bulls[, c("No.", "Player")]
> chicago_bulls_ht_wt <- chicago_bulls[, c("Wt", "Ht")]
> # left dataframe
> chicago_bulls_players
   No.          Player
1    0     Randy Brown
2   30    Jud Buechler
3   35    Jason Caffey
4   53   James Edwards
5   54      Jack Haley
6    9      Ron Harper
7   23  Michael Jordan
8   25      Steve Kerr
9    7      Toni Kukoc
10  13     Luc Longley
11  33  Scottie Pippen
12  91   Dennis Rodman
13  22     John Salley
14   8 Dickey Simpkins
15  34 Bill Wennington
> # right dataframe
> chicago_bulls_ht_wt
    Wt   Ht
1  190  6-2
2  220  6-6
3  255  6-8
4  225  7-0
5  240 6-10
6  185  6-6
7  195  6-6
8  175  6-3
9  192 6-10
10 265  7-2
11 210  6-8
12 210  6-7
13 230 6-11
14 248  6-9
15 245  7-0
> # dataframe after bind_cols
> bind_cols(chicago_bulls_players, chicago_bulls_ht_wt)
   No.          Player  Wt   Ht
1    0     Randy Brown 190  6-2
2   30    Jud Buechler 220  6-6
3   35    Jason Caffey 255  6-8
4   53   James Edwards 225  7-0
5   54      Jack Haley 240 6-10
6    9      Ron Harper 185  6-6
7   23  Michael Jordan 195  6-6
8   25      Steve Kerr 175  6-3
9    7      Toni Kukoc 192 6-10
10  13     Luc Longley 265  7-2
11  33  Scottie Pippen 210  6-8
12  91   Dennis Rodman 210  6-7
13  22     John Salley 230 6-11
14   8 Dickey Simpkins 248  6-9
15  34 Bill Wennington 245  7-0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61

進階的資料框水平合併會運用類似 Excel 軟體中的 vlookup()hlookup() 函數、或者說關聯式資料庫的 JOIN 語法,這樣的操作是利用兩個資料框中共有的變數欄位作為對照依據來進行水平合併。我們使用日本漫畫家尾田榮一郎創作的著名少年熱血漫畫「海賊王」中草帽海賊團船員、年紀與惡魔果實來作範例。

# 左邊的資料框
name <- c("蒙其·D·魯夫", "羅羅亞·索隆", "娜美", "多尼多尼·喬巴")
age <- c(19, 21, 20, 17)
left_df <- data.frame(name, age)
# 右邊的資料框
name <- c("蒙其·D·魯夫", "多尼多尼·喬巴", "妮可·羅賓")
devil_fruit <- c("橡膠果實", "人人果實", "花花果實")
right_df <- data.frame(name, devil_fruit)
left_df
right_df
1
2
3
4
5
6
7
8
9
10
## > # 左邊的資料框
## > name <- c("蒙其·D·魯夫", "羅羅亞·索隆", "娜美", "多尼多尼·喬巴")
## > age <- c(19, 21, 20, 17)
## > left_df <- data.frame(name, age)
## > # 右邊的資料框
## > name <- c("蒙其·D·魯夫", "多尼多尼·喬巴", "妮可·羅賓")
## > devil_fruit <- c("橡膠果實", "人人果實", "花花果實")
## > right_df <- data.frame(name, devil_fruit)
## > left_df
##            name age
## 1   蒙其·D·魯夫  19
## 2   羅羅亞·索隆  21
## 3          娜美  20
## 4 多尼多尼·喬巴  17
## > right_df
##            name devil_fruit
## 1   蒙其·D·魯夫    橡膠果實
## 2 多尼多尼·喬巴    人人果實
## 3     妮可·羅賓    花花果實
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

使用 merge() 函數能夠讓左邊記錄姓名及年齡資訊的資料框、右邊記錄姓名及惡魔果實的資料框,以姓名作為對照依據來合併。

# 左邊的資料框
name <- c("蒙其·D·魯夫", "羅羅亞·索隆", "娜美", "多尼多尼·喬巴")
age <- c(19, 21, 20, 17)
left_df <- data.frame(name, age)
# 右邊的資料框
name <- c("蒙其·D·魯夫", "多尼多尼·喬巴", "妮可·羅賓")
devil_fruit <- c("橡膠果實", "人人果實", "花花果實")
right_df <- data.frame(name, devil_fruit)
# 合併後的資料框
merge(left_df, right_df)
1
2
3
4
5
6
7
8
9
10
## > # 左邊的資料框
## > name <- c("蒙其·D·魯夫", "羅羅亞·索隆", "娜美", "多尼多尼·喬巴")
## > age <- c(19, 21, 20, 17)
## > left_df <- data.frame(name, age)
## > # 右邊的資料框
## > name <- c("蒙其·D·魯夫", "多尼多尼·喬巴", "妮可·羅賓")
## > devil_fruit <- c("橡膠果實", "人人果實", "花花果實")
## > right_df <- data.frame(name, devil_fruit)
## > # 合併後的資料框
## > merge(left_df, right_df)
##            name age devil_fruit
## 1 多尼多尼·喬巴  17    人人果實
## 2   蒙其·D·魯夫  19    橡膠果實
1
2
3
4
5
6
7
8
9
10
11
12
13

觀察合併後的資料框,我們可以發現只有左邊與右邊都有的兩個觀測值保留在最後的輸出,這是因為 merge() 函數預設是保留左右兩個資料框交集,這在關聯式資料庫中稱之為 INNER JOIN 語法。

或使用 dplyr 套件中的 inner_join() 函數來完成保留左右兩個資料框交集的水平合併。

# install.packages("dplyr")
library(dplyr)

# 左邊的資料框
name <- c("蒙其·D·魯夫", "羅羅亞·索隆", "娜美", "多尼多尼·喬巴")
age <- c(19, 21, 20, 17)
left_df <- data.frame(name, age, stringsAsFactors = FALSE)
# 右邊的資料框
name <- c("蒙其·D·魯夫", "多尼多尼·喬巴", "妮可·羅賓")
devil_fruit <- c("橡膠果實", "人人果實", "花花果實")
right_df <- data.frame(name, devil_fruit, stringsAsFactors = FALSE)
# 合併後的資料框
inner_join(left_df, right_df)
1
2
3
4
5
6
7
8
9
10
11
12
13
## > # install.packages("dplyr")
## > library(dplyr)
## > 
## > # 左邊的資料框
## > name <- c("蒙其·D·魯夫", "羅羅亞·索隆", "娜美", "多尼多尼·喬巴")
## > age <- c(19, 21, 20, 17)
## > left_df <- data.frame(name, age, stringsAsFactors = FALSE)
## > # 右邊的資料框
## > name <- c("蒙其·D·魯夫", "多尼多尼·喬巴", "妮可·羅賓")
## > devil_fruit <- c("橡膠果實", "人人果實", "花花果實")
## > right_df <- data.frame(name, devil_fruit, stringsAsFactors = FALSE)
## > # 合併後的資料框
## > inner_join(left_df, right_df)
## Joining, by = "name"
##            name age devil_fruit
## 1   蒙其·D·魯夫  19    橡膠果實
## 2 多尼多尼·喬巴  17    人人果實
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

假如我們希望保留所有左邊資料框的觀測值,可以在 merge() 函數中指定參數 all.x = TRUE,那麼右邊資料框對應不到的部分,就會以遺漏值呈現,這在關聯式資料庫中稱之為 LEFT JOIN 語法。

# 左邊的資料框
name <- c("蒙其·D·魯夫", "羅羅亞·索隆", "娜美", "多尼多尼·喬巴")
age <- c(19, 21, 20, 17)
left_df <- data.frame(name, age)
# 右邊的資料框
name <- c("蒙其·D·魯夫", "多尼多尼·喬巴", "妮可·羅賓")
devil_fruit <- c("橡膠果實", "人人果實", "花花果實")
right_df <- data.frame(name, devil_fruit)
# 合併後的資料框
merge(left_df, right_df, all.x = TRUE)
1
2
3
4
5
6
7
8
9
10
## > # 左邊的資料框
## > name <- c("蒙其·D·魯夫", "羅羅亞·索隆", "娜美", "多尼多尼·喬巴")
## > age <- c(19, 21, 20, 17)
## > left_df <- data.frame(name, age)
## > # 右邊的資料框
## > name <- c("蒙其·D·魯夫", "多尼多尼·喬巴", "妮可·羅賓")
## > devil_fruit <- c("橡膠果實", "人人果實", "花花果實")
## > right_df <- data.frame(name, devil_fruit)
## > # 合併後的資料框
## > merge(left_df, right_df, all.x = TRUE)
##            name age devil_fruit
## 1 多尼多尼·喬巴  17    人人果實
## 2          娜美  20        <NA>
## 3   羅羅亞·索隆  21        <NA>
## 4   蒙其·D·魯夫  19    橡膠果實
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

或使用 dplyr 套件中的 left_join() 函數來完成保留所有左邊資料框觀測值的水平合併。

# install.packages("dplyr")
library(dplyr)

# 左邊的資料框
name <- c("蒙其·D·魯夫", "羅羅亞·索隆", "娜美", "多尼多尼·喬巴")
age <- c(19, 21, 20, 17)
left_df <- data.frame(name, age, stringsAsFactors = FALSE)
# 右邊的資料框
name <- c("蒙其·D·魯夫", "多尼多尼·喬巴", "妮可·羅賓")
devil_fruit <- c("橡膠果實", "人人果實", "花花果實")
right_df <- data.frame(name, devil_fruit, stringsAsFactors = FALSE)
# 合併後的資料框
left_join(left_df, right_df)
1
2
3
4
5
6
7
8
9
10
11
12
13
## > # install.packages("dplyr")
## > library(dplyr)
## > 
## > # 左邊的資料框
## > name <- c("蒙其·D·魯夫", "羅羅亞·索隆", "娜美", "多尼多尼·喬巴")
## > age <- c(19, 21, 20, 17)
## > left_df <- data.frame(name, age, stringsAsFactors = FALSE)
## > # 右邊的資料框
## > name <- c("蒙其·D·魯夫", "多尼多尼·喬巴", "妮可·羅賓")
## > devil_fruit <- c("橡膠果實", "人人果實", "花花果實")
## > right_df <- data.frame(name, devil_fruit, stringsAsFactors = FALSE)
## > # 合併後的資料框
## > left_join(left_df, right_df)
## Joining, by = "name"
##            name age devil_fruit
## 1   蒙其·D·魯夫  19    橡膠果實
## 2   羅羅亞·索隆  21        <NA>
## 3          娜美  20        <NA>
## 4 多尼多尼·喬巴  17    人人果實
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

假如我們希望保留所有右邊資料框的觀測值,可以在 merge() 函數中指定參數 all.y = TRUE,那麼左邊資料框對應不到的部分,就會以遺漏值呈現,這在關聯式資料庫中稱之為 RIGHT JOIN 語法。

# 左邊的資料框
name <- c("蒙其·D·魯夫", "羅羅亞·索隆", "娜美", "多尼多尼·喬巴")
age <- c(19, 21, 20, 17)
left_df <- data.frame(name, age)
# 右邊的資料框
name <- c("蒙其·D·魯夫", "多尼多尼·喬巴", "妮可·羅賓")
devil_fruit <- c("橡膠果實", "人人果實", "花花果實")
right_df <- data.frame(name, devil_fruit)
# 合併後的資料框
merge(left_df, right_df, all.y = TRUE)
1
2
3
4
5
6
7
8
9
10
## > # 左邊的資料框
## > name <- c("蒙其·D·魯夫", "羅羅亞·索隆", "娜美", "多尼多尼·喬巴")
## > age <- c(19, 21, 20, 17)
## > left_df <- data.frame(name, age)
## > # 右邊的資料框
## > name <- c("蒙其·D·魯夫", "多尼多尼·喬巴", "妮可·羅賓")
## > devil_fruit <- c("橡膠果實", "人人果實", "花花果實")
## > right_df <- data.frame(name, devil_fruit)
## > # 合併後的資料框
## > merge(left_df, right_df, all.y = TRUE)
##            name age devil_fruit
## 1 多尼多尼·喬巴  17    人人果實
## 2   蒙其·D·魯夫  19    橡膠果實
## 3     妮可·羅賓  NA    花花果實
1
2
3
4
5
6
7
8
9
10
11
12
13
14

或使用 dplyr 套件中的 right_join() 函數來完成保留所有右邊資料框觀測值的水平合併。

# install.packages("dplyr")
library(dplyr)

# 左邊的資料框
name <- c("蒙其·D·魯夫", "羅羅亞·索隆", "娜美", "多尼多尼·喬巴")
age <- c(19, 21, 20, 17)
left_df <- data.frame(name, age, stringsAsFactors = FALSE)
# 右邊的資料框
name <- c("蒙其·D·魯夫", "多尼多尼·喬巴", "妮可·羅賓")
devil_fruit <- c("橡膠果實", "人人果實", "花花果實")
right_df <- data.frame(name, devil_fruit, stringsAsFactors = FALSE)
# 合併後的資料框
right_join(left_df, right_df)
1
2
3
4
5
6
7
8
9
10
11
12
13
## > # install.packages("dplyr")
## > library(dplyr)
## > 
## > # 左邊的資料框
## > name <- c("蒙其·D·魯夫", "羅羅亞·索隆", "娜美", "多尼多尼·喬巴")
## > age <- c(19, 21, 20, 17)
## > left_df <- data.frame(name, age, stringsAsFactors = FALSE)
## > # 右邊的資料框
## > name <- c("蒙其·D·魯夫", "多尼多尼·喬巴", "妮可·羅賓")
## > devil_fruit <- c("橡膠果實", "人人果實", "花花果實")
## > right_df <- data.frame(name, devil_fruit, stringsAsFactors = FALSE)
## > # 合併後的資料框
## > right_join(left_df, right_df)
## Joining, by = "name"
##            name age devil_fruit
## 1   蒙其·D·魯夫  19    橡膠果實
## 2 多尼多尼·喬巴  17    人人果實
## 3     妮可·羅賓  NA    花花果實
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

假如我們希望保留兩邊資料框的觀測值,可以在 merge() 函數中指定參數 all.x = TRUEall.y = TRUE,兩邊資料框對應不到的部分,就會以遺漏值呈現,這在關聯式資料庫中稱之為 FULL JOIN 語法。

# 左邊的資料框
name <- c("蒙其·D·魯夫", "羅羅亞·索隆", "娜美", "多尼多尼·喬巴")
age <- c(19, 21, 20, 17)
left_df <- data.frame(name, age)
# 右邊的資料框
name <- c("蒙其·D·魯夫", "多尼多尼·喬巴", "妮可·羅賓")
devil_fruit <- c("橡膠果實", "人人果實", "花花果實")
right_df <- data.frame(name, devil_fruit)
# 合併後的資料框
merge(left_df, right_df, all.x = TRUE, all.y = TRUE)
1
2
3
4
5
6
7
8
9
10
## > # 左邊的資料框
## > name <- c("蒙其·D·魯夫", "羅羅亞·索隆", "娜美", "多尼多尼·喬巴")
## > age <- c(19, 21, 20, 17)
## > left_df <- data.frame(name, age)
## > # 右邊的資料框
## > name <- c("蒙其·D·魯夫", "多尼多尼·喬巴", "妮可·羅賓")
## > devil_fruit <- c("橡膠果實", "人人果實", "花花果實")
## > right_df <- data.frame(name, devil_fruit)
## > # 合併後的資料框
## > merge(left_df, right_df, all.x = TRUE, all.y = TRUE)
##            name age devil_fruit
## 1 多尼多尼·喬巴  17    人人果實
## 2          娜美  20        <NA>
## 3   羅羅亞·索隆  21        <NA>
## 4   蒙其·D·魯夫  19    橡膠果實
## 5     妮可·羅賓  NA    花花果實
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

或使用 dplyr 套件中的 full_join() 函數來完成保留所有兩邊資料框觀測值的水平合併。

# install.packages("dplyr")
library(dplyr)

# 左邊的資料框
name <- c("蒙其·D·魯夫", "羅羅亞·索隆", "娜美", "多尼多尼·喬巴")
age <- c(19, 21, 20, 17)
left_df <- data.frame(name, age, stringsAsFactors = FALSE)
# 右邊的資料框
name <- c("蒙其·D·魯夫", "多尼多尼·喬巴", "妮可·羅賓")
devil_fruit <- c("橡膠果實", "人人果實", "花花果實")
right_df <- data.frame(name, devil_fruit, stringsAsFactors = FALSE)
# 合併後的資料框
full_join(left_df, right_df)
1
2
3
4
5
6
7
8
9
10
11
12
13
## > # install.packages("dplyr")
## > library(dplyr)
## > 
## > # 左邊的資料框
## > name <- c("蒙其·D·魯夫", "羅羅亞·索隆", "娜美", "多尼多尼·喬巴")
## > age <- c(19, 21, 20, 17)
## > left_df <- data.frame(name, age, stringsAsFactors = FALSE)
## > # 右邊的資料框
## > name <- c("蒙其·D·魯夫", "多尼多尼·喬巴", "妮可·羅賓")
## > devil_fruit <- c("橡膠果實", "人人果實", "花花果實")
## > right_df <- data.frame(name, devil_fruit, stringsAsFactors = FALSE)
## > # 合併後的資料框
## > full_join(left_df, right_df)
## Joining, by = "name"
##            name age devil_fruit
## 1   蒙其·D·魯夫  19    橡膠果實
## 2   羅羅亞·索隆  21        <NA>
## 3          娜美  20        <NA>
## 4 多尼多尼·喬巴  17    人人果實
## 5     妮可·羅賓  NA    花花果實
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

在前面的範例中由於左邊與右邊資料框用來作為對照依據的變數都取名為 name,所以不論是內建 merge() 函數或者 dplyr 套件中的 join 系列函數都會自動辨認用 name 變數,但是當取名不同的時候,就會出現所謂的笛卡爾連接(Cartesian Join)

# 左邊的資料框
name <- c("蒙其·D·魯夫", "羅羅亞·索隆", "娜美", "多尼多尼·喬巴")
age <- c(19, 21, 20, 17)
left_df <- data.frame(name, age)
# 右邊的資料框
name <- c("蒙其·D·魯夫", "多尼多尼·喬巴", "妮可·羅賓")
devil_fruit <- c("橡膠果實", "人人果實", "花花果實")
right_df <- data.frame(crew = name, devil_fruit)
# 合併後的資料框
merge(left_df, right_df, all.x = TRUE, all.y = TRUE)
1
2
3
4
5
6
7
8
9
10
## > # 左邊的資料框
## > name <- c("蒙其·D·魯夫", "羅羅亞·索隆", "娜美", "多尼多尼·喬巴")
## > age <- c(19, 21, 20, 17)
## > left_df <- data.frame(name, age)
## > # 右邊的資料框
## > name <- c("蒙其·D·魯夫", "多尼多尼·喬巴", "妮可·羅賓")
## > devil_fruit <- c("橡膠果實", "人人果實", "花花果實")
## > right_df <- data.frame(crew = name, devil_fruit)
## > # 合併後的資料框
## > merge(left_df, right_df, all.x = TRUE, all.y = TRUE)
##             name age          crew devil_fruit
## 1    蒙其·D·魯夫  19   蒙其·D·魯夫    橡膠果實
## 2    羅羅亞·索隆  21   蒙其·D·魯夫    橡膠果實
## 3           娜美  20   蒙其·D·魯夫    橡膠果實
## 4  多尼多尼·喬巴  17   蒙其·D·魯夫    橡膠果實
## 5    蒙其·D·魯夫  19 多尼多尼·喬巴    人人果實
## 6    羅羅亞·索隆  21 多尼多尼·喬巴    人人果實
## 7           娜美  20 多尼多尼·喬巴    人人果實
## 8  多尼多尼·喬巴  17 多尼多尼·喬巴    人人果實
## 9    蒙其·D·魯夫  19     妮可·羅賓    花花果實
## 10   羅羅亞·索隆  21     妮可·羅賓    花花果實
## 11          娜美  20     妮可·羅賓    花花果實
## 12 多尼多尼·喬巴  17     妮可·羅賓    花花果實
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

這樣的結果不是我們想要的水平合併結果,這時只要在 merge() 函數中加入 by.x = “name"by.y = "crew" 就可以解決。

# 左邊的資料框
name <- c("蒙其·D·魯夫", "羅羅亞·索隆", "娜美", "多尼多尼·喬巴")
age <- c(19, 21, 20, 17)
left_df <- data.frame(name, age)
# 右邊的資料框
name <- c("蒙其·D·魯夫", "多尼多尼·喬巴", "妮可·羅賓")
devil_fruit <- c("橡膠果實", "人人果實", "花花果實")
right_df <- data.frame(crew = name, devil_fruit)
# 合併後的資料框
merge(left_df, right_df, all.x = TRUE, all.y = TRUE, by.x = "name", by.y = "crew")
1
2
3
4
5
6
7
8
9
10
## > # 左邊的資料框
## > name <- c("蒙其·D·魯夫", "羅羅亞·索隆", "娜美", "多尼多尼·喬巴")
## > age <- c(19, 21, 20, 17)
## > left_df <- data.frame(name, age)
## > # 右邊的資料框
## > name <- c("蒙其·D·魯夫", "多尼多尼·喬巴", "妮可·羅賓")
## > devil_fruit <- c("橡膠果實", "人人果實", "花花果實")
## > right_df <- data.frame(crew = name, devil_fruit)
## > # 合併後的資料框
## > merge(left_df, right_df, all.x = TRUE, all.y = TRUE, by.x = "name", by.y = "crew")
##            name age devil_fruit
## 1 多尼多尼·喬巴  17    人人果實
## 2          娜美  20        <NA>
## 3   羅羅亞·索隆  21        <NA>
## 4   蒙其·D·魯夫  19    橡膠果實
## 5     妮可·羅賓  NA    花花果實
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

相同的,我們也能夠在 dplyr 中的 join 系列函數中加入 by = c("name" = "crew") 就可以解決。

# install.packages("dplyr")
library(dplyr)

# 左邊的資料框
name <- c("蒙其·D·魯夫", "羅羅亞·索隆", "娜美", "多尼多尼·喬巴")
age <- c(19, 21, 20, 17)
left_df <- data.frame(name, age, stringsAsFactors = FALSE)
# 右邊的資料框
name <- c("蒙其·D·魯夫", "多尼多尼·喬巴", "妮可·羅賓")
devil_fruit <- c("橡膠果實", "人人果實", "花花果實")
right_df <- data.frame(crew = name, devil_fruit, stringsAsFactors = FALSE)
# 合併後的資料框
full_join(left_df, right_df, by = c("name" = "crew"))
1
2
3
4
5
6
7
8
9
10
11
12
13
## > # install.packages("dplyr")
## > library(dplyr)
## > 
## > # 左邊的資料框
## > name <- c("蒙其·D·魯夫", "羅羅亞·索隆", "娜美", "多尼多尼·喬巴")
## > age <- c(19, 21, 20, 17)
## > left_df <- data.frame(name, age, stringsAsFactors = FALSE)
## > # 右邊的資料框
## > name <- c("蒙其·D·魯夫", "多尼多尼·喬巴", "妮可·羅賓")
## > devil_fruit <- c("橡膠果實", "人人果實", "花花果實")
## > right_df <- data.frame(crew = name, devil_fruit, stringsAsFactors = FALSE)
## > # 合併後的資料框
## > full_join(left_df, right_df, by = c("name" = "crew"))
##            name age devil_fruit
## 1   蒙其·D·魯夫  19    橡膠果實
## 2   羅羅亞·索隆  21        <NA>
## 3          娜美  20        <NA>
## 4 多尼多尼·喬巴  17    人人果實
## 5     妮可·羅賓  NA    花花果實
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

小結

在這個小節中我們簡介進階資料框處理技巧,這些技巧包含如何運用函數型編程或內建函數將變數重新分類、運用 tidyr 套件對資料框進行寬格式(wide-format)與長格式(long-format)的互相轉換以及將超過一個的資料框進行垂直或者水平方向的合併。

延伸閱讀

Join Data in R