使用 dplyr 處理資料框

dplyr is a grammar of data manipulation.

dplyr.tidyverse.org

基礎資料框處理我們已對基礎的資料框處理技巧駕輕就熟,包含觀察資料框維度外觀、查詢資料框詳細資訊、解構資料框、調整變數以及排序資料框。接著要討論的是應用 dplyr 套件處理資料框,包含如何使用 %>% 運算子(pipe operator)以及如何使用 dplyr 套件中的基礎函數;而我們需要使用兩個套件輔助前述的技法。

  • 使用 %>% 運算子:Stefan Milton Bache 開發的 magrittr 套件
  • 更有效率地處理資料框:Hadley Wickham 開發的 dplyr 套件

安裝與載入套件

我們可以利用 install.packages()library() 兩個函數進行套件的安裝與載入。

pkgs <- c("magrittr", "dplyr")
install.packages(pkgs)
library(magrittr)
library(dplyr)
1
2
3
4
## > pkgs <- c("magrittr", "dplyr")
## > install.packages(pkgs)
## trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.5/magrittr_1.5.tgz'
## Content type 'application/x-gzip' length 152395 bytes (148 KB)
## ==================================================
## downloaded 148 KB
## 
## trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.5/dplyr_0.7.8.tgz'
## Content type 'application/x-gzip' length 5720340 bytes (5.5 MB)
## ==================================================
## downloaded 5.5 MB
## 
## 
## The downloaded binary packages are in
##  /var/folders/0b/r__z5mpn6ldgb_w2j7_y_ntr0000gn/T//Rtmp6V3aA9/downloaded_packages
## > library(magrittr)
## > library(dplyr)
## 
## Attaching package: ‘dplyr’
## 
## The following objects are masked from ‘package:stats’:
## 
##     filter, lag
## 
## The following objects are masked from ‘package:base’:
## 
##     intersect, setdiff, setequal, union
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

或者利用 RStudio 右下角的 Packages 區塊提供的使用者介面來安裝與載入套件。

於右下角的 Packages 區塊點選 Install

輸入要安裝的套件名稱

安裝完成之後,把方框勾選起來表示載入 magrittr

安裝完成之後,把方框勾選起來表示載入 dplyr

%>% 運算子

%>% 運算子稱作 Pipe operator,這個運算子稍微修改了在先前章節呼叫函數的習慣,例如想要將 1:5 作為 sum() 函數的輸入時,本來撰寫的習慣是將輸入(inputs)放置在函數名稱的小括號中。

one_to_five <- 1:5
sum(one_to_five)
1
2
## > one_to_five <- 1:5
## > sum(one_to_five)
## [1] 15
1
2
3

在使用 %>% 運算子時候,我們將輸入與函數的位置調換,將輸入寫在 %>% 運算子的左邊,然後把函數名稱寫在運算子的右邊,外觀看起來就像是把輸入丟入函數中一般。

library(magrittr)

one_to_five <- 1:5
one_to_five %>% 
  sum()
1
2
3
4
5
## > library(magrittr)
## > 
## > one_to_five <- 1:5
## > one_to_five %>% 
## +   sum()
## [1] 15
1
2
3
4
5
6

在 RStudio 中 %>% 運算子可以使用鍵盤快捷鍵 Ctrl + shift + M 產出。

%>% 運算子的使用時機

%>% 是進階的 R 語言使用者在需要呼叫多次函數進行鏈結函數(Chaining functions)時候會棄傳統呼叫函數寫法而改採的運算子,而所謂的鏈結函數可以理解為多個函數彼此的輸入輸出存在依賴關係,例如我們想要獲取系統日期的年份,並且轉換為長度為 1 的數字向量,傳統呼叫函數的寫法可能會寫為:

sys_date <- Sys.Date()
sys_yr <- format(sys_date, format = "%Y")
sys_yr_num <- as.numeric(sys_yr)
sys_yr_num
1
2
3
4
## > sys_date <- Sys.Date()
## > sys_yr <- format(sys_date, format = "%Y")
## > sys_yr_num <- as.numeric(sys_yr)
## > sys_yr_num
## [1] 2019
1
2
3
4
5

這樣的寫法可讀性(readibility)很高,但是為了得到我們要的答案,過程中額外建立了 sys_datesys_yr 這兩個中繼物件(最後並沒有用到),似乎不太有效率,接著將上面這段程式改寫得精簡一點:

sys_yr_num <- as.numeric(format(Sys.Date(), format = "%Y"))
sys_yr_num
1
2
## > sys_yr_num <- as.numeric(format(Sys.Date(), format = "%Y"))
## > sys_yr_num
## [1] 2019
1
2
3

這樣子寫雖然精簡,但是可讀性就變得比較低,尤其小括號很多,沒有人喜歡去檢查哪個左括號應該對應哪個右括號。在這種呼叫多次函數進行鏈結函數,將前一次函數的輸出作為後一次函數的輸入時機,在未來我們應該要很快地想到運用 %>% 運算子,如此一來就能夠兼顧可讀性高與精簡的兩個優點!

library(magrittr)

sys_yr_num <- Sys.Date() %>% 
  format("%Y") %>% 
  as.numeric()
sys_yr_num
1
2
3
4
5
6
## > library(magrittr)
## > 
## > sys_yr_num <- Sys.Date() %>% 
## +   format("%Y") %>% 
## +   as.numeric()
## > sys_yr_num
## [1] 2019
1
2
3
4
5
6
7

連結 %>% 與其他符號

在使用 %>% 運算子將多個函數呼叫串連的流程中,也可以在流程之中加入運算符號(+-*/**%%%/% 等)選擇符號([][[]]$ 等)或判斷符號( |&><==!=%in% 等。)只要將運算、選擇或判斷符號放入 `` 之中(這個符號稱作 tilt,可以在鍵盤最左上角、tab 鍵上方、數字鍵 1 左邊的按鍵找到它。)然後將要運算的變數放入小括號中。例如印製超級球星的球衣,在球衣的設計上,除了會印製背號以外,亦會印製球員的姓氏(family name);像是 LeBron James 的球衣,除了 23 號還會有「JAMES」字樣。

NBA Store

假如在 super_nba_stars 這個文字向量中儲存了多位超級 NBA 球星(退役或現役,)我們需要想辦法將 LeBron James從 super_nba_stars 中取出,再把姓氏的所有字母轉換為大寫(upper-cased。)

# 超級球星
super_nba_stars <- c("Steve Nash", "Michael Jordan", "LeBron James", "Dirk Nowitzski", "Hakeem Olajuwon")
lbj <- super_nba_stars %>% 
  strsplit(split = " ") %>% 
  `[[` (3) %>% 
  `[` (2) %>% 
  toupper()
lbj
1
2
3
4
5
6
7
8
## > # 超級球星
## > super_nba_stars <- c("Steve Nash", "Michael Jordan", "LeBron James", "Dirk Nowitzski", "Hakeem Olajuwon")
## > lbj <- super_nba_stars %>% 
## +   strsplit(split = " ") %>% 
## +   `[[` (3) %>% 
## +   `[` (2) %>% 
## +   toupper()
## > lbj
## [1] "JAMES"
1
2
3
4
5
6
7
8
9

dplyr 套件中的基礎函數

接著我們要介紹的是 dplyr 套件,相較於 R 內建資料處理語法(使用 [] 為主的語法) dplyr 套件融入很多概念與結構化查詢語言(Structured Query Language,SQL)相仿的函數,搭配 %>% 運算子一起使用,能夠讓整理資料的能力獲得一個檔次的提升,讓我們趕緊來認識 dplyr 套件中的基礎函數。

  • filter() 函數:篩選符合條件的觀測值
  • select() 函數:選擇變數
  • mutate() 函數:新增變數
  • arrange() 函數:依照變數排序觀測值
  • summarise() 函數:聚合變數
  • group_by() 函數:依照類別變數分組,通常搭配 summarise() 函數一起使用

dplyr 套件中的 filter() 函數

filter() 函數中我們輸入要篩選的資料框,以及依據什麼判斷條件進行篩選,舉例來說我們可以將 chicago_bulls 資料框中的 Michael Jordan 選出來,成為一個 1x7 的資料框。

library(dplyr)

csv_url <- "https://s3-ap-northeast-1.amazonaws.com/r-essentials/chicago_bulls_1995_1996.csv"
chicago_bulls <- read.csv(csv_url)
mj <- chicago_bulls %>% 
  filter(Player == "Michael Jordan")
mj
1
2
3
4
5
6
7
## > library(dplyr)
## > 
## > csv_url <- "https://s3-ap-northeast-1.amazonaws.com/r-essentials/chicago_bulls_1995_1996.csv"
## > chicago_bulls <- read.csv(csv_url)
## > mj <- chicago_bulls %>% 
## +   filter(Player == "Michael Jordan")
## > mj
##   No.         Player Pos  Ht  Wt        Birth.Date                      College
## 1  23 Michael Jordan  SG 6-6 195 February 17, 1963 University of North Carolina
1
2
3
4
5
6
7
8
9

也能藉此機會對照之前學習的 R 語言原生寫法。

csv_url <- "https://s3-ap-northeast-1.amazonaws.com/r-essentials/chicago_bulls_1995_1996.csv"
chicago_bulls <- read.csv(csv_url)
mj <- chicago_bulls[chicago_bulls$Player == "Michael Jordan", ]
mj
1
2
3
4
## > csv_url <- "https://s3-ap-northeast-1.amazonaws.com/r-essentials/chicago_bulls_1995_1996.csv"
## > chicago_bulls <- read.csv(csv_url)
## > mj <- chicago_bulls[chicago_bulls$Player == "Michael Jordan", ]
## > mj
##   No.         Player Pos  Ht  Wt        Birth.Date                      College
## 7  23 Michael Jordan  SG 6-6 195 February 17, 1963 University of North Carolina
1
2
3
4
5
6

filter() 函數中我們可以利用 | 聯集多個篩選條件、利用 & 交集多個篩選條件(或以 %>% 連結多個 filter() 函數)交集多個篩選條件,舉例來說我們可以將 chicago_bulls 資料框中的鐵三角 Michael Jordan、Scottie Pippen 與 Dennis Rodman 選出來,成為一個 3x7 的資料框。

library(dplyr)

csv_url <- "https://s3-ap-northeast-1.amazonaws.com/r-essentials/chicago_bulls_1995_1996.csv"
chicago_bulls <- read.csv(csv_url)
trio <- chicago_bulls %>% 
  filter(Player == "Michael Jordan" | Player == "Scottie Pippen" | Player == "Dennis Rodman")
trio
1
2
3
4
5
6
7
## > library(dplyr)
## > 
## > csv_url <- "https://s3-ap-northeast-1.amazonaws.com/r-essentials/chicago_bulls_1995_1996.csv"
## > chicago_bulls <- read.csv(csv_url)
## > trio <- chicago_bulls %>% 
## +   filter(Player == "Michael Jordan" | Player == "Scottie Pippen" | Player == "Dennis Rodman")
## > trio
##   No.         Player Pos  Ht  Wt         Birth.Date                                College
## 1  23 Michael Jordan  SG 6-6 195  February 17, 1963           University of North Carolina
## 2  33 Scottie Pippen  SF 6-8 210 September 25, 1965         University of Central Arkansas
## 3  91  Dennis Rodman  PF 6-7 210       May 13, 1961 Southeastern Oklahoma State University
1
2
3
4
5
6
7
8
9
10
11

dplyr 套件中的 select() 函數

select() 函數中我們輸入要選擇變數的資料框,以及想要選取的變數名稱,舉例來說我們可以將 chicago_bulls 資料框中的 Player 變數選出來,成為一個 15x1 的資料框。

library(dplyr)

csv_url <- "https://s3-ap-northeast-1.amazonaws.com/r-essentials/chicago_bulls_1995_1996.csv"
chicago_bulls <- read.csv(csv_url)
players <- chicago_bulls %>% 
  select(Player)
players
1
2
3
4
5
6
7
## > library(dplyr)
## > 
## > csv_url <- "https://s3-ap-northeast-1.amazonaws.com/r-essentials/chicago_bulls_1995_1996.csv"
## > chicago_bulls <- read.csv(csv_url)
## > players <- chicago_bulls %>% 
## +   select(Player)
## > players
##             Player
## 1      Randy Brown
## 2     Jud Buechler
## 3     Jason Caffey
## 4    James Edwards
## 5       Jack Haley
## 6       Ron Harper
## 7   Michael Jordan
## 8       Steve Kerr
## 9       Toni Kukoc
## 10     Luc Longley
## 11  Scottie Pippen
## 12   Dennis Rodman
## 13     John Salley
## 14 Dickey Simpkins
## 15 Bill Wennington
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

select() 函數選擇單一變數時並不會像 R 語言原生語法轉換為向量,而是維持原本的資料框,在原生語法中我們可以在中括號中多指定 drop = FALSE 達到這個效果。

csv_url <- "https://s3-ap-northeast-1.amazonaws.com/r-essentials/chicago_bulls_1995_1996.csv"
chicago_bulls <- read.csv(csv_url)
players <- chicago_bulls[, "Player", drop = FALSE]
players
1
2
3
4
## > csv_url <- "https://s3-ap-northeast-1.amazonaws.com/r-essentials/chicago_bulls_1995_1996.csv"
## > chicago_bulls <- read.csv(csv_url)
## > players <- chicago_bulls[, "Player", drop = FALSE]
## > players
##             Player
## 1      Randy Brown
## 2     Jud Buechler
## 3     Jason Caffey
## 4    James Edwards
## 5       Jack Haley
## 6       Ron Harper
## 7   Michael Jordan
## 8       Steve Kerr
## 9       Toni Kukoc
## 10     Luc Longley
## 11  Scottie Pippen
## 12   Dennis Rodman
## 13     John Salley
## 14 Dickey Simpkins
## 15 Bill Wennington
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

select() 函數也可以在選擇變數的同時對變數重新進行命名,將號碼變數命名為 jersey_number 、將球員變數命名為 player_name

library(dplyr)

csv_url <- "https://s3-ap-northeast-1.amazonaws.com/r-essentials/chicago_bulls_1995_1996.csv"
chicago_bulls <- read.csv(csv_url)
chicago_bulls %>% 
  select(jersey_number = No., player_name = Player)
1
2
3
4
5
6
## > library(dplyr)
## > 
## > csv_url <- "https://s3-ap-northeast-1.amazonaws.com/r-essentials/chicago_bulls_1995_1996.csv"
## > chicago_bulls <- read.csv(csv_url)
## > chicago_bulls %>% 
## +   select(jersey_number = No., player_name = Player)
##    jersey_number     player_name
## 1              0     Randy Brown
## 2             30    Jud Buechler
## 3             35    Jason Caffey
## 4             53   James Edwards
## 5             54      Jack Haley
## 6              9      Ron Harper
## 7             23  Michael Jordan
## 8             25      Steve Kerr
## 9              7      Toni Kukoc
## 10            13     Luc Longley
## 11            33  Scottie Pippen
## 12            91   Dennis Rodman
## 13            22     John Salley
## 14             8 Dickey Simpkins
## 15            34 Bill Wennington
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

dplyr 套件中的 mutate() 函數

mutate() 函數中我們輸入要新增衍生或非衍生變數的資料框,以及想要新增變數之規則,舉例來說我們要在 chicago_bulls 資料框中新增衍生變數 wt_kg(以公斤為單位的體重)與非衍生變數 season(球季。)

library(dplyr)

csv_url <- "https://s3-ap-northeast-1.amazonaws.com/r-essentials/chicago_bulls_1995_1996.csv"
chicago_bulls <- read.csv(csv_url)
chicago_bulls %>% 
  mutate(
    wt_kg = round(Wt * 0.45359),
    season = "1995-96"
1
2
3
4
5
6
7
8
## > library(dplyr)
## > 
## > csv_url <- "https://s3-ap-northeast-1.amazonaws.com/r-essentials/chicago_bulls_1995_1996.csv"
## > chicago_bulls <- read.csv(csv_url)
## > chicago_bulls %>% 
## +   mutate(
## +     wt_kg = round(Wt * 0.45359),
## +     season = "1995-96"
## +   )
##    No.          Player Pos   Ht  Wt         Birth.Date                                            College wt_kg  season
## 1    0     Randy Brown  PG  6-2 190       May 22, 1968 University of Houston, New Mexico State University    86 1995-96
## 2   30    Jud Buechler  SF  6-6 220      June 19, 1968                              University of Arizona   100 1995-96
## 3   35    Jason Caffey  PF  6-8 255      June 12, 1973                              University of Alabama   116 1995-96
## 4   53   James Edwards   C  7-0 225  November 22, 1955                           University of Washington   102 1995-96
## 5   54      Jack Haley   C 6-10 240   January 27, 1964              University of California, Los Angeles   109 1995-96
## 6    9      Ron Harper  PG  6-6 185   January 20, 1964                                   Miami University    84 1995-96
## 7   23  Michael Jordan  SG  6-6 195  February 17, 1963                       University of North Carolina    88 1995-96
## 8   25      Steve Kerr  PG  6-3 175 September 27, 1965                              University of Arizona    79 1995-96
## 9    7      Toni Kukoc  SF 6-10 192 September 18, 1968                                                       87 1995-96
## 10  13     Luc Longley   C  7-2 265   January 19, 1969                           University of New Mexico   120 1995-96
## 11  33  Scottie Pippen  SF  6-8 210 September 25, 1965                     University of Central Arkansas    95 1995-96
## 12  91   Dennis Rodman  PF  6-7 210       May 13, 1961             Southeastern Oklahoma State University    95 1995-96
## 13  22     John Salley  PF 6-11 230       May 16, 1964                    Georgia Institute of Technology   104 1995-96
## 14   8 Dickey Simpkins  PF  6-9 248      April 6, 1972                                 Providence College   112 1995-96
## 15  34 Bill Wennington   C  7-0 245     April 26, 1963                              St. John's University   111 1995-96
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

dplyr 套件中的 arrange() 函數

arrange() 函數中我們輸入要排序的資料框,以及指定排序所依據的變數名稱來對觀測值排序,舉例來說以 Wt 變數來排序 chicago_bulls 資料框。

library(dplyr)

csv_url <- "https://s3-ap-northeast-1.amazonaws.com/r-essentials/chicago_bulls_1995_1996.csv"
chicago_bulls <- read.csv(csv_url)
chicago_bulls %>% 
  arrange(Wt)
1
2
3
4
5
6
## > library(dplyr)
## > 
## > csv_url <- "https://s3-ap-northeast-1.amazonaws.com/r-essentials/chicago_bulls_1995_1996.csv"
## > chicago_bulls <- read.csv(csv_url)
## > chicago_bulls %>% 
## +   arrange(Wt)
##    No.          Player Pos   Ht  Wt         Birth.Date                                            College
## 1   25      Steve Kerr  PG  6-3 175 September 27, 1965                              University of Arizona
## 2    9      Ron Harper  PG  6-6 185   January 20, 1964                                   Miami University
## 3    0     Randy Brown  PG  6-2 190       May 22, 1968 University of Houston, New Mexico State University
## 4    7      Toni Kukoc  SF 6-10 192 September 18, 1968                                                   
## 5   23  Michael Jordan  SG  6-6 195  February 17, 1963                       University of North Carolina
## 6   33  Scottie Pippen  SF  6-8 210 September 25, 1965                     University of Central Arkansas
## 7   91   Dennis Rodman  PF  6-7 210       May 13, 1961             Southeastern Oklahoma State University
## 8   30    Jud Buechler  SF  6-6 220      June 19, 1968                              University of Arizona
## 9   53   James Edwards   C  7-0 225  November 22, 1955                           University of Washington
## 10  22     John Salley  PF 6-11 230       May 16, 1964                    Georgia Institute of Technology
## 11  54      Jack Haley   C 6-10 240   January 27, 1964              University of California, Los Angeles
## 12  34 Bill Wennington   C  7-0 245     April 26, 1963                              St. John's University
## 13   8 Dickey Simpkins  PF  6-9 248      April 6, 1972                                 Providence College
## 14  35    Jason Caffey  PF  6-8 255      June 12, 1973                              University of Alabama
## 15  13     Luc Longley   C  7-2 265   January 19, 1969                           University of New Mexico
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

排序規範預設都是遞增排序(意即數字由小到大、英文字母由 A 到 Z),如果想改為遞減排序,就在變數名稱外增加 desc()

library(dplyr)

csv_url <- "https://s3-ap-northeast-1.amazonaws.com/r-essentials/chicago_bulls_1995_1996.csv"
chicago_bulls <- read.csv(csv_url)
chicago_bulls %>% 
  arrange(desc(Wt))
1
2
3
4
5
6
## > library(dplyr)
## > 
## > csv_url <- "https://s3-ap-northeast-1.amazonaws.com/r-essentials/chicago_bulls_1995_1996.csv"
## > chicago_bulls <- read.csv(csv_url)
## > chicago_bulls %>% 
## +   arrange(desc(Wt))
##    No.          Player Pos   Ht  Wt         Birth.Date                                            College
## 1   13     Luc Longley   C  7-2 265   January 19, 1969                           University of New Mexico
## 2   35    Jason Caffey  PF  6-8 255      June 12, 1973                              University of Alabama
## 3    8 Dickey Simpkins  PF  6-9 248      April 6, 1972                                 Providence College
## 4   34 Bill Wennington   C  7-0 245     April 26, 1963                              St. John's University
## 5   54      Jack Haley   C 6-10 240   January 27, 1964              University of California, Los Angeles
## 6   22     John Salley  PF 6-11 230       May 16, 1964                    Georgia Institute of Technology
## 7   53   James Edwards   C  7-0 225  November 22, 1955                           University of Washington
## 8   30    Jud Buechler  SF  6-6 220      June 19, 1968                              University of Arizona
## 9   33  Scottie Pippen  SF  6-8 210 September 25, 1965                     University of Central Arkansas
## 10  91   Dennis Rodman  PF  6-7 210       May 13, 1961             Southeastern Oklahoma State University
## 11  23  Michael Jordan  SG  6-6 195  February 17, 1963                       University of North Carolina
## 12   7      Toni Kukoc  SF 6-10 192 September 18, 1968                                                   
## 13   0     Randy Brown  PG  6-2 190       May 22, 1968 University of Houston, New Mexico State University
## 14   9      Ron Harper  PG  6-6 185   January 20, 1964                                   Miami University
## 15  25      Steve Kerr  PG  6-3 175 September 27, 1965                              University of Arizona
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

dplyr 套件中的 summarise() 函數

summarise() 函數中我們輸入摘要對象的資料框,以及欲摘要(或稱作聚合 aggregate)的變數名稱,摘要的結果通常是長度為 1 的數值向量(或長度較原始的觀測值個數為少的向量),這代表資料框變數的摘要結果,像是總和、平均數或標準差等都是摘要運算的結果,舉例來說,我們可以運算 chicago_bulls 所有球員的平均體重。

library(dplyr)

csv_url <- "https://s3-ap-northeast-1.amazonaws.com/r-essentials/chicago_bulls_1995_1996.csv"
chicago_bulls <- read.csv(csv_url)
chicago_bulls %>% 
  summarise(avg_wt = mean(Wt))
1
2
3
4
5
6
## > library(dplyr)
## > 
## > csv_url <- "https://s3-ap-northeast-1.amazonaws.com/r-essentials/chicago_bulls_1995_1996.csv"
## > chicago_bulls <- read.csv(csv_url)
## > chicago_bulls %>% 
## +   summarise(avg_wt = mean(Wt))
##   avg_wt
## 1    219
1
2
3
4
5
6
7
8

dplyr 套件中的 group_by() 函數

摘要的運算常會搭配 group_by() 函數一起使用,這時我們就可以整合 %>% 運算子一起使用,舉例來說,我們可以計算 chicago_bulls 中不同鋒衛位置的平均體重。

library(dplyr)

csv_url <- "https://s3-ap-northeast-1.amazonaws.com/r-essentials/chicago_bulls_1995_1996.csv"
chicago_bulls <- read.csv(csv_url)
chicago_bulls %>% 
  group_by(Pos) %>% 
  summarise(avg_wt = mean(Wt))
1
2
3
4
5
6
7
## > library(dplyr)
## > 
## > csv_url <- "https://s3-ap-northeast-1.amazonaws.com/r-essentials/chicago_bulls_1995_1996.csv"
## > chicago_bulls <- read.csv(csv_url)
## > chicago_bulls %>% 
## +   group_by(Pos) %>% 
## +   summarise(avg_wt = mean(Wt))
## # A tibble: 5 x 2
##   Pos   avg_wt
##   <fct>  <dbl>
## 1 C       244.
## 2 PF      236.
## 3 PG      183.
## 4 SF      207.
## 5 SG      195
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

這裡必須記得先使用 group_by() 函數然後再鏈結 summarise() 函數,這個順序與撰寫 SQL 語法相反(SQL 語法是先摘要再分組),熟練 SQL 語法的使用者需要特別留意這個差異。

使用 dplyr 套件中基礎函數之後的輸出是一種叫做 tibble 的改良式資料框,為了不要增添初學者的負擔,我們不說明它跟原生資料框的差異,tibble 可以利用 as.data.frame() 函數轉換為 R 語言的原生資料框。

小結

在這個小節中我們簡介如何應用 dplyr 套件處理資料框,包含用來作鏈結函數(chaining function)的 %>% 運算子、%>% 運算子使用時機、如何合併 %>% 和其他運算符號以及 dplyr 套件中的六個基礎函數。

延伸閱讀