進階資料框操作技巧

Tidy datasets are all alike, but every messy dataset is messy in its own way.

Hadley Wickham

基礎資料框操作技巧我們簡介了資料科學團隊常在面對佔據主流地位的表格式資料(Tabular Data)時,於 Python 與 R 語言中以資料框(Data Frame)來進行基礎操作,包含像是建立、檢視與篩選等,這個小節將繼續專注資料框(Data Frame)的進階操作技巧。

調整變數的型別

Python

一個 pandas 資料框的組成有四個層次,如果從下往上(bottom-up)檢視:先是資料,接著同樣型別的資料可以結合成為一個陣列(ndarray),然後陣列加入索引標籤可以成為一個 Series,最後多個 Series 能夠合併為一個資料框。因此如果有變數型別想要調整,只需要從上往下(top-down)檢視:將變數單獨取出成為一個 Series,調整型別之後再指派回資料框的原變數位置即可。以 1995 至 1996 年球季芝加哥公牛隊先發陣容的資料框為例,背號原本是整數型別(int),可以利用 .astype() 方法將其調整成文字(object)。

import pandas as pd

numbers = [9, 23, 33, 91, 13]
players = ["Ron Harper", "Michael Jordan", "Scottie Pippen", "Dennis Rodman", "Luc Longley"]
df = pd.DataFrame()
df["number"] = numbers
df["player"] = players
print(df["number"].dtype)
print(df["number"].values)
df["number"] = df["number"].astype(str)
print(df["number"].dtype)
print(df["number"].values)
1
2
3
4
5
6
7
8
9
10
11
12
## int64
## [ 9 23 33 91 13]
## object
## ['9' '23' '33' '91' '13']
1
2
3
4

R 語言

R 語言具有一系列調整變數型別的變數可以轉換一整個向量的型別:

  • as.numeric() 轉換為數值向量
  • as.character() 轉換為文字向量
  • as.logical() 轉換為邏輯值向量

在 R 語言中資料框亦是由多個向量組合而成,我們能夠將變數單獨取出成為一個向量,調整型別之後再指派回資料框的原變數位置即可。以 1995 至 1996 年球季芝加哥公牛隊先發陣容的資料框為例,背號原本是數值型別(numeric),可以利用 as.character() 將其調整成文字(character)。

numbers <- c(9, 23, 33, 91, 13)
players <- c("Ron Harper", "Michael Jordan", "Scottie Pippen", "Dennis Rodman", "Luc Longley")
df <- data.frame(number = numbers,
                 player = players,
                 stringsAsFactors = FALSE) # 避免處理 factor 型別
class(df$number)
df$number
df$number <- as.character(df$number)
class(df$number)
df$number
1
2
3
4
5
6
7
8
9
10
## [1] "numeric"
## [1]  9 23 33 91 13
## [1] "character"
## [1] "9"  "23" "33" "91" "13"
1
2
3
4

對文字變數重新編碼

Python

利用 .map() 方法並輸入一個 dict,在這個 dict 中利用 key-value 的對應關係實現重新編碼,像是輸入球員姓名與鋒衛位置的對應 dict:

import pandas as pd

numbers = [9, 23, 33, 91, 13]
players = ["Ron Harper", "Michael Jordan", "Scottie Pippen", "Dennis Rodman", "Luc Longley"]
df = pd.DataFrame()
df["number"] = numbers
df["player"] = players
position_dict = {
    "Ron Harper": "PG",
    "Michael Jordan": "SG",
    "Scottie Pippen": "SF",
    "Dennis Rodman": "PF",
    "Luc Longley": "C"
}
df["position"] = df["player"].map(position_dict)
df
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

輸入球員姓名與鋒衛位置的對應

輸入鋒衛位置與前場後場的對應:

import pandas as pd

numbers = [9, 23, 33, 91, 13]
players = ["Ron Harper", "Michael Jordan", "Scottie Pippen", "Dennis Rodman", "Luc Longley"]
df = pd.DataFrame()
df["number"] = numbers
df["player"] = players
position_dict = {
    "Ron Harper": "PG",
    "Michael Jordan": "SG",
    "Scottie Pippen": "SF",
    "Dennis Rodman": "PF",
    "Luc Longley": "C"
}
df["position"] = df["player"].map(position_dict)
court_dict = {
    "PG": "Back",
    "SG": "Back",
    "SF": "Front",
    "PF": "Front",
    "C": "Front"
}
df["court"] = df["position"].map(court_dict)
df
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

輸入鋒衛位置與前場後場的對應

前一個例子由於前場後場是一個二元分類,我們也可以寫一個簡潔的 lambda 表示來完成對應,相等於 PG 或 SG 就是後場,否則就是前場:

import pandas as pd

numbers = [9, 23, 33, 91, 13]
players = ["Ron Harper", "Michael Jordan", "Scottie Pippen", "Dennis Rodman", "Luc Longley"]
df = pd.DataFrame()
df["number"] = numbers
df["player"] = players
position_dict = {
    "Ron Harper": "PG",
    "Michael Jordan": "SG",
    "Scottie Pippen": "SF",
    "Dennis Rodman": "PF",
    "Luc Longley": "C"
}
df["position"] = df["player"].map(position_dict)
df["court"] = df["position"].map(lambda x: "Back" if x in ['PG', 'SG'] else "Front")
df
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

使用 lambda 表示完成鋒衛位置與前場後場的對應

R 語言

利用流程控制定義一個函數再使用 sapply() 應用至資料框變數上,最後指派至資料框中為一個新變數,像是球員姓名與鋒衛位置的對應:

get_position <- function(player) {
  if (player == "Ron Harper") {
    return("PG")
  } else if (player == "Michael Jordan") {
    return("SG")
  } else if (player == "Scottie Pippen") {
    return("SF")
  } else if (player == "Dennis Rodman") {
    return("PF")
  } else {
    return("C")
  }
}

numbers <- c(9, 23, 33, 91, 13)
players <- c("Ron Harper", "Michael Jordan", "Scottie Pippen", "Dennis Rodman", "Luc Longley")
df <- data.frame(number = numbers,
                 player = players,
                 stringsAsFactors = FALSE) # 避免處理 factor 型別
df$position <- sapply(df$player, FUN = get_position)
View(df)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

球員姓名與鋒衛位置的對應

鋒衛位置與前場後場的對應是一個二元分類,我們利用內建的 ifelse() 函數寫一個簡潔對應,相等於 PG 或 SG 就是後場,否則就是前場:

get_position <- function(player) {
  if (player == "Ron Harper") {
    return("PG")
  } else if (player == "Michael Jordan") {
    return("SG")
  } else if (player == "Scottie Pippen") {
    return("SF")
  } else if (player == "Dennis Rodman") {
    return("PF")
  } else {
    return("C")
  }
}

numbers <- c(9, 23, 33, 91, 13)
players <- c("Ron Harper", "Michael Jordan", "Scottie Pippen", "Dennis Rodman", "Luc Longley")
df <- data.frame(number = numbers,
                 player = players,
                 stringsAsFactors = FALSE) # 避免處理 factor 型別
df$position <- sapply(df$player, FUN = get_position)
df$court <- ifelse(df$position %in% c("PG", "SG"), "Back", "Front")
View(df)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

鋒衛位置與前場後場的對應

對數字重新歸類分組為文字變數

Python

同樣利用 .map() 方法並輸入一個數字歸類分組為文字的函數,在這個函數中我們定義數字的區間與組別,像是低於 200 磅的球員,歸類為輕(Light);介於 200 至 250 磅的球員,歸類為中等(Medium);大於 250 磅的球員,歸類為重(Heavy):

import pandas as pd

numbers = [9, 23, 33, 91, 13]
players = ["Ron Harper", "Michael Jordan", "Scottie Pippen", "Dennis Rodman", "Luc Longley"]
weights = [185, 195, 210, 210, 265]
df = pd.DataFrame()
df["number"] = numbers
df["player"] = players
df["weight"] = weights

def get_weight_category(wt):
  if wt < 200:
    return "Light"
  elif 200 <= wt < 250:
    return "Medium"
  else:
    return "Heavy"
  
df["weight_category"] = df["weight"].map(get_weight_category)
df
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

依照球員體重歸類為輕、中等與重

R 語言

利用流程控制定義一個函數再使用 sapply() 應用至資料框變數上,在這個函數中我們定義數字的區間與組別,像是低於 200 磅的球員,歸類為輕(Light);介於 200 至 250 磅的球員,歸類為中等(Medium);大於 250 磅的球員,歸類為重(Heavy):

get_weight_cat <- function(weight) {
  if (weight < 200) {
    return("Light")
  } else if (weight >= 200 & weight < 250) {
    return("Medium")
  } else {
    return("Heavy")
  }
}

numbers <- c(9, 23, 33, 91, 13)
players <- c("Ron Harper", "Michael Jordan", "Scottie Pippen", "Dennis Rodman", "Luc Longley")
weights <- c(185, 195, 210, 210, 265)
df <- data.frame(number = numbers,
                 player = players,
                 weight = weights,
                 stringsAsFactors = FALSE) # 避免處理 factor 型別
df$weight_cat <- sapply(df$weight, FUN = get_weight_cat)
View(df)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

低於 200 磅的球員,歸類為輕(Light);介於 200 至 250 磅的球員,歸類為中等(Medium);大於 250 磅的球員,歸類為重(Heavy)

或者使用內建函數 cut() 利用四個數線上的位置 -Inf、200、250 與 Inf 作為切點(當然,這裡使用負無限大與無限大處理體重這樣的數值是顯得小題大作),分出三個標籤 Light、Medium 與 Heavy:

numbers <- c(9, 23, 33, 91, 13)
players <- c("Ron Harper", "Michael Jordan", "Scottie Pippen", "Dennis Rodman", "Luc Longley")
weights <- c(185, 195, 210, 210, 265)
df <- data.frame(number = numbers,
                 player = players,
                 weight = weights,
                 stringsAsFactors = FALSE) # 避免處理 factor 型別
df$weight_cat <- cut(df$weight, breaks = c(-Inf, 200, 250, Inf), labels = c("Light", "Medium", "Heavy"))
View(df)
1
2
3
4
5
6
7
8
9

使用內建函數 cut()

處理遺漏值

Python

在 Python pandas 的資料框中,內建型別 None 以及 NumPy 模組中的 nan 會被視作為遺漏值(NA,Not Available),或者稱作未知原因而不存在的資料(not present for whatever reason)。像是 1995 至 1996 年球季芝加哥公牛隊先發陣容以及最佳第六人 Toni Kukoc 的美國大學學歷資料,由於 Toni Kuoc 是歐洲球員沒有美國大學學歷,在記錄時可以利用 None 或 np.nan。

import pandas as pd
import numpy as np

numbers = [9, 23, 33, 91, 13, 7]
players = ["Ron Harper", "Michael Jordan", "Scottie Pippen", "Dennis Rodman", "Luc Longley", "Toni Kukoc"]
colleges = ["Miami University", "University of North Carolina", "University of Central Arkansas", "Southeastern Oklahoma State University", "University of New Mexico", None] # None 替換為 np.nan 亦可
df = pd.DataFrame()
df["number"] = numbers
df["player"] = players
df["college"] = colleges
df
1
2
3
4
5
6
7
8
9
10
11

1995 至 1996 年球季芝加哥公牛隊先發陣容以及最佳第六人 Toni Kukoc 的美國大學學歷資料

我們可以利用 .isna() 方法來判斷資料中是否存在遺漏值並加以篩選。

import pandas as pd
import numpy as np

numbers = [9, 23, 33, 91, 13, 7]
players = ["Ron Harper", "Michael Jordan", "Scottie Pippen", "Dennis Rodman", "Luc Longley", "Toni Kukoc"]
colleges = ["Miami University", "University of North Carolina", "University of Central Arkansas", "Southeastern Oklahoma State University", "University of New Mexico", None] # None 替換為 np.nan 亦可
df = pd.DataFrame()
df["number"] = numbers
df["player"] = players
df["college"] = colleges
print(df["college"].isna()) # 判斷大學是否有遺漏值
df[df["college"].isna()]    # 篩選出大學為遺漏值的列數
1
2
3
4
5
6
7
8
9
10
11
12

利用 .isna() 方法來判斷資料中是否存在遺漏值並加以篩選

或者反過來,利用 .notna() 方法並篩選資料框中完整的列數。

import pandas as pd
import numpy as np

numbers = [9, 23, 33, 91, 13, 7]
players = ["Ron Harper", "Michael Jordan", "Scottie Pippen", "Dennis Rodman", "Luc Longley", "Toni Kukoc"]
colleges = ["Miami University", "University of North Carolina", "University of Central Arkansas", "Southeastern Oklahoma State University", "University of New Mexico", np.nan] # np.nan 替換為 None 亦可
df = pd.DataFrame()
df["number"] = numbers
df["player"] = players
df["college"] = colleges
print(df["college"].notna()) # 判斷大學是否無遺漏值
df[df["college"].notna()]    # 篩選出大學非遺漏值的列數
1
2
3
4
5
6
7
8
9
10
11
12

利用 .notna() 方法並篩選資料框中完整的列數

如果希望填補遺漏值,可以利用 .fillna() 方法輸入欲填補的值,像是將 Toni Kukoc 的大學學歷填入克羅埃西亞:

import pandas as pd
import numpy as np

numbers = [9, 23, 33, 91, 13, 7]
players = ["Ron Harper", "Michael Jordan", "Scottie Pippen", "Dennis Rodman", "Luc Longley", "Toni Kukoc"]
colleges = ["Miami University", "University of North Carolina", "University of Central Arkansas", "Southeastern Oklahoma State University", "University of New Mexico", None] # None 替換為 np.nan 亦可
df = pd.DataFrame()
df["number"] = numbers
df["player"] = players
df["college"] = colleges
df["college"] = df["college"].fillna("Croatia")
df
1
2
3
4
5
6
7
8
9
10
11
12

將 Toni Kukoc 的大學學歷填入克羅埃西亞

R 語言

在 R 語言的資料框中,內建型別 NA 會被視作為遺漏值(NA,Not Available),或者稱作未知原因而不存在的資料(not present for whatever reason)。像是 1995 至 1996 年球季芝加哥公牛隊先發陣容以及最佳第六人 Toni Kukoc 的美國大學學歷資料,由於 Toni Kuoc 是歐洲球員沒有美國大學學歷,在記錄時可以利用 NA。

numbers <- c(9, 23, 33, 91, 13, 7)
players <- c("Ron Harper", "Michael Jordan", "Scottie Pippen", "Dennis Rodman", "Luc Longley", "Tony Kukoc")
colleges <- c("Miami University", "University of North Carolina", "University of Central Arkansas", "Southeastern Oklahoma State University", "University of New Mexico", NA)
df <- data.frame(number = numbers,
                 player = players,
                 college = colleges,
                 stringsAsFactors = FALSE) # 避免處理 factor 型別
View(df)
1
2
3
4
5
6
7
8

1995 至 1996 年球季芝加哥公牛隊先發陣容以及最佳第六人 Toni Kukoc 的美國大學學歷資料

我們可以利用 is.na() 函數來判斷資料中是否存在遺漏值並加以篩選:

numbers <- c(9, 23, 33, 91, 13, 7)
players <- c("Ron Harper", "Michael Jordan", "Scottie Pippen", "Dennis Rodman", "Luc Longley", "Tony Kukoc")
colleges <- c("Miami University", "University of North Carolina", "University of Central Arkansas", "Southeastern Oklahoma State University", "University of New Mexico", NA)
df <- data.frame(number = numbers,
                 player = players,
                 college = colleges,
                 stringsAsFactors = FALSE) # 避免處理 factor 型別
is.na(df$college)
df[is.na(df$college), ]
1
2
3
4
5
6
7
8
9
## [1] FALSE FALSE FALSE FALSE FALSE  TRUE
##   number     player college
## 6      7 Tony Kukoc    <NA>
1
2
3

使用 ! 運算符號可以將 is.na() 函數的判斷結果反轉,改為篩選有美國大學學歷的觀測值:

numbers <- c(9, 23, 33, 91, 13, 7)
players <- c("Ron Harper", "Michael Jordan", "Scottie Pippen", "Dennis Rodman", "Luc Longley", "Tony Kukoc")
colleges <- c("Miami University", "University of North Carolina", "University of Central Arkansas", "Southeastern Oklahoma State University", "University of New Mexico", NA)
df <- data.frame(number = numbers,
                 player = players,
                 college = colleges,
                 stringsAsFactors = FALSE) # 避免處理 factor 型別
!(is.na(df$college))
View(df[!(is.na(df$college)), ])
1
2
3
4
5
6
7
8
9
## [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
1

篩選有美國大學學歷的觀測值

R 語言另外具備一個 complete.cases() 函數,可以回傳判斷資料框的每一列觀測值是否完整(完全不含遺漏值 NA):

numbers <- c(9, 23, 33, 91, 13, 7)
players <- c("Ron Harper", "Michael Jordan", "Scottie Pippen", "Dennis Rodman", "Luc Longley", "Tony Kukoc")
colleges <- c("Miami University", "University of North Carolina", "University of Central Arkansas", "Southeastern Oklahoma State University", "University of New Mexico", NA)
df <- data.frame(number = numbers,
                 player = players,
                 college = colleges,
                 stringsAsFactors = FALSE) # 避免處理 factor 型別
complete.cases(df)
View(df[complete.cases(df), ])
1
2
3
4
5
6
7
8
9
## [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
1

篩選有美國大學學歷的觀測值

如果希望填補遺漏值,可以篩選出遺漏的元素直接指派欲填補的值,像是將 Toni Kukoc 的大學學歷填入克羅埃西亞:

numbers <- c(9, 23, 33, 91, 13, 7)
players <- c("Ron Harper", "Michael Jordan", "Scottie Pippen", "Dennis Rodman", "Luc Longley", "Tony Kukoc")
colleges <- c("Miami University", "University of North Carolina", "University of Central Arkansas", "Southeastern Oklahoma State University", "University of New Mexico", NA)
df <- data.frame(number = numbers,
                 player = players,
                 college = colleges,
                 stringsAsFactors = FALSE) # 避免處理 factor 型別
df$college[is.na(df$college)] <- "Croatia"
View(df)
1
2
3
4
5
6
7
8
9

將 Toni Kukoc 的大學學歷填入克羅埃西亞

處理時間序列

Python

如果資料框中記錄的是與日期或日期時間相關的資訊,可以利用 pandas 的 to_datetime() 函數將原本的字元轉換為具備更多功能的日期時間型別,像是如果將 1995 至 1996 年球季芝加哥公牛隊先發陣容的出生年月日轉換為日期時間型別,就可以實踐依照年紀大小排序,而這是在本來以文字記錄出生年月日時候並不具備的功能。

import pandas as pd

numbers = [9, 23, 33, 91, 13]
players = ["Ron Harper", "Michael Jordan", "Scottie Pippen", "Dennis Rodman", "Luc Longley"]
birth_dates = ["January 20, 1964", "February 17, 1963", "September 25, 1965", "May 13, 1961", "January 19, 1969"]
df = pd.DataFrame()
df["number"] = numbers
df["player"] = players
df["birth_date"] = birth_dates
print(df["birth_date"].dtype) # 字元型別
df
1
2
3
4
5
6
7
8
9
10
11

birth_date 為字元型別

pandas 的 to_datetime() 函數有預設辨別的日期時間格式,例子中的 %B %d, %Y 是能夠被解析的格式,因此不需另外指定:

import pandas as pd

numbers = [9, 23, 33, 91, 13]
players = ["Ron Harper", "Michael Jordan", "Scottie Pippen", "Dennis Rodman", "Luc Longley"]
birth_dates = ["January 20, 1964", "February 17, 1963", "September 25, 1965", "May 13, 1961", "January 19, 1969"]
df = pd.DataFrame()
df["number"] = numbers
df["player"] = players
df["birth_date"] = birth_dates
df["birth_date"] = pd.to_datetime(df["birth_date"]) # 轉換字元為日期時間型別
print(df["birth_date"].dtype)
df
1
2
3
4
5
6
7
8
9
10
11
12

birth_date 為日期時間型別

轉換成為日期時間型別之後就能夠依照年紀排序,這是在本來以文字記錄時候並不具備的功能:

import pandas as pd

numbers = [9, 23, 33, 91, 13]
players = ["Ron Harper", "Michael Jordan", "Scottie Pippen", "Dennis Rodman", "Luc Longley"]
birth_dates = ["January 20, 1964", "February 17, 1963", "September 25, 1965", "May 13, 1961", "January 19, 1969"]
df = pd.DataFrame()
df["number"] = numbers
df["player"] = players
df["birth_date"] = birth_dates
df["birth_date"] = pd.to_datetime(df["birth_date"]) # 轉換字元為日期時間型別
df.sort_values("birth_date")
1
2
3
4
5
6
7
8
9
10
11

轉換成為日期時間型別之後就能夠依照年紀排序

處理常見時間序列資料(例如股價、匯率或利率等)更普遍的作法會將日期時間擺放至列索引值,只要使用 .set_index() 方法就能完成:

import pandas as pd

numbers = [9, 23, 33, 91, 13]
players = ["Ron Harper", "Michael Jordan", "Scottie Pippen", "Dennis Rodman", "Luc Longley"]
birth_dates = ["January 20, 1964", "February 17, 1963", "September 25, 1965", "May 13, 1961", "January 19, 1969"]
df = pd.DataFrame()
df["number"] = numbers
df["player"] = players
df["birth_date"] = birth_dates
df["birth_date"] = pd.to_datetime(df["birth_date"]) # 轉換字元為日期時間型別
df = df.set_index("birth_date", drop=True)          # 將日期時間擺放至列索引值
df
1
2
3
4
5
6
7
8
9
10
11
12

將日期時間擺放至列索引值

R 語言

如果資料框中記錄的是與日期或日期時間相關的資訊,可以利用 as.Date() 或者 as.POSIXct() 函數將原本的字元轉換為具備更多功能的日期或日期時間型別,像是如果將 1995 至 1996 年球季芝加哥公牛隊先發陣容的出生年月日轉換為日期型別,就可以實踐依照年紀大小排序,而這是在本來以文字記錄出生年月日時候並不具備的功能。

numbers <- c(9, 23, 33, 91, 13)
players <- c("Ron Harper", "Michael Jordan", "Scottie Pippen", "Dennis Rodman", "Luc Longley")
birth_dates <- c("January 20, 1964", "February 17, 1963", "September 25, 1965", "May 13, 1961", "January 19, 1969")
df <- data.frame(number = numbers,
                 player = players,
                 birth_date = birth_dates,
                 stringsAsFactors = FALSE) # 避免處理 factor 型別
class(df$birth_date)
View(df)
1
2
3
4
5
6
7
8
9
## [1] "character"
1

birth_date 原本為字元型別

as.Date() 函數預設僅能辨別 %Y-%m-%d%Y/%m/%d 的格式,像例子中的 %B %d, %Y 就需要另外指定 format:

numbers <- c(9, 23, 33, 91, 13)
players <- c("Ron Harper", "Michael Jordan", "Scottie Pippen", "Dennis Rodman", "Luc Longley")
birth_dates <- c("January 20, 1964", "February 17, 1963", "September 25, 1965", "May 13, 1961", "January 19, 1969")
df <- data.frame(number = numbers,
                 player = players,
                 birth_date = birth_dates,
                 stringsAsFactors = FALSE) # 避免處理 factor 型別
df$birth_date <- as.Date(df$birth_date, format = "%B %d, %Y")
class(df$birth_date)
View(df)
1
2
3
4
5
6
7
8
9
10
## [1] "Date"
1

birth_date 為日期型別

轉換成為日期型別之後就能夠依照年紀排序,這是在本來以文字記錄時候並不具備的功能:

library(dplyr)

numbers <- c(9, 23, 33, 91, 13)
players <- c("Ron Harper", "Michael Jordan", "Scottie Pippen", "Dennis Rodman", "Luc Longley")
birth_dates <- c("January 20, 1964", "February 17, 1963", "September 25, 1965", "May 13, 1961", "January 19, 1969")
df <- data.frame(number = numbers,
                 player = players,
                 birth_date = birth_dates,
                 stringsAsFactors = FALSE) # 避免處理 factor 型別
df$birth_date <- as.Date(df$birth_date, format = "%B %d, %Y")
df %>% 
  arrange(birth_date) %>% 
  View()
1
2
3
4
5
6
7
8
9
10
11
12
13

轉換成為日期型別之後就能夠依照年紀排序

轉置資料框

轉置常見的應用是寬表格(Wide Format)與長表格(Long Format)之間的互相轉換,寬表格是比較熟悉的資料框樣式,一列是獨立的觀測值,加入資訊是以增添欄位方式實踐,故得其名為寬表格;長表格是比較陌生的資料框樣式,具有以一欄 key 搭配一欄 value 來紀錄資料的項目與值,加入資訊是以增添列數方式實踐,故得其名為長表格。

Python

像是 1995 至 1996 年球季芝加哥公牛隊先發球員的身高與體重資訊就能以一個寬表格記錄:

import pandas as pd

players = ["Ron Harper", "Michael Jordan", "Scottie Pippen", "Dennis Rodman", "Luc Longley"]
heights = ["6-6", "6-6", "6-8", "6-7", "7-2"]
weights = [185, 195, 210, 210, 265]
df = pd.DataFrame()
df["player"] = players
df["height"] = heights
df["weight"] = weights
df = df.set_index("player", drop=True) # 將球員姓名設定為列索引
df                                     # 原始外觀為寬表格
1
2
3
4
5
6
7
8
9
10
11

原始外觀為寬表格

利用 .stack() 方法可以將寬表格轉換為長表格,成為一個多重索引值(multi-index)的 Series:

import pandas as pd

players = ["Ron Harper", "Michael Jordan", "Scottie Pippen", "Dennis Rodman", "Luc Longley"]
heights = ["6-6", "6-6", "6-8", "6-7", "7-2"]
weights = [185, 195, 210, 210, 265]
df = pd.DataFrame()
df["player"] = players
df["height"] = heights
df["weight"] = weights
df = df.set_index("player", drop=True) # 將球員姓名設定為列索引
long_format = df.stack()               # 寬表格轉長表格
long_format
1
2
3
4
5
6
7
8
9
10
11
12

利用 .stack() 方法可以將寬表格轉換為長表格

利用 .unstack() 方法可以將長表格轉換回原始的寬表格:

import pandas as pd

players = ["Ron Harper", "Michael Jordan", "Scottie Pippen", "Dennis Rodman", "Luc Longley"]
heights = ["6-6", "6-6", "6-8", "6-7", "7-2"]
weights = [185, 195, 210, 210, 265]
df = pd.DataFrame()
df["player"] = players
df["height"] = heights
df["weight"] = weights
df = df.set_index("player", drop=True) # 將球員姓名設定為列索引
long_format = df.stack()               # 寬表格轉長表格
wide_format = long_format.unstack()    # 長表格轉寬表格
wide_format
1
2
3
4
5
6
7
8
9
10
11
12
13

利用 .unstack() 方法可以將長表格轉換回原始的寬表格

R 語言

像是 1995 至 1996 年球季芝加哥公牛隊先發球員的身高與體重資訊就能以一個寬表格記錄:

players <- c("Ron Harper", "Michael Jordan", "Scottie Pippen", "Dennis Rodman", "Luc Longley")
heights <- c("6-6", "6-6", "6-8", "6-7", "7-2")
weights <- c(185, 195, 210, 210, 265)
df <- data.frame(player = players,
                 height = heights,
                 weight = weights,
                 stringsAsFactors = FALSE) # 避免處理 factor 型別
View(df)
1
2
3
4
5
6
7
8

原始外觀為寬表格

利用 tidyr 套件中的 gather() 函數可以將寬表格轉換為長表格,參數 key 要給轉換後的項目名稱、參數 value 要給轉換後的值名稱,然後再指定有哪些變數要轉換為以 key 跟 value 對照的欄位:

library(tidyr)

players <- c("Ron Harper", "Michael Jordan", "Scottie Pippen", "Dennis Rodman", "Luc Longley")
heights <- c("6-6", "6-6", "6-8", "6-7", "7-2")
weights <- c(185, 195, 210, 210, 265)
df <- data.frame(player = players,
                 height = heights,
                 weight = weights,
                 stringsAsFactors = FALSE) # 避免處理 factor 型別
df %>% 
  gather(key = "key", value = "value", height, weight) %>% # 轉換為長表格
  View()
1
2
3
4
5
6
7
8
9
10
11
12

利用 tidyr 套件中的 gather() 函數可以將寬表格轉換為長表格

利用 tidyr 套件中的 spread() 函數可以將長表格轉換為寬表格,參數 key 要給項目的變數名稱、參數 value 要給值的變數名稱:

library(tidyr)

players <- c("Ron Harper", "Michael Jordan", "Scottie Pippen", "Dennis Rodman", "Luc Longley")
heights <- c("6-6", "6-6", "6-8", "6-7", "7-2")
weights <- c(185, 195, 210, 210, 265)
df <- data.frame(player = players,
                 height = heights,
                 weight = weights,
                 stringsAsFactors = FALSE) # 避免處理 factor 型別
long_format <- df %>% 
  gather(key = "key", value = "value", height, weight) # 轉換為長表格
long_format %>% 
  spread(key = "key", value = "value") %>% # 轉換為寬表格
  View()
1
2
3
4
5
6
7
8
9
10
11
12
13
14

利用 tidyr 套件中的 spread() 函數可以將長表格轉換為寬表格

聯結資料框

當資訊分開儲存在兩個以上的資料框,可以使用資料框之間彼此關聯的變數作為對照,進而將資訊集中至同一個資料框之中。

Python

像是1995 至 1996 年球季芝加哥公牛隊先發陣容若是將球員的背號與就讀大學分別儲存在不同資料框中:

import pandas as pd

players = ["Ron Harper", "Michael Jordan", "Scottie Pippen", "Dennis Rodman", "Luc Longley"]
numbers = [9, 23, 33, 91, 13]
colleges = ["Miami University", "University of North Carolina", "University of Central Arkansas", "Southeastern Oklahoma State University", "University of New Mexico"]
number_df = pd.DataFrame()
number_df["player"] = players
number_df["number"] = numbers
college_df = pd.DataFrame()
college_df["player"] = players
college_df["college"] = colleges
print(number_df)
print(college_df)
1
2
3
4
5
6
7
8
9
10
11
12
13

球員的背號與就讀大學分別儲存在不同資料框中

我們可以使用 pd.merge() 函數,輸入對照的變數欄位 player 合併兩個資料框:

import pandas as pd

players = ["Ron Harper", "Michael Jordan", "Scottie Pippen", "Dennis Rodman", "Luc Longley"]
numbers = [9, 23, 33, 91, 13]
colleges = ["Miami University", "University of North Carolina", "University of Central Arkansas", "Southeastern Oklahoma State University", "University of New Mexico"]
number_df = pd.DataFrame()
number_df["player"] = players
number_df["number"] = numbers
college_df = pd.DataFrame()
college_df["player"] = players
college_df["college"] = colleges
pd.merge(number_df, college_df, on='player')
1
2
3
4
5
6
7
8
9
10
11
12

輸入對照的變數欄位 player 合併兩個資料框

我們也可以使用 .join() 方法,根據兩個資料框的列索引(row index)來做合併,注意由於兩個資料框都有同樣名稱的 player 變數,在使用 .join() 方法之前可以去除其中一個資料框的 player 變數。

import pandas as pd

players = ["Ron Harper", "Michael Jordan", "Scottie Pippen", "Dennis Rodman", "Luc Longley"]
numbers = [9, 23, 33, 91, 13]
colleges = ["Miami University", "University of North Carolina", "University of Central Arkansas", "Southeastern Oklahoma State University", "University of New Mexico"]
number_df = pd.DataFrame()
number_df["player"] = players
number_df["number"] = numbers
college_df = pd.DataFrame()
college_df["player"] = players
college_df["college"] = colleges
number_df.join(college_df[["college"]])
1
2
3
4
5
6
7
8
9
10
11
12

使用 .join() 方法,根據兩個資料框的列索引(row index)來做合併

前述兩個聯結資料框的作法不同之處在於 pd.merge() 函數依據兩個資料框關聯的變數,而 .join() 方法依據兩個資料框的列索引值。

R 語言

1995 至 1996 年球季芝加哥公牛隊先發陣容若是將球員的背號與就讀大學分別儲存在不同資料框中:

players <- c("Ron Harper", "Michael Jordan", "Scottie Pippen", "Dennis Rodman", "Luc Longley")
numbers <- c(9, 23, 33, 91, 13)
colleges <- c("Miami University", "University of North Carolina", "University of Central Arkansas", "Southeastern Oklahoma State University", "University of New Mexico")
number_df <- data.frame(player = players,
                 number = numbers,
                 stringsAsFactors = FALSE) # 避免處理 factor 型別
college_df <- data.frame(player = players,
                         college = colleges,
                         stringsAsFactors = FALSE) # 避免處理 factor 型別
View(number_df)
View(college_df)
1
2
3
4
5
6
7
8
9
10
11

球員的背號

球員就讀的大學

我們可以使用 merge() 函數,輸入對照的變數欄位 player 合併兩個資料框:

players <- c("Ron Harper", "Michael Jordan", "Scottie Pippen", "Dennis Rodman", "Luc Longley")
numbers <- c(9, 23, 33, 91, 13)
colleges <- c("Miami University", "University of North Carolina", "University of Central Arkansas", "Southeastern Oklahoma State University", "University of New Mexico")
number_df <- data.frame(player = players,
                 number = numbers,
                 stringsAsFactors = FALSE) # 避免處理 factor 型別
college_df <- data.frame(player = players,
                         college = colleges,
                         stringsAsFactors = FALSE) # 避免處理 factor 型別
View(merge(number_df, college_df, by = "player"))
1
2
3
4
5
6
7
8
9
10

使用 merge() 函數,輸入對照的變數欄位 player 合併兩個資料框

或者使用 dplyr 套件中的 inner_join() 函數合併(在我們舉的栗子中,不論使用內部連結、左外部連結或右外部連結結果都是相同的):

library(dplyr)

players <- c("Ron Harper", "Michael Jordan", "Scottie Pippen", "Dennis Rodman", "Luc Longley")
numbers <- c(9, 23, 33, 91, 13)
colleges <- c("Miami University", "University of North Carolina", "University of Central Arkansas", "Southeastern Oklahoma State University", "University of New Mexico")
number_df <- data.frame(player = players,
                 number = numbers,
                 stringsAsFactors = FALSE) # 避免處理 factor 型別
college_df <- data.frame(player = players,
                         college = colleges,
                         stringsAsFactors = FALSE) # 避免處理 factor 型別
number_df %>% 
  inner_join(college_df) %>% 
  View()
1
2
3
4
5
6
7
8
9
10
11
12
13
14

使用 dplyr 套件中的 inner_join() 函數合併

小結

在這個小節中我們簡介 Python pandas 與 R 語言中的進階資料框操作技巧,包含調整變數的型別、對文字變數重新編碼、對數字重新歸類分組為文字變數、處理遺漏值、處理時間序列、轉置資料框與聯結資料框。