基礎視覺化

The simple graph has brought more information to the data analyst’s mind than any other device.

John Tukey

暸解如何獲取資料、掌控資料之後,接著可以利用視覺化的技能深入探索與查看資料,這樣的技能被資料科學團隊稱為探索性資料分析(Exploratory Data Analysis,EDA)。透過探索性資料分析將會大大加深我們對資料分佈、相關與組成等的理解程度,進而協助資料科學團隊開展出富含價值的資訊,像是:

  • 發想撈取資料(Extract)、資料轉換(Transformation)與資料載入(Load)的流程優化設計
  • 直觀回答業務問題的資料樣態(明顯的趨勢增減、組成比例落差或者絕對數值差距)
  • 建立待驗證的統計檢定假說與機器學習模型預測目標

探索性資料分析包含但不僅限於視覺化,有時候在如何掌控資料篇基礎資料框操作技巧 中介紹的簡單摘要、分組或者排序,亦能提供對業務有助益的高附加價值資訊。

視覺化的基本單位速記

Python 與 R 語言中常為資料科學團隊採用來進行探索性資料分析的視覺化套件,包括 matplotlib 中的 pyplot 模組、seaborn 模組、pandas 模組、base plotting system 與 ggplot2。不同的視覺化套件在生成圖形的單位上也有所差異,主要分兩類型:

  • 以一維陣列作為圖形的基本單位,像是 Python matplotlib 中的 pyplot 模組、R 語言的 base plotting system
  • 以資料框(DataFrames)作為圖形的基本單位,像是 Python pandas 模組、R 語言的 ggplot2

一組文字資料的相異觀測值數量

長條圖(bar chart)是資料科學團隊慣常用作探索一組文字資料相異觀測值組成與數量排名的圖形,例如想知道 1995 至 1996 年球季中的芝加哥公牛隊球員陣容,各個鋒衛位置的人數,就能用長條圖探索。

Python

1995 至 1996 年球季中的芝加哥公牛隊球員陣容

於 Python 中使用 pyplot 作圖之前,得先透過我們在基礎資料框操作技巧介紹過的分組與摘要來計算相異觀測值的分組計數,在 plt.bar() 函數中輸入長條所在的 X 座標位置以及長條對應的高度。

import pandas as pd
import matplotlib.pyplot as plt

csv_url = "https://storage.googleapis.com/ds_data_import/chicago_bulls_1995_1996.csv"
df = pd.read_csv(csv_url)
grouped = df.groupby("Pos")
pos = grouped["Pos"].count()
plt.bar([1, 2, 3, 4, 5], pos)
plt.xticks([1, 2, 3, 4, 5], pos.index)
plt.yticks([1, 2, 3, 4], [1, 2, 3, 4])
plt.show()
1
2
3
4
5
6
7
8
9
10
11

1995 至 1996 年球季中的芝加哥公牛隊球員陣容,各個鋒衛位置的人數

假如透過 pandas 模組作圖,輸入的語法更加簡潔,可以直接在分組摘要的物件上應用 .plot.bar()

import pandas as pd
import matplotlib.pyplot as plt

csv_url = "https://storage.googleapis.com/ds_data_import/chicago_bulls_1995_1996.csv"
df = pd.read_csv(csv_url)
grouped = df.groupby("Pos")
pos = grouped["Pos"].count()
pos.plot.bar()
plt.yticks([1, 2, 3, 4], [1, 2, 3, 4])
plt.show()
1
2
3
4
5
6
7
8
9
10

1995 至 1996 年球季中的芝加哥公牛隊球員陣容,各個鋒衛位置的人數

R 語言

1995 至 1996 年球季中的芝加哥公牛隊球員陣容

利用 R 語言的 base plotting system 作圖同樣需要先計算好相異觀測值的分組計數,再呼叫 barplot() 函數,參數依序輸入長條對應的高度與長條所在的 X 座標標籤。

library(dplyr)

csv_url <- "https://storage.googleapis.com/ds_data_import/chicago_bulls_1995_1996.csv"
df <- read.csv(csv_url)
pos <- df %>% 
  group_by(Pos) %>%
  summarise(freq = n())
barplot(pos$freq, names.arg = pos$Pos)
1
2
3
4
5
6
7
8

1995 至 1996 年球季中的芝加哥公牛隊球員陣容,各個鋒衛位置的人數

假如透過 ggplot2 作圖,由於 geom_bar() 函數預設為計算相異觀測值分組計數,因此不需要在作圖先行分組摘要。

library(dplyr)
library(ggplot2)

csv_url <- "https://storage.googleapis.com/ds_data_import/chicago_bulls_1995_1996.csv"
df <- read.csv(csv_url)
df %>% 
  ggplot(aes(x = Pos)) +
    geom_bar()
1
2
3
4
5
6
7
8

1995 至 1996 年球季中的芝加哥公牛隊球員陣容,各個鋒衛位置的人數

一組數值資料依類別分組摘要排序

長條圖(bar chart)也常用來探索一組數值資料依類別分組摘要排序,例如想知道 1995 至 1996 年球季中的芝加哥公牛隊球員陣容,各個鋒衛位置的平均每場得分,同樣能透過長條圖探索。先前使用的資料中欠缺球員的每場球賽得分統計,必須從另外一個資料取得並透過球員姓名聯結。

Python

import pandas as pd
import matplotlib.pyplot as plt

per_game_url = "https://storage.googleapis.com/ds_data_import/stats_per_game_chicago_bulls_1995_1996.csv"
player_info_url = "https://storage.googleapis.com/ds_data_import/chicago_bulls_1995_1996.csv"
per_game = pd.read_csv(per_game_url)
player_info = pd.read_csv(player_info_url)
df = pd.merge(player_info, per_game[["Name", "PTS/G"]], left_on="Player", right_on="Name")
grouped = df.groupby("Pos")
points_per_game = grouped["PTS/G"].mean()
plt.bar([1, 2, 3, 4, 5], points_per_game)
plt.xticks([1, 2, 3, 4, 5], points_per_game.index)
plt.show()
1
2
3
4
5
6
7
8
9
10
11
12
13

1995 至 1996 年球季中的芝加哥公牛隊球員陣容,各個鋒衛位置的平均每場得分

透過 pandas 模組與先前作法相同。

import pandas as pd
import matplotlib.pyplot as plt

per_game_url = "https://storage.googleapis.com/ds_data_import/stats_per_game_chicago_bulls_1995_1996.csv"
player_info_url = "https://storage.googleapis.com/ds_data_import/chicago_bulls_1995_1996.csv"
per_game = pd.read_csv(per_game_url)
player_info = pd.read_csv(player_info_url)
df = pd.merge(player_info, per_game[["Name", "PTS/G"]], left_on="Player", right_on="Name")
grouped = df.groupby("Pos")
points_per_game = grouped["PTS/G"].mean()
points_per_game.plot.bar()
plt.show()
1
2
3
4
5
6
7
8
9
10
11
12

1995 至 1996 年球季中的芝加哥公牛隊球員陣容,各個鋒衛位置的平均每場得分

R 語言

利用 R 語言的 base plotting system 與先前作法相同。

library(dplyr)

per_game_url <- "https://storage.googleapis.com/ds_data_import/stats_per_game_chicago_bulls_1995_1996.csv"
player_info_url <- "https://storage.googleapis.com/ds_data_import/chicago_bulls_1995_1996.csv"
per_game <- read.csv(per_game_url)
player_info <- read.csv(player_info_url)
df <- merge(player_info, per_game[, c("Name", "PTS.G")], by.x = "Player", by.y = "Name")
points_per_game <- df %>% 
  group_by(Pos) %>% 
  summarise(mean_pts = mean(PTS.G))
barplot(points_per_game$mean_pts, names.arg = points_per_game$Pos)
1
2
3
4
5
6
7
8
9
10
11

1995 至 1996 年球季中的芝加哥公牛隊球員陣容,各個鋒衛位置的平均每場得分

假如透過 ggplot2 作圖,由於 geom_bar() 函數預設為計算相異觀測值分組計數,必須要改參數設定 stat = "identiy" 才能夠在 aes() 中輸入 X 軸資料為鋒衛位置與 Y 軸資料為平均每場得分,否則將出現錯誤訊息:


# Error: stat_count() must not be used with a y aesthetic.
library(dplyr)
library(ggplot2)

per_game_url <- "https://storage.googleapis.com/ds_data_import/stats_per_game_chicago_bulls_1995_1996.csv"
player_info_url <- "https://storage.googleapis.com/ds_data_import/chicago_bulls_1995_1996.csv"
per_game <- read.csv(per_game_url)
player_info <- read.csv(player_info_url)
df <- merge(player_info, per_game[, c("Name", "PTS.G")], by.x = "Player", by.y = "Name")
df %>% 
  group_by(Pos) %>% 
  summarise(mean_pts = mean(PTS.G)) %>% 
  ggplot(aes(x = Pos, y = mean_pts)) +
    geom_bar()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
## Error: stat_count() must not be used with a y aesthetic.
1
library(dplyr)
library(ggplot2)

per_game_url <- "https://storage.googleapis.com/ds_data_import/stats_per_game_chicago_bulls_1995_1996.csv"
player_info_url <- "https://storage.googleapis.com/ds_data_import/chicago_bulls_1995_1996.csv"
per_game <- read.csv(per_game_url)
player_info <- read.csv(player_info_url)
df <- merge(player_info, per_game[, c("Name", "PTS.G")], by.x = "Player", by.y = "Name")
df %>% 
  group_by(Pos) %>% 
  summarise(mean_pts = mean(PTS.G)) %>% 
  ggplot(aes(x = Pos, y = mean_pts)) +
    geom_bar(stat = "identity")
1
2
3
4
5
6
7
8
9
10
11
12
13

1995 至 1996 年球季中的芝加哥公牛隊球員陣容,各個鋒衛位置的平均每場得分

一組數值資料的分佈

直方圖(histogram chart)是資料科學團隊慣常用作探索一組數值資料分佈情形的圖形,藉著圖形可以觀察該組數值資料的峰度(kurtosis)以及偏態(skewness)。例如想知道美國職籃聯盟 NBA 球員的年薪分佈概況,就能夠用直方圖探索,NBA 球員的年薪我們撰寫網頁爬蟲從 sportrac.com 擷取,有關於擷取網頁資料的技巧,可以參考靜態擷取網頁內容

Python

NBA 球員的年薪我們撰寫網頁爬蟲從 sportrac.com 擷取

plt.hist() 函數中輸入數值資料以及直方圖的分箱數(bins)。

from pyquery import PyQuery as pq
import pandas as pd
import matplotlib.pyplot as plt

def get_nba_salary():
  """
  Get NBA players' salary from SPORTRAC.COM
  """
  nba_salary_ranking_url = "https://www.spotrac.com/nba/rankings/"
  html_doc = pq(nba_salary_ranking_url)
  player_css = ".team-name"
  pos_css = ".rank-position"
  salary_css = ".info"
  players = [p.text for p in html_doc(player_css)]
  positions = [p.text for p in html_doc(pos_css)]
  salaries = [s.text.replace("$", "") for s in html_doc(salary_css)]
  salaries = [int(s.replace(",", "")) for s in salaries]
  df = pd.DataFrame()
  df["player"] = players
  df["pos"] = positions
  df["salary"] = salaries
  return df

nba_salary = get_nba_salary()
plt.hist(nba_salary["salary"], bins=15)
plt.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

美國職籃聯盟 NBA 球員的年薪分佈概況

透過 pandas 模組作圖就直接在 salary 陣列上應用 .plot.hist()

from pyquery import PyQuery as pq
import pandas as pd
import matplotlib.pyplot as plt

def get_nba_salary():
  """
  Get NBA players' salary from SPORTRAC.COM
  """
  nba_salary_ranking_url = "https://www.spotrac.com/nba/rankings/"
  html_doc = pq(nba_salary_ranking_url)
  player_css = ".team-name"
  pos_css = ".rank-position"
  salary_css = ".info"
  players = [p.text for p in html_doc(player_css)]
  positions = [p.text for p in html_doc(pos_css)]
  salaries = [s.text.replace("$", "") for s in html_doc(salary_css)]
  salaries = [int(s.replace(",", "")) for s in salaries]
  df = pd.DataFrame()
  df["player"] = players
  df["pos"] = positions
  df["salary"] = salaries
  return df

nba_salary = get_nba_salary()
nba_salary["salary"].plot.hist(bins=15)
plt.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

美國職籃聯盟 NBA 球員的年薪分佈概況

R 語言

呼叫 R 語言的 hist() 函數可以繪製直方圖,輸入數值資料以及分箱數 breaks。

NBA 球員的年薪我們撰寫網頁爬蟲從 sportrac.com 擷取

library(rvest)

get_nba_salary <- function() {
  nba_salary_ranking_url <- "https://www.spotrac.com/nba/rankings/"
  html_doc <- nba_salary_ranking_url %>% 
    read_html()
  player_css <- ".team-name"
  pos_css <- ".rank-position"
  salary_css <- ".info"
  players <- html_doc %>% 
    html_nodes(css = player_css) %>% 
    html_text()
  positions <- html_doc %>% 
    html_nodes(css = pos_css) %>% 
    html_text()
  salaries <- html_doc %>% 
    html_nodes(css = salary_css) %>% 
    html_text() %>% 
    gsub(pattern = "\\$", replacement = "", .) %>% 
    gsub(pattern = ",", replacement = "", .) %>% 
    as.numeric()
  df <- data.frame(player = players,
                   pos = positions,
                   salary = salaries,
                   stringsAsFactors = FALSE)
  return(df)
}
nba_salary <- get_nba_salary()
hist(nba_salary$salary, breaks = 15)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

美國職籃聯盟 NBA 球員的年薪分佈概況

ggplot2 使用 geom_histogram() 函數繪製直方圖,預設的分箱數為 30,如果沒有指定,console 會回傳訊息提醒:

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
1
library(rvest)
library(ggplot2)

get_nba_salary <- function() {
  nba_salary_ranking_url <- "https://www.spotrac.com/nba/rankings/"
  html_doc <- nba_salary_ranking_url %>% 
    read_html()
  player_css <- ".team-name"
  pos_css <- ".rank-position"
  salary_css <- ".info"
  players <- html_doc %>% 
    html_nodes(css = player_css) %>% 
    html_text()
  positions <- html_doc %>% 
    html_nodes(css = pos_css) %>% 
    html_text()
  salaries <- html_doc %>% 
    html_nodes(css = salary_css) %>% 
    html_text() %>% 
    gsub(pattern = "\\$", replacement = "", .) %>% 
    gsub(pattern = ",", replacement = "", .) %>% 
    as.numeric()
  df <- data.frame(player = players,
                   pos = positions,
                   salary = salaries,
                   stringsAsFactors = FALSE)
  return(df)
}
nba_salary <- get_nba_salary()
ggplot(nba_salary, aes(x = salary)) +
  geom_histogram(bins = 15)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

美國職籃聯盟 NBA 球員的年薪分佈概況

一組數值資料依類別分組的分佈

盒鬚圖(box-and-whisker plot)是資料科學團隊慣常用作探索一組數值資料依類別分組的分佈情況之圖形,藉著圖形可以觀察不同類別分組數值資料的峰度(kurtosis)以及偏態(skewness)。例如想知道美國職籃聯盟 NBA 球員依照不同的鋒衛位置年薪分佈,就能夠用盒鬚圖探索。

Python

利用 plt.boxplot() 函數作圖之前,我們必須將資料整理為符合函數規定的格式:

## Make a box and whisker plot for each column of ``x`` or each vector in sequence ``x``.
1

使用 .pivot() 方法將資料整理為寬表格的外觀:

寬表格的外觀

將每個欄位選取出來清除 NaN 再放入一個 list 中,如此就是一個符合繪圖函數規定的格式,然後輸入 plt.boxplot() 函數中、調整一下 X 軸刻度的樣式。

from pyquery import PyQuery as pq
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

def get_nba_salary():
  """
  Get NBA players' salary from SPORTRAC.COM
  """
  nba_salary_ranking_url = "https://www.spotrac.com/nba/rankings/"
  html_doc = pq(nba_salary_ranking_url)
  player_css = ".team-name"
  pos_css = ".rank-position"
  salary_css = ".info"
  players = [p.text for p in html_doc(player_css)]
  positions = [p.text for p in html_doc(pos_css)]
  salaries = [s.text.replace("$", "") for s in html_doc(salary_css)]
  salaries = [int(s.replace(",", "")) for s in salaries]
  df = pd.DataFrame()
  df["player"] = players
  df["pos"] = positions
  df["salary"] = salaries
  return df

nba_salary = get_nba_salary()
box_df = nba_salary.pivot(index='player', columns='pos', values='salary')
data_to_plot = [box_df[col].values[~np.isnan(box_df[col].values)] for col in box_df.columns]
plt.boxplot(data_to_plot)
plt.xticks(range(1, 6), box_df.columns)
plt.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

美國職籃聯盟 NBA 球員依照不同的鋒衛位置年薪分佈

如果透過 pandas 模組作圖,在使用 .pivot() 方法將資料整理為寬表格之後就可以直接作圖,節省清除遺漏值與調整輸入格式的心力。

from pyquery import PyQuery as pq
import pandas as pd
import matplotlib.pyplot as plt

def get_nba_salary():
  """
  Get NBA players' salary from SPORTRAC.COM
  """
  nba_salary_ranking_url = "https://www.spotrac.com/nba/rankings/"
  html_doc = pq(nba_salary_ranking_url)
  player_css = ".team-name"
  pos_css = ".rank-position"
  salary_css = ".info"
  players = [p.text for p in html_doc(player_css)]
  positions = [p.text for p in html_doc(pos_css)]
  salaries = [s.text.replace("$", "") for s in html_doc(salary_css)]
  salaries = [int(s.replace(",", "")) for s in salaries]
  df = pd.DataFrame()
  df["player"] = players
  df["pos"] = positions
  df["salary"] = salaries
  return df

nba_salary = get_nba_salary()
box_df = nba_salary.pivot(index='player', columns='pos', values='salary')
box_df.plot.box()
plt.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

美國職籃聯盟 NBA 球員依照不同的鋒衛位置年薪分佈

R 語言

R 語言的 boxplot() 函數接受的輸入格式為長表格型態,因此可以直接由我們從 sportrac.com 擷取後所儲存的資料框生成盒鬚圖, formula 參數輸入 salary ~ pos 即可。

NBA 球員的年薪我們撰寫網頁爬蟲從 sportrac.com 擷取

library(rvest)

get_nba_salary <- function() {
  nba_salary_ranking_url <- "https://www.spotrac.com/nba/rankings/"
  html_doc <- nba_salary_ranking_url %>% 
    read_html()
  player_css <- ".team-name"
  pos_css <- ".rank-position"
  salary_css <- ".info"
  players <- html_doc %>% 
    html_nodes(css = player_css) %>% 
    html_text()
  positions <- html_doc %>% 
    html_nodes(css = pos_css) %>% 
    html_text()
  salaries <- html_doc %>% 
    html_nodes(css = salary_css) %>% 
    html_text() %>% 
    gsub(pattern = "\\$", replacement = "", .) %>% 
    gsub(pattern = ",", replacement = "", .) %>% 
    as.numeric()
  #salaries <- gsub(pattern = "\\$", replacement = "", salaries)
  df <- data.frame(player = players,
                   pos = positions,
                   salary = salaries,
                   stringsAsFactors = FALSE)
  return(df)
}
nba_salary <- get_nba_salary()
boxplot(salary ~ pos, data = nba_salary)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

美國職籃聯盟 NBA 球員依照不同的鋒衛位置年薪分佈

ggplot2 使用 geom_boxplot() 函數繪製盒鬚圖,接受輸入的格式同樣為長表格型態,因此可以直接由我們從 sportrac.com 擷取後所儲存的資料框生成盒鬚圖,X 軸資料輸入鋒衛位置 pos、Y 軸資料輸入 salary

library(rvest)
library(ggplot2)

get_nba_salary <- function() {
  nba_salary_ranking_url <- "https://www.spotrac.com/nba/rankings/"
  html_doc <- nba_salary_ranking_url %>% 
    read_html()
  player_css <- ".team-name"
  pos_css <- ".rank-position"
  salary_css <- ".info"
  players <- html_doc %>% 
    html_nodes(css = player_css) %>% 
    html_text()
  positions <- html_doc %>% 
    html_nodes(css = pos_css) %>% 
    html_text()
  salaries <- html_doc %>% 
    html_nodes(css = salary_css) %>% 
    html_text() %>% 
    gsub(pattern = "\\$", replacement = "", .) %>% 
    gsub(pattern = ",", replacement = "", .) %>% 
    as.numeric()
  df <- data.frame(player = players,
                   pos = positions,
                   salary = salaries,
                   stringsAsFactors = FALSE)
  return(df)
}
nba_salary <- get_nba_salary()
ggplot(nba_salary, aes(x = pos, y = salary)) +
  geom_boxplot()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

美國職籃聯盟 NBA 球員依照不同的鋒衛位置年薪分佈

兩組數值資料的相關

散佈圖(scatter plot)是資料科學團隊慣常用作探索兩組數值資料相關情況之圖形,藉著圖形可以觀察兩組數值資料之間是否有負相關、正相關或者無相關之特徵。例如想知道美國職籃聯盟 NBA 球員年薪與平均每場得分的相關概況,就能夠用散佈圖探索。年薪的資料我們已經撰寫網頁爬蟲從 sportrac.com 獲得,平均每場得分可以從 nba.com 擷取,再依球員姓名內部合併。

Python

美國職籃聯盟 NBA 球員年薪與平均每場得分

利用 plt.scatter() 函數,X 軸、Y 軸資料分別輸入平均每場得分與年薪。

from pyquery import PyQuery as pq
from requests import get
import pandas as pd
import matplotlib.pyplot as plt

def get_nba_salary():
  """
  Get NBA players' salary from SPORTRAC.com
  """
  nba_salary_ranking_url = "https://www.spotrac.com/nba/rankings/"
  html_doc = pq(nba_salary_ranking_url)
  player_css = ".team-name"
  pos_css = ".rank-position"
  salary_css = ".info"
  players = [p.text for p in html_doc(player_css)]
  positions = [p.text for p in html_doc(pos_css)]
  salaries = [s.text.replace("$", "") for s in html_doc(salary_css)]
  salaries = [int(s.replace(",", "")) for s in salaries]
  df = pd.DataFrame()
  df["player"] = players
  df["pos"] = positions
  df["salary"] = salaries
  return df

def get_pts_game():
  """
  Get NBA players' PTS/G from NBA.com
  """
  nba_stats_url = "https://stats.nba.com/stats/leagueLeaders?LeagueID=00&PerMode=PerGame&Scope=S&Season=2017-18&SeasonType=Regular+Season&StatCategory=PTS"
  pts_game_dict = get(nba_stats_url).json()
  players = [pts_game_dict["resultSet"]["rowSet"][i][2] for i in range(len(pts_game_dict["resultSet"]["rowSet"]))]
  pts_game = [pts_game_dict["resultSet"]["rowSet"][i][22] for i in range(len(pts_game_dict["resultSet"]["rowSet"]))]
  df = pd.DataFrame()
  df["player"] = players
  df["pts_game"] = pts_game
  return df

nba_salary = get_nba_salary()
pts_game = get_pts_game()
df = pd.merge(nba_salary, pts_game)
plt.scatter(df["pts_game"], df["salary"])
plt.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42

美國職籃聯盟 NBA 球員年薪與平均每場得分的相關概況

透過 pandas 模組使用 .plot.scatter() 方法作圖,X 軸、Y 軸資料分別輸入平均每場得分與年薪。

from pyquery import PyQuery as pq
from requests import get
import pandas as pd
import matplotlib.pyplot as plt

def get_nba_salary():
  """
  Get NBA players' salary from SPORTRAC.com
  """
  nba_salary_ranking_url = "https://www.spotrac.com/nba/rankings/"
  html_doc = pq(nba_salary_ranking_url)
  player_css = ".team-name"
  pos_css = ".rank-position"
  salary_css = ".info"
  players = [p.text for p in html_doc(player_css)]
  positions = [p.text for p in html_doc(pos_css)]
  salaries = [s.text.replace("$", "") for s in html_doc(salary_css)]
  salaries = [int(s.replace(",", "")) for s in salaries]
  df = pd.DataFrame()
  df["player"] = players
  df["pos"] = positions
  df["salary"] = salaries
  return df

def get_pts_game():
  """
  Get NBA players' PTS/G from NBA.com
  """
  nba_stats_url = "https://stats.nba.com/stats/leagueLeaders?LeagueID=00&PerMode=PerGame&Scope=S&Season=2017-18&SeasonType=Regular+Season&StatCategory=PTS"
  pts_game_dict = get(nba_stats_url).json()
  players = [pts_game_dict["resultSet"]["rowSet"][i][2] for i in range(len(pts_game_dict["resultSet"]["rowSet"]))]
  pts_game = [pts_game_dict["resultSet"]["rowSet"][i][22] for i in range(len(pts_game_dict["resultSet"]["rowSet"]))]
  df = pd.DataFrame()
  df["player"] = players
  df["pts_game"] = pts_game
  return df

nba_salary = get_nba_salary()
pts_game = get_pts_game()
df = pd.merge(nba_salary, pts_game)
df.plot.scatter("pts_game", "salary")
plt.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42

美國職籃聯盟 NBA 球員年薪與平均每場得分的相關概況

R 語言

R 語言的 plot() 函數可以繪製散佈圖,X 軸、Y 軸資料分別輸入平均每場得分與年薪,並且指派 type 參數為 'p' ,意即 points。

美國職籃聯盟 NBA 球員年薪與平均每場得分

library(rvest)
library(jsonlite)
library(dplyr)

get_nba_salary <- function() {
  nba_salary_ranking_url <- "https://www.spotrac.com/nba/rankings/"
  html_doc <- nba_salary_ranking_url %>% 
    read_html()
  player_css <- ".team-name"
  pos_css <- ".rank-position"
  salary_css <- ".info"
  players <- html_doc %>% 
    html_nodes(css = player_css) %>% 
    html_text()
  positions <- html_doc %>% 
    html_nodes(css = pos_css) %>% 
    html_text()
  salaries <- html_doc %>% 
    html_nodes(css = salary_css) %>% 
    html_text() %>% 
    gsub(pattern = "\\$", replacement = "", .) %>% 
    gsub(pattern = ",", replacement = "", .) %>% 
    as.numeric()
  #salaries <- gsub(pattern = "\\$", replacement = "", salaries)
  df <- data.frame(player = players,
                   pos = positions,
                   salary = salaries,
                   stringsAsFactors = FALSE)
  return(df)
}

get_pts_game <- function() {
  nba_stats_url <- "https://stats.nba.com/stats/leagueLeaders?LeagueID=00&PerMode=PerGame&Scope=S&Season=2017-18&SeasonType=Regular+Season&StatCategory=PTS"
  res <- fromJSON(nba_stats_url)
  players <- res$resultSet$rowSet[, 3]
  pts_game <- as.numeric(res$resultSet$rowSet[, 23])
  df <- data.frame(player = players,
                   pts_game = pts_game,
                   stringsAsFactors = FALSE)
  return(df)
}

nba_salary <- get_nba_salary()
pts_game <- get_pts_game()
df <- merge(nba_salary, pts_game) %>% 
  arrange(desc(pts_game))
plot(df$pts_game, df$salary, type = "p")
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47

美國職籃聯盟 NBA 球員年薪與平均每場得分的相關概況

ggplot2 使用 geom_point() 函數繪製散佈圖,X 軸、Y 軸資料分別輸入平均每場得分與年薪 。

library(rvest)
library(jsonlite)
library(ggplot2)
library(dplyr)

get_nba_salary <- function() {
  nba_salary_ranking_url <- "https://www.spotrac.com/nba/rankings/"
  html_doc <- nba_salary_ranking_url %>% 
    read_html()
  player_css <- ".team-name"
  pos_css <- ".rank-position"
  salary_css <- ".info"
  players <- html_doc %>% 
    html_nodes(css = player_css) %>% 
    html_text()
  positions <- html_doc %>% 
    html_nodes(css = pos_css) %>% 
    html_text()
  salaries <- html_doc %>% 
    html_nodes(css = salary_css) %>% 
    html_text() %>% 
    gsub(pattern = "\\$", replacement = "", .) %>% 
    gsub(pattern = ",", replacement = "", .) %>% 
    as.numeric()
  #salaries <- gsub(pattern = "\\$", replacement = "", salaries)
  df <- data.frame(player = players,
                   pos = positions,
                   salary = salaries,
                   stringsAsFactors = FALSE)
  return(df)
}

get_pts_game <- function() {
  nba_stats_url <- "https://stats.nba.com/stats/leagueLeaders?LeagueID=00&PerMode=PerGame&Scope=S&Season=2017-18&SeasonType=Regular+Season&StatCategory=PTS"
  res <- fromJSON(nba_stats_url)
  players <- res$resultSet$rowSet[, 3]
  pts_game <- as.numeric(res$resultSet$rowSet[, 23])
  df <- data.frame(player = players,
                   pts_game = pts_game,
                   stringsAsFactors = FALSE)
  return(df)
}

nba_salary <- get_nba_salary()
pts_game <- get_pts_game()
df <- merge(nba_salary, pts_game) %>% 
  arrange(desc(pts_game))
ggplot(df, aes(x = pts_game, y = salary)) +
  geom_point()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49

美國職籃聯盟 NBA 球員年薪與平均每場得分的相關概況

數值資料隨著日期時間的變動趨勢

線圖(line graph)是資料科學團隊慣常用作探索數值資料隨著日期時間的變動趨勢之圖形,藉著圖形可以觀察數值資料是否具有上升、下降、持平、季節性或循環性等的特徵。例如想知道我最喜歡的美國職籃聯盟 NBA 球員 Paul Pierce(The Truth)每一個例行賽季的平均每場得分、助攻與籃板變動趨勢,就能夠用線圖探索,我們撰寫網頁爬蟲從 basketball-reference.com 獲得。

Python

Paul Pierce(The Truth)每一個例行賽季的平均每場得分、助攻與籃板

由於得分、籃板與助攻是資料框中的三個欄位,我們呼叫三次 plt.plot() 函數分別將三個 Series 加入到圖中。

from pyquery import PyQuery as pq
import pandas as pd
import matplotlib.pyplot as plt

def get_pp_stats():
  """
  Get Paul Pierce stats from basketball-reference.com
  """
  stats_url = "https://www.basketball-reference.com/players/p/piercpa01.html"
  html_doc = pq(stats_url)
  pts_css = "#per_game .full_table .right:nth-child(30)"
  ast_css = "#per_game .full_table .right:nth-child(25)"
  reb_css = "#per_game .full_table .right:nth-child(24)"
  year = [str(i)+"-01-01" for i in range(1999, 2018)]
  pts = [float(p.text) for p in html_doc(pts_css)]
  ast = [float(a.text) for a in html_doc(ast_css)]
  reb = [float(r.text) for r in html_doc(reb_css)]
  df = pd.DataFrame()
  df["year"] = year
  df["pts"] = pts
  df["ast"] = ast
  df["reb"] = reb
  return df

pp_stats = get_pp_stats()
pp_stats["year"] = pd.to_datetime(pp_stats["year"])
pp_stats = pp_stats.set_index("year")
plt.plot(pp_stats["pts"])
plt.plot(pp_stats["reb"])
plt.plot(pp_stats["ast"])
plt.legend(['PTS', 'REB', 'AST'], loc='upper right')
plt.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

Paul Pierce(The Truth)每一個例行賽季的平均每場得分、助攻與籃板變動趨勢

透過 pandas 模組使用 .plot.line() 方法作圖,預設輸入格式為寬表格,因此不需要輸入任何參數,也不用指定繪製圖例。

from pyquery import PyQuery as pq
import pandas as pd
import matplotlib.pyplot as plt

def get_pp_stats():
  """
  Get Paul Pierce stats from basketball-reference.com
  """
  stats_url = "https://www.basketball-reference.com/players/p/piercpa01.html"
  html_doc = pq(stats_url)
  pts_css = "#per_game .full_table .right:nth-child(30)"
  ast_css = "#per_game .full_table .right:nth-child(25)"
  reb_css = "#per_game .full_table .right:nth-child(24)"
  year = [str(i)+"-01-01" for i in range(1999, 2018)]
  pts = [float(p.text) for p in html_doc(pts_css)]
  ast = [float(a.text) for a in html_doc(ast_css)]
  reb = [float(r.text) for r in html_doc(reb_css)]
  df = pd.DataFrame()
  df["year"] = year
  df["pts"] = pts
  df["ast"] = ast
  df["reb"] = reb
  return df

pp_stats = get_pp_stats()
pp_stats["year"] = pd.to_datetime(pp_stats["year"])
pp_stats = pp_stats.set_index("year")
pp_stats.plot.line()
plt.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

Paul Pierce(The Truth)每一個例行賽季的平均每場得分、助攻與籃板變動趨勢

R 語言

Paul Pierce(The Truth)每一個例行賽季的平均每場得分、助攻與籃板

由於得分、籃板與助攻是資料框中的三個欄位,我們呼叫 R 語言的 plot() 函數一次並指派 type 參數為 'l' ,意即 lines 將得分的線圖先會治好;接著搭配 lines() 函數兩次,藉此分別將籃板與助攻加到圖上。

library(rvest)

get_pp_stats <- function() {
  stats_url <- "https://www.basketball-reference.com/players/p/piercpa01.html"
  html_doc <- stats_url %>% 
    read_html()
  pts_css <- "#per_game .full_table .right:nth-child(30)"
  ast_css <- "#per_game .full_table .right:nth-child(25)"
  reb_css <- "#per_game .full_table .right:nth-child(24)"
  pts <- html_doc %>% 
    html_nodes(pts_css) %>% 
    html_text() %>% 
    as.numeric()
  ast <- html_doc %>% 
    html_nodes(ast_css) %>% 
    html_text() %>% 
    as.numeric()
  reb <- html_doc %>% 
    html_nodes(reb_css) %>% 
    html_text() %>% 
    as.numeric()
  year <- paste(1999:2017, "01", "01", sep = "-") %>% 
    as.Date()
  df <- data.frame(year = year,
                   pts = pts,
                   ast = ast,
                   reb = reb,
                   stringsAsFactors = FALSE)
  return(df)
}

pp_stats <- get_pp_stats()
plot(pp_stats$year, pp_stats$pts, type = "l", lwd = 3, col = rgb(1, 0, 0, 0.5),
     ylim = c(min(pp_stats$ast), max(pp_stats$pts)))
lines(pp_stats$year, pp_stats$reb, lwd = 3, col = rgb(0, 1, 0, 0.5))
lines(pp_stats$year, pp_stats$ast, lwd = 3, col = rgb(0, 0, 1, 0.5))
legend("topright", legend=c("PTS", "REB", "AST"), cex = 0.5, bty = "n",
       col = c("red", "green", "blue"), lty = c(1, 1, 1))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38

Paul Pierce(The Truth)每一個例行賽季的平均每場得分、助攻與籃板變動趨勢

ggplot2 使用 geom_line() 函數繪製線圖,為了方便建立線條顏色與圖例,我們將寬表格轉置為長表格的樣式。

library(rvest)
library(tidyr)
library(ggplot2)

get_pp_stats <- function() {
  stats_url <- "https://www.basketball-reference.com/players/p/piercpa01.html"
  html_doc <- stats_url %>% 
    read_html()
  pts_css <- "#per_game .full_table .right:nth-child(30)"
  ast_css <- "#per_game .full_table .right:nth-child(25)"
  reb_css <- "#per_game .full_table .right:nth-child(24)"
  pts <- html_doc %>% 
    html_nodes(pts_css) %>% 
    html_text() %>% 
    as.numeric()
  ast <- html_doc %>% 
    html_nodes(ast_css) %>% 
    html_text() %>% 
    as.numeric()
  reb <- html_doc %>% 
    html_nodes(reb_css) %>% 
    html_text() %>% 
    as.numeric()
  year <- paste(1999:2017, "01", "01", sep = "-") %>% 
    as.Date()
  df <- data.frame(year = year,
                   pts = pts,
                   ast = ast,
                   reb = reb,
                   stringsAsFactors = FALSE)
  return(df)
}

pp_stats <- get_pp_stats()
pp_stats_long <- gather(pp_stats, key = "stats", value = "value", pts, ast, reb)
ggplot(pp_stats_long, aes(x = year, y = value, color = stats)) +
  geom_line()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

Paul Pierce(The Truth)每一個例行賽季的平均每場得分、助攻與籃板變動趨勢

小結

在這個小節中我們簡介如何在 Python 與 R 語言使用視覺化套件,探索不同資料型別的特徵,包含視覺化的基本單位速記、一組文字資料的相異觀測值數量、一組數值資料依類別分組摘要排序、一組數值資料的分佈、一組數值資料依類別分組的分佈、兩組數值資料的相關以及數值資料隨著日期時間的變動趨勢。

延伸閱讀