視覺化中的元件

The simple graph has brought more information to the data analyst’s mind than any other device.

John Tukey

基礎視覺化我們已經掌握如何描繪不同資料的特徵,進而從資料中挖掘富含價值的資訊;不過該文的重點是如何將資料進行處理並依照資料類型與探索需求,映射至對應圖形種類上,與資料無關的調整,像是圖表標題、色系或者刻度標籤等,並不在考量之中,這也表示我們在探索性資料分析(Exploratory Data Analysis)目的:讓資料科學團隊一目暸然資料特徵,尚有可以進步之空間。

調整畫布的佈景主題

調整畫布的佈景主題(theme)是讓視覺化立即改頭換面的捷徑,佈景主題涵蓋背景顏色、字型大小與線條樣式等整體外觀的調整。

Python

在 Python 中我們可以查看 pyplot 的 style.available 屬性,暸解能夠使用哪些佈景主題。

import matplotlib.pyplot as plt

style_available = plt.style.available
print("可以使用 {} 個佈景主題。".format(len(style_available)))
print(style_available)
1
2
3
4
5
## 可以使用 25 個佈景主題。
## ['seaborn-dark', 'seaborn-colorblind', 'seaborn-muted', 'seaborn-pastel', 'grayscale', 'seaborn-dark-palette', 'seaborn-white', '_classic_test', 'seaborn-poster', 'seaborn-whitegrid', 'fast', 'seaborn-bright', 'seaborn-talk', 'seaborn-paper', 'Solarize_Light2', 'seaborn-notebook', 'ggplot', 'dark_background', 'seaborn-darkgrid', 'bmh', 'classic', 'seaborn-ticks', 'seaborn-deep', 'fivethirtyeight', 'seaborn']
1
2

其中 seaborn 相關、ggplot、dark_background、bmh 與 fivethirtyeight 等是較為鮮明的佈景主題,使用 plt.style.use() 方法來指定,讓我們在這五個佈景主題中分別繪製長條圖探索 1995 至 1996 年球季中的芝加哥公牛隊陣容各個鋒衛位置的人數:

import pandas as pd
import matplotlib.pyplot as plt

csv_url = "https://storage.googleapis.com/ds_data_import/chicago_bulls_1995_1996.csv"
df = pd.read_csv(csv_url)
grouped = df.groupby("Pos")
pos = grouped["Pos"].count()
plt_themes = ["seaborn-darkgrid", "ggplot", "dark_background", "bmh", "fivethirtyeight"]

for i in range(5):
  plt.style.use(plt_themes[i])
  plt.bar(range(1, 6), pos)
  plt.xticks(range(1, 6), pos.index)
  plt.title(plt_themes[i])
  plt.show()
  print("\n")
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

seaborn-darkgrid 佈景主題

ggplot 佈景主題

dark_background 佈景主題

bmh 佈景主題

fivethirtyeight 佈景主題

R 語言

R 語言的 ggplot2 套件有 theme_...() 函數可以更改佈景主題,除了預設的 theme_gray() 其他可以選用的類型有:

library(ggplot2)

csv_url <- "https://storage.googleapis.com/ds_data_import/chicago_bulls_1995_1996.csv"
df <- read.csv(csv_url)
plt <- df %>% 
  ggplot(aes(x = Pos)) +
  geom_bar(fill = "red", alpha = 0.5)

plt + theme_bw()
plt + theme_linedraw()
plt + theme_light()
plt + theme_dark()
plt + theme_minimal()
plt + theme_classic()
plt + theme_void()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

theme_bw() 佈景主題

theme_linedraw() 佈景主題

theme_light() 佈景主題

theme_dark() 佈景主題

theme_minimal() 佈景主題

theme_classic() 佈景主題

theme_void() 佈景主題

加入圖標題與軸標籤

一個敘述得當的圖標題能夠為探索性分析帶來畫龍點睛的效果。

Python

在 Python 中利用 plt.title() 可以加入正常標題、 plt.suptitle() 可以加入一個畫布更上方的置中標題,讓標題具有兩個層級,一個大標與一個副標;而 plt.xlabel()plt.ylabel() 則可以分別為 X 軸與 Y 軸加上變數名稱與單位的敘述。

import pandas as pd
import matplotlib.pyplot as plt

csv_url = "https://storage.googleapis.com/ds_data_import/chicago_bulls_1995_1996.csv"
df = pd.read_csv(csv_url)
grouped = df.groupby("Pos")
pos = grouped["Pos"].count()

plt.bar(range(1, 6), pos)
plt.xticks(range(1, 6), pos.index)
plt.suptitle("Front court players are the majorities.")
plt.title("Chicago Bulls is relatively weak in the paint.")
plt.xlabel("Positions")
plt.ylabel("Number of Players")
plt.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

Python:加入圖標題與軸標籤

習慣使用中文的使用者在這時會碰上中文的標題與軸標籤無法顯示的問題,因為 matplotlib 預設的字體(例如我的 matplotlib 是 DejaVuSans.ttf)不支援中文,就會生成空格:

import pandas as pd
import matplotlib.pyplot as plt

csv_url = "https://storage.googleapis.com/ds_data_import/chicago_bulls_1995_1996.csv"
df = pd.read_csv(csv_url)
grouped = df.groupby("Pos")
pos = grouped["Pos"].count()

# 無法顯示中文
plt.bar(range(1, 6), pos)
plt.xticks(range(1, 6), ["中鋒", "大前鋒", "小前鋒", "控球後衛", "得分後衛"])
plt.suptitle("前場球員為芝加哥公牛隊的大宗")
plt.title("反映當時為了抗衡其他具有主宰力中前鋒的隊伍之現象")
plt.xlabel("鋒衛位置")
plt.ylabel("球員人數")
plt.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

Python:無法顯示中文圖標題與軸標籤

解決方式是另外指定支援中文的字體,例如接下來我要指定的繁體中文細黑體(Heiti TC Light),透過 matplotlib.font_manager 模組所提供的 FontProperties() 函數傳入繁體中文細黑體的路徑(以我的電腦舉例是 /System/Library/Fonts/STHeiti Light.ttc)。

import pandas as pd
import matplotlib.pyplot as plt

csv_url = "https://storage.googleapis.com/ds_data_import/chicago_bulls_1995_1996.csv"
df = pd.read_csv(csv_url)
grouped = df.groupby("Pos")
pos = grouped["Pos"].count()

# 可以顯示中文
myfont = FontProperties(fname="/System/Library/Fonts/STHeiti Light.ttc")
plt.bar(range(1, 6), pos)
plt.xticks(range(1, 6), ["中鋒", "大前鋒", "小前鋒", "控球後衛", "得分後衛"], fontproperties=myfont)
plt.suptitle("前場球員為芝加哥公牛隊的大宗", fontproperties=myfont)
plt.title("反映當時為了抗衡其他具有主宰力中前鋒的隊伍之現象", fontproperties=myfont)
plt.xlabel("鋒衛位置", fontproperties=myfont)
plt.ylabel("球員人數", fontproperties=myfont)
plt.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

Python:能夠顯示中文圖標題與軸標籤

R 語言

在 R 語言中利用 ggtitle() 可以加入標題、 labs(subtitle = , caption =) 可以加入一個副標題與右下角的資料來源註釋;而 xlab()ylab() 則可以分別為 X 軸與 Y 軸加上變數名稱與單位的敘述。

library(ggplot2)

csv_url <- "https://storage.googleapis.com/ds_data_import/chicago_bulls_1995_1996.csv"
df <- read.csv(csv_url)
df %>% 
  ggplot(aes(x = Pos)) +
  geom_bar(fill = "red", alpha = 0.5) +
  ggtitle("Front court players are the majorities.") +
  labs(subtitle = "Chicago Bulls is relatively weak in the paint.",
       caption = "Source: basketball-reference.com") +
  xlab("Positions") +
  ylab("Number of Players")
1
2
3
4
5
6
7
8
9
10
11
12

R:加入圖標題與軸標籤

同樣因為 ggplot2 預設的字體(sans)不支援中文,如果試圖在標題與軸標籤上加入中文,就會碰上無法顯示的問題生成空格:

library(ggplot2)

csv_url <- "https://storage.googleapis.com/ds_data_import/chicago_bulls_1995_1996.csv"
df <- read.csv(csv_url)
# 無法顯示中文
df %>% 
  ggplot(aes(x = Pos)) +
  geom_bar(fill = "red", alpha = 0.5) +
  ggtitle("前場球員為芝加哥公牛隊的大宗") +
  labs(subtitle = "反映當時為了抗衡其他具有主宰力中前鋒的隊伍之現象",
       caption = "資料來源: basketball-reference.com") +
  xlab("鋒衛位置") +
  ylab("球員人數") +
  scale_x_discrete(labels = c("中鋒", "大前鋒", "小前鋒", "控球後衛", "得分後衛"))
1
2
3
4
5
6
7
8
9
10
11
12
13
14

R:無法顯示中文圖標題與軸標籤

解決方式是透過 theme(text = element_text(family = )) 指定支援中文的字體,例如我要指定的繁體中文細黑體(Heiti TC Light)。

library(ggplot2)

csv_url <- "https://storage.googleapis.com/ds_data_import/chicago_bulls_1995_1996.csv"
df <- read.csv(csv_url)
# 能夠顯示中文
df %>% 
  ggplot(aes(x = Pos)) +
  geom_bar(fill = "red", alpha = 0.5) +
  ggtitle("前場球員為芝加哥公牛隊的大宗") +
  labs(subtitle = "反映當時為了抗衡其他具有主宰力中前鋒的隊伍之現象",
       caption = "資料來源: basketball-reference.com") +
  xlab("鋒衛位置") +
  ylab("球員人數") +
  scale_x_discrete(labels = c("中鋒", "大前鋒", "小前鋒", "控球後衛", "得分後衛")) +
  theme(text = element_text(family = "Heiti TC Light"))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

R:能夠顯示中文圖標題與軸標籤

加入註釋

除了標準的圖標題與軸標籤能幫助資料科學團隊解讀探索性資料分析,我們還可以在繪圖中加入凸顯資訊的元件,像是用來註釋描述性資訊的文字、標註重要數值的水平或垂直線、強調某區域的陰影或是指出特定資料點的箭頭等。

Python

在 Python 中可以使用 plt.text() 方法指定文字內容與擺放文字的座標位置,像是將 1995 至 1996 年球季的芝加哥公牛隊各個鋒衛位置的平均每場得分長條圖上方加入得分的數值,特別注意的是擺放文字之位置要做微幅調整,否則會造成註釋文字恰好貼齊長條或者座標軸的情況。

import pandas as pd
import matplotlib.pyplot as plt

per_game_url = "https://storage.googleapis.com/ds_data_import/stats_per_game_chicago_bulls_1995_1996.csv"
player_info_url = "https://storage.googleapis.com/ds_data_import/chicago_bulls_1995_1996.csv"
per_game = pd.read_csv(per_game_url)
player_info = pd.read_csv(player_info_url)
df = pd.merge(player_info, per_game[["Name", "PTS/G"]], left_on="Player", right_on="Name")
grouped = df.groupby("Pos")
points_per_game = grouped["PTS/G"].mean()

plt.bar([1, 2, 3, 4, 5], points_per_game)
plt.xticks([1, 2, 3, 4, 5], points_per_game.index)
plt.ylim(0, points_per_game.max() + 3)
plt.title("Points per game by postions")
plt.xlabel("Positions")
plt.ylabel("PPG")
for i, v in enumerate(points_per_game):
  plt.text(i + 0.9, v + 0.5, "{:.1f}".format(v))
plt.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

Python:註釋描述性資訊的文字

使用 plt.axhline() 可以在圖形上加入水平線、使用 plt.fill_between() 能夠在指定區段加入陰影,藉此達成強調效果。像是我希望在 Paul Pierce 每年場均得分的線圖上標註均值並且強調(高光)高於均值的區段。

from pyquery import PyQuery as pq
import pandas as pd
import matplotlib.pyplot as plt

def get_pp_stats():
  """
  Get Paul Pierce stats from basketball-reference.com
  """
  stats_url = "https://www.basketball-reference.com/players/p/piercpa01.html"
  html_doc = pq(stats_url)
  pts_css = "#per_game .full_table .right:nth-child(30)"
  ast_css = "#per_game .full_table .right:nth-child(25)"
  reb_css = "#per_game .full_table .right:nth-child(24)"
  year = [str(i)+"-01-01" for i in range(1999, 2018)]
  pts = [float(p.text) for p in html_doc(pts_css)]
  ast = [float(a.text) for a in html_doc(ast_css)]
  reb = [float(r.text) for r in html_doc(reb_css)]
  df = pd.DataFrame()
  df["year"] = year
  df["pts"] = pts
  df["ast"] = ast
  df["reb"] = reb
  return df

pp_stats = get_pp_stats()
pp_stats["year"] = pd.to_datetime(pp_stats["year"])
pp_stats = pp_stats.set_index("year")
plt.plot(pp_stats["pts"]) # 線圖
avg_pts = pp_stats["pts"].mean() # 生涯均值
plt.axhline(y = avg_pts, color="g", ls="--", alpha = 0.5) # 水平線
plt.fill_between(pp_stats.index, avg_pts, pp_stats["pts"], 
                 where=pp_stats["pts"] >= avg_pts, color="gray",
                 alpha=0.5, interpolate=True) # 陰影
plt.title("Points per game: Paul Pierce")
plt.xlabel("Year")
plt.ylabel("PPG")
plt.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

Python:加入水平線與陰影

加上 plt.annotate() 函數就可以在圖形上增添箭號與註釋文字,例如我希望在圖上註釋 Paul Pierce 離開波士頓賽爾提克的 2013 年。

from pyquery import PyQuery as pq
import pandas as pd
import matplotlib.pyplot as plt

def get_pp_stats():
  """
  Get Paul Pierce stats from basketball-reference.com
  """
  stats_url = "https://www.basketball-reference.com/players/p/piercpa01.html"
  html_doc = pq(stats_url)
  pts_css = "#per_game .full_table .right:nth-child(30)"
  ast_css = "#per_game .full_table .right:nth-child(25)"
  reb_css = "#per_game .full_table .right:nth-child(24)"
  year = [str(i)+"-01-01" for i in range(1999, 2018)]
  pts = [float(p.text) for p in html_doc(pts_css)]
  ast = [float(a.text) for a in html_doc(ast_css)]
  reb = [float(r.text) for r in html_doc(reb_css)]
  df = pd.DataFrame()
  df["year"] = year
  df["pts"] = pts
  df["ast"] = ast
  df["reb"] = reb
  return df

pp_stats = get_pp_stats()
pp_stats["year"] = pd.to_datetime(pp_stats["year"])
pp_stats = pp_stats.set_index("year")
plt.plot(pp_stats["pts"]) # 線圖
avg_pts = pp_stats["pts"].mean() # 生涯均值
plt.axhline(y = avg_pts, color="g", ls="--", alpha = 0.5) # 水平線
plt.fill_between(pp_stats.index, avg_pts, pp_stats["pts"], 
                 where=pp_stats["pts"] >= avg_pts, color="gray",
                 alpha=0.5, interpolate=True) # 陰影

year_2013 = pp_stats.index[-5] # 2013 年的 index
# 加入箭號與註釋文字
plt.annotate(
    'Left Boston Celtics',
    xy=(year_2013, 19),
    xycoords='data',
    xytext=(year_2013, 25),
    textcoords='data',
    horizontalalignment='center',
    arrowprops=dict(facecolor='black', arrowstyle="fancy")
)
plt.title("Points per game: Paul Pierce")
plt.xlabel("Year")
plt.ylabel("PPG")
plt.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49

Python:加入箭號與註釋文字

R 語言

在 R 語言中利用 geom_text() 函數可以指定加入的數值標籤,並且會自動與 X 軸的座標對齊,加入 vjust 參數調整數值標籤與長條頂端的距離。

library(dplyr)
library(ggplot2)

per_game_url <- "https://storage.googleapis.com/ds_data_import/stats_per_game_chicago_bulls_1995_1996.csv"
player_info_url <- "https://storage.googleapis.com/ds_data_import/chicago_bulls_1995_1996.csv"
per_game <- read.csv(per_game_url)
player_info <- read.csv(player_info_url)
df <- merge(player_info, per_game[, c("Name", "PTS.G")], by.x = "Player", by.y = "Name")
df %>% 
  group_by(Pos) %>% 
  summarise(mean_pts = mean(PTS.G)) %>% 
  ggplot(aes(x = Pos, y = mean_pts)) +
  geom_bar(stat = "identity", fill = "red", alpha = 0.5) +
  geom_text(aes(label = sprintf("%.1f", mean_pts), y= mean_pts),  vjust = -1) +
  scale_y_continuous(limits = c(0, max(df$PTS.G) + 3)) +
  ggtitle("Points per game by positions") +
  xlab("Positions") +
  ylab("PPG")
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

R:註釋描述性資訊的文字

使用 geom_hline() 可以在圖形上加入水平線、使用 geom_ribbon() 能夠在指定區段加入陰影,藉此達成強調效果。

library(rvest)
library(ggplot2)

get_pp_stats <- function() {
  stats_url <- "https://www.basketball-reference.com/players/p/piercpa01.html"
  html_doc <- stats_url %>% 
    read_html()
  pts_css <- "#per_game .full_table .right:nth-child(30)"
  ast_css <- "#per_game .full_table .right:nth-child(25)"
  reb_css <- "#per_game .full_table .right:nth-child(24)"
  pts <- html_doc %>% 
    html_nodes(pts_css) %>% 
    html_text() %>% 
    as.numeric()
  ast <- html_doc %>% 
    html_nodes(ast_css) %>% 
    html_text() %>% 
    as.numeric()
  reb <- html_doc %>% 
    html_nodes(reb_css) %>% 
    html_text() %>% 
    as.numeric()
  year <- paste(1999:2017, "01", "01", sep = "-") %>% 
    as.Date()
  df <- data.frame(year = year,
                   pts = pts,
                   ast = ast,
                   reb = reb,
                   stringsAsFactors = FALSE)
  return(df)
}

pp_stats <- get_pp_stats()
avg_pts <- mean(pp_stats$pts) # 均值
pp_stats %>% 
  ggplot(aes(x = year, y = pts)) +
    # 線圖
    geom_line() +
    # 水平線
    geom_hline(yintercept = avg_pts, lty = 2, col = "green") +
    # 陰影
    geom_ribbon(aes(ymin = avg_pts, ymax = pts, 
                fill = ifelse(pts >= avg_pts, TRUE, NA)),
                alpha = 0.5) +
    scale_fill_manual(values=c("gray"), name="fill") +
    theme(legend.position="none")
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46

R:加入水平線與陰影

加上 annotate() 函數可以在圖形上增添箭號與註釋文字。

library(rvest)
library(ggplot2)

get_pp_stats <- function() {
  stats_url <- "https://www.basketball-reference.com/players/p/piercpa01.html"
  html_doc <- stats_url %>% 
    read_html()
  pts_css <- "#per_game .full_table .right:nth-child(30)"
  ast_css <- "#per_game .full_table .right:nth-child(25)"
  reb_css <- "#per_game .full_table .right:nth-child(24)"
  pts <- html_doc %>% 
    html_nodes(pts_css) %>% 
    html_text() %>% 
    as.numeric()
  ast <- html_doc %>% 
    html_nodes(ast_css) %>% 
    html_text() %>% 
    as.numeric()
  reb <- html_doc %>% 
    html_nodes(reb_css) %>% 
    html_text() %>% 
    as.numeric()
  year <- paste(1999:2017, "01", "01", sep = "-") %>% 
    as.Date()
  df <- data.frame(year = year,
                   pts = pts,
                   ast = ast,
                   reb = reb,
                   stringsAsFactors = FALSE)
  return(df)
}

pp_stats <- get_pp_stats()
avg_pts <- mean(pp_stats$pts)
pp_stats %>% 
  ggplot(aes(x = year, y = pts)) +
    geom_line() +
    geom_hline(yintercept = avg_pts, lty = 2, col = "green") +
    geom_ribbon(aes(ymin = avg_pts, ymax = pts, 
                fill = ifelse(pts >= avg_pts, TRUE, NA)),
                alpha = 0.5) +
    # 增加註釋文字
    annotate("text", x = as.Date("2013-01-01"), y = 25, label = "Left Boston Celtics") +
    # 增加箭頭
    geom_segment(aes(x = as.Date("2013-01-01"), y = 23, xend = as.Date("2013-01-01"), yend = 20),
                 arrow = arrow(length = unit(0.2, "cm"))) +
    scale_fill_manual(values=c("gray"), name="fill") +
    theme(legend.position="none") +
    ggtitle("Points per game: Paul Pierce") +
    xlab("Year") +
    ylab("PPG")
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51

R:加入箭號與註釋文字

調整座標軸

X 軸與 Y 軸用於描述資料中資料映射,大多數情況下沿用 Python Matplotlib、R 語言 ggplot2 預設的 X 軸與 Y 軸規格已經很足夠,但有些情況下資料科學團隊會想要控制軸的範圍、刻度線或者刻度線標籤等。

Python

在 Python 中可以透過 plt.xlim()plt.ylim() 輸入最小值與最大值調整軸的範圍,藉此僅顯示某一部份的圖形,像是藉由調整 X 軸的範圍只顯示出 Paul Pierce 在波士頓賽爾提克時期的場均得分。

from pyquery import PyQuery as pq
import pandas as pd
import matplotlib.pyplot as plt

def get_pp_stats():
  """
  Get Paul Pierce stats from basketball-reference.com
  """
  stats_url = "https://www.basketball-reference.com/players/p/piercpa01.html"
  html_doc = pq(stats_url)
  pts_css = "#per_game .full_table .right:nth-child(30)"
  ast_css = "#per_game .full_table .right:nth-child(25)"
  reb_css = "#per_game .full_table .right:nth-child(24)"
  year = [str(i)+"-01-01" for i in range(1999, 2018)]
  pts = [float(p.text) for p in html_doc(pts_css)]
  ast = [float(a.text) for a in html_doc(ast_css)]
  reb = [float(r.text) for r in html_doc(reb_css)]
  df = pd.DataFrame()
  df["year"] = year
  df["pts"] = pts
  df["ast"] = ast
  df["reb"] = reb
  return df

pp_stats = get_pp_stats()
pp_stats["year"] = pd.to_datetime(pp_stats["year"])
pp_stats = pp_stats.set_index("year")
plt.plot(pp_stats["pts"]) # 線圖
plt.xlim(pp_stats.index.min(), pp_stats.index[14]) # 調整軸的範圍
plt.ylim(15, 30) # 調整軸的範圍
plt.title("Points per game: Paul Pierce in Boston Celtics")
plt.xlabel("Year")
plt.ylabel("PPG")
plt.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34

Python:調整座標軸範圍

在 Python 中可以透過 plt.xticks()plt.yticks() 調整刻度線與刻度線標籤,在未調整前,我們繪製長條圖探索 1995 至 1996 年球季中的芝加哥公牛隊陣容各個鋒衛位置的人數時外觀並不是非常好看,X 軸是 1 至 5 沒有鋒衛的標籤,Y 軸的刻度以 0.5 作為一個刻度間距並且為浮點數外觀。

import pandas as pd
import matplotlib.pyplot as plt

csv_url = "https://storage.googleapis.com/ds_data_import/chicago_bulls_1995_1996.csv"
df = pd.read_csv(csv_url)
grouped = df.groupby("Pos")
pos = grouped["Pos"].count()

plt.bar(range(1, 6), pos)
plt.suptitle("Front court players are the majorities.")
plt.title("Chicago Bulls is relatively weak in the paint.")
plt.xlabel("Positions")
plt.ylabel("Number of Players")
plt.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14

Python:未調整刻度線與刻度線標籤

plt.xticks() 函數中將 1 至 5 分別對應給 ["C", "PF", "SF", "PG", "SG"],然後在 plt.yticks() 函數中輸入 1 至 4 調整預設的樣式。

import pandas as pd
import matplotlib.pyplot as plt

csv_url = "https://storage.googleapis.com/ds_data_import/chicago_bulls_1995_1996.csv"
df = pd.read_csv(csv_url)
grouped = df.groupby("Pos")
pos = grouped["Pos"].count()

plt.bar(range(1, 6), pos)
plt.xticks(range(1, 6), ["C", "PF", "SF", "PG", "SG"]) # 調整 X 軸刻度線與刻度線標籤
plt.yticks(range(1, 5)) # 調整 Y 軸刻度線與刻度線標籤
plt.suptitle("Front court players are the majorities.")
plt.title("Chicago Bulls is relatively weak in the paint.")
plt.xlabel("Positions")
plt.ylabel("Number of Players")
plt.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

Python:調整刻度線與刻度線標籤之後

R 語言

在 R 語言中使用 scale_x_...()scale_y_...() 函數中的 limits 參數來調整座標軸的範圍,... 必須依 X 軸與 Y 軸的型別決定。

library(rvest)
library(ggplot2)

get_pp_stats <- function() {
  stats_url <- "https://www.basketball-reference.com/players/p/piercpa01.html"
  html_doc <- stats_url %>% 
    read_html()
  pts_css <- "#per_game .full_table .right:nth-child(30)"
  ast_css <- "#per_game .full_table .right:nth-child(25)"
  reb_css <- "#per_game .full_table .right:nth-child(24)"
  pts <- html_doc %>% 
    html_nodes(pts_css) %>% 
    html_text() %>% 
    as.numeric()
  ast <- html_doc %>% 
    html_nodes(ast_css) %>% 
    html_text() %>% 
    as.numeric()
  reb <- html_doc %>% 
    html_nodes(reb_css) %>% 
    html_text() %>% 
    as.numeric()
  year <- paste(1999:2017, "01", "01", sep = "-") %>% 
    as.Date()
  df <- data.frame(year = year,
                   pts = pts,
                   ast = ast,
                   reb = reb,
                   stringsAsFactors = FALSE)
  return(df)
}

pp_stats <- get_pp_stats()
avg_pts <- mean(pp_stats$pts)
pp_stats %>% 
  ggplot(aes(x = year, y = pts)) +
    geom_line() +
    ggtitle("Points per game: Paul Pierce in Boston") +
    # 調整 X 軸的範圍與樣式
    scale_x_date(date_breaks = "1 year", date_labels = "%Y", limits = c(as.Date("1999-01-01"), as.Date("2013-01-01"))) +
    # 調整 Y 軸的範圍
    scale_y_continuous(limits = c(15, 30)) +
    xlab("Year") +
    ylab("PPG")
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44

R:調整座標軸範圍

刻度線與刻度線標籤同樣在 scale_x_...()scale_y_...() 函數中以參數 breakslabels 調整。

library(dplyr)
library(ggplot2)

csv_url <- "https://storage.googleapis.com/ds_data_import/chicago_bulls_1995_1996.csv"
df <- read.csv(csv_url)
labs <- c("Center", "Power Forward", "Point Guard", "Small Forward", "Shooting Guard")
df %>% 
  ggplot(aes(x = Pos)) +
  geom_bar(fill = "red", alpha = 0.5) +
  ggtitle("Front court players are the majorities.") +
  labs(subtitle = "Chicago Bulls is relatively weak in the paint.",
       caption = "Source: basketball-reference.com") +
  xlab("Positions") +
  ylab("Number of Players") +
  scale_x_discrete(labels = labs) +
  scale_y_continuous(breaks = 1:4)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

R:調整刻度線與刻度線標籤之後

加入與調整圖例

圖例(Legends)顯示資料如何映射至顏色或樣式。

Python

在 Python 中我們使用 plt.legend() 加入與調整圖例,像是將 1995 至 1996 年球季中的芝加哥公牛隊陣容各個鋒衛位置的人數區分為前場和後場兩種顏色的長條。

import pandas as pd
import matplotlib.pyplot as plt

csv_url = "https://storage.googleapis.com/ds_data_import/chicago_bulls_1995_1996.csv"
df = pd.read_csv(csv_url)
grouped = df.groupby("Pos")
pos = grouped.count()
bar_1 = pos["Player"].loc[["SG", "PG"]].values
bar_2 = pos["Player"].loc[["SF", "PF", "C"]].values
plt.bar(range(1, 3), bar_1, label="Back Court", alpha=0.6, color="red")
plt.bar(range(3, 6), bar_2, label="Front Court", alpha=0.6, color="green")
plt.legend(title = "Court") # 加入圖例
plt.xticks(range(1, 6), ["SG", "PG", "SF", "PF", "C"]) # 調整 X 軸刻度線與刻度線標籤
plt.yticks(range(1, 5)) # 調整 Y 軸刻度線與刻度線標籤
plt.suptitle("Front court players are the majorities.")
plt.title("Chicago Bulls is relatively weak in the paint.")
plt.xlabel("Positions")
plt.ylabel("Number of Players")
plt.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

Python:加入與調整圖例

R 語言

如果我們在 aes() 函數中有將資料映射給線條顏色、填滿色彩或線條樣式,R 語言的 ggplot2 就會自動加入圖例。

library(ggplot2)

csv_url <- "https://storage.googleapis.com/ds_data_import/chicago_bulls_1995_1996.csv"
df <- read.csv(csv_url)
labs <- c("Center", "Power Forward", "Point Guard", "Small Forward", "Shooting Guard")
df %>% 
  ggplot(aes(x = Pos, fill = Pos)) +
  geom_bar(alpha = 0.7) +
  ggtitle("Front court players are the majorities.") +
  labs(subtitle = "Chicago Bulls is relatively weak in the paint.",
       caption = "Source: basketball-reference.com") +
  xlab("Positions") +
  ylab("Number of Players") +
  scale_x_discrete(labels = labs) +
  scale_y_continuous(breaks = 1:4)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

R:將鋒衛位置映射至填滿顏色

如果希望移除圖例,可以加入 theme(legend.position="none")

library(ggplot2)

csv_url <- "https://storage.googleapis.com/ds_data_import/chicago_bulls_1995_1996.csv"
df <- read.csv(csv_url)
labs <- c("Center", "Power Forward", "Point Guard", "Small Forward", "Shooting Guard")
df %>% 
  ggplot(aes(x = Pos, fill = Pos)) +
  geom_bar(alpha = 0.7) +
  ggtitle("Front court players are the majorities.") +
  labs(subtitle = "Chicago Bulls is relatively weak in the paint.",
       caption = "Source: basketball-reference.com") +
  xlab("Positions") +
  ylab("Number of Players") +
  scale_x_discrete(labels = labs) +
  scale_y_continuous(breaks = 1:4) +
  theme(legend.position = "none") # 移除圖例
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

R:移除圖例的顯示

theme(legend.position=) 也可以用來指定圖例的擺放位置,像是輸入一組 (1, 1) 座標放置到圖形的右上方。

library(ggplot2)

csv_url <- "https://storage.googleapis.com/ds_data_import/chicago_bulls_1995_1996.csv"
df <- read.csv(csv_url)
df$Court <- ifelse(df$Pos %in% c("SG", "PG"), "Back Court", "Front Court")
labs <- c("Center", "Power Forward", "Point Guard", "Small Forward", "Shooting Guard")
df %>% 
  ggplot(aes(x = Pos, fill = Court)) +
  geom_bar(alpha = 0.7) +
  ggtitle("Front court players are the majorities.") +
  labs(subtitle = "Chicago Bulls is relatively weak in the paint.",
       caption = "Source: basketball-reference.com") +
  xlab("Positions") +
  ylab("Number of Players") +
  scale_x_discrete(labels = labs) +
  scale_y_continuous(breaks = 1:4) +
  theme_minimal() +
  theme(legend.position=c(1,1), legend.justification = c(1, 1)) + # 調整圖例的擺放位置
  theme(legend.background=element_blank()) + # 調整圖例背景顏色
  theme(legend.key=element_blank()) # 調整圖例邊框顏色
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

R:調整圖例至右上方

在一個畫布上繪製多個子圖形

資料科學團隊經常採用的實用技巧是將資料分組並列呈現,進而輕鬆地比較組別之間的差別,在 Python Matplotlib 中慣常使用子圖(subplots)實踐這個技巧;在 R 語言 ggplot2 中則慣常使用 facets,意即映射類別變數至分組條件上以達到類似目的。

Python

Python 可以使用 plt.subplots(m, n) 函數將畫布切割成 mxn 的外觀,然後將子圖形依序填入;在使用子圖形繪製時,普遍的作法是先將畫布與子圖形區隔各指定為一個物件,通常會為畫布取名 fig,為子圖形取名為 axes,然後利用迴圈的語法在畫布上繪製子圖形。

如果想知道美國職籃聯盟 NBA 球員依照不同的鋒衛位置年薪分佈,除了能夠用盒鬚圖探索,其實也可以嘗試用顏色區隔不同位置,繪製重疊的直方圖。

from pyquery import PyQuery as pq
import pandas as pd
import matplotlib.pyplot as plt

def get_nba_salary():
  """
  Get NBA players' salary from ESPN.COM
  """
  player_css = "td:nth-child(2) a"
  pos_css = ".evenrow td:nth-child(2) , .oddrow td:nth-child(2)"
  salary_css = ".evenrow td:nth-child(4) , .oddrow td:nth-child(4)"
  
  nba_salary_ranking_url = "http://www.espn.com/nba/salaries/_/page/{}/seasontype/4"
  nba_salary_ranking_urls = [nba_salary_ranking_url.format(i) for i in range(1, 10)]
  players = []
  positions = []
  salaries = []
  for nba_salary_ranking_url in nba_salary_ranking_urls:
    html_doc = pq(nba_salary_ranking_url)
    player = [p.text for p in html_doc(player_css)]
    player_pos = html_doc(".evenrow td:nth-child(2) , .oddrow td:nth-child(2)").text()
    player_pos = player_pos.split(" ")
    position = []
    for pp in player_pos:
      if pp in ["C", "PF", "SF", "SG", "PG", "G"]:
        position.append(pp)
    salary = [s.text for s in html_doc(salary_css)]
    salary = [s.replace(",", "") for s in salary]
    salary = [int(s.replace("$", "")) for s in salary]
    players = players + player
    positions = positions + position
    salaries = salaries + salary
    
  df = pd.DataFrame()
  df["player"] = players
  df["position"] = positions
  df["salary"] = salaries
  
  return df

nba_salary = get_nba_salary()
nba_salary[nba_salary["position"] == "PG"]["salary"].plot.hist(bins = 15, label = "PG")
nba_salary[nba_salary["position"] == "SG"]["salary"].plot.hist(bins = 15, label = "SG")
nba_salary[nba_salary["position"] == "G"]["salary"].plot.hist(bins = 15, label = "G")
nba_salary[nba_salary["position"] == "SF"]["salary"].plot.hist(bins = 15, label = "SF")
nba_salary[nba_salary["position"] == "PF"]["salary"].plot.hist(bins = 15, label = "PF")
nba_salary[nba_salary["position"] == "C"]["salary"].plot.hist(bins = 15, label = "C")
plt.legend()
plt.title("Player salary by positions")
plt.xlabel("Salary")
plt.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51

Python 重疊的直方圖:NBA 球員依照不同的鋒衛位置年薪分佈

但是在資料有六個鋒衛位置的情況下,使用一個重疊直方圖探索不夠清晰、分開繪製六個圖形又不方便觀察,因此可以透過 plt.subplots() 將畫布分割為 2x3 的外觀再添加子圖。

from pyquery import PyQuery as pq
import pandas as pd
import matplotlib.pyplot as plt

def get_nba_salary():
  """
  Get NBA players' salary from ESPN.COM
  """
  player_css = "td:nth-child(2) a"
  pos_css = ".evenrow td:nth-child(2) , .oddrow td:nth-child(2)"
  salary_css = ".evenrow td:nth-child(4) , .oddrow td:nth-child(4)"
  
  nba_salary_ranking_url = "http://www.espn.com/nba/salaries/_/page/{}/seasontype/4"
  nba_salary_ranking_urls = [nba_salary_ranking_url.format(i) for i in range(1, 10)]
  players = []
  positions = []
  salaries = []
  for nba_salary_ranking_url in nba_salary_ranking_urls:
    html_doc = pq(nba_salary_ranking_url)
    player = [p.text for p in html_doc(player_css)]
    player_pos = html_doc(".evenrow td:nth-child(2) , .oddrow td:nth-child(2)").text()
    player_pos = player_pos.split(" ")
    position = []
    for pp in player_pos:
      if pp in ["C", "PF", "SF", "SG", "PG", "G"]:
        position.append(pp)
    salary = [s.text for s in html_doc(salary_css)]
    salary = [s.replace(",", "") for s in salary]
    salary = [int(s.replace("$", "")) for s in salary]
    players = players + player
    positions = positions + position
    salaries = salaries + salary
    
  df = pd.DataFrame()
  df["player"] = players
  df["position"] = positions
  df["salary"] = salaries
  
  return df

nba_salary = get_nba_salary()
fig, axes = plt.subplots(2, 3, figsize=(14, 4))

positions = nba_salary["position"].unique()
# 繪製子圖
for (ax, pos) in zip(axes.ravel(), positions):
  ax.hist(nba_salary[nba_salary["position"] == pos]["salary"], bins=15)
  ax.set_xticks([])
  ax.set_title(pos)

plt.suptitle("Player salary by positions")
plt.subplots_adjust(top=0.8)
plt.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53

Python 2x3 的直方圖:NBA 球員依照不同的鋒衛位置年薪分佈

R 語言

在 R 語言中嘗試重疊的直方圖,也覺得不夠清晰。

library(rvest)
library(ggplot2)

# Get NBA players' salary data from ESPN.com
get_nba_salary_data <- function() {
  salary_urls <- sprintf("http://www.espn.com/nba/salaries/_/page/%s/seasontype/4", 1:9)
  players <- c()
  positions <- c()
  salaries <- c()
  for (salary_url in salary_urls) {
    player_pos_css <- ".evenrow td:nth-child(2) , .oddrow td:nth-child(2)"
    salary_css <- ".evenrow td:nth-child(4) , .oddrow td:nth-child(4)"
    player_pos <- salary_url %>% 
      read_html() %>% 
      html_nodes(player_pos_css) %>% 
      html_text()
    player_pos_split <- player_pos %>% 
      strsplit(split = ", ")
    player <- c()
    position <- c()
    for (i in 1:length(player_pos_split)) {
      player <- c(player, player_pos_split[[i]][1])
      position <- c(position, player_pos_split[[i]][2])
    }
    salary <- salary_url %>% 
      read_html() %>% 
      html_nodes(salary_css) %>% 
      html_text() %>% 
      gsub(pattern = "\\$", replacement = "") %>% 
      gsub(pattern = ",", replacement = "") %>% 
      as.numeric()
    positions <- c(positions, position)
    players <- c(players, player)
    salaries <- c(salaries, salary)
  }
  df <- data.frame(player = players, position = positions, salary = salaries, stringsAsFactors = FALSE)
  return(df)
}

df <- get_nba_salary_data()
ggplot(df, aes(x = salary, color = position)) +
  geom_histogram(alpha = 0.7, bins = 15)

ggplot(df, aes(x = salary, color = position)) +
  geom_freqpoly(alpha = 0.7, bins = 15)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45

R 重疊的直方圖:NBA 球員依照不同的鋒衛位置年薪分佈

R 重疊的直方圖:NBA 球員依照不同的鋒衛位置年薪分佈

R 語言使用 facet_wrap(~類別變數) 將不同類別變數對應的直方圖映射至不同直方圖上。

library(rvest)
library(ggplot2)

# Get NBA players' salary data from ESPN.com
get_nba_salary_data <- function() {
  salary_urls <- sprintf("http://www.espn.com/nba/salaries/_/page/%s/seasontype/4", 1:9)
  players <- c()
  positions <- c()
  salaries <- c()
  for (salary_url in salary_urls) {
    player_pos_css <- ".evenrow td:nth-child(2) , .oddrow td:nth-child(2)"
    salary_css <- ".evenrow td:nth-child(4) , .oddrow td:nth-child(4)"
    player_pos <- salary_url %>% 
      read_html() %>% 
      html_nodes(player_pos_css) %>% 
      html_text()
    player_pos_split <- player_pos %>% 
      strsplit(split = ", ")
    player <- c()
    position <- c()
    for (i in 1:length(player_pos_split)) {
      player <- c(player, player_pos_split[[i]][1])
      position <- c(position, player_pos_split[[i]][2])
    }
    salary <- salary_url %>% 
      read_html() %>% 
      html_nodes(salary_css) %>% 
      html_text() %>% 
      gsub(pattern = "\\$", replacement = "") %>% 
      gsub(pattern = ",", replacement = "") %>% 
      as.numeric()
    positions <- c(positions, position)
    players <- c(players, player)
    salaries <- c(salaries, salary)
  }
  df <- data.frame(player = players, position = positions, salary = salaries, stringsAsFactors = FALSE)
  return(df)
}

df <- get_nba_salary_data()
ggplot(df, aes(x = salary)) +
  geom_histogram(alpha = 0.7, bins = 15) +
  ggtitle("Player salary by positions") +
  xlab("Salary") +
  facet_wrap(~position)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45

R 2x3 的直方圖:NBA 球員依照不同的鋒衛位置年薪分佈

小結

在這個小節中我們簡介如何在 Python 與 R 語言使用視覺化套件中不同元件,適當調整圖形的外觀與添加資訊,進而讓視覺化可以完成探索性資料分析的目的;調整畫布的佈景主題、加入圖標題與軸標籤、加入註釋、調整座標軸、加入與調整圖例以及在一個畫布上繪製多個子圖形。

值得注意的是,視覺化中允許資料科學團隊自行調整的元件或延伸外掛如過江之鯽,因此能夠調整的元件絕對包含但不限於本篇文章所提及的內容。

延伸閱讀