尋找羅吉斯迴歸的係數

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E.

Tom Mitchel

歷經尋找迴歸模型的係數迴歸模型的評估,我們對資料科學團隊面對的機器學習問題,包含切割資料用於訓練驗證及測試、標準化單位量級、尋找迴歸模型係數、交叉驗證評估模型表現與正規化避免過度配適,已具備一定程度的暸解,接著要探討監督式學習中的另一個分支:分類模型(Classification Model)。

關於分類模型

機器學習是透過輸入資料將預測或挖掘特徵能力內化於電腦程式之中的方法,模型涵蓋三個元素:資料(Experience)、任務(Task)與評估(Performance),以一個船難乘客生存預測模型為例,它的三要素是:

  • 資料(Experience):一定數量具備年齡、性別、社經地位和生存與否等變數的乘客資訊
  • 任務(Task):利用模型辨識測試資料中沒有生存與否標籤的觀測值
  • 評估(Performance):模型預測的分類正確率

以及一個但書:隨著資料增加,分類正確率應該要上升。

學習資料集

本文我們會在 Python 與 R 語言中使用部分鐵達尼號資料中的觀測值與變數藉此暸解機器學習與分類模型的二三事,取得鐵達尼號資料以後會發現在有標籤與無標籤的資料中都有部分遺漏值,值得注意的是,參加競賽的時候應該對訓練、驗證與測試資料實施相同原則的遺漏值填補(Imputations),在學習階段刪去標籤資料的遺漏值是比較便捷的作法。

Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

Kaggle

繪製一個橫軸為票價(Fare)縱軸為年齡(Age)的散佈圖,並將生存與死亡以不同的顏色標記表示;如果將分類模型的目標一言以蔽之,就是找到一個決策邊界能夠將生存與死亡的顏色標記在散佈圖上區隔開來,在這個決策邊界之下,正確分類(藍色點落在藍色區域、紅色點落在紅色區域)的資料點愈多、錯誤分類(藍色點落在紅色區域、紅色點落在藍色區域)的資料點愈少,就表示分類模型的表現愈佳。

找到一個決策能夠將生存與死亡的顏色標記在散佈圖上區隔開來

Python

import pandas as pd
import matplotlib.pyplot as plt

labeled = pd.read_csv("https://storage.googleapis.com/kaggle_datasets/Titanic-Machine-Learning-from-Disaster/train.csv")
# Removed observations without Age
labeled = labeled[~labeled["Age"].isna()]
survived = labeled[labeled["Survived"] == 1]
dead = labeled[labeled["Survived"] == 0]
plt.scatter(survived["Fare"], survived["Age"], label="Survived", color="blue", marker="o", alpha=0.5)
plt.scatter(dead["Fare"], dead["Age"], label="Dead", color="red", marker="x", alpha=0.5)
plt.xlabel("Fare")
plt.ylabel("Age")
plt.legend()
plt.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14

將生存與死亡以不同的顏色標記區隔

R 語言

library(ggplot2)

labeled <- read.csv("https://storage.googleapis.com/kaggle_datasets/Titanic-Machine-Learning-from-Disaster/train.csv")
# Removed observations without Age
labeled <- labeled[!is.na(labeled$Age), ]
labeled$Survived <- as.character(labeled$Survived)
ggplot(labeled, aes(x = Fare, y = Age, color = Survived)) +
  geom_point() +
  xlab("Fare") +
  ylab("Age") +
  scale_color_hue(labels = c("Dead", "Survived")) +
  theme(legend.title = element_blank())
1
2
3
4
5
6
7
8
9
10
11
12

將生存與死亡以不同的顏色標記區隔

Sigmoid 函數

從迴歸模型過渡到基礎的分類模型是羅吉斯迴歸模型,雖然其名稱具有迴歸兩字,但實際是個不折不扣的分類模型,將迴歸模型所獲得的數值輸出經過 Sigmoid 函數(亦稱作 Logistic 函數)映射至 0 到 1 之間,成為連續型的機率輸出。

將迴歸模型所獲得的數值輸出映射至 0 到 1 之間

Python

import numpy as np
import matplotlib.pyplot as plt

def sigmoid(z):
  return 1/(1 + np.exp(-z))

x_arr = np.linspace(-10, 10)
y_arr = np.array(list(map(sigmoid, x_arr)))
plt.plot(x_arr, y_arr)
plt.xticks([-10, 0, 10], ["$-\infty$", "$0$", "$\infty$"])
plt.yticks([0, 0.5, 1], ["0", "0.5", "1"])
plt.xlabel("$X\\theta$")
plt.ylabel("$g(X\\theta)$")
plt.title("Sigmoid Function")
plt.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

Sigmoid 函數

R 語言

library(ggplot2)

sigmoid <- function(z) {
  return(1/(1 + exp(-z)))
}

ggplot(data.frame(x = c(-10, 10)), aes(x = x)) +
  stat_function(fun = sigmoid, geom = "line") +
  xlab("z") +
  ylab("Sigmoid(z)") +
  scale_y_continuous(breaks = c(0, 0.5, 1)) +
  ggtitle("Sigmoid Function")
1
2
3
4
5
6
7
8
9
10
11
12

Sigmoid 函數

羅吉斯迴歸的成本函數

在迴歸模型中我們面對的成本函數為均方誤差,在分類模型中則要設計一個成本函數可以讓成本在正確預測(分類為生存的乘客實際生存、分類為死亡的乘客實際死亡)的情境中趨近於零,在錯誤預測(分類為生存的乘客實際死亡、分類為死亡的乘客實際生存)的情境中趨近於無限大,透過對數函數可以設計出符合這樣需求的成本函數。

Python

import numpy as np
import matplotlib.pyplot as plt

x_arr = np.linspace(0, 1)
y_arr_0 = -np.log(x_arr)
y_arr_1 = -np.log(1 - x_arr)
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].plot(x_arr, y_arr_0)
axes[0].set_title("$y=1,-\log(g(\hat{y}))}$")
axes[0].set_xlabel("$g(\hat{y})$")
axes[1].plot(x_arr, y_arr_1)
axes[1].set_title("$y=0, -\log(1 - g(\hat{y}))$")
axes[1].set_xlabel("$g(\hat{y})$")
plt.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14

透過對數函數設計出符合需求的成本函數

R 語言

library(ggplot2)
library(gridExtra)

log_function_1 <- function(x) {
  return(-log(x))
}

log_function_0 <- function(x) {
  return(-log(1 - x))
}

x_vec <- seq(0, 1, length.out = 50)
gg1 <- ggplot(data.frame(x = x_vec), aes(x = x)) +
  stat_function(fun = log_function_1, geom = "line")
gg2 <- ggplot(data.frame(x = x_vec), aes(x = x)) +
  stat_function(fun = log_function_0, geom = "line")
grid.arrange(gg1, gg2, nrow = 1, ncol = 2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

透過對數函數設計出符合需求的成本函數

將符合需求兩個成本函數整合為一個成本函數:

將符合需求兩個成本函數整合為一個成本函數

經過這樣整合設計,在實際值為 1 的時候使用前半段成本函數:

在實際值為 1 的時候使用前半段成本函數

在實際值為 0 的時候使用後半段成本函數:

在實際值為 0 的時候使用後半段成本函數

將整合後的成本函數寫為矩陣運算的外觀(Vectorized form):

寫為矩陣運算的外觀

將整合後的成本函數以程式表達:

import numpy as np

def sigmoid(z):
  return 1/(1 + np.exp(-z))

def cost_function(X, y, thetas):
  m = X.shape[0]
  h = sigmoid(X.dot(thetas))
  J = -1*(1/m)*(np.log(h).T.dot(y)+np.log(1-h).T.dot(1-y))
  # log(0) approached -Inf results in NaN
  if np.isnan(J[0]):
    return(np.inf)
  else:
    return(J[0])
1
2
3
4
5
6
7
8
9
10
11
12
13
14

尋找羅吉斯迴歸的係數

接著我們要對成本函數偏微分係數向量,藉此獲得梯度遞減演算法中每次迭代所需要更新的梯度(Gradient),對係數向量偏微分之前我們先利用對數函數的特性(對數相除時可寫為分子對數函數減去分母對數函數),整理寫為矩陣運算外觀的成本函數:

利用對數函數的特性整理寫為矩陣運算外觀的成本函數

將前述兩個部分代回成本函數中整理:

將前述兩個部分代回成本函數中整理

然後是對成本函數偏微分係數向量,這裡會應用微分對數函數的特性與連鎖律:

對成本函數偏微分係數向量

偏微分以後得到了梯度遞減演算法中每次迭代所需要更新的梯度計算公式,可以程式表達:

import numpy as np

def sigmoid(z):
  return 1/(1 + np.exp(-z))

def gradient(X, y, thetas):
  m = y.size
  h = sigmoid(X.dot(thetas.reshape(-1,1)))
  grad =(1/m)*X.T.dot(h-y)
  return(grad.ravel())
1
2
3
4
5
6
7
8
9
10

現在我們手邊已經有羅吉斯迴歸模型的成本函數與計算不同係數所對應的梯度函數,接著只需要初始化一組係數向量(通常設定為零)就可以仰賴梯度遞減演算法尋找一組能讓成本函數最小化的係數向量,運用 scipy.optimize 模組中的 minimize() 函數,找尋票價(Fare)和年齡(Age)的係數向量(當然還包括了常數項係數。)

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from scipy.optimize import minimize

def sigmoid(z):
  return 1/(1 + np.exp(-z))

# Must put thetas as the first arg
def cost_function(thetas, X, y):
  m = X.shape[0]
  h = sigmoid(X.dot(thetas))
  J = -1*(1/m)*(np.log(h).T.dot(y)+np.log(1-h).T.dot(1-y))
  # log(0) approached -Inf results in NaN
  if np.isnan(J[0]):
    return(np.inf)
  else:
    return(J[0])

# Must put thetas as the first arg
def gradient(thetas, X, y):
  m = y.size
  h = sigmoid(X.dot(thetas.reshape(-1,1)))
  grad =(1/m)*X.T.dot(h-y)
  return(grad.ravel())

labeled = pd.read_csv("https://storage.googleapis.com/kaggle_datasets/Titanic-Machine-Learning-from-Disaster/train.csv")
# Removed observations without Age
labeled = labeled[~labeled["Age"].isna()]
train, validation = train_test_split(labeled, test_size=0.3, random_state=123)
X_train = train.loc[:, ["Age", "Fare"]].values
ones = np.ones(X_train.shape[0]).reshape(-1, 1)
X_train = np.concatenate([ones, X_train], axis=1)
y_train = train.loc[:, "Survived"].values.reshape(-1, 1)
initial_thetas = np.zeros(X_train.shape[1])
res = minimize(cost_function, initial_thetas, args=(X_train, y_train), method=None, jac=gradient, options={'maxiter':400})
thetas = res.x.reshape(-1, 1)
print(thetas)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
## [[-0.62176132]
##  [-0.01014703]
##  [ 0.01692017]]
1
2
3

獲得彌足珍貴的係數向量以後,將它與驗證資料相乘並輸入 Sigmoid 函數就可以獲得預測值。

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from scipy.optimize import minimize

def sigmoid(z):
  return 1/(1 + np.exp(-z))

# Must put thetas as the first arg
def cost_function(thetas, X, y):
  m = X.shape[0]
  h = sigmoid(X.dot(thetas))
  J = -1*(1/m)*(np.log(h).T.dot(y)+np.log(1-h).T.dot(1-y))
  # log(0) approached -Inf results in NaN
  if np.isnan(J[0]):
    return(np.inf)
  else:
    return(J[0])

# Must put thetas as the first arg
def gradient(thetas, X, y):
  m = y.size
  h = sigmoid(X.dot(thetas.reshape(-1,1)))
  grad =(1/m)*X.T.dot(h-y)
  return(grad.ravel())

labeled = pd.read_csv("https://storage.googleapis.com/kaggle_datasets/Titanic-Machine-Learning-from-Disaster/train.csv")
# Removed observations without Age
labeled = labeled[~labeled["Age"].isna()]
train, validation = train_test_split(labeled, test_size=0.3, random_state=123)
X_train = train.loc[:, ["Age", "Fare"]].values
ones = np.ones(X_train.shape[0]).reshape(-1, 1)
X_train = np.concatenate([ones, X_train], axis=1)
y_train = train.loc[:, "Survived"].values.reshape(-1, 1)
initial_thetas = np.zeros(X_train.shape[1])
res = minimize(cost_function, initial_thetas, args=(X_train, y_train), method=None, jac=gradient, options={'maxiter':400})
thetas = res.x.reshape(-1, 1)
X_validation = validation.loc[:, ["Age", "Fare"]].values
ones = np.ones(X_validation.shape[0]).reshape(-1, 1)
X_validation = np.concatenate([ones, X_validation], axis=1)
y_hat = np.dot(X_validation, thetas)
print("Before Sigmoid transform:")
print(y_hat[:5])
g_y_hat = sigmoid(y_hat)
print("After Sigmoid transform:")
print(g_y_hat[:5])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
## Before Sigmoid transform:
## [[-0.41524398]
##  [-0.78135263]
##  [-0.78939634]
##  [-0.65171855]
##  [ 1.03819849]]
## After Sigmoid transform:
## [[0.39765538]
##  [0.31402844]
##  [0.3122983 ]
##  [0.34260237]
##  [0.73850225]]
1
2
3
4
5
6
7
8
9
10
11
12

Step 函數

將迴歸模型輸出使用 Sigmoid 函數映射到 0 至 1 之間後,最後一步是利用 Step 函數將機率映射為 0 與 1 兩個離散值,常用的門檻是 0.5,如果機率大於等於 0.5 則給予 1,否則給予 0。

Step 函數

import numpy as np
import matplotlib.pyplot as plt
  
x_arr_0 = np.arange(0, 0.5, 0.01)
x_arr_1 = np.arange(0.5, 1, 0.01)
y_arr_0 = np.where(x_arr_0 >= 0.5, 1, 0)
y_arr_1 = np.where(x_arr_1 >= 0.5, 1, 0)
plt.plot(x_arr_0, y_arr_0, color="b")
plt.plot(x_arr_1, y_arr_1, color="b")
plt.scatter([0.5], [0], facecolors='none', edgecolors='b')
plt.scatter([0.5], [1], color='b')
plt.xticks([0, 0.5, 1])
plt.yticks([0, 1])
plt.xlabel("$g(X\\theta)$")
plt.title("Step Function")
plt.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

Step 函數

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from scipy.optimize import minimize

def sigmoid(z):
  return 1/(1 + np.exp(-z))

# Must put thetas as the first arg
def cost_function(thetas, X, y):
  m = X.shape[0]
  h = sigmoid(X.dot(thetas))
  J = -1*(1/m)*(np.log(h).T.dot(y)+np.log(1-h).T.dot(1-y))
  # log(0) approached -Inf results in NaN
  if np.isnan(J[0]):
    return(np.inf)
  else:
    return(J[0])

# Must put thetas as the first arg
def gradient(thetas, X, y):
  m = y.size
  h = sigmoid(X.dot(thetas.reshape(-1,1)))
  grad =(1/m)*X.T.dot(h-y)
  return(grad.ravel())

def step(g_y_hat, threshold=0.5):
  return np.where(g_y_hat >= threshold, 1, 0).reshape(-1, 1)

labeled = pd.read_csv("https://storage.googleapis.com/kaggle_datasets/Titanic-Machine-Learning-from-Disaster/train.csv")
# Removed observations without Age
labeled = labeled[~labeled["Age"].isna()]
train, validation = train_test_split(labeled, test_size=0.3, random_state=123)
X_train = train.loc[:, ["Age", "Fare"]].values
ones = np.ones(X_train.shape[0]).reshape(-1, 1)
X_train = np.concatenate([ones, X_train], axis=1)
y_train = train.loc[:, "Survived"].values.reshape(-1, 1)
initial_thetas = np.zeros(X_train.shape[1])
res = minimize(cost_function, initial_thetas, args=(X_train, y_train), method=None, jac=gradient, options={'maxiter':400})
thetas = res.x.reshape(-1, 1)
X_validation = validation.loc[:, ["Age", "Fare"]].values
ones = np.ones(X_validation.shape[0]).reshape(-1, 1)
X_validation = np.concatenate([ones, X_validation], axis=1)
y_hat = np.dot(X_validation, thetas)
print("Before Sigmoid transform:")
print(y_hat[:5])
g_y_hat = sigmoid(y_hat)
print("After Sigmoid transform:")
print(g_y_hat[:5])
y_pred = step(g_y_hat)
print("After Step transform:")
print(y_pred[:5])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
## Before Sigmoid transform:
## [[-0.41524398]
##  [-0.78135263]
##  [-0.78939634]
##  [-0.65171855]
##  [ 1.03819849]]
## After Sigmoid transform:
## [[0.39765538]
##  [0.31402844]
##  [0.3122983 ]
##  [0.34260237]
##  [0.73850225]]
## After Step transform:
## [[0]
##  [0]
##  [0]
##  [0]
##  [1]]
## True condition:
## [[1]
##  [0]
##  [1]
##  [1]
##  [0]]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

將預測值與驗證資料的前五實際值相比,可以觀察到除了第二筆資料是正確分類(預測死亡、實際死亡,學名為 True Negative);第一、第三以及第四筆資料是錯誤分類(預測死亡、實際生存,學名為 False Negative);第五筆資料為錯誤分類(預測生存、實際死亡,學名為 False Positive)。

使用模組與套件尋找係數

我們已經知道如何透過梯度遞減(Gradient Descent)尋找羅吉斯迴歸的係數;接著可以使用 Python 與 R 語言中豐富的函數來協助這個任務。

Python

在 Python 中我們引用 sklearn.linear_model 中的 LogisticRegression() 來尋找係數,Scikit-Learn 中的函數呼叫方式一致,都先初始化物件,再利用 .fit() 方法投入訓練資料。

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

def sigmoid(z):
  return 1/(1 + np.exp(-z))

def step(g_y_hat, threshold=0.5):
  return np.where(g_y_hat >= threshold, 1, 0).reshape(-1, 1)

labeled = pd.read_csv("https://storage.googleapis.com/kaggle_datasets/Titanic-Machine-Learning-from-Disaster/train.csv")
# Removed observations without Age
labeled = labeled[~labeled["Age"].isna()]
train, validation = train_test_split(labeled, test_size=0.3, random_state=123)
X_train = train.loc[:, ["Age", "Fare"]].values
y_train = train.loc[:, "Survived"].values
logistic_clf = LogisticRegression()
logistic_clf.fit(X_train, y_train)
fit_intercept = logistic_clf.intercept_.reshape(-1, 1)
fit_coef = logistic_clf.coef_.reshape(-1, 1)
thetas = np.concatenate([fit_intercept, fit_coef])
X_validation = validation.loc[:, ["Age", "Fare"]].values
ones = np.ones(X_validation.shape[0]).reshape(-1, 1)
X_validation = np.concatenate([ones, X_validation], axis=1)
y_hat = np.dot(X_validation, thetas)
g_y_hat = sigmoid(y_hat)
y_pred = step(g_y_hat)
y_true = validation.loc[:, "Survived"].values.reshape(-1, 1)
print("Thetas from sklearn:")
print(thetas)
print("Before Sigmoid transform:")
print(y_hat[:5])
print("After Sigmoid transform:")
print(g_y_hat[:5])
print("After Step transform:")
print(y_pred[:5])
print("True condition:")
print(y_true[:5])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
## Thetas from sklearn:
## [[-0.58965696]
##  [-0.01094766]
##  [ 0.01680808]]
## Before Sigmoid transform:
## [[-0.40012884]
##  [-0.80026166]
##  [-0.78308464]
##  [-0.63243339]
##  [ 1.0436855 ]]
## After Sigmoid transform:
## [[0.40128138]
##  [0.30996955]
##  [0.31365546]
##  [0.34695898]
##  [0.7395605 ]]
## After Step transform:
## [[0]
##  [0]
##  [0]
##  [0]
##  [1]]
## True condition:
## [[1]
##  [0]
##  [1]
##  [1]
##  [0]]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28

R 語言

在 R 語言中我們利用內建的 glm() 函數來尋找係數。

get_train_validation <- function(labeled_df, validation_size=0.3, random_state=123) {
  m <- nrow(labeled_df)
  row_indice <- 1:m
  set.seed(random_state)
  shuffled_row_indice <- sample(row_indice)
  labeled_df <- labeled_df[shuffled_row_indice, ]
  validation_threshold <- as.integer(validation_size * m)
  validation <- labeled_df[1:validation_threshold, ]
  train <- labeled_df[(validation_threshold+1):m, ]
  return(list(
    validation = validation,
    train = train
  ))
}

sigmoid <- function(z) {
  return(1/(1 + exp(-z)))
}

step <- function(g_y_hat, threshold = 0.5) {
  return(ifelse(g_y_hat >= threshold, yes = 1, no = 0))
}

labeled <- read.csv("https://storage.googleapis.com/kaggle_datasets/Titanic-Machine-Learning-from-Disaster/train.csv")
# Removed observations without Age
labeled <- labeled[!(is.na(labeled$Age)), ]
split_data <- get_train_validation(labeled)
train <- split_data$train
validation <- split_data$validation
logistic_clf <- glm(Survived ~ Age + Fare, data = train, family = "binomial")
thetas <- as.matrix(logistic_clf$coefficients)
X_validation <- as.matrix(validation[, c("Age", "Fare")])
ones <- as.matrix(rep(1, times = nrow(X_validation)))
X_validation <- cbind(ones, X_validation)
y_hat <- X_validation %*% thetas
g_y_hat <- sigmoid(y_hat)
y_pred <- step(g_y_hat)
y_true <- validation$Survived
print("Thetas from glm():")
print(thetas)
print("Before Sigmoid transform:")
print(y_hat[1:5])
print("After Sigmoid transform:")
print(g_y_hat[1:5])
print("After Step transform:")
print(y_pred[1:5])
print("True condition:")
print(y_true[1:5])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
## [1] "Thetas from glm():"
##                    [,1]
## (Intercept) -0.49961899
## Age         -0.01364591
## Fare         0.01398677
## [1] "Before Sigmoid transform:"
## [1] -0.7763935 -0.7050698 -0.8786190 -0.2014215  0.1313069
## [1] "After Sigmoid transform:"
## [1] 0.3150977 0.3306892 0.2934640 0.4498142 0.5327797
## [1] "After Step transform:"
## [1] 0 0 0 0 1
## [1] "True condition:"
## [1] 0 1 0 0 1
1
2
3
4
5
6
7
8
9
10
11
12
13

在散佈圖繪製決策邊界

完成尋找係數的任務之後,我們可以將羅吉斯迴歸模型的決策邊界(Decision Boundary)繪製到散佈圖上,繪製方法是在散佈圖的平面上均勻地打出密集的網格,然後將這些網格視作驗證資料輸入模型獲得分類結果,接著再利用等高線圖(Contour Plot)將每一個網格的分類結果以不同顏色呈現在散佈圖之上。

Python

在 Python 中我們運用 NumPy 模組中的 meshgrid() 函數在散佈圖平面上打出均勻密集的網格。

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

def sigmoid(z):
  return 1/(1 + np.exp(-z))

def step(g_y_hat, threshold=0.5):
  return np.where(g_y_hat >= threshold, 1, 0).reshape(-1, 1)

labeled = pd.read_csv("https://storage.googleapis.com/kaggle_datasets/Titanic-Machine-Learning-from-Disaster/train.csv")
# Removed observations without Age
labeled = labeled[~labeled["Age"].isna()]
survived = labeled[labeled["Survived"] == 1]
dead = labeled[labeled["Survived"] == 0]
train, validation = train_test_split(labeled, test_size=0.3, random_state=123)
X_train = train.loc[:, ["Fare", "Age"]].values
y_train = train.loc[:, "Survived"].values
logistic_clf = LogisticRegression()
logistic_clf.fit(X_train, y_train)
fit_intercept = logistic_clf.intercept_.reshape(-1, 1)
fit_coef = logistic_clf.coef_.reshape(-1, 1)
thetas = np.concatenate([fit_intercept, fit_coef])
# Decision boundary plot
fare_min, fare_max = labeled["Fare"].min(), labeled["Fare"].max()
age_min, age_max = labeled["Age"].min(), labeled["Age"].max()
fare_arr = np.linspace(fare_min - 5, fare_max + 5, 1000)
age_arr = np.linspace(age_min - 5, age_max + 5, 1000)
xx, yy = np.meshgrid(fare_arr, age_arr)
ones = np.ones(xx.size).reshape(-1, 1)
X_grid = np.concatenate([ones, xx.reshape(-1, 1), yy.reshape(-1, 1)], axis=1)
y_grid = np.dot(X_grid, thetas)
g_y_grid = sigmoid(y_grid)
y_grid_pred = step(g_y_grid)
Z = y_grid_pred.reshape(xx.shape)
plt.scatter(survived["Fare"], survived["Age"], label="Survived", marker="o", color="blue")
plt.scatter(dead["Fare"], dead["Age"], label="Dead", marker="x", color="red")
plt.contourf(xx, yy, Z, alpha=0.4, cmap=plt.cm.coolwarm_r)
plt.legend(loc="upper right")
plt.xlabel("Fare")
plt.ylabel("Age")
plt.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44

在散佈圖繪製決策邊界

R 語言

在 R 語言中我們運用 expand.grid() 函數在散佈圖平面上打出均勻密集的網格。

get_train_validation <- function(labeled_df, validation_size=0.3, random_state=123) {
  m <- nrow(labeled_df)
  row_indice <- 1:m
  set.seed(random_state)
  shuffled_row_indice <- sample(row_indice)
  labeled_df <- labeled_df[shuffled_row_indice, ]
  validation_threshold <- as.integer(validation_size * m)
  validation <- labeled_df[1:validation_threshold, ]
  train <- labeled_df[(validation_threshold+1):m, ]
  return(list(
    validation = validation,
    train = train
  ))
}

sigmoid <- function(z) {
  return(1/(1 + exp(-z)))
}

step <- function(g_y_hat, threshold = 0.5) {
  return(ifelse(g_y_hat >= threshold, yes = 1, no = 0))
}

labeled <- read.csv("https://storage.googleapis.com/kaggle_datasets/Titanic-Machine-Learning-from-Disaster/train.csv")
# Removed observations without Age
labeled <- labeled[!(is.na(labeled$Age)), ]
#survived <- labeled[labeled$Survived == 1, ]
#dead <- labeled[labeled$Survived == 0, ]
split_data <- get_train_validation(labeled)
train <- split_data$train
logistic_clf <- glm(Survived ~ Fare + Age, data = train, family = "binomial")
thetas <- as.matrix(logistic_clf$coefficients)
# Decision boundary plot
fare_min <- min(labeled$Fare)
fare_max <- max(labeled$Fare)
age_min <- min(labeled$Age)
age_max <- max(labeled$Age)
res <- 200
fare_vec <- seq(fare_min - 5, fare_max + 5, length.out = res)
age_vec <- seq(age_min - 5, age_max + 5, length.out = res)
gd <- expand.grid(fare_vec, age_vec)
X_grid <- as.matrix(gd)
ones <- rep(1, times = nrow(X_grid))
X_grid <- cbind(ones, X_grid)
y_grid <- X_grid %*% thetas
g_y_grid <- sigmoid(y_grid)
y_grid_pred <- step(g_y_grid)
Z <- matrix(y_grid_pred, nrow = res)
contour(fare_vec, age_vec, Z, labels = "", xlab = "", ylab = "", axes=FALSE)
points(labeled$Fare, labeled$Age, col = ifelse(labeled$Survived == 1, rgb(86, 180, 233, maxColorValue = 255), rgb(213, 94, 0, maxColorValue = 255)), pch = ifelse(labeled$Survived == 1, 16, 4), lwd = 2)
points(gd, pch = "." , cex = 1.2, col = alpha(ifelse(Z == 1, "cornflowerblue", "coral"), 0.4))
box()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52

在散佈圖繪製決策邊界

小結

在這個小節中我們使用鐵達尼號資料中的觀測值與變數簡介關於分類模型、學習資料集、Sigmoid 函數、羅吉斯迴歸的成本函數、尋找羅吉斯迴歸的係數、Step 函數、使用模組與套件尋找係數還有在散佈圖繪製決策邊界。

延伸閱讀