分類模型的評估

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E.

Tom Mitchel

尋找羅吉斯迴歸的係數中我們暸解如何使用在 Python 與 R 語言的環境中尋找羅吉斯迴歸模型的係數,並且運用散佈圖(Scatter Plot)與等高線圖(Contour Plot)完成決策邊界圖(Decision Boundary Plot)繪製,藉此以視覺化方式觀察資料點的正確分類與錯誤分類,對分類模型有更多的暸解。接著我們會應用準確率(Accuracy)來評估分類模型在驗證資料集上的表現,準確率愈高代表決策邊界愈能有效地區隔資料;藉由比較不同模型的準確率,資料科學家團隊可以挑選出適合部署至正式環境的分類模型,精進評估的方式除了在訓練資料中納入新變數以及增加變數的次方項以外,還包含像是應用不同的成本函數(變更分類演算法)和集成學習(納入多個分類演算法)等進階方法。

在試圖提高準確率的過程中也會發現羅吉斯迴歸模型面臨新的挑戰,像是建立非線性決策邊界、伴隨非線性決策邊界帶來的過度配適以及多元分類模型,這時資料科學團隊會引進高次項係數、正規化與 One-vs.-all 等技巧來因應。

學習資料集

機器學習是透過輸入資料將預測或挖掘特徵能力內化於電腦程式之中的方法,模型涵蓋三個元素:資料(Experience)、任務(Task)與評估(Performance)以及一個但書。

以一個船難乘客生存預測模型為例,它的三要素是:

  • 資料(Experience):一定數量具備年齡、性別、社經地位和生存與否等變數的乘客資訊
  • 任務(Task):利用模型辨識測試資料中沒有生存與否標籤的觀測值
  • 評估(Performance):模型預測的分類正確率

船難乘客生存預測模型

以一個辨識 0 到 9 手寫數字圖片辨識的分類模型為例,它的三要素是:

  • 資料(Experience):一定數量的手寫數字圖片,每一張圖片有 784 個變數紀錄 28 X 28 每個像素的灰階色彩強度和 0 到 9 的標籤
  • 任務(Task):利用模型辨識測試資料中沒有 0 到 9 標籤的觀測值
  • 評估(Performance):模型預測的分類正確率

手寫數字圖片辨識的分類模型

兩個學習資料的但書皆是隨著資料增加,分類正確率應該要上升。

混淆矩陣

評估分類結果的指標很多,這些指標皆源自混淆矩陣(Confusion Matrix),在二元分類模型中這是 2 x 2 的表格,第一象限是預測生存、實際死亡的乘客數(False Positive,FP);第二象限是預測生存、實際生存的乘客數(True Positive,TP);第三象限是預測死亡、實際生存的乘客數(False Negative,FN);第四象限是預測死亡、實際死亡的乘客數(True Negative,TN)。

混淆矩陣,圖片來源:Wikipedia

最簡易且直觀的分類評估為準確率(Accuracy),計算方法是將正確分類的個數除以所有驗證資料的觀測值個數:

準確率

我們可以延續尋找羅吉斯迴歸的係數自行計算鐵達尼號資料基於羅吉斯迴歸模型所獲得的混淆矩陣和準確率。

Python

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

labeled = pd.read_csv("https://storage.googleapis.com/kaggle_datasets/Titanic-Machine-Learning-from-Disaster/train.csv")
# Removed observations without Age
labeled = labeled[~labeled["Age"].isna()]
train, validation = train_test_split(labeled, test_size=0.3, random_state=123)
X_train = train.loc[:, ["Age", "Fare"]].values
y_train = train.loc[:, "Survived"].values
# Fit Logistic regression classifier
clf = LogisticRegression()
clf.fit(X_train, y_train)
X_validation = validation.loc[:, ["Age", "Fare"]].values
y_validation = validation.loc[:, "Survived"].values
y_hat = clf.predict(X_validation)
# Calculating confusion matrix
nunique_labels = len(set(y_train))
conf_mat_shape = (nunique_labels, nunique_labels)
conf_mat = np.zeros(conf_mat_shape, dtype=int)
for actual, predict in zip(y_hat, y_validation):
  conf_mat[actual, predict] += 1
# Calculating accuracy
accuracy = (conf_mat[0, 0] + conf_mat[1, 1])/conf_mat.sum()
print(conf_mat)
print("Accuracy: {:.2f}%".format(accuracy*100))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
## [[121  10]
##  [ 63  21]]
## Accuracy: 66.05%
1
2
3

在 Python 中只要呼叫 sklearn.metrics 模組中的 confusion_matrix()accuracy_score() 兩個函數即可獲得分類模型的混淆矩陣和準確率,能有效節省額外計算的時間。

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score

labeled = pd.read_csv("https://storage.googleapis.com/kaggle_datasets/Titanic-Machine-Learning-from-Disaster/train.csv")
# Removed observations without Age
labeled = labeled[~labeled["Age"].isna()]
train, validation = train_test_split(labeled, test_size=0.3, random_state=123)
X_train = train.loc[:, ["Age", "Fare"]].values
y_train = train.loc[:, "Survived"].values
# Fit Logistic regression classifier
clf = LogisticRegression()
clf.fit(X_train, y_train)
X_validation = validation.loc[:, ["Age", "Fare"]].values
y_validation = validation.loc[:, "Survived"].values
y_hat = clf.predict(X_validation)
# Calculating confusion matrix
conf_mat = confusion_matrix(y_validation, y_hat)
# Calculating accuracy
accuracy = accuracy_score(y_validation, y_hat)
print(conf_mat)
print("Accuracy: {:.2f}%".format(accuracy*100))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
## [[121  10]
##  [ 63  21]]
## Accuracy: 66.05%
1
2
3

R 語言

get_train_validation <- function(labeled_df, validation_size=0.3, random_state=123) {
  m <- nrow(labeled_df)
  row_indice <- 1:m
  set.seed(random_state)
  shuffled_row_indice <- sample(row_indice)
  labeled_df <- labeled_df[shuffled_row_indice, ]
  validation_threshold <- as.integer(validation_size * m)
  validation <- labeled_df[1:validation_threshold, ]
  train <- labeled_df[(validation_threshold+1):m, ]
  return(list(
    validation = validation,
    train = train
  ))
}

sigmoid <- function(z) {
  return(1/(1 + exp(-z)))
}

step <- function(g_y_hat, threshold = 0.5) {
  return(ifelse(g_y_hat >= threshold, yes = 1, no = 0))
}

labeled <- read.csv("https://storage.googleapis.com/kaggle_datasets/Titanic-Machine-Learning-from-Disaster/train.csv")
# Removed observations without Age
labeled <- labeled[!(is.na(labeled$Age)), ]
split_data <- get_train_validation(labeled)
train <- split_data$train
validation <- split_data$validation
logistic_clf <- glm(Survived ~ Fare + Age, data = train, family = "binomial")
thetas <- as.matrix(logistic_clf$coefficients)
Fare_validation <- as.matrix(validation$Fare)
Age_validation <- as.matrix(validation$Age)
ones <- rep(1, times = nrow(Fare_validation))
X_validation <- cbind(ones, Fare_validation, Age_validation)
y_hat <- X_validation %*% thetas
g_y_hat <- sigmoid(y_hat)
y_pred <- step(g_y_hat)
# Calculating confusion matrix
conf_mat <- table(validation$Survived, y_pred)
# Calculating accuracy
accuracy <- sum(diag(conf_mat)) / sum(conf_mat)
sprintf("Accuracy: %.2f%%", accuracy*100)
conf_mat
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44

建立非線性決策邊界

假如資料科學團隊認為一個直線的決策邊界無法有效將兩種分類的資料點做區隔。

Python

在不更動分類演算法的情況下 Python 可以引用 sklearn.preprocessing 模組的 PolynomialFeatures() 函數將高次項加入訓練資料中,並繼續使用羅吉斯迴歸這個分類演算法。

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.preprocessing import PolynomialFeatures

def plot_decision_boundary(xlab, ylab, clf, labeled, pos_label="Survived", neg_label="Dead", clf_target="Survived", degree=1):
  xx_min, xx_max = labeled[xlab].min(), labeled[xlab].max()
  yy_min, yy_max = labeled[ylab].min(), labeled[ylab].max()
  xx_arr = np.linspace(xx_min - 5, xx_max + 5, 1000)
  yy_arr = np.linspace(yy_min - 5, yy_max + 5, 1000)
  xx, yy = np.meshgrid(xx_arr, yy_arr)
  X_grid = np.concatenate([xx.reshape(-1, 1), yy.reshape(-1, 1)], axis=1)
  X_grid_poly = PolynomialFeatures(degree).fit_transform(X_grid)
  Z = clf.predict(X_grid_poly).reshape(xx.shape)
  pos = labeled[labeled[clf_target] == 1]
  neg = labeled[labeled[clf_target] == 0]
  plt.scatter(pos[xlab], pos[ylab], label=pos_label, marker="o", color="blue")
  plt.scatter(neg[xlab], neg[ylab], label=neg_label, marker="x", color="red")
  plt.contourf(xx, yy, Z, alpha=0.4, cmap=plt.cm.coolwarm_r)
  plt.legend()
  plt.xlabel(xlab)
  plt.ylabel(ylab)

labeled = pd.read_csv("https://storage.googleapis.com/kaggle_datasets/Titanic-Machine-Learning-from-Disaster/train.csv")
# Removed observations without Age
labeled = labeled[~labeled["Age"].isna()]
train, validation = train_test_split(labeled, test_size=0.3, random_state=123)
X_train = train.loc[:, ["Fare", "Age"]].values
# Polynomial features
d = 6
X_train_poly = PolynomialFeatures(d).fit_transform(X_train)
y_train = train.loc[:, "Survived"].values
# Fit Logistic regression classifier
clf = LogisticRegression()
clf.fit(X_train_poly, y_train)
# Decision boundary plot
plot_decision_boundary("Fare", "Age", clf, labeled, degree=d)
plt.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41

建立非線性決策邊界

R 語言

在 R 語言中透過 polym(raw = TRUE) 函數將高次項加入訓練資料中。

library(scales)

get_train_validation <- function(labeled_df, validation_size=0.3, random_state=123) {
  m <- nrow(labeled_df)
  row_indice <- 1:m
  set.seed(random_state)
  shuffled_row_indice <- sample(row_indice)
  labeled_df <- labeled_df[shuffled_row_indice, ]
  validation_threshold <- as.integer(validation_size * m)
  validation <- labeled_df[1:validation_threshold, ]
  train <- labeled_df[(validation_threshold+1):m, ]
  return(list(
    validation = validation,
    train = train
  ))
}

sigmoid <- function(z) {
  return(1/(1 + exp(-z)))
}

step <- function(g_y_hat, threshold = 0.5) {
  return(ifelse(g_y_hat >= threshold, yes = 1, no = 0))
}

decision_boundary_plot <- function(xlab, ylab, clf, labeled, clf_target = "Survived") {
  fare_min <- min(labeled[, xlab])
  fare_max <- max(labeled[, xlab])
  age_min <- min(labeled[, ylab])
  age_max <- max(labeled[, ylab])
  res <- 200
  fare_vec <- seq(fare_min - 5, fare_max + 5, length.out = res)
  age_vec <- seq(age_min - 5, age_max + 5, length.out = res)
  gd <- expand.grid(fare_vec, age_vec)
  names(gd) <- c(xlab, ylab)
  y_prob <- predict.glm(clf, newdata = gd, type = "response")
  y_pred <- ifelse(y_prob >= 0.5, 1, 0)
  Z <- matrix(y_pred, nrow = res)
  contour(fare_vec, age_vec, Z, labels = "", xlab = "", ylab = "",
          axes=FALSE)
  points(labeled[, xlab], labeled[, ylab], 
         col = ifelse(labeled[, clf_target] == 1,
                      rgb(86, 180, 233, maxColorValue = 255),
                      rgb(213, 94, 0, maxColorValue = 255)),
         pch = ifelse(labeled[, clf_target] == 1, 16, 4), lwd = 2)
  points(gd, pch = "." , cex = 1.2,
         col = alpha(ifelse(Z == 1, "cornflowerblue", "coral"), 0.4))
  box()
}

labeled <- read.csv("https://storage.googleapis.com/kaggle_datasets/Titanic-Machine-Learning-from-Disaster/train.csv")
# Removed observations without Age
labeled <- labeled[!(is.na(labeled$Age)), ]
split_data <- get_train_validation(labeled)
train <- split_data$train
d <- 6
logistic_clf <- glm(Survived ~ polym(Fare, Age, degree = d, raw = TRUE), data = train, family = "binomial")
# Decision boundary plot
decision_boundary_plot("Fare", "Age", logistic_clf, labeled)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59

建立非線性決策邊界

正規化(Regularization)

當分類模型開始建立高次項以及納入多個變數之後,我們發現交叉驗證的準確率在某個高次項後開始穩定下降,這時與迴歸模型中的議題相似再度面對過度配適(Over-fitting),h 函數對訓練資料太熟悉,以致決策邊界對於區分訓練資料的類別表現非常突出,但是喪失了對驗證資料、測試資料的區分類別能力,過度配適的模型被資料科學團隊以**低誤差(Low bias)高變異(High variance)**形容。

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import cross_val_score

labeled = pd.read_csv("https://storage.googleapis.com/kaggle_datasets/Titanic-Machine-Learning-from-Disaster/train.csv")
# Removed observations without Age
labeled = labeled[~labeled["Age"].isna()]
X = labeled.loc[:, ["Fare", "Age"]].values
y = labeled.loc[:, "Survived"].values
d = 10
poly_degrees = list(range(1, d+1))
cv_accuracies = []
for poly_d in poly_degrees:
  X_poly = PolynomialFeatures(poly_d).fit_transform(X)
  # Get cross validated train/valid accuracy
  clf = LogisticRegression()
  cv_acc = np.array(cross_val_score(clf, X_poly, y)).mean()
  cv_accuracies.append(cv_acc)

plt.plot(cv_accuracies, marker="o")
plt.xticks(range(d), poly_degrees)
plt.title("Cross-validated accuracies")
plt.xlabel("Degrees")
plt.ylabel("CV Accuracy")
plt.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28

交叉驗證的準確率在某個高次項後開始穩定下降

與迴歸模型相同,資料科學團隊使用正規化來避免模型產生過度配適,核心精神是在成本函數增加一個懲罰項(penalty),使得在考量最小化成本函數的前提之下最佳化過程會限制任何一個係數變得太高,其中常見的一種正規化方式為 Ridge,懲罰項的係數同長由使用者自行輸入一個自然數,如果設定的數值愈高,即正規化程度愈高;若設定為零,表示不作正規化,一如原本的分類模型。

Ridge 正規化

import numpy as np

def sigmoid(z):
  return 1/(1 + np.exp(-z))

def cost_function_regularized(X, y, thetas, Lambda):
  m = y.size
  h = sigmoid(X.dot(thetas))
  J = -1*(1/m)*(np.log(h).T.dot(y)+np.log(1-h).T.dot(1-y)) + (Lambda/(2*m))*np.sum(np.square(thetas[1:]))
  if np.isnan(J):
    return np.inf
  else:
    return J
  
def get_gradient(X, y, thetas, Lambda):
  m = y.size
  h = sigmoid(X.dot(thetas))
  zero_arr = np.array([0]).reshape(-1, 1)
  regularized_thetas = np.concatenate([zero_arr, thetas])
  grad = (1/m)*X.T.dot(h-y) + (Lambda/m)*regularized_thetas
  return grad.reshape(-1, 1)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

在 Python 中引用 sklearn.linear_model 模組中的 RidgeClassifier() 函數可以實踐正規化的羅吉斯迴歸模型,我們比較不同正規化程度下的決策邊界外觀,能觀察到在懲罰項係數 Lambda 愈大的情況下,即便在高次方項(六次方項)設定下,決策邊界的非線性特徵也會漸趨不明顯。

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import RidgeClassifier

def sigmoid(z):
  return 1/(1 + np.exp(-z))

def step(g_y_hat, threshold=0.5):
  return np.where(g_y_hat >= threshold, 1, 0).reshape(-1, 1)

def plot_decision_boundary(xlab, ylab, thetas, labeled, pos_label="Survived", neg_label="Dead", clf_target="Survived", poly_d=6, axes=None):
  xx_min, xx_max = labeled[xlab].min(), labeled[xlab].max()
  yy_min, yy_max = labeled[ylab].min(), labeled[ylab].max()
  xx_arr = np.linspace(xx_min - 5, xx_max + 5, 1000)
  yy_arr = np.linspace(yy_min - 5, yy_max + 5, 1000)
  xx, yy = np.meshgrid(xx_arr, yy_arr)
  X_grid = np.concatenate([xx.reshape(-1, 1), yy.reshape(-1, 1)], axis=1)
  X_grid_poly = PolynomialFeatures(poly_d).fit_transform(X_grid)
  Z = step(sigmoid(np.dot(X_grid_poly, thetas))).reshape(xx.shape)
  pos = labeled[labeled[clf_target] == 1]
  neg = labeled[labeled[clf_target] == 0]
  if axes == None:
    axes = plt.gca()
  axes.scatter(pos[xlab], pos[ylab], label=pos_label, marker="o", color="blue")
  axes.scatter(neg[xlab], neg[ylab], label=neg_label, marker="x", color="red")
  axes.contourf(xx, yy, Z, alpha=0.4, cmap=plt.cm.coolwarm_r)
  axes.legend()
  axes.set_xlabel(xlab)
  axes.set_ylabel(ylab)

labeled = pd.read_csv("https://storage.googleapis.com/kaggle_datasets/Titanic-Machine-Learning-from-Disaster/train.csv")
# Removed observations without Age
labeled = labeled[~labeled["Age"].isna()]
train, validation = train_test_split(labeled, test_size=0.3, random_state=123)
X_train = train.loc[:, ["Fare", "Age"]].values
X_train_poly = PolynomialFeatures(6).fit_transform(X_train)
y_train = train.loc[:, "Survived"].values
ridge_clf = RidgeClassifier()
ridge_clf.fit(X_train_poly, y_train)
thetas = np.concatenate([ridge_clf.intercept_.reshape(-1, 1), ridge_clf.coef_[0, 1:].reshape(-1, 1)])

# Decision boundary plots
fig, axes = plt.subplots(2, 3, sharey=True, figsize=(17, 10))
for i, alpha in enumerate([0, 1, 1e3, 1e6, 1e9, 1e12]):
  ridge_clf = RidgeClassifier(alpha=alpha)
  ridge_clf.fit(X_train_poly, y_train)
  thetas = np.concatenate([ridge_clf.intercept_.reshape(-1, 1), ridge_clf.coef_[0, 1:].reshape(-1, 1)])
  plot_decision_boundary("Fare", "Age", thetas, labeled, axes=axes.ravel()[i])
  axes.ravel()[i].set_title("Lambda: {:.0e}".format(ridge_clf.alpha))
plt.tight_layout()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53

比較不同正規化程度下的決策邊界外觀

多元分類

從鐵達尼號資料中我們所預測的分類標籤是 Survived 變數,這個變數僅有 0 與 1 兩個值,也就是所謂的二元分類(Binary Classification);事實上羅吉斯迴歸模型從 Sigmoid、Step 函數一直到對數組成的成本函數,皆是設計為了解決二元分類的問題。然而面對手寫數字資料時候,所預測的分類標籤變為 label 變數,這個變數有 0 到 9 共十個值,意即所謂的多元分類(Multi-class Classification。)

import pandas as pd

titanic = pd.read_csv("https://storage.googleapis.com/kaggle_datasets/Titanic-Machine-Learning-from-Disaster/train.csv")
digit_recognizer = pd.read_csv("https://storage.googleapis.com/kaggle_datasets/Digit-Recognizer/train.csv")
unique_digits = digit_recognizer["label"].unique()
unique_digits.sort()
print("Binaray classification:")
print(titanic["Survived"].unique())
print("Multi-class classification:")
print(unique_digits)
1
2
3
4
5
6
7
8
9
10
## Binaray classification:
## [0 1]
## Multi-class classification:
## [0 1 2 3 4 5 6 7 8 9]
1
2
3
4

手寫數字的圖片是 28 x 28 解析度,可以把前一百個觀測值的 784 個特徵重新整理外觀為(28, 28)再顯示出來,並在左下角印出圖片的數字標籤。

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

digit_recognizer = pd.read_csv("https://storage.googleapis.com/kaggle_datasets/Digit-Recognizer/train.csv")
fig, axes = plt.subplots(10, 10, figsize=(8, 8))
fig.subplots_adjust(hspace=0.1, wspace=0.1)

for i, ax in enumerate(axes.flat):
  digit = digit_recognizer.iloc[i, 1:].values.reshape(28, 28)
  ax.imshow(digit, cmap='binary', interpolation='nearest')
  ax.text(0.05, 0.05, str(digit_recognizer["label"][i]),
          transform=ax.transAxes, color='green')
  ax.set_xticks([])
  ax.set_yticks([])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

將前一百個觀測值顯示出來

這是否意味著我們無法將設計為處理二元分類的羅吉斯迴歸模型應用至手寫數字圖片辨識呢?不是的,在對迴歸應用 Sigmoid 函數之後,資料科學團隊會獲得一個介於 0 到 1 之間的值,將其視為被歸類為分類 1 的機率,第一次將數字 0 視作為分類 1(或稱為陽性 Positive),數字 1 到 9 視作為分類 0(或稱為陰性 Negative);第二次將數字 1 視作分類 1,數字 0、2 到 9 視作為分類 0;第三次將數字 2 視作分類 1,數字 0、1、3 到 9 視作分類 0,以此類推重複操作 10 個回合的二元分類,我們就能獲得每一個觀測值被辨識為 10 個數字的個別機率,比對這些個別機率之後以最高者作為預測,這個技巧稱之為 One-vs.-all,協助資料科學團隊將二元分類問題延伸應用至多元分類。

One-vs.-all 技巧

Python

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

def sigmoid(z):
  return 1/(1 + np.exp(-z))

digits = pd.read_csv("https://storage.googleapis.com/kaggle_datasets/Digit-Recognizer/train.csv")
unique_digits = digits["label"].unique()
unique_digits.sort()
train, validation = train_test_split(digits, test_size=0.3, random_state=123)
X_train = train.loc[:, "pixel0":"pixel783"].values
X_valid = validation.loc[:, "pixel0":"pixel783"].values
ones = np.ones(X_valid.shape[0]).reshape(-1, 1)
X_valid = np.concatenate([ones, X_valid], axis=1)
y_train = train.loc[:, "label"].values
# One vs. all
all_probs = np.zeros((X_valid.shape[0], unique_digits.size))
for digit_label in unique_digits:
  y_train_recoded = np.where(y_train == digit_label, 1, 0)
  clf = LogisticRegression()
  clf.fit(X_train, y_train_recoded)
  thetas = np.concatenate([clf.intercept_.reshape(-1, 1), clf.coef_.reshape(-1, 1)])
  y_prob = sigmoid(np.dot(X_valid, thetas))
  all_probs[:, digit_label] = y_prob.ravel()
print(all_probs.argmax(axis=1))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
## [2 2 0 ... 9 6 3]
1

R 語言

get_train_validation <- function(labeled_df, validation_size=0.3, random_state=123) {
  m <- nrow(labeled_df)
  row_indice <- 1:m
  set.seed(random_state)
  shuffled_row_indice <- sample(row_indice)
  labeled_df <- labeled_df[shuffled_row_indice, ]
  validation_threshold <- as.integer(validation_size * m)
  validation <- labeled_df[1:validation_threshold, ]
  train <- labeled_df[(validation_threshold+1):m, ]
  return(list(
    validation = validation,
    train = train
  ))
}

sigmoid <- function(z) {
  return(1/(1 + exp(-z)))
}

digits <- read.csv("https://storage.googleapis.com/kaggle_datasets/Digit-Recognizer/train.csv")
unique_digits <- unique(digits$label)
unique_digits <- sort(unique_digits)
split_data <- get_train_validation(digits)
train <- split_data$train
validation <- split_data$validation
# One vs. all
all_probs <- matrix(0, nrow(validation), length(unique_digits))
for (unique_digit in unique_digits) {
  train$label_encoded <- ifelse(train$label == unique_digit, 1, 0)
  logistic_clf <- glm(label_encoded ~ .-label, data = train, family = "binomial")
  y_prob <- matrix(predict(logistic_clf, newdata = validation, type = "response"))
  all_probs[, unique_digit + 1] <- y_prob
}
head(max.col(all_probs), n = 10)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
## [1]  3  3 10  3  9  8  3  2  1 10
1

不論 Python 或者 R 語言都已經將 One-vs.-all 內建在羅吉斯迴歸的函數中,資料科學團隊可以直接將其應用於多元分類模型的問題,並不需要自行加入 One-vs.-all 的技巧。

小結

在這個小節中我們使用部分鐵達尼號資料與手寫數字資料簡介學習資料集、如何在 Python 與 R 語言的環境中自行計算或用內建函數取得分類模型的混淆矩陣、納入高次項係數來建立非線性的決策邊界、在成本函數增加懲罰項以避免分類模型過度配適、還有 One-vs.-all 將二元分類模型延展至多元分類應用。

延伸閱讀