迴歸模型的評估

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E.

Tom Mitchel

尋找迴歸模型的係數中我們暸解如何使用正規方程(Normal Equation)、梯度遞減(Gradient Descent)在 Python 與 R 語言的環境中尋找迴歸模型係數,並且運用散佈圖、線圖與曲面圖(Surface Plot)探索成本函數在梯度遞減過程中的變動,迴歸模型在資料散佈圖上的外觀,藉此對迴歸模型有更多的暸解。

接著我們會應用均方誤差(Mean Squared Error)來評估迴歸模型在驗證資料集上的表現,均方誤差愈小代表 h 函數與假設存在的 f 函數的相似程度愈高;藉由比較不同模型的均方誤差,資料科學家團隊可以挑選出適合部署至正式環境的迴歸模型,精進評估的方式包含在訓練資料中納入新變數與增加變數的次方項,在試圖縮小均方誤差的過程中也會發現迴歸模型面臨新的挑戰,像是變數的單位量級差距導致梯度遞減不均、隨機切割訓練驗證樣本所產生的變異與伴隨高次方項變數的過度配適,這時資料科學團隊會引進標準化、交叉驗證與正規化等技巧來因應。

標準化(Normalization)

在艾姆斯房價資料集中房價變數(SalePrice)數量級約莫落於數萬美元至數十萬美元之間,而生活空間大小(GrLivArea)數量級約莫落於數千英畝左右,這可能使得在同一個學習速率下,兩個係數收斂的速度差異過大,而導致其中一個係數已經收斂,但另外一個數字仍很緩慢地向低點前進,雖然成本函數遞減的速率已經平緩,迭代後所得到的係數卻與內建函數的相差甚遠,特別是常數項。

Python

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

def compute_cost(X, y, thetas):
  m = X.shape[0]
  y_hat = np.dot(X, thetas)
  J = 1/(2*m)*np.sum(np.square(y_hat - y))
  return J

def gradient_descent(X, y, alpha=0.01, num_iters=1500):
  m = X.shape[0]
  J_history = np.zeros(num_iters)
  thetas = np.array([0, 0], dtype=float).reshape(-1, 1)
  for num_iter in range(num_iters):
    y_hat = np.dot(X, thetas)
    loss = y_hat - y
    gradient = np.dot(X.T, loss)
    thetas -= alpha*gradient/m
    J_history[num_iter] = compute_cost(X, y, thetas)
  return thetas, J_history

labeled = pd.read_csv("https://storage.googleapis.com/kaggle_datasets/House-Prices-Advanced-Regression-Techniques/train.csv")
train, validation = train_test_split(labeled, test_size=0.3, random_state=123)
X_train = train.loc[:, "GrLivArea"].values.reshape(-1, 1)
y_train = train.loc[:, "SalePrice"].values.reshape(-1, 1)
reg = LinearRegression()
reg.fit(X_train, y_train)
print("Thetas from Scikit-Learn:")
print(reg.intercept_)
print(reg.coef_)
ones = np.ones(X_train.shape[0], dtype=int).reshape(-1, 1)
X_train = np.concatenate([ones, X_train], axis=1)
thetas, J_history = gradient_descent(X_train, y_train, alpha=0.0000000001, num_iters=20000)
print("Thetas from manual gradient descent:")
print(thetas)
plt.plot(J_history)
plt.xlabel("Iterations")
plt.ylabel("Cost")
plt.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
## Thetas from Scikit-Learn: 
## [21905.13153846] 
## [[104.0984541]]
## Thetas from manual gradient descent:
## [[7.28896918e-02]
##  [1.16260785e+02]]
1
2
3
4
5
6

成本函數在 20000 次迭代過程的變化

R 語言

compute_cost <- function(X, y, thetas) {
  m <- nrow(X)
  y_hat <- X %*% thetas
  J <- (1/(2*m))*sum((y_hat - y)**2)
  return(J)
}

get_thetas_gd <- function(X, y, alpha=0.01, num_iters=1500) {
  m <- nrow(X)
  thetas <- as.matrix(c(0, 0))
  J_history <- vector(length = num_iters)
  for (num_iter in 1:num_iters) {
    y_hat <- X %*% thetas
    loss <- y_hat - y
    gradient <- (t(X) %*% loss)/m
    thetas <- thetas - alpha * gradient
    J_history[num_iter] <- compute_cost(X, y, thetas)
  }
  result <- list(
    thetas = thetas,
    J_history = J_history
  )
  return(result)
}

get_train_validation <- function(labeled_df, validation_size=0.3, random_state=123) {
  m <- nrow(labeled_df)
  row_indice <- 1:m
  set.seed(random_state)
  shuffled_row_indice <- sample(row_indice)
  labeled_df <- labeled_df[shuffled_row_indice, ]
  validation_threshold <- as.integer(validation_size * m)
  validation <- labeled_df[1:validation_threshold, ]
  train <- labeled_df[(validation_threshold+1):m, ]
  return(list(
    validation = validation,
    train = train
  ))
}

labeled_url <- "https://storage.googleapis.com/kaggle_datasets/House-Prices-Advanced-Regression-Techniques/train.csv"
labeled_df <- read.csv(labeled_url)
split_result <- get_train_validation(labeled_df)
train <- split_result$train
lm_fit <- lm(SalePrice ~ GrLivArea, data = train)
print("Thetas from lm():")
print(lm_fit$coefficients)
X_train <- as.matrix(split_result$train$GrLivArea)
ones <- as.matrix(rep(1, times = nrow(X_train)))
X_train <- cbind(ones, X_train)
y_train <- as.matrix(split_result$train$SalePrice)
lm_fit <- get_thetas_gd(X_train, y_train, alpha=0.0000000001, num_iters=20000)
print("Thetas from manual gradient descent:")
print(lm_fit$thetas)
plot(lm_fit$J_history, xlab = "Iterations", ylab="Cost", type="l")
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
## [1] "Thetas from lm():"
## (Intercept)   GrLivArea 
##   7298.3633    114.8567 
## [1] "Thetas from manual gradient descent:"
##              [,1]
## [1,]   0.07110339
## [2,] 118.48548260
1
2
3
4
5
6
7

成本函數在 20000 次迭代過程的變化

面對變數的單位數量級差距明顯時,資料科學團隊常會利用兩種標準化(Normalization)的技巧來將變數的單位量級調整至同一個尺度之上,一是以最大值最小值作為調整依據,稱為 Min Max Scaler,另一是以標準差值作為調整依據,稱為 Standard Scaler,這裡我們利用 Standard Scaler 將 SalePrice 與 GrLivArea 縮放至符合標準常態分配、平均為 0 標準差為 1 的尺度之上,在 Python 與 R 語言中都有內建的函數可以做標準化,也可以自訂函數。

Python

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

def compute_cost(X, y, thetas):
  m = X.shape[0]
  y_hat = np.dot(X, thetas)
  J = 1/(2*m)*np.sum(np.square(y_hat - y))
  return J

def gradient_descent(X, y, alpha=0.01, num_iters=1500):
  m = X.shape[0]
  J_history = np.zeros(num_iters)
  thetas = np.array([0, 0], dtype=float).reshape(-1, 1)
  for num_iter in range(num_iters):
    y_hat = np.dot(X, thetas)
    loss = y_hat - y
    gradient = np.dot(X.T, loss)
    thetas -= alpha*gradient/m
    J_history[num_iter] = compute_cost(X, y, thetas)
  return thetas, J_history

def standard_scaler(x):
  mean_x = x.mean()
  std_x = x.std()
  standard_x = (x - mean_x)/std_x
  return standard_x

labeled = pd.read_csv("https://storage.googleapis.com/kaggle_datasets/House-Prices-Advanced-Regression-Techniques/train.csv")
train, validation = train_test_split(labeled, test_size=0.3, random_state=123)
X_train = train.loc[:, "GrLivArea"].values.reshape(-1, 1)
y_train = train.loc[:, "SalePrice"].values.reshape(-1, 1)
reg = LinearRegression()
reg.fit(X_train, y_train)
print("Thetas from Scikit-Learn:")
print(reg.intercept_)
print(reg.coef_)
X_train_ss = standard_scaler(X_train)
y_train_ss = standard_scaler(y_train)
ones = np.ones(X_train.shape[0], dtype=int).reshape(-1, 1)
X_train_ss = np.concatenate([ones, X_train_ss], axis=1)
thetas, J_history = gradient_descent(X_train_ss, y_train_ss, alpha=0.001, num_iters=5000)
print("Thetas from manual gradient descent:")
print(thetas)
plt.plot(J_history)
plt.xlabel("Iterations")
plt.ylabel("Cost")
plt.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
## Thetas from Scikit-Learn:
## [21905.13153846]
## [[104.0984541]]
## Thetas from manual gradient descent:
## [[-2.77625824e-17]
##  [ 6.88445467e-01]]
1
2
3
4
5
6

成本函數在 5000 次迭代過程的變化

R 語言

compute_cost <- function(X, y, thetas) {
  m <- nrow(X)
  y_hat <- X %*% thetas
  J <- (1/(2*m))*sum((y_hat - y)**2)
  return(J)
}

get_thetas_gd <- function(X, y, alpha=0.01, num_iters=1500) {
  m <- nrow(X)
  thetas <- as.matrix(c(0, 0))
  J_history <- vector(length = num_iters)
  for (num_iter in 1:num_iters) {
    y_hat <- X %*% thetas
    loss <- y_hat - y
    gradient <- (t(X) %*% loss)/m
    thetas <- thetas - alpha * gradient
    J_history[num_iter] <- compute_cost(X, y, thetas)
  }
  result <- list(
    thetas = thetas,
    J_history = J_history
  )
  return(result)
}

get_train_validation <- function(labeled_df, validation_size=0.3, random_state=123) {
  m <- nrow(labeled_df)
  row_indice <- 1:m
  set.seed(random_state)
  shuffled_row_indice <- sample(row_indice)
  labeled_df <- labeled_df[shuffled_row_indice, ]
  validation_threshold <- as.integer(validation_size * m)
  validation <- labeled_df[1:validation_threshold, ]
  train <- labeled_df[(validation_threshold+1):m, ]
  return(list(
    validation = validation,
    train = train
  ))
}

standard_scaler <- function(x) {
  sd_x <- sd(x)
  mean_x <- mean(x)
  standard_x <- (x - mean_x)/sd_x
  return(standard_x)
}

labeled_url <- "https://storage.googleapis.com/kaggle_datasets/House-Prices-Advanced-Regression-Techniques/train.csv"
labeled_df <- read.csv(labeled_url)
split_result <- get_train_validation(labeled_df)
train <- split_result$train
lm_fit <- lm(SalePrice ~ GrLivArea, data = train)
print("Thetas from lm():")
print(lm_fit$coefficients)
X_train <- as.matrix(standard_scaler(split_result$train$GrLivArea))
ones <- as.matrix(rep(1, times = nrow(X_train)))
X_train <- cbind(ones, X_train)
y_train <- as.matrix(standard_scaler(split_result$train$SalePrice))
lm_fit <- get_thetas_gd(X_train, y_train, alpha=0.001, num_iters=5000)
print("Thetas from manual gradient descent:")
print(lm_fit$thetas)
plot(lm_fit$J_history, xlab = "Iterations", ylab="Cost", type="l")
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
## [1] "Thetas from lm():"
## (Intercept)   GrLivArea 
##   7298.3633    114.8567 
## [1] "Thetas from manual gradient descent:"
##              [,1]
## [1,] 2.124162e-16
## [2,] 7.226880e-01
1
2
3
4
5
6
7

成本函數在 5000 次迭代過程的變化

經過標準化之後,可以用一個較大的學習速率並在較少次的迭代中讓係數收斂,不過這時得到的係數乃是針對縮放過後的訓練資料,如果希望得到原本的係數,我們得根據下列的推導還原。

推導過程

比對方程式外觀,可以得知原本的係數為:

還原縮放之前的係數

Python

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

def compute_cost(X, y, thetas):
  m = X.shape[0]
  y_hat = np.dot(X, thetas)
  J = 1/(2*m)*np.sum(np.square(y_hat - y))
  return J

def gradient_descent(X, y, alpha=0.01, num_iters=1500):
  m = X.shape[0]
  J_history = np.zeros(num_iters)
  thetas = np.array([0, 0], dtype=float).reshape(-1, 1)
  for num_iter in range(num_iters):
    y_hat = np.dot(X, thetas)
    loss = y_hat - y
    gradient = np.dot(X.T, loss)
    thetas -= alpha*gradient/m
    J_history[num_iter] = compute_cost(X, y, thetas)
  return thetas, J_history

def standard_scaler(x):
  mean_x = x.mean()
  std_x = x.std()
  standard_x = (x - mean_x)/std_x
  return standard_x

labeled = pd.read_csv("https://storage.googleapis.com/kaggle_datasets/House-Prices-Advanced-Regression-Techniques/train.csv")
train, validation = train_test_split(labeled, test_size=0.3, random_state=123)
X_train = train.loc[:, "GrLivArea"].values.reshape(-1, 1)
y_train = train.loc[:, "SalePrice"].values.reshape(-1, 1)
reg = LinearRegression()
reg.fit(X_train, y_train)
print("Thetas from Scikit-Learn:")
print(reg.intercept_)
print(reg.coef_)
# Standardization
X_train_ss = standard_scaler(X_train)
y_train_ss = standard_scaler(y_train)
ones = np.ones(X_train.shape[0], dtype=int).reshape(-1, 1)
X_train_ss = np.concatenate([ones, X_train_ss], axis=1)
thetas, J_history = gradient_descent(X_train_ss, y_train_ss, alpha=0.001, num_iters=5000)
print("Standardized thetas from manual gradient descent:")
print(thetas)
# Rescaling
theta_0_pron = thetas[0, 0]
theta_1_pron = thetas[1, 0]
mu_y = y_train.mean()
sigma_y = y_train.std()
mu_X = X_train.mean()
sigma_X = X_train.std()
theta_0 = mu_y + sigma_y*theta_0_pron - sigma_y*mu_X*theta_1_pron/sigma_X
theta_1 = sigma_y*theta_1_pron/sigma_X
print("Rescaled thetas from manual gradient descent:")
print(theta_0)
print(theta_1)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
## Thetas from Scikit-Learn:
## [21905.13153846]
## [[104.0984541]]
## Standardized thetas from manual gradient descent:
## [[-2.77625824e-17]
##  [ 6.88445467e-01]] 
## Rescaled thetas from manual gradient descent:
## 22967.755672321568
## 103.39879673499419
1
2
3
4
5
6
7
8
9

R 語言

compute_cost <- function(X, y, thetas) {
  m <- nrow(X)
  y_hat <- X %*% thetas
  J <- (1/(2*m))*sum((y_hat - y)**2)
  return(J)
}

get_thetas_gd <- function(X, y, alpha=0.01, num_iters=1500) {
  m <- nrow(X)
  thetas <- as.matrix(c(0, 0))
  J_history <- vector(length = num_iters)
  for (num_iter in 1:num_iters) {
    y_hat <- X %*% thetas
    loss <- y_hat - y
    gradient <- (t(X) %*% loss)/m
    thetas <- thetas - alpha * gradient
    J_history[num_iter] <- compute_cost(X, y, thetas)
  }
  result <- list(
    thetas = thetas,
    J_history = J_history
  )
  return(result)
}

get_train_validation <- function(labeled_df, validation_size=0.3, random_state=123) {
  m <- nrow(labeled_df)
  row_indice <- 1:m
  set.seed(random_state)
  shuffled_row_indice <- sample(row_indice)
  labeled_df <- labeled_df[shuffled_row_indice, ]
  validation_threshold <- as.integer(validation_size * m)
  validation <- labeled_df[1:validation_threshold, ]
  train <- labeled_df[(validation_threshold+1):m, ]
  return(list(
    validation = validation,
    train = train
  ))
}

standard_scaler <- function(x) {
  sd_x <- sd(x)
  mean_x <- mean(x)
  standard_x <- (x - mean_x)/sd_x
  return(standard_x)
}

labeled_url <- "https://storage.googleapis.com/kaggle_datasets/House-Prices-Advanced-Regression-Techniques/train.csv"
labeled_df <- read.csv(labeled_url)
split_result <- get_train_validation(labeled_df)
train <- split_result$train
lm_fit <- lm(SalePrice ~ GrLivArea, data = train)
print("Thetas from lm():")
print(lm_fit$coefficients)
X_train <- train$GrLivArea
y_train <- train$SalePrice
# Standardization
X_train_ss <- as.matrix(standard_scaler(X_train))
ones <- as.matrix(rep(1, times = nrow(X_train_ss)))
X_train_ss <- cbind(ones, X_train_ss)
y_train_ss <- as.matrix(standard_scaler(y_train))
lm_fit <- get_thetas_gd(X_train_ss, y_train_ss, alpha=0.001, num_iters=5000)
print("Thetas from manual gradient descent:")
print(lm_fit$thetas)
# Rescaling
theta_0_pron <- lm_fit$thetas[1, 1]
theta_1_pron <- lm_fit$thetas[2, 1]
mu_y <- mean(y_train)
sigma_y <- sd(y_train)
mu_X <- mean(X_train)
sigma_X <- sd(X_train)
theta_0 <- mu_y + sigma_y*theta_0_pron - sigma_y*mu_X*theta_1_pron/sigma_X
theta_1 <- sigma_y*theta_1_pron/sigma_X
print("Rescaled thetas from manual gradient descent:")
print(theta_0)
print(theta_1)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
## [1] "Thetas from lm():"
## (Intercept)   GrLivArea 
##   7298.3633    114.8567 
## [1] "Thetas from manual gradient descent:"
##              [,1]
## [1,] 2.124162e-16
## [2,] 7.226880e-01
## [1] "Rescaled thetas from manual gradient descent:"
## [1] 8482.633
## [1] 114.081
1
2
3
4
5
6
7
8
9
10

這時候比對使用 Python 與 R 語言內建函數所求得的係數相比,就會發現與標準化資料進行梯度遞減後得到的係數相近,優於先前未經標準化就實行梯度遞減的結果。

評估迴歸模型的表現

均方誤差(Mean Squared Error)是迴歸模型的核心,扮演尋找係數的依據,也同時是評估模型表現的數值;假設 GrLivArea 與 SalePrice 之間存在一個無從得知的迴歸模型 f,我們只能建立一個迴歸模型 h,如果預測值與實際值之間的誤差愈小,就有自信說 h 與 f 的相似程度愈高。

寫成純量與向量計算外觀的均方誤差

資料科學團隊使用訓練資料建立 h,將 h 應用在驗證資料之上建立預測值,最後比對預測值與驗證資料中的實際值,不論是獲取預測值還是計算均方誤差,我們都能手動計算。

Python

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

def get_mse(X_arr, y_arr, thetas):
  m = X_arr.size
  theta_0 = thetas[0, 0]
  theta_1 = thetas[1, 0]
  y_hat = theta_0 + theta_1*X_arr
  err = y_hat - y_arr
  se = np.sum(err**2)
  return se/m

def get_mse_vectorized(X, y, thetas):
  m = X.shape[0]
  y_hat = np.dot(X, thetas)
  err = y_hat - y
  se = np.dot(err.T, err)
  return se[0, 0]/m

labeled = pd.read_csv("https://storage.googleapis.com/kaggle_datasets/House-Prices-Advanced-Regression-Techniques/train.csv")
train, validation = train_test_split(labeled, test_size=0.3, random_state=123)
X_train = train.loc[:, "GrLivArea"].values.reshape(-1, 1)
y_train = train.loc[:, "SalePrice"].values.reshape(-1, 1)
X_validation = validation.loc[:, "GrLivArea"].values.reshape(-1, 1)
ones = np.ones(X_validation.shape[0]).reshape(-1, 1)
X_validation = np.concatenate([ones, X_validation], axis=1)
y_validation = validation.loc[:, "SalePrice"].values.reshape(-1, 1)
reg = LinearRegression()
reg.fit(X_train, y_train)
theta_0 = reg.intercept_[0]
theta_1 = reg.coef_[0, 0]
thetas = np.array([theta_0, theta_1]).reshape(-1, 1)
mse = get_mse(validation.loc[:, "GrLivArea"].values, validation.loc[:, "SalePrice"].values, thetas)
vectorized_mse = get_mse_vectorized(X_validation, y_validation, thetas)
print("MSE: {:.0f}".format(mse))
print("MSE(vectorized): {:.0f}".format(vectorized_mse))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
## MSE: 2541490406
## MSE(vectorized): 2541490406
1
2

R 語言

get_mse <- function(X_vec, y_vec, thetas) {
  m <- length(X_vec)
  theta_0 <- thetas[1, 1]
  theta_1 <- thetas[2, 1]
  y_hat <- theta_0 + theta_1*X_vec
  err <- y_hat - y_vec
  se <- sum(err**2)
  return(se/m)
}

get_mse_vectorized <- function(X, y, thetas) {
  m <- nrow(X)
  y_hat <- X %*% thetas
  err <- y_hat - y
  se <- t(err) %*% err
  return(se[1, 1]/m)
}

get_train_validation <- function(labeled_df, validation_size=0.3, random_state=123) {
  m <- nrow(labeled_df)
  row_indice <- 1:m
  set.seed(random_state)
  shuffled_row_indice <- sample(row_indice)
  labeled_df <- labeled_df[shuffled_row_indice, ]
  validation_threshold <- as.integer(validation_size * m)
  validation <- labeled_df[1:validation_threshold, ]
  train <- labeled_df[(validation_threshold+1):m, ]
  return(list(
    validation = validation,
    train = train
  ))
}

labeled_url <- "https://storage.googleapis.com/kaggle_datasets/House-Prices-Advanced-Regression-Techniques/train.csv"
labeled_df <- read.csv(labeled_url)
split_result <- get_train_validation(labeled_df)
train <- split_result$train
validation <- split_result$validation
X_validation <- as.matrix(validation$GrLivArea)
ones <- as.matrix(rep(1, times = nrow(X_validation)))
X_validation <- cbind(ones, X_validation)
y_validation <- as.matrix(validation$SalePrice)
lm_fit <- lm(SalePrice ~ GrLivArea, data = train)
theta_0 <- lm_fit$coefficients[1]
theta_1 <- lm_fit$coefficients[2]
thetas <- as.matrix(c(theta_0, theta_1))
mse <- get_mse(validation$GrLivArea, validation$SalePrice, thetas)
mse_vectorized <- get_mse_vectorized(X_validation, y_validation, thetas)
sprintf("MSE: %.0f", mse)
sprintf("MSE(vectorized): %.0f", mse_vectorized)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
## [1] "MSE: 3174682821"
## [1] "MSE(vectorized): 3174682821"
1
2

多數的時候資料科學團隊會直接引用 Python 與 R 語言的內建函數獲得預測值以及均方誤差。

Python

在 Python 中引用 .fit() 方法獲得預測值、引用 sklean.metrics 中的 mean_squared_error() 函數獲得均方誤差。

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

labeled = pd.read_csv("https://storage.googleapis.com/kaggle_datasets/House-Prices-Advanced-Regression-Techniques/train.csv")
train, validation = train_test_split(labeled, test_size=0.3, random_state=123)
X_train = train.loc[:, "GrLivArea"].values.reshape(-1, 1)
y_train = train.loc[:, "SalePrice"].values.reshape(-1, 1)
X_validation = validation.loc[:, "GrLivArea"].values.reshape(-1, 1)
y_validation = validation.loc[:, "SalePrice"].values.reshape(-1, 1)
reg = LinearRegression()
reg.fit(X_train, y_train)
y_hat = reg.predict(X_validation)
mse = mean_squared_error(y_validation, y_hat)
print("MSE: {:.0f}".format(mse))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
## MSE: 2541490406
1

R 語言

在 R 語言中引用 predict() 函數獲得預測值。

get_train_validation <- function(labeled_df, validation_size=0.3, random_state=123) {
  m <- nrow(labeled_df)
  row_indice <- 1:m
  set.seed(random_state)
  shuffled_row_indice <- sample(row_indice)
  labeled_df <- labeled_df[shuffled_row_indice, ]
  validation_threshold <- as.integer(validation_size * m)
  validation <- labeled_df[1:validation_threshold, ]
  train <- labeled_df[(validation_threshold+1):m, ]
  return(list(
    validation = validation,
    train = train
  ))
}

labeled_url <- "https://storage.googleapis.com/kaggle_datasets/House-Prices-Advanced-Regression-Techniques/train.csv"
labeled_df <- read.csv(labeled_url)
split_result <- get_train_validation(labeled_df)
train <- split_result$train
validation <- split_result$validation
lm_fit <- lm(SalePrice ~ GrLivArea, data = train)
y_hat <- predict(lm_fit, newdata = validation)
mse <- mean((y_hat - validation$SalePrice)**2)
sprintf("MSE: %.0f", mse)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
## [1] "MSE: 3174682821"
1

簡單比對一下不論自行使用純量計算、向量計算或引用內建函數,都能夠順利得到均方誤差。

精進評估

我們已經暸解如何獲取均方誤差作為評估迴歸模型表現的依據,也知道均方誤差愈小代表預測值與實際值差距愈小,可以推斷 h 與 f 的相似度愈高;不過這是一個需要比較的數值(並不像比例較能被單獨解讀)因此需要與其他的假說 h 比較,如果其他的 h 所產生的均方誤差比較小,則說明其他的 h 與 f 之相似度更高。 建立更多迴歸模型的途徑有加入其他變數建立多變數迴歸模型或者將既有變數建立出次方項,其中加入更多變數的方式相當直觀,像是除了 GrLivArea 以外加入與 SalePrice 正相關性也很強的 GarageArea。

Python

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

labeled = pd.read_csv("https://storage.googleapis.com/kaggle_datasets/House-Prices-Advanced-Regression-Techniques/train.csv")
train, validation = train_test_split(labeled, test_size=0.3, random_state=123)
X_train_simple = train.loc[:, "GrLivArea"].values.reshape(-1, 1)
X_train_multiple = train.loc[:, ["GrLivArea", "GarageArea"]].values
y_train = train.loc[:, "SalePrice"].values.reshape(-1, 1)
X_validation_simple = validation.loc[:, "GrLivArea"].values.reshape(-1, 1)
X_validation_multiple = validation.loc[:, ["GrLivArea", "GarageArea"]].values
y_validation = validation.loc[:, "SalePrice"].values.reshape(-1, 1)
reg_simple = LinearRegression()
reg_multiple = LinearRegression()
reg_simple.fit(X_train_simple, y_train)
reg_multiple.fit(X_train_multiple, y_train)
y_hat_simple = reg_simple.predict(X_validation_simple)
y_hat_multiple = reg_multiple.predict(X_validation_multiple)
mse_simple = mean_squared_error(y_validation, y_hat_simple)
mse_multiple = mean_squared_error(y_validation, y_hat_multiple)
print("MSE of simple linear regression: {:.0f}".format(mse_simple))
print("MSE of multiple linear regression: {:.0f}".format(mse_multiple))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
## MSE of simple linear regression: 2541490406
## MSE of multiple linear regression: 1930533820
1
2

R 語言

get_train_validation <- function(labeled_df, validation_size=0.3, random_state=123) {
  m <- nrow(labeled_df)
  row_indice <- 1:m
  set.seed(random_state)
  shuffled_row_indice <- sample(row_indice)
  labeled_df <- labeled_df[shuffled_row_indice, ]
  validation_threshold <- as.integer(validation_size * m)
  validation <- labeled_df[1:validation_threshold, ]
  train <- labeled_df[(validation_threshold+1):m, ]
  return(list(
    validation = validation,
    train = train
  ))
}

labeled_url <- "https://storage.googleapis.com/kaggle_datasets/House-Prices-Advanced-Regression-Techniques/train.csv"
labeled_df <- read.csv(labeled_url)
split_result <- get_train_validation(labeled_df)
train <- split_result$train
validation <- split_result$validation
lm_fit_simple <- lm(SalePrice ~ GrLivArea, data = train)
lm_fit_multiple <- lm(SalePrice ~ GrLivArea + GarageArea, data = train)
y_hat_simple <- predict(lm_fit_simple, newdata = validation)
y_hat_multiple <- predict(lm_fit_multiple, newdata = validation)
mse_simple <- mean((y_hat_simple - validation$SalePrice)**2)
mse_multiple <- mean((y_hat_multiple - validation$SalePrice)**2)
sprintf("MSE of simple linear regression: %.0f", mse_simple)
sprintf("MSE of multiple linear regression: %.0f", mse_multiple)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
## [1] "MSE of simple linear regression: 3174682821"
## [1] "MSE of multiple linear regression: 2535888809"
1
2

從均方誤差的降低,我們發現新增 GarageArea 之後讓迴歸模型在驗證資料上的預測表現變得更好。

而建立次方項則可以 Python 與 R 語言內建的函數實踐,進而將線性迴歸模型 h 延伸應用至非線性關係的資料樣態上,首先繪製散佈圖觀察 YearBuilt 與 SalePrice 的關係。

Python

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

labeled = pd.read_csv("https://storage.googleapis.com/kaggle_datasets/House-Prices-Advanced-Regression-Techniques/train.csv")
train, validation = train_test_split(labeled, test_size=0.3, random_state=123)
plt.scatter(train["YearBuilt"], train["SalePrice"], s=5, c="b", label="Train")
plt.scatter(validation["YearBuilt"], validation["SalePrice"], s=5, c="r", label="Validation")
plt.xlabel("Year Built")
plt.ylabel("Sale Price")
plt.legend(loc="upper left")
plt.show()
1
2
3
4
5
6
7
8
9
10
11
12

繪製散佈圖觀察 YearBuilt 與 SalePrice 的關係

R 語言

library(ggplot2)

get_train_validation <- function(labeled_df, validation_size=0.3, random_state=123) {
  m <- nrow(labeled_df)
  row_indice <- 1:m
  set.seed(random_state)
  shuffled_row_indice <- sample(row_indice)
  labeled_df <- labeled_df[shuffled_row_indice, ]
  validation_threshold <- as.integer(validation_size * m)
  validation <- labeled_df[1:validation_threshold, ]
  train <- labeled_df[(validation_threshold+1):m, ]
  return(list(
    validation = validation,
    train = train
  ))
}
labeled_url <- "https://storage.googleapis.com/kaggle_datasets/House-Prices-Advanced-Regression-Techniques/train.csv"
labeled_df <- read.csv(labeled_url)
split_result <- get_train_validation(labeled_df)
train <- split_result$train
train$source <- "train"
validation <- split_result$validation
validation$source <- "validation"
df_to_plot <- rbind(train, validation)
df_to_plot %>% 
  ggplot(aes(x = YearBuilt, y = SalePrice, color = source)) +
  geom_point()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

繪製散佈圖觀察 YearBuilt 與 SalePrice 的關係

接著我們試著建立出 YearBuilt 變數的平方項加入迴歸模型,使得 h 可以寫成:

建立出 YearBuilt 變數的平方項加入迴歸模型

Python

在 Python 中可以使用 sklearn.preprocessing 的 PolynomialFeatures() 建立出次方項,值得注意的是驗證資料應當也要一併轉換,才能夠在後續計算均方誤差時順利代入模型。

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

labeled = pd.read_csv("https://storage.googleapis.com/kaggle_datasets/House-Prices-Advanced-Regression-Techniques/train.csv")
train, validation = train_test_split(labeled, test_size=0.3, random_state=123)
X_train = train.loc[:, "YearBuilt"].values.reshape(-1, 1)
X_validation = validation.loc[:, "YearBuilt"].values.reshape(-1, 1)
X_train_d2 = PolynomialFeatures(2).fit_transform(X_train)
print("X_train with degree=2:")
print(X_train_d2)
X_validation_d2 = PolynomialFeatures(2).fit_transform(X_validation)
y_train = train.loc[:, "SalePrice"].values.reshape(-1, 1)
y_validation = validation.loc[:, "SalePrice"].values.reshape(-1, 1)
reg_d1 = LinearRegression()
reg_d1.fit(X_train, y_train)
y_hat = reg_d1.predict(X_validation)
mse_d1 = mean_squared_error(y_validation, y_hat)
print("MSE with degree=1: {:.0f}".format(mse_d1))
reg_d2 = LinearRegression()
reg_d2.fit(X_train_d2, y_train)
y_hat = reg_d2.predict(X_validation_d2)
mse_d2 = mean_squared_error(y_validation, y_hat)
print("MSE with degree=2: {:.0f}".format(mse_d2))
X_arr_d1 = np.linspace(labeled["YearBuilt"].min(), labeled["YearBuilt"].max()).reshape(-1, 1)
X_arr_d2 = PolynomialFeatures(2).fit_transform(X_arr)
y_arr_d1 = reg_d1.predict(X_arr_d1)
y_arr_d2 = reg_d2.predict(X_arr_d2)
plt.scatter(train["YearBuilt"], train["SalePrice"], s=5, c="b", label="Train")
plt.scatter(validation["YearBuilt"], validation["SalePrice"], s=5, c="r", label="Validation")
plt.plot(X_arr, y_arr_d1, c="c", linewidth=3, label="d=1")
plt.plot(X_arr, y_arr_d2, c="g", linewidth=3, label="d=2")
plt.xlabel("Year Built")
plt.ylabel("Sale Price")
plt.legend(loc="upper left")
plt.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
## X_train with degree=2:
## [[1.000000e+00 1.996000e+03 3.984016e+06]
##  [1.000000e+00 1.940000e+03 3.763600e+06]
##  [1.000000e+00 1.967000e+03 3.869089e+06]
##  ...
##  [1.000000e+00 1.968000e+03 3.873024e+06]
##  [1.000000e+00 1.972000e+03 3.888784e+06]
##  [1.000000e+00 1.941000e+03 3.767481e+06]]
## MSE with degree=1: 4223063920
## MSE with degree=2: 3769969724
1
2
3
4
5
6
7
8
9
10

在散佈圖加入一次項與二次項的迴歸模型

R 語言

在 R 語言只需要在 lm() 函數中的 formula 參數將二次項加入以 I() 包括起來即可。

library(ggplot2)

get_train_validation <- function(labeled_df, validation_size=0.3, random_state=123) {
  m <- nrow(labeled_df)
  row_indice <- 1:m
  set.seed(random_state)
  shuffled_row_indice <- sample(row_indice)
  labeled_df <- labeled_df[shuffled_row_indice, ]
  validation_threshold <- as.integer(validation_size * m)
  validation <- labeled_df[1:validation_threshold, ]
  train <- labeled_df[(validation_threshold+1):m, ]
  return(list(
    validation = validation,
    train = train
  ))
}
labeled_url <- "https://storage.googleapis.com/kaggle_datasets/House-Prices-Advanced-Regression-Techniques/train.csv"
labeled_df <- read.csv(labeled_url)
split_result <- get_train_validation(labeled_df)
train <- split_result$train
train$source <- "train"
validation <- split_result$validation
validation$source <- "validation"
reg_d1 <- lm(SalePrice ~ YearBuilt, data = train)
reg_d2 <- lm(SalePrice ~ YearBuilt + I(YearBuilt**2), data = train)
y_hat <- predict(reg_d1, newdata = validation)
mse_d1 <- mean((y_hat - validation$SalePrice)**2)
y_hat <- predict(reg_d2, newdata = validation)
mse_d2 <- mean((y_hat - validation$SalePrice)**2)
sprintf("MSE with degree=1: %.0f", mse_d1)
sprintf("MSE with degree=2: %.0f", mse_d2)
x_arr <- seq(from = min(labeled_df$YearBuilt), to = max(labeled_df$YearBuilt), length.out = 50)
new_data <- data.frame(YearBuilt = x_arr)
y_arr_d1 <- predict(reg_d1, newdata = new_data)
y_arr_d2 <- predict(reg_d2, newdata = new_data)
new_data$SalePrice_d1 <- y_arr_d1
new_data$SalePrice_d2 <- y_arr_d2
df_to_plot <- rbind(train, validation)
df_to_plot %>% 
  ggplot(aes(x = YearBuilt, y = SalePrice, color = source)) +
  geom_point(size=0.5) +
  geom_line(data = new_data, aes(x = YearBuilt, y = SalePrice_d1), color = "#ff00ff", size = 1.4) +
  geom_line(data = new_data, aes(x = YearBuilt, y = SalePrice_d2), color = "#006600", size = 1.4)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
## [1] "MSE with degree=1: 3931896041"
## [1] "MSE with degree=2: 3727047693"
1
2

在散佈圖加入一次項與二次項的迴歸模型

加入 YearBuilt 變數的平方項之後,模型的均方誤差下降,可以大膽地說具有平方項的 h 比沒有平方項的 h 可能更相似於 f。然而我們會開始疑惑,是否要嘗試加入立方項、四次方項或者更高次方項會有更好的成果?究竟該納入的合適次方項是多少?

為了驗證納入不同次方項的效果,可以運用迴圈試算納入一次至十次方項不同模型的均方誤差,再選擇誤差最低的設定。

Python

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

def get_best_degree(X_train, y_train, X_validation, y_validation, d=10):
  degrees = range(1, d+1)
  mse_train_arr = np.zeros(d)
  mse_validation_arr = np.zeros(d)
  for degree in degrees:
    X_train_poly = PolynomialFeatures(degree).fit_transform(X_train)
    X_validation_poly = PolynomialFeatures(degree).fit_transform(X_validation)
    reg = LinearRegression()
    reg.fit(X_train_poly, y_train)
    y_hat = reg.predict(X_train_poly)
    mse_train = mean_squared_error(y_train, y_hat)
    y_hat = reg.predict(X_validation_poly)
    mse_validation = mean_squared_error(y_validation, y_hat)
    mse_train_arr[degree - 1] = mse_train
    mse_validation_arr[degree - 1] = mse_validation
  best_degree = mse_validation_arr.argmin() + 1
  return mse_train_arr, mse_validation_arr, best_degree

labeled = pd.read_csv("https://storage.googleapis.com/kaggle_datasets/House-Prices-Advanced-Regression-Techniques/train.csv")
train, validation = train_test_split(labeled, test_size=0.3, random_state=123)
X_train = train.loc[:, "YearBuilt"].values.reshape(-1, 1)
X_validation = validation.loc[:, "YearBuilt"].values.reshape(-1, 1)
y_train = train.loc[:, "SalePrice"].values.reshape(-1, 1)
y_validation = validation.loc[:, "SalePrice"].values.reshape(-1, 1)
d = 10
mse_train_arr, mse_validation_arr, best_degree = get_best_degree(X_train, y_train, X_validation, y_validation, d=10)
print("Best degree: {}".format(best_degree))
degrees = range(1, d+1)
plt.plot(degrees, mse_train_arr, c="red", marker="s", label="Train")
plt.plot(degrees, mse_validation_arr, c="g", marker="^", label="Validation")
plt.xticks(degrees)
plt.axvline(best_degree, label="Best Degree: {}".format(best_degree))
plt.legend(loc="upper right")
plt.xlabel("Degree")
plt.ylabel("MSE")
plt.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
## Best degree: 2
1

比較納入一次到十次項後的模型均方誤差,選擇納入驗證誤差最小的二次項

R 語言

library(tidyr)
library(ggplot2)

get_train_validation <- function(labeled_df, validation_size=0.3, random_state=123) {
  m <- nrow(labeled_df)
  row_indice <- 1:m
  set.seed(random_state)
  shuffled_row_indice <- sample(row_indice)
  labeled_df <- labeled_df[shuffled_row_indice, ]
  validation_threshold <- as.integer(validation_size * m)
  validation <- labeled_df[1:validation_threshold, ]
  train <- labeled_df[(validation_threshold+1):m, ]
  return(list(
    validation = validation,
    train = train
  ))
}

get_best_degree <- function(train, validation, d = 10) {
  degrees <- seq(1, d)
  mse_train_vec <- vector(length = d)
  mse_validation_vec <- vector(length = d)
  for (degree in degrees) {
    reg <- lm(SalePrice ~ poly(YearBuilt, degree = degree, raw = TRUE), data = train)
    y_hat <- predict(reg, newdata = train)
    mse_train <- mean((y_hat - train$SalePrice)**2)
    y_hat <- predict(reg, newdata = validation)
    mse_validation <- mean((y_hat - validation$SalePrice)**2)
    mse_train_vec[degree] = mse_train
    mse_validation_vec[degree] = mse_validation
  }
  best_degree <- which.min(mse_validation_vec)
  return(list(
    mse_train_vec = mse_train_vec,
    mse_validation_vec = mse_validation_vec,
    best_degree = best_degree
  ))
}

labeled_url <- "https://storage.googleapis.com/kaggle_datasets/House-Prices-Advanced-Regression-Techniques/train.csv"
labeled_df <- read.csv(labeled_url)
split_result <- get_train_validation(labeled_df)
train <- split_result$train
train$source <- "train"
validation <- split_result$validation
validation$source <- "validation"
results <- get_best_degree(train, validation)
mse_train_vec <- results$mse_train_vec
mse_validation_vec <- results$mse_validation_vec
best_degree <- results$best_degree
sprintf("Best degree: %s", best_degree)
d <- 10
degrees <- seq(1, d)
df_to_plot <- data.frame(
  degree = degrees,
  mse_train = mse_train_vec,
  mse_validation = mse_validation_vec
)
df_to_plot_long <- gather(df_to_plot, key = "source", value = "mse", mse_train, mse_validation)

ggplot(df_to_plot_long, aes(x = degree, y = mse, color = source)) +
  geom_line() +
  geom_point() +
  geom_vline(aes(xintercept = best_degree), color="#007f00") +
  scale_x_continuous(breaks = degrees) +
  xlab("Degree") +
  ylab("MSE")
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
## [1] "Best degree: 2"
1

比較納入一次到十次項後的模型均方誤差,選擇納入驗證誤差最小的二次項

納入二次項之後的驗證資料誤差最低;不過這是在 random_state 隨機種子設定為 123 切割訓練、驗證資料的情況,如果我們取消隨機種子的設定並進行多次迭代,將會發現驗證誤差最小的並不一定是加入二次項。

Python

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

def get_best_degree(X_train, y_train, X_validation, y_validation, d=10):
  degrees = range(1, d+1)
  mse_train_arr = np.zeros(d)
  mse_validation_arr = np.zeros(d)
  for degree in degrees:
    X_train_poly = PolynomialFeatures(degree).fit_transform(X_train)
    X_validation_poly = PolynomialFeatures(degree).fit_transform(X_validation)
    reg = LinearRegression()
    reg.fit(X_train_poly, y_train)
    y_hat = reg.predict(X_train_poly)
    mse_train = mean_squared_error(y_train, y_hat)
    y_hat = reg.predict(X_validation_poly)
    mse_validation = mean_squared_error(y_validation, y_hat)
    mse_train_arr[degree - 1] = mse_train
    mse_validation_arr[degree - 1] = mse_validation
  best_degree = mse_validation_arr.argmin() + 1
  return mse_train_arr, mse_validation_arr, best_degree

def get_bd_history(labeled, num_iters=100):
  bd_history = np.zeros(num_iters)
  for num_iter in range(num_iters):
    train, validation = train_test_split(labeled, test_size=0.3)
    X_train = train.loc[:, "YearBuilt"].values.reshape(-1, 1)
    X_validation = validation.loc[:, "YearBuilt"].values.reshape(-1, 1)
    y_train = train.loc[:, "SalePrice"].values.reshape(-1, 1)
    y_validation = validation.loc[:, "SalePrice"].values.reshape(-1, 1)
    best_degree = get_best_degree(X_train, y_train, X_validation, y_validation)[2]
    bd_history[num_iter] = best_degree
  df = pd.DataFrame()
  df["bd_history"] = bd_history
  grouped = df.groupby("bd_history")
  return grouped.size().sort_values(ascending=False)

labeled = pd.read_csv("https://storage.googleapis.com/kaggle_datasets/House-Prices-Advanced-Regression-Techniques/train.csv")
get_bd_history(labeled)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
## bd_history
## 10.0    48
## 2.0     30
## 3.0     19
## 8.0      1
## 7.0      1
## 6.0      1
## dtype: int64
1
2
3
4
5
6
7
8

R 語言

get_train_validation <- function(labeled_df, validation_size=0.3) {
  m <- nrow(labeled_df)
  row_indice <- 1:m
  shuffled_row_indice <- sample(row_indice)
  labeled_df <- labeled_df[shuffled_row_indice, ]
  validation_threshold <- as.integer(validation_size * m)
  validation <- labeled_df[1:validation_threshold, ]
  train <- labeled_df[(validation_threshold+1):m, ]
  return(list(
    validation = validation,
    train = train
  ))
}

get_best_degree <- function(train, validation, d = 10) {
  degrees <- seq(1, d)
  mse_train_vec <- vector(length = d)
  mse_validation_vec <- vector(length = d)
  for (degree in degrees) {
    reg <- lm(SalePrice ~ poly(YearBuilt, degree = degree), data = train)
    y_hat <- predict(reg, newdata = train)
    mse_train <- mean((y_hat - train$SalePrice)**2)
    y_hat <- predict(reg, newdata = validation)
    mse_validation <- mean((y_hat - validation$SalePrice)**2)
    mse_train_vec[degree] = mse_train
    mse_validation_vec[degree] = mse_validation
  }
  best_degree <- which.min(mse_validation_vec)
  return(list(
    mse_train_vec = mse_train_vec,
    mse_validation_vec = mse_validation_vec,
    best_degree = best_degree
  ))
}

get_bd_history <- function(labeled_df, num_iters = 100) {
  bd_history <- vector(length = num_iters)
  for (num_iters in 1:num_iters) {
    split_result <- get_train_validation(labeled_df)
    train <- split_result$train
    train$source <- "train"
    validation <- split_result$validation
    validation$source <- "validation"
    results <- get_best_degree(train, validation)
    best_degree <- results$best_degree
    bd_history[num_iters] <- best_degree
  }
  return(sort(table(bd_history), decreasing = TRUE))
}

labeled_url <- "https://storage.googleapis.com/kaggle_datasets/House-Prices-Advanced-Regression-Techniques/train.csv"
labeled_df <- read.csv(labeled_url)
get_bd_history(labeled_df)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
## bd_history
## 10  9  2  7  3  8  4 
## 53 17  9  9  7  4  1
1
2
3

我們進行一百次的迭代之後將最佳的次方項選擇記錄起來,發現切割訓練、驗證資料的隨機性,確實會影響驗證樣本的均方誤差,使得參數的決定產生變異,又該如何消弭?

交叉驗證

為了消弭隨機切割標籤資料為訓練、驗證樣本而產生的變異,資料科學團隊會採用交叉驗證(Cross Validation)的技巧來因應。具體作法是將標籤資料分為 K 個相等大小的區塊,輪流讓其中 K-1 個區塊作為訓練資料、一個區塊作為驗證資料,然後將 K 次的 MSE 平均作為最後參考的均方誤差,藉此消弭一次隨機的資料切割可能產生的變異性。

交叉驗證示意圖,圖片來源:CS109a

Python

在 Python 中我們可以引用 sklearn.model_selection 模組中的 KFold(shuffle=True) 函數獲得訓練與驗證資料的列索引值。

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error

def get_best_degree(X_labeled, y_labeled, k=5, d=10):
  kf = KFold(n_splits=k, shuffle=True)
  degrees = range(1, d+1)
  mse_train_avg_arr = np.zeros(d)
  mse_validation_avg_arr = np.zeros(d)
  for degree in degrees:
    mse_train_arr = []
    mse_validation_arr = []
    for train_idx, valid_idx in kf.split(X_labeled):
      X_train, X_validation = X_labeled[train_idx], X_labeled[valid_idx]
      y_train, y_validation = y_labeled[train_idx], y_labeled[valid_idx]
      X_train_poly = PolynomialFeatures(d).fit_transform(X_train)
      X_validation_poly =  PolynomialFeatures(d).fit_transform(X_validation)
      reg = LinearRegression()
      reg.fit(X_train_poly, y_train)
      y_hat = reg.predict(X_train_poly)
      mse_train = mean_squared_error(y_train, y_hat)
      y_hat = reg.predict(X_validation_poly)
      mse_validation = mean_squared_error(y_validation, y_hat)
      mse_train_arr.append(mse_train)
      mse_validation_arr.append(mse_validation)
    mse_train_avg_arr[degree - 1] = np.array(mse_train_arr).mean()
    mse_validation_avg_arr[degree - 1] = np.array(mse_validation_arr).mean()
  best_degree = mse_validation_avg_arr.argmin() + 1
  return mse_train_avg_arr, mse_validation_avg_arr, best_degree

labeled = pd.read_csv("https://storage.googleapis.com/kaggle_datasets/House-Prices-Advanced-Regression-Techniques/train.csv")
X_labeled = labeled.loc[:, "YearBuilt"].values.reshape(-1, 1)
y_labeled = labeled.loc[:, "SalePrice"].values.reshape(-1, 1)
mse_train_avg_arr, mse_validation_avg_arr, best_degree = get_best_degree(X_labeled, y_labeled)
print("Best degree: {}".format(best_degree))
d = 10
degrees = range(1, d+1)
plt.plot(degrees, mse_train_avg_arr, c="r", marker="s", label="Train")
plt.plot(degrees, mse_validation_avg_arr, c="g", marker="^", label="Validation")
plt.xticks(degrees)
plt.axvline(best_degree, label="Best Degree: {}".format(best_degree))
plt.legend(loc="upper left")
plt.xlabel("Degree")
plt.ylabel("MSE")
plt.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
## Best degree: 3
1

以交叉驗證比較納入一次到十次項後的平均均方誤差

R 語言

在 R 語言中我們可以引用 caret 套件中的 createFolds() 函數獲得訓練與驗證資料的列索引值。

library(caret)
library(tidyr)
library(ggplot2)

get_best_degree <- function(labeled_df, d = 10, k = 5) {
  degrees <- seq(1, d)
  mse_train_avg_vec <- vector(length = d)
  mse_validation_avg_vec <- vector(length = d)
  for (degree in degrees) {
    validation_indice <- createFolds(labeled_df$SalePrice, k = k)
    validation_indice_len <- length(validation_indice)
    mse_train_vec <- vector(length = validation_indice_len)
    mse_validation_vec <- vector(length = validation_indice_len)
    for (i in 1:validation_indice_len) {
      train <- labeled_df[-validation_indice[[i]], ]
      validation <- labeled_df[validation_indice[[i]], ]
      reg <- lm(SalePrice ~ poly(YearBuilt, degree = degree), data = train)
      y_hat <- predict(reg, newdata = train)
      mse_train <- mean((y_hat - train$SalePrice)**2)
      y_hat <- predict(reg, newdata = validation)
      mse_validation <- mean((y_hat - validation$SalePrice)**2)
      mse_train_vec[i] <- mse_train
      mse_validation_vec[i] <- mse_validation
    }
    mse_train_avg_vec[degree] <- mean(mse_train_vec)
    mse_validation_avg_vec[degree] <- mean(mse_validation_vec)
  }
  best_degree <- which.min(mse_validation_avg_vec)
  return(list(
    mse_train_avg_vec = mse_train_avg_vec,
    mse_validation_avg_vec = mse_validation_avg_vec,
    best_degree = best_degree
  ))
}

labeled_url <- "https://storage.googleapis.com/kaggle_datasets/House-Prices-Advanced-Regression-Techniques/train.csv"
labeled_df <- read.csv(labeled_url)
X_labeled <- labeled_df$YearBuilt
y_labeled <- labeled_df$SalePrice
results <- get_best_degree(labeled_df)
mse_train_vec <- results$mse_train_avg_vec
mse_validation_vec <- results$mse_validation_avg_vec
best_degree <- results$best_degree
sprintf("Best degree: %s", best_degree)
d <- 10
degrees <- seq(1, d)
df_to_plot <- data.frame(
  degree = degrees,
  mse_train = mse_train_vec,
  mse_validation = mse_validation_vec
)
df_to_plot_long <- gather(df_to_plot, key = "source", value = "mse", mse_train, mse_validation)

ggplot(df_to_plot_long, aes(x = degree, y = mse, color = source)) +
  geom_line() +
  geom_point() +
  geom_vline(aes(xintercept = best_degree), color="#007f00") +
  scale_x_continuous(breaks = degrees) +
  xlab("Degree") +
  ylab("MSE")
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
## [1] "Best degree: 8"
1

以交叉驗證比較納入一次到十次項後的平均均方誤差

正規化(Regularization)

剛開始採用單變數迴歸模型時,不論是建立次方項或者納入其他變數,都能有效降低均方誤差,提高預測表現,這個階段稱為配適不足(Under-fitting),意指 h 函數對於訓練資料的特徵還很陌生,因此不論是對於訓練資料或未知資料(包含驗證、測試資料),預測表現一致不佳,資料科學團隊以高誤差(High bias)和低變異(Low variance)來形容配適不足的模型。

當迴歸模型開始建立高次項或者納入多變數之後,比對訓練以及驗證資料的均方誤差變化,會發現訓練資料的均方誤差呈現穩定下降的趨勢,但是驗證資料卻會在某個階段之後開始大幅增加,這個階段稱為過度配適(Over-fitting),意指 h 函數對於訓練資料的特徵已經太熟悉,因此對於訓練資料的預測表現非常突出,但是喪失了對未知資料的預測能力,資料科學團隊以低誤差(Low bias)和高變異(High variance)來形容過度配適的模型。

正規化是資料科學團隊常使用來避免過度配適發生的技巧,核心精神是在成本函數增加一個懲罰項(penalty),使得在考量最小化成本函數的前提之下最佳化過程會限制任何一個係數變得太高,其中常見的一種正規化迴歸模型為 Ridge,懲罰項的係數同長由使用者自行輸入一個自然數,如果設定的數值愈高,即正規化程度愈高;若設定為零,表示不作正規化,一如原本的迴歸模型。

Ridge 正規化

Python

在 Python 中改引用 sklearn.linear_model 模組中的 Ridge() 函數可以實踐正規化迴歸模型,我們比較在幾個不同的正規化程度下 h 函數各係數的大小以及迴歸模型的外觀。

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error

def plot_functions(labeled, regressor, d, ax):
  X_arr = np.linspace(labeled["GarageArea"].min(), labeled["GarageArea"].max()).reshape(-1, 1)
  X_arr_poly = PolynomialFeatures(d).fit_transform(X_arr)
  y_arr = regressor.predict(X_arr_poly)
  ax.scatter(labeled["GarageArea"], labeled["SalePrice"], s=5, label="data")
  ax.plot(X_arr, y_arr, c="g", linewidth=2, label="h")
  ax.set_ylim(labeled["SalePrice"].min(), labeled["SalePrice"].max())
  ax.legend(loc="upper left")
  
def plot_coefficients(regressor, ax, alpha):
  coef = regressor.coef_.ravel()
  ax.semilogy(np.abs(coef), marker='^', label="alpha = {:.1E}".format(alpha))
  ax.set_ylim((1e-10, 1e2))
  ax.axhline(y=1, ls="--", color="r")
  ax.legend(loc='upper right')
  
labeled = pd.read_csv("https://storage.googleapis.com/kaggle_datasets/House-Prices-Advanced-Regression-Techniques/train.csv")
X = labeled.loc[:, "GarageArea"].values.reshape(-1, 1)
y = labeled.loc[:, "SalePrice"].values.reshape(-1, 1)
degree = 10
X_poly = PolynomialFeatures(degree).fit_transform(X)
X_train, X_validation, y_train, y_validation = train_test_split(X_poly, y, test_size=0.3, random_state=123)
# 正規化係數 alpha
alphas = [0, 1e3, 1e6, 1e9, 1e12]
# Visualization
fig, axes = plt.subplots(5, 2, figsize=(12, 16))
for i, alpha in enumerate(alphas):
  ridge = Ridge(alpha=alpha)
  ridge.fit(X_train, y_train)
  ax = axes[i, 0]
  plot_functions(labeled, ridge, d=degree, ax=ax)
  ax = axes[i, 1]
  plot_coefficients(ridge, ax=ax, alpha=alpha)
plt.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42

隨著正規化程度提高,h 函數趨近平滑、係數也變小

R 語言

在 R 語言引用 MASS 套件中的 lm.ridge() 函數可以實踐正規化迴歸模型。

library(ggplot2)
library(gridExtra)
library(MASS)

get_train_validation <- function(labeled_df, validation_size=0.3) {
  m <- nrow(labeled_df)
  row_indice <- 1:m
  shuffled_row_indice <- sample(row_indice)
  labeled_df <- labeled_df[shuffled_row_indice, ]
  validation_threshold <- as.integer(validation_size * m)
  validation <- labeled_df[1:validation_threshold, ]
  train <- labeled_df[(validation_threshold+1):m, ]
  return(list(
    validation = validation,
    train = train
  ))
}

labeled_url <- "https://storage.googleapis.com/kaggle_datasets/House-Prices-Advanced-Regression-Techniques/train.csv"
labeled_df <- read.csv(labeled_url)
split_result <- get_train_validation(labeled_df)
train <- split_result$train
X_vec <- seq(min(train$GarageArea), max(train$GarageArea), length.out = 50)

lambdas <- c(0, 1e2, 1e4, 1e6)
function_plots <- list()
coef_plots <- list()
d <- 10
for (i in 1:length(lambdas)) {
  ridge <- lm.ridge(SalePrice ~ poly(GarageArea, degree = d), lambda = lambdas[i], data = train)
  new_data <- poly(as.matrix(X_vec), degree = d)
  #ones <- as.matrix(rep(1, times = nrow(new_data)))
  y_vec <- cbind(1, new_data) %*% coef(ridge)
  function_df <- data.frame(GarageArea = X_vec, SalePrice = y_vec)
  gg <- ggplot(labeled_df, aes(x = GarageArea, y = SalePrice)) +
    geom_point(size = 0.5) +
    geom_line(data = function_df, aes(x = GarageArea, y = SalePrice), color = "#ff00ff") +
    xlab("") +
    ylab("") +
    theme(axis.ticks.x = element_blank(),
          axis.ticks.y = element_blank())
  function_plots[[i]] <- gg
  coefs <- abs(ridge$coef)
  thetas <- 1:d
  coef_df <- data.frame(thetas = thetas, coefs = coefs)
  gg <- ggplot(coef_df, aes(x = thetas, y = coefs)) +
    geom_line() +
    geom_point() +
    scale_y_continuous(trans = "log") +
    xlab("") +
    ylab("") +
    scale_x_continuous(breaks = 1:10) +
    geom_hline(yintercept = 5000, color = "red", lty = 2) +
    theme(axis.ticks.x = element_blank(),
          axis.ticks.y = element_blank())
  coef_plots[[i]] <- gg
}
grid.arrange(function_plots[[1]], coef_plots[[1]],
             function_plots[[2]], coef_plots[[2]],
             function_plots[[3]], coef_plots[[3]],
             function_plots[[4]], coef_plots[[4]],
             ncol = 2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62

隨著正規化程度提高,h 函數趨近平滑、係數也變小

小結

在這個小節中我們使用艾姆斯房價資料集簡介如何在 Python 與 R 語言的環境中自訂或引用現成函數來標準化變數的單位量級、使用均方誤差評估與比較迴歸模型、利用納入更多變數與建立次方向精進評估、利用交叉驗證消弭資料切割的變異以及正規化迴歸模型避免過度配適。

延伸閱讀