尋找迴歸模型的係數

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E.

Tom Mitchel

歷經如何獲取資料、如何掌控資料與如何探索資料等系列的洗禮,現在我們已經具備使用 Python 與 R 語言撰寫程式碼,完成資料科學專案中建立機器學習模型前置任務的能力,包含自動化擷取網頁資料、大範圍清理整併資料以及快速暸解資料特徵,許多頂尖的資料科學團隊會建立自己的 Data Pipeline 將前置任務標準以及規模化(像是 Airbnb 的 AirFlow。)該是時候利用資料訓練機器學習模型來協助團隊預測或挖掘隱含特徵的時機,其中利用模型作預測稱為監督式學習(Supervised Learning),而挖掘特徵則稱為非監督式學習(Unsupervised Learning。)

關於迴歸模型

機器學習是透過輸入資料將預測或挖掘特徵能力內化於電腦程式之中的方法,模型涵蓋三個元素一個但書:資料(Experience)、任務(Task)與評估(Performance),以一個房價預測模型為例,它的三要素是:

  • 資料(Experience):一定數量具備坪數、屋齡、類型與成交價等變數的觀測值
  • 任務(Task):利用模型預測市場上尚未成交的房價
  • 評估(Performance):模型預測的房價與未來實際成交房價的誤差大小

一個但書為:隨著資料增加,預測誤差應該要減少;而因為房價預測模型所輸出的是數值,屬於監督式學習中的一個分支:迴歸模型(Regression Model)。

學習資料集

眾多資料科學專欄中都諄諄告誡面試者不應該將學習資料集(Toy Datasets)的使用經驗寫入簡歷中,但這並不代表這些學習資料集是不好的教材,常見用來學習迴歸模型的資料集有波士頓房價(Boston, Massachusetts)艾姆斯房價(Imes, Iowa)。本文我們會在 Python 與 R 語言中使用部分艾姆斯房價資料集中的觀測值與變數藉此暸解機器學習與迴歸模型的二三事。

什麼是訓練、驗證與測試資料

取得艾姆斯房價資料以後,會發現有兩個 .csv 檔案,一個是具有實際值的資料(變數名稱為 SalePrice),另一個則是不具有實際值的資料;針對具有實際值的資料,通常會用 70% 與 30% 或者 80% 與 20% 的比例切割為訓練以及驗證,而不具有實際值的資料,就稱為測試。

  • 訓練(Train):從具有實際值的資料中隨機排序後取出 70% 或 80% 的觀測值作為建構迴歸模型的基石
  • 驗證(Validation):從具有實際值的資料中隨機排序後取出 30% 或 20% 的觀測值,這些觀測值將被輸入到訓練資料所建構之迴歸模型中獲得預測值,並藉由比對這些預測值與實際值的誤差,來評估迴歸模型的品質,通常資料科學團隊會使用與集成多種模型並從中選擇評估表現最好的模型準備輸入測試資料
  • 測試(Test):不具有實際值的資料,這些觀測值輸入迴歸模型之後所得到的預測值將投入正式營運環境或者作為競賽的繳交資料,而因不具有實際值的特性,僅能在應用過後設計實驗(常見的如 AB Test)或者透過競賽方公布的評估結果來檢視迴歸模型在測試資料上的配適情況。

我們可以自行撰寫函數切割訓練、驗證資料,首先將資料框中的觀測值隨機排列,像是撲克牌在發牌之前要洗牌一般,接著將位於前 30% 或 20% 的觀測值篩選給驗證,將位於後 70% 或 80% 的觀測值篩選給訓練。其中最重要的洗牌函數,在 Python 中可以引用 random 模組中的 shuffle() 函數;在 R 語言中則可以使用 sample() 函數。

Python

在 Python 中可以引用 random 模組中的 shuffle() 函數。

import pandas as pd
import random

def get_train_validation(labeled_df, validation_size=0.3, random_state=123):
  """
  Getting train/validation data from labeled dataframe
  """
  m = labeled_df.shape[0]
  row_indice = list(labeled_df.index)
  random.Random(random_state).shuffle(row_indice)
  shuffled = labeled_df.loc[row_indice, :]
  validation_threshold = int(validation_size * m)
  validation = shuffled.iloc[0:validation_threshold, :]
  train = shuffled.iloc[validation_threshold:, :]
  return train, validation
  
labeled_url = "https://storage.googleapis.com/kaggle_datasets/House-Prices-Advanced-Regression-Techniques/train.csv"
labeled_df = pd.read_csv(labeled_url)
train_df, validation_df = get_train_validation(labeled_df)
print(train_df.shape)
print(validation_df.shape)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
## (1022, 81)
## (438, 81)
1
2

亦可以使用已經寫好的函數來切割,引用 sklearn.model_selection 中的 train_test_split() 函數。

import pandas as pd
from sklearn.model_selection import train_test_split

labeled_url = "https://storage.googleapis.com/kaggle_datasets/House-Prices-Advanced-Regression-Techniques/train.csv"
labeled_df = pd.read_csv(labeled_url)
train_df, validation_df = train_test_split(labeled_df, test_size=0.3, random_state=123)
print(train_df.shape)
print(validation_df.shape)
1
2
3
4
5
6
7
8
## (1022, 81)
## (438, 81)
1
2

R 語言

在 R 語言中則可以使用 sample() 函數。

get_train_validation <- function(labeled_df, validation_size=0.3, random_state=123) {
  m <- nrow(labeled_df)
  row_indice <- 1:m
  set.seed(random_state)
  shuffled_row_indice <- sample(row_indice)
  labeled_df <- labeled_df[shuffled_row_indice, ]
  validation_threshold <- as.integer(validation_size * m)
  validation <- labeled_df[1:validation_threshold, ]
  train <- labeled_df[(validation_threshold+1):m, ]
  return(list(
    validation = validation,
    train = train
  ))
}

labeled_url <- "https://storage.googleapis.com/kaggle_datasets/House-Prices-Advanced-Regression-Techniques/train.csv"
labeled_df <- read.csv(labeled_url)
split_result <- get_train_validation(labeled_df)
train <- split_result$train
validation <- split_result$validation
dim(train)
dim(validation)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
## (1022, 81)
## (438, 81)
1
2

在 R 語言中引用 caret 套件中的 createDataPartition() 函數。

library(caret)

labeled_url <- "https://storage.googleapis.com/kaggle_datasets/House-Prices-Advanced-Regression-Techniques/train.csv"
labeled_df <- read.csv(labeled_url)
train_index <- createDataPartition(labeled_df$Id, p = 0.7, 
                                  list = FALSE, 
                                  times = 1)
train <- labeled_df[train_index, ]
validation <- labeled_df[-train_index, ]
dim(train)
dim(validation)
1
2
3
4
5
6
7
8
9
10
11
## [1] 1024   81
## [1] 436  81
1
2

單變數的迴歸模型

在具有實際值的資料中除去 SalePrice 還有 80 個變數,從中尋找一個與 SalePrice 高相關的變數作為建構迴歸模型的依據。

Python

在 Python 中可以使用 .corr() 方法獲得相關係數矩陣,然後將 SalePrice 的相關係數取絕對值以後排序觀察大於 0.6 的變數。

import pandas as pd

labeled_url = "https://storage.googleapis.com/kaggle_datasets/House-Prices-Advanced-Regression-Techniques/train.csv"
labeled_df = pd.read_csv(labeled_url)
df_corr = labeled_df.corr()
sale_price_corr = df_corr["SalePrice"].abs().sort_values(ascending=False)
sale_price_corr[sale_price_corr >= 0.6]
1
2
3
4
5
6
7

觀察相關係數大於 0.6 的變數

R 語言

在 R 語言中可以使用 cor() 函數獲得相關係數矩陣。

labeled_url <- "https://storage.googleapis.com/kaggle_datasets/House-Prices-Advanced-Regression-Techniques/train.csv"
labeled_df <- read.csv(labeled_url)
is_numeric_col <- unlist(lapply(labeled_df, FUN = is.numeric))
corr_mat <- cor(labeled_df[, is_numeric_col], use = "complete.obs")
sorted_corr_mat <- sort(corr_mat[, "SalePrice"], decreasing = TRUE)
sorted_corr_mat[sorted_corr_mat >= 0.6]
1
2
3
4
5
6

觀察相關係數大於 0.6 的變數

GrLivArea 變數(Ground Living Area)代表房屋扣除地下室後的日常生活空間大小,也許能作為預測房價的依據,除了觀察相關係數,資料科學家團隊通常也會將相關係數高的變數與實際值繪製成散佈圖,我們可以藉此機會複習在如何探索資料系列中的作圖技巧。

Python

import pandas as pd
import matplotlib.pyplot as plt

labeled_url = "https://storage.googleapis.com/kaggle_datasets/House-Prices-Advanced-Regression-Techniques/train.csv"
labeled_df = pd.read_csv(labeled_url)
labeled_df.plot.scatter("GrLivArea", "SalePrice")
plt.show()
1
2
3
4
5
6
7

GrLivArea 變數與 SalePrice 變數

R 語言

library(ggplot2)

labeled_url <- "https://storage.googleapis.com/kaggle_datasets/House-Prices-Advanced-Regression-Techniques/train.csv"
labeled_df <- read.csv(labeled_url)
ggplot(labeled_df, aes(x = GrLivArea, y = SalePrice)) +
  geom_point()
1
2
3
4
5
6

GrLivArea 變數與 SalePrice 變數

假設 GrLivArea 與 SalePrice 之間存在一個迴歸模型 f,但是這個模型無從得知,只能假定有另一個迴歸模型 h,能夠讓我們將 GrLivArea 輸入以後得到預測值,如果預測值與實際值之間的誤差愈小,則就更有信心可以說 h 與 f 的相似程度高,而均方誤差(Mean Square Error)就是用來衡量的評估,任務也被簡化成尋找一組能夠讓均方誤差最小的係數(常數項係數與 GrLivArea 係數。)

import pandas as pd
from sklearn.model_selection import train_test_split

labeled_url = "https://storage.googleapis.com/kaggle_datasets/House-Prices-Advanced-Regression-Techniques/train.csv"
labeled_df = pd.read_csv(labeled_url)
train_df, validation_df = train_test_split(labeled_df, test_size=0.3, random_state=123)
print(train_df[["GrLivArea", "SalePrice"]].head(2))
print(train_df[["GrLivArea", "SalePrice"]].tail(2))
1
2
3
4
5
6
7
8

訓練資料的前兩列與後兩列

透過 1022 個聯立方程組尋找讓均方誤差最小的係數組合

為了計算便捷,將聯立方程組改寫為矩陣形式。

聯立方程組改寫為矩陣形式

其中 X 的外觀為 m x 2,係數矩陣的外觀為 2 x 1,y 的外觀為 m x 1;此處的 m 是訓練資料的觀測值個數 1022。而均方誤差(又稱為成本函數 J)也可以從聯立方程組的表示更改為矩陣形式。

尋找一組能夠讓均方誤差最小的係數

尋找係數的任務

我們目前的任務是尋找一組能夠讓均方誤差最小的係數建立出迴歸模型 h,而資料科學團隊通常採用的方法有兩種,一是正規方程 Normal Equation,另一則是梯度遞減 Gradient Descent;這兩種方法都以最小化成本函數 J 作為核心。

正規方程

推導係數的過程我們再次將 m、X、係數向量與 y 的外觀銘記在心,m 是訓練資料觀測值個數 1022、X 是訓練資料 1022 x 2、係數向量是 2 x 1 而 y 是 1022 x 1;將矩陣形式的均方誤差表示式展開的過程,因為 X 與係數向量相乘的結果與 y 同樣為 1022 x 1 的外觀,可以合併為一項。

將矩陣形式的均方誤差表示式展開

接著可以對成本函數 J 偏微分係數向量並使其為零,獲得能夠讓均方誤差最小的係數向量。

獲得能夠讓均方誤差最小的係數

推導出係數向量之後,我們就可以定義一個函數 get_thetas_lm() 讓使用者輸入 X 與 y 能夠回傳係數向量。

Python

在 Python 中我們可以引用 NumPy 模組支援矩陣運算。

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

def get_thetas_lm(X, y):
  m = X.size
  X = X.reshape(-1, 1)
  y = y.reshape(-1, 1)
  ones = np.ones(m, dtype=int).reshape(-1, 1)
  X = np.concatenate([ones, X], axis=1)
  LHS = np.dot(X.T, X)
  RHS = np.dot(X.T, y)
  LHS_inv = np.linalg.inv(LHS)
  thetas = np.dot(LHS_inv, RHS)
  return tuple(thetas.ravel())

labeled_url = "https://storage.googleapis.com/kaggle_datasets/House-Prices-Advanced-Regression-Techniques/train.csv"
labeled_df = pd.read_csv(labeled_url)
train_df, validation_df = train_test_split(labeled_df, test_size=0.3, random_state=123)
X_train = train_df["GrLivArea"].values
y_train = train_df["SalePrice"].values
theta_0, theta_1 = get_thetas_lm(X_train, y_train)
print("Theta 0: {:.4f}".format(theta_0))
print("Theta 1: {:.4f}".format(theta_1))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
## Theta 0: 21905.1315
## Theta 1: 104.0985
1
2

R 語言

在 R 語言中矩陣型別與運算支援則是內建的,能使用 solve() 函數求解反矩陣。

get_thetas_lm <- function(X, y) {
  m <- length(X)
  X <- as.matrix(X)
  y <- as.matrix(y)
  ones <- as.matrix(rep(1, times=m))
  X <- cbind(ones, X)
  LHS <- t(X) %*% X
  RHS <- t(X) %*% y
  return(solve(LHS, RHS))
}

get_train_validation <- function(labeled_df, validation_size=0.3, random_state=123) {
  m <- nrow(labeled_df)
  row_indice <- 1:m
  set.seed(random_state)
  shuffled_row_indice <- sample(row_indice)
  labeled_df <- labeled_df[shuffled_row_indice, ]
  validation_threshold <- as.integer(validation_size * m)
  validation <- labeled_df[1:validation_threshold, ]
  train <- labeled_df[(validation_threshold+1):m, ]
  return(list(
    validation = validation,
    train = train
  ))
}

labeled_url <- "https://storage.googleapis.com/kaggle_datasets/House-Prices-Advanced-Regression-Techniques/train.csv"
labeled_df <- read.csv(labeled_url)
split_result <- get_train_validation(labeled_df)
X_train <- split_result$train$GrLivArea
y_train <- split_result$train$SalePrice
thetas_lm <- get_thetas_lm(X_train, y_train)
theta_0 <- thetas_lm[1, 1]
theta_1 <- thetas_lm[2, 1]
sprintf("Theta 0: %.4f", theta_0)
sprintf("Theta 1: %.4f", theta_1)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
## [1] "Theta 0: 23102.0244"
## [1] "Theta 1: 103.9400"
1
2

我們在 Python 與 R 語言隨機取得的訓練資料不盡相同,因此求出的係數會有些許誤差。

梯度遞減

隨機賦予係數向量一個起始值(通常從 0 開始),開始計算該點對成本函數 J 偏微分係數的斜率,並且依照斜率的正負與學習速率,決定下一次係數應該減少一些或者增加一些,在經過足夠多次的迭代更新後可以得到一組逼近 Local Minimum 的係數向量。

依照斜率的正負與學習速率,決定下一次係數應該減少一些或者增加一些

依照斜率的正負與學習速率,決定下一次係數應該減少一些或者增加一些,表示成矩陣形式

Python

在 Python 中利用梯度遞減尋找係數:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

def compute_cost(X, y, thetas):
  m = X.shape[0]
  h = np.dot(X, thetas)
  J = 1/(2*m)*np.sum(np.square(h - y))
  return J
  
def get_thetas_lm(X, y, alpha=0.001, num_iters=10000):
  thetas = np.array([0, 0]).reshape(-1, 1)
  X = X.reshape(-1, 1)
  y = y.reshape(-1, 1)
  m = X.shape[0]
  ones = np.ones(m, dtype=int).reshape(-1, 1)
  X = np.concatenate([ones, X], axis=1)
  J_history = np.zeros(num_iters)
  for num_iter in range(num_iters):
    h = np.dot(X, thetas)
    loss = h - y
    gradient = np.dot(X.T, loss)
    thetas = thetas - (alpha * gradient)/m
    J_history[num_iter] = compute_cost(X, y, thetas=thetas)
  return thetas, J_history

def get_rescaled_thetas(thetas, scaler_X, scaler_y, X_train, y_train):
  theta_0 = thetas[0, 0]
  theta_1 = thetas[1, 0]
  scale_X = scaler_X.scale_[0]
  scale_y = scaler_y.scale_[0]
  theta_0_rescaled = y_train.mean() + y_train.std()*theta_0 - theta_1*X_train.mean()*(y_train.std()/X_train.std())
  theta_1_rescaled = theta_1 / scale_X * scale_y
  return theta_0_rescaled, theta_1_rescaled
  
labeled_url = "https://storage.googleapis.com/kaggle_datasets/House-Prices-Advanced-Regression-Techniques/train.csv"
labeled_df = pd.read_csv(labeled_url)
train_df, validation_df = train_test_split(labeled_df, test_size=0.3, random_state=123)
X_train = train_df["GrLivArea"].values.reshape(-1, 1).astype(float)
y_train = train_df["SalePrice"].values.reshape(-1, 1).astype(float)
# Normalization
scaler_X = StandardScaler()
scaler_y = StandardScaler()
X_train_scaled = scaler_X.fit_transform(X_train)
y_train_scaled = scaler_y.fit_transform(y_train)
thetas, J_history = get_thetas_lm(X_train_scaled, y_train_scaled)
# Rescaling
theta_0, theta_1 = get_rescaled_thetas(thetas, scaler_X, scaler_y, X_train, y_train)
print("Theta 0: {:.4f}".format(theta_0))
print("Theta 1: {:.4f}".format(theta_1))
# Plotting J_history
plt.plot(J_history)
plt.title("Cost function during gradient descent")
plt.xlabel("Iterations")
plt.ylabel(r"$J(\theta)$")
plt.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
## Theta 0: 21910.8619
## Theta 1: 104.0947
1
2

觀察成本函數在 10000 次迭代過程的變化

R 語言

在 R 語言中利用梯度遞減尋找係數:

get_train_validation <- function(labeled_df, validation_size=0.3, random_state=123) {
  m <- nrow(labeled_df)
  row_indice <- 1:m
  set.seed(random_state)
  shuffled_row_indice <- sample(row_indice)
  labeled_df <- labeled_df[shuffled_row_indice, ]
  validation_threshold <- as.integer(validation_size * m)
  validation <- labeled_df[1:validation_threshold, ]
  train <- labeled_df[(validation_threshold+1):m, ]
  return(list(
    validation = validation,
    train = train
  ))
}

compute_cost <- function(X, y, thetas) {
  m <- nrow(X)
  h <- X %*% thetas
  J <- sum((h - y)**2) / (2*m)
  return(J)
}

standardize <- function(X, y) {
  X_sd <- sd(X)
  mean_sd <- mean(X)
  y_sd <- sd(y)
  mean_y <- mean(y)
  X_scaled <- (X - mean_sd)/X_sd
  y_scaled <- (y - mean_y)/y_sd
  return(list(
    X_scaled = X_scaled,
    y_scaled = y_scaled
  ))
}

get_thetas_lm <- function(X, y, alpha=0.001, num_iters=10000) {
  thetas <- as.matrix(c(0, 0))
  m <- length(X)
  X <- as.matrix(X)
  y <- as.matrix(y)
  ones <- as.matrix(rep(1, times = m))
  X <- cbind(ones, X)
  J_history <- vector(length = num_iters)
  for (num_iter in 1:num_iters) {
    h <- X %*% thetas
    loss <- h - y
    gradient <- t(X) %*% loss
    thetas <- thetas - (alpha * gradient)/m
    J_history[num_iter] <- compute_cost(X, y, thetas = thetas)
  }
  return(list(
    thetas = thetas,
    J_history = J_history
  ))
}

get_rescaled_thetas <- function(thetas, X_train, y_train) {
  theta_0 <- thetas[1, 1]
  theta_1 <- thetas[2, 1]
  X_train_mean <- mean(X_train)
  y_train_mean <- mean(y_train)
  X_train_sd <- sd(X_train)
  y_train_sd <- sd(y_train)
  theta_1_rescaled <- theta_1 / X_train_sd * y_train_sd
  theta_0_rescaled <- y_train_mean + y_train_sd * theta_0 - y_train_sd/X_train_sd * theta_1 * X_train_mean
  return(list(
    theta_0_rescaled = theta_0_rescaled,
    theta_1_rescaled = theta_1_rescaled
  ))
}

labeled_url <- "https://storage.googleapis.com/kaggle_datasets/House-Prices-Advanced-Regression-Techniques/train.csv"
labeled_df <- read.csv(labeled_url)
split_result <- get_train_validation(labeled_df)
X_train <- split_result$train$GrLivArea
y_train <- split_result$train$SalePrice
# Normalization
standard_scaler <- standardize(X_train, y_train)
X_train_scaled <- standard_scaler$X_scaled
y_train_scaled <- standard_scaler$y_scaled
thetas_lm <- get_thetas_lm(X_train_scaled, y_train_scaled)
thetas <- thetas_lm$thetas
J_history <- thetas_lm$J_history
# Rescaling
thetas_rescaled <- get_rescaled_thetas(thetas, X_train, y_train)
theta_0 <- thetas_rescaled$theta_0_rescaled
theta_1 <- thetas_rescaled$theta_1_rescaled
sprintf("Theta 0: %.4f", theta_0)
sprintf("Theta 1: %.4f", theta_1)
plot(J_history, type = "l", main = "Cost function during gradient descent", xlab = "Iterations", ylab = expression(J(theta)))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
## [1] "Theta 0: 7306.3620"
## [1] "Theta 1: 114.8515"
1
2

觀察成本函數在 10000 次迭代過程的變化

資料科學團隊常會使用立體的曲面圖(Surface Plot)描繪在梯度遞減過程中,不同係數組合所對應到的成本函數大小。

Python

在 Python 中必須引用 mpl_toolkits.mplot3d 中的 Axes3D 來繪製曲面圖。

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

def compute_cost(X, y, thetas):
  m = X.shape[0]
  h = np.dot(X, thetas)
  J = 1/(2*m)*np.sum(np.square(h - y))
  return J
  
def get_thetas_lm(X, y, alpha=0.001, num_iters=10000):
  thetas = np.array([0, 0]).reshape(-1, 1)
  X = X.reshape(-1, 1)
  y = y.reshape(-1, 1)
  m = X.shape[0]
  ones = np.ones(m, dtype=int).reshape(-1, 1)
  X = np.concatenate([ones, X], axis=1)
  J_history = np.zeros(num_iters)
  for num_iter in range(num_iters):
    h = np.dot(X, thetas)
    loss = h - y
    gradient = np.dot(X.T, loss)
    thetas = thetas - (alpha * gradient)/m
    J_history[num_iter] = compute_cost(X, y, thetas=thetas)
  return thetas, J_history

def get_rescaled_thetas(thetas, scaler_X, scaler_y, X_train, y_train):
  theta_0 = thetas[0, 0]
  theta_1 = thetas[1, 0]
  scale_X = scaler_X.scale_[0]
  scale_y = scaler_y.scale_[0]
  theta_0_rescaled = y_train.mean() + y_train.std()*theta_0 - theta_1*X_train.mean()*(y_train.std()/X_train.std())
  theta_1_rescaled = theta_1 / scale_X * scale_y
  return theta_0_rescaled, theta_1_rescaled

def surface_plot(theta0_range, theta1_range, X, y):
  theta0_start, theta0_end = theta0_range
  theta1_start, theta1_end = theta1_range
  length = 50
  theta0_arr = np.linspace(theta0_start, theta0_end, length)
  theta1_arr = np.linspace(theta1_start, theta1_end, length)
  Z = np.zeros((length, length))
  for i in range(length):
    for j in range(length):
      theta_0 = theta0_arr[i]
      theta_1 = theta1_arr[j]
      thetas_arr = np.array([theta_0, theta_1]).reshape(-1, 1)
      Z[i, j] = compute_cost(X, y, thetas=thetas_arr)
  xx, yy = np.meshgrid(theta0_arr, theta1_arr, indexing='xy')

  fig = plt.figure()
  ax = fig.add_subplot(111, projection='3d')
  ax.plot_surface(xx, yy, Z, rstride=1, cstride=1, alpha=0.6, cmap=plt.cm.jet)
  ax.set_zlabel('Cost')
  ax.set_zlim(Z.min(),Z.max())
  ax.view_init(elev=15, azim=230)
  ax.set_xticks(np.linspace(theta0_start, theta0_end, 5, dtype=int))
  ax.set_yticks(np.linspace(theta1_start, theta1_end, 5, dtype=int))
  ax.set_xlabel(r'$\theta_0$')
  ax.set_ylabel(r'$\theta_1$')
  ax.set_title("Cost function during gradient descent")
  plt.show()
  
labeled_url = "https://storage.googleapis.com/kaggle_datasets/House-Prices-Advanced-Regression-Techniques/train.csv"
labeled_df = pd.read_csv(labeled_url)
train_df, validation_df = train_test_split(labeled_df, test_size=0.3, random_state=123)
X_train = train_df["GrLivArea"].values.reshape(-1, 1).astype(float)
y_train = train_df["SalePrice"].values.reshape(-1, 1).astype(float)
# Surface plot
X_train_reshaped = X_train.reshape(-1, 1)
y_train_reshaped = y_train.reshape(-1, 1)
m = X_train_reshaped.shape[0]
ones = np.ones(m, dtype=int).reshape(-1, 1)
X_train_reshaped = np.concatenate([ones, X_train_reshaped], axis=1)
surface_plot((10000, 30000), (0, 200), X_train_reshaped, y_train_reshaped)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78

觀察成本函數在梯度遞減過程的變化

或者透過同時支援 Python 與 R 語言的 plotly 套件,都可以繪製出可以與使用者互動(包含旋轉、顯示座標點與放大縮小等)的動態曲面圖。

Python

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import plotly.plotly as py
import plotly.graph_objs as go

def compute_cost(X, y, thetas):
  m = X.shape[0]
  h = np.dot(X, thetas)
  J = 1/(2*m)*np.sum(np.square(h - y))
  return J
  
def get_thetas_lm(X, y, alpha=0.001, num_iters=10000):
  thetas = np.array([0, 0]).reshape(-1, 1)
  X = X.reshape(-1, 1)
  y = y.reshape(-1, 1)
  m = X.shape[0]
  ones = np.ones(m, dtype=int).reshape(-1, 1)
  X = np.concatenate([ones, X], axis=1)
  J_history = np.zeros(num_iters)
  for num_iter in range(num_iters):
    h = np.dot(X, thetas)
    loss = h - y
    gradient = np.dot(X.T, loss)
    thetas = thetas - (alpha * gradient)/m
    J_history[num_iter] = compute_cost(X, y, thetas=thetas)
  return thetas, J_history

def get_rescaled_thetas(thetas, scaler_X, scaler_y, X_train, y_train):
  theta_0 = thetas[0, 0]
  theta_1 = thetas[1, 0]
  scale_X = scaler_X.scale_[0]
  scale_y = scaler_y.scale_[0]
  theta_0_rescaled = y_train.mean() + y_train.std()*theta_0 - theta_1*X_train.mean()*(y_train.std()/X_train.std())
  theta_1_rescaled = theta_1 / scale_X * scale_y
  return theta_0_rescaled, theta_1_rescaled

def get_surface_Z(theta0_range, theta1_range, X, y):
  theta0_start, theta0_end = theta0_range
  theta1_start, theta1_end = theta1_range
  length = 50
  theta0_arr = np.linspace(theta0_start, theta0_end, length)
  theta1_arr = np.linspace(theta1_start, theta1_end, length)
  Z = np.zeros((length, length))
  for i in range(length):
    for j in range(length):
      theta_0 = theta0_arr[i]
      theta_1 = theta1_arr[j]
      thetas_arr = np.array([theta_0, theta_1]).reshape(-1, 1)
      Z[i, j] = compute_cost(X, y, thetas=thetas_arr)
  return Z
  
labeled_url = "https://storage.googleapis.com/kaggle_datasets/House-Prices-Advanced-Regression-Techniques/train.csv"
labeled_df = pd.read_csv(labeled_url)
train_df, validation_df = train_test_split(labeled_df, test_size=0.3, random_state=123)
X_train = train_df["GrLivArea"].values.reshape(-1, 1).astype(float)
y_train = train_df["SalePrice"].values.reshape(-1, 1).astype(float)
# Normalization
scaler_X = StandardScaler()
scaler_y = StandardScaler()
X_train_scaled = scaler_X.fit_transform(X_train)
y_train_scaled = scaler_y.fit_transform(y_train)
thetas, J_history = get_thetas_lm(X_train_scaled, y_train_scaled)
# Rescaling
theta_0, theta_1 = get_rescaled_thetas(thetas, scaler_X, scaler_y, X_train, y_train)
# Surface plot
X_train_reshaped = X_train.reshape(-1, 1)
y_train_reshaped = y_train.reshape(-1, 1)
m = X_train_reshaped.shape[0]
ones = np.ones(m, dtype=int).reshape(-1, 1)
X_train_reshaped = np.concatenate([ones, X_train_reshaped], axis=1)
Z = get_surface_Z((10000, 30000), (0, 200), X_train_reshaped, y_train_reshaped)

py.sign_in('USERNAME', 'APIKEY') # Use your own plotly Username / API Key
data = [go.Surface(z=Z)]
layout = go.Layout(
  title='Cost function during gradient descent',
  scene=dict(
      xaxis = dict(title='theta_0'),
      yaxis = dict(title="theta_1"),
      zaxis = dict(title="J(theta)")
  )
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='gd-3d-surface')
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86

觀察成本函數在梯度遞減過程的變化:plotly

R 語言

library(plotly)

get_train_validation <- function(labeled_df, validation_size=0.3, random_state=123) {
  m <- nrow(labeled_df)
  row_indice <- 1:m
  set.seed(random_state)
  shuffled_row_indice <- sample(row_indice)
  labeled_df <- labeled_df[shuffled_row_indice, ]
  validation_threshold <- as.integer(validation_size * m)
  validation <- labeled_df[1:validation_threshold, ]
  train <- labeled_df[(validation_threshold+1):m, ]
  return(list(
    validation = validation,
    train = train
  ))
}

compute_cost <- function(X, y, thetas) {
  m <- nrow(X)
  h <- X %*% thetas
  J <- sum((h - y)**2) / (2*m)
  return(J)
}

standardize <- function(X, y) {
  X_sd <- sd(X)
  mean_sd <- mean(X)
  y_sd <- sd(y)
  mean_y <- mean(y)
  X_scaled <- (X - mean_sd)/X_sd
  y_scaled <- (y - mean_y)/y_sd
  return(list(
    X_scaled = X_scaled,
    y_scaled = y_scaled
  ))
}

get_thetas_lm <- function(X, y, alpha=0.001, num_iters=10000) {
  thetas <- as.matrix(c(0, 0))
  m <- length(X)
  X <- as.matrix(X)
  y <- as.matrix(y)
  ones <- as.matrix(rep(1, times = m))
  X <- cbind(ones, X)
  J_history <- vector(length = num_iters)
  for (num_iter in 1:num_iters) {
    h <- X %*% thetas
    loss <- h - y
    gradient <- t(X) %*% loss
    thetas <- thetas - (alpha * gradient)/m
    J_history[num_iter] <- compute_cost(X, y, thetas = thetas)
  }
  return(list(
    thetas = thetas,
    J_history = J_history
  ))
}

get_rescaled_thetas <- function(thetas, X_train, y_train) {
  theta_0 <- thetas[1, 1]
  theta_1 <- thetas[2, 1]
  X_train_mean <- mean(X_train)
  y_train_mean <- mean(y_train)
  X_train_sd <- sd(X_train)
  y_train_sd <- sd(y_train)
  theta_1_rescaled <- theta_1 / X_train_sd * y_train_sd
  theta_0_rescaled <- y_train_mean + y_train_sd * theta_0 - y_train_sd/X_train_sd * theta_1 * X_train_mean
  return(list(
    theta_0_rescaled = theta_0_rescaled,
    theta_1_rescaled = theta_1_rescaled
  ))
}

surface_plot <- function(theta0_range, theta1_range, X, y) {
  theta0_start <- theta0_range[1]
  theta0_end <- theta0_range[2]
  theta1_start <- theta1_range[1]
  theta1_end <- theta1_range[2]
  length_out <- 50
  theta0_arr <- seq(theta0_start, theta0_end, length.out = length_out)
  theta1_arr <- seq(theta1_start, theta1_end, length.out = length_out)
  Z <- matrix(nrow = length_out, ncol = length_out)
  for (i in 1:length_out) {
    for (j in 1:length_out) {
      theta_0 <- theta0_arr[i]
      theta_1 <- theta1_arr[j]
      thetas_mat <- as.matrix(c(theta_0, theta_1))
      Z[i, j] <- compute_cost(X, y, thetas = thetas_mat)
    }
  }
  list_for_surface <- list(
    x = theta0_arr,
    y = theta1_arr,
    z = Z
  )
  plot_ly(x = list_for_surface$x, y = list_for_surface$y, z = list_for_surface$z, colorscale = "Jet") %>%
    add_surface() %>% 
    layout(
      title = "Cost function during gradient descent",
      scene = list(
        xaxis = list(title = "theta_0"),
        yaxis = list(title = "theta_1"),
        zaxis = list(title = "J")
      )
    )
}

labeled_url <- "https://storage.googleapis.com/kaggle_datasets/House-Prices-Advanced-Regression-Techniques/train.csv"
labeled_df <- read.csv(labeled_url)
split_result <- get_train_validation(labeled_df)
X_train <- split_result$train$GrLivArea
y_train <- split_result$train$SalePrice
# Normalization
standard_scaler <- standardize(X_train, y_train)
X_train_scaled <- standard_scaler$X_scaled
y_train_scaled <- standard_scaler$y_scaled
thetas_lm <- get_thetas_lm(X_train_scaled, y_train_scaled)
thetas <- thetas_lm$thetas
J_history <- thetas_lm$J_history
# Rescaling
thetas_rescaled <- get_rescaled_thetas(thetas, X_train, y_train)
theta_0 <- thetas_rescaled$theta_0_rescaled
theta_1 <- thetas_rescaled$theta_1_rescaled
sprintf("Theta 0: %.4f", theta_0)
sprintf("Theta 1: %.4f", theta_1)
m <- length(X_train)
X_train_reshaped <- as.matrix(X_train)
y_train_reshaped <- as.matrix(y_train)
ones <- as.matrix(rep(1, times = m))
X_train_reshaped <- cbind(ones, X_train_reshaped)
surface_plot(c(10000, 30000), c(0, 200), X_train_reshaped, y_train_reshaped)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131

觀察成本函數在梯度遞減過程的變化:plotly

使用模組或套件尋找係數

透過手動撰寫程式碼,我們知道如何透過 Normal Equation 或 Gradient Descent 兩種管道尋找迴歸模型 h 的係數;接著可以使用 Python 與 R 語言中豐富的函數來協助。

Python

在 Python 中我們引用 sklearn.linear_model 中的 LinearRegression() 來尋找係數,Scikit-Learn 中的函數呼叫方式一致,都先初始化物件,再利用 .fit() 方法投入訓練資料。

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
  
labeled_url = "https://storage.googleapis.com/kaggle_datasets/House-Prices-Advanced-Regression-Techniques/train.csv"
labeled_df = pd.read_csv(labeled_url)
train_df, validation_df = train_test_split(labeled_df, test_size=0.3, random_state=123)
X_train = train_df["GrLivArea"].values.reshape(-1, 1)
y_train = train_df["SalePrice"].values.reshape(-1, 1)
regressor = LinearRegression()
regressor.fit(X_train, y_train)
theta_0 = regressor.intercept_[0]
theta_1 = regressor.coef_[0, 0]
print("Theta 0: {:.4f}".format(theta_0))
print("Theta 1: {:.4f}".format(theta_1))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
## Theta 0: 21905.1315
## Theta 1: 104.0985
1
2

R 語言

在 R 語言中我們利用內建的 lm() 函數來尋找係數。

get_train_validation <- function(labeled_df, validation_size=0.3, random_state=123) {
  m <- nrow(labeled_df)
  row_indice <- 1:m
  set.seed(random_state)
  shuffled_row_indice <- sample(row_indice)
  labeled_df <- labeled_df[shuffled_row_indice, ]
  validation_threshold <- as.integer(validation_size * m)
  validation <- labeled_df[1:validation_threshold, ]
  train <- labeled_df[(validation_threshold+1):m, ]
  return(list(
    validation = validation,
    train = train
  ))
}

labeled_url <- "https://storage.googleapis.com/kaggle_datasets/House-Prices-Advanced-Regression-Techniques/train.csv"
labeled_df <- read.csv(labeled_url)
split_result <- get_train_validation(labeled_df)
train_df <- split_result$train
regressor <- lm(SalePrice ~ GrLivArea, data = train_df)
theta_0 <- regressor$coefficients[1]
theta_1 <- regressor$coefficients[2]
sprintf("Theta 0: %.4f", theta_0)
sprintf("Theta 1: %.4f", theta_1)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
## [1] "Theta 0: 23102.0244"
## [1] "Theta 1: 103.9400"
1
2

將模型繪製到散佈圖上

順利完成尋找係數的任務之後,可以將這個迴歸模型 h 繪製到散佈圖上,繪製方法是在 X 軸變數的最小值與最大值之間均勻地切出資料點,再將這些資料點代入迴歸模型 h,獲得一組預測值,然後將預測值相連成為一條直線。

Python

在 Python 中我們習慣使用 np.linspace() 函數在一段區間中打點。

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
  
labeled_url = "https://storage.googleapis.com/kaggle_datasets/House-Prices-Advanced-Regression-Techniques/train.csv"
labeled_df = pd.read_csv(labeled_url)
train_df, validation_df = train_test_split(labeled_df, test_size=0.3, random_state=123)
X_train = train_df["GrLivArea"].values.reshape(-1, 1)
y_train = train_df["SalePrice"].values.reshape(-1, 1)
regressor = LinearRegression()
regressor.fit(X_train, y_train)
theta_0 = regressor.intercept_[0]
theta_1 = regressor.coef_[0, 0]
X_min = labeled_df["GrLivArea"].min()
X_max = labeled_df["GrLivArea"].max()
X_arr = np.linspace(X_min, X_max)
y_hats = theta_0 + theta_1 * X_arr

plt.scatter(train_df["GrLivArea"], train_df["SalePrice"], color="r", s=5, label="train")
plt.scatter(validation_df["GrLivArea"], validation_df["SalePrice"], color="g", s=5, label="validation")
plt.plot(X_arr, y_hats, color="c", linewidth=3, label="h")
plt.legend(loc="upper left")
plt.xlabel("Ground Living Area")
plt.ylabel("Sale Price")
plt.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

將迴歸模型 h 繪製到散佈圖上

R 語言

在 R 語言中我們可以使用 seq() 函數在在一個數值區間中打點。

library(ggplot2)

get_train_validation <- function(labeled_df, validation_size=0.3, random_state=123) {
  m <- nrow(labeled_df)
  row_indice <- 1:m
  set.seed(random_state)
  shuffled_row_indice <- sample(row_indice)
  labeled_df <- labeled_df[shuffled_row_indice, ]
  validation_threshold <- as.integer(validation_size * m)
  validation <- labeled_df[1:validation_threshold, ]
  train <- labeled_df[(validation_threshold+1):m, ]
  return(list(
    validation = validation,
    train = train
  ))
}

labeled_url <- "https://storage.googleapis.com/kaggle_datasets/House-Prices-Advanced-Regression-Techniques/train.csv"
labeled_df <- read.csv(labeled_url)
split_result <- get_train_validation(labeled_df)
train_df <- split_result$train
train_df$Split <- "Train"
validation_df <- split_result$validation
validation_df$Split <- "Validation"
df_for_scatter <- rbind(train_df, validation_df)
regressor <- lm(SalePrice ~ GrLivArea, data = train_df)
theta_0 <- regressor$coefficients[1]
theta_1 <- regressor$coefficients[2]
X_min <- min(labeled_df$GrLivArea)
X_max <- max(labeled_df$GrLivArea)
X_arr <- seq(from = X_min, to = X_max, length.out = 50)
y_hats <- theta_0 + theta_1 * X_arr
df_for_line <- data.frame(X_arr, y_hats)

ggplot(df_for_scatter, aes(x = GrLivArea, y = SalePrice, color = Split)) +
  geom_point(size = 0.7) +
  geom_line(data = df_for_line, aes(x = X_arr, y = y_hats), color = "#009E73", size = 1.2) +
  xlab("Ground Living Area") +
  ylab("Sale Price")
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39

將迴歸模型 h 繪製到散佈圖上

小結

在這個小節中我們使用艾姆斯房價資料集中的觀測值與變數簡介關於迴歸模型、學習資料集、什麼是訓練、驗證及測試資料、單變數的迴歸模型、撰寫正規方程以及梯度遞減的程式尋找迴歸模型係數、利用曲面圖描繪成本函數在梯度遞減過程中的變化 、使用模組或套件尋找係數還有在散佈圖上繪製出迴歸模型。

延伸閱讀