AUC

详细看之前博客:模型评估与选择
受试者工作特征(Receiver Operating Characteristic, ROC)
ROC曲线下面积(Area Under ROC Curve)
很多学习器是微测试样本产生一个实值或者概率预测,然后将这个预测值与一个分类阈值(threshold)进行比较,若大于阈值则为正例,小于阈值则为反例。例如logistic回归中使用sigmod函数将输出限制在0-1,大于0.5为True,反之则为False。
根据实值或者概率预测结果,我们将测试样本按照可能性排序,分类过程相当于在这个序列中选取一个间断点来将样本分为不同的两个部分,前一部分为“正例”,后一部分为“反例”。
不同任务选取不同点,若重视P,则靠前,若重视R,则靠后。
因此排序的质量体现了“一般情况下”泛化性能的好坏,ROC曲线则是从这个角度出发研究学习器的泛化性能。
根据学习器对样例的排序,按此顺序逐个把样本作为正例预测,每次计算两个值绘制ROC曲线:

  • 纵轴——“真正例率”(True Positive Rate, TPR)
  • 横轴——“假正例率”(False positive Rate, FPR)
    两者分别定义为:
    截屏2020-02-28下午9.15.10
    截屏2020-02-28下午9.15.33
    对曲线的解释:
    先解释两种特殊情形,即“对角线对应于‘随机猜测’模型,而点(0,1)则对应于将所有正例排在所有反例之前的‘理想模型’ ”。
    看一下 ROC 绘图过程:
    给定m+m^+个正例和mm^-个反例,根据学习器预测结果对样例进行排序,然后把分类阈值设为最大,即把所有样例均预测为反例,此时真正例率和假正例率均为 0(无样例被预测为正例,因此真正例 TP 和假正例 FP 均为 0,根据公式可知真正例率 TPR 和假正例率 FPR 均为 0,在坐标(0,0)处标记一个点,然后将分类阙值依次设为每个样例的预测值,依次将每个样例划分为正例,设前一个标记点坐标为(x,y):
    若当前为真正例,坐标为(x,y+1m+)(x,y+\frac{1}{m^+})
    若当前为假正例,坐标为(x+1m,y)(x+\frac{1}{m^-},y)
    学习器比较时,若一个包住另一个,则可说前者优于后者,若有交叉,则比较AUC大小。

AUC=12i=1m1(xi+1xi)(yi+yi+1)AUC=\frac{1}{2}\sum_{i=1}^{m-1}(x_{i+1}-x_i)(y_i+y_{i+1})

为了更好理解,我们将式子变形为:

AUG=i=1m1(xi+1xi)(yi+yi+1)2AUG = \sum_{i=1}^{m-1}(x_{i+1}-x_i)\frac{(y_i+y_{i+1})}{2}

这样可以看出(xi+1xi)(x_{i+1}-x_i)是矩阵的底,(yi+yi+1)2\frac{(y_i+y_{i+1})}{2}是矩阵的高.
排序“损失”(loss)定义为:

lrank=1m+mx+D+xD((f(x+)<f(x))+12(f(x+)=f(x)))l_{rank} = \frac{1}{m^+m^-}\sum_{x^+ \in D^+}\sum_{x^- \in D^-}(Ⅱ(f(x^+)<f(x^-))+\frac{1}{2}Ⅱ(f(x^+)=f(x^-)))

且:
AUC=1lrankAUC = 1-l_{rank}

线性回归的方式

不难,直接上代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from pprint import pprint
if __name__ == '__main__':
path = './Advertising.csv'
data = pd.read_csv(path)
# print(data)
x = data[['TV', 'Radio']]
# print(x)
y = data[['Sales']]
# print(y)
plt.figure(facecolor='w', figsize=(9, 10))
plt.subplot(311)
plt.plot(data['TV'], y, 'ro', mec='k')
plt.title('TV')
plt.grid(b=True, ls=':')
plt.subplot(312)
plt.plot(data['Radio'], y, 'g^', mec='k')
plt.title('Radio')
plt.grid(b=True, ls=':')
plt.subplot(313)
plt.plot(data['Newspaper'], y, 'b*', mec='k')
plt.title('Newspaper')
plt.grid(b=True, ls=':')
plt.tight_layout(pad=2)
# plt.savefig('three_graph.png')
plt.show()
# 从图中可以看出,Newspaper对结果影响较小,所以我们只需要使用前两个数据即可
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)
model = LinearRegression()
model.fit(x_train, y_train)
print(model.coef_, model.intercept_)
order = y_test.argsort_value(axis=0)
y_test = y_test.values[order]
x_test = x_test.values[order, :]
y_test_pred = model.predict(x_test)
mse = np.mean((y_test_pred - np.array(y_test)) ** 2) # Mean Squared Error
rmse = np.sqrt(mse) # Root Mean Squared Error
mse_sys = mean_squared_error(y_test, y_test_pred)
print('MSE = ', mse, end=' ')
print('MSE(System Function) = ', mse_sys, end=' ')
print('MAE = ', mean_absolute_error(y_test, y_test_pred))
print('RMSE = ', rmse)
print('Training R2 = ', model.score(x_train, y_train))
print('Training R2(System) = ', r2_score(y_train, model.predict(x_train)))
print('Test R2 = ', model.score(x_test, y_test))
error = y_test - y_test_pred
np.set_printoptions(suppress=True)
print('error = ', error)
plt.hist(error, bins=20, color='g', alpha=0.6, edgecolor='k')
plt.show()
plt.figure(facecolor='w')
t = np.arange(len(x_test))
plt.plot(t, y_test, 'r-', linewidth=2, label='真实数据')
plt.plot(t, y_test_pred, 'g-', linewidth=2, label='预测数据')
plt.legend(loc='upper left')
plt.title('线性回归预测销量', fontsize=18)
plt.grid(b=True, ls=':')
plt.show()

Ridge回归方式

废话不多直接代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso, Ridge
from sklearn.model_selection import GridSearchCV
if __name__ == "__main__":
# pandas读入
data = pd.read_csv('.\advertising.csv') # TV、Radio、Newspaper、Sales
print(data)
# x = data[['TV', 'Radio', 'Newspaper']]
x = data[['TV', 'Radio']]
y = data['Sales']
print(x)
print(y)
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1, test_size=0.2)
model = Ridge()
alpha_can = np.logspace(-3, 2, 10)
np.set_printoptions(suppress=True)
print('alpha_can = ', alpha_can)
lasso_model = GridSearchCV(model, param_grid={'alpha': alpha_can}, cv=5)
lasso_model.fit(x_train, y_train)
print('超参数:\n', lasso_model.best_params_)
order = y_test.argsort(axis=0)
y_test = y_test.values[order]
x_test = x_test.values[order, :]
y_hat = lasso_model.predict(x_test)
print(lasso_model.score(x_test, y_test))
mse = np.average((y_hat - np.array(y_test)) ** 2) # Mean Squared Error
rmse = np.sqrt(mse) # Root Mean Squared Error
print(mse, rmse)
t = np.arange(len(x_test))
mpl.rcParams['font.sans-serif'] = ['simHei']
mpl.rcParams['axes.unicode_minus'] = False
plt.figure(facecolor='w')
plt.plot(t, y_test, 'r-', linewidth=2, label='真实数据')
plt.plot(t, y_hat, 'g-', linewidth=2, label='预测数据')
plt.title('线性回归预测销量', fontsize=18)
plt.legend(loc='upper left')
plt.grid(b=True, ls=':')
plt.show()

Lasso(引入L2正则化)方式

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso, Ridge
from sklearn.model_selection import GridSearchCV
if __name__ == "__main__":
# pandas读入
data = pd.read_csv('.\advertising.csv') # TV、Radio、Newspaper、Sales
print(data)
# x = data[['TV', 'Radio', 'Newspaper']]
x = data[['TV', 'Radio']]
y = data['Sales']
print(x)
print(y)
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1, test_size=0.2)
model = Lasso()
alpha_can = np.logspace(-3, 2, 10)
np.set_printoptions(suppress=True)
print('alpha_can = ', alpha_can)
lasso_model = GridSearchCV(model, param_grid={'alpha': alpha_can}, cv=5)
lasso_model.fit(x_train, y_train)
print('超参数:\n', lasso_model.best_params_)
order = y_test.argsort(axis=0)
y_test = y_test.values[order]
x_test = x_test.values[order, :]
y_hat = lasso_model.predict(x_test)
print(lasso_model.score(x_test, y_test))
mse = np.average((y_hat - np.array(y_test)) ** 2) # Mean Squared Error
rmse = np.sqrt(mse) # Root Mean Squared Error
print(mse, rmse)
t = np.arange(len(x_test))
mpl.rcParams['font.sans-serif'] = ['simHei']
mpl.rcParams['axes.unicode_minus'] = False
plt.figure(facecolor='w')
plt.plot(t, y_test, 'r-', linewidth=2, label='真实数据')
plt.plot(t, y_hat, 'g-', linewidth=2, label='预测数据')
plt.title('线性回归预测销量', fontsize=18)
plt.legend(loc='upper left')
plt.grid(b=True, ls=':')
plt.show()