07wk-32: 아이스크림(교호작용) / 의사결정나무

Author

최규빈

Published

October 17, 2023

1. 강의영상

2. Imports

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt 
import sklearn.linear_model 
import sklearn.tree

3. Data

np.random.seed(43052)
temp = pd.read_csv('https://raw.githubusercontent.com/guebin/DV2022/master/posts/temp.csv').iloc[:,3].to_numpy()[:100]
temp.sort()
choco = 40 + temp * 2.0 + np.random.randn(100)*3
vanilla = 60 + temp * 5.0 + np.random.randn(100)*3
df1 = pd.DataFrame({'temp':temp,'sales':choco}).assign(type='choco')
df2 = pd.DataFrame({'temp':temp,'sales':vanilla}).assign(type='vanilla')
df_train = pd.concat([df1,df2])
df_train
temp sales type
0 -4.1 32.950261 choco
1 -3.7 35.852524 choco
2 -3.0 37.428335 choco
3 -1.3 38.323681 choco
4 -0.5 39.713362 choco
... ... ... ...
95 12.4 119.708075 vanilla
96 13.4 129.300464 vanilla
97 14.7 136.596568 vanilla
98 15.0 136.213140 vanilla
99 15.2 135.595252 vanilla

200 rows × 3 columns

plt.plot(df_train.temp[df_train.type=='choco'],df_train.sales[df_train.type=='choco'],'o',label='choco')
plt.plot(df_train.temp[df_train.type=='vanilla'],df_train.sales[df_train.type=='vanilla'],'o',label='vanilla')
plt.legend()

5. 분석

- 분석1: 선형회귀

# step1
X = pd.get_dummies(df_train[['temp','type']],drop_first=True)
y = df_train['sales']
# step2 
predictr = sklearn.linear_model.LinearRegression()
# step3
predictr.fit(X,y)
# step4 
df_train['sales_hat'] = predictr.predict(X)
#---#
f'train score = {predictr.score(X,y):.4f}'
'train score = 0.9250'
plt.plot(df_train.temp[df_train.type=='choco'],df_train.sales[df_train.type=='choco'],'o',alpha=0.2,label='choco')
plt.plot(df_train.temp[df_train.type=='choco'],df_train.sales_hat[df_train.type=='choco'],'--',color='C0')
plt.plot(df_train.temp[df_train.type=='vanilla'],df_train.sales[df_train.type=='vanilla'],'o',alpha=0.2,label='vanilla')
plt.plot(df_train.temp[df_train.type=='vanilla'],df_train.sales_hat[df_train.type=='vanilla'],'--',color='C1')
plt.legend()

- 분석2

# step1
X = pd.get_dummies(df_train[['temp','type']],drop_first=True)
y = df_train['sales']
# step2 
predictr = sklearn.tree.DecisionTreeRegressor()
# step3
predictr.fit(X,y)
# step4 
df_train['sales_hat'] = predictr.predict(X)
#---#
f'train score = {predictr.score(X,y):.4f}'
'train score = 0.9964'
plt.plot(df_train.temp[df_train.type=='choco'],df_train.sales[df_train.type=='choco'],'o',alpha=0.2,label='choco')
plt.plot(df_train.temp[df_train.type=='choco'],df_train.sales_hat[df_train.type=='choco'],'--',color='C0')
plt.plot(df_train.temp[df_train.type=='vanilla'],df_train.sales[df_train.type=='vanilla'],'o',alpha=0.2,label='vanilla')
plt.plot(df_train.temp[df_train.type=='vanilla'],df_train.sales_hat[df_train.type=='vanilla'],'--',color='C1')
plt.legend()

* 오버피팅에 대한 제 개념: 통계에서 “관측치 = 언더라잉 + 랜덤” 으로 볼 수 있다. 모형이 설명해야할 영역은 “언더라잉” 이다. 만약에 모형이 언더라잉을 잘 설명하지 못한다면 언더피팅이고, 주어진 모형이 언더라잉을 넘어 오차항까지 설명하고 있다면 오버피팅이다.

  1. 마음속의 underlying 을 간직한다 – 애매하죠?
  2. 그 underlying 보다 잘 맞추면 오버피팅이다.
  3. 내 마음속의 underlying 제대로 학습못하고 있다고 판단되면 모형미스 혹은 언더피팅이다.

이러한 논리로 인하면 위의 의사결정나무로 적합된 결과는 오버피팅이다. (그렇지만 언더피팅보단 나을지도?)