03wk-11: Medical Cost / 회귀분석

Author

최규빈

Published

September 21, 2023

1. 강의영상

2. Import

import numpy as np
import pandas as pd 
import sklearn.linear_model

3. Data 불러오기

- 캐글에서 Medical Cost Personal Datasets download

https://www.kaggle.com/datasets/mirichoi0218/insurance

- Data Load

df_train = pd.read_csv('https://raw.githubusercontent.com/guebin/MP2023/main/posts/insurance.csv')
df_train

	age	sex	bmi	children	smoker	region	charges
0	19	female	27.900	0	yes	southwest	16884.92400
1	18	male	33.770	1	no	southeast	1725.55230
2	28	male	33.000	3	no	southeast	4449.46200
3	33	male	22.705	0	no	northwest	21984.47061
4	32	male	28.880	0	no	northwest	3866.85520
...	...	...	...	...	...	...	...
1333	50	male	30.970	3	no	northwest	10600.54830
1334	18	female	31.920	0	no	northeast	2205.98080
1335	18	female	36.850	0	no	southeast	1629.83350
1336	21	female	25.800	0	no	southwest	2007.94500
1337	61	female	29.070	0	yes	northwest	29141.36030

1338 rows × 7 columns

4. 분석

A. Data 정리

df_train.columns

Index(['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges'], dtype='object')

X = pd.get_dummies(df_train.drop(['charges'],axis=1))
y = df_train[['charges']]

	age	bmi	children	sex_female	sex_male	smoker_no	smoker_yes	region_northeast	region_northwest	region_southeast	region_southwest
0	19	27.900	0	True	False	False	True	False	False	False	True
1	18	33.770	1	False	True	True	False	False	False	True	False
2	28	33.000	3	False	True	True	False	False	False	True	False
3	33	22.705	0	False	True	True	False	False	True	False	False
4	32	28.880	0	False	True	True	False	False	True	False	False
...	...	...	...	...	...	...	...	...	...	...	...
1333	50	30.970	3	False	True	True	False	False	True	False	False
1334	18	31.920	0	True	False	True	False	True	False	False	False
1335	18	36.850	0	True	False	True	False	False	False	True	False
1336	21	25.800	0	True	False	True	False	False	False	False	True
1337	61	29.070	0	True	False	False	True	False	True	False	False

1338 rows × 11 columns

	charges
0	16884.92400
1	1725.55230
2	4449.46200
3	21984.47061
4	3866.85520
...	...
1333	10600.54830
1334	2205.98080
1335	1629.83350
1336	2007.94500
1337	29141.36030

1338 rows × 1 columns

B. Predictor 생성

predictr = sklearn.linear_model.LinearRegression()

C. 학습

predictr.fit(X,y)

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

D. 예측

df_train.assign(yhat = predictr.predict(X))

	age	sex	bmi	children	smoker	region	charges	yhat
0	19	female	27.900	0	yes	southwest	16884.92400	25293.713028
1	18	male	33.770	1	no	southeast	1725.55230	3448.602834
2	28	male	33.000	3	no	southeast	4449.46200	6706.988491
3	33	male	22.705	0	no	northwest	21984.47061	3754.830163
4	32	male	28.880	0	no	northwest	3866.85520	5592.493386
...	...	...	...	...	...	...	...	...
1333	50	male	30.970	3	no	northwest	10600.54830	12351.323686
1334	18	female	31.920	0	no	northeast	2205.98080	3511.930809
1335	18	female	36.850	0	no	southeast	1629.83350	4149.132486
1336	21	female	25.800	0	no	southwest	2007.94500	1246.584939
1337	61	female	29.070	0	yes	northwest	29141.36030	37085.623268

1338 rows × 8 columns

E. 평가

predictr.score(X,y) # R^2

0.7509130345985207

0.7 이상이면 망한모형까지는 아님 (대회용으로는 부적절할 수 있으나 대충 쓸 수는 있는 정도)

5. 계수해석

- 상수항 해석

predictr.intercept_

array([-666.93771994])

기본적인 보험료는 -666이라는 의미

- 계수해석

pd.DataFrame({'name':list(X.columns), 'coef':predictr.coef_.reshape(-1)})

	name	coef
0	age	256.856353
1	bmi	339.193454
2	children	475.500545
3	sex_female	65.657180
4	sex_male	-65.657180
5	smoker_no	-11924.267271
6	smoker_yes	11924.267271
7	region_northeast	587.009235
8	region_northwest	234.045336
9	region_southeast	-448.012814
10	region_southwest	-373.041756

지역은 잘 모르겠으나 나머지는 꽤 그럴듯해 보임