07wk-29: 체중감량(교호작용) / 회귀분석 – 추가해설

Author

최규빈

Published

October 17, 2023

1. 강의영상

2. Imports

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt 
import sklearn.linear_model 
import sklearn.tree
import sklearn.model_selection

3. Data

# n = 10000
# Supplement = np.random.choice([True, False], n)
# Exercise = np.random.choice([False, True], n)
# Weight_Loss = np.where(
#     (~Supplement & (~Exercise)),
#     np.random.normal(loc=0, scale=1, size=n),  
#     np.where(
#         (Supplement & (Exercise)),
#         np.random.normal(loc=15.00, scale=1, size=n),
#         np.where(
#             (~Supplement & (Exercise)),
#             np.random.normal(loc=5.00, scale=1, size=n),
#             np.random.normal(loc=0.5, scale=1, size=n)
#         )
#     )
# )
# df = pd.DataFrame({
#     'Supplement': Supplement,
#     'Exercise': Exercise,
#     'Weight_Loss': Weight_Loss
# })
df_train = pd.read_csv('https://raw.githubusercontent.com/guebin/MP2023/main/posts/weightloss.csv')

df_train

	Supplement	Exercise	Weight_Loss
0	False	False	-0.877103
1	True	False	1.604542
2	True	True	13.824148
3	True	True	13.004505
4	True	True	13.701128
...	...	...	...
9995	True	False	1.558841
9996	False	False	-0.217816
9997	False	True	4.072701
9998	True	False	-0.253796
9999	False	False	-1.399092

10000 rows × 3 columns

df_train.pivot_table(index='Supplement',columns='Exercise',values='Weight_Loss')

Exercise	False	True
Supplement
False	0.021673	4.991314
True	0.497573	14.966363

- 운동과 체중감량보조제를 병행하면 시너지가 나는 것 같음

4. 분석

- 분석1: 모형을 아래와 같이 본다. – 언더피팅

\({\bf X}\): Supplement, Exercise
\({\bf y}\): Weight_Loss

# step1
X = df_train[['Supplement','Exercise']]
y = df_train['Weight_Loss']
# step2 
predictr = sklearn.linear_model.LinearRegression()
# step3
predictr.fit(X,y)
# step4 
df_train['Weight_Loss_hat'] = predictr.predict(X)
#---#
print(f'train score = {predictr.score(X,y):.4f}')

train score = 0.8208

df_train.pivot_table(index='Supplement',columns='Exercise',values='Weight_Loss')

Exercise	False	True
Supplement
False	0.021673	4.991314
True	0.497573	14.966363

df_train.pivot_table(index='Supplement',columns='Exercise',values='Weight_Loss_hat')

Exercise	False	True
Supplement
False	-2.373106	7.374557
True	2.845934	12.593598

운동을 하면 10키로 감량효과가 있다고 추정하고 있음.
보충제를 먹으면 5키로 감량효과가 있다고 추정하고 있음.
대충 (10,5)의 숫자를 바꿔가면서 적합해봤는데 이게 최선이라는 의미임

- 분석2: 모형을 아래와 같이 본다. – 딱 맞아요

\({\bf X}\): Supplement, Exercise, Supplement \(\times\) Exercise
\({\bf y}\): Weight_Loss

Note: 기본적인 운동의 효과 및 보조제의 효과는 각각 Supplement, Exercise 로 적합하고 운동과 보조제의 시너지는 Supplement\(\times\)Exercise 로 적합한다.

# step1 
X = df_train.eval('Interaction = Supplement * Exercise')[['Supplement','Exercise','Interaction']]
y = df_train['Weight_Loss']
# step2 
predictr = sklearn.linear_model.LinearRegression()
# step3 
predictr.fit(X,y)
# step4 -- pass 
df_train['Weight_Loss_hat'] = predictr.predict(X)
#---#
print(f'train score = {predictr.score(X,y):.4f}')

train score = 0.9728

df_train.pivot_table(index='Supplement',columns='Exercise',values='Weight_Loss')

Exercise	False	True
Supplement
False	0.021673	4.991314
True	0.497573	14.966363

df_train.pivot_table(index='Supplement',columns='Exercise',values='Weight_Loss_hat')

Exercise	False	True
Supplement
False	0.021673	4.991314
True	0.497573	14.966363

운동의 효과는 5정도 감량효과가 있다고 추정함.
보충제를 먹으면 0.5키로 감량효과가 있다고 추정함.
다만 운동을 하면서 보충제를 함께 먹을 경우 발생하는 추가적인 시너지효과가 9.5정도라고 추정하는 것임.

2023-10-24 추가해설

만약에 운동을 안하고, 약만먹을 경우 부작용이 생긴다면? (이것도 교호작용의 일종)

이러한 경우 위의 모형으로 단순적합하기 어렵다. (위의 모형은 “운동O/약O”인 case에서 발생하는 효과만 고려하도록 설계되어있음)
따라서 이럴 경우 차라리 (운동,약)을 결합하여 새로운 범주형 변수를 만들고 그 변수에서 원핫인코딩을 하는게 좋다. (마지막 더미변수는 제외하는고 좋지만, 파이썬에서는 제외하지 않아도 큰일나는건 아님)
사실 (운동,약)을 결합하여 모든 새로운 범주를 만들고 이중 필요없는 범주를 또 다시 제거해야하는 과정도 분석에 포함되어야 한다. (\(p\)-value를 보면서 뺼수도 있고 다른 방법을 쓸 수도 있고..)
그런데 범주형 변수가 3개라면? –> 솔직히 이것저것 생각하기 귀찮으니까 이럴떄는 “트리모형”계열을 사용하는게 속편하다. (아니면 교호작용이 없길 기도하거나)