13wk-54: 체중감량(교호작용) / 자료분석(Autogluon)

Author

최규빈

Published

December 1, 2023

1. 강의영상

2. Imports

#!pip install autogluon.eda
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
#---#
from autogluon.tabular import TabularPredictor
import autogluon.eda.auto as auto
#---#
import warnings
warnings.filterwarnings('ignore')

3. Data

df_train = pd.read_csv('https://raw.githubusercontent.com/guebin/MP2023/main/posts/weightloss.csv')
df_train
Supplement Exercise Weight_Loss
0 False False -0.877103
1 True False 1.604542
2 True True 13.824148
3 True True 13.004505
4 True True 13.701128
... ... ... ...
9995 True False 1.558841
9996 False False -0.217816
9997 False True 4.072701
9998 True False -0.253796
9999 False False -1.399092

10000 rows × 3 columns

4. 적합

# step1 -- pass
# step2
predictr = TabularPredictor(label='Weight_Loss')
# step3
predictr.fit(df_train)
# step4
yhat = predictr.predict(df_train)
No path specified. Models will be saved in: "AutogluonModels/ag-20231201_113248/"
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels/ag-20231201_113248/"
AutoGluon Version:  0.8.2
Python Version:     3.10.13
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #26~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Jul 13 16:27:29 UTC 2
Disk Space Avail:   248.33 GB / 490.57 GB (50.6%)
Train Data Rows:    10000
Train Data Columns: 2
Label Column: Weight_Loss
Preprocessing data ...
AutoGluon infers your prediction problem is: 'regression' (because dtype of label-column == float and many unique label-values observed).
    Label info (max, min, mean, stddev): (18.725299456466026, -3.4848875790233675, 5.11908, 6.09267)
    If 'regression' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
    Available Memory:                    125414.99 MB
    Train Data (Original)  Memory Usage: 0.02 MB (0.0% of available memory)
    Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
    Stage 1 Generators:
        Fitting AsTypeFeatureGenerator...
            Note: Converting 2 features to boolean dtype as they only contain 2 unique values.
    Stage 2 Generators:
        Fitting FillNaFeatureGenerator...
    Stage 3 Generators:
        Fitting IdentityFeatureGenerator...
    Stage 4 Generators:
        Fitting DropUniqueFeatureGenerator...
    Stage 5 Generators:
        Fitting DropDuplicatesFeatureGenerator...
    Types of features in original data (raw dtype, special dtypes):
        ('bool', []) : 2 | ['Supplement', 'Exercise']
    Types of features in processed data (raw dtype, special dtypes):
        ('int', ['bool']) : 2 | ['Supplement', 'Exercise']
    0.0s = Fit runtime
    2 features in original data used to generate 2 features in processed data.
    Train Data (Processed) Memory Usage: 0.02 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.03s ...
AutoGluon will gauge predictive performance using evaluation metric: 'root_mean_squared_error'
    This metric's sign has been flipped to adhere to being higher_is_better. The metric score can be multiplied by -1 to get the metric value.
    To change this, specify the eval_metric parameter of Predictor()
Automatically generating train/validation split with holdout_frac=0.1, Train Rows: 9000, Val Rows: 1000
User-specified model hyperparameters to be fit:
{
    'NN_TORCH': {},
    'GBM': [{'extra_trees': True, 'ag_args': {'name_suffix': 'XT'}}, {}, 'GBMLarge'],
    'CAT': {},
    'XGB': {},
    'FASTAI': {},
    'RF': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
    'XT': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
    'KNN': [{'weights': 'uniform', 'ag_args': {'name_suffix': 'Unif'}}, {'weights': 'distance', 'ag_args': {'name_suffix': 'Dist'}}],
}
Fitting 11 L1 models ...
Fitting model: KNeighborsUnif ...
    No valid features to train KNeighborsUnif... Skipping this model.
Fitting model: KNeighborsDist ...
    No valid features to train KNeighborsDist... Skipping this model.
Fitting model: LightGBMXT ...
    -1.0098  = Validation score   (-root_mean_squared_error)
    0.49s    = Training   runtime
    0.0s     = Validation runtime
Fitting model: LightGBM ...
    -1.0098  = Validation score   (-root_mean_squared_error)
    0.24s    = Training   runtime
    0.0s     = Validation runtime
Fitting model: RandomForestMSE ...
    -1.0098  = Validation score   (-root_mean_squared_error)
    0.33s    = Training   runtime
    0.12s    = Validation runtime
Fitting model: CatBoost ...
    -1.0098  = Validation score   (-root_mean_squared_error)
    0.36s    = Training   runtime
    0.0s     = Validation runtime
Fitting model: ExtraTreesMSE ...
    -1.0098  = Validation score   (-root_mean_squared_error)
    0.33s    = Training   runtime
    0.02s    = Validation runtime
Fitting model: NeuralNetFastAI ...
    -1.0087  = Validation score   (-root_mean_squared_error)
    10.49s   = Training   runtime
    0.02s    = Validation runtime
Fitting model: XGBoost ...
    -1.0098  = Validation score   (-root_mean_squared_error)
    0.14s    = Training   runtime
    0.0s     = Validation runtime
Fitting model: NeuralNetTorch ...
    -1.01    = Validation score   (-root_mean_squared_error)
    26.12s   = Training   runtime
    0.0s     = Validation runtime
Fitting model: LightGBMLarge ...
    -1.0098  = Validation score   (-root_mean_squared_error)
    0.26s    = Training   runtime
    0.0s     = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
    -1.0085  = Validation score   (-root_mean_squared_error)
    0.19s    = Training   runtime
    0.0s     = Validation runtime
AutoGluon training complete, total runtime = 39.3s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20231201_113248/")

5. 해석 및 시각화

A. y의 분포, (X,y)의 관계 시각화

auto.target_analysis(
    train_data=df_train,
    label='Weight_Loss',
    fit_distributions=False
)

Target variable analysis

count mean std min 25% 50% 75% max dtypes unique missing_count missing_ratio raw_type special_types
Weight_Loss 10000 5.11908 6.092669 -3.484888 0.277797 2.683447 12.075458 18.725299 float64 10000 float

Target variable correlations

train_data - spearman correlation matrix; focus: absolute correlation for Weight_Loss >= 0.5

Feature interaction between Exercise/Weight_Loss in train_data

B. 중요한 설명변수

auto.quick_fit(
    train_data=df_train,
    label='Weight_Loss',
    show_feature_importance_barplots=True
)
No path specified. Models will be saved in: "AutogluonModels/ag-20231201_113551/"

Model Prediction for Weight_Loss

Using validation data for Test points

Model Leaderboard

model score_test score_val pred_time_test pred_time_val fit_time pred_time_test_marginal pred_time_val_marginal fit_time_marginal stack_level can_infer fit_order
0 LightGBMXT -1.013205 -0.973093 0.011518 0.002616 0.246242 0.011518 0.002616 0.246242 1 True 1

Feature Importance for Trained Model

importance stddev p_value n p99_high p99_low
Exercise 6.735479 0.105037 7.094759e-09 5 6.951752 6.519206
Supplement 4.018616 0.073537 1.344948e-08 5 4.170030 3.867202

Rows with the highest prediction error

Rows in this category worth inspecting for the causes of the error

Supplement Exercise Weight_Loss Weight_Loss_pred error
4639 True False 4.358143 0.484748 3.873395
1683 True True 18.432093 14.966088 3.466005
1275 False True 1.693150 5.012056 3.318906
2631 False True 8.328789 5.012056 3.316733
5334 True False 3.769029 0.484748 3.284281
3437 False False -3.205225 0.039978 3.245204
2761 False True 1.853100 5.012056 3.158956
9675 True True 18.070419 14.966088 3.104331
3161 False False -3.020254 0.039978 3.060232
4637 False True 2.077349 5.012056 2.934706

C. 관측치별 해석

- 2번관측치

df_train.iloc[[2]]
Supplement Exercise Weight_Loss
5 True False -0.379401
predictr.predict(df_train.iloc[[2]])
2    15.005399
Name: Weight_Loss, dtype: float32
auto.explain_rows(
    train_data=df_train,
    model=predictr,
    rows=df_train.iloc[[2]]*1,
    display_rows=True,
    plot='waterfall'
)
Supplement Exercise Weight_Loss
2 1.0 1.0 13.824148

- 5번관측치

df_train.iloc[[5]]
Supplement Exercise Weight_Loss
5 True False -0.379401
predictr.predict(df_train.iloc[[5]])
5    0.583611
Name: Weight_Loss, dtype: float32
auto.explain_rows(
    train_data=df_train,
    model=predictr,
    rows=df_train.iloc[[5]]*1,
    display_rows=True,
    plot='waterfall'
)
Supplement Exercise Weight_Loss
5 1.0 0.0 -0.379401

- 10번관측치

df_train.iloc[[10]]
Supplement Exercise Weight_Loss
10 False True 4.356397
predictr.predict(df_train.iloc[[10]])
10    4.998917
Name: Weight_Loss, dtype: float32
auto.explain_rows(
    train_data=df_train,
    model=predictr,
    rows=df_train.iloc[[10]]*1,
    display_rows=True,
    plot='waterfall'
)
Supplement Exercise Weight_Loss
10 0.0 1.0 4.356397