13wk-54: 체중감량(교호작용) / 자료분석(Autogluon)

Author

최규빈

Published

December 1, 2023

1. 강의영상

2. Imports

#!pip install autogluon.eda

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
#---#
from autogluon.tabular import TabularPredictor
import autogluon.eda.auto as auto
#---#
import warnings
warnings.filterwarnings('ignore')

3. Data

df_train = pd.read_csv('https://raw.githubusercontent.com/guebin/MP2023/main/posts/weightloss.csv')
df_train

	Supplement	Exercise	Weight_Loss
0	False	False	-0.877103
1	True	False	1.604542
2	True	True	13.824148
3	True	True	13.004505
4	True	True	13.701128
...	...	...	...
9995	True	False	1.558841
9996	False	False	-0.217816
9997	False	True	4.072701
9998	True	False	-0.253796
9999	False	False	-1.399092

10000 rows × 3 columns

4. 적합

# step1 -- pass
# step2
predictr = TabularPredictor(label='Weight_Loss')
# step3
predictr.fit(df_train)
# step4
yhat = predictr.predict(df_train)

No path specified. Models will be saved in: "AutogluonModels/ag-20231201_113248/"
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels/ag-20231201_113248/"
AutoGluon Version:  0.8.2
Python Version:     3.10.13
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #26~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Jul 13 16:27:29 UTC 2
Disk Space Avail:   248.33 GB / 490.57 GB (50.6%)
Train Data Rows:    10000
Train Data Columns: 2
Label Column: Weight_Loss
Preprocessing data ...
AutoGluon infers your prediction problem is: 'regression' (because dtype of label-column == float and many unique label-values observed).
    Label info (max, min, mean, stddev): (18.725299456466026, -3.4848875790233675, 5.11908, 6.09267)
    If 'regression' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
    Available Memory:                    125414.99 MB
    Train Data (Original)  Memory Usage: 0.02 MB (0.0% of available memory)
    Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
    Stage 1 Generators:
        Fitting AsTypeFeatureGenerator...
            Note: Converting 2 features to boolean dtype as they only contain 2 unique values.
    Stage 2 Generators:
        Fitting FillNaFeatureGenerator...
    Stage 3 Generators:
        Fitting IdentityFeatureGenerator...
    Stage 4 Generators:
        Fitting DropUniqueFeatureGenerator...
    Stage 5 Generators:
        Fitting DropDuplicatesFeatureGenerator...
    Types of features in original data (raw dtype, special dtypes):
        ('bool', []) : 2 | ['Supplement', 'Exercise']
    Types of features in processed data (raw dtype, special dtypes):
        ('int', ['bool']) : 2 | ['Supplement', 'Exercise']
    0.0s = Fit runtime
    2 features in original data used to generate 2 features in processed data.
    Train Data (Processed) Memory Usage: 0.02 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.03s ...
AutoGluon will gauge predictive performance using evaluation metric: 'root_mean_squared_error'
    This metric's sign has been flipped to adhere to being higher_is_better. The metric score can be multiplied by -1 to get the metric value.
    To change this, specify the eval_metric parameter of Predictor()
Automatically generating train/validation split with holdout_frac=0.1, Train Rows: 9000, Val Rows: 1000
User-specified model hyperparameters to be fit:
{
    'NN_TORCH': {},
    'GBM': [{'extra_trees': True, 'ag_args': {'name_suffix': 'XT'}}, {}, 'GBMLarge'],
    'CAT': {},
    'XGB': {},
    'FASTAI': {},
    'RF': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
    'XT': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
    'KNN': [{'weights': 'uniform', 'ag_args': {'name_suffix': 'Unif'}}, {'weights': 'distance', 'ag_args': {'name_suffix': 'Dist'}}],
}
Fitting 11 L1 models ...
Fitting model: KNeighborsUnif ...
    No valid features to train KNeighborsUnif... Skipping this model.
Fitting model: KNeighborsDist ...
    No valid features to train KNeighborsDist... Skipping this model.
Fitting model: LightGBMXT ...
    -1.0098  = Validation score   (-root_mean_squared_error)
    0.49s    = Training   runtime
    0.0s     = Validation runtime
Fitting model: LightGBM ...
    -1.0098  = Validation score   (-root_mean_squared_error)
    0.24s    = Training   runtime
    0.0s     = Validation runtime
Fitting model: RandomForestMSE ...
    -1.0098  = Validation score   (-root_mean_squared_error)
    0.33s    = Training   runtime
    0.12s    = Validation runtime
Fitting model: CatBoost ...
    -1.0098  = Validation score   (-root_mean_squared_error)
    0.36s    = Training   runtime
    0.0s     = Validation runtime
Fitting model: ExtraTreesMSE ...
    -1.0098  = Validation score   (-root_mean_squared_error)
    0.33s    = Training   runtime
    0.02s    = Validation runtime
Fitting model: NeuralNetFastAI ...
    -1.0087  = Validation score   (-root_mean_squared_error)
    10.49s   = Training   runtime
    0.02s    = Validation runtime
Fitting model: XGBoost ...
    -1.0098  = Validation score   (-root_mean_squared_error)
    0.14s    = Training   runtime
    0.0s     = Validation runtime
Fitting model: NeuralNetTorch ...
    -1.01    = Validation score   (-root_mean_squared_error)
    26.12s   = Training   runtime
    0.0s     = Validation runtime
Fitting model: LightGBMLarge ...
    -1.0098  = Validation score   (-root_mean_squared_error)
    0.26s    = Training   runtime
    0.0s     = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
    -1.0085  = Validation score   (-root_mean_squared_error)
    0.19s    = Training   runtime
    0.0s     = Validation runtime
AutoGluon training complete, total runtime = 39.3s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20231201_113248/")

5. 해석 및 시각화

A. y의 분포, (X,y)의 관계 시각화

auto.target_analysis(
    train_data=df_train,
    label='Weight_Loss',
    fit_distributions=False
)

Target variable analysis

	count	mean	std	min	25%	50%	75%	max	dtypes	unique	missing_count	missing_ratio	raw_type	special_types
Weight_Loss	10000	5.11908	6.092669	-3.484888	0.277797	2.683447	12.075458	18.725299	float64	10000			float

Target variable correlations

train_data - spearman correlation matrix; focus: absolute correlation for Weight_Loss >= 0.5

Feature interaction between Exercise/Weight_Loss in train_data

B. 중요한 설명변수

auto.quick_fit(
    train_data=df_train,
    label='Weight_Loss',
    show_feature_importance_barplots=True
)

No path specified. Models will be saved in: "AutogluonModels/ag-20231201_113551/"

Model Prediction for Weight_Loss

Using validation data for Test points

Model Leaderboard

	model	score_test	score_val	pred_time_test	pred_time_val	fit_time	pred_time_test_marginal	pred_time_val_marginal	fit_time_marginal	stack_level	can_infer	fit_order
0	LightGBMXT	-1.013205	-0.973093	0.011518	0.002616	0.246242	0.011518	0.002616	0.246242	1	True	1

Feature Importance for Trained Model

	importance	stddev	p_value	n	p99_high	p99_low
Exercise	6.735479	0.105037	7.094759e-09	5	6.951752	6.519206
Supplement	4.018616	0.073537	1.344948e-08	5	4.170030	3.867202

Rows with the highest prediction error

Rows in this category worth inspecting for the causes of the error

	Supplement	Exercise	Weight_Loss	Weight_Loss_pred	error
4639	True	False	4.358143	0.484748	3.873395
1683	True	True	18.432093	14.966088	3.466005
1275	False	True	1.693150	5.012056	3.318906
2631	False	True	8.328789	5.012056	3.316733
5334	True	False	3.769029	0.484748	3.284281
3437	False	False	-3.205225	0.039978	3.245204
2761	False	True	1.853100	5.012056	3.158956
9675	True	True	18.070419	14.966088	3.104331
3161	False	False	-3.020254	0.039978	3.060232
4637	False	True	2.077349	5.012056	2.934706

C. 관측치별 해석

- 2번관측치

df_train.iloc[[2]]

	Supplement	Exercise	Weight_Loss
5	True	False	-0.379401

predictr.predict(df_train.iloc[[2]])

2    15.005399
Name: Weight_Loss, dtype: float32

auto.explain_rows(
    train_data=df_train,
    model=predictr,
    rows=df_train.iloc[[2]]*1,
    display_rows=True,
    plot='waterfall'
)

	Supplement	Exercise	Weight_Loss
2	1.0	1.0	13.824148

- 5번관측치

df_train.iloc[[5]]

	Supplement	Exercise	Weight_Loss
5	True	False	-0.379401

predictr.predict(df_train.iloc[[5]])

5    0.583611
Name: Weight_Loss, dtype: float32

auto.explain_rows(
    train_data=df_train,
    model=predictr,
    rows=df_train.iloc[[5]]*1,
    display_rows=True,
    plot='waterfall'
)

	Supplement	Exercise	Weight_Loss
5	1.0	0.0	-0.379401

- 10번관측치

df_train.iloc[[10]]

	Supplement	Exercise	Weight_Loss
10	False	True	4.356397

predictr.predict(df_train.iloc[[10]])

10    4.998917
Name: Weight_Loss, dtype: float32

auto.explain_rows(
    train_data=df_train,
    model=predictr,
    rows=df_train.iloc[[10]]*1,
    display_rows=True,
    plot='waterfall'
)

	Supplement	Exercise	Weight_Loss
10	0.0	1.0	4.356397