13wk-52: 취업(오버피팅) / 자료분석(Autogluon)

Author

최규빈

Published

December 1, 2023

1. 강의영상

2. Imports

#!pip install autogluon.eda

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
import sklearn.model_selection
#---#
from autogluon.tabular import TabularPredictor
import autogluon.eda.auto as auto
#---#
import warnings
warnings.filterwarnings('ignore')

3. Data

np.random.randn(43052)
n_balance = 10 
toeic = np.random.randint(0,199,size=5000)*5
gpa = np.random.randint(100,450,size=5000)/100
u = toeic * 8/995 + gpa * 10/4.5
u = u - np.mean(u)
v = np.exp(u)/(1+np.exp(u))
employment = np.random.binomial(n=1,p=v)
df = pd.DataFrame({
'toiec':toeic,
'gpa':gpa,
'employment':employment
})
df_balance = pd.DataFrame((np.random.randn(5000,n_balance)).reshape(5000,n_balance)*1,columns = ['balance'+str(i) for i in range(n_balance)]) > 0
df = pd.concat([df,df_balance],axis=1).assign(employment = lambda df: df.employment.map({0:'No',1:'Yes'}))
df_train, df_test = sklearn.model_selection.train_test_split(df, test_size=0.7, random_state=42)

df_train

	toiec	gpa	employment	balance0	balance1	balance2	balance3	balance4	balance5	balance6	balance7	balance8	balance9
4431	195	2.50	No	True	False	False	False	True	False	True	False	False	False
2162	410	3.74	Yes	True	False	True	True	False	True	False	False	False	False
2396	940	1.07	No	True	False	False	False	False	False	True	True	True	True
4768	785	4.21	Yes	False	True	False	True	False	False	True	True	True	True
2271	965	3.32	Yes	True	False	True	False	False	True	True	False	False	True
...	...	...	...	...	...	...	...	...	...	...	...	...	...
4426	745	3.19	Yes	True	True	False	True	True	False	True	False	True	True
466	790	3.82	Yes	True	True	False	True	False	False	True	False	True	False
3092	800	1.41	No	False	True	False	False	False	False	False	True	False	True
3772	290	1.35	No	False	False	True	True	False	False	True	True	False	True
860	965	4.46	Yes	False	False	True	True	False	False	False	True	False	True

1500 rows × 13 columns

4. 적합

# step1 -- pass
# step2
predictr = TabularPredictor(label='employment')
# step3
predictr.fit(df_train)
# step4
yhat = predictr.predict(df_train)

No path specified. Models will be saved in: "AutogluonModels/ag-20231201_110823/"
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels/ag-20231201_110823/"
AutoGluon Version:  0.8.2
Python Version:     3.10.13
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #26~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Jul 13 16:27:29 UTC 2
Disk Space Avail:   248.41 GB / 490.57 GB (50.6%)
Train Data Rows:    1500
Train Data Columns: 12
Label Column: employment
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
    2 unique label values:  ['No', 'Yes']
    If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping:  class 1 = Yes, class 0 = No
    Note: For your binary classification, AutoGluon arbitrarily selected which label-value represents positive (Yes) vs negative (No) class.
    To explicitly set the positive_class, either rename classes to 1 and 0, or specify positive_class in Predictor init.
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
    Available Memory:                    126338.85 MB
    Train Data (Original)  Memory Usage: 0.04 MB (0.0% of available memory)
    Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
    Stage 1 Generators:
        Fitting AsTypeFeatureGenerator...
            Note: Converting 10 features to boolean dtype as they only contain 2 unique values.
    Stage 2 Generators:
        Fitting FillNaFeatureGenerator...
    Stage 3 Generators:
        Fitting IdentityFeatureGenerator...
    Stage 4 Generators:
        Fitting DropUniqueFeatureGenerator...
    Stage 5 Generators:
        Fitting DropDuplicatesFeatureGenerator...
    Types of features in original data (raw dtype, special dtypes):
        ('bool', [])  : 10 | ['balance0', 'balance1', 'balance2', 'balance3', 'balance4', ...]
        ('float', []) :  1 | ['gpa']
        ('int', [])   :  1 | ['toiec']
    Types of features in processed data (raw dtype, special dtypes):
        ('float', [])     :  1 | ['gpa']
        ('int', [])       :  1 | ['toiec']
        ('int', ['bool']) : 10 | ['balance0', 'balance1', 'balance2', 'balance3', 'balance4', ...]
    0.0s = Fit runtime
    12 features in original data used to generate 12 features in processed data.
    Train Data (Processed) Memory Usage: 0.04 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.04s ...
AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'
    To change this, specify the eval_metric parameter of Predictor()
Automatically generating train/validation split with holdout_frac=0.2, Train Rows: 1200, Val Rows: 300
User-specified model hyperparameters to be fit:
{
    'NN_TORCH': {},
    'GBM': [{'extra_trees': True, 'ag_args': {'name_suffix': 'XT'}}, {}, 'GBMLarge'],
    'CAT': {},
    'XGB': {},
    'FASTAI': {},
    'RF': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
    'XT': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
    'KNN': [{'weights': 'uniform', 'ag_args': {'name_suffix': 'Unif'}}, {'weights': 'distance', 'ag_args': {'name_suffix': 'Dist'}}],
}
Fitting 13 L1 models ...
Fitting model: KNeighborsUnif ...
    0.72     = Validation score   (accuracy)
    0.01s    = Training   runtime
    0.01s    = Validation runtime
Fitting model: KNeighborsDist ...
    0.7333   = Validation score   (accuracy)
    0.01s    = Training   runtime
    0.01s    = Validation runtime
Fitting model: LightGBMXT ...
    0.8733   = Validation score   (accuracy)
    0.21s    = Training   runtime
    0.0s     = Validation runtime
Fitting model: LightGBM ...
    0.8533   = Validation score   (accuracy)
    0.21s    = Training   runtime
    0.0s     = Validation runtime
Fitting model: RandomForestGini ...
    0.8567   = Validation score   (accuracy)
    0.45s    = Training   runtime
    0.05s    = Validation runtime
Fitting model: RandomForestEntr ...
    0.8633   = Validation score   (accuracy)
    0.59s    = Training   runtime
    0.19s    = Validation runtime
Fitting model: CatBoost ...
    0.8667   = Validation score   (accuracy)
    0.31s    = Training   runtime
    0.0s     = Validation runtime
Fitting model: ExtraTreesGini ...
    0.8567   = Validation score   (accuracy)
    0.89s    = Training   runtime
    0.05s    = Validation runtime
Fitting model: ExtraTreesEntr ...
    0.85     = Validation score   (accuracy)
    0.72s    = Training   runtime
    0.1s     = Validation runtime
Fitting model: NeuralNetFastAI ...
No improvement since epoch 2: early stopping
    0.8667   = Validation score   (accuracy)
    1.23s    = Training   runtime
    0.01s    = Validation runtime
Fitting model: XGBoost ...
    0.8433   = Validation score   (accuracy)
    0.16s    = Training   runtime
    0.0s     = Validation runtime
Fitting model: NeuralNetTorch ...
    0.8667   = Validation score   (accuracy)
    1.83s    = Training   runtime
    0.0s     = Validation runtime
Fitting model: LightGBMLarge ...
    0.8367   = Validation score   (accuracy)
    0.47s    = Training   runtime
    0.0s     = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
    0.8933   = Validation score   (accuracy)
    0.47s    = Training   runtime
    0.0s     = Validation runtime
AutoGluon training complete, total runtime = 8.22s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20231201_110823/")

5. 해석 및 시각화

A. y의 분포, (X,y)의 관계 시각화

auto.target_analysis(
    train_data=df_train,
    label='employment',
    fit_distributions=False
)

Target variable analysis

	count	unique	top	freq	dtypes	missing_count	missing_ratio	raw_type	special_types
employment	1500	2	No	756	object			object

Target variable correlations

train_data - spearman correlation matrix; focus: absolute correlation for employment >= 0.5

Feature interaction between toiec/employment in train_data

B. 중요한 설명변수

auto.quick_fit(
    train_data= df_train,
    label = 'employment',
    show_feature_importance_barplots=True
)

No path specified. Models will be saved in: "AutogluonModels/ag-20231201_111117/"

Model Prediction for employment

Using validation data for Test points

Model Leaderboard

	model	score_test	score_val	pred_time_test	pred_time_val	fit_time	pred_time_test_marginal	pred_time_val_marginal	fit_time_marginal	stack_level	can_infer	fit_order
0	LightGBMXT	0.833333	0.87619	0.00083	0.001549	0.218086	0.00083	0.001549	0.218086	1	True	1

Feature Importance for Trained Model

	importance	stddev	p_value	n	p99_high	p99_low
gpa	0.188889	0.015947	0.000006	5	0.221725	0.156053
toiec	0.185778	0.020882	0.000019	5	0.228774	0.142782
balance7	0.002667	0.003651	0.088904	5	0.010185	-0.004852
balance6	0.001778	0.009081	0.342087	5	0.020476	-0.016921
balance4	0.000889	0.005116	0.358714	5	0.011423	-0.009645
balance8	-0.001333	0.002534	0.847721	5	0.003884	-0.006550
balance0	-0.004444	0.003143	0.982945	5	0.002026	-0.010915
balance5	-0.005778	0.004332	0.979679	5	0.003142	-0.014697
balance1	-0.006222	0.003296	0.993267	5	0.000564	-0.013009
balance3	-0.006667	0.004157	0.988475	5	0.001893	-0.015227
balance9	-0.007556	0.001217	0.999922	5	-0.005049	-0.010062
balance2	-0.011111	0.003514	0.998945	5	-0.003876	-0.018346

Rows with the highest prediction error

Rows in this category worth inspecting for the causes of the error

	toiec	gpa	balance0	balance1	balance2	balance3	balance4	balance5	balance6	balance7	balance8	balance9	employment	No	Yes	error
3031	410	2.34	True	True	False	False	False	True	True	True	True	True	Yes	0.670908	0.329092	0.341816
3562	420	2.23	False	True	True	True	True	True	False	False	False	True	Yes	0.629780	0.370220	0.259560
3521	560	3.82	False	False	False	True	False	False	True	True	True	False	No	0.381926	0.618074	0.236148
186	470	2.20	True	True	False	False	False	False	True	True	False	False	Yes	0.615211	0.384789	0.230421
137	615	1.45	True	False	True	False	True	True	False	False	False	False	Yes	0.610440	0.389560	0.220880
3149	855	2.23	False	True	True	True	True	False	True	True	True	True	No	0.390826	0.609174	0.218348
4637	940	1.85	False	False	False	True	True	False	True	False	False	True	No	0.419480	0.580520	0.161039
4517	155	3.73	False	False	False	True	True	False	False	True	True	True	Yes	0.577100	0.422900	0.154201
1449	965	2.01	False	True	True	True	True	True	True	False	False	False	No	0.434479	0.565521	0.131042
2106	230	2.65	True	False	False	False	True	False	False	True	True	False	Yes	0.564785	0.435215	0.129570

Rows with the least distance vs other class

Rows in this category are the closest to the decision boundary vs the other class and are good candidates for additional labeling

	toiec	gpa	balance0	balance1	balance2	balance3	balance4	balance5	balance6	balance7	balance8	balance9	employment	No	Yes	error
3776	990	1.30	True	True	True	False	False	False	False	False	True	True	Yes	0.500001	0.499999	0.000001
3507	855	1.86	False	True	False	False	False	False	False	True	False	True	No	0.499844	0.500156	0.000311
2553	80	4.36	True	True	True	False	False	True	True	True	True	True	No	0.499721	0.500279	0.000558
249	840	1.95	True	True	False	True	False	False	True	True	False	False	Yes	0.501248	0.498752	0.002496
2987	140	4.08	True	False	True	True	False	True	False	False	True	True	No	0.497140	0.502860	0.005720
4232	15	3.78	False	True	True	True	True	True	True	False	False	False	Yes	0.502909	0.497091	0.005819
872	875	1.29	False	False	False	True	True	False	True	True	False	False	Yes	0.503659	0.496341	0.007318
3696	30	4.31	False	False	True	True	True	False	False	False	True	False	No	0.495588	0.504412	0.008823
2568	565	2.36	False	True	False	False	False	False	False	False	True	True	Yes	0.505832	0.494168	0.011664
2681	660	2.28	False	True	False	False	False	True	False	True	False	True	No	0.494135	0.505865	0.011729

C. 관측치별 해석

- 0번관측치

df_train.iloc[[0]]

	toiec	gpa	employment	balance0	balance1	balance2	balance3	balance4	balance5	balance6	balance7	balance8	balance9
4431	195	2.5	No	True	False	False	False	True	False	True	False	False	False

predictr.predict(df_train.iloc[[0]])

4431    No
Name: employment, dtype: object

predictr.predict_proba(df_train.iloc[[0]])

	No	Yes
4431	0.814248	0.185752

auto.explain_rows(
    train_data=df_train,
    model=predictr,
    rows=df_train.iloc[[0]]*1,
    display_rows=True,
    plot='waterfall'
)

	toiec	gpa	employment	balance0	balance1	balance2	balance3	balance4	balance5	balance6	balance7	balance8	balance9
4431	195	2.5	No	1	0	0	0	1	0	1	0	0	0

# 떨어진 이유

- 1번관측치

df_train.iloc[[1]]

	toiec	gpa	employment	balance0	balance1	balance2	balance3	balance4	balance5	balance6	balance7	balance8	balance9
2162	410	3.74	Yes	True	False	True	True	False	True	False	False	False	False

predictr.predict(df_train.iloc[[1]])

2162    Yes
Name: employment, dtype: object

predictr.predict_proba(df_train.iloc[[1]])

	No	Yes
2162	0.347349	0.652651

auto.explain_rows(
    train_data=df_train,
    model=predictr,
    rows=df_train.iloc[[1]]*1,
    display_rows=True,
    plot='waterfall'
)

	toiec	gpa	employment	balance0	balance1	balance2	balance3	balance4	balance5	balance6	balance7	balance8	balance9
2162	410	3.74	Yes	1	0	1	1	0	1	0	0	0	0

# 합격한이유