13wk-52: 취업(오버피팅) / 자료분석(Autogluon)

Author

최규빈

Published

December 1, 2023

1. 강의영상

2. Imports

#!pip install autogluon.eda
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
import sklearn.model_selection
#---#
from autogluon.tabular import TabularPredictor
import autogluon.eda.auto as auto
#---#
import warnings
warnings.filterwarnings('ignore')

3. Data

np.random.randn(43052)
n_balance = 10 
toeic = np.random.randint(0,199,size=5000)*5
gpa = np.random.randint(100,450,size=5000)/100
u = toeic * 8/995 + gpa * 10/4.5
u = u - np.mean(u)
v = np.exp(u)/(1+np.exp(u))
employment = np.random.binomial(n=1,p=v)
df = pd.DataFrame({
'toiec':toeic,
'gpa':gpa,
'employment':employment
})
df_balance = pd.DataFrame((np.random.randn(5000,n_balance)).reshape(5000,n_balance)*1,columns = ['balance'+str(i) for i in range(n_balance)]) > 0
df = pd.concat([df,df_balance],axis=1).assign(employment = lambda df: df.employment.map({0:'No',1:'Yes'}))
df_train, df_test = sklearn.model_selection.train_test_split(df, test_size=0.7, random_state=42)
df_train
toiec gpa employment balance0 balance1 balance2 balance3 balance4 balance5 balance6 balance7 balance8 balance9
4431 195 2.50 No True False False False True False True False False False
2162 410 3.74 Yes True False True True False True False False False False
2396 940 1.07 No True False False False False False True True True True
4768 785 4.21 Yes False True False True False False True True True True
2271 965 3.32 Yes True False True False False True True False False True
... ... ... ... ... ... ... ... ... ... ... ... ... ...
4426 745 3.19 Yes True True False True True False True False True True
466 790 3.82 Yes True True False True False False True False True False
3092 800 1.41 No False True False False False False False True False True
3772 290 1.35 No False False True True False False True True False True
860 965 4.46 Yes False False True True False False False True False True

1500 rows × 13 columns

4. 적합

# step1 -- pass
# step2
predictr = TabularPredictor(label='employment')
# step3
predictr.fit(df_train)
# step4
yhat = predictr.predict(df_train)
No path specified. Models will be saved in: "AutogluonModels/ag-20231201_110823/"
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels/ag-20231201_110823/"
AutoGluon Version:  0.8.2
Python Version:     3.10.13
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #26~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Jul 13 16:27:29 UTC 2
Disk Space Avail:   248.41 GB / 490.57 GB (50.6%)
Train Data Rows:    1500
Train Data Columns: 12
Label Column: employment
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
    2 unique label values:  ['No', 'Yes']
    If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping:  class 1 = Yes, class 0 = No
    Note: For your binary classification, AutoGluon arbitrarily selected which label-value represents positive (Yes) vs negative (No) class.
    To explicitly set the positive_class, either rename classes to 1 and 0, or specify positive_class in Predictor init.
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
    Available Memory:                    126338.85 MB
    Train Data (Original)  Memory Usage: 0.04 MB (0.0% of available memory)
    Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
    Stage 1 Generators:
        Fitting AsTypeFeatureGenerator...
            Note: Converting 10 features to boolean dtype as they only contain 2 unique values.
    Stage 2 Generators:
        Fitting FillNaFeatureGenerator...
    Stage 3 Generators:
        Fitting IdentityFeatureGenerator...
    Stage 4 Generators:
        Fitting DropUniqueFeatureGenerator...
    Stage 5 Generators:
        Fitting DropDuplicatesFeatureGenerator...
    Types of features in original data (raw dtype, special dtypes):
        ('bool', [])  : 10 | ['balance0', 'balance1', 'balance2', 'balance3', 'balance4', ...]
        ('float', []) :  1 | ['gpa']
        ('int', [])   :  1 | ['toiec']
    Types of features in processed data (raw dtype, special dtypes):
        ('float', [])     :  1 | ['gpa']
        ('int', [])       :  1 | ['toiec']
        ('int', ['bool']) : 10 | ['balance0', 'balance1', 'balance2', 'balance3', 'balance4', ...]
    0.0s = Fit runtime
    12 features in original data used to generate 12 features in processed data.
    Train Data (Processed) Memory Usage: 0.04 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.04s ...
AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'
    To change this, specify the eval_metric parameter of Predictor()
Automatically generating train/validation split with holdout_frac=0.2, Train Rows: 1200, Val Rows: 300
User-specified model hyperparameters to be fit:
{
    'NN_TORCH': {},
    'GBM': [{'extra_trees': True, 'ag_args': {'name_suffix': 'XT'}}, {}, 'GBMLarge'],
    'CAT': {},
    'XGB': {},
    'FASTAI': {},
    'RF': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
    'XT': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
    'KNN': [{'weights': 'uniform', 'ag_args': {'name_suffix': 'Unif'}}, {'weights': 'distance', 'ag_args': {'name_suffix': 'Dist'}}],
}
Fitting 13 L1 models ...
Fitting model: KNeighborsUnif ...
    0.72     = Validation score   (accuracy)
    0.01s    = Training   runtime
    0.01s    = Validation runtime
Fitting model: KNeighborsDist ...
    0.7333   = Validation score   (accuracy)
    0.01s    = Training   runtime
    0.01s    = Validation runtime
Fitting model: LightGBMXT ...
    0.8733   = Validation score   (accuracy)
    0.21s    = Training   runtime
    0.0s     = Validation runtime
Fitting model: LightGBM ...
    0.8533   = Validation score   (accuracy)
    0.21s    = Training   runtime
    0.0s     = Validation runtime
Fitting model: RandomForestGini ...
    0.8567   = Validation score   (accuracy)
    0.45s    = Training   runtime
    0.05s    = Validation runtime
Fitting model: RandomForestEntr ...
    0.8633   = Validation score   (accuracy)
    0.59s    = Training   runtime
    0.19s    = Validation runtime
Fitting model: CatBoost ...
    0.8667   = Validation score   (accuracy)
    0.31s    = Training   runtime
    0.0s     = Validation runtime
Fitting model: ExtraTreesGini ...
    0.8567   = Validation score   (accuracy)
    0.89s    = Training   runtime
    0.05s    = Validation runtime
Fitting model: ExtraTreesEntr ...
    0.85     = Validation score   (accuracy)
    0.72s    = Training   runtime
    0.1s     = Validation runtime
Fitting model: NeuralNetFastAI ...
No improvement since epoch 2: early stopping
    0.8667   = Validation score   (accuracy)
    1.23s    = Training   runtime
    0.01s    = Validation runtime
Fitting model: XGBoost ...
    0.8433   = Validation score   (accuracy)
    0.16s    = Training   runtime
    0.0s     = Validation runtime
Fitting model: NeuralNetTorch ...
    0.8667   = Validation score   (accuracy)
    1.83s    = Training   runtime
    0.0s     = Validation runtime
Fitting model: LightGBMLarge ...
    0.8367   = Validation score   (accuracy)
    0.47s    = Training   runtime
    0.0s     = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
    0.8933   = Validation score   (accuracy)
    0.47s    = Training   runtime
    0.0s     = Validation runtime
AutoGluon training complete, total runtime = 8.22s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20231201_110823/")

5. 해석 및 시각화

A. y의 분포, (X,y)의 관계 시각화

auto.target_analysis(
    train_data=df_train,
    label='employment',
    fit_distributions=False
)

Target variable analysis

count unique top freq dtypes missing_count missing_ratio raw_type special_types
employment 1500 2 No 756 object object

Target variable correlations

train_data - spearman correlation matrix; focus: absolute correlation for employment >= 0.5

Feature interaction between toiec/employment in train_data

B. 중요한 설명변수

auto.quick_fit(
    train_data= df_train,
    label = 'employment',
    show_feature_importance_barplots=True
)
No path specified. Models will be saved in: "AutogluonModels/ag-20231201_111117/"

Model Prediction for employment

Using validation data for Test points

Model Leaderboard

model score_test score_val pred_time_test pred_time_val fit_time pred_time_test_marginal pred_time_val_marginal fit_time_marginal stack_level can_infer fit_order
0 LightGBMXT 0.833333 0.87619 0.00083 0.001549 0.218086 0.00083 0.001549 0.218086 1 True 1

Feature Importance for Trained Model

importance stddev p_value n p99_high p99_low
gpa 0.188889 0.015947 0.000006 5 0.221725 0.156053
toiec 0.185778 0.020882 0.000019 5 0.228774 0.142782
balance7 0.002667 0.003651 0.088904 5 0.010185 -0.004852
balance6 0.001778 0.009081 0.342087 5 0.020476 -0.016921
balance4 0.000889 0.005116 0.358714 5 0.011423 -0.009645
balance8 -0.001333 0.002534 0.847721 5 0.003884 -0.006550
balance0 -0.004444 0.003143 0.982945 5 0.002026 -0.010915
balance5 -0.005778 0.004332 0.979679 5 0.003142 -0.014697
balance1 -0.006222 0.003296 0.993267 5 0.000564 -0.013009
balance3 -0.006667 0.004157 0.988475 5 0.001893 -0.015227
balance9 -0.007556 0.001217 0.999922 5 -0.005049 -0.010062
balance2 -0.011111 0.003514 0.998945 5 -0.003876 -0.018346

Rows with the highest prediction error

Rows in this category worth inspecting for the causes of the error

toiec gpa balance0 balance1 balance2 balance3 balance4 balance5 balance6 balance7 balance8 balance9 employment No Yes error
3031 410 2.34 True True False False False True True True True True Yes 0.670908 0.329092 0.341816
3562 420 2.23 False True True True True True False False False True Yes 0.629780 0.370220 0.259560
3521 560 3.82 False False False True False False True True True False No 0.381926 0.618074 0.236148
186 470 2.20 True True False False False False True True False False Yes 0.615211 0.384789 0.230421
137 615 1.45 True False True False True True False False False False Yes 0.610440 0.389560 0.220880
3149 855 2.23 False True True True True False True True True True No 0.390826 0.609174 0.218348
4637 940 1.85 False False False True True False True False False True No 0.419480 0.580520 0.161039
4517 155 3.73 False False False True True False False True True True Yes 0.577100 0.422900 0.154201
1449 965 2.01 False True True True True True True False False False No 0.434479 0.565521 0.131042
2106 230 2.65 True False False False True False False True True False Yes 0.564785 0.435215 0.129570

Rows with the least distance vs other class

Rows in this category are the closest to the decision boundary vs the other class and are good candidates for additional labeling

toiec gpa balance0 balance1 balance2 balance3 balance4 balance5 balance6 balance7 balance8 balance9 employment No Yes error
3776 990 1.30 True True True False False False False False True True Yes 0.500001 0.499999 0.000001
3507 855 1.86 False True False False False False False True False True No 0.499844 0.500156 0.000311
2553 80 4.36 True True True False False True True True True True No 0.499721 0.500279 0.000558
249 840 1.95 True True False True False False True True False False Yes 0.501248 0.498752 0.002496
2987 140 4.08 True False True True False True False False True True No 0.497140 0.502860 0.005720
4232 15 3.78 False True True True True True True False False False Yes 0.502909 0.497091 0.005819
872 875 1.29 False False False True True False True True False False Yes 0.503659 0.496341 0.007318
3696 30 4.31 False False True True True False False False True False No 0.495588 0.504412 0.008823
2568 565 2.36 False True False False False False False False True True Yes 0.505832 0.494168 0.011664
2681 660 2.28 False True False False False True False True False True No 0.494135 0.505865 0.011729

C. 관측치별 해석

- 0번관측치

df_train.iloc[[0]]
toiec gpa employment balance0 balance1 balance2 balance3 balance4 balance5 balance6 balance7 balance8 balance9
4431 195 2.5 No True False False False True False True False False False
predictr.predict(df_train.iloc[[0]])
4431    No
Name: employment, dtype: object
predictr.predict_proba(df_train.iloc[[0]])
No Yes
4431 0.814248 0.185752
auto.explain_rows(
    train_data=df_train,
    model=predictr,
    rows=df_train.iloc[[0]]*1,
    display_rows=True,
    plot='waterfall'
)
toiec gpa employment balance0 balance1 balance2 balance3 balance4 balance5 balance6 balance7 balance8 balance9
4431 195 2.5 No 1 0 0 0 1 0 1 0 0 0

# 떨어진 이유

- 1번관측치

df_train.iloc[[1]]
toiec gpa employment balance0 balance1 balance2 balance3 balance4 balance5 balance6 balance7 balance8 balance9
2162 410 3.74 Yes True False True True False True False False False False
predictr.predict(df_train.iloc[[1]])
2162    Yes
Name: employment, dtype: object
predictr.predict_proba(df_train.iloc[[1]])
No Yes
2162 0.347349 0.652651
auto.explain_rows(
    train_data=df_train,
    model=predictr,
    rows=df_train.iloc[[1]]*1,
    display_rows=True,
    plot='waterfall'
)
toiec gpa employment balance0 balance1 balance2 balance3 balance4 balance5 balance6 balance7 balance8 balance9
2162 410 3.74 Yes 1 0 1 1 0 1 0 0 0 0

# 합격한이유