13wk-53: 취업(다중공선성) / 자료분석(Autogluon)

Author

최규빈

Published

December 1, 2023

1. 강의영상

2. Imports

#!pip install autogluon.eda
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
#---#
from autogluon.tabular import TabularPredictor
import autogluon.eda.auto as auto
#---#
import warnings
warnings.filterwarnings('ignore')

3. Data

np.random.seed(43052)
df_train = pd.read_csv("https://raw.githubusercontent.com/guebin/MP2023/main/posts/employment_multicollinearity.csv")
df_train['employment_score'] = df_train.gpa * 1.0 + df_train.toeic* 1/100 + np.random.randn(500)
df_train = df_train.iloc[:,:8]
df_train
employment_score gpa toeic toeic0 toeic1 toeic2 toeic3 toeic4
0 1.784955 0.051535 135 129.566309 133.078481 121.678398 113.457366 133.564200
1 10.789671 0.355496 935 940.563187 935.723570 939.190519 938.995672 945.376482
2 8.221213 2.228435 485 493.671390 493.909118 475.500970 480.363752 478.868942
3 2.137594 1.179701 65 62.272565 55.957257 68.521468 76.866765 51.436321
4 8.650144 3.962356 445 449.280637 438.895582 433.598274 444.081141 437.005100
... ... ... ... ... ... ... ... ...
495 9.057243 4.288465 280 276.680902 274.502675 277.868536 292.283300 277.476630
496 4.108020 2.601212 310 296.940263 301.545000 306.725610 314.811407 311.935810
497 2.430590 0.042323 225 206.793217 228.335345 222.115146 216.479498 227.469560
498 5.343171 1.041416 320 327.461442 323.019899 329.589337 313.312233 315.645050
499 6.505106 3.626883 375 370.966595 364.668477 371.853566 373.574930 376.701708

500 rows × 8 columns

4. 적합

# step1 -- pass
# step2
predictr = TabularPredictor(label='employment_score')
# step3
predictr.fit(df_train)
# step4
yhat = predictr.predict(df_train)
No path specified. Models will be saved in: "AutogluonModels/ag-20231201_112421/"
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels/ag-20231201_112421/"
AutoGluon Version:  0.8.2
Python Version:     3.10.13
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #26~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Jul 13 16:27:29 UTC 2
Disk Space Avail:   248.36 GB / 490.57 GB (50.6%)
Train Data Rows:    500
Train Data Columns: 7
Label Column: employment_score
Preprocessing data ...
AutoGluon infers your prediction problem is: 'regression' (because dtype of label-column == float and many unique label-values observed).
    Label info (max, min, mean, stddev): (15.12090627137731, -0.6447161480491369, 7.2271, 3.11598)
    If 'regression' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
    Available Memory:                    125968.81 MB
    Train Data (Original)  Memory Usage: 0.03 MB (0.0% of available memory)
    Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
    Stage 1 Generators:
        Fitting AsTypeFeatureGenerator...
    Stage 2 Generators:
        Fitting FillNaFeatureGenerator...
    Stage 3 Generators:
        Fitting IdentityFeatureGenerator...
    Stage 4 Generators:
        Fitting DropUniqueFeatureGenerator...
    Stage 5 Generators:
        Fitting DropDuplicatesFeatureGenerator...
    Types of features in original data (raw dtype, special dtypes):
        ('float', []) : 6 | ['gpa', 'toeic0', 'toeic1', 'toeic2', 'toeic3', ...]
        ('int', [])   : 1 | ['toeic']
    Types of features in processed data (raw dtype, special dtypes):
        ('float', []) : 6 | ['gpa', 'toeic0', 'toeic1', 'toeic2', 'toeic3', ...]
        ('int', [])   : 1 | ['toeic']
    0.0s = Fit runtime
    7 features in original data used to generate 7 features in processed data.
    Train Data (Processed) Memory Usage: 0.03 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.03s ...
AutoGluon will gauge predictive performance using evaluation metric: 'root_mean_squared_error'
    This metric's sign has been flipped to adhere to being higher_is_better. The metric score can be multiplied by -1 to get the metric value.
    To change this, specify the eval_metric parameter of Predictor()
Automatically generating train/validation split with holdout_frac=0.2, Train Rows: 400, Val Rows: 100
User-specified model hyperparameters to be fit:
{
    'NN_TORCH': {},
    'GBM': [{'extra_trees': True, 'ag_args': {'name_suffix': 'XT'}}, {}, 'GBMLarge'],
    'CAT': {},
    'XGB': {},
    'FASTAI': {},
    'RF': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
    'XT': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
    'KNN': [{'weights': 'uniform', 'ag_args': {'name_suffix': 'Unif'}}, {'weights': 'distance', 'ag_args': {'name_suffix': 'Dist'}}],
}
Fitting 11 L1 models ...
Fitting model: KNeighborsUnif ...
    -1.5764  = Validation score   (-root_mean_squared_error)
    0.01s    = Training   runtime
    0.01s    = Validation runtime
Fitting model: KNeighborsDist ...
    -1.5604  = Validation score   (-root_mean_squared_error)
    0.01s    = Training   runtime
    0.01s    = Validation runtime
Fitting model: LightGBMXT ...
    -0.9448  = Validation score   (-root_mean_squared_error)
    0.33s    = Training   runtime
    0.0s     = Validation runtime
Fitting model: LightGBM ...
    -0.9643  = Validation score   (-root_mean_squared_error)
    0.18s    = Training   runtime
    0.0s     = Validation runtime
Fitting model: RandomForestMSE ...
    -0.9278  = Validation score   (-root_mean_squared_error)
    0.3s     = Training   runtime
    0.04s    = Validation runtime
Fitting model: CatBoost ...
    -0.9398  = Validation score   (-root_mean_squared_error)
    0.3s     = Training   runtime
    0.0s     = Validation runtime
Fitting model: ExtraTreesMSE ...
    -0.9428  = Validation score   (-root_mean_squared_error)
    0.3s     = Training   runtime
    0.02s    = Validation runtime
Fitting model: NeuralNetFastAI ...
    -0.9448  = Validation score   (-root_mean_squared_error)
    0.7s     = Training   runtime
    0.01s    = Validation runtime
Fitting model: XGBoost ...
    -0.9637  = Validation score   (-root_mean_squared_error)
    0.2s     = Training   runtime
    0.0s     = Validation runtime
Fitting model: NeuralNetTorch ...
    -0.9001  = Validation score   (-root_mean_squared_error)
    2.4s     = Training   runtime
    0.01s    = Validation runtime
Fitting model: LightGBMLarge ...
    -0.981   = Validation score   (-root_mean_squared_error)
    0.31s    = Training   runtime
    0.0s     = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
    -0.8742  = Validation score   (-root_mean_squared_error)
    0.19s    = Training   runtime
    0.0s     = Validation runtime
AutoGluon training complete, total runtime = 5.53s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20231201_112421/")

5. 해석 및 시각화

A. y의 분포, (X,y)의 관계 시각화

auto.target_analysis(
    train_data=df_train,
    label='employment_score',
    fit_distributions=False
)

Target variable analysis

count mean std min 25% 50% 75% max dtypes unique missing_count missing_ratio raw_type special_types
employment_score 500 7.227104 3.115979 -0.644716 4.695513 7.281178 9.548811 15.120906 float64 500 float

Target variable correlations

train_data - spearman correlation matrix; focus: absolute correlation for employment_score >= 0.5

Feature interaction between toeic/employment_score in train_data

Feature interaction between toeic2/employment_score in train_data

Feature interaction between toeic4/employment_score in train_data

Feature interaction between toeic3/employment_score in train_data

Feature interaction between toeic1/employment_score in train_data

Feature interaction between toeic0/employment_score in train_data

B. 중요한 설명변수

auto.quick_fit(
    train_data=df_train,
    label='employment_score',
    show_feature_importance_barplots=True
)
No path specified. Models will be saved in: "AutogluonModels/ag-20231201_112654/"

Model Prediction for employment_score

Using validation data for Test points

Model Leaderboard

model score_test score_val pred_time_test pred_time_val fit_time pred_time_test_marginal pred_time_val_marginal fit_time_marginal stack_level can_infer fit_order
0 LightGBMXT -1.012202 -0.979194 0.001505 0.000941 0.187162 0.001505 0.000941 0.187162 1 True 1

Feature Importance for Trained Model

importance stddev p_value n p99_high p99_low
gpa 1.020679 0.095581 0.000009 5 1.217483 0.823876
toeic0 0.266908 0.043618 0.000083 5 0.356718 0.177098
toeic2 0.252605 0.038863 0.000065 5 0.332626 0.172585
toeic 0.241941 0.043093 0.000116 5 0.330669 0.153212
toeic3 0.161979 0.035012 0.000246 5 0.234070 0.089888
toeic1 0.158796 0.040579 0.000470 5 0.242350 0.075242
toeic4 0.136791 0.037064 0.000588 5 0.213106 0.060476

Rows with the highest prediction error

Rows in this category worth inspecting for the causes of the error

gpa toeic toeic0 toeic1 toeic2 toeic3 toeic4 employment_score employment_score_pred error
55 0.200267 450 450.310311 464.340472 458.213429 456.215452 448.932120 2.234912 4.809618 2.574706
8 4.191552 25 29.000939 22.725391 19.529454 35.896321 24.151228 2.514707 5.069639 2.554931
491 1.754276 425 428.686989 439.377437 446.630603 439.109681 423.056878 8.008441 5.771637 2.236804
144 2.013480 520 519.674312 521.390587 531.847782 511.375625 525.305439 8.755093 6.543661 2.211432
118 3.585276 675 679.425199 680.429579 677.878530 674.812300 672.177564 12.785551 10.599931 2.185620
469 1.969145 5 3.785864 4.575646 -8.358037 1.071854 7.253616 4.552622 2.385054 2.167567
403 1.678080 70 53.993037 65.691879 65.135837 58.510651 51.307683 4.136691 2.040128 2.096563
75 0.564461 795 799.270794 791.212118 803.426181 805.178974 803.979166 10.702375 8.615296 2.087079
293 4.364023 990 973.878219 966.687506 991.332887 987.137768 989.321286 15.120906 13.106976 2.013931
137 4.248511 115 106.422018 114.653052 106.830406 124.361062 118.754414 7.183716 5.199379 1.984337

C. 관측치별 해석

df_train.iloc[[1]]
employment_score gpa toeic toeic0 toeic1 toeic2 toeic3 toeic4
1 10.789671 0.355496 935 940.563187 935.72357 939.190519 938.995672 945.376482
predictr.predict(df_train.iloc[[1]])
1    10.539183
Name: employment_score, dtype: float32
auto.explain_rows(
    train_data = df_train,
    model = predictr,
    rows = df_train.iloc[[1]],
    display_rows = False,
    plot='waterfall'
)