#!pip install autogluon.eda
13wk-52: 취업(오버피팅) / 자료분석(Autogluon)
1. 강의영상
2. Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn.model_selection
#---#
from autogluon.tabular import TabularPredictor
import autogluon.eda.auto as auto
#---#
import warnings
'ignore') warnings.filterwarnings(
3. Data
43052)
np.random.randn(= 10
n_balance = np.random.randint(0,199,size=5000)*5
toeic = np.random.randint(100,450,size=5000)/100
gpa = toeic * 8/995 + gpa * 10/4.5
u = u - np.mean(u)
u = np.exp(u)/(1+np.exp(u))
v = np.random.binomial(n=1,p=v)
employment = pd.DataFrame({
df 'toiec':toeic,
'gpa':gpa,
'employment':employment
})= pd.DataFrame((np.random.randn(5000,n_balance)).reshape(5000,n_balance)*1,columns = ['balance'+str(i) for i in range(n_balance)]) > 0
df_balance = pd.concat([df,df_balance],axis=1).assign(employment = lambda df: df.employment.map({0:'No',1:'Yes'}))
df = sklearn.model_selection.train_test_split(df, test_size=0.7, random_state=42) df_train, df_test
df_train
toiec | gpa | employment | balance0 | balance1 | balance2 | balance3 | balance4 | balance5 | balance6 | balance7 | balance8 | balance9 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4431 | 195 | 2.50 | No | True | False | False | False | True | False | True | False | False | False |
2162 | 410 | 3.74 | Yes | True | False | True | True | False | True | False | False | False | False |
2396 | 940 | 1.07 | No | True | False | False | False | False | False | True | True | True | True |
4768 | 785 | 4.21 | Yes | False | True | False | True | False | False | True | True | True | True |
2271 | 965 | 3.32 | Yes | True | False | True | False | False | True | True | False | False | True |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4426 | 745 | 3.19 | Yes | True | True | False | True | True | False | True | False | True | True |
466 | 790 | 3.82 | Yes | True | True | False | True | False | False | True | False | True | False |
3092 | 800 | 1.41 | No | False | True | False | False | False | False | False | True | False | True |
3772 | 290 | 1.35 | No | False | False | True | True | False | False | True | True | False | True |
860 | 965 | 4.46 | Yes | False | False | True | True | False | False | False | True | False | True |
1500 rows × 13 columns
4. 적합
# step1 -- pass
# step2
= TabularPredictor(label='employment')
predictr # step3
predictr.fit(df_train)# step4
= predictr.predict(df_train) yhat
No path specified. Models will be saved in: "AutogluonModels/ag-20231201_110823/"
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels/ag-20231201_110823/"
AutoGluon Version: 0.8.2
Python Version: 3.10.13
Operating System: Linux
Platform Machine: x86_64
Platform Version: #26~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Jul 13 16:27:29 UTC 2
Disk Space Avail: 248.41 GB / 490.57 GB (50.6%)
Train Data Rows: 1500
Train Data Columns: 12
Label Column: employment
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
2 unique label values: ['No', 'Yes']
If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping: class 1 = Yes, class 0 = No
Note: For your binary classification, AutoGluon arbitrarily selected which label-value represents positive (Yes) vs negative (No) class.
To explicitly set the positive_class, either rename classes to 1 and 0, or specify positive_class in Predictor init.
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
Available Memory: 126338.85 MB
Train Data (Original) Memory Usage: 0.04 MB (0.0% of available memory)
Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
Stage 1 Generators:
Fitting AsTypeFeatureGenerator...
Note: Converting 10 features to boolean dtype as they only contain 2 unique values.
Stage 2 Generators:
Fitting FillNaFeatureGenerator...
Stage 3 Generators:
Fitting IdentityFeatureGenerator...
Stage 4 Generators:
Fitting DropUniqueFeatureGenerator...
Stage 5 Generators:
Fitting DropDuplicatesFeatureGenerator...
Types of features in original data (raw dtype, special dtypes):
('bool', []) : 10 | ['balance0', 'balance1', 'balance2', 'balance3', 'balance4', ...]
('float', []) : 1 | ['gpa']
('int', []) : 1 | ['toiec']
Types of features in processed data (raw dtype, special dtypes):
('float', []) : 1 | ['gpa']
('int', []) : 1 | ['toiec']
('int', ['bool']) : 10 | ['balance0', 'balance1', 'balance2', 'balance3', 'balance4', ...]
0.0s = Fit runtime
12 features in original data used to generate 12 features in processed data.
Train Data (Processed) Memory Usage: 0.04 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.04s ...
AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'
To change this, specify the eval_metric parameter of Predictor()
Automatically generating train/validation split with holdout_frac=0.2, Train Rows: 1200, Val Rows: 300
User-specified model hyperparameters to be fit:
{
'NN_TORCH': {},
'GBM': [{'extra_trees': True, 'ag_args': {'name_suffix': 'XT'}}, {}, 'GBMLarge'],
'CAT': {},
'XGB': {},
'FASTAI': {},
'RF': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
'XT': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
'KNN': [{'weights': 'uniform', 'ag_args': {'name_suffix': 'Unif'}}, {'weights': 'distance', 'ag_args': {'name_suffix': 'Dist'}}],
}
Fitting 13 L1 models ...
Fitting model: KNeighborsUnif ...
0.72 = Validation score (accuracy)
0.01s = Training runtime
0.01s = Validation runtime
Fitting model: KNeighborsDist ...
0.7333 = Validation score (accuracy)
0.01s = Training runtime
0.01s = Validation runtime
Fitting model: LightGBMXT ...
0.8733 = Validation score (accuracy)
0.21s = Training runtime
0.0s = Validation runtime
Fitting model: LightGBM ...
0.8533 = Validation score (accuracy)
0.21s = Training runtime
0.0s = Validation runtime
Fitting model: RandomForestGini ...
0.8567 = Validation score (accuracy)
0.45s = Training runtime
0.05s = Validation runtime
Fitting model: RandomForestEntr ...
0.8633 = Validation score (accuracy)
0.59s = Training runtime
0.19s = Validation runtime
Fitting model: CatBoost ...
0.8667 = Validation score (accuracy)
0.31s = Training runtime
0.0s = Validation runtime
Fitting model: ExtraTreesGini ...
0.8567 = Validation score (accuracy)
0.89s = Training runtime
0.05s = Validation runtime
Fitting model: ExtraTreesEntr ...
0.85 = Validation score (accuracy)
0.72s = Training runtime
0.1s = Validation runtime
Fitting model: NeuralNetFastAI ...
No improvement since epoch 2: early stopping
0.8667 = Validation score (accuracy)
1.23s = Training runtime
0.01s = Validation runtime
Fitting model: XGBoost ...
0.8433 = Validation score (accuracy)
0.16s = Training runtime
0.0s = Validation runtime
Fitting model: NeuralNetTorch ...
0.8667 = Validation score (accuracy)
1.83s = Training runtime
0.0s = Validation runtime
Fitting model: LightGBMLarge ...
0.8367 = Validation score (accuracy)
0.47s = Training runtime
0.0s = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
0.8933 = Validation score (accuracy)
0.47s = Training runtime
0.0s = Validation runtime
AutoGluon training complete, total runtime = 8.22s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20231201_110823/")
5. 해석 및 시각화
A. y의 분포, (X,y)의 관계 시각화
auto.target_analysis(=df_train,
train_data='employment',
label=False
fit_distributions )
Target variable analysis
count | unique | top | freq | dtypes | missing_count | missing_ratio | raw_type | special_types | |
---|---|---|---|---|---|---|---|---|---|
employment | 1500 | 2 | No | 756 | object | object |
Target variable correlations
train_data
- spearman
correlation matrix; focus: absolute correlation for employment
>= 0.5
Feature interaction between toiec
/employment
in train_data
B. 중요한 설명변수
auto.quick_fit(= df_train,
train_data= 'employment',
label =True
show_feature_importance_barplots )
No path specified. Models will be saved in: "AutogluonModels/ag-20231201_111117/"
Model Prediction for employment
Using validation data for Test
points
Model Leaderboard
model | score_test | score_val | pred_time_test | pred_time_val | fit_time | pred_time_test_marginal | pred_time_val_marginal | fit_time_marginal | stack_level | can_infer | fit_order | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | LightGBMXT | 0.833333 | 0.87619 | 0.00083 | 0.001549 | 0.218086 | 0.00083 | 0.001549 | 0.218086 | 1 | True | 1 |
Feature Importance for Trained Model
importance | stddev | p_value | n | p99_high | p99_low | |
---|---|---|---|---|---|---|
gpa | 0.188889 | 0.015947 | 0.000006 | 5 | 0.221725 | 0.156053 |
toiec | 0.185778 | 0.020882 | 0.000019 | 5 | 0.228774 | 0.142782 |
balance7 | 0.002667 | 0.003651 | 0.088904 | 5 | 0.010185 | -0.004852 |
balance6 | 0.001778 | 0.009081 | 0.342087 | 5 | 0.020476 | -0.016921 |
balance4 | 0.000889 | 0.005116 | 0.358714 | 5 | 0.011423 | -0.009645 |
balance8 | -0.001333 | 0.002534 | 0.847721 | 5 | 0.003884 | -0.006550 |
balance0 | -0.004444 | 0.003143 | 0.982945 | 5 | 0.002026 | -0.010915 |
balance5 | -0.005778 | 0.004332 | 0.979679 | 5 | 0.003142 | -0.014697 |
balance1 | -0.006222 | 0.003296 | 0.993267 | 5 | 0.000564 | -0.013009 |
balance3 | -0.006667 | 0.004157 | 0.988475 | 5 | 0.001893 | -0.015227 |
balance9 | -0.007556 | 0.001217 | 0.999922 | 5 | -0.005049 | -0.010062 |
balance2 | -0.011111 | 0.003514 | 0.998945 | 5 | -0.003876 | -0.018346 |
Rows with the highest prediction error
Rows in this category worth inspecting for the causes of the error
toiec | gpa | balance0 | balance1 | balance2 | balance3 | balance4 | balance5 | balance6 | balance7 | balance8 | balance9 | employment | No | Yes | error | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3031 | 410 | 2.34 | True | True | False | False | False | True | True | True | True | True | Yes | 0.670908 | 0.329092 | 0.341816 |
3562 | 420 | 2.23 | False | True | True | True | True | True | False | False | False | True | Yes | 0.629780 | 0.370220 | 0.259560 |
3521 | 560 | 3.82 | False | False | False | True | False | False | True | True | True | False | No | 0.381926 | 0.618074 | 0.236148 |
186 | 470 | 2.20 | True | True | False | False | False | False | True | True | False | False | Yes | 0.615211 | 0.384789 | 0.230421 |
137 | 615 | 1.45 | True | False | True | False | True | True | False | False | False | False | Yes | 0.610440 | 0.389560 | 0.220880 |
3149 | 855 | 2.23 | False | True | True | True | True | False | True | True | True | True | No | 0.390826 | 0.609174 | 0.218348 |
4637 | 940 | 1.85 | False | False | False | True | True | False | True | False | False | True | No | 0.419480 | 0.580520 | 0.161039 |
4517 | 155 | 3.73 | False | False | False | True | True | False | False | True | True | True | Yes | 0.577100 | 0.422900 | 0.154201 |
1449 | 965 | 2.01 | False | True | True | True | True | True | True | False | False | False | No | 0.434479 | 0.565521 | 0.131042 |
2106 | 230 | 2.65 | True | False | False | False | True | False | False | True | True | False | Yes | 0.564785 | 0.435215 | 0.129570 |
Rows with the least distance vs other class
Rows in this category are the closest to the decision boundary vs the other class and are good candidates for additional labeling
toiec | gpa | balance0 | balance1 | balance2 | balance3 | balance4 | balance5 | balance6 | balance7 | balance8 | balance9 | employment | No | Yes | error | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3776 | 990 | 1.30 | True | True | True | False | False | False | False | False | True | True | Yes | 0.500001 | 0.499999 | 0.000001 |
3507 | 855 | 1.86 | False | True | False | False | False | False | False | True | False | True | No | 0.499844 | 0.500156 | 0.000311 |
2553 | 80 | 4.36 | True | True | True | False | False | True | True | True | True | True | No | 0.499721 | 0.500279 | 0.000558 |
249 | 840 | 1.95 | True | True | False | True | False | False | True | True | False | False | Yes | 0.501248 | 0.498752 | 0.002496 |
2987 | 140 | 4.08 | True | False | True | True | False | True | False | False | True | True | No | 0.497140 | 0.502860 | 0.005720 |
4232 | 15 | 3.78 | False | True | True | True | True | True | True | False | False | False | Yes | 0.502909 | 0.497091 | 0.005819 |
872 | 875 | 1.29 | False | False | False | True | True | False | True | True | False | False | Yes | 0.503659 | 0.496341 | 0.007318 |
3696 | 30 | 4.31 | False | False | True | True | True | False | False | False | True | False | No | 0.495588 | 0.504412 | 0.008823 |
2568 | 565 | 2.36 | False | True | False | False | False | False | False | False | True | True | Yes | 0.505832 | 0.494168 | 0.011664 |
2681 | 660 | 2.28 | False | True | False | False | False | True | False | True | False | True | No | 0.494135 | 0.505865 | 0.011729 |
C. 관측치별 해석
-
0번관측치
0]] df_train.iloc[[
toiec | gpa | employment | balance0 | balance1 | balance2 | balance3 | balance4 | balance5 | balance6 | balance7 | balance8 | balance9 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4431 | 195 | 2.5 | No | True | False | False | False | True | False | True | False | False | False |
0]]) predictr.predict(df_train.iloc[[
4431 No
Name: employment, dtype: object
0]]) predictr.predict_proba(df_train.iloc[[
No | Yes | |
---|---|---|
4431 | 0.814248 | 0.185752 |
auto.explain_rows(=df_train,
train_data=predictr,
model=df_train.iloc[[0]]*1,
rows=True,
display_rows='waterfall'
plot )
toiec | gpa | employment | balance0 | balance1 | balance2 | balance3 | balance4 | balance5 | balance6 | balance7 | balance8 | balance9 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4431 | 195 | 2.5 | No | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |
# 떨어진 이유
-
1번관측치
1]] df_train.iloc[[
toiec | gpa | employment | balance0 | balance1 | balance2 | balance3 | balance4 | balance5 | balance6 | balance7 | balance8 | balance9 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2162 | 410 | 3.74 | Yes | True | False | True | True | False | True | False | False | False | False |
1]]) predictr.predict(df_train.iloc[[
2162 Yes
Name: employment, dtype: object
1]]) predictr.predict_proba(df_train.iloc[[
No | Yes | |
---|---|---|
2162 | 0.347349 | 0.652651 |
auto.explain_rows(=df_train,
train_data=predictr,
model=df_train.iloc[[1]]*1,
rows=True,
display_rows='waterfall'
plot )
toiec | gpa | employment | balance0 | balance1 | balance2 | balance3 | balance4 | balance5 | balance6 | balance7 | balance8 | balance9 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2162 | 410 | 3.74 | Yes | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
# 합격한이유