#pip install autogluon
02wk-06: 타이타닉 / Autogluon (Fsize)
1. 강의영상
2. Import
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
/kaggle/input/titanic/train.csv
/kaggle/input/titanic/test.csv
/kaggle/input/titanic/gender_submission.csv
from autogluon.tabular import TabularDataset, TabularPredictor
/home/cgb2/anaconda3/envs/ag/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
3. 분석의 절차
A. 데이터
-
비유: 문제를 받아오는 과정으로 비유할 수 있다.
!kaggle competitions download -c titanic
!unzip titanic.zip -d ./titanic
= TabularDataset('titanic/train.csv')
df_train = TabularDataset('titanic/test.csv')
df_test !rm titanic.zip
!rm -rf titanic/
Downloading titanic.zip to /home/cgb2/Dropbox/07_lectures/2023-09-MP2023/posts
0%| | 0.00/34.1k [00:00<?, ?B/s]
100%|███████████████████████████████████████| 34.1k/34.1k [00:00<00:00, 395kB/s]
Archive: titanic.zip
inflating: ./titanic/gender_submission.csv
inflating: ./titanic/test.csv
inflating: ./titanic/train.csv
-
피처엔지니어링
eval('Fsize = SibSp + Parch')
df_train.eval('Fsize = SibSp + Parch') df_test.
PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Fsize | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 892 | 3 | Kelly, Mr. James | male | 34.5 | 0 | 0 | 330911 | 7.8292 | NaN | Q | 0 |
1 | 893 | 3 | Wilkes, Mrs. James (Ellen Needs) | female | 47.0 | 1 | 0 | 363272 | 7.0000 | NaN | S | 1 |
2 | 894 | 2 | Myles, Mr. Thomas Francis | male | 62.0 | 0 | 0 | 240276 | 9.6875 | NaN | Q | 0 |
3 | 895 | 3 | Wirz, Mr. Albert | male | 27.0 | 0 | 0 | 315154 | 8.6625 | NaN | S | 0 |
4 | 896 | 3 | Hirvonen, Mrs. Alexander (Helga E Lindqvist) | female | 22.0 | 1 | 1 | 3101298 | 12.2875 | NaN | S | 2 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
413 | 1305 | 3 | Spector, Mr. Woolf | male | NaN | 0 | 0 | A.5. 3236 | 8.0500 | NaN | S | 0 |
414 | 1306 | 1 | Oliva y Ocana, Dona. Fermina | female | 39.0 | 0 | 0 | PC 17758 | 108.9000 | C105 | C | 0 |
415 | 1307 | 3 | Saether, Mr. Simon Sivertsen | male | 38.5 | 0 | 0 | SOTON/O.Q. 3101262 | 7.2500 | NaN | S | 0 |
416 | 1308 | 3 | Ware, Mr. Frederick | male | NaN | 0 | 0 | 359309 | 8.0500 | NaN | S | 0 |
417 | 1309 | 3 | Peter, Master. Michael J | male | NaN | 1 | 1 | 2668 | 22.3583 | NaN | C | 2 |
418 rows × 12 columns
B. Predictor 생성
-
비유: 문제를 풀 학생을 생성하는 과정으로 비유할 수 있다.
= TabularPredictor("Survived") predictr
No path specified. Models will be saved in: "AutogluonModels/ag-20231024_084218"
C. 적합(fit)
-
비유: 학생이 공부를 하는 과정으로 비유할 수 있다.
-
학습
eval('Fsize = SibSp + Parch')) predictr.fit(df_train.
/home/cgb2/anaconda3/envs/ag/lib/python3.10/site-packages/autogluon/core/utils/utils.py:564: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context("mode.use_inf_as_na", True): # treat None, NaN, INF, NINF as NA
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels/ag-20231024_084218"
AutoGluon Version: 0.8.2
Python Version: 3.10.13
Operating System: Linux
Platform Machine: x86_64
Platform Version: #26~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Jul 13 16:27:29 UTC 2
Disk Space Avail: 265.27 GB / 490.57 GB (54.1%)
Train Data Rows: 891
Train Data Columns: 12
Label Column: Survived
Preprocessing data ...
/home/cgb2/anaconda3/envs/ag/lib/python3.10/site-packages/autogluon/core/utils/utils.py:564: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context("mode.use_inf_as_na", True): # treat None, NaN, INF, NINF as NA
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
2 unique label values: [0, 1]
If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
/home/cgb2/anaconda3/envs/ag/lib/python3.10/site-packages/autogluon/tabular/learner/default_learner.py:215: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context("mode.use_inf_as_na", True): # treat None, NaN, INF, NINF as NA
Selected class <--> label mapping: class 1 = 1, class 0 = 0
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
Available Memory: 124116.32 MB
Train Data (Original) Memory Usage: 0.32 MB (0.0% of available memory)
Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
Stage 1 Generators:
Fitting AsTypeFeatureGenerator...
Note: Converting 1 features to boolean dtype as they only contain 2 unique values.
Stage 2 Generators:
Fitting FillNaFeatureGenerator...
/home/cgb2/anaconda3/envs/ag/lib/python3.10/site-packages/autogluon/features/generators/fillna.py:58: FutureWarning: The 'downcast' keyword in fillna is deprecated and will be removed in a future version. Use res.infer_objects(copy=False) to infer non-object dtype, or pd.to_numeric with the 'downcast' keyword to downcast numeric results.
X.fillna(self._fillna_feature_map, inplace=True, downcast=False)
Stage 3 Generators:
Fitting IdentityFeatureGenerator...
Fitting CategoryFeatureGenerator...
Fitting CategoryMemoryMinimizeFeatureGenerator...
Fitting TextSpecialFeatureGenerator...
Fitting BinnedFeatureGenerator...
Fitting DropDuplicatesFeatureGenerator...
Fitting TextNgramFeatureGenerator...
Fitting CountVectorizer for text features: ['Name']
CountVectorizer fit with vocabulary size = 8
Stage 4 Generators:
Fitting DropUniqueFeatureGenerator...
Stage 5 Generators:
Fitting DropDuplicatesFeatureGenerator...
Types of features in original data (raw dtype, special dtypes):
('float', []) : 2 | ['Age', 'Fare']
('int', []) : 5 | ['PassengerId', 'Pclass', 'SibSp', 'Parch', 'Fsize']
('object', []) : 4 | ['Sex', 'Ticket', 'Cabin', 'Embarked']
('object', ['text']) : 1 | ['Name']
Types of features in processed data (raw dtype, special dtypes):
('category', []) : 3 | ['Ticket', 'Cabin', 'Embarked']
('float', []) : 2 | ['Age', 'Fare']
('int', []) : 5 | ['PassengerId', 'Pclass', 'SibSp', 'Parch', 'Fsize']
('int', ['binned', 'text_special']) : 9 | ['Name.char_count', 'Name.word_count', 'Name.capital_ratio', 'Name.lower_ratio', 'Name.special_ratio', ...]
('int', ['bool']) : 1 | ['Sex']
('int', ['text_ngram']) : 9 | ['__nlp__.henry', '__nlp__.john', '__nlp__.master', '__nlp__.miss', '__nlp__.mr', ...]
0.2s = Fit runtime
12 features in original data used to generate 29 features in processed data.
/home/cgb2/anaconda3/envs/ag/lib/python3.10/site-packages/autogluon/common/utils/pandas_utils.py:50: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '2417.259259259259' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
memory_usage[column] = (
Train Data (Processed) Memory Usage: 0.08 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.22s ...
AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'
To change this, specify the eval_metric parameter of Predictor()
Automatically generating train/validation split with holdout_frac=0.2, Train Rows: 712, Val Rows: 179
User-specified model hyperparameters to be fit:
{
'NN_TORCH': {},
'GBM': [{'extra_trees': True, 'ag_args': {'name_suffix': 'XT'}}, {}, 'GBMLarge'],
'CAT': {},
'XGB': {},
'FASTAI': {},
'RF': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
'XT': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
'KNN': [{'weights': 'uniform', 'ag_args': {'name_suffix': 'Unif'}}, {'weights': 'distance', 'ag_args': {'name_suffix': 'Dist'}}],
}
Fitting 13 L1 models ...
Fitting model: KNeighborsUnif ...
0.648 = Validation score (accuracy)
0.01s = Training runtime
0.02s = Validation runtime
Fitting model: KNeighborsDist ...
0.648 = Validation score (accuracy)
0.01s = Training runtime
0.0s = Validation runtime
Fitting model: LightGBMXT ...
/home/cgb2/anaconda3/envs/ag/lib/python3.10/site-packages/autogluon/common/utils/pandas_utils.py:50: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '2059.259259259259' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
memory_usage[column] = (
0.8156 = Validation score (accuracy)
0.4s = Training runtime
0.0s = Validation runtime
Fitting model: LightGBM ...
/home/cgb2/anaconda3/envs/ag/lib/python3.10/site-packages/autogluon/common/utils/pandas_utils.py:50: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '2059.259259259259' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
memory_usage[column] = (
0.8212 = Validation score (accuracy)
0.19s = Training runtime
0.0s = Validation runtime
Fitting model: RandomForestGini ...
0.8156 = Validation score (accuracy)
0.42s = Training runtime
0.14s = Validation runtime
Fitting model: RandomForestEntr ...
0.8212 = Validation score (accuracy)
0.54s = Training runtime
0.11s = Validation runtime
Fitting model: CatBoost ...
/home/cgb2/anaconda3/envs/ag/lib/python3.10/site-packages/autogluon/common/utils/pandas_utils.py:50: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '2059.259259259259' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
memory_usage[column] = (
0.8268 = Validation score (accuracy)
0.54s = Training runtime
0.0s = Validation runtime
Fitting model: ExtraTreesGini ...
0.8045 = Validation score (accuracy)
0.84s = Training runtime
0.24s = Validation runtime
Fitting model: ExtraTreesEntr ...
0.8101 = Validation score (accuracy)
0.77s = Training runtime
0.18s = Validation runtime
Fitting model: NeuralNetFastAI ...
/home/cgb2/anaconda3/envs/ag/lib/python3.10/site-packages/autogluon/common/utils/pandas_utils.py:50: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '2059.259259259259' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
memory_usage[column] = (
/home/cgb2/anaconda3/envs/ag/lib/python3.10/site-packages/autogluon/tabular/models/fastainn/tabular_nn_fastai.py:192: FutureWarning: The 'downcast' keyword in fillna is deprecated and will be removed in a future version. Use res.infer_objects(copy=False) to infer non-object dtype, or pd.to_numeric with the 'downcast' keyword to downcast numeric results.
df = df.fillna(column_fills, inplace=False, downcast=False)
/home/cgb2/anaconda3/envs/ag/lib/python3.10/site-packages/autogluon/tabular/models/fastainn/tabular_nn_fastai.py:192: FutureWarning: The 'downcast' keyword in fillna is deprecated and will be removed in a future version. Use res.infer_objects(copy=False) to infer non-object dtype, or pd.to_numeric with the 'downcast' keyword to downcast numeric results.
df = df.fillna(column_fills, inplace=False, downcast=False)
/home/cgb2/anaconda3/envs/ag/lib/python3.10/site-packages/fastai/data/transforms.py:225: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if is_categorical_dtype(col):
/home/cgb2/anaconda3/envs/ag/lib/python3.10/site-packages/fastai/tabular/core.py:233: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if not is_categorical_dtype(c):
/home/cgb2/anaconda3/envs/ag/lib/python3.10/site-packages/fastai/tabular/core.py:233: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if not is_categorical_dtype(c):
/home/cgb2/anaconda3/envs/ag/lib/python3.10/site-packages/fastai/tabular/core.py:233: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if not is_categorical_dtype(c):
/home/cgb2/anaconda3/envs/ag/lib/python3.10/site-packages/fastai/data/transforms.py:225: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if is_categorical_dtype(col):
/home/cgb2/anaconda3/envs/ag/lib/python3.10/site-packages/fastai/tabular/core.py:233: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if not is_categorical_dtype(c):
/home/cgb2/anaconda3/envs/ag/lib/python3.10/site-packages/autogluon/tabular/models/fastainn/tabular_nn_fastai.py:192: FutureWarning: The 'downcast' keyword in fillna is deprecated and will be removed in a future version. Use res.infer_objects(copy=False) to infer non-object dtype, or pd.to_numeric with the 'downcast' keyword to downcast numeric results.
df = df.fillna(column_fills, inplace=False, downcast=False)
/home/cgb2/anaconda3/envs/ag/lib/python3.10/site-packages/fastai/tabular/core.py:233: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if not is_categorical_dtype(c):
/home/cgb2/anaconda3/envs/ag/lib/python3.10/site-packages/fastai/tabular/core.py:233: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if not is_categorical_dtype(c):
/home/cgb2/anaconda3/envs/ag/lib/python3.10/site-packages/fastai/tabular/core.py:233: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if not is_categorical_dtype(c):
0.8324 = Validation score (accuracy)
1.82s = Training runtime
0.01s = Validation runtime
Fitting model: XGBoost ...
/home/cgb2/anaconda3/envs/ag/lib/python3.10/site-packages/autogluon/common/utils/pandas_utils.py:50: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '2059.259259259259' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
memory_usage[column] = (
/home/cgb2/anaconda3/envs/ag/lib/python3.10/site-packages/xgboost/data.py:440: FutureWarning: is_sparse is deprecated and will be removed in a future version. Check `isinstance(dtype, pd.SparseDtype)` instead.
if is_sparse(data):
0.8268 = Validation score (accuracy)
0.2s = Training runtime
0.0s = Validation runtime
Fitting model: NeuralNetTorch ...
/home/cgb2/anaconda3/envs/ag/lib/python3.10/site-packages/autogluon/common/utils/pandas_utils.py:50: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '2059.259259259259' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
memory_usage[column] = (
0.8324 = Validation score (accuracy)
3.7s = Training runtime
0.02s = Validation runtime
Fitting model: LightGBMLarge ...
/home/cgb2/anaconda3/envs/ag/lib/python3.10/site-packages/autogluon/common/utils/pandas_utils.py:50: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '2059.259259259259' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
memory_usage[column] = (
0.8324 = Validation score (accuracy)
0.4s = Training runtime
0.0s = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
0.8547 = Validation score (accuracy)
0.48s = Training runtime
0.0s = Validation runtime
AutoGluon training complete, total runtime = 11.61s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20231024_084218")
<autogluon.tabular.predictor.predictor.TabularPredictor at 0x7f29d8593b80>
-
리더보드확인 (모의고사채점)
predictr.leaderboard()
model score_val pred_time_val fit_time pred_time_val_marginal fit_time_marginal stack_level can_infer fit_order
0 WeightedEnsemble_L2 0.860335 0.016608 2.934600 0.000400 0.236214 2 True 14
1 NeuralNetTorch 0.837989 0.008671 1.382263 0.008671 1.382263 1 True 12
2 LightGBMLarge 0.832402 0.003077 0.377827 0.003077 0.377827 1 True 13
3 NeuralNetFastAI 0.832402 0.007537 1.316123 0.007537 1.316123 1 True 10
4 CatBoost 0.826816 0.003649 0.528946 0.003649 0.528946 1 True 7
5 XGBoost 0.826816 0.004545 0.150251 0.004545 0.150251 1 True 11
6 LightGBM 0.821229 0.003294 0.180805 0.003294 0.180805 1 True 4
7 RandomForestEntr 0.821229 0.194986 0.478185 0.194986 0.478185 1 True 6
8 LightGBMXT 0.815642 0.003372 0.198229 0.003372 0.198229 1 True 3
9 RandomForestGini 0.815642 0.097445 0.306731 0.097445 0.306731 1 True 5
10 ExtraTreesEntr 0.810056 0.076015 0.512563 0.076015 0.512563 1 True 9
11 ExtraTreesGini 0.804469 0.118582 0.881634 0.118582 0.881634 1 True 8
12 KNeighborsDist 0.648045 0.001869 0.007722 0.001869 0.007722 1 True 2
13 KNeighborsUnif 0.648045 0.004271 0.092099 0.004271 0.092099 1 True 1
model | score_val | pred_time_val | fit_time | pred_time_val_marginal | fit_time_marginal | stack_level | can_infer | fit_order | |
---|---|---|---|---|---|---|---|---|---|
0 | WeightedEnsemble_L2 | 0.860335 | 0.016608 | 2.934600 | 0.000400 | 0.236214 | 2 | True | 14 |
1 | NeuralNetTorch | 0.837989 | 0.008671 | 1.382263 | 0.008671 | 1.382263 | 1 | True | 12 |
2 | LightGBMLarge | 0.832402 | 0.003077 | 0.377827 | 0.003077 | 0.377827 | 1 | True | 13 |
3 | NeuralNetFastAI | 0.832402 | 0.007537 | 1.316123 | 0.007537 | 1.316123 | 1 | True | 10 |
4 | CatBoost | 0.826816 | 0.003649 | 0.528946 | 0.003649 | 0.528946 | 1 | True | 7 |
5 | XGBoost | 0.826816 | 0.004545 | 0.150251 | 0.004545 | 0.150251 | 1 | True | 11 |
6 | LightGBM | 0.821229 | 0.003294 | 0.180805 | 0.003294 | 0.180805 | 1 | True | 4 |
7 | RandomForestEntr | 0.821229 | 0.194986 | 0.478185 | 0.194986 | 0.478185 | 1 | True | 6 |
8 | LightGBMXT | 0.815642 | 0.003372 | 0.198229 | 0.003372 | 0.198229 | 1 | True | 3 |
9 | RandomForestGini | 0.815642 | 0.097445 | 0.306731 | 0.097445 | 0.306731 | 1 | True | 5 |
10 | ExtraTreesEntr | 0.810056 | 0.076015 | 0.512563 | 0.076015 | 0.512563 | 1 | True | 9 |
11 | ExtraTreesGini | 0.804469 | 0.118582 | 0.881634 | 0.118582 | 0.881634 | 1 | True | 8 |
12 | KNeighborsDist | 0.648045 | 0.001869 | 0.007722 | 0.001869 | 0.007722 | 1 | True | 2 |
13 | KNeighborsUnif | 0.648045 | 0.004271 | 0.092099 | 0.004271 | 0.092099 | 1 | True | 1 |
D. 예측 (predict)
-
비유: 학습이후에 문제를 푸는 과정으로 비유할 수 있다.
-
training set 을 풀어봄 (predict) \(\to\) 점수 확인
== predictr.predict(df_train.eval('Fsize = SibSp + Parch'))).mean() (df_train.Survived
0.9169472502805837
-
test set 을 풀어봄 (predict) \(\to\) 점수 확인 하러 캐글에 결과제출
# df_test.assign(Survived = predictr.predict(df_test.eval('Fsize = SibSp + Parch'))).loc[:,['PassengerId','Survived']]\
# .to_csv("autogluon(Fsize)_submission.csv",index=False)