#!pip install autogluon.eda
14wk-61: NLP with Disaster Tweets / 자료분석(Autogluon)
1. 강의영상
Video Player is loading.
This is a modal window.
No compatible source was found for this media.
2. Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#---#
from autogluon.tabular import TabularPredictor
#---#
import warnings
'ignore') warnings.filterwarnings(
3. Data
!kaggle competitions download -c nlp-getting-started
Downloading nlp-getting-started.zip to /home/cgb2/Dropbox/07_lectures/2023-09-MP2023/posts
100%|████████████████████████████████████████| 593k/593k [00:00<00:00, 2.47MB/s]
100%|████████████████████████████████████████| 593k/593k [00:00<00:00, 2.46MB/s]
!unzip nlp-getting-started.zip -d data
Archive: nlp-getting-started.zip
inflating: data/sample_submission.csv
inflating: data/test.csv
inflating: data/train.csv
= pd.read_csv('data/train.csv')
df_train = pd.read_csv('data/test.csv')
df_test = pd.read_csv('data/sample_submission.csv') sample_submission
!rm -rf data
!rm nlp-getting-started.zip
4. 분석
df_train.head()
id | keyword | location | text | target | |
---|---|---|---|---|---|
0 | 1 | NaN | NaN | Our Deeds are the Reason of this #earthquake M... | 1 |
1 | 4 | NaN | NaN | Forest fire near La Ronge Sask. Canada | 1 |
2 | 5 | NaN | NaN | All residents asked to 'shelter in place' are ... | 1 |
3 | 6 | NaN | NaN | 13,000 people receive #wildfires evacuation or... | 1 |
4 | 7 | NaN | NaN | Just got sent this photo from Ruby #Alaska as ... | 1 |
df_test.head()
id | keyword | location | text | |
---|---|---|---|---|
0 | 0 | NaN | NaN | Just happened a terrible car crash |
1 | 2 | NaN | NaN | Heard about #earthquake is different cities, s... |
2 | 3 | NaN | NaN | there is a forest fire at spot pond, geese are... |
3 | 9 | NaN | NaN | Apocalypse lighting. #Spokane #wildfires |
4 | 11 | NaN | NaN | Typhoon Soudelor kills 28 in China and Taiwan |
# step1 -- pass
# step2
= TabularPredictor(label = 'target')
predictr # step3
=1)
predictr.fit(df_train,num_gpus# step4
= predictr.predict(df_test) yhat
No path specified. Models will be saved in: "AutogluonModels/ag-20231207_031213"
No presets specified! To achieve strong results with AutoGluon, it is recommended to use the available presets.
Recommended Presets (For more details refer to https://auto.gluon.ai/stable/tutorials/tabular/tabular-essentials.html#presets):
presets='best_quality' : Maximize accuracy. Default time_limit=3600.
presets='high_quality' : Strong accuracy with fast inference speed. Default time_limit=3600.
presets='good_quality' : Good accuracy with very fast inference speed. Default time_limit=3600.
presets='medium_quality' : Fast training time, ideal for initial prototyping.
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels/ag-20231207_031213"
=================== System Info ===================
AutoGluon Version: 1.0.0
Python Version: 3.10.13
Operating System: Linux
Platform Machine: x86_64
Platform Version: #26~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Jul 13 16:27:29 UTC 2
CPU Count: 16
Memory Avail: 117.71 GB / 125.71 GB (93.6%)
Disk Space Avail: 217.68 GB / 456.88 GB (47.6%)
===================================================
Train Data Rows: 7613
Train Data Columns: 4
Label Column: target
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
2 unique label values: [1, 0]
If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Problem Type: binary
Preprocessing data ...
Selected class <--> label mapping: class 1 = 1, class 0 = 0
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
Available Memory: 120538.91 MB
Train Data (Original) Memory Usage: 2.19 MB (0.0% of available memory)
Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
Stage 1 Generators:
Fitting AsTypeFeatureGenerator...
Stage 2 Generators:
Fitting FillNaFeatureGenerator...
Stage 3 Generators:
Fitting IdentityFeatureGenerator...
Fitting CategoryFeatureGenerator...
Fitting CategoryMemoryMinimizeFeatureGenerator...
Fitting TextSpecialFeatureGenerator...
Fitting BinnedFeatureGenerator...
Fitting DropDuplicatesFeatureGenerator...
Fitting TextNgramFeatureGenerator...
Fitting CountVectorizer for text features: ['text']
CountVectorizer fit with vocabulary size = 641
Stage 4 Generators:
Fitting DropUniqueFeatureGenerator...
Stage 5 Generators:
Fitting DropDuplicatesFeatureGenerator...
Types of features in original data (raw dtype, special dtypes):
('int', []) : 1 | ['id']
('object', []) : 2 | ['keyword', 'location']
('object', ['text']) : 1 | ['text']
Types of features in processed data (raw dtype, special dtypes):
('category', []) : 2 | ['keyword', 'location']
('category', ['text_as_category']) : 1 | ['text']
('int', []) : 1 | ['id']
('int', ['binned', 'text_special']) : 28 | ['text.char_count', 'text.word_count', 'text.capital_ratio', 'text.lower_ratio', 'text.digit_ratio', ...]
('int', ['text_ngram']) : 631 | ['__nlp__.05', '__nlp__.08', '__nlp__.10', '__nlp__.11', '__nlp__.15', ...]
3.3s = Fit runtime
4 features in original data used to generate 663 features in processed data.
Train Data (Processed) Memory Usage: 9.46 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 3.3s ...
AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'
To change this, specify the eval_metric parameter of Predictor()
Automatically generating train/validation split with holdout_frac=0.1, Train Rows: 6851, Val Rows: 762
User-specified model hyperparameters to be fit:
{
'NN_TORCH': {},
'GBM': [{'extra_trees': True, 'ag_args': {'name_suffix': 'XT'}}, {}, 'GBMLarge'],
'CAT': {},
'XGB': {},
'FASTAI': {},
'RF': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
'XT': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
'KNN': [{'weights': 'uniform', 'ag_args': {'name_suffix': 'Unif'}}, {'weights': 'distance', 'ag_args': {'name_suffix': 'Dist'}}],
}
Fitting 13 L1 models ...
Fitting model: KNeighborsUnif ...
0.7283 = Validation score (accuracy)
0.07s = Training runtime
0.04s = Validation runtime
Fitting model: KNeighborsDist ...
0.7402 = Validation score (accuracy)
0.07s = Training runtime
0.04s = Validation runtime
Fitting model: LightGBMXT ...
Training LightGBMXT with GPU, note that this may negatively impact model quality compared to CPU training.
[LightGBM] [Fatal] GPU Tree Learner was not enabled in this build.
Please recompile with CMake option -DUSE_GPU=1
Warning: GPU mode might not be installed for LightGBM, GPU training raised an exception. Falling back to CPU training...Refer to LightGBM GPU documentation: https://github.com/Microsoft/LightGBM/tree/master/python-package#build-gpu-versionOne possible method is: pip uninstall lightgbm -y pip install lightgbm --install-option=--gpu
0.7953 = Validation score (accuracy)
0.58s = Training runtime
0.01s = Validation runtime
Fitting model: LightGBM ...
Training LightGBM with GPU, note that this may negatively impact model quality compared to CPU training.
Warning: GPU mode might not be installed for LightGBM, GPU training raised an exception. Falling back to CPU training...Refer to LightGBM GPU documentation: https://github.com/Microsoft/LightGBM/tree/master/python-package#build-gpu-versionOne possible method is: pip uninstall lightgbm -y pip install lightgbm --install-option=--gpu
[LightGBM] [Fatal] GPU Tree Learner was not enabled in this build.
Please recompile with CMake option -DUSE_GPU=1
0.7992 = Validation score (accuracy)
0.75s = Training runtime
0.01s = Validation runtime
Fitting model: RandomForestGini ...
0.7808 = Validation score (accuracy)
0.99s = Training runtime
0.05s = Validation runtime
Fitting model: RandomForestEntr ...
0.7703 = Validation score (accuracy)
1.0s = Training runtime
0.3s = Validation runtime
Fitting model: CatBoost ...
Training CatBoost with GPU, note that this may negatively impact model quality compared to CPU training.
Warning: CatBoost on GPU is experimental. If you encounter issues, use CPU for training CatBoost instead.
0.8018 = Validation score (accuracy)
4.63s = Training runtime
0.02s = Validation runtime
Fitting model: ExtraTreesGini ...
0.7874 = Validation score (accuracy)
1.13s = Training runtime
0.05s = Validation runtime
Fitting model: ExtraTreesEntr ...
0.7835 = Validation score (accuracy)
1.15s = Training runtime
0.05s = Validation runtime
Fitting model: NeuralNetFastAI ...
No improvement since epoch 2: early stopping
0.7808 = Validation score (accuracy)
7.44s = Training runtime
0.02s = Validation runtime
Fitting model: XGBoost ...
0.7953 = Validation score (accuracy)
0.99s = Training runtime
0.01s = Validation runtime
Fitting model: NeuralNetTorch ...
0.727 = Validation score (accuracy)
17.46s = Training runtime
0.02s = Validation runtime
Fitting model: LightGBMLarge ...
Training LightGBMLarge with GPU, note that this may negatively impact model quality compared to CPU training.
[LightGBM] [Fatal] GPU Tree Learner was not enabled in this build.
Please recompile with CMake option -DUSE_GPU=1
Warning: GPU mode might not be installed for LightGBM, GPU training raised an exception. Falling back to CPU training...Refer to LightGBM GPU documentation: https://github.com/Microsoft/LightGBM/tree/master/python-package#build-gpu-versionOne possible method is: pip uninstall lightgbm -y pip install lightgbm --install-option=--gpu
0.8071 = Validation score (accuracy)
2.71s = Training runtime
0.02s = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
Ensemble Weights: {'LightGBMLarge': 0.265, 'XGBoost': 0.163, 'CatBoost': 0.143, 'NeuralNetFastAI': 0.143, 'RandomForestEntr': 0.102, 'LightGBMXT': 0.082, 'LightGBM': 0.041, 'ExtraTreesGini': 0.041, 'KNeighborsDist': 0.02}
0.8268 = Validation score (accuracy)
0.59s = Training runtime
0.0s = Validation runtime
AutoGluon training complete, total runtime = 44.0s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20231207_031213")
5. 제출
sample_submission
id | target | |
---|---|---|
0 | 0 | 0 |
1 | 2 | 0 |
2 | 3 | 0 |
3 | 9 | 0 |
4 | 11 | 0 |
... | ... | ... |
3258 | 10861 | 0 |
3259 | 10865 | 0 |
3260 | 10868 | 0 |
3261 | 10874 | 0 |
3262 | 10875 | 0 |
3263 rows × 2 columns
'target'] = yhat
sample_submission["submission.csv",index=False) sample_submission.to_csv(
!kaggle competitions submit -c nlp-getting-started -f submission.csv -m "오토글루온, TabularPredictor"
100%|██████████████████████████████████████| 22.2k/22.2k [00:02<00:00, 10.4kB/s]
Successfully submitted to Natural Language Processing with Disaster Tweets
761/1094
0.6956124314442413
별로네..