14wk-61: NLP with Disaster Tweets / 자료분석(Autogluon)

Author

최규빈

Published

December 1, 2023

1. 강의영상

Video Player is loading.

Current Time 0:00

Duration 0:00

Loaded: 0%

Stream Type LIVE

Remaining Time 0:00

2. Imports

#!pip install autogluon.eda

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
#---#
from autogluon.tabular import TabularPredictor
#---#
import warnings
warnings.filterwarnings('ignore')

3. Data

!kaggle competitions download -c nlp-getting-started

Downloading nlp-getting-started.zip to /home/cgb2/Dropbox/07_lectures/2023-09-MP2023/posts
100%|████████████████████████████████████████| 593k/593k [00:00<00:00, 2.47MB/s]
100%|████████████████████████████████████████| 593k/593k [00:00<00:00, 2.46MB/s]

!unzip nlp-getting-started.zip -d data

Archive:  nlp-getting-started.zip
  inflating: data/sample_submission.csv  
  inflating: data/test.csv           
  inflating: data/train.csv

df_train = pd.read_csv('data/train.csv')
df_test = pd.read_csv('data/test.csv')
sample_submission = pd.read_csv('data/sample_submission.csv')

!rm -rf data
!rm nlp-getting-started.zip

4. 분석

df_train.head()

	id	keyword	location	text	target
0	1	NaN	NaN	Our Deeds are the Reason of this #earthquake M...	1
1	4	NaN	NaN	Forest fire near La Ronge Sask. Canada	1
2	5	NaN	NaN	All residents asked to 'shelter in place' are ...	1
3	6	NaN	NaN	13,000 people receive #wildfires evacuation or...	1
4	7	NaN	NaN	Just got sent this photo from Ruby #Alaska as ...	1

df_test.head()

	id	keyword	location	text
0	0	NaN	NaN	Just happened a terrible car crash
1	2	NaN	NaN	Heard about #earthquake is different cities, s...
2	3	NaN	NaN	there is a forest fire at spot pond, geese are...
3	9	NaN	NaN	Apocalypse lighting. #Spokane #wildfires
4	11	NaN	NaN	Typhoon Soudelor kills 28 in China and Taiwan

# step1 -- pass
# step2 
predictr = TabularPredictor(label = 'target')
# step3
predictr.fit(df_train,num_gpus=1)
# step4 
yhat = predictr.predict(df_test)

No path specified. Models will be saved in: "AutogluonModels/ag-20231207_031213"
No presets specified! To achieve strong results with AutoGluon, it is recommended to use the available presets.
    Recommended Presets (For more details refer to https://auto.gluon.ai/stable/tutorials/tabular/tabular-essentials.html#presets):
    presets='best_quality'   : Maximize accuracy. Default time_limit=3600.
    presets='high_quality'   : Strong accuracy with fast inference speed. Default time_limit=3600.
    presets='good_quality'   : Good accuracy with very fast inference speed. Default time_limit=3600.
    presets='medium_quality' : Fast training time, ideal for initial prototyping.
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels/ag-20231207_031213"
=================== System Info ===================
AutoGluon Version:  1.0.0
Python Version:     3.10.13
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #26~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Jul 13 16:27:29 UTC 2
CPU Count:          16
Memory Avail:       117.71 GB / 125.71 GB (93.6%)
Disk Space Avail:   217.68 GB / 456.88 GB (47.6%)
===================================================
Train Data Rows:    7613
Train Data Columns: 4
Label Column:       target
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
    2 unique label values:  [1, 0]
    If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Problem Type:       binary
Preprocessing data ...
Selected class <--> label mapping:  class 1 = 1, class 0 = 0
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
    Available Memory:                    120538.91 MB
    Train Data (Original)  Memory Usage: 2.19 MB (0.0% of available memory)
    Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
    Stage 1 Generators:
        Fitting AsTypeFeatureGenerator...
    Stage 2 Generators:
        Fitting FillNaFeatureGenerator...
    Stage 3 Generators:
        Fitting IdentityFeatureGenerator...
        Fitting CategoryFeatureGenerator...
            Fitting CategoryMemoryMinimizeFeatureGenerator...
        Fitting TextSpecialFeatureGenerator...
            Fitting BinnedFeatureGenerator...
            Fitting DropDuplicatesFeatureGenerator...
        Fitting TextNgramFeatureGenerator...
            Fitting CountVectorizer for text features: ['text']
            CountVectorizer fit with vocabulary size = 641
    Stage 4 Generators:
        Fitting DropUniqueFeatureGenerator...
    Stage 5 Generators:
        Fitting DropDuplicatesFeatureGenerator...
    Types of features in original data (raw dtype, special dtypes):
        ('int', [])          : 1 | ['id']
        ('object', [])       : 2 | ['keyword', 'location']
        ('object', ['text']) : 1 | ['text']
    Types of features in processed data (raw dtype, special dtypes):
        ('category', [])                    :   2 | ['keyword', 'location']
        ('category', ['text_as_category'])  :   1 | ['text']
        ('int', [])                         :   1 | ['id']
        ('int', ['binned', 'text_special']) :  28 | ['text.char_count', 'text.word_count', 'text.capital_ratio', 'text.lower_ratio', 'text.digit_ratio', ...]
        ('int', ['text_ngram'])             : 631 | ['__nlp__.05', '__nlp__.08', '__nlp__.10', '__nlp__.11', '__nlp__.15', ...]
    3.3s = Fit runtime
    4 features in original data used to generate 663 features in processed data.
    Train Data (Processed) Memory Usage: 9.46 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 3.3s ...
AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'
    To change this, specify the eval_metric parameter of Predictor()
Automatically generating train/validation split with holdout_frac=0.1, Train Rows: 6851, Val Rows: 762
User-specified model hyperparameters to be fit:
{
    'NN_TORCH': {},
    'GBM': [{'extra_trees': True, 'ag_args': {'name_suffix': 'XT'}}, {}, 'GBMLarge'],
    'CAT': {},
    'XGB': {},
    'FASTAI': {},
    'RF': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
    'XT': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
    'KNN': [{'weights': 'uniform', 'ag_args': {'name_suffix': 'Unif'}}, {'weights': 'distance', 'ag_args': {'name_suffix': 'Dist'}}],
}
Fitting 13 L1 models ...
Fitting model: KNeighborsUnif ...
    0.7283   = Validation score   (accuracy)
    0.07s    = Training   runtime
    0.04s    = Validation runtime
Fitting model: KNeighborsDist ...
    0.7402   = Validation score   (accuracy)
    0.07s    = Training   runtime
    0.04s    = Validation runtime
Fitting model: LightGBMXT ...
    Training LightGBMXT with GPU, note that this may negatively impact model quality compared to CPU training.
[LightGBM] [Fatal] GPU Tree Learner was not enabled in this build.
Please recompile with CMake option -DUSE_GPU=1
Warning: GPU mode might not be installed for LightGBM, GPU training raised an exception. Falling back to CPU training...Refer to LightGBM GPU documentation: https://github.com/Microsoft/LightGBM/tree/master/python-package#build-gpu-versionOne possible method is:  pip uninstall lightgbm -y   pip install lightgbm --install-option=--gpu
    0.7953   = Validation score   (accuracy)
    0.58s    = Training   runtime
    0.01s    = Validation runtime
Fitting model: LightGBM ...
    Training LightGBM with GPU, note that this may negatively impact model quality compared to CPU training.
Warning: GPU mode might not be installed for LightGBM, GPU training raised an exception. Falling back to CPU training...Refer to LightGBM GPU documentation: https://github.com/Microsoft/LightGBM/tree/master/python-package#build-gpu-versionOne possible method is:  pip uninstall lightgbm -y   pip install lightgbm --install-option=--gpu
[LightGBM] [Fatal] GPU Tree Learner was not enabled in this build.
Please recompile with CMake option -DUSE_GPU=1
    0.7992   = Validation score   (accuracy)
    0.75s    = Training   runtime
    0.01s    = Validation runtime
Fitting model: RandomForestGini ...
    0.7808   = Validation score   (accuracy)
    0.99s    = Training   runtime
    0.05s    = Validation runtime
Fitting model: RandomForestEntr ...
    0.7703   = Validation score   (accuracy)
    1.0s     = Training   runtime
    0.3s     = Validation runtime
Fitting model: CatBoost ...
    Training CatBoost with GPU, note that this may negatively impact model quality compared to CPU training.
    Warning: CatBoost on GPU is experimental. If you encounter issues, use CPU for training CatBoost instead.
    0.8018   = Validation score   (accuracy)
    4.63s    = Training   runtime
    0.02s    = Validation runtime
Fitting model: ExtraTreesGini ...
    0.7874   = Validation score   (accuracy)
    1.13s    = Training   runtime
    0.05s    = Validation runtime
Fitting model: ExtraTreesEntr ...
    0.7835   = Validation score   (accuracy)
    1.15s    = Training   runtime
    0.05s    = Validation runtime
Fitting model: NeuralNetFastAI ...
No improvement since epoch 2: early stopping
    0.7808   = Validation score   (accuracy)
    7.44s    = Training   runtime
    0.02s    = Validation runtime
Fitting model: XGBoost ...
    0.7953   = Validation score   (accuracy)
    0.99s    = Training   runtime
    0.01s    = Validation runtime
Fitting model: NeuralNetTorch ...
    0.727    = Validation score   (accuracy)
    17.46s   = Training   runtime
    0.02s    = Validation runtime
Fitting model: LightGBMLarge ...
    Training LightGBMLarge with GPU, note that this may negatively impact model quality compared to CPU training.
[LightGBM] [Fatal] GPU Tree Learner was not enabled in this build.
Please recompile with CMake option -DUSE_GPU=1
Warning: GPU mode might not be installed for LightGBM, GPU training raised an exception. Falling back to CPU training...Refer to LightGBM GPU documentation: https://github.com/Microsoft/LightGBM/tree/master/python-package#build-gpu-versionOne possible method is:  pip uninstall lightgbm -y   pip install lightgbm --install-option=--gpu
    0.8071   = Validation score   (accuracy)
    2.71s    = Training   runtime
    0.02s    = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
    Ensemble Weights: {'LightGBMLarge': 0.265, 'XGBoost': 0.163, 'CatBoost': 0.143, 'NeuralNetFastAI': 0.143, 'RandomForestEntr': 0.102, 'LightGBMXT': 0.082, 'LightGBM': 0.041, 'ExtraTreesGini': 0.041, 'KNeighborsDist': 0.02}
    0.8268   = Validation score   (accuracy)
    0.59s    = Training   runtime
    0.0s     = Validation runtime
AutoGluon training complete, total runtime = 44.0s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20231207_031213")

5. 제출

sample_submission

	id	target
0	0	0
1	2	0
2	3	0
3	9	0
4	11	0
...	...	...
3258	10861	0
3259	10865	0
3260	10868	0
3261	10874	0
3262	10875	0

3263 rows × 2 columns

sample_submission['target'] = yhat 
sample_submission.to_csv("submission.csv",index=False)

!kaggle competitions submit -c nlp-getting-started -f submission.csv -m "오토글루온, TabularPredictor"

100%|██████████████████████████████████████| 22.2k/22.2k [00:02<00:00, 10.4kB/s]
Successfully submitted to Natural Language Processing with Disaster Tweets

761/1094

0.6956124314442413

별로네..