import os
os.environ["WANDB_MODE"] = "offline"Quiz-11 (2024.12.10) // 범위: 12wk-1 까지
| 항목 | 허용 여부 | 비고 |
|---|---|---|
| 강의노트 참고 | 허용 | 수업 중 제공된 강의노트나 본인이 정리한 자료를 참고 가능 |
| 구글 검색 | 허용 | 인터넷을 통한 자료 검색 및 정보 확인 가능 |
| 생성 모형 사용 | 허용 | 인공지능 기반 도구(GPT 등) 사용 가능 |
해설영상은 생략하도록 하겠습니다. 한 학기동안 고생하셨습니다.
import pandas as pd
import transformers
import datasets
import evaluate
import torch
import numpy as np1. COVID19 tweets – 100점
covid19_tweets_df = pd.read_csv('https://raw.githubusercontent.com/guebin/STML2022/main/posts/Corona_NLP_train.csv',encoding="ISO-8859-1")
covid19_tweets_df_train = covid19_tweets_df[::2].reset_index(drop=True)
covid19_tweets_df_test = covid19_tweets_df[1::2].reset_index(drop=True).loc[:,'UserName':'OriginalTweet'].assign(Sentiment = '???')
del covid19_tweets_dfcovid19_tweets_df_train.head()| UserName | ScreenName | Location | TweetAt | OriginalTweet | Sentiment | |
|---|---|---|---|---|---|---|
| 0 | 3799 | 48751 | London | 16-03-2020 | @MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i... | Neutral |
| 1 | 3801 | 48753 | Vagabonds | 16-03-2020 | Coronavirus Australia: Woolworths to give elde... | Positive |
| 2 | 3803 | 48755 | NaN | 16-03-2020 | Me, ready to go at supermarket during the #COV... | Extremely Negative |
| 3 | 3805 | 48757 | 35.926541,-78.753267 | 16-03-2020 | Cashier at grocery store was sharing his insig... | Positive |
| 4 | 3807 | 48759 | Atlanta, GA USA | 16-03-2020 | Due to COVID-19 our retail store and classroom... | Positive |
covid19_tweets_df_test.head()| UserName | ScreenName | Location | TweetAt | OriginalTweet | Sentiment | |
|---|---|---|---|---|---|---|
| 0 | 3800 | 48752 | UK | 16-03-2020 | advice Talk to your neighbours family to excha... | ??? |
| 1 | 3802 | 48754 | NaN | 16-03-2020 | My food stock is not the only one which is emp... | ??? |
| 2 | 3804 | 48756 | ÜT: 36.319708,-82.363649 | 16-03-2020 | As news of the region’s first confirmed COVID... | ??? |
| 3 | 3806 | 48758 | Austria | 16-03-2020 | Was at the supermarket today. Didn't buy toile... | ??? |
| 4 | 3808 | 48760 | BHAVNAGAR,GUJRAT | 16-03-2020 | For corona prevention,we should stop to buy th... | ??? |
이 데이터는 COVID-19 관련 트윗 데이터를 포함하며, 각 트윗에 대해 작성자의 감정을 나타내는 Sentiment(감정) 레이블이 지정되어 있다.
주요 열 설명
1. OriginalTweet: 사용자가 작성한 트윗 내용으로. COVID-19와 관련된 다양한 의견, 경험, 상황 등이 포함되어 있음. 예시는 아래와 같음.
- “Coronavirus Australia: Woolworths to give elders priority shopping amid coronavirus pandemic.”
- “Due to COVID-19 our retail store and classroom…”
2. Sentiment: 트윗의 감정을 나타내는 레이블으로 가능한값은 아래와 같음.
- Extremely Positive: 강한 긍정적 감정을 담아 희망과 연대감을 전달하는 트윗.
- Positive: 긍정적인 해결책이나 희망적인 메시지를 담고 있음.
- Neutral: 중립적이고 감정 표현이 없는 트윗 (사실 전달에 초점).
- Negative: 문제점이나 불만을 표현한 부정적인 트윗.
- Extremely Negative: 극단적으로 부정적인 상황에 대한 트윗.
(1) transformers 라이브러리의 “distilbert/distilbert-base-uncased” 모델을 활용하여 OriginalTweet을 입력으로 Sentiment를 예측하는 텍스트 분류기를 학습하라.
요구 사항
1. 데이터 분할: 주어진 데이터셋 covid19_tweets_df_train 을 7:3 비율로 나누어, 70%는 훈련 데이터로, 나머지 30%는 검증 데이터로 사용하라.
2. 평가지표: 모델의 성능평가를 위하여 아래의 함수를 사용하라.
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=1) # 예측 클래스
# 평가 메트릭 불러오기
acc = evaluate.load("accuracy")
f1 = evaluate.load("f1")
# 평가 메트릭 계산
dct1 = acc.compute(predictions=predictions, references=labels)
dct2 = f1.compute(predictions=predictions, references=labels, average='weighted')
# 두 개의 메트릭 결과를 합치기
return dct1 | dct2 (풀이)
set(covid19_tweets_df_train['Sentiment']){'Extremely Negative', 'Extremely Positive', 'Negative', 'Neutral', 'Positive'}
id2label = {i:label for i,label in enumerate(set(covid19_tweets_df_train['Sentiment']))}
label2id = {label:i for i,label in enumerate(set(covid19_tweets_df_train['Sentiment']))}
print(id2label)
print(label2id){0: 'Positive', 1: 'Neutral', 2: 'Extremely Negative', 3: 'Extremely Positive', 4: 'Negative'}
{'Positive': 0, 'Neutral': 1, 'Extremely Negative': 2, 'Extremely Positive': 3, 'Negative': 4}
train_text = list(covid19_tweets_df_train['OriginalTweet'])
test_text = list(covid19_tweets_df_test['OriginalTweet'])
train_label = list(covid19_tweets_df_train.Sentiment.map(label2id))tokenizer = transformers.AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")
train_dct = tokenizer(train_text,truncation=True) | {'labels': train_label}
test_dct = tokenizer(test_text,truncation=True)train = datasets.Dataset.from_dict(train_dct)
test = datasets.Dataset.from_dict(test_dct)
d = train.train_test_split(test_size=0.3)
covid19_tweets = datasets.DatasetDict({'train':d['train'], 'eval':d['test'], 'test':test})
covid19_tweetsDatasetDict({
train: Dataset({
features: ['input_ids', 'attention_mask', 'labels'],
num_rows: 14405
})
eval: Dataset({
features: ['input_ids', 'attention_mask', 'labels'],
num_rows: 6174
})
test: Dataset({
features: ['input_ids', 'attention_mask'],
num_rows: 20578
})
})
model = transformers.AutoModelForSequenceClassification.from_pretrained(
"distilbert/distilbert-base-uncased",
num_labels=5,
id2label=id2label,
label2id=label2id
)Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
data_collator = transformers.DataCollatorWithPadding(tokenizer)def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=1) # 예측 클래스
# 평가 메트릭 불러오기
acc = evaluate.load("accuracy")
f1 = evaluate.load("f1")
# 평가 메트릭 계산
dct1 = acc.compute(predictions=predictions, references=labels)
dct2 = f1.compute(predictions=predictions, references=labels, average='weighted')
# 두 개의 메트릭 결과를 합치기
return dct1 | dct2 # Training Arguments 설정
training_args = transformers.TrainingArguments(
output_dir="./results",
evaluation_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=3,
weight_decay=0.01,
logging_dir='./logs',
logging_steps=10,
)
# Trainer 설정
trainer = transformers.Trainer(
model=model,
args=training_args,
train_dataset=covid19_tweets['train'],
eval_dataset=covid19_tweets['eval'],
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics,
)
# 모델 학습
trainer.train()/home/cgb3/anaconda3/envs/hf/lib/python3.12/site-packages/transformers/training_args.py:1568: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
warnings.warn(
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
/tmp/ipykernel_113605/2082248890.py:15: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead.
trainer = transformers.Trainer(
***** Running training *****
Num examples = 14,405
Num Epochs = 3
Instantaneous batch size per device = 16
Total train batch size (w. parallel, distributed & accumulation) = 16
Gradient Accumulation steps = 1
Total optimization steps = 2,703
Number of trainable parameters = 66,957,317
Saving model checkpoint to ./results/checkpoint-500
Configuration saved in ./results/checkpoint-500/config.json
Model weights saved in ./results/checkpoint-500/model.safetensors
tokenizer config file saved in ./results/checkpoint-500/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-500/special_tokens_map.json
***** Running Evaluation *****
Num examples = 6174
Batch size = 16
Downloading builder script: 100%|██████████| 6.77k/6.77k [00:00<00:00, 14.6MB/s]
Saving model checkpoint to ./results/checkpoint-1000
Configuration saved in ./results/checkpoint-1000/config.json
Model weights saved in ./results/checkpoint-1000/model.safetensors
tokenizer config file saved in ./results/checkpoint-1000/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-1000/special_tokens_map.json
Saving model checkpoint to ./results/checkpoint-1500
Configuration saved in ./results/checkpoint-1500/config.json
Model weights saved in ./results/checkpoint-1500/model.safetensors
tokenizer config file saved in ./results/checkpoint-1500/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-1500/special_tokens_map.json
***** Running Evaluation *****
Num examples = 6174
Batch size = 16
Saving model checkpoint to ./results/checkpoint-2000
Configuration saved in ./results/checkpoint-2000/config.json
Model weights saved in ./results/checkpoint-2000/model.safetensors
tokenizer config file saved in ./results/checkpoint-2000/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-2000/special_tokens_map.json
Saving model checkpoint to ./results/checkpoint-2500
Configuration saved in ./results/checkpoint-2500/config.json
Model weights saved in ./results/checkpoint-2500/model.safetensors
tokenizer config file saved in ./results/checkpoint-2500/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-2500/special_tokens_map.json
Saving model checkpoint to ./results/checkpoint-2703
Configuration saved in ./results/checkpoint-2703/config.json
Model weights saved in ./results/checkpoint-2703/model.safetensors
tokenizer config file saved in ./results/checkpoint-2703/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-2703/special_tokens_map.json
***** Running Evaluation *****
Num examples = 6174
Batch size = 16
Training completed. Do not forget to share your model on huggingface.co/models =)
| Epoch | Training Loss | Validation Loss | Accuracy | F1 |
|---|---|---|---|---|
| 1 | 0.413400 | 0.959950 | 0.785876 | 0.786625 |
| 2 | 0.285800 | 0.793307 | 0.788468 | 0.788806 |
| 3 | 0.148400 | 0.852429 | 0.798348 | 0.798777 |
TrainOutput(global_step=2703, training_loss=0.21161614542002155, metrics={'train_runtime': 114.3206, 'train_samples_per_second': 378.016, 'train_steps_per_second': 23.644, 'total_flos': 1019315616068040.0, 'train_loss': 0.21161614542002155, 'epoch': 3.0})
(2) (1)에서 학습한 모델을 활용하여 covid19_tweets_df_test의 OriginalTweet 열에 대한 Sentiment를 예측하시오.
답안예시
covid19_tweets_df_test.head()| UserName | ScreenName | Location | TweetAt | OriginalTweet | Sentiment | |
|---|---|---|---|---|---|---|
| 0 | 3800 | 48752 | UK | 16-03-2020 | advice Talk to your neighbours family to excha... | Positive |
| 1 | 3802 | 48754 | NaN | 16-03-2020 | My food stock is not the only one which is emp... | Extremely Positive |
| 2 | 3804 | 48756 | ÜT: 36.319708,-82.363649 | 16-03-2020 | As news of the region’s first confirmed COVID... | Positive |
| 3 | 3806 | 48758 | Austria | 16-03-2020 | Was at the supermarket today. Didn't buy toile... | Neutral |
| 4 | 3808 | 48760 | BHAVNAGAR,GUJRAT | 16-03-2020 | For corona prevention,we should stop to buy th... | Negative |
(풀이)
out = trainer.predict(covid19_tweets['test'])
out
***** Running Prediction *****
Num examples = 20578
Batch size = 16
PredictionOutput(predictions=array([[ 6.052478 , -0.25911158, -4.757802 , -1.4280325 , -1.1493124 ],
[ 2.3945487 , -1.8138888 , -4.4416656 , 2.885043 , -1.0756998 ],
[ 5.9334354 , -1.897063 , -5.115348 , 0.6702355 , -1.7291662 ],
...,
[-1.9611454 , -3.7045465 , -3.4850702 , 6.3820515 , -2.5484905 ],
[-2.0311882 , -1.2512956 , -0.4017841 , -3.9326744 , 5.3791475 ],
[ 1.1640915 , 2.671375 , -3.3171444 , -3.6160667 , 1.4595261 ]],
dtype=float32), label_ids=None, metrics={'test_runtime': 12.0254, 'test_samples_per_second': 1711.21, 'test_steps_per_second': 107.023})
covid19_tweets_df_test['Sentiment'] = [id2label[i] for i in out.predictions.argmax(axis=1)]covid19_tweets_df_test.head()| UserName | ScreenName | Location | TweetAt | OriginalTweet | Sentiment | |
|---|---|---|---|---|---|---|
| 0 | 3800 | 48752 | UK | 16-03-2020 | advice Talk to your neighbours family to excha... | Positive |
| 1 | 3802 | 48754 | NaN | 16-03-2020 | My food stock is not the only one which is emp... | Extremely Positive |
| 2 | 3804 | 48756 | ÜT: 36.319708,-82.363649 | 16-03-2020 | As news of the region’s first confirmed COVID... | Positive |
| 3 | 3806 | 48758 | Austria | 16-03-2020 | Was at the supermarket today. Didn't buy toile... | Neutral |
| 4 | 3808 | 48760 | BHAVNAGAR,GUJRAT | 16-03-2020 | For corona prevention,we should stop to buy th... | Negative |
2. IMDB – 가산점 50점
아래는 2주차에서 배운 imdb 분석코드를 정리한 것이다. 분석의 편의를 위하여 1000개의 자료만 사용하여 학습과 평가에 사용했다.
## Step1
imdb = datasets.load_dataset('imdb')
tokenizer = transformers.AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")
imdb_transformed = imdb.map(lambda dct: tokenizer(dct['text'],truncation=True), batched=True)
## Step2
model = transformers.AutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased", num_labels=2)
## Step3
def accuracy(eval_pred):
predictions, labels = eval_pred
predictions = predictions.argmax(axis=1)
accuracy = evaluate.load("accuracy")
return accuracy.compute(predictions=predictions, references=labels)
trainer = transformers.Trainer(
model=model,
args=transformers.TrainingArguments(
output_dir="my_awesome_model",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=2,
weight_decay=0.01,
eval_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
push_to_hub=False,
report_to="none"
),
train_dataset=imdb_transformed['train'].select(range(1000)), # 1000개만 사용
eval_dataset=imdb_transformed['test'].select(range(1000)), # 1000개만 사용
tokenizer=tokenizer,
data_collator=transformers.DataCollatorWithPadding(tokenizer=tokenizer),
compute_metrics=accuracy,
)
trainer.train()Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
/tmp/ipykernel_116097/1501501835.py:13: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead.
trainer = transformers.Trainer(
| Epoch | Training Loss | Validation Loss | Accuracy |
|---|---|---|---|
| 1 | No log | 0.003095 | 1.000000 |
| 2 | No log | 0.001992 | 1.000000 |
TrainOutput(global_step=126, training_loss=0.042853578688606384, metrics={'train_runtime': 30.9969, 'train_samples_per_second': 64.523, 'train_steps_per_second': 4.065, 'total_flos': 260867634212640.0, 'train_loss': 0.042853578688606384, 'epoch': 2.0})
보는것처럼 학습은 완벽하게 되었다. (accuracy가 100%) 그렇지만 imdb_transformed['test'] 전체에서 평가를 해보니 결과가 좋지 않았다. (accuracy가 50%)
## Step4
out = trainer.predict(imdb_transformed['test'])
outPredictionOutput(predictions=array([[ 3.2670784, -2.9905088],
[ 3.2648833, -3.0288398],
[ 3.224249 , -2.9409087],
...,
[ 3.2539847, -3.0032833],
[ 3.2596397, -3.0228188],
[ 3.2206337, -3.0072105]], dtype=float32), label_ids=array([0, 0, 0, ..., 1, 1, 1]), metrics={'test_loss': 3.1024723052978516, 'test_accuracy': 0.5, 'test_runtime': 81.9806, 'test_samples_per_second': 304.95, 'test_steps_per_second': 19.065})
out.metrics['test_accuracy']0.5
무엇이 문제일까? 혹시 아래의 코드를 수정하여 학습의 성능을 올릴 수 있는 방법이 있을까? (단 데이터의 수는 그대로 유지한다)
trainer = transformers.Trainer(
model=model,
args=transformers.TrainingArguments(
output_dir="my_awesome_model",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=2,
weight_decay=0.01,
eval_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
push_to_hub=False,
report_to="none"
),
train_dataset=imdb_transformed['train'].select(range(1000)), # 1000개만 사용
eval_dataset=imdb_transformed['test'].select(range(1000)), # 1000개만 사용
tokenizer=tokenizer,
data_collator=transformers.DataCollatorWithPadding(tokenizer=tokenizer),
compute_metrics=accuracy,
)(문제의원인)
- 학습에 사용된 데이터를 살펴보자.
set(imdb_transformed['train'].select(range(1000))['label']){0}
imdb_transformed['train'].select(range(1000))에 부정적인 영화평가에 해당하는 라벨만 존재함.
참고로 imdb_transformed['train']자료는 처음 12500는 모두 부정평가만, 다음 12500은 모두 긍정평가만 있다. 즉 imdb_transformed['train']자료는 “정렬”되어 있다.
처음 12500개의 라벨
set(imdb_transformed['train'].select(range(12500))['label']){0}
다음 12500개의 라벨
set(imdb_transformed['train'].select(range(12500,25000))['label']){1}
- 따라서 imdb_transformed['train'].select(range(1000))이 데이터로만 학습한다면 인공지능의 입장에서는
“데이터 상관없이 0이라고만 대답하면 되잖아?”
라고 인식할 수 있다. 실제로 인공지능은 imdb_transformed['test']에 대응하는 모든 예측값을 항상 0으로만 예측하도록 학습되었다. (그렇게 해도 지금까지는 항상 accuracy가 100이 나왔으니까..)
set(out.predictions.argmax(axis=1)) # 모든예측값은 0이다. {0}
- 실제로는 imdb_transformed['test']에 0과 1은 딱 절반씩 있으므로, accuracy는 50%가 나온다. (인공지능은 항상 0이라고 예측)
{i:imdb_transformed['test']['label'].count(i) for i in set(imdb_transformed['test']['label'])}{0: 12500, 1: 12500}
- 생각해보니까.. train과정에서 accuracy가 항상 100%가 나온이유도 당연하다.
imdb_transformed['test'].select(range(1000))여기에는label = 0에 해당하는 자료만 있을테니까
(해결책)
데이터를 섞으면 된다.
set(imdb_transformed['train'].shuffle().select(range(1000))['label']){0, 1}
## Step1
imdb = datasets.load_dataset('imdb')
tokenizer = transformers.AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")
imdb_transformed = imdb.map(lambda dct: tokenizer(dct['text'],truncation=True), batched=True)
## Step2
model = transformers.AutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased", num_labels=2)
## Step3
def accuracy(eval_pred):
predictions, labels = eval_pred
predictions = predictions.argmax(axis=1)
accuracy = evaluate.load("accuracy")
return accuracy.compute(predictions=predictions, references=labels)
trainer = transformers.Trainer(
model=model,
args=transformers.TrainingArguments(
output_dir="my_awesome_model",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=2,
weight_decay=0.01,
eval_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
push_to_hub=False,
report_to="none"
),
train_dataset=imdb_transformed['train'].shuffle().select(range(1000)), # 1000개만 사용
eval_dataset=imdb_transformed['test'].shuffle().select(range(1000)), # 1000개만 사용
tokenizer=tokenizer,
data_collator=transformers.DataCollatorWithPadding(tokenizer=tokenizer),
compute_metrics=accuracy,
)
trainer.train()Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
/tmp/ipykernel_116097/3188304965.py:13: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead.
trainer = transformers.Trainer(
| Epoch | Training Loss | Validation Loss | Accuracy |
|---|---|---|---|
| 1 | No log | 0.441965 | 0.826000 |
| 2 | No log | 0.345760 | 0.853000 |
TrainOutput(global_step=126, training_loss=0.48426180037241134, metrics={'train_runtime': 31.3854, 'train_samples_per_second': 63.724, 'train_steps_per_second': 4.015, 'total_flos': 262388939494080.0, 'train_loss': 0.48426180037241134, 'epoch': 2.0})
## Step4
out = trainer.predict(imdb_transformed['test'])
out.metrics['test_accuracy']0.88408
이게 정상적인 학습