Quiz-11 (2024.12.10) // 범위: 12wk-1 까지

Author

최규빈

Published

December 10, 2024

항목	허용 여부	비고
강의노트 참고	허용	수업 중 제공된 강의노트나 본인이 정리한 자료를 참고 가능
구글 검색	허용	인터넷을 통한 자료 검색 및 정보 확인 가능
생성 모형 사용	허용	인공지능 기반 도구(GPT 등) 사용 가능

Important

해설영상은 생략하도록 하겠습니다. 한 학기동안 고생하셨습니다.

import os
os.environ["WANDB_MODE"] = "offline"

import pandas as pd
import transformers 
import datasets
import evaluate
import torch
import numpy as np

1. COVID19 tweets – 100점

covid19_tweets_df = pd.read_csv('https://raw.githubusercontent.com/guebin/STML2022/main/posts/Corona_NLP_train.csv',encoding="ISO-8859-1")
covid19_tweets_df_train = covid19_tweets_df[::2].reset_index(drop=True)
covid19_tweets_df_test = covid19_tweets_df[1::2].reset_index(drop=True).loc[:,'UserName':'OriginalTweet'].assign(Sentiment = '???')
del covid19_tweets_df

covid19_tweets_df_train.head()

	UserName	ScreenName	Location	TweetAt	OriginalTweet	Sentiment
0	3799	48751	London	16-03-2020	@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...	Neutral
1	3801	48753	Vagabonds	16-03-2020	Coronavirus Australia: Woolworths to give elde...	Positive
2	3803	48755	NaN	16-03-2020	Me, ready to go at supermarket during the #COV...	Extremely Negative
3	3805	48757	35.926541,-78.753267	16-03-2020	Cashier at grocery store was sharing his insig...	Positive
4	3807	48759	Atlanta, GA USA	16-03-2020	Due to COVID-19 our retail store and classroom...	Positive

covid19_tweets_df_test.head()

	UserName	ScreenName	Location	TweetAt	OriginalTweet	Sentiment
0	3800	48752	UK	16-03-2020	advice Talk to your neighbours family to excha...	???
1	3802	48754	NaN	16-03-2020	My food stock is not the only one which is emp...	???
2	3804	48756	ÃœT: 36.319708,-82.363649	16-03-2020	As news of the regionÂ’s first confirmed COVID...	???
3	3806	48758	Austria	16-03-2020	Was at the supermarket today. Didn't buy toile...	???
4	3808	48760	BHAVNAGAR,GUJRAT	16-03-2020	For corona prevention,we should stop to buy th...	???

이 데이터는 COVID-19 관련 트윗 데이터를 포함하며, 각 트윗에 대해 작성자의 감정을 나타내는 Sentiment(감정) 레이블이 지정되어 있다.

주요 열 설명

1. OriginalTweet: 사용자가 작성한 트윗 내용으로. COVID-19와 관련된 다양한 의견, 경험, 상황 등이 포함되어 있음. 예시는 아래와 같음.

“Coronavirus Australia: Woolworths to give elders priority shopping amid coronavirus pandemic.”
“Due to COVID-19 our retail store and classroom…”

2. Sentiment: 트윗의 감정을 나타내는 레이블으로 가능한값은 아래와 같음.

Extremely Positive: 강한 긍정적 감정을 담아 희망과 연대감을 전달하는 트윗.
Positive: 긍정적인 해결책이나 희망적인 메시지를 담고 있음.
Neutral: 중립적이고 감정 표현이 없는 트윗 (사실 전달에 초점).
Negative: 문제점이나 불만을 표현한 부정적인 트윗.
Extremely Negative: 극단적으로 부정적인 상황에 대한 트윗.

(1) transformers 라이브러리의 “distilbert/distilbert-base-uncased” 모델을 활용하여 OriginalTweet을 입력으로 Sentiment를 예측하는 텍스트 분류기를 학습하라.

요구 사항

1. 데이터 분할: 주어진 데이터셋 covid19_tweets_df_train 을 7:3 비율로 나누어, 70%는 훈련 데이터로, 나머지 30%는 검증 데이터로 사용하라.

2. 평가지표: 모델의 성능평가를 위하여 아래의 함수를 사용하라.

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=1)  # 예측 클래스

    # 평가 메트릭 불러오기
    acc = evaluate.load("accuracy")
    f1 = evaluate.load("f1")

    # 평가 메트릭 계산
    dct1 = acc.compute(predictions=predictions, references=labels)  
    dct2 = f1.compute(predictions=predictions, references=labels, average='weighted')

    # 두 개의 메트릭 결과를 합치기
    return dct1 | dct2

(풀이)

set(covid19_tweets_df_train['Sentiment'])

{'Extremely Negative', 'Extremely Positive', 'Negative', 'Neutral', 'Positive'}

id2label = {i:label for i,label in enumerate(set(covid19_tweets_df_train['Sentiment']))}
label2id = {label:i for i,label in enumerate(set(covid19_tweets_df_train['Sentiment']))}
print(id2label)
print(label2id)

{0: 'Positive', 1: 'Neutral', 2: 'Extremely Negative', 3: 'Extremely Positive', 4: 'Negative'}
{'Positive': 0, 'Neutral': 1, 'Extremely Negative': 2, 'Extremely Positive': 3, 'Negative': 4}

train_text = list(covid19_tweets_df_train['OriginalTweet'])
test_text = list(covid19_tweets_df_test['OriginalTweet'])
train_label = list(covid19_tweets_df_train.Sentiment.map(label2id))

tokenizer = transformers.AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")
train_dct = tokenizer(train_text,truncation=True) | {'labels': train_label}
test_dct = tokenizer(test_text,truncation=True)

train = datasets.Dataset.from_dict(train_dct)
test = datasets.Dataset.from_dict(test_dct)
d = train.train_test_split(test_size=0.3)
covid19_tweets = datasets.DatasetDict({'train':d['train'], 'eval':d['test'], 'test':test})
covid19_tweets

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 14405
    })
    eval: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 6174
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 20578
    })
})

model = transformers.AutoModelForSequenceClassification.from_pretrained(
    "distilbert/distilbert-base-uncased", 
    num_labels=5,
    id2label=id2label,
    label2id=label2id
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

data_collator = transformers.DataCollatorWithPadding(tokenizer)

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=1)  # 예측 클래스

    # 평가 메트릭 불러오기
    acc = evaluate.load("accuracy")
    f1 = evaluate.load("f1")

    # 평가 메트릭 계산
    dct1 = acc.compute(predictions=predictions, references=labels)  
    dct2 = f1.compute(predictions=predictions, references=labels, average='weighted')

    # 두 개의 메트릭 결과를 합치기
    return dct1 | dct2

# Training Arguments 설정
training_args = transformers.TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
)

# Trainer 설정
trainer = transformers.Trainer(
    model=model,
    args=training_args,
    train_dataset=covid19_tweets['train'],
    eval_dataset=covid19_tweets['eval'],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

# 모델 학습
trainer.train()

/home/cgb3/anaconda3/envs/hf/lib/python3.12/site-packages/transformers/training_args.py:1568: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
  warnings.warn(
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
/tmp/ipykernel_113605/2082248890.py:15: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead.
  trainer = transformers.Trainer(
***** Running training *****
  Num examples = 14,405
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 2,703
  Number of trainable parameters = 66,957,317
Saving model checkpoint to ./results/checkpoint-500
Configuration saved in ./results/checkpoint-500/config.json
Model weights saved in ./results/checkpoint-500/model.safetensors
tokenizer config file saved in ./results/checkpoint-500/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-500/special_tokens_map.json

***** Running Evaluation *****
  Num examples = 6174
  Batch size = 16
Downloading builder script: 100%|██████████| 6.77k/6.77k [00:00<00:00, 14.6MB/s]
Saving model checkpoint to ./results/checkpoint-1000
Configuration saved in ./results/checkpoint-1000/config.json
Model weights saved in ./results/checkpoint-1000/model.safetensors
tokenizer config file saved in ./results/checkpoint-1000/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-1000/special_tokens_map.json
Saving model checkpoint to ./results/checkpoint-1500
Configuration saved in ./results/checkpoint-1500/config.json
Model weights saved in ./results/checkpoint-1500/model.safetensors
tokenizer config file saved in ./results/checkpoint-1500/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-1500/special_tokens_map.json

***** Running Evaluation *****
  Num examples = 6174
  Batch size = 16
Saving model checkpoint to ./results/checkpoint-2000
Configuration saved in ./results/checkpoint-2000/config.json
Model weights saved in ./results/checkpoint-2000/model.safetensors
tokenizer config file saved in ./results/checkpoint-2000/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-2000/special_tokens_map.json
Saving model checkpoint to ./results/checkpoint-2500
Configuration saved in ./results/checkpoint-2500/config.json
Model weights saved in ./results/checkpoint-2500/model.safetensors
tokenizer config file saved in ./results/checkpoint-2500/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-2500/special_tokens_map.json
Saving model checkpoint to ./results/checkpoint-2703
Configuration saved in ./results/checkpoint-2703/config.json
Model weights saved in ./results/checkpoint-2703/model.safetensors
tokenizer config file saved in ./results/checkpoint-2703/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-2703/special_tokens_map.json

***** Running Evaluation *****
  Num examples = 6174
  Batch size = 16


Training completed. Do not forget to share your model on huggingface.co/models =)

[2703/2703 01:54, Epoch 3/3]

Epoch	Training Loss	Validation Loss	Accuracy	F1
1	0.413400	0.959950	0.785876	0.786625
2	0.285800	0.793307	0.788468	0.788806
3	0.148400	0.852429	0.798348	0.798777

TrainOutput(global_step=2703, training_loss=0.21161614542002155, metrics={'train_runtime': 114.3206, 'train_samples_per_second': 378.016, 'train_steps_per_second': 23.644, 'total_flos': 1019315616068040.0, 'train_loss': 0.21161614542002155, 'epoch': 3.0})

(2) (1)에서 학습한 모델을 활용하여 covid19_tweets_df_test의 OriginalTweet 열에 대한 Sentiment를 예측하시오.

답안예시

covid19_tweets_df_test.head()

	UserName	ScreenName	Location	TweetAt	OriginalTweet	Sentiment
0	3800	48752	UK	16-03-2020	advice Talk to your neighbours family to excha...	Positive
1	3802	48754	NaN	16-03-2020	My food stock is not the only one which is emp...	Extremely Positive
2	3804	48756	ÃœT: 36.319708,-82.363649	16-03-2020	As news of the regionÂ’s first confirmed COVID...	Positive
3	3806	48758	Austria	16-03-2020	Was at the supermarket today. Didn't buy toile...	Neutral
4	3808	48760	BHAVNAGAR,GUJRAT	16-03-2020	For corona prevention,we should stop to buy th...	Negative

(풀이)

out = trainer.predict(covid19_tweets['test'])
out


***** Running Prediction *****
  Num examples = 20578
  Batch size = 16

PredictionOutput(predictions=array([[ 6.052478  , -0.25911158, -4.757802  , -1.4280325 , -1.1493124 ],
       [ 2.3945487 , -1.8138888 , -4.4416656 ,  2.885043  , -1.0756998 ],
       [ 5.9334354 , -1.897063  , -5.115348  ,  0.6702355 , -1.7291662 ],
       ...,
       [-1.9611454 , -3.7045465 , -3.4850702 ,  6.3820515 , -2.5484905 ],
       [-2.0311882 , -1.2512956 , -0.4017841 , -3.9326744 ,  5.3791475 ],
       [ 1.1640915 ,  2.671375  , -3.3171444 , -3.6160667 ,  1.4595261 ]],
      dtype=float32), label_ids=None, metrics={'test_runtime': 12.0254, 'test_samples_per_second': 1711.21, 'test_steps_per_second': 107.023})

covid19_tweets_df_test['Sentiment'] = [id2label[i] for i in out.predictions.argmax(axis=1)]

covid19_tweets_df_test.head()

	UserName	ScreenName	Location	TweetAt	OriginalTweet	Sentiment
0	3800	48752	UK	16-03-2020	advice Talk to your neighbours family to excha...	Positive
1	3802	48754	NaN	16-03-2020	My food stock is not the only one which is emp...	Extremely Positive
2	3804	48756	ÃœT: 36.319708,-82.363649	16-03-2020	As news of the regionÂ’s first confirmed COVID...	Positive
3	3806	48758	Austria	16-03-2020	Was at the supermarket today. Didn't buy toile...	Neutral
4	3808	48760	BHAVNAGAR,GUJRAT	16-03-2020	For corona prevention,we should stop to buy th...	Negative

2. IMDB – 가산점 50점

아래는 2주차에서 배운 imdb 분석코드를 정리한 것이다. 분석의 편의를 위하여 1000개의 자료만 사용하여 학습과 평가에 사용했다.

## Step1 
imdb = datasets.load_dataset('imdb')
tokenizer = transformers.AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased") 
imdb_transformed = imdb.map(lambda dct: tokenizer(dct['text'],truncation=True), batched=True)
## Step2 
model = transformers.AutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased", num_labels=2)
## Step3 
def accuracy(eval_pred):
    predictions, labels = eval_pred
    predictions = predictions.argmax(axis=1)
    accuracy = evaluate.load("accuracy")
    return accuracy.compute(predictions=predictions, references=labels)
trainer = transformers.Trainer(
    model=model,
    args=transformers.TrainingArguments(
        output_dir="my_awesome_model",
        learning_rate=2e-5,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        num_train_epochs=2,
        weight_decay=0.01,
        eval_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        push_to_hub=False,
        report_to="none"
    ),
    train_dataset=imdb_transformed['train'].select(range(1000)), # 1000개만 사용
    eval_dataset=imdb_transformed['test'].select(range(1000)), # 1000개만 사용
    tokenizer=tokenizer,
    data_collator=transformers.DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=accuracy,
)
trainer.train()

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
/tmp/ipykernel_116097/1501501835.py:13: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead.
  trainer = transformers.Trainer(

[126/126 00:30, Epoch 2/2]

Epoch	Training Loss	Validation Loss	Accuracy
1	No log	0.003095	1.000000
2	No log	0.001992	1.000000

TrainOutput(global_step=126, training_loss=0.042853578688606384, metrics={'train_runtime': 30.9969, 'train_samples_per_second': 64.523, 'train_steps_per_second': 4.065, 'total_flos': 260867634212640.0, 'train_loss': 0.042853578688606384, 'epoch': 2.0})

보는것처럼 학습은 완벽하게 되었다. (accuracy가 100%) 그렇지만 imdb_transformed['test'] 전체에서 평가를 해보니 결과가 좋지 않았다. (accuracy가 50%)

## Step4 
out = trainer.predict(imdb_transformed['test'])
out

PredictionOutput(predictions=array([[ 3.2670784, -2.9905088],
       [ 3.2648833, -3.0288398],
       [ 3.224249 , -2.9409087],
       ...,
       [ 3.2539847, -3.0032833],
       [ 3.2596397, -3.0228188],
       [ 3.2206337, -3.0072105]], dtype=float32), label_ids=array([0, 0, 0, ..., 1, 1, 1]), metrics={'test_loss': 3.1024723052978516, 'test_accuracy': 0.5, 'test_runtime': 81.9806, 'test_samples_per_second': 304.95, 'test_steps_per_second': 19.065})

out.metrics['test_accuracy']

0.5

무엇이 문제일까? 혹시 아래의 코드를 수정하여 학습의 성능을 올릴 수 있는 방법이 있을까? (단 데이터의 수는 그대로 유지한다)

trainer = transformers.Trainer(
    model=model,
    args=transformers.TrainingArguments(
        output_dir="my_awesome_model",
        learning_rate=2e-5,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        num_train_epochs=2,
        weight_decay=0.01,
        eval_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        push_to_hub=False,
        report_to="none"
    ),
    train_dataset=imdb_transformed['train'].select(range(1000)), # 1000개만 사용
    eval_dataset=imdb_transformed['test'].select(range(1000)), # 1000개만 사용
    tokenizer=tokenizer,
    data_collator=transformers.DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=accuracy,
)

(문제의원인)

- 학습에 사용된 데이터를 살펴보자.

set(imdb_transformed['train'].select(range(1000))['label'])

{0}

imdb_transformed['train'].select(range(1000)) 에 부정적인 영화평가에 해당하는 라벨만 존재함.

Note

참고로 imdb_transformed['train']자료는 처음 12500는 모두 부정평가만, 다음 12500은 모두 긍정평가만 있다. 즉 imdb_transformed['train']자료는 “정렬”되어 있다.

처음 12500개의 라벨

set(imdb_transformed['train'].select(range(12500))['label'])

{0}

다음 12500개의 라벨

set(imdb_transformed['train'].select(range(12500,25000))['label'])

{1}

- 따라서 imdb_transformed['train'].select(range(1000))이 데이터로만 학습한다면 인공지능의 입장에서는

“데이터 상관없이 0이라고만 대답하면 되잖아?”

라고 인식할 수 있다. 실제로 인공지능은 imdb_transformed['test']에 대응하는 모든 예측값을 항상 0으로만 예측하도록 학습되었다. (그렇게 해도 지금까지는 항상 accuracy가 100이 나왔으니까..)

set(out.predictions.argmax(axis=1)) # 모든예측값은 0이다.

{0}

- 실제로는 imdb_transformed['test']에 0과 1은 딱 절반씩 있으므로, accuracy는 50%가 나온다. (인공지능은 항상 0이라고 예측)

{i:imdb_transformed['test']['label'].count(i) for i in set(imdb_transformed['test']['label'])}

{0: 12500, 1: 12500}

생각해보니까.. train과정에서 accuracy가 항상 100%가 나온이유도 당연하다.
imdb_transformed['test'].select(range(1000)) 여기에는 label = 0 에 해당하는 자료만 있을테니까

(해결책)

데이터를 섞으면 된다.

set(imdb_transformed['train'].shuffle().select(range(1000))['label'])

{0, 1}

## Step1 
imdb = datasets.load_dataset('imdb')
tokenizer = transformers.AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased") 
imdb_transformed = imdb.map(lambda dct: tokenizer(dct['text'],truncation=True), batched=True)
## Step2 
model = transformers.AutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased", num_labels=2)
## Step3 
def accuracy(eval_pred):
    predictions, labels = eval_pred
    predictions = predictions.argmax(axis=1)
    accuracy = evaluate.load("accuracy")
    return accuracy.compute(predictions=predictions, references=labels)
trainer = transformers.Trainer(
    model=model,
    args=transformers.TrainingArguments(
        output_dir="my_awesome_model",
        learning_rate=2e-5,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        num_train_epochs=2,
        weight_decay=0.01,
        eval_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        push_to_hub=False,
        report_to="none"
    ),
    train_dataset=imdb_transformed['train'].shuffle().select(range(1000)), # 1000개만 사용
    eval_dataset=imdb_transformed['test'].shuffle().select(range(1000)), # 1000개만 사용
    tokenizer=tokenizer,
    data_collator=transformers.DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=accuracy,
)
trainer.train()

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
/tmp/ipykernel_116097/3188304965.py:13: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead.
  trainer = transformers.Trainer(

[126/126 00:31, Epoch 2/2]

Epoch	Training Loss	Validation Loss	Accuracy
1	No log	0.441965	0.826000
2	No log	0.345760	0.853000

TrainOutput(global_step=126, training_loss=0.48426180037241134, metrics={'train_runtime': 31.3854, 'train_samples_per_second': 63.724, 'train_steps_per_second': 4.015, 'total_flos': 262388939494080.0, 'train_loss': 0.48426180037241134, 'epoch': 2.0})

## Step4 
out = trainer.predict(imdb_transformed['test'])
out.metrics['test_accuracy']

0.88408

이게 정상적인 학습