import os
os.environ["WANDB_MODE"] = "offline"11wk-1: data_collator
1. 강의영상
2. Imports
import pandas as pd
import numpy as np
import datasets
import transformers
import torch
import torchvision
import torch.utils
import evaluate/home/cgb3/anaconda3/envs/hf/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
3. data_collator 이해
A. 외우세요 \((\star\star\star)\)
- data_collator를 잘 설계하는 방법: trainer_input과 model이 주어졌을때 data_collator는 아래의 코드가 동작하도록 설계하면 된다.
trainer_input = ~~~
model = ~~~~
#---#
batch_maker = transformers.Trainer(
model = model,
data_collator = lambda x: x
) # 이 과정에서 model이 cuda로 감
_batched_data = batch_maker.get_test_dataloader(trainer_input) # 이 과정에서 trainer_input이 cuda로 감
batched_data = list(_batched_data)
single_batch = batched_data[-1]
model.to("cpu") # 경우에 따라 생략해야할수도있음
model(**data_collator(single_batch))- 위의 코드가 오류없이 실행되었다면 아래의 코드를 사용할 수 있다.
trainer = transformers.Trainer(
model = model,
data_collator = data_collator
)
trainer.predict(trainer_input)이걸 어떻게 알았냐고요? 코드뜯어봤습니다.. \(\to\) 숙제
코랩사용자의 경우 아래와 같이 wandb(Weights & Biases) 로그인을 요구하는 문제가 있습니다.
wandb: WARNING The `run_name` is currently set to the same value as `TrainingArguments.output_dir`. If this was not intended, please specify a different run name by setting the `TrainingArguments.run_name` parameter.
wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
wandb: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
wandb: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:이를 해결하기 위해서는 아래의 코드를 코랩처음에 실행하면 됩니다.
import os
os.environ["WANDB_MODE"] = "offline"주의: trainer_input의 type이 꼭 Dataset 일 필요는 없다..
B. IMDB – 복습
ref: https://huggingface.co/docs/transformers/tasks/sequence_classification
1. 데이터준비: "guebin/imdb-tiny" \(\to\) trainer_input
imdb = datasets.load_dataset("guebin/imdb-tiny")
tokenizer = transformers.AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")
def preprocess_function(examples):
return tokenizer(examples["text"], truncation=True)
tokenized_imdb = imdb.map(preprocess_function,batched=True)
trainer_input = tokenized_imdb['train']
trainer_inputMap: 100%|█████████████████████████████| 10/10 [00:00<00:00, 1694.39 examples/s]
Dataset({
features: ['text', 'label', 'input_ids', 'attention_mask'],
num_rows: 10
})
2. 모델준비: "distilbert/distilbert-base-uncased" \(\to\)model
model = transformers.AutoModelForSequenceClassification.from_pretrained(
"distilbert/distilbert-base-uncased", num_labels=2
)Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
3. 데이터콜렉터: DataCollatorWithPadding() \(\to\) data_collator
data_collator = transformers.DataCollatorWithPadding(tokenizer=tokenizer)
data_collatorDataCollatorWithPadding(tokenizer=DistilBertTokenizerFast(name_or_path='distilbert/distilbert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=False), added_tokens_decoder={
0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}, padding=True, max_length=None, pad_to_multiple_of=None, return_tensors='pt')
데이터콜렉터가 올바로 설정되었는지 체크하고, 적당한 trainer를 만들어
trainer.predict(trainer_input)이 정상동작하는지 확인하라.
(풀이)
batch_maker = transformers.Trainer(
model = model,
data_collator = lambda x: x
) # 이 과정에서 model이 cuda로 감
_batched_data = batch_maker.get_test_dataloader(trainer_input) # 이 과정에서 trainer_input이 cuda로 감
batched_data = list(_batched_data)
single_batch = batched_data[-1]
model.to("cpu") # 경우에 따라 생략해야할수도있음
model(**data_collator(single_batch))SequenceClassifierOutput(loss=tensor(0.7085, grad_fn=<NllLossBackward0>), logits=tensor([[-0.0543, 0.0011],
[-0.0405, -0.0351]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)
- 잘 돌아갔음. (=여기에서 사용된 데이터콜렉터는 잘 설계된
data_collator라는 의미)
trainer = transformers.Trainer(
model = model,
data_collator = data_collator
)
out = trainer.predict(trainer_input)
out PredictionOutput(predictions=array([[-0.24347985, 0.03874021],
[-0.26586303, 0.06817057],
[-0.2564777 , 0.04826375],
[-0.2534306 , 0.06623521],
[-0.23762025, 0.05738585],
[-0.25557715, 0.07033838],
[-0.19689777, 0.07268588],
[-0.20918864, 0.05981901],
[-0.2526626 , 0.10021226],
[-0.24273753, 0.05700814]], dtype=float32), label_ids=array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), metrics={'test_loss': 0.8574765920639038, 'test_model_preparation_time': 0.0011, 'test_runtime': 0.038, 'test_samples_per_second': 263.415, 'test_steps_per_second': 52.683})
#
- 관찰1: batched_data[-1] 는 하나의배치(single_batch)를 의미함. 모델의 입력으로는 부적절한 형식임.
# batched_data[-1] -- 부적절해보이는 모델입력..model(**batched_data[-1])TypeError: DistilBertForSequenceClassification(
(distilbert): DistilBertModel(
(embeddings): Embeddings(
(word_embeddings): Embedding(30522, 768, padding_idx=0)
(position_embeddings): Embedding(512, 768)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(transformer): Transformer(
(layer): ModuleList(
(0-5): 6 x TransformerBlock(
(attention): DistilBertSdpaAttention(
(dropout): Dropout(p=0.1, inplace=False)
(q_lin): Linear(in_features=768, out_features=768, bias=True)
(k_lin): Linear(in_features=768, out_features=768, bias=True)
(v_lin): Linear(in_features=768, out_features=768, bias=True)
(out_lin): Linear(in_features=768, out_features=768, bias=True)
)
(sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(ffn): FFN(
(dropout): Dropout(p=0.1, inplace=False)
(lin1): Linear(in_features=768, out_features=3072, bias=True)
(lin2): Linear(in_features=3072, out_features=768, bias=True)
(activation): GELUActivation()
)
(output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
)
)
)
)
(pre_classifier): Linear(in_features=768, out_features=768, bias=True)
(classifier): Linear(in_features=768, out_features=2, bias=True)
(dropout): Dropout(p=0.2, inplace=False)
) argument after ** must be a mapping, not list
- 관찰2: data_collator(batched_data[-1]) 역시 하나의배치(single_batch)를 의미함. 그런데 이것은 모델의 입력으로도 적절한 형식.
data_collator(batched_data[-1]) # 모델의 입력으로 매우 바람직해 보이는 형식임 {'input_ids': tensor([[ 101, 2040, 2024, ..., 22132, 7847, 102],
[ 101, 2023, 2003, ..., 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 0, 0, 0]]), 'labels': tensor([0, 0])}
model.to("cpu")
model(**data_collator(batched_data[1]))SequenceClassifierOutput(loss=tensor(0.8696, grad_fn=<NllLossBackward0>), logits=tensor([[-0.2527, 0.1002],
[-0.2427, 0.0570]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)
data_collator – 심화이해
아래의 형식으로 정리된 배치화된 자료가 있다고 하자. (주의: batched_data는 항상 list비슷한 오브젝트이어야함)
batched_data = [batch_1, batch_2, ...,batch_n]data_collator 는 각각의 single_batch, 즉 batch_1, batch_2 등을 model이 처리가능한 형태로 “형식”을 맞춰주는 역할을 한다. 즉 아래가 실행되도록 만들어주는 역할을 한다.
model(**data_collator(batch_1))trainer와 model의 자료처리과정 비교
#. model의 자료처리과정
-코드: model.forward(model_input)
-처리과정: model_input에 정리된 입력을 단순히 model.forward() 함수가 처리.
#. trainer의 자료처리과정
-코드: trainer.predict(trainer_input)
-처리과정: 배치화 \(\to\) 데이터콜렉팅 \(\to\) 추론의 3단계를 거친다.
trainer_input을 배치(batch)로 나눈다.- 각 배치(=
single_batch)를data_collator를 통해 형식을 맞춘다. - 형식이 조정된 데이터를
model.forward의 입력으로 전달한다.
-슈도코드:
## 이 코드는..
trainer.predict(trainer_input)
## 대략 아래의 느낌으로 해석하면 된다.. (동일X. 결과정리, GPU처리 등 세부로직이 더 있음)
batched_data = some_function(trainer_input)
for single_batch in batched_data:
collated_data = data_collator(single_batch)
model(**collated_data)trainer.predict() 의 분해
trainer.predict()의 동작은 개념적으로 (1) 배치화 (2) 데이터콜렝팅 (3) 추론의 과정으로 분해할 수 있지만, 실제이러한 과정으로 코드를 정확하게 분리하는건 어렵다. (그리고 저 사이사이에는 다른 자잘한 과정들이 많다..) 하지만 이해를 위해서 코드조각을 억지로 분리해본다면 아래 3개의 코드조각으로 분리할 수 있을것이다.
1. 배치화: trainer_input \(\to\) batched_data
batch_maker = transformers.Trainer(
model = model,
data_collator = lambda x: x
)
_batched_data = batch_maker.get_test_dataloader(trainer_input)
batched_data = list(_batched_data)2. 데이터콜렉팅: single_batch \(\to\) collated_data
#for single_batch in batched_data:
collated_data = data_collator(single_batch)3. 추론: collated_data \(\to\) model_out
#for single_batch in batched_data:
#collated_data = data_collator(single_batch)
model_out = model(**collated_data)C. FOOD101 – 복습
ref: https://huggingface.co/docs/transformers/tasks/image_classification
1. 데이터준비: "guebin/food101-tiny" \(\to\) trainer_input
food = datasets.load_dataset("guebin/food101-tiny")
image_processor = transformers.AutoImageProcessor.from_pretrained("google/vit-base-patch16-224-in21k")
normalize = torchvision.transforms.Normalize(mean=image_processor.image_mean, std=image_processor.image_std)
size = (
image_processor.size["shortest_edge"]
if "shortest_edge" in image_processor.size
else (image_processor.size["height"], image_processor.size["width"])
)
_transforms = torchvision.transforms.Compose([
torchvision.transforms.RandomResizedCrop(size),
torchvision.transforms.ToTensor(),
normalize
])
def transforms(examples):
examples["pixel_values"] = [_transforms(img.convert("RGB")) for img in examples["image"]]
del examples["image"]
return examples
trainer_input = food['train'].with_transform(transforms)
trainer_inputFast image processor class <class 'transformers.models.vit.image_processing_vit_fast.ViTImageProcessorFast'> is available for this model. Using slow image processor class. To use the fast image processor class set `use_fast=True`.
Dataset({
features: ['image', 'label'],
num_rows: 10
})
2. 모델준비: "google/vit-base-patch16-224-in21k" \(\to\)model
labels = food["train"].features["label"].names
label2id, id2label = dict(), dict()
for i, label in enumerate(labels):
label2id[label] = str(i)
id2label[str(i)] = label
model = transformers.AutoModelForImageClassification.from_pretrained(
"google/vit-base-patch16-224-in21k",
num_labels=len(labels),
id2label=id2label,
label2id=label2id,
)Some weights of ViTForImageClassification were not initialized from the model checkpoint at google/vit-base-patch16-224-in21k and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
3. 데이터콜렉터: DefaultDataCollator() \(\to\) data_collator
data_collator = transformers.DefaultDataCollator()
data_collatorDefaultDataCollator(return_tensors='pt')
데이터콜렉터가 올바로 설정되었는지 체크하고, 적당한 trainer를 만들어
trainer.predict(trainer_input)이 정상동작하는지 확인하라.
(풀이1) – 실패
batch_maker = transformers.Trainer(
model = model,
data_collator = lambda x: x
)
_batched_data = batch_maker.get_test_dataloader(trainer_input)
batched_data = list(_batched_data)
single_batch = batched_data[-1]
model(**data_collator(single_batch))KeyError: 'image'
- 왜 실패했지?? (예전에는 분명히 되었던 것 같은뎅..)
<에러메시지의 해석>
- 아래가 동작하지 않음.
batched_data = list(_batched_data)- 그 이유는 아래가 동작하지 않기 때문임.
next(dataloader_iter)- …(생략)…
- 최종적으로는 아래가 동작하지 않기 때문에 생긴 문제였음. (그런데 이건 .with_transform()에 있는 코드인데?)
examples["pixel_values"] = [_transforms(img.convert("RGB")) for img in examples["image"]]- 결국
[_transforms(img.convert("RGB")) for img in examples["image"]]를 실행하는 시점에서 examples["image"]가 없었다는 의미.
눈치:
with_transform이 지금 실행되는거였어?
- 왜 이런일이 생기지?
- 배치화를 하는 코드
_batched_data = batch_maker.get_test_dataloader(trainer_input)에서 아래의 column_names:
pixel_valueshead_masklabelsoutput_attentionsoutput_hidden_statesinterpolate_pos_encodingreturn_dict
를 제외하고는 모두 트레이너(batch_maker = trainer)가 강제로 제거하는 로직이 있음.1
1 왜 이런 로직이 있을까? 이런 로직이 없다면 model의 args를 강제로 외우고 있어야 하니까..
- image라는 column_name은 위에 해당되지 않으므로 제거됨.
- 그리고 image 칼럼이 제거된 이후에 with_transform 이 나중에 실행되면서 (지연실행) 문제가 발생.
이걸 어떻게 알았냐고요? 코드뜯어봤습니다.. \(\to\) 숙제
trainer.predict() 은 (1) 배치화 (2) 데이터콜렉팅 (3) 추론의 과정을 거친다. 그리고 배치화와 데이터콜렉팅 사이에 “싱글배치”를 만드는 과정이 있다.
- 세부사항1: 그런데 “배치화”단계에서
model.forward()의 입력으로 사용되지 않는 columns는 지워지는 내부로직이 존재한다. - 세부사항2:
trainer_input에 걸려있는.with_transform()은 “배치화”이후 싱글배치가 만들어지는 과정에서 실행된다.
따라서 .with_transform() 에서 특정컬럼의 변화시키는 동작이 약속된 경우, 그 컬럼이 배치화의 단계에서 자동제거되어 코드가 돌아가지 않을 수 있는 위험성이 존재한다.
(풀이2) – image를 return_dict 로 위장.. // 완전 테크니컬한 풀이
- 현재상황: food['train']에 .with_transform(transforms)을 걸어두고(?) trainer_input을 만든상황
- 문제: trainer.predict() 내부동작에서 .with_transform(transform) 이 실현될때
transforms??Signature: transforms(examples)
Docstring: <no docstring>
Source:
def transforms(examples):
examples["pixel_values"] = [_transforms(img.convert("RGB")) for img in examples["image"]]
del examples["image"]
return examples
File: /tmp/ipykernel_706133/1515420127.py
Type: function
이 내용이 실행되어야하는데, image는 model의 입력으로 유하하지 않은 키라서 트레이너가 이미 제거한 상태임.
- 전략: 제거가 안되게 막아보자..
#model.forward?#trainer_input = food['train'].with_transform(transforms)
trainer_input2 = trainer_input.rename_columns({'image':'return_dict'})
trainer_input2Dataset({
features: ['return_dict', 'label'],
num_rows: 10
})
def transforms2(examples):
examples["pixel_values"] = [_transforms(img.convert("RGB")) for img in examples["return_dict"]]
del examples["return_dict"]
return examplestrainer_input3 = trainer_input2.with_transform(transforms2)
trainer_input3Dataset({
features: ['return_dict', 'label'],
num_rows: 10
})
batch_maker = transformers.Trainer(
model = model,
data_collator = lambda x: x
)
_batched_data = batch_maker.get_test_dataloader(trainer_input3)
batched_data = list(_batched_data)
single_batch = batched_data[-1]
model(**data_collator(single_batch))ImageClassifierOutput(loss=tensor(4.5805, device='cuda:0', grad_fn=<NllLossBackward0>), logits=tensor([[-0.0258, 0.0340, 0.0493, 0.0371, 0.0371, 0.0636, -0.0353, -0.0416,
0.0061, -0.0304, 0.0127, -0.0633, 0.0532, -0.1117, -0.1657, -0.0810,
0.0022, 0.0071, -0.0947, -0.0831, -0.1189, 0.0783, -0.2383, -0.0486,
0.1039, 0.0115, 0.0054, -0.0113, 0.0740, 0.0783, 0.0188, 0.0618,
0.2759, 0.1308, -0.1028, 0.0198, 0.0032, 0.2006, -0.1247, -0.0512,
-0.0331, -0.0608, -0.1030, 0.0307, 0.2115, 0.1275, -0.1836, -0.2429,
-0.1090, -0.0293, 0.1010, 0.0847, -0.0655, 0.0416, -0.1167, -0.0598,
0.1333, 0.1627, -0.1722, 0.0046, -0.0842, 0.0161, 0.1583, -0.0403,
-0.0190, -0.1496, 0.0723, -0.0647, -0.1083, -0.1299, 0.0851, -0.1810,
0.0214, 0.2340, -0.0186, -0.1256, 0.0582, 0.1798, 0.1589, -0.0982,
0.0066, 0.0177, 0.0315, 0.0404, 0.1300, -0.0198, 0.0468, -0.0595,
0.2014, 0.0155, -0.0009, 0.1910, -0.0110, 0.1809, 0.0187, 0.0010,
0.0691, 0.2024, 0.1041, -0.1182, 0.0577],
[-0.0791, 0.0483, -0.0684, 0.0205, 0.0634, 0.0355, 0.1256, 0.0242,
0.0795, -0.1158, 0.1004, -0.0554, 0.1398, 0.0703, -0.0372, -0.0903,
0.0322, -0.1763, -0.0331, 0.0778, 0.0345, 0.0899, 0.0006, -0.1170,
-0.0303, 0.0620, -0.1490, -0.0589, -0.0060, 0.0266, -0.0812, -0.0497,
-0.0114, 0.0981, -0.0686, 0.0337, 0.0196, 0.0132, -0.1738, -0.0574,
-0.0434, 0.0773, 0.0020, 0.1212, 0.1227, -0.0150, -0.0698, -0.1568,
0.0644, -0.1053, 0.0420, -0.1292, -0.1032, -0.1744, -0.1242, -0.0229,
0.1295, 0.0844, -0.1660, -0.0132, -0.0407, 0.1438, -0.0115, -0.0879,
-0.1188, -0.1644, -0.0454, -0.0449, -0.0555, -0.2129, -0.0220, -0.1480,
-0.0191, 0.2003, 0.0107, 0.1169, 0.0108, 0.0526, 0.1320, -0.2591,
0.0240, -0.0215, 0.2772, 0.0699, 0.0940, 0.0377, 0.0715, 0.1504,
0.0094, -0.0027, 0.1345, 0.2739, 0.0965, 0.1069, -0.0843, 0.0841,
0.0078, 0.1318, 0.1355, 0.0620, -0.0478]], device='cuda:0',
grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)
trainer = transformers.Trainer(
model = model,
data_collator= data_collator
)
trainer.predict(trainer_input3)PredictionOutput(predictions=array([[-0.05419738, -0.05692905, 0.02577981, ..., 0.06238552,
-0.08741985, 0.00835681],
[ 0.06003767, 0.03531971, -0.01702251, ..., 0.10187533,
-0.04111148, -0.11816782],
[-0.06362653, -0.06895374, 0.04998193, ..., 0.04436018,
0.09370279, -0.10635335],
...,
[ 0.01029072, 0.00109556, -0.0853666 , ..., 0.117467 ,
-0.07630866, 0.04534987],
[-0.02582262, 0.03399263, 0.04932407, ..., 0.10409873,
-0.11815406, 0.05774596],
[-0.07906114, 0.04832995, -0.06836515, ..., 0.1355295 ,
0.06195256, -0.04780686]], dtype=float32), label_ids=array([6, 6, 6, 6, 6, 6, 6, 6, 6, 6]), metrics={'test_loss': 4.615988254547119, 'test_model_preparation_time': 0.0021, 'test_runtime': 0.122, 'test_samples_per_second': 81.994, 'test_steps_per_second': 16.399})
(풀이3) – trainer_input 에 예약된 with_transform을 지연실행하지 않고 즉시 실행
trainer_inputDataset({
features: ['image', 'label'],
num_rows: 10
})
trainer_input2 = [l for l in trainer_input]
#trainer_input2batch_maker = transformers.Trainer(
model = model,
data_collator = lambda x: x
)
_batched_data = batch_maker.get_test_dataloader(trainer_input2)
batched_data = list(_batched_data)
single_batch = batched_data[-1]
model(**data_collator(single_batch))ImageClassifierOutput(loss=tensor(4.5605, device='cuda:0', grad_fn=<NllLossBackward0>), logits=tensor([[-0.0081, 0.0617, 0.0638, 0.0713, 0.0144, 0.0612, -0.0130, 0.0154,
-0.0081, -0.0375, 0.0250, -0.0233, 0.0434, -0.1051, -0.1325, -0.0450,
-0.0049, -0.0313, -0.0842, -0.0833, -0.0892, 0.0594, -0.2713, -0.0347,
0.1534, 0.0343, 0.0183, 0.0157, 0.0553, 0.1003, 0.0007, 0.0441,
0.2778, 0.1277, -0.1301, 0.0467, -0.0503, 0.2478, -0.1140, -0.1092,
-0.0189, -0.0305, -0.1160, 0.0112, 0.2403, 0.1366, -0.1775, -0.2425,
-0.1163, -0.0243, 0.0992, 0.0648, -0.0584, 0.0718, -0.1058, -0.0473,
0.1545, 0.1715, -0.2551, 0.0352, -0.0359, 0.0221, 0.1607, -0.0603,
-0.0414, -0.1300, 0.1734, -0.0703, -0.1057, -0.1081, 0.0777, -0.1908,
0.0017, 0.3012, -0.0455, -0.1913, 0.0702, 0.1233, 0.1578, -0.0738,
-0.0173, 0.0552, 0.0420, 0.0655, 0.1074, -0.0273, 0.0485, -0.0461,
0.1798, 0.0381, 0.0032, 0.1604, -0.0975, 0.1537, 0.0042, -0.0461,
0.0601, 0.2107, 0.1335, -0.1295, 0.0352],
[-0.1044, 0.0104, -0.0422, -0.1469, -0.0117, 0.0846, 0.1661, -0.0103,
0.0525, -0.0917, 0.1212, -0.0444, 0.1618, 0.1138, -0.0373, 0.0542,
0.0429, -0.2012, -0.0207, 0.0457, 0.0667, 0.0972, -0.0717, -0.0703,
0.0701, 0.0540, -0.0171, -0.0794, 0.0547, 0.2083, -0.0065, 0.0393,
0.0592, 0.2466, 0.0027, 0.0328, -0.0566, 0.0978, -0.1787, 0.0818,
-0.0550, 0.0916, 0.0148, 0.1101, 0.1682, 0.0056, -0.0835, -0.2765,
-0.0238, -0.1956, 0.0127, -0.0766, -0.0920, -0.1452, -0.0421, -0.0560,
0.1438, 0.1189, -0.1660, 0.0936, -0.0736, 0.1523, 0.0853, -0.0591,
-0.0346, -0.1171, 0.0096, -0.0056, 0.0095, -0.2420, 0.0185, -0.0991,
0.1547, 0.2323, 0.0378, 0.0578, 0.0714, 0.1055, 0.1090, -0.2153,
0.1281, 0.0639, 0.1533, -0.0397, 0.2495, 0.0217, 0.0576, 0.1019,
0.2074, 0.0387, 0.1036, 0.3094, 0.1219, 0.0817, -0.0584, 0.0388,
0.0619, 0.0701, 0.0913, 0.0300, -0.0823]], device='cuda:0',
grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)
trainer = transformers.Trainer(
model = model,
data_collator= data_collator
)
trainer.predict(trainer_input2)PredictionOutput(predictions=array([[-0.11918398, -0.16509736, 0.0360051 , ..., -0.0015837 ,
-0.1649007 , 0.06934457],
[ 0.02522344, -0.03897335, 0.14615731, ..., 0.14916745,
-0.08329906, 0.00363915],
[-0.03516109, -0.03787031, 0.06803481, ..., 0.0288963 ,
0.03937618, -0.04595642],
...,
[ 0.05573866, -0.03177875, -0.12821546, ..., 0.08249602,
-0.12046868, 0.02415534],
[-0.00807494, 0.06173132, 0.06380235, ..., 0.13346131,
-0.1294754 , 0.03517982],
[-0.10440043, 0.01043405, -0.04224908, ..., 0.09133518,
0.03001446, -0.08225137]], dtype=float32), label_ids=array([6, 6, 6, 6, 6, 6, 6, 6, 6, 6]), metrics={'test_loss': 4.626803398132324, 'test_model_preparation_time': 0.0022, 'test_runtime': 0.0629, 'test_samples_per_second': 159.085, 'test_steps_per_second': 31.817})
(풀이4) – 트레이너가 가진 “사용하지 않는 column을 제거하는 기능”을 False 시킴..
trainer_inputDataset({
features: ['image', 'label'],
num_rows: 10
})
batch_maker = transformers.Trainer(
model = model,
data_collator = lambda x: x,
args = transformers.TrainingArguments(
output_dir="asdf",
remove_unused_columns=False
)
)
_batched_data = batch_maker.get_test_dataloader(trainer_input)
batched_data = list(_batched_data)
single_batch = batched_data[-1]
model(**data_collator(single_batch))ImageClassifierOutput(loss=tensor(4.5805, device='cuda:0', grad_fn=<NllLossBackward0>), logits=tensor([[-0.0258, 0.0340, 0.0493, 0.0371, 0.0371, 0.0636, -0.0353, -0.0416,
0.0061, -0.0304, 0.0127, -0.0633, 0.0532, -0.1117, -0.1657, -0.0810,
0.0022, 0.0071, -0.0947, -0.0831, -0.1189, 0.0783, -0.2383, -0.0486,
0.1039, 0.0115, 0.0054, -0.0113, 0.0740, 0.0783, 0.0188, 0.0618,
0.2759, 0.1308, -0.1028, 0.0198, 0.0032, 0.2006, -0.1247, -0.0512,
-0.0331, -0.0608, -0.1030, 0.0307, 0.2115, 0.1275, -0.1836, -0.2429,
-0.1090, -0.0293, 0.1010, 0.0847, -0.0655, 0.0416, -0.1167, -0.0598,
0.1333, 0.1627, -0.1722, 0.0046, -0.0842, 0.0161, 0.1583, -0.0403,
-0.0190, -0.1496, 0.0723, -0.0647, -0.1083, -0.1299, 0.0851, -0.1810,
0.0214, 0.2340, -0.0186, -0.1256, 0.0582, 0.1798, 0.1589, -0.0982,
0.0066, 0.0177, 0.0315, 0.0404, 0.1300, -0.0198, 0.0468, -0.0595,
0.2014, 0.0155, -0.0009, 0.1910, -0.0110, 0.1809, 0.0187, 0.0010,
0.0691, 0.2024, 0.1041, -0.1182, 0.0577],
[-0.0791, 0.0483, -0.0684, 0.0205, 0.0634, 0.0355, 0.1256, 0.0242,
0.0795, -0.1158, 0.1004, -0.0554, 0.1398, 0.0703, -0.0372, -0.0903,
0.0322, -0.1763, -0.0331, 0.0778, 0.0345, 0.0899, 0.0006, -0.1170,
-0.0303, 0.0620, -0.1490, -0.0589, -0.0060, 0.0266, -0.0812, -0.0497,
-0.0114, 0.0981, -0.0686, 0.0337, 0.0196, 0.0132, -0.1738, -0.0574,
-0.0434, 0.0773, 0.0020, 0.1212, 0.1227, -0.0150, -0.0698, -0.1568,
0.0644, -0.1053, 0.0420, -0.1292, -0.1032, -0.1744, -0.1242, -0.0229,
0.1295, 0.0844, -0.1660, -0.0132, -0.0407, 0.1438, -0.0115, -0.0879,
-0.1188, -0.1644, -0.0454, -0.0449, -0.0555, -0.2129, -0.0220, -0.1480,
-0.0191, 0.2003, 0.0107, 0.1169, 0.0108, 0.0526, 0.1320, -0.2591,
0.0240, -0.0215, 0.2772, 0.0699, 0.0940, 0.0377, 0.0715, 0.1504,
0.0094, -0.0027, 0.1345, 0.2739, 0.0965, 0.1069, -0.0843, 0.0841,
0.0078, 0.1318, 0.1355, 0.0620, -0.0478]], device='cuda:0',
grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)
trainer = transformers.Trainer(
model = model,
data_collator= data_collator,
args = transformers.TrainingArguments(
output_dir="asdf",
remove_unused_columns=False
)
)
trainer.predict(trainer_input)PredictionOutput(predictions=array([[-0.05419738, -0.05692905, 0.02577981, ..., 0.06238552,
-0.08741985, 0.00835681],
[ 0.06003767, 0.03531971, -0.01702251, ..., 0.10187533,
-0.04111148, -0.11816782],
[-0.06362653, -0.06895374, 0.04998193, ..., 0.04436018,
0.09370279, -0.10635335],
...,
[ 0.01029072, 0.00109556, -0.0853666 , ..., 0.117467 ,
-0.07630866, 0.04534987],
[-0.02582262, 0.03399263, 0.04932407, ..., 0.10409873,
-0.11815406, 0.05774596],
[-0.07906114, 0.04832995, -0.06836515, ..., 0.1355295 ,
0.06195256, -0.04780686]], dtype=float32), label_ids=array([6, 6, 6, 6, 6, 6, 6, 6, 6, 6]), metrics={'test_loss': 4.615988254547119, 'test_model_preparation_time': 0.0021, 'test_runtime': 0.0794, 'test_samples_per_second': 125.968, 'test_steps_per_second': 25.194})
#
(풀이5) – 트레이너가 가진 “사용하지 않는 column을 제거하는 기능”을 False 시킬꺼면, batch_maker를 고려할 필요도 없이 아래와 같이 바로 single_batch를 얻을 수 있음.
풀이4: 실제로 trainer가 싱글배치를 얻는 과정과 유사하게 얻는 방법
batch_maker = transformers.Trainer(
model = model,
data_collator = lambda x: x,
args = transformers.TrainingArguments(
output_dir= "asdf", # 아무거나 써야함.
remove_unused_columns= False # 이 부분이 포인트!!
)
)
_batched_data = batch_maker.get_test_dataloader(trainer_input)
batched_data = list(_batched_data)
single_batch = batched_data[0]
# single_batch = [
# {'label':int, 'pixel_values': 3d-tsr},
# {'label':int, 'pixel_values': 3d-tsr},
# {'label':int, 'pixel_values': 3d-tsr},
# {'label':int, 'pixel_values': 3d-tsr},
# {'label':int, 'pixel_values': 3d-tsr},
# {'label':int, 'pixel_values': 3d-tsr},
# {'label':int, 'pixel_values': 3d-tsr},
# {'label':int, 'pixel_values': 3d-tsr},
# ] 형식관찰:
single_batch는[Dict, Dict, Dict, .... Dict]꼴임을 주목하라.
풀이5: 형식관찰에 힌트를 얻어 무식하게 얻은 싱글배치
single_batch = [
trainer_input[0],
trainer_input[1],
trainer_input[2],
trainer_input[3],
trainer_input[4],
trainer_input[5],
trainer_input[6],
trainer_input[7],
]아무튼 풀이5 스타일로 싱글배치를 얻었다면? 이후의 코드는 동일
model.to("cpu")
model(**data_collator(single_batch));trainer = transformers.Trainer(
model = model,
data_collator= data_collator,
args = transformers.TrainingArguments(
output_dir="asdf",
remove_unused_columns=False
)
)
trainer.predict(trainer_input)PredictionOutput(predictions=array([[-0.05419738, -0.05692905, 0.02577981, ..., 0.06238552,
-0.08741985, 0.00835681],
[ 0.06003767, 0.03531971, -0.01702251, ..., 0.10187533,
-0.04111148, -0.11816782],
[-0.06362653, -0.06895374, 0.04998193, ..., 0.04436018,
0.09370279, -0.10635335],
...,
[ 0.01029072, 0.00109556, -0.0853666 , ..., 0.117467 ,
-0.07630866, 0.04534987],
[-0.02582262, 0.03399263, 0.04932407, ..., 0.10409873,
-0.11815406, 0.05774596],
[-0.07906114, 0.04832995, -0.06836515, ..., 0.1355295 ,
0.06195256, -0.04780686]], dtype=float32), label_ids=array([6, 6, 6, 6, 6, 6, 6, 6, 6, 6]), metrics={'test_loss': 4.615988254547119, 'test_model_preparation_time': 0.0015, 'test_runtime': 0.0567, 'test_samples_per_second': 176.35, 'test_steps_per_second': 35.27})
참고1: 아래의 방식으로 싱글배치를 얻을 수 없음. – 이유? 지연실행때문에..
#single_batch = trainer_input.to_list()[:8]참고2: 아래의 방식으로도 싱글배치를 얻을 수 없음.
#single_batch = trainer_input[:8]이유?
trainer_input[:2] == [trainer_input[0],trainer_input[1]]False
D. FOOD101 – DefaultDataCollator 구현
1. 데이터준비: "guebin/food101-tiny" \(\to\) trainer_input
food = datasets.load_dataset("guebin/food101-tiny")
image_processor = transformers.AutoImageProcessor.from_pretrained("google/vit-base-patch16-224-in21k")
normalize = torchvision.transforms.Normalize(mean=image_processor.image_mean, std=image_processor.image_std)
size = (
image_processor.size["shortest_edge"]
if "shortest_edge" in image_processor.size
else (image_processor.size["height"], image_processor.size["width"])
)
_transforms = torchvision.transforms.Compose([
torchvision.transforms.RandomResizedCrop(size),
torchvision.transforms.ToTensor(),
normalize
])
def transforms(examples):
examples["pixel_values"] = [_transforms(img.convert("RGB")) for img in examples["image"]]
del examples["image"]
return examples
trainer_input = food['train'].with_transform(transforms)
trainer_inputFast image processor class <class 'transformers.models.vit.image_processing_vit_fast.ViTImageProcessorFast'> is available for this model. Using slow image processor class. To use the fast image processor class set `use_fast=True`.
Dataset({
features: ['image', 'label'],
num_rows: 10
})
2. 모델준비: "google/vit-base-patch16-224-in21k" \(\to\)model
labels = food["train"].features["label"].names
label2id, id2label = dict(), dict()
for i, label in enumerate(labels):
label2id[label] = str(i)
id2label[str(i)] = label
model = transformers.AutoModelForImageClassification.from_pretrained(
"google/vit-base-patch16-224-in21k",
num_labels=len(labels),
id2label=id2label,
label2id=label2id,
)Some weights of ViTForImageClassification were not initialized from the model checkpoint at google/vit-base-patch16-224-in21k and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
3. 데이터콜렉터: collate_fn 직접설계
# data_collator = transformers.DefaultDataCollator()
# data_collatordef collate_fn(single_batch):
passDefaultDataCollator() 와 동일한 역할을 하는 collate_fn을 설계하라. 이를 이용하여 적당한 trainer를 만들어
trainer.predict(trainer_input)이 정상동작하는지 확인하라.
(풀이)
trainer_inputDataset({
features: ['image', 'label'],
num_rows: 10
})
# batch_maker = transformers.Trainer(
# model= model,
# data_collator= lambda x: x,
# args = transformers.TrainingArguments(
# output_dir="asdf",
# remove_unused_columns=False
# )
# )
# _batched_data = batch_maker.get_eval_dataloader(trainer_input)
# batched_data = list(_batched_data)
# single_batch = batched_data[-1]
#---#
single_batch = [trainer_input[-2],trainer_input[-1]]
single_batch[{'label': 6,
'pixel_values': tensor([[[ 0.9294, 0.9137, 0.9137, ..., -0.0902, -0.1373, -0.1451],
[ 0.9216, 0.8902, 0.8824, ..., -0.1059, -0.1451, -0.1216],
[ 0.9137, 0.8745, 0.8588, ..., -0.1294, -0.1608, -0.1216],
...,
[ 0.8902, 0.8824, 0.8588, ..., 0.6471, 0.6941, 0.7490],
[ 0.8980, 0.9608, 0.9216, ..., 0.6471, 0.6863, 0.7412],
[ 0.7961, 0.9294, 0.8980, ..., 0.6863, 0.7569, 0.8275]],
[[ 0.7882, 0.7725, 0.7725, ..., -0.7020, -0.7490, -0.7569],
[ 0.7804, 0.7412, 0.7333, ..., -0.7176, -0.7569, -0.7490],
[ 0.7725, 0.7255, 0.7098, ..., -0.7412, -0.7725, -0.7490],
...,
[ 0.6000, 0.6000, 0.5765, ..., 0.3569, 0.4196, 0.4824],
[ 0.6078, 0.6784, 0.6549, ..., 0.3647, 0.4275, 0.4824],
[ 0.5059, 0.6471, 0.6314, ..., 0.4118, 0.5137, 0.5843]],
[[ 0.3020, 0.2863, 0.2863, ..., -0.8824, -0.9294, -0.9373],
[ 0.2941, 0.2784, 0.2627, ..., -0.8980, -0.9373, -0.9294],
[ 0.3020, 0.2627, 0.2471, ..., -0.9294, -0.9608, -0.9373],
...,
[-0.0588, -0.0431, -0.0275, ..., -0.2000, -0.1294, -0.0588],
[-0.0196, 0.0745, 0.0824, ..., -0.1843, -0.1059, -0.0353],
[-0.1059, 0.0667, 0.0745, ..., -0.1216, -0.0118, 0.0745]]])},
{'label': 6,
'pixel_values': tensor([[[ 0.2471, 0.2392, 0.2235, ..., 0.5529, 0.5608, 0.5686],
[ 0.2784, 0.2706, 0.2549, ..., 0.5137, 0.5216, 0.5294],
[ 0.2863, 0.2863, 0.2706, ..., 0.5373, 0.5373, 0.5373],
...,
[ 0.1843, 0.1843, 0.1922, ..., 0.3961, 0.4039, 0.4039],
[ 0.1765, 0.1765, 0.1765, ..., 0.3725, 0.3804, 0.3804],
[ 0.1843, 0.1843, 0.1843, ..., 0.3412, 0.3490, 0.3490]],
[[ 0.0196, 0.0118, -0.0039, ..., 0.2392, 0.2471, 0.2549],
[ 0.0510, 0.0431, 0.0275, ..., 0.2000, 0.2078, 0.2157],
[ 0.0431, 0.0353, 0.0275, ..., 0.2235, 0.2235, 0.2235],
...,
[ 0.0275, 0.0275, 0.0353, ..., 0.2314, 0.2392, 0.2392],
[ 0.0196, 0.0196, 0.0275, ..., 0.2078, 0.2157, 0.2157],
[ 0.0353, 0.0353, 0.0353, ..., 0.1765, 0.1843, 0.1843]],
[[-0.0275, -0.0353, -0.0510, ..., 0.3020, 0.3098, 0.3176],
[ 0.0039, -0.0039, -0.0196, ..., 0.2627, 0.2706, 0.2784],
[-0.0196, -0.0196, -0.0275, ..., 0.2784, 0.2784, 0.2784],
...,
[-0.0275, -0.0275, -0.0196, ..., 0.2000, 0.2078, 0.2078],
[-0.0353, -0.0353, -0.0275, ..., 0.1843, 0.1922, 0.1922],
[-0.0118, -0.0118, -0.0118, ..., 0.1529, 0.1608, 0.1608]]])}]
def collate_fn(single_batch):
#single_batch = [Dict,Dict]
#Dict = {'label': 6, 'pixel_values': [3, 224, 224]-tensor
collated_data = dict()
collated_data['labels'] = torch.tensor([dct['label'] for dct in single_batch])
collated_data['pixel_values'] = torch.stack([dct['pixel_values'] for dct in single_batch])
return collated_datamodel.to("cpu")
model(**collate_fn(single_batch))ImageClassifierOutput(loss=tensor(4.6879, grad_fn=<NllLossBackward0>), logits=tensor([[-0.0804, 0.0968, 0.0104, 0.0587, 0.0753, -0.1459, -0.0490, 0.0943,
-0.1302, 0.0035, 0.0278, 0.0814, -0.0322, -0.0997, 0.0074, 0.0590,
0.1447, -0.0570, 0.0402, -0.1111, 0.0828, -0.0466, -0.0744, -0.0126,
-0.0425, 0.1688, -0.0974, -0.0623, 0.0361, 0.0408, 0.0729, -0.0884,
-0.1466, -0.0140, -0.0014, 0.0648, 0.1264, -0.0280, 0.1474, -0.0689,
-0.1422, 0.0655, 0.0284, -0.0079, -0.0690, -0.0004, 0.1554, 0.2469,
-0.0823, -0.1235, 0.1127, 0.0328, -0.0263, -0.1717, -0.0735, -0.0631,
-0.0033, 0.0384, 0.0394, -0.0366, -0.0721, -0.1715, -0.1646, 0.1292,
-0.0584, 0.1022, 0.1657, -0.0345, -0.0113, 0.0878, 0.0139, 0.0916,
0.0486, 0.1362, -0.1265, -0.0859, 0.1684, 0.0747, -0.0101, 0.0710,
0.1240, 0.0428, 0.0963, -0.0619, 0.0882, -0.1248, -0.0710, -0.0345,
-0.0587, 0.0099, -0.0551, 0.0146, 0.0188, -0.0608, -0.0025, -0.0860,
0.0773, -0.0181, 0.0626, -0.0063, 0.1354],
[-0.1139, 0.1333, 0.0288, 0.1036, 0.0013, -0.1305, -0.0787, -0.0498,
-0.0068, -0.0083, -0.1182, 0.1121, -0.0138, 0.1194, -0.0208, 0.0663,
0.1040, 0.0072, 0.0234, -0.0689, -0.0308, -0.1163, 0.0537, -0.0286,
-0.0101, 0.0307, -0.0585, -0.0954, 0.0320, -0.0579, 0.0325, -0.0295,
-0.1303, -0.0086, 0.0865, -0.0150, 0.1053, -0.0445, 0.1173, 0.0385,
-0.0747, -0.0407, 0.0267, 0.0213, -0.0670, -0.0072, 0.1436, 0.2285,
-0.0249, -0.0071, 0.2300, 0.0438, -0.0619, -0.1296, -0.0915, -0.1184,
-0.0810, -0.0472, 0.0674, -0.0898, -0.2508, 0.0194, -0.1328, 0.0603,
-0.0573, 0.2025, 0.1324, -0.0234, 0.1049, -0.0662, -0.0686, 0.1090,
0.0918, -0.0061, 0.0338, -0.0134, 0.2654, 0.0237, -0.0282, 0.0598,
0.0974, 0.0358, 0.0761, 0.0540, -0.0830, 0.0899, -0.0992, -0.1714,
-0.1553, 0.0348, 0.0597, -0.0319, 0.0956, 0.0430, -0.0128, 0.0559,
0.1588, -0.0096, 0.0150, -0.0084, -0.0050]],
grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)
trainer = transformers.Trainer(
model=model,
data_collator=collate_fn,
args=transformers.TrainingArguments(
output_dir="asdf",
remove_unused_columns=False
)
)
trainer.predict(trainer_input)PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
***** Running Prediction *****
Num examples = 10
Batch size = 8
PredictionOutput(predictions=array([[-0.11636792, -0.0263294 , 0.1064104 , ..., -0.04388852,
-0.07757819, 0.01414965],
[-0.07075333, 0.05525547, 0.0611947 , ..., -0.03989508,
-0.0375449 , 0.12472424],
[-0.07043052, 0.12611614, 0.00971566, ..., 0.00761356,
-0.00533151, 0.02300033],
...,
[-0.20117757, 0.09828947, 0.00724527, ..., -0.04101294,
-0.02915922, 0.21293962],
[-0.05230844, 0.08269425, 0.02642585, ..., 0.03440103,
-0.00195974, 0.12479743],
[-0.0799979 , 0.02366202, 0.03927091, ..., 0.05246412,
0.07132983, -0.06526391]], dtype=float32), label_ids=array([6, 6, 6, 6, 6, 6, 6, 6, 6, 6]), metrics={'test_loss': 4.681787967681885, 'test_model_preparation_time': 0.0015, 'test_runtime': 0.0592, 'test_samples_per_second': 168.828, 'test_steps_per_second': 33.766})
E. IMDB – DataCollatorWithPadding 구현
ref: https://huggingface.co/docs/transformers/tasks/sequence_classification
1. 데이터준비: "guebin/imdb-tiny" \(\to\) trainer_input
imdb = datasets.load_dataset("guebin/imdb-tiny")
tokenizer = transformers.AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")
def preprocess_function(examples):
return tokenizer(examples["text"], truncation=True)
tokenized_imdb = imdb.map(preprocess_function,batched=True)
trainer_input = tokenized_imdb['train']2. 모델준비: "distilbert/distilbert-base-uncased" \(\to\)model
model = transformers.AutoModelForSequenceClassification.from_pretrained(
"distilbert/distilbert-base-uncased", num_labels=2
)Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
3. 데이터콜렉터: collate_fn 직접설계
# data_collator = transformers.DataCollatorWithPadding(tokenizer=tokenizer)
# data_collatordef collate_fn(single_batch):
passDataCollatorWithPadding() 와 동일한 역할을 하는 collate_fn을 설계하라. 이를 이용하여 적당한 trainer를 만들어
trainer.predict(trainer_input)이 정상동작하는지 확인하라.
(풀이)
trainer_inputDataset({
features: ['text', 'label', 'input_ids', 'attention_mask'],
num_rows: 10
})
batch_maker = transformers.Trainer(
model= model,
data_collator= lambda x: x,
)
_batched_data = batch_maker.get_eval_dataloader(trainer_input)
batched_data = list(_batched_data)
single_batch = batched_data[-1]labels = torch.tensor([dct['label'] for dct in single_batch])
labelstensor([0, 0])
input_ids = torch.nn.utils.rnn.pad_sequence([torch.tensor(dct['input_ids']) for dct in single_batch]).t()
input_idstensor([[ 101, 2040, 2024, ..., 22132, 7847, 102],
[ 101, 2023, 2003, ..., 0, 0, 0]])
attention_mask = torch.nn.utils.rnn.pad_sequence([torch.tensor(dct['attention_mask']) for dct in single_batch]).t()
attention_masktensor([[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 0, 0, 0]])
# single_batch = [Dict, Dict]
# Dict = {
# 'label': int
# 'input_ids': 1d-list
# 'attention_mask': 1d-list
# }
def collate_fn(single_batch):
collated_data = dict()
collated_data['input_ids'] = torch.nn.utils.rnn.pad_sequence([torch.tensor(dct['input_ids']) for dct in single_batch]).t()
collated_data['attention_mask'] = torch.nn.utils.rnn.pad_sequence([torch.tensor(dct['attention_mask']) for dct in single_batch]).t()
collated_data['labels'] = torch.tensor([dct['label'] for dct in single_batch])
return collated_datacollate_fn(single_batch){'input_ids': tensor([[ 101, 2040, 2024, ..., 22132, 7847, 102],
[ 101, 2023, 2003, ..., 0, 0, 0]]),
'attention_mask': tensor([[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 0, 0, 0]]),
'labels': tensor([0, 0])}
model.to("cpu")
model(**collate_fn(single_batch))SequenceClassifierOutput(loss=tensor(0.6132, grad_fn=<NllLossBackward0>), logits=tensor([[0.1724, 0.0153],
[0.1978, 0.0212]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)
trainer = transformers.Trainer(
model=model,
data_collator=collate_fn,
)
trainer.predict(trainer_input)PredictionOutput(predictions=array([[ 0.18233198, 0.0185583 ],
[ 0.19031762, 0.02762305],
[ 0.19021928, 0.03987525],
[ 0.15878916, -0.00159456],
[ 0.18261112, 0.02069864],
[ 0.14113042, -0.00186965],
[ 0.17083615, 0.03911189],
[ 0.16111258, 0.01503472],
[ 0.17235444, 0.0153002 ],
[ 0.19777855, 0.02123523]], dtype=float32), label_ids=array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), metrics={'test_loss': 0.6185036897659302, 'test_model_preparation_time': 0.0011, 'test_runtime': 0.0378, 'test_samples_per_second': 264.573, 'test_steps_per_second': 52.915})
4. 연습 – sms_spam
model = transformers.AutoModelForSequenceClassification.from_pretrained(
"distilbert/distilbert-base-uncased", num_labels=2
)
tokenizer = transformers.AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")
spam = datasets.load_dataset('guebin/spam-tiny')
spamSome weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
DatasetDict({
train: Dataset({
features: ['sms', 'label'],
num_rows: 10
})
})
spamDatasetDict({
train: Dataset({
features: ['sms', 'label'],
num_rows: 10
})
})
A. 방법1: 고정패딩, collate_fn
def m_trans(example_batch):
# example_batch = {'sms':[xxx,xxxx,...], 'label':[yyy,yyyy]
# example_batch = spam['train'][:8]
out = tokenizer(example_batch['sms'],padding=True,truncation=True)
return out spam2 = spam.map(m_trans,batched=True,batch_size=8)
spam2Map: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 1538.07 examples/s]
DatasetDict({
train: Dataset({
features: ['sms', 'label', 'input_ids', 'attention_mask'],
num_rows: 10
})
})
spam2.set_format("pt")
#spam2['train']['input_ids'] -- list of tensor with length 10 spam2['train'][8:]['input_ids'] # 2d-tensor tensor([[ 101, 3453, 999, 999, 2004, 1037, 11126, 2897, 8013, 2017,
2031, 2042, 3479, 2000, 4374, 2050, 1069, 21057, 2692, 3396,
10377, 999, 2000, 4366, 2655, 5641, 2692, 2575, 16576, 24096,
21472, 2487, 1012, 4366, 3642, 1047, 2140, 22022, 2487, 1012,
9398, 2260, 2847, 2069, 1012, 102],
[ 101, 2018, 2115, 4684, 2340, 2706, 2030, 2062, 1029, 1057,
1054, 4709, 2000, 10651, 2000, 1996, 6745, 6120, 4684, 2015,
2007, 4950, 2005, 2489, 999, 2655, 1996, 4684, 10651, 2522,
2489, 2006, 5511, 8889, 24594, 20842, 2692, 14142, 102, 0,
0, 0, 0, 0, 0, 0]])
spam2['train'][7:]['input_ids'] # list of 1d-tensor [tensor([ 101, 2004, 2566, 2115, 5227, 1005, 11463, 2571, 11463, 2571,
1006, 2030, 2226, 8117, 28987, 11231, 3070, 18447, 2063, 27617,
5575, 2226, 29525, 15464, 1007, 1005, 2038, 2042, 2275, 2004,
2115, 20587, 8525, 2638, 2005, 2035, 20587, 2015, 1012, 2811,
1008, 1023, 2000, 6100, 2115, 2814, 20587, 8525, 2638, 102,
0, 0, 0, 0, 0, 0]),
tensor([ 101, 3453, 999, 999, 2004, 1037, 11126, 2897, 8013, 2017,
2031, 2042, 3479, 2000, 4374, 2050, 1069, 21057, 2692, 3396,
10377, 999, 2000, 4366, 2655, 5641, 2692, 2575, 16576, 24096,
21472, 2487, 1012, 4366, 3642, 1047, 2140, 22022, 2487, 1012,
9398, 2260, 2847, 2069, 1012, 102]),
tensor([ 101, 2018, 2115, 4684, 2340, 2706, 2030, 2062, 1029, 1057,
1054, 4709, 2000, 10651, 2000, 1996, 6745, 6120, 4684, 2015,
2007, 4950, 2005, 2489, 999, 2655, 1996, 4684, 10651, 2522,
2489, 2006, 5511, 8889, 24594, 20842, 2692, 14142, 102, 0,
0, 0, 0, 0, 0, 0])]
trainer_input = spam2['train'].remove_columns(['sms']).rename_columns({'label':'labels'})
trainer_inputDataset({
features: ['labels', 'input_ids', 'attention_mask'],
num_rows: 10
})
batch_maker = transformers.Trainer(
model= model,
data_collator=lambda x:x
)
_batched_data = batch_maker.get_eval_dataloader(trainer_input)
batched_data = list(_batched_data)
single_batch = batched_data[-1]
single_batch[{'labels': tensor(1, device='cuda:0'),
'input_ids': tensor([ 101, 3453, 999, 999, 2004, 1037, 11126, 2897, 8013, 2017,
2031, 2042, 3479, 2000, 4374, 2050, 1069, 21057, 2692, 3396,
10377, 999, 2000, 4366, 2655, 5641, 2692, 2575, 16576, 24096,
21472, 2487, 1012, 4366, 3642, 1047, 2140, 22022, 2487, 1012,
9398, 2260, 2847, 2069, 1012, 102], device='cuda:0'),
'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
device='cuda:0')},
{'labels': tensor(1, device='cuda:0'),
'input_ids': tensor([ 101, 2018, 2115, 4684, 2340, 2706, 2030, 2062, 1029, 1057,
1054, 4709, 2000, 10651, 2000, 1996, 6745, 6120, 4684, 2015,
2007, 4950, 2005, 2489, 999, 2655, 1996, 4684, 10651, 2522,
2489, 2006, 5511, 8889, 24594, 20842, 2692, 14142, 102, 0,
0, 0, 0, 0, 0, 0], device='cuda:0'),
'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
device='cuda:0')}]
torch.stack([single_batch[0]['labels'],single_batch[1]['labels']])tensor([1, 1], device='cuda:0')
def collate_fn(single_batch):
out = dict()
out['labels'] = torch.stack([dct['labels'] for dct in single_batch])
out['input_ids'] = torch.stack([dct['input_ids'] for dct in single_batch])
out['attention_mask'] = torch.stack([dct['attention_mask'] for dct in single_batch])
return out model(**collate_fn(single_batch))SequenceClassifierOutput(loss=tensor(0.2875, device='cuda:0', grad_fn=<NllLossBackward0>), logits=tensor([[-0.4793, 0.6985],
[-0.4598, 0.5654]], device='cuda:0', grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)
trainer = transformers.Trainer(
model= model,
data_collator=collate_fn
)
trainer.predict(trainer_input)PredictionOutput(predictions=array([[ 0.70305747, -0.7353085 ],
[ 0.7872642 , -0.77549946],
[-0.6230489 , 0.72870666],
[ 0.7890144 , -0.74234533],
[ 0.6112454 , -0.5727863 ],
[-0.56530714, 0.636417 ],
[ 0.39937705, -0.28327113],
[ 0.45833465, -0.43777147],
[-0.6101986 , 0.7738755 ],
[-0.48634416, 0.7041703 ]], dtype=float32), label_ids=array([0, 0, 1, 0, 0, 1, 0, 0, 1, 1]), metrics={'test_loss': 0.2599327564239502, 'test_model_preparation_time': 0.0011, 'test_runtime': 0.0119, 'test_samples_per_second': 842.754, 'test_steps_per_second': 168.551})
trainer = transformers.Trainer(
model=model,
data_collator=collate_fn,
train_dataset=trainer_input,
args = transformers.TrainingArguments(
output_dir="asdf",
remove_unused_columns=False
)
)
trainer.train()RuntimeError: stack expects each tensor to be equal size, but got [56] at entry 0 and [46] at entry 3
B. 방법2: 고정패딩, DefaultDataCollator
def m_trans(example_batch):
# example_batch = {'sms':[xxx,xxxx,...], 'label':[yyy,yyyy]
# example_batch = spam['train'][:8]
out = tokenizer(example_batch['sms'],padding=True,truncation=True)
return out
spam2 = spam.map(m_trans,batched=True,batch_size=8)
spam2.set_format("pt")
trainer_input = spam2['train'].remove_columns(['sms']).rename_columns({'label':'labels'})Map: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 1685.61 examples/s]
batch_maker = transformers.Trainer(
model= model,
data_collator=lambda x:x
)
_batched_data = batch_maker.get_eval_dataloader(trainer_input)
batched_data = list(_batched_data)
single_batch = batched_data[-1]
single_batch[{'labels': tensor(1, device='cuda:0'),
'input_ids': tensor([ 101, 3453, 999, 999, 2004, 1037, 11126, 2897, 8013, 2017,
2031, 2042, 3479, 2000, 4374, 2050, 1069, 21057, 2692, 3396,
10377, 999, 2000, 4366, 2655, 5641, 2692, 2575, 16576, 24096,
21472, 2487, 1012, 4366, 3642, 1047, 2140, 22022, 2487, 1012,
9398, 2260, 2847, 2069, 1012, 102], device='cuda:0'),
'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
device='cuda:0')},
{'labels': tensor(1, device='cuda:0'),
'input_ids': tensor([ 101, 2018, 2115, 4684, 2340, 2706, 2030, 2062, 1029, 1057,
1054, 4709, 2000, 10651, 2000, 1996, 6745, 6120, 4684, 2015,
2007, 4950, 2005, 2489, 999, 2655, 1996, 4684, 10651, 2522,
2489, 2006, 5511, 8889, 24594, 20842, 2692, 14142, 102, 0,
0, 0, 0, 0, 0, 0], device='cuda:0'),
'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
device='cuda:0')}]
# def collate_fn(single_batch):
# out = dict()
# out['labels'] = torch.stack([dct['labels'] for dct in single_batch])
# out['input_ids'] = torch.stack([dct['input_ids'] for dct in single_batch])
# out['attention_mask'] = torch.stack([dct['attention_mask'] for dct in single_batch])
# return out
data_collator = transformers.DefaultDataCollator()model(**data_collator(single_batch))SequenceClassifierOutput(loss=tensor(0.2875, device='cuda:0', grad_fn=<NllLossBackward0>), logits=tensor([[-0.4793, 0.6985],
[-0.4598, 0.5654]], device='cuda:0', grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)
trainer = transformers.Trainer(
model= model,
data_collator=data_collator
)
trainer.predict(trainer_input)PredictionOutput(predictions=array([[ 0.70305747, -0.7353085 ],
[ 0.7872642 , -0.77549946],
[-0.6230489 , 0.72870666],
[ 0.7890144 , -0.74234533],
[ 0.6112454 , -0.5727863 ],
[-0.56530714, 0.636417 ],
[ 0.39937705, -0.28327113],
[ 0.45833465, -0.43777147],
[-0.6101986 , 0.7738755 ],
[-0.48634416, 0.7041703 ]], dtype=float32), label_ids=array([0, 0, 1, 0, 0, 1, 0, 0, 1, 1]), metrics={'test_loss': 0.2599327564239502, 'test_model_preparation_time': 0.0011, 'test_runtime': 0.0118, 'test_samples_per_second': 844.621, 'test_steps_per_second': 168.924})
trainer = transformers.Trainer(
model=model,
data_collator=data_collator,
train_dataset=trainer_input,
args = transformers.TrainingArguments(
output_dir="asdf",
remove_unused_columns=False
)
)
trainer.train()RuntimeError: stack expects each tensor to be equal size, but got [56] at entry 0 and [46] at entry 3
C. 방법3: 동적패딩, DataCollatorWithPadding
spamDatasetDict({
train: Dataset({
features: ['sms', 'label'],
num_rows: 10
})
})
def w_trans(examples):
# examples = spam['train'][:8] = {'sms': [xxx,xxxx,...], 'label':[yyy,yyyy,...]
out = tokenizer(examples['sms'],truncation=True)
out['labels'] = torch.tensor(examples['label'])
return out trainer_input = spam.with_transform(w_trans)['train']
trainer_inputDataset({
features: ['sms', 'label'],
num_rows: 10
})
batch_maker = transformers.Trainer(
model = model,
data_collator = lambda x: x,
args = transformers.TrainingArguments(
output_dir="asdf",
remove_unused_columns=False
)
)
single_batch = next(iter(batch_maker.get_eval_dataloader(trainer_input)))
#sigle_batchdata_collator = transformers.DataCollatorWithPadding(tokenizer)
model.to("cpu")
model(**data_collator(single_batch))SequenceClassifierOutput(loss=tensor(0.5369, grad_fn=<NllLossBackward0>), logits=tensor([[ 0.2340, -0.2688],
[ 0.2608, -0.2633],
[-0.1423, 0.2838],
[ 0.2734, -0.3063],
[ 0.3347, -0.1394],
[-0.0950, 0.1350],
[ 0.0552, 0.0188],
[ 0.1153, 0.0608]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)
trainer = transformers.Trainer(
model = model,
data_collator = data_collator,
args = transformers.TrainingArguments(
output_dir="asdf",
remove_unused_columns=False
)
)
trainer.predict(trainer_input)PredictionOutput(predictions=array([[ 0.2797705 , -0.19910407],
[ 0.30945048, -0.2513666 ],
[-0.14997171, 0.28633246],
[ 0.30314386, -0.24964799],
[ 0.2884021 , -0.17398489],
[-0.07598098, 0.12895201],
[ 0.11931977, -0.05026204],
[ 0.08751589, -0.07571842],
[-0.13582245, 0.29102388],
[-0.06882622, 0.2479064 ]], dtype=float32), label_ids=array([0, 0, 1, 0, 0, 1, 0, 0, 1, 1]), metrics={'test_loss': 0.5247495770454407, 'test_model_preparation_time': 0.0007, 'test_runtime': 0.0093, 'test_samples_per_second': 1075.104, 'test_steps_per_second': 215.021})
trainer = transformers.Trainer(
model=model,
data_collator=data_collator,
train_dataset=trainer_input,
args = transformers.TrainingArguments(
output_dir="asdf",
remove_unused_columns=False
)
)
trainer.train()| Step | Training Loss |
|---|
TrainOutput(global_step=6, training_loss=0.38783260186513263, metrics={'train_runtime': 1.0559, 'train_samples_per_second': 28.412, 'train_steps_per_second': 5.682, 'total_flos': 421204931664.0, 'train_loss': 0.38783260186513263, 'epoch': 3.0})
D. 방법4: 동적패딩, 전처리X \((\star)\)
trainer_input = spam['train']
trainer_inputDataset({
features: ['sms', 'label'],
num_rows: 10
})
single_batch = [trainer_input[-2],trainer_input[-1]]
single_batch[{'sms': 'WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.\n',
'label': 1},
{'sms': 'Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030\n',
'label': 1}]
def collate_fn(single_batch):
out = tokenizer(
[dct['sms'] for dct in single_batch],
padding=True,
truncation=True,
return_tensors="pt",
)
out['labels'] = torch.tensor([dct['label'] for dct in single_batch])
return out model.to("cpu")
model(**collate_fn(single_batch))SequenceClassifierOutput(loss=tensor(0.6672, grad_fn=<NllLossBackward0>), logits=tensor([[0.0171, 0.1000],
[0.0605, 0.0832]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)
trainer = transformers.Trainer(
model=model,
data_collator=collate_fn,
args = transformers.TrainingArguments(
output_dir="asdf",
remove_unused_columns=False
)
)
trainer.predict(trainer_input)PredictionOutput(predictions=array([[-0.02218767, 0.10636629],
[-0.01826159, 0.08857261],
[-0.01180449, 0.07579152],
[-0.03820946, 0.06749745],
[ 0.04095571, 0.06443821],
[ 0.0097203 , 0.05986086],
[-0.01054696, 0.09217122],
[-0.02597055, 0.07729876],
[ 0.01710123, 0.09998252],
[ 0.06050469, 0.08315243]], dtype=float32), label_ids=array([0, 0, 1, 0, 0, 1, 0, 0, 1, 1]), metrics={'test_loss': 0.7104923725128174, 'test_model_preparation_time': 0.0011, 'test_runtime': 0.0109, 'test_samples_per_second': 919.985, 'test_steps_per_second': 183.997})
trainer = transformers.Trainer(
model=model,
data_collator=collate_fn,
train_dataset=trainer_input,
args = transformers.TrainingArguments(
output_dir="asdf",
remove_unused_columns=False
)
)
trainer.train()| Step | Training Loss |
|---|
TrainOutput(global_step=6, training_loss=0.6373028755187988, metrics={'train_runtime': 1.0552, 'train_samples_per_second': 28.431, 'train_steps_per_second': 5.686, 'total_flos': 421204931664.0, 'train_loss': 0.6373028755187988, 'epoch': 3.0})
A1. 공지
안녕하세요, 제가 촬영하고 강의시간을 살펴보니 원래 강의시간보다 약 20분정도 초과되었습니다. (3시간 분량인데 3시간20분 소요됨) 죄송합니다. 이후의 강의에서 이를 반영하여 조금 강의시간을 줄여서 올리도록하겠습니다.
아래의 코드
lst = [1,2,3]
lst2 = lst
lst2.append(4)를 실행하였을 경우 lst와 lst2에 동일한 값이 저장되는 현상에 대한 설명은
https://guebin.github.io/PP2023/posts/2023-06-21-13wk-1.html
에 있으니 관심있으신 학생들은 참고하시기 바랍니다. (이 수업에서는 저 내용을 몰라도 학점받는데 영향없습니다)