11wk-1: `data_collator`

Author

최규빈

Published

November 22, 2024

1. 강의영상

2. Imports

import os
os.environ["WANDB_MODE"] = "offline"

import pandas as pd
import numpy as np
import datasets 
import transformers
import torch
import torchvision
import torch.utils
import evaluate

/home/cgb3/anaconda3/envs/hf/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

3. `data_collator` 이해

A. 외우세요 \((\star\star\star)\)

- data_collator를 잘 설계하는 방법: trainer_input과 model이 주어졌을때 data_collator는 아래의 코드가 동작하도록 설계하면 된다.

trainer_input = ~~~
model = ~~~~ 
#---#
batch_maker = transformers.Trainer(
    model = model,
    data_collator = lambda x: x
) # 이 과정에서 model이 cuda로 감 
_batched_data = batch_maker.get_test_dataloader(trainer_input) # 이 과정에서 trainer_input이 cuda로 감
batched_data = list(_batched_data)
single_batch = batched_data[-1]
model.to("cpu") # 경우에 따라 생략해야할수도있음
model(**data_collator(single_batch))

- 위의 코드가 오류없이 실행되었다면 아래의 코드를 사용할 수 있다.

trainer = transformers.Trainer(
    model = model,
    data_collator = data_collator
)
trainer.predict(trainer_input)

이걸 어떻게 알았냐고요? 코드뜯어봤습니다.. \(\to\) 숙제

Important

코랩사용자의 경우 아래와 같이 wandb(Weights & Biases) 로그인을 요구하는 문제가 있습니다.

wandb: WARNING The `run_name` is currently set to the same value as `TrainingArguments.output_dir`. If this was not intended, please specify a different run name by setting the `TrainingArguments.run_name` parameter.
wandb: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
wandb: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
wandb: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

이를 해결하기 위해서는 아래의 코드를 코랩처음에 실행하면 됩니다.

import os
os.environ["WANDB_MODE"] = "offline"

Note

주의: trainer_input의 type이 꼭 Dataset 일 필요는 없다..

B. IMDB – 복습

ref: https://huggingface.co/docs/transformers/tasks/sequence_classification

1. 데이터준비: "guebin/imdb-tiny" \(\to\) trainer_input

imdb = datasets.load_dataset("guebin/imdb-tiny")
tokenizer = transformers.AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased") 
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)
tokenized_imdb = imdb.map(preprocess_function,batched=True)
trainer_input = tokenized_imdb['train']
trainer_input

Map: 100%|█████████████████████████████| 10/10 [00:00<00:00, 1694.39 examples/s]

Dataset({
    features: ['text', 'label', 'input_ids', 'attention_mask'],
    num_rows: 10
})

2. 모델준비: "distilbert/distilbert-base-uncased" \(\to\)model

model = transformers.AutoModelForSequenceClassification.from_pretrained(
    "distilbert/distilbert-base-uncased", num_labels=2
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

3. 데이터콜렉터: DataCollatorWithPadding() \(\to\) data_collator

data_collator = transformers.DataCollatorWithPadding(tokenizer=tokenizer)
data_collator

DataCollatorWithPadding(tokenizer=DistilBertTokenizerFast(name_or_path='distilbert/distilbert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
    0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}, padding=True, max_length=None, pad_to_multiple_of=None, return_tensors='pt')

데이터콜렉터가 올바로 설정되었는지 체크하고, 적당한 trainer를 만들어

trainer.predict(trainer_input)

이 정상동작하는지 확인하라.

(풀이)

batch_maker = transformers.Trainer(
    model = model,
    data_collator = lambda x: x
) # 이 과정에서 model이 cuda로 감 
_batched_data = batch_maker.get_test_dataloader(trainer_input) # 이 과정에서 trainer_input이 cuda로 감
batched_data = list(_batched_data)
single_batch = batched_data[-1]
model.to("cpu") # 경우에 따라 생략해야할수도있음
model(**data_collator(single_batch))

SequenceClassifierOutput(loss=tensor(0.7085, grad_fn=<NllLossBackward0>), logits=tensor([[-0.0543,  0.0011],
        [-0.0405, -0.0351]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

잘 돌아갔음. (=여기에서 사용된 데이터콜렉터는 잘 설계된 data_collator 라는 의미)

trainer = transformers.Trainer(
    model = model,
    data_collator = data_collator
)
out = trainer.predict(trainer_input)
out

PredictionOutput(predictions=array([[-0.24347985,  0.03874021],
       [-0.26586303,  0.06817057],
       [-0.2564777 ,  0.04826375],
       [-0.2534306 ,  0.06623521],
       [-0.23762025,  0.05738585],
       [-0.25557715,  0.07033838],
       [-0.19689777,  0.07268588],
       [-0.20918864,  0.05981901],
       [-0.2526626 ,  0.10021226],
       [-0.24273753,  0.05700814]], dtype=float32), label_ids=array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), metrics={'test_loss': 0.8574765920639038, 'test_model_preparation_time': 0.0011, 'test_runtime': 0.038, 'test_samples_per_second': 263.415, 'test_steps_per_second': 52.683})

#

- 관찰1: batched_data[-1] 는 하나의배치(single_batch)를 의미함. 모델의 입력으로는 부적절한 형식임.

# batched_data[-1] -- 부적절해보이는 모델입력..

model(**batched_data[-1])

TypeError: DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): Linear(in_features=768, out_features=3072, bias=True)
            (lin2): Linear(in_features=3072, out_features=768, bias=True)
            (activation): GELUActivation()
          )
          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        )
      )
    )
  )
  (pre_classifier): Linear(in_features=768, out_features=768, bias=True)
  (classifier): Linear(in_features=768, out_features=2, bias=True)
  (dropout): Dropout(p=0.2, inplace=False)
) argument after ** must be a mapping, not list

- 관찰2: data_collator(batched_data[-1]) 역시 하나의배치(single_batch)를 의미함. 그런데 이것은 모델의 입력으로도 적절한 형식.

data_collator(batched_data[-1]) # 모델의 입력으로 매우 바람직해 보이는 형식임

{'input_ids': tensor([[  101,  2040,  2024,  ..., 22132,  7847,   102],
        [  101,  2023,  2003,  ...,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 0, 0, 0]]), 'labels': tensor([0, 0])}

model.to("cpu")
model(**data_collator(batched_data[1]))

SequenceClassifierOutput(loss=tensor(0.8696, grad_fn=<NllLossBackward0>), logits=tensor([[-0.2527,  0.1002],
        [-0.2427,  0.0570]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

data_collator – 심화이해

아래의 형식으로 정리된 배치화된 자료가 있다고 하자. (주의: batched_data는 항상 list비슷한 오브젝트이어야함)

batched_data = [batch_1, batch_2, ...,batch_n]

data_collator 는 각각의 single_batch, 즉 batch_1, batch_2 등을 model이 처리가능한 형태로 “형식”을 맞춰주는 역할을 한다. 즉 아래가 실행되도록 만들어주는 역할을 한다.

model(**data_collator(batch_1))

trainer와 model의 자료처리과정 비교

#. model의 자료처리과정

-코드: model.forward(model_input)

-처리과정: model_input에 정리된 입력을 단순히 model.forward() 함수가 처리.

#. trainer의 자료처리과정

-코드: trainer.predict(trainer_input)

-처리과정: 배치화 \(\to\) 데이터콜렉팅 \(\to\) 추론의 3단계를 거친다.

trainer_input을 배치(batch)로 나눈다.
각 배치(=single_batch)를 data_collator를 통해 형식을 맞춘다.
형식이 조정된 데이터를 model.forward의 입력으로 전달한다.

-슈도코드:

## 이 코드는.. 
trainer.predict(trainer_input)

## 대략 아래의 느낌으로 해석하면 된다.. (동일X. 결과정리, GPU처리 등 세부로직이 더 있음)
batched_data = some_function(trainer_input)
for single_batch in batched_data:
    collated_data = data_collator(single_batch)
    model(**collated_data)

trainer.predict() 의 분해

trainer.predict()의 동작은 개념적으로 (1) 배치화 (2) 데이터콜렝팅 (3) 추론의 과정으로 분해할 수 있지만, 실제이러한 과정으로 코드를 정확하게 분리하는건 어렵다. (그리고 저 사이사이에는 다른 자잘한 과정들이 많다..) 하지만 이해를 위해서 코드조각을 억지로 분리해본다면 아래 3개의 코드조각으로 분리할 수 있을것이다.

1. 배치화: trainer_input \(\to\) batched_data

batch_maker = transformers.Trainer(
    model = model,
    data_collator = lambda x: x
)
_batched_data = batch_maker.get_test_dataloader(trainer_input)
batched_data = list(_batched_data)

2. 데이터콜렉팅: single_batch \(\to\) collated_data

#for single_batch in batched_data:
    collated_data = data_collator(single_batch)

3. 추론: collated_data \(\to\) model_out

#for single_batch in batched_data:
    #collated_data = data_collator(single_batch)
    model_out = model(**collated_data)

C. FOOD101 – 복습

ref: https://huggingface.co/docs/transformers/tasks/image_classification

1. 데이터준비: "guebin/food101-tiny" \(\to\) trainer_input

food = datasets.load_dataset("guebin/food101-tiny")
image_processor = transformers.AutoImageProcessor.from_pretrained("google/vit-base-patch16-224-in21k")
normalize = torchvision.transforms.Normalize(mean=image_processor.image_mean, std=image_processor.image_std)
size = (
    image_processor.size["shortest_edge"]
    if "shortest_edge" in image_processor.size
    else (image_processor.size["height"], image_processor.size["width"])
)
_transforms = torchvision.transforms.Compose([
    torchvision.transforms.RandomResizedCrop(size), 
    torchvision.transforms.ToTensor(), 
    normalize
])
def transforms(examples):
    examples["pixel_values"] = [_transforms(img.convert("RGB")) for img in examples["image"]]
    del examples["image"]
    return examples
trainer_input = food['train'].with_transform(transforms)
trainer_input

Fast image processor class <class 'transformers.models.vit.image_processing_vit_fast.ViTImageProcessorFast'> is available for this model. Using slow image processor class. To use the fast image processor class set `use_fast=True`.

Dataset({
    features: ['image', 'label'],
    num_rows: 10
})

2. 모델준비: "google/vit-base-patch16-224-in21k" \(\to\)model

labels = food["train"].features["label"].names
label2id, id2label = dict(), dict()
for i, label in enumerate(labels):
    label2id[label] = str(i)
    id2label[str(i)] = label
model = transformers.AutoModelForImageClassification.from_pretrained(
    "google/vit-base-patch16-224-in21k",
    num_labels=len(labels),
    id2label=id2label,
    label2id=label2id,
)

Some weights of ViTForImageClassification were not initialized from the model checkpoint at google/vit-base-patch16-224-in21k and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

3. 데이터콜렉터: DefaultDataCollator() \(\to\) data_collator

data_collator = transformers.DefaultDataCollator()
data_collator

DefaultDataCollator(return_tensors='pt')

데이터콜렉터가 올바로 설정되었는지 체크하고, 적당한 trainer를 만들어

trainer.predict(trainer_input)

이 정상동작하는지 확인하라.

(풀이1) – 실패

batch_maker = transformers.Trainer(
    model = model,
    data_collator = lambda x: x 
)
_batched_data = batch_maker.get_test_dataloader(trainer_input)
batched_data = list(_batched_data)
single_batch = batched_data[-1]
model(**data_collator(single_batch))

KeyError: 'image'

- 왜 실패했지?? (예전에는 분명히 되었던 것 같은뎅..)

Note

<에러메시지의 해석>

- 아래가 동작하지 않음.

batched_data = list(_batched_data)

- 그 이유는 아래가 동작하지 않기 때문임.

next(dataloader_iter)

- …(생략)…

- 최종적으로는 아래가 동작하지 않기 때문에 생긴 문제였음. (그런데 이건 .with_transform()에 있는 코드인데?)

examples["pixel_values"] = [_transforms(img.convert("RGB")) for img in examples["image"]]

- 결국

[_transforms(img.convert("RGB")) for img in examples["image"]]

를 실행하는 시점에서 examples["image"]가 없었다는 의미.

눈치: with_transform이 지금 실행되는거였어?

- 왜 이런일이 생기지?

- 배치화를 하는 코드

_batched_data = batch_maker.get_test_dataloader(trainer_input)

에서 아래의 column_names:

pixel_values
head_mask
labels
output_attentions
output_hidden_states
interpolate_pos_encoding
return_dict

를 제외하고는 모두 트레이너(batch_maker = trainer)가 강제로 제거하는 로직이 있음.¹

¹ 왜 이런 로직이 있을까? 이런 로직이 없다면 model의 args를 강제로 외우고 있어야 하니까..

- image라는 column_name은 위에 해당되지 않으므로 제거됨.

- 그리고 image 칼럼이 제거된 이후에 with_transform 이 나중에 실행되면서 (지연실행) 문제가 발생.

이걸 어떻게 알았냐고요? 코드뜯어봤습니다.. \(\to\) 숙제

중간정리

trainer.predict() 은 (1) 배치화 (2) 데이터콜렉팅 (3) 추론의 과정을 거친다. 그리고 배치화와 데이터콜렉팅 사이에 “싱글배치”를 만드는 과정이 있다.

세부사항1: 그런데 “배치화”단계에서 model.forward()의 입력으로 사용되지 않는 columns는 지워지는 내부로직이 존재한다.
세부사항2: trainer_input에 걸려있는 .with_transform()은 “배치화”이후 싱글배치가 만들어지는 과정에서 실행된다.

따라서 .with_transform() 에서 특정컬럼의 변화시키는 동작이 약속된 경우, 그 컬럼이 배치화의 단계에서 자동제거되어 코드가 돌아가지 않을 수 있는 위험성이 존재한다.

(풀이2) – image를 return_dict 로 위장.. // 완전 테크니컬한 풀이

- 현재상황: food['train']에 .with_transform(transforms)을 걸어두고(?) trainer_input을 만든상황

- 문제: trainer.predict() 내부동작에서 .with_transform(transform) 이 실현될때

transforms??

Signature: transforms(examples)
Docstring: <no docstring>
Source:   
def transforms(examples):
    examples["pixel_values"] = [_transforms(img.convert("RGB")) for img in examples["image"]]
    del examples["image"]
    return examples
File:      /tmp/ipykernel_706133/1515420127.py
Type:      function

이 내용이 실행되어야하는데, image는 model의 입력으로 유하하지 않은 키라서 트레이너가 이미 제거한 상태임.

- 전략: 제거가 안되게 막아보자..

#model.forward?

#trainer_input = food['train'].with_transform(transforms)
trainer_input2 = trainer_input.rename_columns({'image':'return_dict'})
trainer_input2

Dataset({
    features: ['return_dict', 'label'],
    num_rows: 10
})

def transforms2(examples):
    examples["pixel_values"] = [_transforms(img.convert("RGB")) for img in examples["return_dict"]]
    del examples["return_dict"]
    return examples

trainer_input3 = trainer_input2.with_transform(transforms2)
trainer_input3

Dataset({
    features: ['return_dict', 'label'],
    num_rows: 10
})

batch_maker = transformers.Trainer(
    model = model,
    data_collator = lambda x: x 
)
_batched_data = batch_maker.get_test_dataloader(trainer_input3)
batched_data = list(_batched_data)
single_batch = batched_data[-1]
model(**data_collator(single_batch))

ImageClassifierOutput(loss=tensor(4.5805, device='cuda:0', grad_fn=<NllLossBackward0>), logits=tensor([[-0.0258,  0.0340,  0.0493,  0.0371,  0.0371,  0.0636, -0.0353, -0.0416,
          0.0061, -0.0304,  0.0127, -0.0633,  0.0532, -0.1117, -0.1657, -0.0810,
          0.0022,  0.0071, -0.0947, -0.0831, -0.1189,  0.0783, -0.2383, -0.0486,
          0.1039,  0.0115,  0.0054, -0.0113,  0.0740,  0.0783,  0.0188,  0.0618,
          0.2759,  0.1308, -0.1028,  0.0198,  0.0032,  0.2006, -0.1247, -0.0512,
         -0.0331, -0.0608, -0.1030,  0.0307,  0.2115,  0.1275, -0.1836, -0.2429,
         -0.1090, -0.0293,  0.1010,  0.0847, -0.0655,  0.0416, -0.1167, -0.0598,
          0.1333,  0.1627, -0.1722,  0.0046, -0.0842,  0.0161,  0.1583, -0.0403,
         -0.0190, -0.1496,  0.0723, -0.0647, -0.1083, -0.1299,  0.0851, -0.1810,
          0.0214,  0.2340, -0.0186, -0.1256,  0.0582,  0.1798,  0.1589, -0.0982,
          0.0066,  0.0177,  0.0315,  0.0404,  0.1300, -0.0198,  0.0468, -0.0595,
          0.2014,  0.0155, -0.0009,  0.1910, -0.0110,  0.1809,  0.0187,  0.0010,
          0.0691,  0.2024,  0.1041, -0.1182,  0.0577],
        [-0.0791,  0.0483, -0.0684,  0.0205,  0.0634,  0.0355,  0.1256,  0.0242,
          0.0795, -0.1158,  0.1004, -0.0554,  0.1398,  0.0703, -0.0372, -0.0903,
          0.0322, -0.1763, -0.0331,  0.0778,  0.0345,  0.0899,  0.0006, -0.1170,
         -0.0303,  0.0620, -0.1490, -0.0589, -0.0060,  0.0266, -0.0812, -0.0497,
         -0.0114,  0.0981, -0.0686,  0.0337,  0.0196,  0.0132, -0.1738, -0.0574,
         -0.0434,  0.0773,  0.0020,  0.1212,  0.1227, -0.0150, -0.0698, -0.1568,
          0.0644, -0.1053,  0.0420, -0.1292, -0.1032, -0.1744, -0.1242, -0.0229,
          0.1295,  0.0844, -0.1660, -0.0132, -0.0407,  0.1438, -0.0115, -0.0879,
         -0.1188, -0.1644, -0.0454, -0.0449, -0.0555, -0.2129, -0.0220, -0.1480,
         -0.0191,  0.2003,  0.0107,  0.1169,  0.0108,  0.0526,  0.1320, -0.2591,
          0.0240, -0.0215,  0.2772,  0.0699,  0.0940,  0.0377,  0.0715,  0.1504,
          0.0094, -0.0027,  0.1345,  0.2739,  0.0965,  0.1069, -0.0843,  0.0841,
          0.0078,  0.1318,  0.1355,  0.0620, -0.0478]], device='cuda:0',
       grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

trainer = transformers.Trainer(
    model = model,
    data_collator= data_collator
)
trainer.predict(trainer_input3)

PredictionOutput(predictions=array([[-0.05419738, -0.05692905,  0.02577981, ...,  0.06238552,
        -0.08741985,  0.00835681],
       [ 0.06003767,  0.03531971, -0.01702251, ...,  0.10187533,
        -0.04111148, -0.11816782],
       [-0.06362653, -0.06895374,  0.04998193, ...,  0.04436018,
         0.09370279, -0.10635335],
       ...,
       [ 0.01029072,  0.00109556, -0.0853666 , ...,  0.117467  ,
        -0.07630866,  0.04534987],
       [-0.02582262,  0.03399263,  0.04932407, ...,  0.10409873,
        -0.11815406,  0.05774596],
       [-0.07906114,  0.04832995, -0.06836515, ...,  0.1355295 ,
         0.06195256, -0.04780686]], dtype=float32), label_ids=array([6, 6, 6, 6, 6, 6, 6, 6, 6, 6]), metrics={'test_loss': 4.615988254547119, 'test_model_preparation_time': 0.0021, 'test_runtime': 0.122, 'test_samples_per_second': 81.994, 'test_steps_per_second': 16.399})

(풀이3) – trainer_input 에 예약된 with_transform을 지연실행하지 않고 즉시 실행

trainer_input

Dataset({
    features: ['image', 'label'],
    num_rows: 10
})

trainer_input2 = [l for l in trainer_input]
#trainer_input2

batch_maker = transformers.Trainer(
    model = model,
    data_collator = lambda x: x 
)
_batched_data = batch_maker.get_test_dataloader(trainer_input2)
batched_data = list(_batched_data)
single_batch = batched_data[-1]
model(**data_collator(single_batch))

ImageClassifierOutput(loss=tensor(4.5605, device='cuda:0', grad_fn=<NllLossBackward0>), logits=tensor([[-0.0081,  0.0617,  0.0638,  0.0713,  0.0144,  0.0612, -0.0130,  0.0154,
         -0.0081, -0.0375,  0.0250, -0.0233,  0.0434, -0.1051, -0.1325, -0.0450,
         -0.0049, -0.0313, -0.0842, -0.0833, -0.0892,  0.0594, -0.2713, -0.0347,
          0.1534,  0.0343,  0.0183,  0.0157,  0.0553,  0.1003,  0.0007,  0.0441,
          0.2778,  0.1277, -0.1301,  0.0467, -0.0503,  0.2478, -0.1140, -0.1092,
         -0.0189, -0.0305, -0.1160,  0.0112,  0.2403,  0.1366, -0.1775, -0.2425,
         -0.1163, -0.0243,  0.0992,  0.0648, -0.0584,  0.0718, -0.1058, -0.0473,
          0.1545,  0.1715, -0.2551,  0.0352, -0.0359,  0.0221,  0.1607, -0.0603,
         -0.0414, -0.1300,  0.1734, -0.0703, -0.1057, -0.1081,  0.0777, -0.1908,
          0.0017,  0.3012, -0.0455, -0.1913,  0.0702,  0.1233,  0.1578, -0.0738,
         -0.0173,  0.0552,  0.0420,  0.0655,  0.1074, -0.0273,  0.0485, -0.0461,
          0.1798,  0.0381,  0.0032,  0.1604, -0.0975,  0.1537,  0.0042, -0.0461,
          0.0601,  0.2107,  0.1335, -0.1295,  0.0352],
        [-0.1044,  0.0104, -0.0422, -0.1469, -0.0117,  0.0846,  0.1661, -0.0103,
          0.0525, -0.0917,  0.1212, -0.0444,  0.1618,  0.1138, -0.0373,  0.0542,
          0.0429, -0.2012, -0.0207,  0.0457,  0.0667,  0.0972, -0.0717, -0.0703,
          0.0701,  0.0540, -0.0171, -0.0794,  0.0547,  0.2083, -0.0065,  0.0393,
          0.0592,  0.2466,  0.0027,  0.0328, -0.0566,  0.0978, -0.1787,  0.0818,
         -0.0550,  0.0916,  0.0148,  0.1101,  0.1682,  0.0056, -0.0835, -0.2765,
         -0.0238, -0.1956,  0.0127, -0.0766, -0.0920, -0.1452, -0.0421, -0.0560,
          0.1438,  0.1189, -0.1660,  0.0936, -0.0736,  0.1523,  0.0853, -0.0591,
         -0.0346, -0.1171,  0.0096, -0.0056,  0.0095, -0.2420,  0.0185, -0.0991,
          0.1547,  0.2323,  0.0378,  0.0578,  0.0714,  0.1055,  0.1090, -0.2153,
          0.1281,  0.0639,  0.1533, -0.0397,  0.2495,  0.0217,  0.0576,  0.1019,
          0.2074,  0.0387,  0.1036,  0.3094,  0.1219,  0.0817, -0.0584,  0.0388,
          0.0619,  0.0701,  0.0913,  0.0300, -0.0823]], device='cuda:0',
       grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

trainer = transformers.Trainer(
    model = model,
    data_collator= data_collator
)
trainer.predict(trainer_input2)

PredictionOutput(predictions=array([[-0.11918398, -0.16509736,  0.0360051 , ..., -0.0015837 ,
        -0.1649007 ,  0.06934457],
       [ 0.02522344, -0.03897335,  0.14615731, ...,  0.14916745,
        -0.08329906,  0.00363915],
       [-0.03516109, -0.03787031,  0.06803481, ...,  0.0288963 ,
         0.03937618, -0.04595642],
       ...,
       [ 0.05573866, -0.03177875, -0.12821546, ...,  0.08249602,
        -0.12046868,  0.02415534],
       [-0.00807494,  0.06173132,  0.06380235, ...,  0.13346131,
        -0.1294754 ,  0.03517982],
       [-0.10440043,  0.01043405, -0.04224908, ...,  0.09133518,
         0.03001446, -0.08225137]], dtype=float32), label_ids=array([6, 6, 6, 6, 6, 6, 6, 6, 6, 6]), metrics={'test_loss': 4.626803398132324, 'test_model_preparation_time': 0.0022, 'test_runtime': 0.0629, 'test_samples_per_second': 159.085, 'test_steps_per_second': 31.817})

(풀이4) – 트레이너가 가진 “사용하지 않는 column을 제거하는 기능”을 False 시킴..

trainer_input

Dataset({
    features: ['image', 'label'],
    num_rows: 10
})

batch_maker = transformers.Trainer(
    model = model,
    data_collator = lambda x: x,
    args = transformers.TrainingArguments(
        output_dir="asdf",
        remove_unused_columns=False
    )
)
_batched_data = batch_maker.get_test_dataloader(trainer_input)
batched_data = list(_batched_data)
single_batch = batched_data[-1]
model(**data_collator(single_batch))

ImageClassifierOutput(loss=tensor(4.5805, device='cuda:0', grad_fn=<NllLossBackward0>), logits=tensor([[-0.0258,  0.0340,  0.0493,  0.0371,  0.0371,  0.0636, -0.0353, -0.0416,
          0.0061, -0.0304,  0.0127, -0.0633,  0.0532, -0.1117, -0.1657, -0.0810,
          0.0022,  0.0071, -0.0947, -0.0831, -0.1189,  0.0783, -0.2383, -0.0486,
          0.1039,  0.0115,  0.0054, -0.0113,  0.0740,  0.0783,  0.0188,  0.0618,
          0.2759,  0.1308, -0.1028,  0.0198,  0.0032,  0.2006, -0.1247, -0.0512,
         -0.0331, -0.0608, -0.1030,  0.0307,  0.2115,  0.1275, -0.1836, -0.2429,
         -0.1090, -0.0293,  0.1010,  0.0847, -0.0655,  0.0416, -0.1167, -0.0598,
          0.1333,  0.1627, -0.1722,  0.0046, -0.0842,  0.0161,  0.1583, -0.0403,
         -0.0190, -0.1496,  0.0723, -0.0647, -0.1083, -0.1299,  0.0851, -0.1810,
          0.0214,  0.2340, -0.0186, -0.1256,  0.0582,  0.1798,  0.1589, -0.0982,
          0.0066,  0.0177,  0.0315,  0.0404,  0.1300, -0.0198,  0.0468, -0.0595,
          0.2014,  0.0155, -0.0009,  0.1910, -0.0110,  0.1809,  0.0187,  0.0010,
          0.0691,  0.2024,  0.1041, -0.1182,  0.0577],
        [-0.0791,  0.0483, -0.0684,  0.0205,  0.0634,  0.0355,  0.1256,  0.0242,
          0.0795, -0.1158,  0.1004, -0.0554,  0.1398,  0.0703, -0.0372, -0.0903,
          0.0322, -0.1763, -0.0331,  0.0778,  0.0345,  0.0899,  0.0006, -0.1170,
         -0.0303,  0.0620, -0.1490, -0.0589, -0.0060,  0.0266, -0.0812, -0.0497,
         -0.0114,  0.0981, -0.0686,  0.0337,  0.0196,  0.0132, -0.1738, -0.0574,
         -0.0434,  0.0773,  0.0020,  0.1212,  0.1227, -0.0150, -0.0698, -0.1568,
          0.0644, -0.1053,  0.0420, -0.1292, -0.1032, -0.1744, -0.1242, -0.0229,
          0.1295,  0.0844, -0.1660, -0.0132, -0.0407,  0.1438, -0.0115, -0.0879,
         -0.1188, -0.1644, -0.0454, -0.0449, -0.0555, -0.2129, -0.0220, -0.1480,
         -0.0191,  0.2003,  0.0107,  0.1169,  0.0108,  0.0526,  0.1320, -0.2591,
          0.0240, -0.0215,  0.2772,  0.0699,  0.0940,  0.0377,  0.0715,  0.1504,
          0.0094, -0.0027,  0.1345,  0.2739,  0.0965,  0.1069, -0.0843,  0.0841,
          0.0078,  0.1318,  0.1355,  0.0620, -0.0478]], device='cuda:0',
       grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

trainer = transformers.Trainer(
    model = model,
    data_collator= data_collator,
    args = transformers.TrainingArguments(
        output_dir="asdf",
        remove_unused_columns=False
    )    
)
trainer.predict(trainer_input)

PredictionOutput(predictions=array([[-0.05419738, -0.05692905,  0.02577981, ...,  0.06238552,
        -0.08741985,  0.00835681],
       [ 0.06003767,  0.03531971, -0.01702251, ...,  0.10187533,
        -0.04111148, -0.11816782],
       [-0.06362653, -0.06895374,  0.04998193, ...,  0.04436018,
         0.09370279, -0.10635335],
       ...,
       [ 0.01029072,  0.00109556, -0.0853666 , ...,  0.117467  ,
        -0.07630866,  0.04534987],
       [-0.02582262,  0.03399263,  0.04932407, ...,  0.10409873,
        -0.11815406,  0.05774596],
       [-0.07906114,  0.04832995, -0.06836515, ...,  0.1355295 ,
         0.06195256, -0.04780686]], dtype=float32), label_ids=array([6, 6, 6, 6, 6, 6, 6, 6, 6, 6]), metrics={'test_loss': 4.615988254547119, 'test_model_preparation_time': 0.0021, 'test_runtime': 0.0794, 'test_samples_per_second': 125.968, 'test_steps_per_second': 25.194})

#

(풀이5) – 트레이너가 가진 “사용하지 않는 column을 제거하는 기능”을 False 시킬꺼면, batch_maker를 고려할 필요도 없이 아래와 같이 바로 single_batch를 얻을 수 있음.

풀이4: 실제로 trainer가 싱글배치를 얻는 과정과 유사하게 얻는 방법

batch_maker = transformers.Trainer(
    model = model,
    data_collator = lambda x: x,
    args = transformers.TrainingArguments(
        output_dir= "asdf", # 아무거나 써야함. 
        remove_unused_columns= False # 이 부분이 포인트!!
    )        
)
_batched_data = batch_maker.get_test_dataloader(trainer_input)
batched_data = list(_batched_data)
single_batch = batched_data[0]
# single_batch = [
#     {'label':int, 'pixel_values': 3d-tsr},
#     {'label':int, 'pixel_values': 3d-tsr},
#     {'label':int, 'pixel_values': 3d-tsr},
#     {'label':int, 'pixel_values': 3d-tsr},
#     {'label':int, 'pixel_values': 3d-tsr},
#     {'label':int, 'pixel_values': 3d-tsr},
#     {'label':int, 'pixel_values': 3d-tsr},
#     {'label':int, 'pixel_values': 3d-tsr},
# ]

형식관찰: single_batch는 [Dict, Dict, Dict, .... Dict] 꼴임을 주목하라.

풀이5: 형식관찰에 힌트를 얻어 무식하게 얻은 싱글배치

single_batch = [
    trainer_input[0],
    trainer_input[1],
    trainer_input[2],
    trainer_input[3],
    trainer_input[4],
    trainer_input[5],
    trainer_input[6],
    trainer_input[7],
]

아무튼 풀이5 스타일로 싱글배치를 얻었다면? 이후의 코드는 동일

model.to("cpu")
model(**data_collator(single_batch));

trainer = transformers.Trainer(
    model = model,
    data_collator= data_collator,
    args = transformers.TrainingArguments(
        output_dir="asdf",
        remove_unused_columns=False
    )    
)
trainer.predict(trainer_input)

PredictionOutput(predictions=array([[-0.05419738, -0.05692905,  0.02577981, ...,  0.06238552,
        -0.08741985,  0.00835681],
       [ 0.06003767,  0.03531971, -0.01702251, ...,  0.10187533,
        -0.04111148, -0.11816782],
       [-0.06362653, -0.06895374,  0.04998193, ...,  0.04436018,
         0.09370279, -0.10635335],
       ...,
       [ 0.01029072,  0.00109556, -0.0853666 , ...,  0.117467  ,
        -0.07630866,  0.04534987],
       [-0.02582262,  0.03399263,  0.04932407, ...,  0.10409873,
        -0.11815406,  0.05774596],
       [-0.07906114,  0.04832995, -0.06836515, ...,  0.1355295 ,
         0.06195256, -0.04780686]], dtype=float32), label_ids=array([6, 6, 6, 6, 6, 6, 6, 6, 6, 6]), metrics={'test_loss': 4.615988254547119, 'test_model_preparation_time': 0.0015, 'test_runtime': 0.0567, 'test_samples_per_second': 176.35, 'test_steps_per_second': 35.27})

참고1: 아래의 방식으로 싱글배치를 얻을 수 없음. – 이유? 지연실행때문에..

#single_batch = trainer_input.to_list()[:8]

참고2: 아래의 방식으로도 싱글배치를 얻을 수 없음.

#single_batch = trainer_input[:8]

이유?

trainer_input[:2] == [trainer_input[0],trainer_input[1]]

False

D. FOOD101 – DefaultDataCollator 구현

1. 데이터준비: "guebin/food101-tiny" \(\to\) trainer_input

food = datasets.load_dataset("guebin/food101-tiny")
image_processor = transformers.AutoImageProcessor.from_pretrained("google/vit-base-patch16-224-in21k")
normalize = torchvision.transforms.Normalize(mean=image_processor.image_mean, std=image_processor.image_std)
size = (
    image_processor.size["shortest_edge"]
    if "shortest_edge" in image_processor.size
    else (image_processor.size["height"], image_processor.size["width"])
)
_transforms = torchvision.transforms.Compose([
    torchvision.transforms.RandomResizedCrop(size), 
    torchvision.transforms.ToTensor(), 
    normalize
])
def transforms(examples):
    examples["pixel_values"] = [_transforms(img.convert("RGB")) for img in examples["image"]]
    del examples["image"]
    return examples
trainer_input = food['train'].with_transform(transforms)
trainer_input

Fast image processor class <class 'transformers.models.vit.image_processing_vit_fast.ViTImageProcessorFast'> is available for this model. Using slow image processor class. To use the fast image processor class set `use_fast=True`.

Dataset({
    features: ['image', 'label'],
    num_rows: 10
})

2. 모델준비: "google/vit-base-patch16-224-in21k" \(\to\)model

labels = food["train"].features["label"].names
label2id, id2label = dict(), dict()
for i, label in enumerate(labels):
    label2id[label] = str(i)
    id2label[str(i)] = label
model = transformers.AutoModelForImageClassification.from_pretrained(
    "google/vit-base-patch16-224-in21k",
    num_labels=len(labels),
    id2label=id2label,
    label2id=label2id,
)

Some weights of ViTForImageClassification were not initialized from the model checkpoint at google/vit-base-patch16-224-in21k and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

3. 데이터콜렉터: collate_fn 직접설계

# data_collator = transformers.DefaultDataCollator()
# data_collator

def collate_fn(single_batch):
    pass

DefaultDataCollator() 와 동일한 역할을 하는 collate_fn을 설계하라. 이를 이용하여 적당한 trainer를 만들어

trainer.predict(trainer_input)

이 정상동작하는지 확인하라.

(풀이)

trainer_input

Dataset({
    features: ['image', 'label'],
    num_rows: 10
})

# batch_maker = transformers.Trainer(
#     model= model,
#     data_collator= lambda x: x,
#     args = transformers.TrainingArguments(
#         output_dir="asdf",
#         remove_unused_columns=False
#     )
# )
# _batched_data = batch_maker.get_eval_dataloader(trainer_input)
# batched_data = list(_batched_data)
# single_batch = batched_data[-1]
#---#
single_batch = [trainer_input[-2],trainer_input[-1]]
single_batch

[{'label': 6,
  'pixel_values': tensor([[[ 0.9294,  0.9137,  0.9137,  ..., -0.0902, -0.1373, -0.1451],
           [ 0.9216,  0.8902,  0.8824,  ..., -0.1059, -0.1451, -0.1216],
           [ 0.9137,  0.8745,  0.8588,  ..., -0.1294, -0.1608, -0.1216],
           ...,
           [ 0.8902,  0.8824,  0.8588,  ...,  0.6471,  0.6941,  0.7490],
           [ 0.8980,  0.9608,  0.9216,  ...,  0.6471,  0.6863,  0.7412],
           [ 0.7961,  0.9294,  0.8980,  ...,  0.6863,  0.7569,  0.8275]],
  
          [[ 0.7882,  0.7725,  0.7725,  ..., -0.7020, -0.7490, -0.7569],
           [ 0.7804,  0.7412,  0.7333,  ..., -0.7176, -0.7569, -0.7490],
           [ 0.7725,  0.7255,  0.7098,  ..., -0.7412, -0.7725, -0.7490],
           ...,
           [ 0.6000,  0.6000,  0.5765,  ...,  0.3569,  0.4196,  0.4824],
           [ 0.6078,  0.6784,  0.6549,  ...,  0.3647,  0.4275,  0.4824],
           [ 0.5059,  0.6471,  0.6314,  ...,  0.4118,  0.5137,  0.5843]],
  
          [[ 0.3020,  0.2863,  0.2863,  ..., -0.8824, -0.9294, -0.9373],
           [ 0.2941,  0.2784,  0.2627,  ..., -0.8980, -0.9373, -0.9294],
           [ 0.3020,  0.2627,  0.2471,  ..., -0.9294, -0.9608, -0.9373],
           ...,
           [-0.0588, -0.0431, -0.0275,  ..., -0.2000, -0.1294, -0.0588],
           [-0.0196,  0.0745,  0.0824,  ..., -0.1843, -0.1059, -0.0353],
           [-0.1059,  0.0667,  0.0745,  ..., -0.1216, -0.0118,  0.0745]]])},
 {'label': 6,
  'pixel_values': tensor([[[ 0.2471,  0.2392,  0.2235,  ...,  0.5529,  0.5608,  0.5686],
           [ 0.2784,  0.2706,  0.2549,  ...,  0.5137,  0.5216,  0.5294],
           [ 0.2863,  0.2863,  0.2706,  ...,  0.5373,  0.5373,  0.5373],
           ...,
           [ 0.1843,  0.1843,  0.1922,  ...,  0.3961,  0.4039,  0.4039],
           [ 0.1765,  0.1765,  0.1765,  ...,  0.3725,  0.3804,  0.3804],
           [ 0.1843,  0.1843,  0.1843,  ...,  0.3412,  0.3490,  0.3490]],
  
          [[ 0.0196,  0.0118, -0.0039,  ...,  0.2392,  0.2471,  0.2549],
           [ 0.0510,  0.0431,  0.0275,  ...,  0.2000,  0.2078,  0.2157],
           [ 0.0431,  0.0353,  0.0275,  ...,  0.2235,  0.2235,  0.2235],
           ...,
           [ 0.0275,  0.0275,  0.0353,  ...,  0.2314,  0.2392,  0.2392],
           [ 0.0196,  0.0196,  0.0275,  ...,  0.2078,  0.2157,  0.2157],
           [ 0.0353,  0.0353,  0.0353,  ...,  0.1765,  0.1843,  0.1843]],
  
          [[-0.0275, -0.0353, -0.0510,  ...,  0.3020,  0.3098,  0.3176],
           [ 0.0039, -0.0039, -0.0196,  ...,  0.2627,  0.2706,  0.2784],
           [-0.0196, -0.0196, -0.0275,  ...,  0.2784,  0.2784,  0.2784],
           ...,
           [-0.0275, -0.0275, -0.0196,  ...,  0.2000,  0.2078,  0.2078],
           [-0.0353, -0.0353, -0.0275,  ...,  0.1843,  0.1922,  0.1922],
           [-0.0118, -0.0118, -0.0118,  ...,  0.1529,  0.1608,  0.1608]]])}]

def collate_fn(single_batch):
    #single_batch = [Dict,Dict] 
    #Dict = {'label': 6, 'pixel_values': [3, 224, 224]-tensor
    collated_data = dict()
    collated_data['labels'] = torch.tensor([dct['label'] for dct in single_batch])    
    collated_data['pixel_values'] = torch.stack([dct['pixel_values'] for dct in single_batch])
    return collated_data

model.to("cpu")
model(**collate_fn(single_batch))

ImageClassifierOutput(loss=tensor(4.6879, grad_fn=<NllLossBackward0>), logits=tensor([[-0.0804,  0.0968,  0.0104,  0.0587,  0.0753, -0.1459, -0.0490,  0.0943,
         -0.1302,  0.0035,  0.0278,  0.0814, -0.0322, -0.0997,  0.0074,  0.0590,
          0.1447, -0.0570,  0.0402, -0.1111,  0.0828, -0.0466, -0.0744, -0.0126,
         -0.0425,  0.1688, -0.0974, -0.0623,  0.0361,  0.0408,  0.0729, -0.0884,
         -0.1466, -0.0140, -0.0014,  0.0648,  0.1264, -0.0280,  0.1474, -0.0689,
         -0.1422,  0.0655,  0.0284, -0.0079, -0.0690, -0.0004,  0.1554,  0.2469,
         -0.0823, -0.1235,  0.1127,  0.0328, -0.0263, -0.1717, -0.0735, -0.0631,
         -0.0033,  0.0384,  0.0394, -0.0366, -0.0721, -0.1715, -0.1646,  0.1292,
         -0.0584,  0.1022,  0.1657, -0.0345, -0.0113,  0.0878,  0.0139,  0.0916,
          0.0486,  0.1362, -0.1265, -0.0859,  0.1684,  0.0747, -0.0101,  0.0710,
          0.1240,  0.0428,  0.0963, -0.0619,  0.0882, -0.1248, -0.0710, -0.0345,
         -0.0587,  0.0099, -0.0551,  0.0146,  0.0188, -0.0608, -0.0025, -0.0860,
          0.0773, -0.0181,  0.0626, -0.0063,  0.1354],
        [-0.1139,  0.1333,  0.0288,  0.1036,  0.0013, -0.1305, -0.0787, -0.0498,
         -0.0068, -0.0083, -0.1182,  0.1121, -0.0138,  0.1194, -0.0208,  0.0663,
          0.1040,  0.0072,  0.0234, -0.0689, -0.0308, -0.1163,  0.0537, -0.0286,
         -0.0101,  0.0307, -0.0585, -0.0954,  0.0320, -0.0579,  0.0325, -0.0295,
         -0.1303, -0.0086,  0.0865, -0.0150,  0.1053, -0.0445,  0.1173,  0.0385,
         -0.0747, -0.0407,  0.0267,  0.0213, -0.0670, -0.0072,  0.1436,  0.2285,
         -0.0249, -0.0071,  0.2300,  0.0438, -0.0619, -0.1296, -0.0915, -0.1184,
         -0.0810, -0.0472,  0.0674, -0.0898, -0.2508,  0.0194, -0.1328,  0.0603,
         -0.0573,  0.2025,  0.1324, -0.0234,  0.1049, -0.0662, -0.0686,  0.1090,
          0.0918, -0.0061,  0.0338, -0.0134,  0.2654,  0.0237, -0.0282,  0.0598,
          0.0974,  0.0358,  0.0761,  0.0540, -0.0830,  0.0899, -0.0992, -0.1714,
         -0.1553,  0.0348,  0.0597, -0.0319,  0.0956,  0.0430, -0.0128,  0.0559,
          0.1588, -0.0096,  0.0150, -0.0084, -0.0050]],
       grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

trainer = transformers.Trainer(
    model=model,
    data_collator=collate_fn,
    args=transformers.TrainingArguments(
        output_dir="asdf",
        remove_unused_columns=False
    )
)
trainer.predict(trainer_input)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).

***** Running Prediction *****
  Num examples = 10
  Batch size = 8

PredictionOutput(predictions=array([[-0.11636792, -0.0263294 ,  0.1064104 , ..., -0.04388852,
        -0.07757819,  0.01414965],
       [-0.07075333,  0.05525547,  0.0611947 , ..., -0.03989508,
        -0.0375449 ,  0.12472424],
       [-0.07043052,  0.12611614,  0.00971566, ...,  0.00761356,
        -0.00533151,  0.02300033],
       ...,
       [-0.20117757,  0.09828947,  0.00724527, ..., -0.04101294,
        -0.02915922,  0.21293962],
       [-0.05230844,  0.08269425,  0.02642585, ...,  0.03440103,
        -0.00195974,  0.12479743],
       [-0.0799979 ,  0.02366202,  0.03927091, ...,  0.05246412,
         0.07132983, -0.06526391]], dtype=float32), label_ids=array([6, 6, 6, 6, 6, 6, 6, 6, 6, 6]), metrics={'test_loss': 4.681787967681885, 'test_model_preparation_time': 0.0015, 'test_runtime': 0.0592, 'test_samples_per_second': 168.828, 'test_steps_per_second': 33.766})

E. IMDB – DataCollatorWithPadding 구현

ref: https://huggingface.co/docs/transformers/tasks/sequence_classification

1. 데이터준비: "guebin/imdb-tiny" \(\to\) trainer_input

imdb = datasets.load_dataset("guebin/imdb-tiny")
tokenizer = transformers.AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)
tokenized_imdb = imdb.map(preprocess_function,batched=True)
trainer_input = tokenized_imdb['train']

2. 모델준비: "distilbert/distilbert-base-uncased" \(\to\)model

model = transformers.AutoModelForSequenceClassification.from_pretrained(
    "distilbert/distilbert-base-uncased", num_labels=2
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

3. 데이터콜렉터: collate_fn 직접설계

# data_collator = transformers.DataCollatorWithPadding(tokenizer=tokenizer)
# data_collator

def collate_fn(single_batch):
    pass

DataCollatorWithPadding() 와 동일한 역할을 하는 collate_fn을 설계하라. 이를 이용하여 적당한 trainer를 만들어

trainer.predict(trainer_input)

이 정상동작하는지 확인하라.

(풀이)

trainer_input

Dataset({
    features: ['text', 'label', 'input_ids', 'attention_mask'],
    num_rows: 10
})

batch_maker = transformers.Trainer(
    model= model,
    data_collator= lambda x: x,
)
_batched_data = batch_maker.get_eval_dataloader(trainer_input)
batched_data = list(_batched_data)
single_batch = batched_data[-1]

labels = torch.tensor([dct['label'] for dct in single_batch])
labels

tensor([0, 0])

input_ids = torch.nn.utils.rnn.pad_sequence([torch.tensor(dct['input_ids']) for dct in single_batch]).t()
input_ids

tensor([[  101,  2040,  2024,  ..., 22132,  7847,   102],
        [  101,  2023,  2003,  ...,     0,     0,     0]])

attention_mask = torch.nn.utils.rnn.pad_sequence([torch.tensor(dct['attention_mask']) for dct in single_batch]).t()
attention_mask

tensor([[1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 0, 0, 0]])

# single_batch = [Dict, Dict]
# Dict = {
#     'label': int 
#     'input_ids': 1d-list 
#     'attention_mask': 1d-list 
# }
def collate_fn(single_batch):
    collated_data = dict()
    collated_data['input_ids'] = torch.nn.utils.rnn.pad_sequence([torch.tensor(dct['input_ids']) for dct in single_batch]).t()    
    collated_data['attention_mask'] = torch.nn.utils.rnn.pad_sequence([torch.tensor(dct['attention_mask']) for dct in single_batch]).t()
    collated_data['labels'] = torch.tensor([dct['label'] for dct in single_batch])
    return collated_data

collate_fn(single_batch)

{'input_ids': tensor([[  101,  2040,  2024,  ..., 22132,  7847,   102],
         [  101,  2023,  2003,  ...,     0,     0,     0]]),
 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 0, 0, 0]]),
 'labels': tensor([0, 0])}

model.to("cpu")
model(**collate_fn(single_batch))

SequenceClassifierOutput(loss=tensor(0.6132, grad_fn=<NllLossBackward0>), logits=tensor([[0.1724, 0.0153],
        [0.1978, 0.0212]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

trainer = transformers.Trainer(
    model=model,
    data_collator=collate_fn,
)
trainer.predict(trainer_input)

PredictionOutput(predictions=array([[ 0.18233198,  0.0185583 ],
       [ 0.19031762,  0.02762305],
       [ 0.19021928,  0.03987525],
       [ 0.15878916, -0.00159456],
       [ 0.18261112,  0.02069864],
       [ 0.14113042, -0.00186965],
       [ 0.17083615,  0.03911189],
       [ 0.16111258,  0.01503472],
       [ 0.17235444,  0.0153002 ],
       [ 0.19777855,  0.02123523]], dtype=float32), label_ids=array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), metrics={'test_loss': 0.6185036897659302, 'test_model_preparation_time': 0.0011, 'test_runtime': 0.0378, 'test_samples_per_second': 264.573, 'test_steps_per_second': 52.915})

4. 연습 – `sms_spam`

model = transformers.AutoModelForSequenceClassification.from_pretrained(
    "distilbert/distilbert-base-uncased", num_labels=2
)
tokenizer = transformers.AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")
spam = datasets.load_dataset('guebin/spam-tiny')
spam

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

DatasetDict({
    train: Dataset({
        features: ['sms', 'label'],
        num_rows: 10
    })
})

spam

DatasetDict({
    train: Dataset({
        features: ['sms', 'label'],
        num_rows: 10
    })
})

A. 방법1: 고정패딩, `collate_fn`

def m_trans(example_batch):
    # example_batch = {'sms':[xxx,xxxx,...], 'label':[yyy,yyyy] 
    # example_batch = spam['train'][:8]
    out = tokenizer(example_batch['sms'],padding=True,truncation=True)
    return out

spam2 = spam.map(m_trans,batched=True,batch_size=8)
spam2

Map: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 1538.07 examples/s]

DatasetDict({
    train: Dataset({
        features: ['sms', 'label', 'input_ids', 'attention_mask'],
        num_rows: 10
    })
})

spam2.set_format("pt")
#spam2['train']['input_ids'] -- list of tensor with length 10

spam2['train'][8:]['input_ids'] # 2d-tensor

tensor([[  101,  3453,   999,   999,  2004,  1037, 11126,  2897,  8013,  2017,
          2031,  2042,  3479,  2000,  4374,  2050,  1069, 21057,  2692,  3396,
         10377,   999,  2000,  4366,  2655,  5641,  2692,  2575, 16576, 24096,
         21472,  2487,  1012,  4366,  3642,  1047,  2140, 22022,  2487,  1012,
          9398,  2260,  2847,  2069,  1012,   102],
        [  101,  2018,  2115,  4684,  2340,  2706,  2030,  2062,  1029,  1057,
          1054,  4709,  2000, 10651,  2000,  1996,  6745,  6120,  4684,  2015,
          2007,  4950,  2005,  2489,   999,  2655,  1996,  4684, 10651,  2522,
          2489,  2006,  5511,  8889, 24594, 20842,  2692, 14142,   102,     0,
             0,     0,     0,     0,     0,     0]])

spam2['train'][7:]['input_ids'] # list of 1d-tensor

[tensor([  101,  2004,  2566,  2115,  5227,  1005, 11463,  2571, 11463,  2571,
          1006,  2030,  2226,  8117, 28987, 11231,  3070, 18447,  2063, 27617,
          5575,  2226, 29525, 15464,  1007,  1005,  2038,  2042,  2275,  2004,
          2115, 20587,  8525,  2638,  2005,  2035, 20587,  2015,  1012,  2811,
          1008,  1023,  2000,  6100,  2115,  2814, 20587,  8525,  2638,   102,
             0,     0,     0,     0,     0,     0]),
 tensor([  101,  3453,   999,   999,  2004,  1037, 11126,  2897,  8013,  2017,
          2031,  2042,  3479,  2000,  4374,  2050,  1069, 21057,  2692,  3396,
         10377,   999,  2000,  4366,  2655,  5641,  2692,  2575, 16576, 24096,
         21472,  2487,  1012,  4366,  3642,  1047,  2140, 22022,  2487,  1012,
          9398,  2260,  2847,  2069,  1012,   102]),
 tensor([  101,  2018,  2115,  4684,  2340,  2706,  2030,  2062,  1029,  1057,
          1054,  4709,  2000, 10651,  2000,  1996,  6745,  6120,  4684,  2015,
          2007,  4950,  2005,  2489,   999,  2655,  1996,  4684, 10651,  2522,
          2489,  2006,  5511,  8889, 24594, 20842,  2692, 14142,   102,     0,
             0,     0,     0,     0,     0,     0])]

trainer_input = spam2['train'].remove_columns(['sms']).rename_columns({'label':'labels'})
trainer_input

Dataset({
    features: ['labels', 'input_ids', 'attention_mask'],
    num_rows: 10
})

batch_maker = transformers.Trainer(
    model= model,
    data_collator=lambda x:x
) 
_batched_data = batch_maker.get_eval_dataloader(trainer_input)
batched_data = list(_batched_data)
single_batch = batched_data[-1]
single_batch

[{'labels': tensor(1, device='cuda:0'),
  'input_ids': tensor([  101,  3453,   999,   999,  2004,  1037, 11126,  2897,  8013,  2017,
           2031,  2042,  3479,  2000,  4374,  2050,  1069, 21057,  2692,  3396,
          10377,   999,  2000,  4366,  2655,  5641,  2692,  2575, 16576, 24096,
          21472,  2487,  1012,  4366,  3642,  1047,  2140, 22022,  2487,  1012,
           9398,  2260,  2847,  2069,  1012,   102], device='cuda:0'),
  'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
         device='cuda:0')},
 {'labels': tensor(1, device='cuda:0'),
  'input_ids': tensor([  101,  2018,  2115,  4684,  2340,  2706,  2030,  2062,  1029,  1057,
           1054,  4709,  2000, 10651,  2000,  1996,  6745,  6120,  4684,  2015,
           2007,  4950,  2005,  2489,   999,  2655,  1996,  4684, 10651,  2522,
           2489,  2006,  5511,  8889, 24594, 20842,  2692, 14142,   102,     0,
              0,     0,     0,     0,     0,     0], device='cuda:0'),
  'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
         device='cuda:0')}]

torch.stack([single_batch[0]['labels'],single_batch[1]['labels']])

tensor([1, 1], device='cuda:0')

def collate_fn(single_batch):
    out = dict()
    out['labels'] = torch.stack([dct['labels'] for dct in single_batch])
    out['input_ids'] = torch.stack([dct['input_ids'] for dct in single_batch])
    out['attention_mask'] = torch.stack([dct['attention_mask'] for dct in single_batch])
    return out

model(**collate_fn(single_batch))

SequenceClassifierOutput(loss=tensor(0.2875, device='cuda:0', grad_fn=<NllLossBackward0>), logits=tensor([[-0.4793,  0.6985],
        [-0.4598,  0.5654]], device='cuda:0', grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

trainer = transformers.Trainer(
    model= model,
    data_collator=collate_fn
)
trainer.predict(trainer_input)

PredictionOutput(predictions=array([[ 0.70305747, -0.7353085 ],
       [ 0.7872642 , -0.77549946],
       [-0.6230489 ,  0.72870666],
       [ 0.7890144 , -0.74234533],
       [ 0.6112454 , -0.5727863 ],
       [-0.56530714,  0.636417  ],
       [ 0.39937705, -0.28327113],
       [ 0.45833465, -0.43777147],
       [-0.6101986 ,  0.7738755 ],
       [-0.48634416,  0.7041703 ]], dtype=float32), label_ids=array([0, 0, 1, 0, 0, 1, 0, 0, 1, 1]), metrics={'test_loss': 0.2599327564239502, 'test_model_preparation_time': 0.0011, 'test_runtime': 0.0119, 'test_samples_per_second': 842.754, 'test_steps_per_second': 168.551})

trainer = transformers.Trainer(
    model=model,
    data_collator=collate_fn,
    train_dataset=trainer_input,
    args = transformers.TrainingArguments(
        output_dir="asdf",
        remove_unused_columns=False
    )
)
trainer.train()

RuntimeError: stack expects each tensor to be equal size, but got [56] at entry 0 and [46] at entry 3

B. 방법2: 고정패딩, DefaultDataCollator

def m_trans(example_batch):
    # example_batch = {'sms':[xxx,xxxx,...], 'label':[yyy,yyyy] 
    # example_batch = spam['train'][:8]
    out = tokenizer(example_batch['sms'],padding=True,truncation=True)
    return out 
spam2 = spam.map(m_trans,batched=True,batch_size=8)
spam2.set_format("pt")
trainer_input = spam2['train'].remove_columns(['sms']).rename_columns({'label':'labels'})

Map: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 1685.61 examples/s]

batch_maker = transformers.Trainer(
    model= model,
    data_collator=lambda x:x
) 
_batched_data = batch_maker.get_eval_dataloader(trainer_input)
batched_data = list(_batched_data)
single_batch = batched_data[-1]
single_batch

[{'labels': tensor(1, device='cuda:0'),
  'input_ids': tensor([  101,  3453,   999,   999,  2004,  1037, 11126,  2897,  8013,  2017,
           2031,  2042,  3479,  2000,  4374,  2050,  1069, 21057,  2692,  3396,
          10377,   999,  2000,  4366,  2655,  5641,  2692,  2575, 16576, 24096,
          21472,  2487,  1012,  4366,  3642,  1047,  2140, 22022,  2487,  1012,
           9398,  2260,  2847,  2069,  1012,   102], device='cuda:0'),
  'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
         device='cuda:0')},
 {'labels': tensor(1, device='cuda:0'),
  'input_ids': tensor([  101,  2018,  2115,  4684,  2340,  2706,  2030,  2062,  1029,  1057,
           1054,  4709,  2000, 10651,  2000,  1996,  6745,  6120,  4684,  2015,
           2007,  4950,  2005,  2489,   999,  2655,  1996,  4684, 10651,  2522,
           2489,  2006,  5511,  8889, 24594, 20842,  2692, 14142,   102,     0,
              0,     0,     0,     0,     0,     0], device='cuda:0'),
  'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
         device='cuda:0')}]

# def collate_fn(single_batch):
#     out = dict()
#     out['labels'] = torch.stack([dct['labels'] for dct in single_batch])
#     out['input_ids'] = torch.stack([dct['input_ids'] for dct in single_batch])
#     out['attention_mask'] = torch.stack([dct['attention_mask'] for dct in single_batch])
#     return out 
data_collator = transformers.DefaultDataCollator()

model(**data_collator(single_batch))

SequenceClassifierOutput(loss=tensor(0.2875, device='cuda:0', grad_fn=<NllLossBackward0>), logits=tensor([[-0.4793,  0.6985],
        [-0.4598,  0.5654]], device='cuda:0', grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

trainer = transformers.Trainer(
    model= model,
    data_collator=data_collator
)
trainer.predict(trainer_input)

PredictionOutput(predictions=array([[ 0.70305747, -0.7353085 ],
       [ 0.7872642 , -0.77549946],
       [-0.6230489 ,  0.72870666],
       [ 0.7890144 , -0.74234533],
       [ 0.6112454 , -0.5727863 ],
       [-0.56530714,  0.636417  ],
       [ 0.39937705, -0.28327113],
       [ 0.45833465, -0.43777147],
       [-0.6101986 ,  0.7738755 ],
       [-0.48634416,  0.7041703 ]], dtype=float32), label_ids=array([0, 0, 1, 0, 0, 1, 0, 0, 1, 1]), metrics={'test_loss': 0.2599327564239502, 'test_model_preparation_time': 0.0011, 'test_runtime': 0.0118, 'test_samples_per_second': 844.621, 'test_steps_per_second': 168.924})

trainer = transformers.Trainer(
    model=model,
    data_collator=data_collator,
    train_dataset=trainer_input,
    args = transformers.TrainingArguments(
        output_dir="asdf",
        remove_unused_columns=False
    )
)
trainer.train()

RuntimeError: stack expects each tensor to be equal size, but got [56] at entry 0 and [46] at entry 3

C. 방법3: 동적패딩, `DataCollatorWithPadding`

spam

DatasetDict({
    train: Dataset({
        features: ['sms', 'label'],
        num_rows: 10
    })
})

def w_trans(examples):
    # examples = spam['train'][:8] = {'sms': [xxx,xxxx,...], 'label':[yyy,yyyy,...]
    out = tokenizer(examples['sms'],truncation=True)
    out['labels'] = torch.tensor(examples['label'])
    return out

trainer_input = spam.with_transform(w_trans)['train']
trainer_input

Dataset({
    features: ['sms', 'label'],
    num_rows: 10
})

batch_maker = transformers.Trainer(
    model = model,
    data_collator = lambda x: x,
    args = transformers.TrainingArguments(
        output_dir="asdf",
        remove_unused_columns=False
    )
)
single_batch = next(iter(batch_maker.get_eval_dataloader(trainer_input)))
#sigle_batch

data_collator = transformers.DataCollatorWithPadding(tokenizer)
model.to("cpu")
model(**data_collator(single_batch))

SequenceClassifierOutput(loss=tensor(0.5369, grad_fn=<NllLossBackward0>), logits=tensor([[ 0.2340, -0.2688],
        [ 0.2608, -0.2633],
        [-0.1423,  0.2838],
        [ 0.2734, -0.3063],
        [ 0.3347, -0.1394],
        [-0.0950,  0.1350],
        [ 0.0552,  0.0188],
        [ 0.1153,  0.0608]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

trainer = transformers.Trainer(
    model = model,
    data_collator = data_collator,
    args = transformers.TrainingArguments(
        output_dir="asdf",
        remove_unused_columns=False
    )
)
trainer.predict(trainer_input)

PredictionOutput(predictions=array([[ 0.2797705 , -0.19910407],
       [ 0.30945048, -0.2513666 ],
       [-0.14997171,  0.28633246],
       [ 0.30314386, -0.24964799],
       [ 0.2884021 , -0.17398489],
       [-0.07598098,  0.12895201],
       [ 0.11931977, -0.05026204],
       [ 0.08751589, -0.07571842],
       [-0.13582245,  0.29102388],
       [-0.06882622,  0.2479064 ]], dtype=float32), label_ids=array([0, 0, 1, 0, 0, 1, 0, 0, 1, 1]), metrics={'test_loss': 0.5247495770454407, 'test_model_preparation_time': 0.0007, 'test_runtime': 0.0093, 'test_samples_per_second': 1075.104, 'test_steps_per_second': 215.021})

trainer = transformers.Trainer(
    model=model,
    data_collator=data_collator,
    train_dataset=trainer_input,
    args = transformers.TrainingArguments(
        output_dir="asdf",
        remove_unused_columns=False
    )
)
trainer.train()

[6/6 00:01, Epoch 3/3]

Step	Training Loss

TrainOutput(global_step=6, training_loss=0.38783260186513263, metrics={'train_runtime': 1.0559, 'train_samples_per_second': 28.412, 'train_steps_per_second': 5.682, 'total_flos': 421204931664.0, 'train_loss': 0.38783260186513263, 'epoch': 3.0})

D. 방법4: 동적패딩, 전처리X \((\star)\)

trainer_input = spam['train']
trainer_input

Dataset({
    features: ['sms', 'label'],
    num_rows: 10
})

single_batch = [trainer_input[-2],trainer_input[-1]]
single_batch

[{'sms': 'WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.\n',
  'label': 1},
 {'sms': 'Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030\n',
  'label': 1}]

def collate_fn(single_batch):
    out = tokenizer(
        [dct['sms'] for dct in single_batch],
        padding=True,
        truncation=True,
        return_tensors="pt",
    )
    out['labels'] = torch.tensor([dct['label'] for dct in single_batch])
    return out

model.to("cpu")
model(**collate_fn(single_batch))

SequenceClassifierOutput(loss=tensor(0.6672, grad_fn=<NllLossBackward0>), logits=tensor([[0.0171, 0.1000],
        [0.0605, 0.0832]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

trainer = transformers.Trainer(
    model=model,
    data_collator=collate_fn,
    args = transformers.TrainingArguments(
        output_dir="asdf",
        remove_unused_columns=False
    )
)
trainer.predict(trainer_input)

PredictionOutput(predictions=array([[-0.02218767,  0.10636629],
       [-0.01826159,  0.08857261],
       [-0.01180449,  0.07579152],
       [-0.03820946,  0.06749745],
       [ 0.04095571,  0.06443821],
       [ 0.0097203 ,  0.05986086],
       [-0.01054696,  0.09217122],
       [-0.02597055,  0.07729876],
       [ 0.01710123,  0.09998252],
       [ 0.06050469,  0.08315243]], dtype=float32), label_ids=array([0, 0, 1, 0, 0, 1, 0, 0, 1, 1]), metrics={'test_loss': 0.7104923725128174, 'test_model_preparation_time': 0.0011, 'test_runtime': 0.0109, 'test_samples_per_second': 919.985, 'test_steps_per_second': 183.997})

trainer = transformers.Trainer(
    model=model,
    data_collator=collate_fn,
    train_dataset=trainer_input,
    args = transformers.TrainingArguments(
        output_dir="asdf",
        remove_unused_columns=False
    )
)
trainer.train()

[6/6 00:00, Epoch 3/3]

Step	Training Loss

TrainOutput(global_step=6, training_loss=0.6373028755187988, metrics={'train_runtime': 1.0552, 'train_samples_per_second': 28.431, 'train_steps_per_second': 5.686, 'total_flos': 421204931664.0, 'train_loss': 0.6373028755187988, 'epoch': 3.0})

A1. 공지

강의시간 이슈

안녕하세요, 제가 촬영하고 강의시간을 살펴보니 원래 강의시간보다 약 20분정도 초과되었습니다. (3시간 분량인데 3시간20분 소요됨) 죄송합니다. 이후의 강의에서 이를 반영하여 조금 강의시간을 줄여서 올리도록하겠습니다.

깊은복사 얕은복사

아래의 코드

lst = [1,2,3]
lst2 = lst 
lst2.append(4)

를 실행하였을 경우 lst와 lst2에 동일한 값이 저장되는 현상에 대한 설명은

https://guebin.github.io/PP2023/posts/2023-06-21-13wk-1.html

에 있으니 관심있으신 학생들은 참고하시기 바랍니다. (이 수업에서는 저 내용을 몰라도 학점받는데 영향없습니다)

1. 강의영상

2. Imports

3. data_collator 이해

A. 외우세요 \((\star\star\star)\)

B. IMDB – 복습

C. FOOD101 – 복습

D. FOOD101 – DefaultDataCollator 구현

E. IMDB – DataCollatorWithPadding 구현

4. 연습 – sms_spam

A. 방법1: 고정패딩, collate_fn

B. 방법2: 고정패딩, DefaultDataCollator

C. 방법3: 동적패딩, DataCollatorWithPadding

D. 방법4: 동적패딩, 전처리X \((\star)\)

A1. 공지

3. `data_collator` 이해

4. 연습 – `sms_spam`

A. 방법1: 고정패딩, `collate_fn`

C. 방법3: 동적패딩, `DataCollatorWithPadding`