12wk-1: 선인장 이미지 분류

Author

최규빈

Published

December 3, 2024

1. 강의영상

2. imports

import numpy as np
import pandas as pd 
import zipfile
import os
import PIL.Image
import matplotlib.pyplot as plt
#---#
import datasets
import transformers
import torchvision.transforms
import evaluate
import torch

/home/cgb3/anaconda3/envs/hf/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

3. Kaggle

A. ref

ref: https://www.kaggle.com/c/aerial-cactus-identification

B. 압축해제

with zipfile.ZipFile('aerial-cactus-identification.zip', 'r') as z:
    z.extractall('./data')

with zipfile.ZipFile('./data/test.zip', 'r') as z_test:
    z_test.extractall('./data')    
with zipfile.ZipFile('./data/train.zip', 'r') as z_train:
    z_train.extractall('./data')

C. 데이터 살펴보기

train_csv = pd.read_csv("./data/train.csv")
train_csv

	id	has_cactus
0	0004be2cfeaba1c0361d39e2b000257b.jpg	1
1	000c8a36845c0208e833c79c1bffedd1.jpg	1
2	000d1e9a533f62e55c289303b072733d.jpg	1
3	0011485b40695e9138e92d0b3fb55128.jpg	1
4	0014d7a11e90b62848904c1418fc8cf2.jpg	1
...	...	...
17495	ffede47a74e47a5930f81c0b6896479e.jpg	0
17496	ffef6382a50d23251d4bc05519c91037.jpg	1
17497	fff059ecc91b30be5745e8b81111dc7b.jpg	1
17498	fff43acb3b7a23edcc4ae937be2b7522.jpg	0
17499	fffd9e9b990eba07c836745d8aef1a3a.jpg	1

17500 rows × 2 columns

train에는 이미지에 해당하는 라벨이 존재

test_csv = pd.read_csv("./data/sample_submission.csv")
test_csv

	id	has_cactus
0	000940378805c44108d287872b2f04ce.jpg	0.5
1	0017242f54ececa4512b4d7937d1e21e.jpg	0.5
2	001ee6d8564003107853118ab87df407.jpg	0.5
3	002e175c3c1e060769475f52182583d0.jpg	0.5
4	0036e44a7e8f7218e9bc7bf8137e4943.jpg	0.5
...	...	...
3995	ffaafd0c9f2f0e73172848463bc2e523.jpg	0.5
3996	ffae37344310a1549162493237d25d3f.jpg	0.5
3997	ffbd469c56873d064326204aac546e0d.jpg	0.5
3998	ffcb76b7d47f29ece11c751e5f763f52.jpg	0.5
3999	fffed17d1a8e0433a934db518d7f532c.jpg	0.5

4000 rows × 2 columns

test에는 이미지에 해당하는 라벨이 존재하지 않음.
우리의 목표: 확률값을 잘 추정해서 sample_submission의 has_cactus열에 대입하고 그 결과를 캐글에 제출

4. Logits의 이해

A. 로짓의 의미

- 로짓의 이해: 클래스가 2개인 자료에 대한 분류문제를 푼다고 하자. 8개의 observation/examples 에 대한 로짓값이 아래와 같다고 하자.

logits = np.array(
    [[ 2.7346244, -3.1177292],
     [ 2.7103324, -3.1362345],
     [ 2.7464483, -3.0521457],
     [ 2.7195318, -3.122628 ],
     [ 2.7138977, -3.1041346],
     [ 2.7398622, -3.1098123],
     [ 0.0657177, -0.0930362],
     [-2.7668718,  3.0918367]]
)
logits

array([[ 2.7346244, -3.1177292],
       [ 2.7103324, -3.1362345],
       [ 2.7464483, -3.0521457],
       [ 2.7195318, -3.122628 ],
       [ 2.7138977, -3.1041346],
       [ 2.7398622, -3.1098123],
       [ 0.0657177, -0.0930362],
       [-2.7668718,  3.0918367]])

로짓값은 일반적으로 \((n,k)\)의 차원을 가지며 여기에서 \(n\)은 observation의 숫자, \(k\)는 클래스의 숫자를 의미한다. 이 예제의 경우는 \(n=8\), \(k=2\)인 경우이다.

여기에서 각 observation에 대한 로짓값이 의미하는 바를 살펴보면 아래와 같다.

첫 번째 관측값 ([2.7346244, -3.1177292]):

첫 번째 클래스에 대한 확신 정도: 2.7346244
두 번째 클래스에 대한 확신 정도: -3.1177292

두 번째 관측값 ([2.7103324, -3.1362345]):

첫 번째 클래스에 대한 확신 정도: 2.7103324
두 번째 클래스에 대한 확신 정도: -3.1362345

…

마지막에서 두번째 관측값 ([0.0657177, -0.0930362]):

첫 번째 클래스에 대한 확신 정도: 0.0657177
두 번째 클래스에 대한 확신 정도: -0.0930362

마지막 관측값 ([-2.7668718, 3.0918367]):

첫 번째 클래스에 대한 확신 정도: -2.7668718
두 번째 클래스에 대한 확신 정도: 3.0918367

B. 로짓 \(\to\) 예측클래스

- 로짓 \(\to\) 예측클래스의 과정을 살펴보자.

첫 번째 관측값: \(2.7346244 > -3.1177292\) \(\Rightarrow\) 첫 번째 클래스로 예측
두 번째 관측값: \(2.7103324 > -3.1362345\) \(\Rightarrow\) 첫 번째 클래스로 예측

…

마지막에서 두번째 관측값: \(0.0657177 > -0.0930362\) \(\Rightarrow\) 첫 번째 클래스로 예측
마지막 관측값: \(-2.7668718 < 3.0918367\) \(\Rightarrow\) 두 번째 클래스로 예측

logits

array([[ 2.7346244, -3.1177292],
       [ 2.7103324, -3.1362345],
       [ 2.7464483, -3.0521457],
       [ 2.7195318, -3.122628 ],
       [ 2.7138977, -3.1041346],
       [ 2.7398622, -3.1098123],
       [ 0.0657177, -0.0930362],
       [-2.7668718,  3.0918367]])

for u in logits:
    u1, u2 = u 
    if u1 > u2: 
        prediction = 0 
    else: 
        prediction = 1 
    print(prediction)

- 이것은 아래를 이용하여 구할수도 있다.

logits.argmax(axis=1)

array([0, 0, 0, 0, 0, 0, 0, 1])

C. 로짓 \(\to\) 예측확률

- 로짓 \(\to\) 예측확률의 과정을 살펴보자.

\({\boldsymbol u}=\begin{bmatrix} u_1 & \dots & u_k\end{bmatrix}\)를 고정된 observation에 대한 logits값 이라고 하자. 이때 각 클래스에 속할 확률값은 아래와 같이 구한다.

\[\text{prob} =\left[\frac{\exp(u_1)}{\exp(u_1)+\dots+\exp(u_k)}, \cdots, \frac{\exp(u_k)}{\exp(u_1)+\dots+\exp(u_k)}\right]\]

logits

array([[ 2.7346244, -3.1177292],
       [ 2.7103324, -3.1362345],
       [ 2.7464483, -3.0521457],
       [ 2.7195318, -3.122628 ],
       [ 2.7138977, -3.1041346],
       [ 2.7398622, -3.1098123],
       [ 0.0657177, -0.0930362],
       [-2.7668718,  3.0918367]])

for u in logits:
    u1,u2 = u
    p1 = np.exp(u1) / (np.exp(u1)+np.exp(u2))
    p2 = np.exp(u2) / (np.exp(u1)+np.exp(u2))
    prediction_scores = [p1,p2]
    print(prediction_scores)

[0.9971351022231982, 0.0028648977768018545]
[0.9971185237682113, 0.0028814762317887605]
[0.9969773496339113, 0.0030226503660887583]
[0.9971058336243687, 0.0028941663756314024]
[0.997035364944602, 0.0029646350553980375]
[0.9971274386623321, 0.002872561337667959]
[0.5396053294830294, 0.4603946705169706]
[0.0028468010295870897, 0.997153198970413]

- 이 확률은 아래를 통하여 구할수도 있다.

torch.tensor(logits).softmax(dim=1)

tensor([[0.9971, 0.0029],
        [0.9971, 0.0029],
        [0.9970, 0.0030],
        [0.9971, 0.0029],
        [0.9970, 0.0030],
        [0.9971, 0.0029],
        [0.5396, 0.4604],
        [0.0028, 0.9972]], dtype=torch.float64)

5. 평가지표

A. accuracy 계산

- accuracy의 계산: logits와 labels가 아래와 같이 주어졌다고 하자.

logits = np.array(
    [[ 2.7346244, -3.1177292],
     [ 2.7103324, -3.1362345],
     [ 2.7464483, -3.0521457],
     [ 2.7195318, -3.122628 ],
     [ 2.7138977, -3.1041346],
     [ 2.7398622, -3.1098123],
     [ 0.0657177, -0.0930362],
     [-2.7668718,  3.0918367]]
)
references = labels = np.array([0,0,0,0,0,0,1,1])

predictions = logits.argmax(axis=1)
predictions

array([0, 0, 0, 0, 0, 0, 0, 1])

references

array([0, 0, 0, 0, 0, 0, 1, 1])

accuracy는 아래와 같이 계산할 수 있다.

7/8

0.875

(predictions == references).sum() / 8

0.875

(predictions == references).mean()

0.875

이걸 아래와 같이 계산할 수도 있다.

acc = evaluate.load("accuracy")
acc.compute(predictions = predictions, references= references)

{'accuracy': 0.875}

B. recall 계산

- 경우에 따라서 1을 얼마나 더 잘맞추는지 알고 싶은 경우도 있다.

\[\text{recall}= \frac{\text{실제 라벨이 1인 관측치 중 올바르게 예측된 관측치수}}{\text{실제 라벨이 1인 관측치 수}}\]

logits = np.array(
    [[ 2.7346244, -3.1177292],
     [ 2.7103324, -3.1362345],
     [ 2.7464483, -3.0521457],
     [ 2.7195318, -3.122628 ],
     [ 2.7138977, -3.1041346],
     [ 2.7398622, -3.1098123],
     [ 0.0657177, -0.0930362],
     [-2.7668718,  3.0918367]]
)
references = labels = np.array([0,0,0,0,0,0,1,1])

predictions = logits.argmax(axis=1)
predictions

array([0, 0, 0, 0, 0, 0, 0, 1])

(predictions[references == 1]==1).mean() # recall

0.5

이것을 아래와 같이 구할수도 있다.

rec = evaluate.load("recall")
rec.compute(predictions = predictions, references = references)

{'recall': 0.5}

C. auc 계산

- accuracy 이외의 평가지표들

https://guebin.github.io/MP2023/posts/12wk-46.html // 시험에는 X

- AUC: 클래스간의 불균형이 있을때 유의미한 평가지표

logits = np.array(
    [[ 2.7346244, -3.1177292],
     [ 2.7103324, -3.1362345],
     [ 2.7464483, -3.0521457],
     [ 2.7195318, -3.122628 ],
     [ 2.7138977, -3.1041346],
     [ 2.7398622, -3.1098123],
     [ 0.0657177, -0.0930362],
     [-2.7668718,  3.0918367]]
)
references = labels = np.array([0,0,0,0,0,0,1,1])

probabilities = torch.tensor(logits).softmax(dim=1).numpy()
prediction_scores = probabilities[:,1]
prediction_scores

array([0.0028649 , 0.00288148, 0.00302265, 0.00289417, 0.00296464,
       0.00287256, 0.46039467, 0.9971532 ])

확률 0.4 이상부터는 1로 판단한다면? \(\to\) 다 맞춘거아니야?

roc_auc = evaluate.load("roc_auc")
roc_auc.compute(prediction_scores=prediction_scores, references=references)

{'roc_auc': 1.0}

# 예제1 – 시각화

logits = np.array(
    [[ 2.7346244, -3.1177292],
     [ 2.7103324, -3.1362345],
     [ 2.7464483, -3.0521457],
     [ 2.7195318, -3.122628 ],
     [ 2.7138977, -3.1041346],
     [ 2.7398622, -3.1098123],
     [ 0.0657177, -0.0930362],
     [-2.7668718,  3.0918367]]
)
references = labels = np.array([0,0,0,0,0,0,1,1])
probabilities = torch.tensor(logits).softmax(dim=1).numpy()
prediction_scores = probabilities[:,1]
plt.plot(prediction_scores,'--o')
plt.plot(labels,'x')
plt.axhline(y=0.5,color='red',linestyle='--')
acc = evaluate.load("accuracy")
rec = evaluate.load("recall")
roc_auc = evaluate.load("roc_auc")
print(acc.compute(predictions=predictions,references=references))
print(rec.compute(predictions=predictions,references=references))
print(roc_auc.compute(prediction_scores=prediction_scores,references=references))

{'accuracy': 0.875}
{'recall': 0.5}
{'roc_auc': 1.0}

# 예제2 – 시각화

logits = np.array(
    [[ 2.7346244, -3.1177292],
     [ 2.7103324, -3.1362345],
     [ 2.7464483, -3.0521457],
     [ 2.7195318, -3.122628 ],
     [ 2.7138977, -3.1041346],
     [ 0.0657177, -0.0930362],
     [ 2.7398622, -3.1098123],
     [-2.7668718,  3.0918367]]
)
references = labels = np.array([0,0,0,0,0,0,1,1])
probabilities = torch.tensor(logits).softmax(dim=1).numpy()
prediction_scores = probabilities[:,1]
plt.plot(prediction_scores,'--o')
plt.plot(labels,'x')
plt.axhline(y=0.5,color='red',linestyle='--')
acc = evaluate.load("accuracy")
rec = evaluate.load("recall")
roc_auc = evaluate.load("roc_auc")
print(acc.compute(predictions=predictions,references=references))
print(rec.compute(predictions=predictions,references=references))
print(roc_auc.compute(prediction_scores=prediction_scores,references=references))

{'accuracy': 0.875}
{'recall': 0.5}
{'roc_auc': 0.5833333333333333}

6. 분석

- train.csv를 pandas로

train_csv = pd.read_csv("./data/train.csv")
train_csv

	id	has_cactus
0	0004be2cfeaba1c0361d39e2b000257b.jpg	1
1	000c8a36845c0208e833c79c1bffedd1.jpg	1
2	000d1e9a533f62e55c289303b072733d.jpg	1
3	0011485b40695e9138e92d0b3fb55128.jpg	1
4	0014d7a11e90b62848904c1418fc8cf2.jpg	1
...	...	...
17495	ffede47a74e47a5930f81c0b6896479e.jpg	0
17496	ffef6382a50d23251d4bc05519c91037.jpg	1
17497	fff059ecc91b30be5745e8b81111dc7b.jpg	1
17498	fff43acb3b7a23edcc4ae937be2b7522.jpg	0
17499	fffd9e9b990eba07c836745d8aef1a3a.jpg	1

17500 rows × 2 columns

test_csv = pd.read_csv("./data/sample_submission.csv")
test_csv

	id	has_cactus
0	000940378805c44108d287872b2f04ce.jpg	0.5
1	0017242f54ececa4512b4d7937d1e21e.jpg	0.5
2	001ee6d8564003107853118ab87df407.jpg	0.5
3	002e175c3c1e060769475f52182583d0.jpg	0.5
4	0036e44a7e8f7218e9bc7bf8137e4943.jpg	0.5
...	...	...
3995	ffaafd0c9f2f0e73172848463bc2e523.jpg	0.5
3996	ffae37344310a1549162493237d25d3f.jpg	0.5
3997	ffbd469c56873d064326204aac546e0d.jpg	0.5
3998	ffcb76b7d47f29ece11c751e5f763f52.jpg	0.5
3999	fffed17d1a8e0433a934db518d7f532c.jpg	0.5

4000 rows × 2 columns

A. 예쁜(?) 정석 코드

Step1: Data

ctx_train = datasets.Dataset.from_pandas(train_csv)
ctx_test = datasets.Dataset.from_pandas(test_csv).remove_columns(['has_cactus'])

ctx_train = ctx_train.map(lambda example: {'path': "./data/train/" + example['id']})
ctx_test = ctx_test.map(lambda example: {'path': "./data/test/" + example['id']})

Map: 100%|██████████| 17500/17500 [00:00<00:00, 59071.71 examples/s]
Map: 100%|██████████| 4000/4000 [00:00<00:00, 86647.98 examples/s]

ctx = datasets.DatasetDict({
    'train':ctx_train,
    'test':ctx_test
})
ctx

DatasetDict({
    train: Dataset({
        features: ['id', 'has_cactus', 'path'],
        num_rows: 17500
    })
    test: Dataset({
        features: ['id', 'path'],
        num_rows: 4000
    })
})

compose = torchvision.transforms.Compose([
    lambda path: PIL.Image.open(path),
    torchvision.transforms.ToTensor(),
    torchvision.transforms.Resize((224,224))
])
def w_trans(examples):
    # train: examples = {'id':[xx,xxx,....], 'has_cactus':[yy,yyy,...], 'path':[zz,zzz,...]}
    # train: examples = {'id':[xx,xxx,....], 'path':[zz,zzz,...]}
    dct = dict()
    dct['pixel_values'] = torch.stack(list(map(compose, examples['path'])))
    try: 
        dct['labels']= torch.tensor(examples['has_cactus'])
    except:
        pass
    return dct

ctx = ctx.with_transform(w_trans)
ctx

DatasetDict({
    train: Dataset({
        features: ['id', 'has_cactus', 'path'],
        num_rows: 17500
    })
    test: Dataset({
        features: ['id', 'path'],
        num_rows: 4000
    })
})

ctx['train'][:2]
#ctx['test'][:2]

{'pixel_values': tensor([[[[0.5333, 0.5333, 0.5333,  ..., 0.6157, 0.6157, 0.6157],
           [0.5333, 0.5333, 0.5333,  ..., 0.6157, 0.6157, 0.6157],
           [0.5333, 0.5333, 0.5333,  ..., 0.6157, 0.6157, 0.6157],
           ...,
           [0.7176, 0.7176, 0.7176,  ..., 0.5451, 0.5451, 0.5451],
           [0.7176, 0.7176, 0.7176,  ..., 0.5451, 0.5451, 0.5451],
           [0.7176, 0.7176, 0.7176,  ..., 0.5451, 0.5451, 0.5451]],
 
          [[0.5412, 0.5412, 0.5412,  ..., 0.5255, 0.5255, 0.5255],
           [0.5412, 0.5412, 0.5412,  ..., 0.5255, 0.5255, 0.5255],
           [0.5412, 0.5412, 0.5412,  ..., 0.5255, 0.5255, 0.5255],
           ...,
           [0.6157, 0.6157, 0.6157,  ..., 0.4314, 0.4314, 0.4314],
           [0.6157, 0.6157, 0.6157,  ..., 0.4314, 0.4314, 0.4314],
           [0.6157, 0.6157, 0.6157,  ..., 0.4314, 0.4314, 0.4314]],
 
          [[0.4902, 0.4902, 0.4902,  ..., 0.5490, 0.5490, 0.5490],
           [0.4902, 0.4902, 0.4902,  ..., 0.5490, 0.5490, 0.5490],
           [0.4902, 0.4902, 0.4902,  ..., 0.5490, 0.5490, 0.5490],
           ...,
           [0.6588, 0.6588, 0.6588,  ..., 0.5098, 0.5098, 0.5098],
           [0.6588, 0.6588, 0.6588,  ..., 0.5098, 0.5098, 0.5098],
           [0.6588, 0.6588, 0.6588,  ..., 0.5098, 0.5098, 0.5098]]],
 
 
         [[[0.4627, 0.4627, 0.4627,  ..., 0.4824, 0.4824, 0.4824],
           [0.4627, 0.4627, 0.4627,  ..., 0.4824, 0.4824, 0.4824],
           [0.4627, 0.4627, 0.4627,  ..., 0.4824, 0.4824, 0.4824],
           ...,
           [0.3647, 0.3647, 0.3647,  ..., 0.4941, 0.4941, 0.4941],
           [0.3647, 0.3647, 0.3647,  ..., 0.4941, 0.4941, 0.4941],
           [0.3647, 0.3647, 0.3647,  ..., 0.4941, 0.4941, 0.4941]],
 
          [[0.4275, 0.4275, 0.4275,  ..., 0.3804, 0.3804, 0.3804],
           [0.4275, 0.4275, 0.4275,  ..., 0.3804, 0.3804, 0.3804],
           [0.4275, 0.4275, 0.4275,  ..., 0.3804, 0.3804, 0.3804],
           ...,
           [0.3059, 0.3059, 0.3059,  ..., 0.4314, 0.4314, 0.4314],
           [0.3059, 0.3059, 0.3059,  ..., 0.4314, 0.4314, 0.4314],
           [0.3059, 0.3059, 0.3059,  ..., 0.4314, 0.4314, 0.4314]],
 
          [[0.4471, 0.4471, 0.4471,  ..., 0.4235, 0.4235, 0.4235],
           [0.4471, 0.4471, 0.4471,  ..., 0.4235, 0.4235, 0.4235],
           [0.4471, 0.4471, 0.4471,  ..., 0.4235, 0.4235, 0.4235],
           ...,
           [0.3255, 0.3255, 0.3255,  ..., 0.4745, 0.4745, 0.4745],
           [0.3255, 0.3255, 0.3255,  ..., 0.4745, 0.4745, 0.4745],
           [0.3255, 0.3255, 0.3255,  ..., 0.4745, 0.4745, 0.4745]]]]),
 'labels': tensor([1, 1])}

Step2: Model

model = transformers.AutoModelForImageClassification.from_pretrained(
    "microsoft/resnet-50",
    num_labels=2,
    ignore_mismatched_sizes=True,
)

Some weights of ResNetForImageClassification were not initialized from the model checkpoint at microsoft/resnet-50 and are newly initialized because the shapes did not match:
- classifier.1.bias: found shape torch.Size([1000]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.1.weight: found shape torch.Size([1000, 2048]) in the checkpoint and torch.Size([2, 2048]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Step3: Train

# single_batch = [ctx['train'][0],ctx['train'][1]]
# single_batch # [{'pixel_values':xx, 'labels':yy},{'pixel_values':xxx, 'labels':yyy}]
# data_collator = transformers.DefaultDataCollator()
# data_collator(single_batch) # [{'pixel_values':xx, 'labels':yy},{'pixel_values':xxx, 'labels':yyy}] --> {'pixel_values':[xx,xxx], 'labels':[yy,yyy]}
# model.to("cpu")
# model(**data_collator(single_batch))

ImageClassifierOutputWithNoAttention(loss=tensor(0.6874, grad_fn=<NllLossBackward0>), logits=tensor([[0.0648, 0.0802],
        [0.0668, 0.0745]], grad_fn=<AddmmBackward0>), hidden_states=None)

data_collator = transformers.DefaultDataCollator()
data_collator

DefaultDataCollator(return_tensors='pt')

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits,axis=1)
    predictions_scores = torch.tensor(logits).softmax(dim=1).numpy()[:,1]
    acc = evaluate.load("accuracy")
    rec = evaluate.load("recall")
    roc_auc = evaluate.load("roc_auc")
    dct1 = acc.compute(predictions = predictions, references = labels) # {'accuracy':???}
    dct2 = rec.compute(predictions = predictions, references = labels) # {'recall':???}
    dct3 = roc_auc.compute(prediction_scores = predictions_scores, references = labels) # {'roc_auc':???}
    return dct1|dct2|dct3# {'accuracy':???, 'recall':???, 'roc_auc':???}

training_args = transformers.TrainingArguments(
    output_dir="asdf",
    remove_unused_columns=False,
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=16,
    gradient_accumulation_steps=4,
    per_device_eval_batch_size=16,
    num_train_epochs=4,
    warmup_ratio=0.1,
    logging_steps=10,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    push_to_hub=False,
    report_to="none"
)
trainer = transformers.Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=ctx["train"].select(range(1000)),
    eval_dataset=ctx["train"].select(range(1000,1500)),
    compute_metrics=compute_metrics,
)
trainer.train()

[60/60 00:28, Epoch 3/4]

Epoch	Training Loss	Validation Loss	Accuracy	Recall	Roc Auc
0	0.671200	0.669483	0.738000	0.989247	0.346690
1	0.611900	0.612094	0.744000	1.000000	0.634871
2	0.604300	0.583503	0.748000	1.000000	0.862273
3	0.577600	0.572888	0.750000	1.000000	0.885774

TrainOutput(global_step=60, training_loss=0.6141141653060913, metrics={'train_runtime': 29.0371, 'train_samples_per_second': 137.755, 'train_steps_per_second': 2.066, 'total_flos': 8.103429948063744e+16, 'train_loss': 0.6141141653060913, 'epoch': 3.8095238095238093})

Step4: Prediction

out = trainer.predict(ctx['test'])
out

[ 1/250 : < :]

PredictionOutput(predictions=array([[-0.14711462,  0.28146246],
       [-0.0999002 ,  0.31734663],
       [-0.13772886,  0.21717918],
       ...,
       [-0.10952485,  0.2599906 ],
       [-0.08912332,  0.41566697],
       [-0.13212559,  0.30518785]], dtype=float32), label_ids=None, metrics={'test_runtime': 4.4363, 'test_samples_per_second': 901.655, 'test_steps_per_second': 56.353})

logits = out.predictions
has_cactus = torch.tensor(logits).softmax(dim=1).numpy()[:,1]
has_cactus

array([0.60553384, 0.6028243 , 0.58780724, ..., 0.59134185, 0.6235844 ,
       0.6076187 ], dtype=float32)

test_csv['has_cactus']= has_cactus
test_csv

	id	has_cactus
0	000940378805c44108d287872b2f04ce.jpg	0.605534
1	0017242f54ececa4512b4d7937d1e21e.jpg	0.602824
2	001ee6d8564003107853118ab87df407.jpg	0.587807
3	002e175c3c1e060769475f52182583d0.jpg	0.561108
4	0036e44a7e8f7218e9bc7bf8137e4943.jpg	0.578930
...	...	...
3995	ffaafd0c9f2f0e73172848463bc2e523.jpg	0.612775
3996	ffae37344310a1549162493237d25d3f.jpg	0.650841
3997	ffbd469c56873d064326204aac546e0d.jpg	0.591342
3998	ffcb76b7d47f29ece11c751e5f763f52.jpg	0.623584
3999	fffed17d1a8e0433a934db518d7f532c.jpg	0.607619

4000 rows × 2 columns

Step1 ~ Step4

train_csv = pd.read_csv("./data/train.csv")
test_csv = pd.read_csv("./data/sample_submission.csv")
#---#
# Step1: Data
ctx_train = datasets.Dataset.from_pandas(train_csv)
ctx_test = datasets.Dataset.from_pandas(test_csv).remove_columns(['has_cactus'])
ctx_train = ctx_train.map(lambda example: {'path': "./data/train/" + example['id']})
ctx_test = ctx_test.map(lambda example: {'path': "./data/test/" + example['id']})
ctx = datasets.DatasetDict({
    'train':ctx_train,
    'test':ctx_test
})
compose = torchvision.transforms.Compose([
    lambda path: PIL.Image.open(path),
    torchvision.transforms.ToTensor(),
    torchvision.transforms.Resize((224,224))
])
def w_trans(examples):
    # train: examples = {'id':[xx,xxx,....], 'has_cactus':[yy,yyy,...], 'path':[zz,zzz,...]}
    # train: examples = {'id':[xx,xxx,....], 'path':[zz,zzz,...]}
    dct = dict()
    dct['pixel_values'] = torch.stack(list(map(compose, examples['path'])))
    try: 
        dct['labels']= torch.tensor(examples['has_cactus'])
    except:
        pass
    return dct 
ctx = ctx.with_transform(w_trans)
# Step2: Model
model = transformers.AutoModelForImageClassification.from_pretrained(
    "microsoft/resnet-50",
    num_labels=2,
    ignore_mismatched_sizes=True,
)
# Step3: Train
data_collator = transformers.DefaultDataCollator()
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits,axis=1)
    predictions_scores = torch.tensor(logits).softmax(dim=1).numpy()[:,1]
    acc = evaluate.load("accuracy")
    rec = evaluate.load("recall")
    roc_auc = evaluate.load("roc_auc")
    dct1 = acc.compute(predictions = predictions, references = labels) # {'accuracy':???}
    dct2 = rec.compute(predictions = predictions, references = labels) # {'recall':???}
    dct3 = roc_auc.compute(prediction_scores = predictions_scores, references = labels) # {'roc_auc':???}
    return dct1|dct2|dct3# {'accuracy':???, 'recall':???, 'roc_auc':???}
training_args = transformers.TrainingArguments(
    output_dir="asdf",
    remove_unused_columns=False,
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=16,
    gradient_accumulation_steps=4,
    per_device_eval_batch_size=16,
    num_train_epochs=4,
    warmup_ratio=0.1,
    logging_steps=10,
    load_best_model_at_end=True,
    metric_for_best_model="roc_auc",
    push_to_hub=False,
    report_to="none"
)
trainer = transformers.Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=ctx["train"].select(range(1000)),
    eval_dataset=ctx["train"].select(range(1000,1500)),
    compute_metrics=compute_metrics,
)
trainer.train()
# Step4: Prediction
out = trainer.predict(ctx['test'])
logits = out.predictions
has_cactus = torch.tensor(logits).softmax(dim=1).numpy()[:,1]
test_csv['has_cactus']= has_cactus

Map:   0%|          | 0/17500 [00:00<?, ? examples/s]Map: 100%|██████████| 17500/17500 [00:00<00:00, 63327.18 examples/s]
Map: 100%|██████████| 4000/4000 [00:00<00:00, 87028.23 examples/s]
Some weights of ResNetForImageClassification were not initialized from the model checkpoint at microsoft/resnet-50 and are newly initialized because the shapes did not match:
- classifier.1.bias: found shape torch.Size([1000]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.1.weight: found shape torch.Size([1000, 2048]) in the checkpoint and torch.Size([2, 2048]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

[60/60 00:27, Epoch 3/4]

Epoch	Training Loss	Validation Loss	Accuracy	Recall	Roc Auc
0	0.688900	0.662976	0.762000	0.916667	0.753990
1	0.627400	0.621370	0.752000	0.997312	0.866389
2	0.615200	0.598107	0.758000	1.000000	0.913243
3	0.592600	0.589067	0.760000	1.000000	0.929183

B. 자유로운 코드

Step1: Datasets

train_csv = pd.read_csv("./data/train.csv")
test_csv = pd.read_csv("./data/sample_submission.csv")
train_csv2 = pd.read_csv("./data/train.csv")
test_csv2 = pd.read_csv("./data/sample_submission.csv")

train_csv2['path'] = ['./data/train/'+l for l in train_csv.id]
test_csv2['path'] = ['./data/test/'+l for l in test_csv.id]
train_csv2 = train_csv2.loc[:,['has_cactus','path']]
test_csv2 = test_csv2.loc[:,['path']]

ctx = datasets.DatasetDict(
    {
        'train': datasets.Dataset.from_pandas(train_csv2),
        'test':datasets.Dataset.from_pandas(test_csv2)
    }
)
ctx

DatasetDict({
    train: Dataset({
        features: ['has_cactus', 'path'],
        num_rows: 17500
    })
    test: Dataset({
        features: ['path'],
        num_rows: 4000
    })
})

Step2~3 과정의 간보기

model = transformers.AutoModelForImageClassification.from_pretrained(
    "microsoft/resnet-50",
    num_labels=2,
    ignore_mismatched_sizes=True,
)

Some weights of ResNetForImageClassification were not initialized from the model checkpoint at microsoft/resnet-50 and are newly initialized because the shapes did not match:
- classifier.1.bias: found shape torch.Size([1000]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.1.weight: found shape torch.Size([1000, 2048]) in the checkpoint and torch.Size([2, 2048]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

single_batch =  [ctx['train'][0], ctx['train'][1]]
single_batch

[{'has_cactus': 1,
  'path': './data/train/0004be2cfeaba1c0361d39e2b000257b.jpg'},
 {'has_cactus': 1,
  'path': './data/train/000c8a36845c0208e833c79c1bffedd1.jpg'}]

compose = torchvision.transforms.Compose([
    lambda path: PIL.Image.open(path),
    torchvision.transforms.ToTensor(),
    torchvision.transforms.Resize((224,224))
])

def collate_fn(single_batch):
    dct = dict()
    dct['pixel_values'] = torch.stack([compose(o['path']) for o in single_batch])
    try: 
        dct['labels'] = torch.tensor([o['has_cactus'] for o in single_batch])
    except:
        pass
    return dct

model(**collate_fn(single_batch))

ImageClassifierOutputWithNoAttention(loss=tensor(0.6771, grad_fn=<NllLossBackward0>), logits=tensor([[-0.0403,  0.0007],
        [-0.0529, -0.0292]], grad_fn=<AddmmBackward0>), hidden_states=None)

Step1~4

train_csv = pd.read_csv("./data/train.csv")
test_csv = pd.read_csv("./data/sample_submission.csv")
train_csv2 = pd.read_csv("./data/train.csv")
test_csv2 = pd.read_csv("./data/sample_submission.csv")
#---#
# Step1: Data
train_csv2['path'] = ['./data/train/'+l for l in train_csv.id]
test_csv2['path'] = ['./data/test/'+l for l in test_csv.id]
train_csv2 = train_csv2.loc[:,['has_cactus','path']]
test_csv2 = test_csv2.loc[:,['path']]
ctx = datasets.DatasetDict(
    {
        'train': datasets.Dataset.from_pandas(train_csv2),
        'test':datasets.Dataset.from_pandas(test_csv2)
    }
)
# Step2: Model
model = transformers.AutoModelForImageClassification.from_pretrained(
    "microsoft/resnet-50",
    num_labels=2,
    ignore_mismatched_sizes=True,
)
# Step3: Train
def collate_fn(single_batch):
    dct = dict()
    dct['pixel_values'] = torch.stack([compose(o['path']) for o in single_batch])
    try: 
        dct['labels'] = torch.tensor([o['has_cactus'] for o in single_batch])
    except:
        pass
    return dct 
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits,axis=1)
    predictions_scores = torch.tensor(logits).softmax(dim=1).numpy()[:,1]
    acc = evaluate.load("accuracy")
    rec = evaluate.load("recall")
    roc_auc = evaluate.load("roc_auc")
    dct1 = acc.compute(predictions = predictions, references = labels) # {'accuracy':???}
    dct2 = rec.compute(predictions = predictions, references = labels) # {'recall':???}
    dct3 = roc_auc.compute(prediction_scores = predictions_scores, references = labels) # {'roc_auc':???}
    return dct1|dct2|dct3# {'accuracy':???, 'recall':???, 'roc_auc':???}
training_args = transformers.TrainingArguments(
    output_dir="asdf",
    remove_unused_columns=False,
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=16,
    gradient_accumulation_steps=4,
    per_device_eval_batch_size=16,
    num_train_epochs=4,
    warmup_ratio=0.1,
    logging_steps=10,
    load_best_model_at_end=True,
    metric_for_best_model="roc_auc",
    push_to_hub=False,
    report_to="none"
)
trainer = transformers.Trainer(
    model=model,
    args=training_args,
    data_collator=collate_fn,
    train_dataset=ctx["train"].select(range(1000)),
    eval_dataset=ctx["train"].select(range(1000,1500)),
    compute_metrics=compute_metrics,
)
trainer.train()
# Step4: Prediction
out = trainer.predict(ctx['test'])
logits = out.predictions
has_cactus = torch.tensor(logits).softmax(dim=1).numpy()[:,1]
test_csv['has_cactus']= has_cactus

Some weights of ResNetForImageClassification were not initialized from the model checkpoint at microsoft/resnet-50 and are newly initialized because the shapes did not match:
- classifier.1.bias: found shape torch.Size([1000]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.1.weight: found shape torch.Size([1000, 2048]) in the checkpoint and torch.Size([2, 2048]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

[60/60 00:27, Epoch 3/4]

Epoch	Training Loss	Validation Loss	Accuracy	Recall	Roc Auc
0	0.698900	0.686095	0.656000	0.873656	0.345493
1	0.639900	0.631235	0.746000	1.000000	0.702390
2	0.627000	0.609499	0.744000	1.000000	0.841104
3	0.607400	0.598909	0.744000	1.000000	0.885753

A1. 작년강의노트

https://guebin.github.io/MP2023/ – tabular data 분석 위주의 수업

1. 강의영상

2. imports

3. Kaggle

A. ref

B. 압축해제

C. 데이터 살펴보기

4. Logits의 이해

A. 로짓의 의미

B. 로짓 \(\to\) 예측클래스

C. 로짓 \(\to\) 예측확률

5. 평가지표

A. accuracy 계산

B. recall 계산

C. auc 계산

6. 분석

A. 예쁜(?) 정석 코드

B. 자유로운 코드

A1. 작년강의노트

B. 자유로운 코드