04wk-1: 감성분석 파고들기 (2)

Author

최규빈

Published

September 27, 2024

1. 강의영상

2. Imports

import datasets
import transformers
import evaluate
import numpy as np
import torch # 파이토치

/home/cgb3/anaconda3/envs/hf/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

3. 이전시간 요약

- 이전시간의 내용중 이번시간에 기억할것들을 요약

- DatasetDict: 임의의 자료에 대한 DatasetDict 오브젝트 만들기

train_dict = {
    'text': [
        "I prefer making decisions based on logic and objective facts.",
        "I always consider how others might feel when making a decision.",
        "Data and analysis drive most of my decisions.",
        "I rely on my empathy and personal values to guide my choices."
    ],
    'label': [0, 1, 0, 1]  # 0은 T(사고형), 1은 F(감정형)
}

test_dict = {
    'text': [
        "I find it important to weigh all the pros and cons logically.",
        "When making decisions, I prioritize harmony and people's emotions."
    ],
    'label': [0, 1]  # 0은 T(사고형), 1은 F(감정형)
}

train_data = datasets.Dataset.from_dict(train_dict)
test_data = datasets.Dataset.from_dict(test_dict)
나의데이터 = datasets.dataset_dict.DatasetDict({'train':train_data, 'test':test_data})

- 토크나이저: 토크나이저는 “Str \(\to\) Dict” 인 함수이다.

데이터전처리하기1 = 토크나이저 = transformers.AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")

/home/cgb3/anaconda3/envs/hf/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(

토크나이저(나의데이터['train']['text'][0])

{'input_ids': [101, 1045, 9544, 2437, 6567, 2241, 2006, 7961, 1998, 7863, 8866, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

- 토크나이저: 토크나이저의 “Str \(\to\) Dict” 인 함수는 배치처리가 가능하다.

토크나이저(나의데이터['train']['text'])

{'input_ids': [[101, 1045, 9544, 2437, 6567, 2241, 2006, 7961, 1998, 7863, 8866, 1012, 102], [101, 1045, 2467, 5136, 2129, 2500, 2453, 2514, 2043, 2437, 1037, 3247, 1012, 102], [101, 2951, 1998, 4106, 3298, 2087, 1997, 2026, 6567, 1012, 102], [101, 1045, 11160, 2006, 2026, 26452, 1998, 3167, 5300, 2000, 5009, 2026, 9804, 1012, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

- 인공지능: 인공지능은 많은 파라메터를 포함하고 있는 어떠한 물체이다.

인공지능 = model = transformers.AutoModelForSequenceClassification.from_pretrained(
    "distilbert/distilbert-base-uncased",
    num_labels=2
)
인공지능.classifier.weight

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Parameter containing:
tensor([[-0.0124, -0.0124, -0.0205,  ...,  0.0244, -0.0019,  0.0045],
        [ 0.0073, -0.0218, -0.0173,  ...,  0.0125,  0.0105,  0.0278]],
       requires_grad=True)

- 인공지능: 인공지능의 파라메터는 변화할 수 있으며, loss가 더 작은쪽으로 파라메터를 변화시키는 과정을 “학습”이라고 부른다.

- 인공지능: 인공지능은 “**Dict \(\to\) transformers.modeling_outputs.SequenceClassifierOutput”인 함수이다. 그런데 쓰기 까다롭다.

1. Dict에는 특정한 key를 포함하고 있어야한다. (input_ids, attention_mask)
2. key에 대응하는 숫자는 파이토치 텐서형태이어야 한다. (3. 따라서 (m,n)꼴의 차원을 반드시 가져야 한다)
4. Dict에 labels이 (텐서형으로) 포함된 경우 loss가 계산된다. (그리고 이걸 계산해야지 학습을 할 수 있음)

토크나이저(나의데이터['train']['text'][0])

{'input_ids': [101, 1045, 9544, 2437, 6567, 2241, 2006, 7961, 1998, 7863, 8866, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

# 인공지능의 입력예시1 – 텐서형으로 정리된 텍스트자료만 있는 경우

인공지능입력1 = {
        'input_ids': torch.tensor([[ 101, 1045, 102],[101, 9544, 102]]), 
        'attention_mask': torch.tensor([[1, 1, 1],[1, 1, 1]]) # 생략가능
}

인공지능(**인공지능입력1)

SequenceClassifierOutput(loss=None, logits=tensor([[0.0177, 0.0286],
        [0.0490, 0.0451]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

#

# 인공지능의 입력예시2 – 텐서형으로 정리된 텍스트자료와 labels이 같이 있는경우

인공지능입력2 = {
        'input_ids': torch.tensor([[ 101, 1045, 102],[101, 9544, 102]]), 
        'attention_mask': torch.tensor([[1, 1, 1],[1, 1, 1]]), # 생략가능
        'labels': torch.tensor([1, 0]) # 생략가능
}

인공지능(**인공지능입력2)

SequenceClassifierOutput(loss=tensor(0.6895, grad_fn=<NllLossBackward0>), logits=tensor([[0.0177, 0.0286],
        [0.0490, 0.0451]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

#

- 인공지능: 인공지능의 출력결과¹ 에서 확률/예측값을 추출하는 방법

¹ transformers.modeling_outputs.SequenceClassifierOutput 자료형을 가짐

인공지능의 출력결과 \(\to\) 로짓 \(\to\) 확률 \(\to\) 인공지능의예측
인공지능의 출력결과 \(\to\) 로짓 \(\to\) 인공지능의예측

예제1: 인공지능의 출력결과에서 인공지능의 예측값을 계산하자. – 로짓 \(\to\) 인공지능의예측

인공지능출력 = 인공지능(**인공지능입력2)
로짓 = 인공지능출력.logits
로짓

tensor([[0.0177, 0.0286],
        [0.0490, 0.0451]], grad_fn=<AddmmBackward0>)

로짓.argmax(axis=1)

tensor([1, 0])

예제2: 인공지능의 출력결과에서 인공지능의 예측확률을 계산하자 – 로짓 \(\to\) 확률

인공지능출력 = 인공지능(**인공지능입력2)
로짓 = 인공지능출력.logits
로짓

tensor([[0.0177, 0.0286],
        [0.0490, 0.0451]], grad_fn=<AddmmBackward0>)

torch.exp(로짓) / torch.exp(로짓).sum(axis=1)

tensor([[0.4973, 0.4909],
        [0.5131, 0.4990]], grad_fn=<DivBackward0>)

확률게산하기라는 함수선언하여 외의 과정을 단순화 하기

def 확률계산하기(인공지능출력):
    로짓 = 인공지능출력.logits
    return torch.exp(로짓) / torch.exp(로짓).sum(axis=1)

확률계산하기(인공지능(**인공지능입력2))

tensor([[0.4973, 0.4909],
        [0.5131, 0.4990]], grad_fn=<DivBackward0>)

#

4. 데이터전처리하기2

- 아래코드를 파고들어보자.

def 데이터전처리하기2(examples):
    return 데이터전처리하기1(examples["text"], truncation=True)
전처리된나의데이터 = 나의데이터.map(lambda x: {'dummy': '메롱'})

- 인공지능의 입력으로 가정된 두가지 경우: (1) 토크나이징결과, (2) 토크나이징결과 + label

1. 와 같은 형태의 입력을 정리하기 위해서는, {'text':[...], 'label':[...]} 이러한 형태로 정리된 자료를 {'text':[...], 'label':[...], 'input_ids':[...], 'attention_mask':[...]} 이러한 형태로 만들어야 하는데 이를 쉽게처리해주는 함수가 바로 나의데이터.map() 임.
나의데이터.map()의 도움말을 확인해본 결과 map은 (1) function 자체를 입력으로 받는데 (2) function 은 Dict를 입력으로 받고, Dict를 출력하는 함수이어야 한다는 사실을 확인할 수 있었음.

- 개념1: Hugging Face의 datasets 라이브러리에서 제공하는 datasets.dataset_dict.DatasetDict은, 요소들의 변환에 특화된 map이라는 메소드가(=함수가) 내장되어있다.

# 예시1 – 메롱

전처리된나의데이터 = 나의데이터.map(lambda dct: {'dummy':'메롱'})

Map: 100%|██████████| 4/4 [00:00<00:00, 1570.02 examples/s]
Map: 100%|██████████| 2/2 [00:00<00:00, 1216.45 examples/s]

전처리된나의데이터['train'][0]

{'text': 'I prefer making decisions based on logic and objective facts.',
 'label': 0,
 'dummy': '메롱'}

#

# 예시2 – Text의 length 계산

전처리된나의데이터 = 나의데이터.map(lambda dct: {'dummy':'메롱', 'length':len(dct['text'])})

Map: 100%|██████████| 4/4 [00:00<00:00, 2154.79 examples/s]
Map: 100%|██████████| 2/2 [00:00<00:00, 1365.33 examples/s]

#

# 예시2 – 토크나이징결과

전처리된나의데이터 = 나의데이터.map(lambda dct: 토크나이저(dct["text"], truncation=True))

Map: 100%|██████████| 4/4 [00:00<00:00, 1274.67 examples/s]
Map: 100%|██████████| 2/2 [00:00<00:00, 966.10 examples/s]

전처리된나의데이터['train'][0]

{'text': 'I prefer making decisions based on logic and objective facts.',
 'label': 0,
 'input_ids': [101,
  1045,
  9544,
  2437,
  6567,
  2241,
  2006,
  7961,
  1998,
  7863,
  8866,
  1012,
  102],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

#

5. 데이터콜렉터

- 전터리된 데이터가 인공지능은 마음에 들지 않는다.

전처리된나의데이터['train'][2]

{'text': 'Data and analysis drive most of my decisions.',
 'label': 0,
 'input_ids': [101, 2951, 1998, 4106, 3298, 2087, 1997, 2026, 6567, 1012, 102],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

이유: 인공지능은 torch.tensor() 자료형을 가지며 (n,m)의 행렬로 정리된 “묶음” 형태의 자료형을 기대함.

- 자료처리과정요약

	주어진자료	\(\overset{tokenizer,map}{\Longrightarrow}\) 전처리된자료	\(\overset{datacollector}{\Longrightarrow}\)더전처리된자료
Dict의 Keys	`text`,`label`	`input_ids`, `attention_mask`, `label`	`input_ids`, `attention_mask`, `labels`
자료의형태	텍스트,라벨	숫자화 O, 행렬화 X	숫자화 O, 행렬화 O
`torch.tensor`	-	X	O
미니배치	-	X	O
패딩/동적패딩	-	X	O
예측할때	강인공지능의 입력	트레이너의 입력	인공지능의 입력

- 데이터콜렉터에서 우리가 기대하는 것:

자료의 형태가 [Dict,Dict,Dict,Dict] 로 되어있는 경우² (4,??) shape 텐서의 input_ids, (4,??) shape 텐서의 attention_mask, 그리고 (4,) shape 텐서의 labels로 변환해줌.

² 전처리된나의데이터['train']이 이러한 형태로 되어있음

- 데이터콜렉터 사용방법

데이터콜렉터 = transformers.DataCollatorWithPadding(tokenizer=토크나이저, return_tensors='pt')
# 데이터콜렉터?
# 도움말에서 당장필요한 것: 입력형태가 [Dict, Dict, Dict, ... ]

데이터콜렉터(
    [
        dict(label=0, input_ids=[101,1045,102],attention_mask=[1,1,1]),
        dict(label=1, input_ids=[101,1045,9544,102],attention_mask=[1,1,1,1]),
        dict(label=0, input_ids=[101,1045,9544,9544,102],attention_mask=[1,1,1,1,1])
        
    ]
)

{'input_ids': tensor([[ 101, 1045,  102,    0,    0],
        [ 101, 1045, 9544,  102,    0],
        [ 101, 1045, 9544, 9544,  102]]), 'attention_mask': tensor([[1, 1, 1, 0, 0],
        [1, 1, 1, 1, 0],
        [1, 1, 1, 1, 1]]), 'labels': tensor([0, 1, 0])}

인공지능(**데이터콜렉터(
    [
        dict(label=0, input_ids=[101,1045,102],attention_mask=[1,1,1]),
        dict(label=1, input_ids=[101,1045,9544,102],attention_mask=[1,1,1,1]),
        dict(label=0, input_ids=[101,1045,9544,9544,102],attention_mask=[1,1,1,1,1])
        
    ]
))

SequenceClassifierOutput(loss=tensor(0.6952, grad_fn=<NllLossBackward0>), logits=tensor([[0.0177, 0.0286],
        [0.0384, 0.0379],
        [0.0415, 0.0425]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

6. 평가하기

- accuracy.compute의 기능

accuracy = evaluate.load("accuracy")

accuracy.compute(predictions=[0,0,0], references=[0,0,0])

{'accuracy': 1.0}

accuracy.compute(predictions=[1,1,1], references=[0,0,0])

{'accuracy': 0.0}

accuracy.compute(predictions=[0,1,0], references=[0,0,0])

{'accuracy': 0.6666666666666666}

- 평가하기 함수의 내용

# def 평가하기(eval_pred):
#     predictions, labels = eval_pred
#     predictions = np.argmax(predictions, axis=1)
#     accuracy = evaluate.load("accuracy")
#     return accuracy.compute(predictions=predictions, references=labels)

def 평가하기(eval_pred):
    로짓, 실제정답 = eval_pred
    인공지능의예측 = np.argmax(로짓, axis=1)
    accuracy = evaluate.load("accuracy")
    return accuracy.compute(predictions=인공지능의예측, references=실제정답)

7. 트레이너

- 이전코드

## Step1 
데이터불러오기 = datasets.load_dataset
토크나이저 = transformers.AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased") 
## Step2 
인공지능생성하기 = transformers.AutoModelForSequenceClassification.from_pretrained
## Step3 
데이터콜렉터 = transformers.DataCollatorWithPadding(tokenizer=토크나이저)
def 평가하기(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    accuracy = evaluate.load("accuracy")
    return accuracy.compute(predictions=predictions, references=labels)
트레이너세부지침생성기 = transformers.TrainingArguments
트레이너생성기 = transformers.Trainer
## Step4 
강인공지능생성하기 = transformers.pipeline
#---#
## Step1 
데이터 = 데이터불러오기('imdb')
전처리된데이터 = 데이터.map(lambda dct: 토크나이저(dct["text"], truncation=True),batched=True)
전처리된훈련자료, 전처리된검증자료 = 전처리된데이터['train'], 전처리된데이터['test']
## Step2 
torch.manual_seed(43052)
인공지능 = 인공지능생성하기("distilbert/distilbert-base-uncased", num_labels=2)

/home/cgb3/anaconda3/envs/hf/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

A. 트레이너의 제1역할 – CPU에서 GPU로..

`#` 트레이너 생성전

- 인공지능의 파라메터 상태확인 1

인공지능.classifier.weight

Parameter containing:
tensor([[-0.0234,  0.0279,  0.0242,  ...,  0.0091, -0.0063, -0.0133],
        [ 0.0087,  0.0007, -0.0099,  ...,  0.0183, -0.0007,  0.0295]],
       requires_grad=True)

중요한내용1: 숫자들 = 초기숫자들
중요한내용2: 숫자들이 CPU에 존재한다는 의미

- 인공지능을 이용한 예측 1

확률계산하기(인공지능(**데이터콜렉터([토크나이저("This movie was a huge disappointment.")])))

tensor([[0.4642, 0.5358]], grad_fn=<DivBackward0>)

확률계산하기(인공지능(**데이터콜렉터([토크나이저("This was a masterpiece.")])))

tensor([[0.4664, 0.5336]], grad_fn=<DivBackward0>)

`#` 트레이너 생성후

- 트레이너생성

## Step3 
트레이너세부지침 = 트레이너세부지침생성기(
    output_dir="my_awesome_model",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2, # 전체문제세트를 2번 공부하라..
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False,
)
트레이너 = 트레이너생성기(
    model=인공지능,
    args=트레이너세부지침,
    train_dataset=전처리된훈련자료,
    eval_dataset=전처리된검증자료,
    tokenizer=토크나이저,
    data_collator=데이터콜렉터,
    compute_metrics=평가하기,
)

- 인공지능의 파라메터 상태확인 2

인공지능.classifier.weight

Parameter containing:
tensor([[-0.0234,  0.0279,  0.0242,  ...,  0.0091, -0.0063, -0.0133],
        [ 0.0087,  0.0007, -0.0099,  ...,  0.0183, -0.0007,  0.0295]],
       device='cuda:0', requires_grad=True)

중요한내용1: 숫자들 = 초기숫자들
중요한내용2: device=“cuda:0” // 숫자들이 GPU에 있다는 의미

- 인공지능을 이용한 예측 2

확률계산하기(인공지능(**데이터콜렉터([토크나이저("This movie was a huge disappointment.")]).to("cuda:0")))

tensor([[0.4642, 0.5358]], device='cuda:0', grad_fn=<DivBackward0>)

확률계산하기(인공지능(**데이터콜렉터([토크나이저("This was a masterpiece.")]).to("cuda:0")))

tensor([[0.4664, 0.5336]], device='cuda:0', grad_fn=<DivBackward0>)

트레이너의 제1역할: 트레이너는 생성과 동시에 하는역할이 있는데, 바로 인공지능의 파라메터를 GPU에 올리는 것이다.

B. 트레이너의 제2역할 – 예측하기

트레이너의 제2역할: 트레이너.predict() 사용가능. 트레이너.predict()의 입력형태는 input_ids, attention_mask, label 이 존재하는 Dataset

# 예제1 트레이너를 이용한 예측

sample_dict = {
    'text': ["This movie was a huge disappointment."],
    'label': [0],
    'input_ids': [[101, 2023, 3185, 2001, 1037, 4121, 10520, 1012, 102]],
    'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1]]
}
sample_dataset = datasets.Dataset.from_dict(sample_dict)
sample_dataset

Dataset({
    features: ['text', 'label', 'input_ids', 'attention_mask'],
    num_rows: 1
})

트레이너.predict(sample_dataset)

PredictionOutput(predictions=array([[-0.11731032,  0.02610314]], dtype=float32), label_ids=array([0]), metrics={'test_loss': 0.7674226760864258, 'test_model_preparation_time': 0.0009, 'test_accuracy': 0.0, 'test_runtime': 1.1868, 'test_samples_per_second': 0.843, 'test_steps_per_second': 0.843})

logits = np.array([[-0.11731032,  0.02610314]])
np.exp(logits)/ np.exp(logits).sum(axis=1)

array([[0.46420796, 0.53579204]])

#

# 예제2 – 트레이너를 이용하여 train_data, test_data 의 prediction 값을 구하라.

트레이너.predict(전처리된데이터['train'])

PredictionOutput(predictions=array([[-0.08470809,  0.0023939 ],
       [-0.08299972,  0.02850237],
       [-0.06004279,  0.01801764],
       ...,
       [-0.02317078, -0.01451463],
       [-0.00802051, -0.02698467],
       [-0.03900156, -0.02573229]], dtype=float32), label_ids=array([0, 0, 0, ..., 1, 1, 1]), metrics={'test_loss': 0.6949957609176636, 'test_model_preparation_time': 0.0009, 'test_accuracy': 0.4916, 'test_runtime': 89.9353, 'test_samples_per_second': 277.978, 'test_steps_per_second': 17.379})

트레이너.predict(전처리된데이터['test'])

PredictionOutput(predictions=array([[-0.03815563,  0.00212397],
       [-0.08166712, -0.00432102],
       [-0.10371644,  0.03148611],
       ...,
       [-0.08171435, -0.00681646],
       [-0.09139054,  0.01050513],
       [-0.06704493,  0.0221498 ]], dtype=float32), label_ids=array([0, 0, 0, ..., 1, 1, 1]), metrics={'test_loss': 0.6949934363365173, 'test_model_preparation_time': 0.0009, 'test_accuracy': 0.4912, 'test_runtime': 89.8578, 'test_samples_per_second': 278.217, 'test_steps_per_second': 17.394})

#

C. 트레이너의 제3역할 – 학습 및 결과저장

`#` 학습

트레이너.train()

[3126/3126 11:57, Epoch 2/2]

Epoch	Training Loss	Validation Loss	Model Preparation Time	Accuracy
1	0.221900	0.199376	0.001900	0.923000
2	0.145700	0.233206	0.001900	0.931000

TrainOutput(global_step=3126, training_loss=0.20330691871472223, metrics={'train_runtime': 718.1146, 'train_samples_per_second': 69.627, 'train_steps_per_second': 4.353, 'total_flos': 6556904415524352.0, 'train_loss': 0.20330691871472223, 'epoch': 2.0})

25000 / 16

1562.5

1563 * 2

`#` 학습후

- 인공지능이 똑똑해졌을까?

- 인공지능의 파라메터 상태확인 3

인공지능.classifier.weight

Parameter containing:
tensor([[-0.0230,  0.0279,  0.0239,  ...,  0.0085, -0.0062, -0.0143],
        [ 0.0084,  0.0007, -0.0097,  ...,  0.0189, -0.0008,  0.0304]],
       device='cuda:0', requires_grad=True)

인공지능의 파라메터 상태확인 2와 비교삿

Parameter containing:
tensor([[-0.0234,  0.0279,  0.0242,  ...,  0.0091, -0.0063, -0.0133],
        [ 0.0087,  0.0007, -0.0099,  ...,  0.0183, -0.0007,  0.0295]],
       device='cuda:0', requires_grad=True)

숫자들이 바뀐걸 확인 \(\to\) 뭔가 다른 계산결과를 준다는 의미겠지? \(\to\) 진짜 그런지 보자..

- 인공지능을 이용한 예측 3

확률계산하기(인공지능(**데이터콜렉터([토크나이저("This movie was a huge disappointment.")]).to("cuda:0")))

tensor([[0.9885, 0.0115]], device='cuda:0', grad_fn=<DivBackward0>)

확률계산하기(인공지능(**데이터콜렉터([토크나이저("This was a masterpiece.")]).to("cuda:0")))

tensor([[0.0219, 0.9781]], device='cuda:0', grad_fn=<DivBackward0>)

- 다시 트레이너를 이용하여 train_data, test_data 의 prediction 값을 구해보자.

트레이너.predict(전처리된데이터['train'])

PredictionOutput(predictions=array([[ 1.5927906 , -1.3617778 ],
       [ 2.36449   , -2.1244614 ],
       [ 2.1150742 , -1.9663404 ],
       ...,
       [-2.690494  ,  2.2624812 ],
       [ 0.32332823, -0.05149931],
       [-1.96404   ,  1.7741865 ]], dtype=float32), label_ids=array([0, 0, 0, ..., 1, 1, 1]), metrics={'test_loss': 0.1358911097049713, 'test_model_preparation_time': 0.0008, 'test_accuracy': 0.95244, 'test_runtime': 89.3077, 'test_samples_per_second': 279.931, 'test_steps_per_second': 17.501})

트레이너.predict(전처리된데이터['test'])

PredictionOutput(predictions=array([[ 2.5778208 , -2.250055  ],
       [ 1.396875  , -1.2021751 ],
       [ 2.1249173 , -1.9882215 ],
       ...,
       [-0.3729212 ,  0.34651938],
       [ 0.19383232,  0.01419903],
       [-1.2160491 ,  1.007748  ]], dtype=float32), label_ids=array([0, 0, 0, ..., 1, 1, 1]), metrics={'test_loss': 0.19937647879123688, 'test_model_preparation_time': 0.0008, 'test_accuracy': 0.923, 'test_runtime': 90.2744, 'test_samples_per_second': 276.933, 'test_steps_per_second': 17.314})

- 우리가 가져야할 생각: 신기하다 X // 노가다 많이 했구나.. O

8. 파이프라인

- 강인공지능?

ref: https://zdnet.co.kr/view/?no=20160622145838

강인공지능 = transformers.pipeline("sentiment-analysis", model="my_awesome_model/checkpoint-1563")
print(강인공지능("This movie was a huge disappointment."))
print(강인공지능("This was a masterpiece."))

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.

[{'label': 'LABEL_0', 'score': 0.9885253310203552}]
[{'label': 'LABEL_1', 'score': 0.978060781955719}]

확률계산하기(인공지능(**데이터콜렉터([토크나이저("This movie was a huge disappointment.")]).to("cuda:0")))

tensor([[0.9885, 0.0115]], device='cuda:0', grad_fn=<DivBackward0>)

확률계산하기(인공지능(**데이터콜렉터([토크나이저("This was a masterpiece.")]).to("cuda:0")))

tensor([[0.0219, 0.9781]], device='cuda:0', grad_fn=<DivBackward0>)