02wk-1: IMDB 영화평 감성분석

Author

최규빈

Published

September 12, 2024

1. 강의영상

2. Install

#pip install transformers datasets evaluate accelerate
  • 이 코드는 뭔가를 설치하는 코드이다.
  • pip: 여기에서 pip 은 앱스토어 이름이라고 생각하자. (구글스토어, T-store)
  • install: 이것은 “설치버튼” 이라고 생각하자.
  • transformer, datasets, evaluate, accelerate: 설치하고 싶은 프로그램 이름

이거 코랩에서 실행해도 여러분의 컴퓨터에 뭐가 깔리는건 아닙니다. 마음편하게 실행하세요.

3. 코딩패턴

- 파이썬에서 보통 지도학습은 적당히 아래와 같은 흐름으로 코드가 작성된다.

## Step1: 데이터 
데이터 = 데이터읽기()
전처리된데이터 = 데이터전처리하기(데이터)
전처리된자료1, 전처리된자료2 = 데이터분리하기(전처리된데이터)

## Step2: 인공지능 생성
인공지능 = 인공지능생성하기() # 바보인공지능

## Step3: 인공지능 학습
인공지능.학습하기(전처리된자료1) # 이 명령어가 실행된 이후는 똑똑히진 인공지능 

## Step4: 예측 
인공지능.예측하기(전처리된자료1) # train_data를 풀어봄 
인공지능.예측하기(전처리된자료2) # val_data를 풀어봄
인공지능.예측하기(전처리된자료3) # 돌발적으로 발생한 자료, test_data 
  • 특징: 오브젝트 중심적인 코드
  • R의 경우는 위의코드가 함수중심적으로 다시 써져있을것임.

- 이전시간에 언급한 내용

  • 전략: (1) 블로그 코드 돌려봄 (2) 코드를 이해한다 (3) 코드의 데이터부분만 나의 데이터로 바꿔서 돌린다 (4) 돌려서 결과를 예측한다.
  • 마음가짐: 나는 어떠한 코드도 짤 수 있다. X // 나는 어떠한 코드도 이해할 수 있다. X // 나는 어떠한 코드도 베낄 수 있다. O
  • 여기에서 코드를 이해한다는 의미 = 어떠한 코드 내용을 자세하게는 모르겠지만 Step1~Step4에 해당하는 카테고리로 빠르게 정리한다.

- 변형패턴1: 사실 패턴이 하나뿐일수는 없음. 코드짜는 사람의 마음에 따라서 얼마든지 바뀔수 있는게 패턴임.

## Step1: 데이터 
데이터 = 데이터읽기()
전처리된훈련자료 = 데이터전처리하기(데이터['train'])
전처리된검증자료 = 데이터전처리하기(데이터['val'])

## Step2: 인공지능 생성
인공지능 = 인공지능생성하기() # 바보인공지능

## Step3: 인공지능 학습
인공지능.학습하기(전처리된훈련자료) # 이 명령어가 실행된 이후는 똑똑히진 인공지능 

## Step4: 예측 
인공지능.예측하기(전처리된훈련자료) # train_data를 풀어봄 
인공지능.예측하기(전처리된검증자료) # val_data를 풀어봄
인공지능.예측하기(전처리된돌발자료) # 돌발적으로 발생한 자료, test_data 

- 변형패턴2: 데이터자체에 “변환하기” 관련된 method 가 있는 경우.

## Step1: 데이터 
데이터 = 데이터읽기()
전처리된데이터 = 데이터.변환하기(데이터전처리하기) # 데이터.변환하기()의 입력으로 데이터전처리하기라는 함수자체를 입력하는 경우
전처리된훈련자료, 전처리된검증자료 = 전처리된데이터['train'], 전처리된데이터['val'] 

## Step2: 인공지능 생성
인공지능 = 인공지능생성하기() # 바보인공지능

## Step3: 인공지능 학습
인공지능.학습하기(전처리된훈련자료) # 이 명령어가 실행된 이후는 똑똑히진 인공지능 

## Step4: 예측 
인공지능.예측하기(전처리된훈련자료) # train_data를 풀어봄 
인공지능.예측하기(전처리된검증자료) # val_data를 풀어봄
인공지능.예측하기(전처리된돌발자료) # 돌발적으로 발생한 자료, test_data 

- 변형패턴3: 트레이너가 존재하는 경우

## Step1: 데이터 
데이터 = 데이터읽기()
전처리된데이터 = 데이터.변환하기(데이터전처리하기) # 데이터.변환하기()의 입력으로 데이터전처리하기라는 함수자체를 입력하는 경우
전처리된훈련자료, 전처리된검증자료 = 전처리된데이터['train'], 전처리된데이터['val'] 

## Step2: 인공지능 생성
인공지능 = 인공지능생성하기() # 바보인공지능

## Step3: 인공지능 학습
트레이너 = 트레이너생성하기(인공지능,전처리된훈련자료,전처리된검증자료) 
트러이너.train() # 인공지능은 똑똑해짐.. 

## Step4: 예측 
인공지능.예측하기(전처리된훈련자료) # train_data를 풀어봄 
인공지능.예측하기(전처리된검증자료) # val_data를 풀어봄
인공지능.예측하기(전처리된돌발자료) # 돌발적으로 발생한 자료, test_data 

- 변형패턴4: 강인공지능이 존재하는 경우

## Step1: 데이터 
데이터 = 데이터읽기()
전처리된데이터 = 데이터.변환하기(데이터전처리하기) # 데이터.변환하기()의 입력으로 데이터전처리하기라는 함수자체를 입력하는 경우
전처리된훈련자료, 전처리된검증자료 = 전처리된데이터['train'], 전처리된데이터['val'] 

## Step2: 인공지능 생성
인공지능 = 인공지능생성하기() # 바보인공지능

## Step3: 인공지능 학습
트레이너 = 트레이너생성하기(인공지능,전처리된훈련자료,전처리된검증자료) 
트러이너.train() # 인공지능은 똑똑해짐.. 

## Step4: 예측 
강인공지능 = 강인공지능생성기(인공지능)
강인공지능.예측하기(전처리되지않은자료) 

- 기타변형패턴들..

## Step1: 데이터 
데이터 = 데이터읽기()
전처리된데이터 = 데이터.변환하기(데이터전처리하기) # 데이터.변환하기()의 입력으로 데이터전처리하기라는 함수자체를 입력하는 경우
전처리된훈련자료, 전처리된검증자료 = 전처리된데이터['train'], 전처리된데이터['val'] 

## Step2: 인공지능 생성
인공지능 = 인공지능생성하기() # 바보인공지능

## Step3: 인공지능 학습
인공지능.훈련모드() # 공부하는모드 
인공지능.학습하기(전처리된훈련자료) # 이 명령어가 실행된 이후는 똑똑히진 인공지능 
인공지능.평가모드() # 문제풀이모드

## Step4: 예측 
인공지능.예측하기(전처리된훈련자료) # train_data를 풀어봄 
인공지능.예측하기(전처리된검증자료) # val_data를 풀어봄
인공지능.예측하기(전처리된돌발자료) # 돌발적으로 발생한 자료, test_data 

4. 코드정리1

ref: https://huggingface.co/docs/transformers/tasks/sequence_classification

A. Step1 – 데이터

- 데이터불러오기

from datasets import load_dataset
imdb = load_dataset("imdb")
/home/cgb3/anaconda3/envs/hf/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")
/home/cgb3/anaconda3/envs/hf/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(

뭐하는거?

tokenizer("Transformers are really useful for natural language processing.") # 텍스트 -> 숫자들 
{'input_ids': [101, 19081, 2024, 2428, 6179, 2005, 3019, 2653, 6364, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

- preprocess_function 를 선언

def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

뭐하는 함수?

tokenizer("Transformers are really useful for natural language processing.",truncation=True) # 텍스트 -> 숫자들 
{'input_ids': [101, 19081, 2024, 2428, 6179, 2005, 3019, 2653, 6364, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
examples = imdb['train'][1]
#tokenizer(examples["text"], truncation=True)
preprocess_function(examples)
{'input_ids': [101, 1000, 1045, 2572, 8025, 1024, 3756, 1000, 2003, 1037, 15544, 19307, 1998, 3653, 6528, 20771, 19986, 8632, 1012, 2009, 2987, 1005, 1056, 3043, 2054, 2028, 1005, 1055, 2576, 5328, 2024, 2138, 2023, 2143, 2064, 6684, 2022, 2579, 5667, 2006, 2151, 2504, 1012, 2004, 2005, 1996, 4366, 2008, 19124, 3287, 16371, 25469, 2003, 2019, 6882, 13316, 1011, 2459, 1010, 2008, 3475, 1005, 1056, 2995, 1012, 1045, 1005, 2310, 2464, 1054, 1011, 6758, 3152, 2007, 3287, 16371, 25469, 1012, 4379, 1010, 2027, 2069, 3749, 2070, 25085, 5328, 1010, 2021, 2073, 2024, 1996, 1054, 1011, 6758, 3152, 2007, 21226, 24728, 22144, 2015, 1998, 20916, 4691, 6845, 2401, 1029, 7880, 1010, 2138, 2027, 2123, 1005, 1056, 4839, 1012, 1996, 2168, 3632, 2005, 2216, 10231, 7685, 5830, 3065, 1024, 8040, 7317, 5063, 2015, 11820, 1999, 1996, 9478, 2021, 2025, 1037, 17962, 21239, 1999, 4356, 1012, 1998, 2216, 3653, 6528, 20771, 10271, 5691, 2066, 1996, 2829, 16291, 1010, 1999, 2029, 2057, 1005, 2128, 5845, 2000, 1996, 2609, 1997, 6320, 25624, 1005, 1055, 17061, 3779, 1010, 2021, 2025, 1037, 7637, 1997, 5061, 5710, 2006, 9318, 7367, 5737, 19393, 1012, 2077, 6933, 1006, 2030, 20242, 1007, 1000, 3313, 1011, 3115, 1000, 1999, 5609, 1997, 16371, 25469, 1010, 1996, 10597, 27885, 5809, 2063, 2323, 2202, 2046, 4070, 2028, 14477, 6767, 8524, 6321, 5793, 28141, 4489, 2090, 2273, 1998, 2308, 1024, 2045, 2024, 2053, 8991, 18400, 2015, 2006, 4653, 2043, 19910, 3544, 15287, 1010, 1998, 1996, 2168, 3685, 2022, 2056, 2005, 1037, 2158, 1012, 1999, 2755, 1010, 2017, 3227, 2180, 1005, 1056, 2156, 2931, 8991, 18400, 2015, 1999, 2019, 2137, 2143, 1999, 2505, 2460, 1997, 22555, 2030, 13216, 14253, 2050, 1012, 2023, 6884, 3313, 1011, 3115, 2003, 2625, 1037, 3313, 3115, 2084, 2019, 4914, 2135, 2139, 24128, 3754, 2000, 2272, 2000, 3408, 20547, 2007, 1996, 19008, 1997, 2308, 1005, 1055, 4230, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
examples = imdb['train'][:2]
#tokenizer(examples["text"], truncation=True)
preprocess_function(examples)
{'input_ids': [[101, 1045, 12524, 1045, 2572, 8025, 1011, 3756, 2013, 2026, 2678, 3573, 2138, 1997, 2035, 1996, 6704, 2008, 5129, 2009, 2043, 2009, 2001, 2034, 2207, 1999, 3476, 1012, 1045, 2036, 2657, 2008, 2012, 2034, 2009, 2001, 8243, 2011, 1057, 1012, 1055, 1012, 8205, 2065, 2009, 2412, 2699, 2000, 4607, 2023, 2406, 1010, 3568, 2108, 1037, 5470, 1997, 3152, 2641, 1000, 6801, 1000, 1045, 2428, 2018, 2000, 2156, 2023, 2005, 2870, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 1996, 5436, 2003, 8857, 2105, 1037, 2402, 4467, 3689, 3076, 2315, 14229, 2040, 4122, 2000, 4553, 2673, 2016, 2064, 2055, 2166, 1012, 1999, 3327, 2016, 4122, 2000, 3579, 2014, 3086, 2015, 2000, 2437, 2070, 4066, 1997, 4516, 2006, 2054, 1996, 2779, 25430, 14728, 2245, 2055, 3056, 2576, 3314, 2107, 2004, 1996, 5148, 2162, 1998, 2679, 3314, 1999, 1996, 2142, 2163, 1012, 1999, 2090, 4851, 8801, 1998, 6623, 7939, 4697, 3619, 1997, 8947, 2055, 2037, 10740, 2006, 4331, 1010, 2016, 2038, 3348, 2007, 2014, 3689, 3836, 1010, 19846, 1010, 1998, 2496, 2273, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 2054, 8563, 2033, 2055, 1045, 2572, 8025, 1011, 3756, 2003, 2008, 2871, 2086, 3283, 1010, 2023, 2001, 2641, 26932, 1012, 2428, 1010, 1996, 3348, 1998, 16371, 25469, 5019, 2024, 2261, 1998, 2521, 2090, 1010, 2130, 2059, 2009, 1005, 1055, 2025, 2915, 2066, 2070, 10036, 2135, 2081, 22555, 2080, 1012, 2096, 2026, 2406, 3549, 2568, 2424, 2009, 16880, 1010, 1999, 4507, 3348, 1998, 16371, 25469, 2024, 1037, 2350, 18785, 1999, 4467, 5988, 1012, 2130, 13749, 7849, 24544, 1010, 15835, 2037, 3437, 2000, 2204, 2214, 2879, 2198, 4811, 1010, 2018, 3348, 5019, 1999, 2010, 3152, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 1045, 2079, 4012, 3549, 2094, 1996, 16587, 2005, 1996, 2755, 2008, 2151, 3348, 3491, 1999, 1996, 2143, 2003, 3491, 2005, 6018, 5682, 2738, 2084, 2074, 2000, 5213, 2111, 1998, 2191, 2769, 2000, 2022, 3491, 1999, 26932, 12370, 1999, 2637, 1012, 1045, 2572, 8025, 1011, 3756, 2003, 1037, 2204, 2143, 2005, 3087, 5782, 2000, 2817, 1996, 6240, 1998, 14629, 1006, 2053, 26136, 3832, 1007, 1997, 4467, 5988, 1012, 2021, 2428, 1010, 2023, 2143, 2987, 1005, 1056, 2031, 2172, 1997, 1037, 5436, 1012, 102], [101, 1000, 1045, 2572, 8025, 1024, 3756, 1000, 2003, 1037, 15544, 19307, 1998, 3653, 6528, 20771, 19986, 8632, 1012, 2009, 2987, 1005, 1056, 3043, 2054, 2028, 1005, 1055, 2576, 5328, 2024, 2138, 2023, 2143, 2064, 6684, 2022, 2579, 5667, 2006, 2151, 2504, 1012, 2004, 2005, 1996, 4366, 2008, 19124, 3287, 16371, 25469, 2003, 2019, 6882, 13316, 1011, 2459, 1010, 2008, 3475, 1005, 1056, 2995, 1012, 1045, 1005, 2310, 2464, 1054, 1011, 6758, 3152, 2007, 3287, 16371, 25469, 1012, 4379, 1010, 2027, 2069, 3749, 2070, 25085, 5328, 1010, 2021, 2073, 2024, 1996, 1054, 1011, 6758, 3152, 2007, 21226, 24728, 22144, 2015, 1998, 20916, 4691, 6845, 2401, 1029, 7880, 1010, 2138, 2027, 2123, 1005, 1056, 4839, 1012, 1996, 2168, 3632, 2005, 2216, 10231, 7685, 5830, 3065, 1024, 8040, 7317, 5063, 2015, 11820, 1999, 1996, 9478, 2021, 2025, 1037, 17962, 21239, 1999, 4356, 1012, 1998, 2216, 3653, 6528, 20771, 10271, 5691, 2066, 1996, 2829, 16291, 1010, 1999, 2029, 2057, 1005, 2128, 5845, 2000, 1996, 2609, 1997, 6320, 25624, 1005, 1055, 17061, 3779, 1010, 2021, 2025, 1037, 7637, 1997, 5061, 5710, 2006, 9318, 7367, 5737, 19393, 1012, 2077, 6933, 1006, 2030, 20242, 1007, 1000, 3313, 1011, 3115, 1000, 1999, 5609, 1997, 16371, 25469, 1010, 1996, 10597, 27885, 5809, 2063, 2323, 2202, 2046, 4070, 2028, 14477, 6767, 8524, 6321, 5793, 28141, 4489, 2090, 2273, 1998, 2308, 1024, 2045, 2024, 2053, 8991, 18400, 2015, 2006, 4653, 2043, 19910, 3544, 15287, 1010, 1998, 1996, 2168, 3685, 2022, 2056, 2005, 1037, 2158, 1012, 1999, 2755, 1010, 2017, 3227, 2180, 1005, 1056, 2156, 2931, 8991, 18400, 2015, 1999, 2019, 2137, 2143, 1999, 2505, 2460, 1997, 22555, 2030, 13216, 14253, 2050, 1012, 2023, 6884, 3313, 1011, 3115, 2003, 2625, 1037, 3313, 3115, 2084, 2019, 4914, 2135, 2139, 24128, 3754, 2000, 2272, 2000, 3408, 20547, 2007, 1996, 19008, 1997, 2308, 1005, 1055, 4230, 1012, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

- map 메소드 사용

tokenized_imdb = imdb.map(preprocess_function, batched=True)
#imdb['train'][0], tokenized_imdb['train'][0]
for i in range(5):
    print(preprocess_function(imdb['train'][i]))
{'input_ids': [101, 1045, 12524, 1045, 2572, 8025, 1011, 3756, 2013, 2026, 2678, 3573, 2138, 1997, 2035, 1996, 6704, 2008, 5129, 2009, 2043, 2009, 2001, 2034, 2207, 1999, 3476, 1012, 1045, 2036, 2657, 2008, 2012, 2034, 2009, 2001, 8243, 2011, 1057, 1012, 1055, 1012, 8205, 2065, 2009, 2412, 2699, 2000, 4607, 2023, 2406, 1010, 3568, 2108, 1037, 5470, 1997, 3152, 2641, 1000, 6801, 1000, 1045, 2428, 2018, 2000, 2156, 2023, 2005, 2870, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 1996, 5436, 2003, 8857, 2105, 1037, 2402, 4467, 3689, 3076, 2315, 14229, 2040, 4122, 2000, 4553, 2673, 2016, 2064, 2055, 2166, 1012, 1999, 3327, 2016, 4122, 2000, 3579, 2014, 3086, 2015, 2000, 2437, 2070, 4066, 1997, 4516, 2006, 2054, 1996, 2779, 25430, 14728, 2245, 2055, 3056, 2576, 3314, 2107, 2004, 1996, 5148, 2162, 1998, 2679, 3314, 1999, 1996, 2142, 2163, 1012, 1999, 2090, 4851, 8801, 1998, 6623, 7939, 4697, 3619, 1997, 8947, 2055, 2037, 10740, 2006, 4331, 1010, 2016, 2038, 3348, 2007, 2014, 3689, 3836, 1010, 19846, 1010, 1998, 2496, 2273, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 2054, 8563, 2033, 2055, 1045, 2572, 8025, 1011, 3756, 2003, 2008, 2871, 2086, 3283, 1010, 2023, 2001, 2641, 26932, 1012, 2428, 1010, 1996, 3348, 1998, 16371, 25469, 5019, 2024, 2261, 1998, 2521, 2090, 1010, 2130, 2059, 2009, 1005, 1055, 2025, 2915, 2066, 2070, 10036, 2135, 2081, 22555, 2080, 1012, 2096, 2026, 2406, 3549, 2568, 2424, 2009, 16880, 1010, 1999, 4507, 3348, 1998, 16371, 25469, 2024, 1037, 2350, 18785, 1999, 4467, 5988, 1012, 2130, 13749, 7849, 24544, 1010, 15835, 2037, 3437, 2000, 2204, 2214, 2879, 2198, 4811, 1010, 2018, 3348, 5019, 1999, 2010, 3152, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 1045, 2079, 4012, 3549, 2094, 1996, 16587, 2005, 1996, 2755, 2008, 2151, 3348, 3491, 1999, 1996, 2143, 2003, 3491, 2005, 6018, 5682, 2738, 2084, 2074, 2000, 5213, 2111, 1998, 2191, 2769, 2000, 2022, 3491, 1999, 26932, 12370, 1999, 2637, 1012, 1045, 2572, 8025, 1011, 3756, 2003, 1037, 2204, 2143, 2005, 3087, 5782, 2000, 2817, 1996, 6240, 1998, 14629, 1006, 2053, 26136, 3832, 1007, 1997, 4467, 5988, 1012, 2021, 2428, 1010, 2023, 2143, 2987, 1005, 1056, 2031, 2172, 1997, 1037, 5436, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
{'input_ids': [101, 1000, 1045, 2572, 8025, 1024, 3756, 1000, 2003, 1037, 15544, 19307, 1998, 3653, 6528, 20771, 19986, 8632, 1012, 2009, 2987, 1005, 1056, 3043, 2054, 2028, 1005, 1055, 2576, 5328, 2024, 2138, 2023, 2143, 2064, 6684, 2022, 2579, 5667, 2006, 2151, 2504, 1012, 2004, 2005, 1996, 4366, 2008, 19124, 3287, 16371, 25469, 2003, 2019, 6882, 13316, 1011, 2459, 1010, 2008, 3475, 1005, 1056, 2995, 1012, 1045, 1005, 2310, 2464, 1054, 1011, 6758, 3152, 2007, 3287, 16371, 25469, 1012, 4379, 1010, 2027, 2069, 3749, 2070, 25085, 5328, 1010, 2021, 2073, 2024, 1996, 1054, 1011, 6758, 3152, 2007, 21226, 24728, 22144, 2015, 1998, 20916, 4691, 6845, 2401, 1029, 7880, 1010, 2138, 2027, 2123, 1005, 1056, 4839, 1012, 1996, 2168, 3632, 2005, 2216, 10231, 7685, 5830, 3065, 1024, 8040, 7317, 5063, 2015, 11820, 1999, 1996, 9478, 2021, 2025, 1037, 17962, 21239, 1999, 4356, 1012, 1998, 2216, 3653, 6528, 20771, 10271, 5691, 2066, 1996, 2829, 16291, 1010, 1999, 2029, 2057, 1005, 2128, 5845, 2000, 1996, 2609, 1997, 6320, 25624, 1005, 1055, 17061, 3779, 1010, 2021, 2025, 1037, 7637, 1997, 5061, 5710, 2006, 9318, 7367, 5737, 19393, 1012, 2077, 6933, 1006, 2030, 20242, 1007, 1000, 3313, 1011, 3115, 1000, 1999, 5609, 1997, 16371, 25469, 1010, 1996, 10597, 27885, 5809, 2063, 2323, 2202, 2046, 4070, 2028, 14477, 6767, 8524, 6321, 5793, 28141, 4489, 2090, 2273, 1998, 2308, 1024, 2045, 2024, 2053, 8991, 18400, 2015, 2006, 4653, 2043, 19910, 3544, 15287, 1010, 1998, 1996, 2168, 3685, 2022, 2056, 2005, 1037, 2158, 1012, 1999, 2755, 1010, 2017, 3227, 2180, 1005, 1056, 2156, 2931, 8991, 18400, 2015, 1999, 2019, 2137, 2143, 1999, 2505, 2460, 1997, 22555, 2030, 13216, 14253, 2050, 1012, 2023, 6884, 3313, 1011, 3115, 2003, 2625, 1037, 3313, 3115, 2084, 2019, 4914, 2135, 2139, 24128, 3754, 2000, 2272, 2000, 3408, 20547, 2007, 1996, 19008, 1997, 2308, 1005, 1055, 4230, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
{'input_ids': [101, 2065, 2069, 2000, 4468, 2437, 2023, 2828, 1997, 2143, 1999, 1996, 2925, 1012, 2023, 2143, 2003, 5875, 2004, 2019, 7551, 2021, 4136, 2053, 2522, 11461, 2466, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 2028, 2453, 2514, 6819, 5339, 8918, 2005, 3564, 27046, 2009, 2138, 2009, 12817, 2006, 2061, 2116, 2590, 3314, 2021, 2009, 2515, 2061, 2302, 2151, 5860, 11795, 3085, 15793, 1012, 1996, 13972, 3310, 2185, 2007, 2053, 2047, 15251, 1006, 4983, 2028, 3310, 2039, 2007, 2028, 2096, 2028, 1005, 1055, 2568, 17677, 2015, 1010, 2004, 2009, 2097, 26597, 2079, 2076, 2023, 23100, 2143, 1007, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 2028, 2453, 2488, 5247, 2028, 1005, 1055, 2051, 4582, 2041, 1037, 3332, 2012, 1037, 3392, 3652, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
{'input_ids': [101, 2023, 2143, 2001, 2763, 4427, 2011, 2643, 4232, 1005, 1055, 16137, 10841, 4115, 1010, 10768, 25300, 2078, 1998, 1045, 9075, 2017, 2000, 2156, 2008, 2143, 2612, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 1996, 2143, 2038, 2048, 2844, 3787, 1998, 2216, 2024, 1010, 1006, 1015, 1007, 1996, 12689, 3772, 1006, 1016, 1007, 1996, 8052, 1010, 6151, 6810, 2099, 7178, 2135, 2204, 1010, 6302, 1012, 4237, 2013, 2008, 1010, 2054, 9326, 2033, 2087, 2003, 1996, 10866, 5460, 1997, 9033, 21202, 7971, 1012, 14229, 6396, 2386, 2038, 2000, 2022, 2087, 15703, 3883, 1999, 1996, 2088, 1012, 2016, 4490, 2061, 5236, 1998, 2007, 2035, 1996, 16371, 25469, 1999, 2023, 2143, 1010, 1012, 1012, 1012, 2009, 1005, 1055, 14477, 4779, 26884, 1012, 13599, 2000, 2643, 4232, 1005, 1055, 2143, 1010, 7789, 3012, 2038, 2042, 2999, 2007, 28072, 1012, 2302, 2183, 2205, 2521, 2006, 2023, 3395, 1010, 1045, 2052, 2360, 2008, 4076, 2013, 1996, 4489, 1999, 15084, 2090, 1996, 2413, 1998, 1996, 4467, 2554, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 1037, 3185, 1997, 2049, 2051, 1010, 1998, 2173, 1012, 1016, 1013, 2184, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
{'input_ids': [101, 2821, 1010, 2567, 1012, 1012, 1012, 2044, 4994, 2055, 2023, 9951, 2143, 2005, 8529, 13876, 12129, 2086, 2035, 1045, 2064, 2228, 1997, 2003, 2008, 2214, 14911, 3389, 2299, 1012, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 1000, 2003, 2008, 2035, 2045, 2003, 1029, 1029, 1000, 1012, 1012, 1012, 1045, 2001, 2074, 2019, 2220, 9458, 2043, 2023, 20482, 3869, 2718, 1996, 1057, 1012, 1055, 1012, 1045, 2001, 2205, 2402, 2000, 2131, 1999, 1996, 4258, 1006, 2348, 1045, 2106, 6133, 2000, 13583, 2046, 1000, 9119, 8912, 1000, 1007, 1012, 2059, 1037, 11326, 2012, 1037, 2334, 2143, 2688, 10272, 17799, 1011, 2633, 1045, 2071, 2156, 2023, 2143, 1010, 3272, 2085, 1045, 2001, 2004, 2214, 2004, 2026, 3008, 2020, 2043, 2027, 8040, 7317, 13699, 5669, 2000, 2156, 2009, 999, 999, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 1996, 2069, 3114, 2023, 2143, 2001, 2025, 10033, 2000, 1996, 10812, 13457, 1997, 2051, 2001, 2138, 1997, 1996, 27885, 11020, 20693, 2553, 13977, 2011, 2049, 1057, 1012, 1055, 1012, 2713, 1012, 8817, 1997, 2111, 19311, 2098, 2000, 2023, 27136, 2121, 1010, 3241, 2027, 2020, 2183, 2000, 2156, 1037, 3348, 2143, 1012, 1012, 1012, 2612, 1010, 2027, 2288, 7167, 1997, 2485, 22264, 1997, 1043, 11802, 2135, 1010, 16360, 23004, 25430, 18352, 1010, 2006, 1011, 2395, 7636, 1999, 20857, 6023, 25943, 1010, 2004, 5498, 2063, 2576, 3653, 29048, 1012, 1012, 1012, 1998, 7408, 3468, 2040, 1011, 14977, 23599, 3348, 5019, 2007, 7842, 22772, 1010, 5122, 5889, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 3451, 12696, 1010, 4151, 24665, 12502, 1010, 3181, 20785, 1012, 1012, 3649, 2023, 2518, 2001, 1010, 14021, 5596, 2009, 1010, 6402, 2009, 1010, 2059, 4933, 1996, 11289, 1999, 1037, 2599, 3482, 999, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 7069, 9765, 27065, 2229, 2145, 26988, 2000, 2424, 3643, 1999, 2049, 11771, 18404, 6208, 2576, 11867, 7974, 8613, 1012, 1012, 2021, 2065, 2009, 4694, 1005, 1056, 2005, 1996, 15657, 9446, 1010, 2009, 2052, 2031, 2042, 6439, 1010, 2059, 6404, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 2612, 1010, 1996, 1000, 1045, 2572, 8744, 1010, 8744, 1000, 1054, 10536, 16921, 7583, 2516, 2001, 5567, 10866, 2135, 2005, 2086, 2004, 1037, 14841, 26065, 3508, 2005, 22555, 2080, 3152, 1006, 1045, 2572, 8025, 1010, 20920, 1011, 2005, 5637, 3152, 1010, 1045, 2572, 8025, 1010, 2304, 1011, 2005, 1038, 2721, 2595, 24759, 28100, 3370, 3152, 1010, 4385, 1012, 1012, 1007, 1998, 2296, 2702, 2086, 2030, 2061, 1996, 2518, 9466, 2013, 1996, 2757, 1010, 2000, 2022, 7021, 2011, 1037, 2047, 4245, 1997, 26476, 2015, 2040, 2215, 2000, 2156, 2008, 1000, 20355, 3348, 2143, 1000, 2008, 1000, 4329, 3550, 1996, 2143, 3068, 1000, 1012, 1012, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 6300, 9953, 1010, 4468, 2066, 1996, 11629, 1012, 1012, 2030, 2065, 2017, 2442, 2156, 2009, 1011, 9278, 1996, 2678, 1998, 3435, 2830, 2000, 1996, 1000, 6530, 1000, 3033, 1010, 2074, 2000, 2131, 2009, 2058, 2007, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
print(tokenized_imdb['train'][0]['input_ids'])
print(tokenized_imdb['train'][1]['input_ids'])
print(tokenized_imdb['train'][2]['input_ids'])
print(tokenized_imdb['train'][3]['input_ids'])
print(tokenized_imdb['train'][4]['input_ids'])
[101, 1045, 12524, 1045, 2572, 8025, 1011, 3756, 2013, 2026, 2678, 3573, 2138, 1997, 2035, 1996, 6704, 2008, 5129, 2009, 2043, 2009, 2001, 2034, 2207, 1999, 3476, 1012, 1045, 2036, 2657, 2008, 2012, 2034, 2009, 2001, 8243, 2011, 1057, 1012, 1055, 1012, 8205, 2065, 2009, 2412, 2699, 2000, 4607, 2023, 2406, 1010, 3568, 2108, 1037, 5470, 1997, 3152, 2641, 1000, 6801, 1000, 1045, 2428, 2018, 2000, 2156, 2023, 2005, 2870, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 1996, 5436, 2003, 8857, 2105, 1037, 2402, 4467, 3689, 3076, 2315, 14229, 2040, 4122, 2000, 4553, 2673, 2016, 2064, 2055, 2166, 1012, 1999, 3327, 2016, 4122, 2000, 3579, 2014, 3086, 2015, 2000, 2437, 2070, 4066, 1997, 4516, 2006, 2054, 1996, 2779, 25430, 14728, 2245, 2055, 3056, 2576, 3314, 2107, 2004, 1996, 5148, 2162, 1998, 2679, 3314, 1999, 1996, 2142, 2163, 1012, 1999, 2090, 4851, 8801, 1998, 6623, 7939, 4697, 3619, 1997, 8947, 2055, 2037, 10740, 2006, 4331, 1010, 2016, 2038, 3348, 2007, 2014, 3689, 3836, 1010, 19846, 1010, 1998, 2496, 2273, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 2054, 8563, 2033, 2055, 1045, 2572, 8025, 1011, 3756, 2003, 2008, 2871, 2086, 3283, 1010, 2023, 2001, 2641, 26932, 1012, 2428, 1010, 1996, 3348, 1998, 16371, 25469, 5019, 2024, 2261, 1998, 2521, 2090, 1010, 2130, 2059, 2009, 1005, 1055, 2025, 2915, 2066, 2070, 10036, 2135, 2081, 22555, 2080, 1012, 2096, 2026, 2406, 3549, 2568, 2424, 2009, 16880, 1010, 1999, 4507, 3348, 1998, 16371, 25469, 2024, 1037, 2350, 18785, 1999, 4467, 5988, 1012, 2130, 13749, 7849, 24544, 1010, 15835, 2037, 3437, 2000, 2204, 2214, 2879, 2198, 4811, 1010, 2018, 3348, 5019, 1999, 2010, 3152, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 1045, 2079, 4012, 3549, 2094, 1996, 16587, 2005, 1996, 2755, 2008, 2151, 3348, 3491, 1999, 1996, 2143, 2003, 3491, 2005, 6018, 5682, 2738, 2084, 2074, 2000, 5213, 2111, 1998, 2191, 2769, 2000, 2022, 3491, 1999, 26932, 12370, 1999, 2637, 1012, 1045, 2572, 8025, 1011, 3756, 2003, 1037, 2204, 2143, 2005, 3087, 5782, 2000, 2817, 1996, 6240, 1998, 14629, 1006, 2053, 26136, 3832, 1007, 1997, 4467, 5988, 1012, 2021, 2428, 1010, 2023, 2143, 2987, 1005, 1056, 2031, 2172, 1997, 1037, 5436, 1012, 102]
[101, 1000, 1045, 2572, 8025, 1024, 3756, 1000, 2003, 1037, 15544, 19307, 1998, 3653, 6528, 20771, 19986, 8632, 1012, 2009, 2987, 1005, 1056, 3043, 2054, 2028, 1005, 1055, 2576, 5328, 2024, 2138, 2023, 2143, 2064, 6684, 2022, 2579, 5667, 2006, 2151, 2504, 1012, 2004, 2005, 1996, 4366, 2008, 19124, 3287, 16371, 25469, 2003, 2019, 6882, 13316, 1011, 2459, 1010, 2008, 3475, 1005, 1056, 2995, 1012, 1045, 1005, 2310, 2464, 1054, 1011, 6758, 3152, 2007, 3287, 16371, 25469, 1012, 4379, 1010, 2027, 2069, 3749, 2070, 25085, 5328, 1010, 2021, 2073, 2024, 1996, 1054, 1011, 6758, 3152, 2007, 21226, 24728, 22144, 2015, 1998, 20916, 4691, 6845, 2401, 1029, 7880, 1010, 2138, 2027, 2123, 1005, 1056, 4839, 1012, 1996, 2168, 3632, 2005, 2216, 10231, 7685, 5830, 3065, 1024, 8040, 7317, 5063, 2015, 11820, 1999, 1996, 9478, 2021, 2025, 1037, 17962, 21239, 1999, 4356, 1012, 1998, 2216, 3653, 6528, 20771, 10271, 5691, 2066, 1996, 2829, 16291, 1010, 1999, 2029, 2057, 1005, 2128, 5845, 2000, 1996, 2609, 1997, 6320, 25624, 1005, 1055, 17061, 3779, 1010, 2021, 2025, 1037, 7637, 1997, 5061, 5710, 2006, 9318, 7367, 5737, 19393, 1012, 2077, 6933, 1006, 2030, 20242, 1007, 1000, 3313, 1011, 3115, 1000, 1999, 5609, 1997, 16371, 25469, 1010, 1996, 10597, 27885, 5809, 2063, 2323, 2202, 2046, 4070, 2028, 14477, 6767, 8524, 6321, 5793, 28141, 4489, 2090, 2273, 1998, 2308, 1024, 2045, 2024, 2053, 8991, 18400, 2015, 2006, 4653, 2043, 19910, 3544, 15287, 1010, 1998, 1996, 2168, 3685, 2022, 2056, 2005, 1037, 2158, 1012, 1999, 2755, 1010, 2017, 3227, 2180, 1005, 1056, 2156, 2931, 8991, 18400, 2015, 1999, 2019, 2137, 2143, 1999, 2505, 2460, 1997, 22555, 2030, 13216, 14253, 2050, 1012, 2023, 6884, 3313, 1011, 3115, 2003, 2625, 1037, 3313, 3115, 2084, 2019, 4914, 2135, 2139, 24128, 3754, 2000, 2272, 2000, 3408, 20547, 2007, 1996, 19008, 1997, 2308, 1005, 1055, 4230, 1012, 102]
[101, 2065, 2069, 2000, 4468, 2437, 2023, 2828, 1997, 2143, 1999, 1996, 2925, 1012, 2023, 2143, 2003, 5875, 2004, 2019, 7551, 2021, 4136, 2053, 2522, 11461, 2466, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 2028, 2453, 2514, 6819, 5339, 8918, 2005, 3564, 27046, 2009, 2138, 2009, 12817, 2006, 2061, 2116, 2590, 3314, 2021, 2009, 2515, 2061, 2302, 2151, 5860, 11795, 3085, 15793, 1012, 1996, 13972, 3310, 2185, 2007, 2053, 2047, 15251, 1006, 4983, 2028, 3310, 2039, 2007, 2028, 2096, 2028, 1005, 1055, 2568, 17677, 2015, 1010, 2004, 2009, 2097, 26597, 2079, 2076, 2023, 23100, 2143, 1007, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 2028, 2453, 2488, 5247, 2028, 1005, 1055, 2051, 4582, 2041, 1037, 3332, 2012, 1037, 3392, 3652, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 102]
[101, 2023, 2143, 2001, 2763, 4427, 2011, 2643, 4232, 1005, 1055, 16137, 10841, 4115, 1010, 10768, 25300, 2078, 1998, 1045, 9075, 2017, 2000, 2156, 2008, 2143, 2612, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 1996, 2143, 2038, 2048, 2844, 3787, 1998, 2216, 2024, 1010, 1006, 1015, 1007, 1996, 12689, 3772, 1006, 1016, 1007, 1996, 8052, 1010, 6151, 6810, 2099, 7178, 2135, 2204, 1010, 6302, 1012, 4237, 2013, 2008, 1010, 2054, 9326, 2033, 2087, 2003, 1996, 10866, 5460, 1997, 9033, 21202, 7971, 1012, 14229, 6396, 2386, 2038, 2000, 2022, 2087, 15703, 3883, 1999, 1996, 2088, 1012, 2016, 4490, 2061, 5236, 1998, 2007, 2035, 1996, 16371, 25469, 1999, 2023, 2143, 1010, 1012, 1012, 1012, 2009, 1005, 1055, 14477, 4779, 26884, 1012, 13599, 2000, 2643, 4232, 1005, 1055, 2143, 1010, 7789, 3012, 2038, 2042, 2999, 2007, 28072, 1012, 2302, 2183, 2205, 2521, 2006, 2023, 3395, 1010, 1045, 2052, 2360, 2008, 4076, 2013, 1996, 4489, 1999, 15084, 2090, 1996, 2413, 1998, 1996, 4467, 2554, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 1037, 3185, 1997, 2049, 2051, 1010, 1998, 2173, 1012, 1016, 1013, 2184, 1012, 102]
[101, 2821, 1010, 2567, 1012, 1012, 1012, 2044, 4994, 2055, 2023, 9951, 2143, 2005, 8529, 13876, 12129, 2086, 2035, 1045, 2064, 2228, 1997, 2003, 2008, 2214, 14911, 3389, 2299, 1012, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 1000, 2003, 2008, 2035, 2045, 2003, 1029, 1029, 1000, 1012, 1012, 1012, 1045, 2001, 2074, 2019, 2220, 9458, 2043, 2023, 20482, 3869, 2718, 1996, 1057, 1012, 1055, 1012, 1045, 2001, 2205, 2402, 2000, 2131, 1999, 1996, 4258, 1006, 2348, 1045, 2106, 6133, 2000, 13583, 2046, 1000, 9119, 8912, 1000, 1007, 1012, 2059, 1037, 11326, 2012, 1037, 2334, 2143, 2688, 10272, 17799, 1011, 2633, 1045, 2071, 2156, 2023, 2143, 1010, 3272, 2085, 1045, 2001, 2004, 2214, 2004, 2026, 3008, 2020, 2043, 2027, 8040, 7317, 13699, 5669, 2000, 2156, 2009, 999, 999, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 1996, 2069, 3114, 2023, 2143, 2001, 2025, 10033, 2000, 1996, 10812, 13457, 1997, 2051, 2001, 2138, 1997, 1996, 27885, 11020, 20693, 2553, 13977, 2011, 2049, 1057, 1012, 1055, 1012, 2713, 1012, 8817, 1997, 2111, 19311, 2098, 2000, 2023, 27136, 2121, 1010, 3241, 2027, 2020, 2183, 2000, 2156, 1037, 3348, 2143, 1012, 1012, 1012, 2612, 1010, 2027, 2288, 7167, 1997, 2485, 22264, 1997, 1043, 11802, 2135, 1010, 16360, 23004, 25430, 18352, 1010, 2006, 1011, 2395, 7636, 1999, 20857, 6023, 25943, 1010, 2004, 5498, 2063, 2576, 3653, 29048, 1012, 1012, 1012, 1998, 7408, 3468, 2040, 1011, 14977, 23599, 3348, 5019, 2007, 7842, 22772, 1010, 5122, 5889, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 3451, 12696, 1010, 4151, 24665, 12502, 1010, 3181, 20785, 1012, 1012, 3649, 2023, 2518, 2001, 1010, 14021, 5596, 2009, 1010, 6402, 2009, 1010, 2059, 4933, 1996, 11289, 1999, 1037, 2599, 3482, 999, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 7069, 9765, 27065, 2229, 2145, 26988, 2000, 2424, 3643, 1999, 2049, 11771, 18404, 6208, 2576, 11867, 7974, 8613, 1012, 1012, 2021, 2065, 2009, 4694, 1005, 1056, 2005, 1996, 15657, 9446, 1010, 2009, 2052, 2031, 2042, 6439, 1010, 2059, 6404, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 2612, 1010, 1996, 1000, 1045, 2572, 8744, 1010, 8744, 1000, 1054, 10536, 16921, 7583, 2516, 2001, 5567, 10866, 2135, 2005, 2086, 2004, 1037, 14841, 26065, 3508, 2005, 22555, 2080, 3152, 1006, 1045, 2572, 8025, 1010, 20920, 1011, 2005, 5637, 3152, 1010, 1045, 2572, 8025, 1010, 2304, 1011, 2005, 1038, 2721, 2595, 24759, 28100, 3370, 3152, 1010, 4385, 1012, 1012, 1007, 1998, 2296, 2702, 2086, 2030, 2061, 1996, 2518, 9466, 2013, 1996, 2757, 1010, 2000, 2022, 7021, 2011, 1037, 2047, 4245, 1997, 26476, 2015, 2040, 2215, 2000, 2156, 2008, 1000, 20355, 3348, 2143, 1000, 2008, 1000, 4329, 3550, 1996, 2143, 3068, 1000, 1012, 1012, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 6300, 9953, 1010, 4468, 2066, 1996, 11629, 1012, 1012, 2030, 2065, 2017, 2442, 2156, 2009, 1011, 9278, 1996, 2678, 1998, 3435, 2830, 2000, 1996, 1000, 6530, 1000, 3033, 1010, 2074, 2000, 2131, 2009, 2058, 2007, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 102]

- 결론: imdb 에는 자료가 “텍스트” 형태로 저장되어있고, tokenized_imdb 는 자료가 “텍스트 \(\cup\) 숫자” 형태로 저장되어 있음. 즉 tokenized_imdb 는 원래데이터 (raw data) 와 전처리된 데이터 (preprocessed data) 가 같이 있음.

B. Step2 – 인공지능 생성

- 인공지능을 생성하는 코드

from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert/distilbert-base-uncased", num_labels=2
)
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

C. Step3 – 인공지능 학습

- 트레이너를 만들때 필요한 재료 data_collator 생성

from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

- 트레이너를 만들때 필요한 재료 compute_metrics 생성

import evaluate
import numpy as np
accuracy = evaluate.load("accuracy")
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

- 트레이너를 만들때 필요한 재료 training_args(훈련지침?) 생성

from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(
    output_dir="my_awesome_model",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2, # 전체문제세트를 2번 공부하라..
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False,
)

- 트레이너 생성

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_imdb["train"],
    eval_dataset=tokenized_imdb["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

- 트레이너를 이용하여 인공지능을 학습시키는 코드

trainer.train()
[3126/3126 12:33, Epoch 2/2]
Epoch Training Loss Validation Loss Accuracy
1 0.222000 0.202918 0.921160
2 0.147300 0.226498 0.932000

TrainOutput(global_step=3126, training_loss=0.20518110291132619, metrics={'train_runtime': 753.5002, 'train_samples_per_second': 66.357, 'train_steps_per_second': 4.149, 'total_flos': 6556904415524352.0, 'train_loss': 0.20518110291132619, 'epoch': 2.0})

D. Step4 – 예측

text0 = "The movie was a major disappointment. The plot was weak, and the pacing felt disjointed. The characters lacked depth, making it hard to connect with them. Predictable twists and a cliché resolution left no sense of excitement. Visually, it was unimpressive, and the soundtrack didn’t fit the scenes, pulling me out of the experience. Overall, it failed to deliver anything memorable."
text1 = "This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three."
from transformers import pipeline
classifier1 = pipeline("sentiment-analysis", model="my_awesome_model/checkpoint-1563")
print(classifier1(text0))
print(classifier1(text1))
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
[{'label': 'LABEL_0', 'score': 0.9910238981246948}]
[{'label': 'LABEL_1', 'score': 0.993097722530365}]
classifier2 = pipeline("sentiment-analysis", model="my_awesome_model/checkpoint-3126")
print(classifier2(text0))
print(classifier2(text1))
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
[{'label': 'LABEL_0', 'score': 0.9966901540756226}]
[{'label': 'LABEL_1', 'score': 0.9965715408325195}]

5. 코드정리2

- 패키지

import datasets
import transformers
import evaluate
import numpy as np

- Step1~4 준비하기

## Step1 
데이터불러오기 = datasets.load_dataset
데이터전처리하기1 = 토크나이저 = transformers.AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased") 
# str -> {'input_ids': , 'attention_mask': }
def 데이터전처리하기2(examples):
    return 데이터전처리하기1(examples["text"], truncation=True)
# tokenized_imdb['train'][0] ->  {'input_ids': , 'attention_mask': }
## Step2 
인공지능생성하기 = transformers.AutoModelForSequenceClassification.from_pretrained
## Step3 
데이터콜렉터 = transformers.DataCollatorWithPadding(tokenizer=토크나이저)
def 평가하기(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    accuracy = evaluate.load("accuracy")
    return accuracy.compute(predictions=predictions, references=labels)
트레이너세부지침생성기 = transformers.TrainingArguments
트레이너생성기 = transformers.Trainer
## Step4 
강인공지능생성하기 = transformers.pipeline
/home/cgb3/anaconda3/envs/hf/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(

- Step1 ~ 4

## Step1 
데이터 = 데이터불러오기('imdb')
전처리된데이터 = 데이터.map(데이터전처리하기2,batched=True)
전처리된훈련자료, 전처리된검증자료 = 전처리된데이터['train'], 전처리된데이터['test']
## Step2 
인공지능 = 인공지능생성하기("distilbert/distilbert-base-uncased", num_labels=2)
## Step3 
트레이너세부지침 = 트레이너세부지침생성기(
    output_dir="my_awesome_model",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2, # 전체문제세트를 2번 공부하라..
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False,
)
트레이너 = 트레이너생성기(
    model=인공지능,
    args=트레이너세부지침,
    train_dataset=전처리된훈련자료,
    eval_dataset=전처리된검증자료,
    tokenizer=토크나이저,
    data_collator=데이터콜렉터,
    compute_metrics=평가하기,
)
트레이너.train()
Map: 100%|██████████| 25000/25000 [00:02<00:00, 9912.28 examples/s] 
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
## Step4 
강인공지능 = 강인공지능생성하기("sentiment-analysis", model="my_awesome_model/checkpoint-1563")
print(강인공지능(text0))
print(강인공지능(text1))
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
[{'label': 'LABEL_0', 'score': 0.9910238981246948}]
[{'label': 'LABEL_1', 'score': 0.993097722530365}]