#pip install transformers datasets evaluate accelerate
02wk-1: IMDB 영화평 감성분석
1. 강의영상
2. Install
- 이 코드는 뭔가를 설치하는 코드이다.
pip
: 여기에서pip
은 앱스토어 이름이라고 생각하자. (구글스토어, T-store)install
: 이것은 “설치버튼” 이라고 생각하자.transformer
,datasets
,evaluate
,accelerate
: 설치하고 싶은 프로그램 이름
이거 코랩에서 실행해도 여러분의 컴퓨터에 뭐가 깔리는건 아닙니다. 마음편하게 실행하세요.
3. 코딩패턴
-
파이썬에서 보통 지도학습은 적당히 아래와 같은 흐름으로 코드가 작성된다.
## Step1: 데이터
= 데이터읽기()
데이터 = 데이터전처리하기(데이터)
전처리된데이터 1, 전처리된자료2 = 데이터분리하기(전처리된데이터)
전처리된자료
## Step2: 인공지능 생성
= 인공지능생성하기() # 바보인공지능
인공지능
## Step3: 인공지능 학습
1) # 이 명령어가 실행된 이후는 똑똑히진 인공지능
인공지능.학습하기(전처리된자료
## Step4: 예측
1) # train_data를 풀어봄
인공지능.예측하기(전처리된자료2) # val_data를 풀어봄
인공지능.예측하기(전처리된자료3) # 돌발적으로 발생한 자료, test_data 인공지능.예측하기(전처리된자료
- 특징: 오브젝트 중심적인 코드
- R의 경우는 위의코드가 함수중심적으로 다시 써져있을것임.
-
이전시간에 언급한 내용
- 전략: (1) 블로그 코드 돌려봄 (2) 코드를 이해한다 (3) 코드의 데이터부분만 나의 데이터로 바꿔서 돌린다 (4) 돌려서 결과를 예측한다.
- 마음가짐: 나는 어떠한 코드도 짤 수 있다. X // 나는 어떠한 코드도 이해할 수 있다. X // 나는 어떠한 코드도 베낄 수 있다. O
- 여기에서 코드를 이해한다는 의미 = 어떠한 코드 내용을 자세하게는 모르겠지만 Step1~Step4에 해당하는 카테고리로 빠르게 정리한다.
-
변형패턴1: 사실 패턴이 하나뿐일수는 없음. 코드짜는 사람의 마음에 따라서 얼마든지 바뀔수 있는게 패턴임.
## Step1: 데이터
= 데이터읽기()
데이터 = 데이터전처리하기(데이터['train'])
전처리된훈련자료 = 데이터전처리하기(데이터['val'])
전처리된검증자료
## Step2: 인공지능 생성
= 인공지능생성하기() # 바보인공지능
인공지능
## Step3: 인공지능 학습
# 이 명령어가 실행된 이후는 똑똑히진 인공지능
인공지능.학습하기(전처리된훈련자료)
## Step4: 예측
# train_data를 풀어봄
인공지능.예측하기(전처리된훈련자료) # val_data를 풀어봄
인공지능.예측하기(전처리된검증자료) # 돌발적으로 발생한 자료, test_data 인공지능.예측하기(전처리된돌발자료)
-
변형패턴2: 데이터자체에 “변환하기” 관련된 method 가 있는 경우.
## Step1: 데이터
= 데이터읽기()
데이터 = 데이터.변환하기(데이터전처리하기) # 데이터.변환하기()의 입력으로 데이터전처리하기라는 함수자체를 입력하는 경우
전처리된데이터 = 전처리된데이터['train'], 전처리된데이터['val']
전처리된훈련자료, 전처리된검증자료
## Step2: 인공지능 생성
= 인공지능생성하기() # 바보인공지능
인공지능
## Step3: 인공지능 학습
# 이 명령어가 실행된 이후는 똑똑히진 인공지능
인공지능.학습하기(전처리된훈련자료)
## Step4: 예측
# train_data를 풀어봄
인공지능.예측하기(전처리된훈련자료) # val_data를 풀어봄
인공지능.예측하기(전처리된검증자료) # 돌발적으로 발생한 자료, test_data 인공지능.예측하기(전처리된돌발자료)
-
변형패턴3: 트레이너가 존재하는 경우
## Step1: 데이터
= 데이터읽기()
데이터 = 데이터.변환하기(데이터전처리하기) # 데이터.변환하기()의 입력으로 데이터전처리하기라는 함수자체를 입력하는 경우
전처리된데이터 = 전처리된데이터['train'], 전처리된데이터['val']
전처리된훈련자료, 전처리된검증자료
## Step2: 인공지능 생성
= 인공지능생성하기() # 바보인공지능
인공지능
## Step3: 인공지능 학습
= 트레이너생성하기(인공지능,전처리된훈련자료,전처리된검증자료)
트레이너 # 인공지능은 똑똑해짐..
트러이너.train()
## Step4: 예측
# train_data를 풀어봄
인공지능.예측하기(전처리된훈련자료) # val_data를 풀어봄
인공지능.예측하기(전처리된검증자료) # 돌발적으로 발생한 자료, test_data 인공지능.예측하기(전처리된돌발자료)
-
변형패턴4: 강인공지능이 존재하는 경우
## Step1: 데이터
= 데이터읽기()
데이터 = 데이터.변환하기(데이터전처리하기) # 데이터.변환하기()의 입력으로 데이터전처리하기라는 함수자체를 입력하는 경우
전처리된데이터 = 전처리된데이터['train'], 전처리된데이터['val']
전처리된훈련자료, 전처리된검증자료
## Step2: 인공지능 생성
= 인공지능생성하기() # 바보인공지능
인공지능
## Step3: 인공지능 학습
= 트레이너생성하기(인공지능,전처리된훈련자료,전처리된검증자료)
트레이너 # 인공지능은 똑똑해짐..
트러이너.train()
## Step4: 예측
= 강인공지능생성기(인공지능)
강인공지능 강인공지능.예측하기(전처리되지않은자료)
-
기타변형패턴들..
## Step1: 데이터
= 데이터읽기()
데이터 = 데이터.변환하기(데이터전처리하기) # 데이터.변환하기()의 입력으로 데이터전처리하기라는 함수자체를 입력하는 경우
전처리된데이터 = 전처리된데이터['train'], 전처리된데이터['val']
전처리된훈련자료, 전처리된검증자료
## Step2: 인공지능 생성
= 인공지능생성하기() # 바보인공지능
인공지능
## Step3: 인공지능 학습
# 공부하는모드
인공지능.훈련모드() # 이 명령어가 실행된 이후는 똑똑히진 인공지능
인공지능.학습하기(전처리된훈련자료) # 문제풀이모드
인공지능.평가모드()
## Step4: 예측
# train_data를 풀어봄
인공지능.예측하기(전처리된훈련자료) # val_data를 풀어봄
인공지능.예측하기(전처리된검증자료) # 돌발적으로 발생한 자료, test_data 인공지능.예측하기(전처리된돌발자료)
4. 코드정리1
ref: https://huggingface.co/docs/transformers/tasks/sequence_classification
A. Step1 – 데이터
-
데이터불러오기
from datasets import load_dataset
= load_dataset("imdb") imdb
/home/cgb3/anaconda3/envs/hf/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
from transformers import AutoTokenizer
= AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased") tokenizer
/home/cgb3/anaconda3/envs/hf/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
warnings.warn(
뭐하는거?
"Transformers are really useful for natural language processing.") # 텍스트 -> 숫자들 tokenizer(
{'input_ids': [101, 19081, 2024, 2428, 6179, 2005, 3019, 2653, 6364, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
-
preprocess_function
를 선언
def preprocess_function(examples):
return tokenizer(examples["text"], truncation=True)
뭐하는 함수?
"Transformers are really useful for natural language processing.",truncation=True) # 텍스트 -> 숫자들 tokenizer(
{'input_ids': [101, 19081, 2024, 2428, 6179, 2005, 3019, 2653, 6364, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
= imdb['train'][1]
examples #tokenizer(examples["text"], truncation=True)
preprocess_function(examples)
{'input_ids': [101, 1000, 1045, 2572, 8025, 1024, 3756, 1000, 2003, 1037, 15544, 19307, 1998, 3653, 6528, 20771, 19986, 8632, 1012, 2009, 2987, 1005, 1056, 3043, 2054, 2028, 1005, 1055, 2576, 5328, 2024, 2138, 2023, 2143, 2064, 6684, 2022, 2579, 5667, 2006, 2151, 2504, 1012, 2004, 2005, 1996, 4366, 2008, 19124, 3287, 16371, 25469, 2003, 2019, 6882, 13316, 1011, 2459, 1010, 2008, 3475, 1005, 1056, 2995, 1012, 1045, 1005, 2310, 2464, 1054, 1011, 6758, 3152, 2007, 3287, 16371, 25469, 1012, 4379, 1010, 2027, 2069, 3749, 2070, 25085, 5328, 1010, 2021, 2073, 2024, 1996, 1054, 1011, 6758, 3152, 2007, 21226, 24728, 22144, 2015, 1998, 20916, 4691, 6845, 2401, 1029, 7880, 1010, 2138, 2027, 2123, 1005, 1056, 4839, 1012, 1996, 2168, 3632, 2005, 2216, 10231, 7685, 5830, 3065, 1024, 8040, 7317, 5063, 2015, 11820, 1999, 1996, 9478, 2021, 2025, 1037, 17962, 21239, 1999, 4356, 1012, 1998, 2216, 3653, 6528, 20771, 10271, 5691, 2066, 1996, 2829, 16291, 1010, 1999, 2029, 2057, 1005, 2128, 5845, 2000, 1996, 2609, 1997, 6320, 25624, 1005, 1055, 17061, 3779, 1010, 2021, 2025, 1037, 7637, 1997, 5061, 5710, 2006, 9318, 7367, 5737, 19393, 1012, 2077, 6933, 1006, 2030, 20242, 1007, 1000, 3313, 1011, 3115, 1000, 1999, 5609, 1997, 16371, 25469, 1010, 1996, 10597, 27885, 5809, 2063, 2323, 2202, 2046, 4070, 2028, 14477, 6767, 8524, 6321, 5793, 28141, 4489, 2090, 2273, 1998, 2308, 1024, 2045, 2024, 2053, 8991, 18400, 2015, 2006, 4653, 2043, 19910, 3544, 15287, 1010, 1998, 1996, 2168, 3685, 2022, 2056, 2005, 1037, 2158, 1012, 1999, 2755, 1010, 2017, 3227, 2180, 1005, 1056, 2156, 2931, 8991, 18400, 2015, 1999, 2019, 2137, 2143, 1999, 2505, 2460, 1997, 22555, 2030, 13216, 14253, 2050, 1012, 2023, 6884, 3313, 1011, 3115, 2003, 2625, 1037, 3313, 3115, 2084, 2019, 4914, 2135, 2139, 24128, 3754, 2000, 2272, 2000, 3408, 20547, 2007, 1996, 19008, 1997, 2308, 1005, 1055, 4230, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
= imdb['train'][:2]
examples #tokenizer(examples["text"], truncation=True)
preprocess_function(examples)
{'input_ids': [[101, 1045, 12524, 1045, 2572, 8025, 1011, 3756, 2013, 2026, 2678, 3573, 2138, 1997, 2035, 1996, 6704, 2008, 5129, 2009, 2043, 2009, 2001, 2034, 2207, 1999, 3476, 1012, 1045, 2036, 2657, 2008, 2012, 2034, 2009, 2001, 8243, 2011, 1057, 1012, 1055, 1012, 8205, 2065, 2009, 2412, 2699, 2000, 4607, 2023, 2406, 1010, 3568, 2108, 1037, 5470, 1997, 3152, 2641, 1000, 6801, 1000, 1045, 2428, 2018, 2000, 2156, 2023, 2005, 2870, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 1996, 5436, 2003, 8857, 2105, 1037, 2402, 4467, 3689, 3076, 2315, 14229, 2040, 4122, 2000, 4553, 2673, 2016, 2064, 2055, 2166, 1012, 1999, 3327, 2016, 4122, 2000, 3579, 2014, 3086, 2015, 2000, 2437, 2070, 4066, 1997, 4516, 2006, 2054, 1996, 2779, 25430, 14728, 2245, 2055, 3056, 2576, 3314, 2107, 2004, 1996, 5148, 2162, 1998, 2679, 3314, 1999, 1996, 2142, 2163, 1012, 1999, 2090, 4851, 8801, 1998, 6623, 7939, 4697, 3619, 1997, 8947, 2055, 2037, 10740, 2006, 4331, 1010, 2016, 2038, 3348, 2007, 2014, 3689, 3836, 1010, 19846, 1010, 1998, 2496, 2273, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 2054, 8563, 2033, 2055, 1045, 2572, 8025, 1011, 3756, 2003, 2008, 2871, 2086, 3283, 1010, 2023, 2001, 2641, 26932, 1012, 2428, 1010, 1996, 3348, 1998, 16371, 25469, 5019, 2024, 2261, 1998, 2521, 2090, 1010, 2130, 2059, 2009, 1005, 1055, 2025, 2915, 2066, 2070, 10036, 2135, 2081, 22555, 2080, 1012, 2096, 2026, 2406, 3549, 2568, 2424, 2009, 16880, 1010, 1999, 4507, 3348, 1998, 16371, 25469, 2024, 1037, 2350, 18785, 1999, 4467, 5988, 1012, 2130, 13749, 7849, 24544, 1010, 15835, 2037, 3437, 2000, 2204, 2214, 2879, 2198, 4811, 1010, 2018, 3348, 5019, 1999, 2010, 3152, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 1045, 2079, 4012, 3549, 2094, 1996, 16587, 2005, 1996, 2755, 2008, 2151, 3348, 3491, 1999, 1996, 2143, 2003, 3491, 2005, 6018, 5682, 2738, 2084, 2074, 2000, 5213, 2111, 1998, 2191, 2769, 2000, 2022, 3491, 1999, 26932, 12370, 1999, 2637, 1012, 1045, 2572, 8025, 1011, 3756, 2003, 1037, 2204, 2143, 2005, 3087, 5782, 2000, 2817, 1996, 6240, 1998, 14629, 1006, 2053, 26136, 3832, 1007, 1997, 4467, 5988, 1012, 2021, 2428, 1010, 2023, 2143, 2987, 1005, 1056, 2031, 2172, 1997, 1037, 5436, 1012, 102], [101, 1000, 1045, 2572, 8025, 1024, 3756, 1000, 2003, 1037, 15544, 19307, 1998, 3653, 6528, 20771, 19986, 8632, 1012, 2009, 2987, 1005, 1056, 3043, 2054, 2028, 1005, 1055, 2576, 5328, 2024, 2138, 2023, 2143, 2064, 6684, 2022, 2579, 5667, 2006, 2151, 2504, 1012, 2004, 2005, 1996, 4366, 2008, 19124, 3287, 16371, 25469, 2003, 2019, 6882, 13316, 1011, 2459, 1010, 2008, 3475, 1005, 1056, 2995, 1012, 1045, 1005, 2310, 2464, 1054, 1011, 6758, 3152, 2007, 3287, 16371, 25469, 1012, 4379, 1010, 2027, 2069, 3749, 2070, 25085, 5328, 1010, 2021, 2073, 2024, 1996, 1054, 1011, 6758, 3152, 2007, 21226, 24728, 22144, 2015, 1998, 20916, 4691, 6845, 2401, 1029, 7880, 1010, 2138, 2027, 2123, 1005, 1056, 4839, 1012, 1996, 2168, 3632, 2005, 2216, 10231, 7685, 5830, 3065, 1024, 8040, 7317, 5063, 2015, 11820, 1999, 1996, 9478, 2021, 2025, 1037, 17962, 21239, 1999, 4356, 1012, 1998, 2216, 3653, 6528, 20771, 10271, 5691, 2066, 1996, 2829, 16291, 1010, 1999, 2029, 2057, 1005, 2128, 5845, 2000, 1996, 2609, 1997, 6320, 25624, 1005, 1055, 17061, 3779, 1010, 2021, 2025, 1037, 7637, 1997, 5061, 5710, 2006, 9318, 7367, 5737, 19393, 1012, 2077, 6933, 1006, 2030, 20242, 1007, 1000, 3313, 1011, 3115, 1000, 1999, 5609, 1997, 16371, 25469, 1010, 1996, 10597, 27885, 5809, 2063, 2323, 2202, 2046, 4070, 2028, 14477, 6767, 8524, 6321, 5793, 28141, 4489, 2090, 2273, 1998, 2308, 1024, 2045, 2024, 2053, 8991, 18400, 2015, 2006, 4653, 2043, 19910, 3544, 15287, 1010, 1998, 1996, 2168, 3685, 2022, 2056, 2005, 1037, 2158, 1012, 1999, 2755, 1010, 2017, 3227, 2180, 1005, 1056, 2156, 2931, 8991, 18400, 2015, 1999, 2019, 2137, 2143, 1999, 2505, 2460, 1997, 22555, 2030, 13216, 14253, 2050, 1012, 2023, 6884, 3313, 1011, 3115, 2003, 2625, 1037, 3313, 3115, 2084, 2019, 4914, 2135, 2139, 24128, 3754, 2000, 2272, 2000, 3408, 20547, 2007, 1996, 19008, 1997, 2308, 1005, 1055, 4230, 1012, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
-
map 메소드 사용
= imdb.map(preprocess_function, batched=True) tokenized_imdb
#imdb['train'][0], tokenized_imdb['train'][0]
for i in range(5):
print(preprocess_function(imdb['train'][i]))
{'input_ids': [101, 1045, 12524, 1045, 2572, 8025, 1011, 3756, 2013, 2026, 2678, 3573, 2138, 1997, 2035, 1996, 6704, 2008, 5129, 2009, 2043, 2009, 2001, 2034, 2207, 1999, 3476, 1012, 1045, 2036, 2657, 2008, 2012, 2034, 2009, 2001, 8243, 2011, 1057, 1012, 1055, 1012, 8205, 2065, 2009, 2412, 2699, 2000, 4607, 2023, 2406, 1010, 3568, 2108, 1037, 5470, 1997, 3152, 2641, 1000, 6801, 1000, 1045, 2428, 2018, 2000, 2156, 2023, 2005, 2870, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 1996, 5436, 2003, 8857, 2105, 1037, 2402, 4467, 3689, 3076, 2315, 14229, 2040, 4122, 2000, 4553, 2673, 2016, 2064, 2055, 2166, 1012, 1999, 3327, 2016, 4122, 2000, 3579, 2014, 3086, 2015, 2000, 2437, 2070, 4066, 1997, 4516, 2006, 2054, 1996, 2779, 25430, 14728, 2245, 2055, 3056, 2576, 3314, 2107, 2004, 1996, 5148, 2162, 1998, 2679, 3314, 1999, 1996, 2142, 2163, 1012, 1999, 2090, 4851, 8801, 1998, 6623, 7939, 4697, 3619, 1997, 8947, 2055, 2037, 10740, 2006, 4331, 1010, 2016, 2038, 3348, 2007, 2014, 3689, 3836, 1010, 19846, 1010, 1998, 2496, 2273, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 2054, 8563, 2033, 2055, 1045, 2572, 8025, 1011, 3756, 2003, 2008, 2871, 2086, 3283, 1010, 2023, 2001, 2641, 26932, 1012, 2428, 1010, 1996, 3348, 1998, 16371, 25469, 5019, 2024, 2261, 1998, 2521, 2090, 1010, 2130, 2059, 2009, 1005, 1055, 2025, 2915, 2066, 2070, 10036, 2135, 2081, 22555, 2080, 1012, 2096, 2026, 2406, 3549, 2568, 2424, 2009, 16880, 1010, 1999, 4507, 3348, 1998, 16371, 25469, 2024, 1037, 2350, 18785, 1999, 4467, 5988, 1012, 2130, 13749, 7849, 24544, 1010, 15835, 2037, 3437, 2000, 2204, 2214, 2879, 2198, 4811, 1010, 2018, 3348, 5019, 1999, 2010, 3152, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 1045, 2079, 4012, 3549, 2094, 1996, 16587, 2005, 1996, 2755, 2008, 2151, 3348, 3491, 1999, 1996, 2143, 2003, 3491, 2005, 6018, 5682, 2738, 2084, 2074, 2000, 5213, 2111, 1998, 2191, 2769, 2000, 2022, 3491, 1999, 26932, 12370, 1999, 2637, 1012, 1045, 2572, 8025, 1011, 3756, 2003, 1037, 2204, 2143, 2005, 3087, 5782, 2000, 2817, 1996, 6240, 1998, 14629, 1006, 2053, 26136, 3832, 1007, 1997, 4467, 5988, 1012, 2021, 2428, 1010, 2023, 2143, 2987, 1005, 1056, 2031, 2172, 1997, 1037, 5436, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
{'input_ids': [101, 1000, 1045, 2572, 8025, 1024, 3756, 1000, 2003, 1037, 15544, 19307, 1998, 3653, 6528, 20771, 19986, 8632, 1012, 2009, 2987, 1005, 1056, 3043, 2054, 2028, 1005, 1055, 2576, 5328, 2024, 2138, 2023, 2143, 2064, 6684, 2022, 2579, 5667, 2006, 2151, 2504, 1012, 2004, 2005, 1996, 4366, 2008, 19124, 3287, 16371, 25469, 2003, 2019, 6882, 13316, 1011, 2459, 1010, 2008, 3475, 1005, 1056, 2995, 1012, 1045, 1005, 2310, 2464, 1054, 1011, 6758, 3152, 2007, 3287, 16371, 25469, 1012, 4379, 1010, 2027, 2069, 3749, 2070, 25085, 5328, 1010, 2021, 2073, 2024, 1996, 1054, 1011, 6758, 3152, 2007, 21226, 24728, 22144, 2015, 1998, 20916, 4691, 6845, 2401, 1029, 7880, 1010, 2138, 2027, 2123, 1005, 1056, 4839, 1012, 1996, 2168, 3632, 2005, 2216, 10231, 7685, 5830, 3065, 1024, 8040, 7317, 5063, 2015, 11820, 1999, 1996, 9478, 2021, 2025, 1037, 17962, 21239, 1999, 4356, 1012, 1998, 2216, 3653, 6528, 20771, 10271, 5691, 2066, 1996, 2829, 16291, 1010, 1999, 2029, 2057, 1005, 2128, 5845, 2000, 1996, 2609, 1997, 6320, 25624, 1005, 1055, 17061, 3779, 1010, 2021, 2025, 1037, 7637, 1997, 5061, 5710, 2006, 9318, 7367, 5737, 19393, 1012, 2077, 6933, 1006, 2030, 20242, 1007, 1000, 3313, 1011, 3115, 1000, 1999, 5609, 1997, 16371, 25469, 1010, 1996, 10597, 27885, 5809, 2063, 2323, 2202, 2046, 4070, 2028, 14477, 6767, 8524, 6321, 5793, 28141, 4489, 2090, 2273, 1998, 2308, 1024, 2045, 2024, 2053, 8991, 18400, 2015, 2006, 4653, 2043, 19910, 3544, 15287, 1010, 1998, 1996, 2168, 3685, 2022, 2056, 2005, 1037, 2158, 1012, 1999, 2755, 1010, 2017, 3227, 2180, 1005, 1056, 2156, 2931, 8991, 18400, 2015, 1999, 2019, 2137, 2143, 1999, 2505, 2460, 1997, 22555, 2030, 13216, 14253, 2050, 1012, 2023, 6884, 3313, 1011, 3115, 2003, 2625, 1037, 3313, 3115, 2084, 2019, 4914, 2135, 2139, 24128, 3754, 2000, 2272, 2000, 3408, 20547, 2007, 1996, 19008, 1997, 2308, 1005, 1055, 4230, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
{'input_ids': [101, 2065, 2069, 2000, 4468, 2437, 2023, 2828, 1997, 2143, 1999, 1996, 2925, 1012, 2023, 2143, 2003, 5875, 2004, 2019, 7551, 2021, 4136, 2053, 2522, 11461, 2466, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 2028, 2453, 2514, 6819, 5339, 8918, 2005, 3564, 27046, 2009, 2138, 2009, 12817, 2006, 2061, 2116, 2590, 3314, 2021, 2009, 2515, 2061, 2302, 2151, 5860, 11795, 3085, 15793, 1012, 1996, 13972, 3310, 2185, 2007, 2053, 2047, 15251, 1006, 4983, 2028, 3310, 2039, 2007, 2028, 2096, 2028, 1005, 1055, 2568, 17677, 2015, 1010, 2004, 2009, 2097, 26597, 2079, 2076, 2023, 23100, 2143, 1007, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 2028, 2453, 2488, 5247, 2028, 1005, 1055, 2051, 4582, 2041, 1037, 3332, 2012, 1037, 3392, 3652, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
{'input_ids': [101, 2023, 2143, 2001, 2763, 4427, 2011, 2643, 4232, 1005, 1055, 16137, 10841, 4115, 1010, 10768, 25300, 2078, 1998, 1045, 9075, 2017, 2000, 2156, 2008, 2143, 2612, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 1996, 2143, 2038, 2048, 2844, 3787, 1998, 2216, 2024, 1010, 1006, 1015, 1007, 1996, 12689, 3772, 1006, 1016, 1007, 1996, 8052, 1010, 6151, 6810, 2099, 7178, 2135, 2204, 1010, 6302, 1012, 4237, 2013, 2008, 1010, 2054, 9326, 2033, 2087, 2003, 1996, 10866, 5460, 1997, 9033, 21202, 7971, 1012, 14229, 6396, 2386, 2038, 2000, 2022, 2087, 15703, 3883, 1999, 1996, 2088, 1012, 2016, 4490, 2061, 5236, 1998, 2007, 2035, 1996, 16371, 25469, 1999, 2023, 2143, 1010, 1012, 1012, 1012, 2009, 1005, 1055, 14477, 4779, 26884, 1012, 13599, 2000, 2643, 4232, 1005, 1055, 2143, 1010, 7789, 3012, 2038, 2042, 2999, 2007, 28072, 1012, 2302, 2183, 2205, 2521, 2006, 2023, 3395, 1010, 1045, 2052, 2360, 2008, 4076, 2013, 1996, 4489, 1999, 15084, 2090, 1996, 2413, 1998, 1996, 4467, 2554, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 1037, 3185, 1997, 2049, 2051, 1010, 1998, 2173, 1012, 1016, 1013, 2184, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
{'input_ids': [101, 2821, 1010, 2567, 1012, 1012, 1012, 2044, 4994, 2055, 2023, 9951, 2143, 2005, 8529, 13876, 12129, 2086, 2035, 1045, 2064, 2228, 1997, 2003, 2008, 2214, 14911, 3389, 2299, 1012, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 1000, 2003, 2008, 2035, 2045, 2003, 1029, 1029, 1000, 1012, 1012, 1012, 1045, 2001, 2074, 2019, 2220, 9458, 2043, 2023, 20482, 3869, 2718, 1996, 1057, 1012, 1055, 1012, 1045, 2001, 2205, 2402, 2000, 2131, 1999, 1996, 4258, 1006, 2348, 1045, 2106, 6133, 2000, 13583, 2046, 1000, 9119, 8912, 1000, 1007, 1012, 2059, 1037, 11326, 2012, 1037, 2334, 2143, 2688, 10272, 17799, 1011, 2633, 1045, 2071, 2156, 2023, 2143, 1010, 3272, 2085, 1045, 2001, 2004, 2214, 2004, 2026, 3008, 2020, 2043, 2027, 8040, 7317, 13699, 5669, 2000, 2156, 2009, 999, 999, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 1996, 2069, 3114, 2023, 2143, 2001, 2025, 10033, 2000, 1996, 10812, 13457, 1997, 2051, 2001, 2138, 1997, 1996, 27885, 11020, 20693, 2553, 13977, 2011, 2049, 1057, 1012, 1055, 1012, 2713, 1012, 8817, 1997, 2111, 19311, 2098, 2000, 2023, 27136, 2121, 1010, 3241, 2027, 2020, 2183, 2000, 2156, 1037, 3348, 2143, 1012, 1012, 1012, 2612, 1010, 2027, 2288, 7167, 1997, 2485, 22264, 1997, 1043, 11802, 2135, 1010, 16360, 23004, 25430, 18352, 1010, 2006, 1011, 2395, 7636, 1999, 20857, 6023, 25943, 1010, 2004, 5498, 2063, 2576, 3653, 29048, 1012, 1012, 1012, 1998, 7408, 3468, 2040, 1011, 14977, 23599, 3348, 5019, 2007, 7842, 22772, 1010, 5122, 5889, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 3451, 12696, 1010, 4151, 24665, 12502, 1010, 3181, 20785, 1012, 1012, 3649, 2023, 2518, 2001, 1010, 14021, 5596, 2009, 1010, 6402, 2009, 1010, 2059, 4933, 1996, 11289, 1999, 1037, 2599, 3482, 999, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 7069, 9765, 27065, 2229, 2145, 26988, 2000, 2424, 3643, 1999, 2049, 11771, 18404, 6208, 2576, 11867, 7974, 8613, 1012, 1012, 2021, 2065, 2009, 4694, 1005, 1056, 2005, 1996, 15657, 9446, 1010, 2009, 2052, 2031, 2042, 6439, 1010, 2059, 6404, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 2612, 1010, 1996, 1000, 1045, 2572, 8744, 1010, 8744, 1000, 1054, 10536, 16921, 7583, 2516, 2001, 5567, 10866, 2135, 2005, 2086, 2004, 1037, 14841, 26065, 3508, 2005, 22555, 2080, 3152, 1006, 1045, 2572, 8025, 1010, 20920, 1011, 2005, 5637, 3152, 1010, 1045, 2572, 8025, 1010, 2304, 1011, 2005, 1038, 2721, 2595, 24759, 28100, 3370, 3152, 1010, 4385, 1012, 1012, 1007, 1998, 2296, 2702, 2086, 2030, 2061, 1996, 2518, 9466, 2013, 1996, 2757, 1010, 2000, 2022, 7021, 2011, 1037, 2047, 4245, 1997, 26476, 2015, 2040, 2215, 2000, 2156, 2008, 1000, 20355, 3348, 2143, 1000, 2008, 1000, 4329, 3550, 1996, 2143, 3068, 1000, 1012, 1012, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 6300, 9953, 1010, 4468, 2066, 1996, 11629, 1012, 1012, 2030, 2065, 2017, 2442, 2156, 2009, 1011, 9278, 1996, 2678, 1998, 3435, 2830, 2000, 1996, 1000, 6530, 1000, 3033, 1010, 2074, 2000, 2131, 2009, 2058, 2007, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
print(tokenized_imdb['train'][0]['input_ids'])
print(tokenized_imdb['train'][1]['input_ids'])
print(tokenized_imdb['train'][2]['input_ids'])
print(tokenized_imdb['train'][3]['input_ids'])
print(tokenized_imdb['train'][4]['input_ids'])
[101, 1045, 12524, 1045, 2572, 8025, 1011, 3756, 2013, 2026, 2678, 3573, 2138, 1997, 2035, 1996, 6704, 2008, 5129, 2009, 2043, 2009, 2001, 2034, 2207, 1999, 3476, 1012, 1045, 2036, 2657, 2008, 2012, 2034, 2009, 2001, 8243, 2011, 1057, 1012, 1055, 1012, 8205, 2065, 2009, 2412, 2699, 2000, 4607, 2023, 2406, 1010, 3568, 2108, 1037, 5470, 1997, 3152, 2641, 1000, 6801, 1000, 1045, 2428, 2018, 2000, 2156, 2023, 2005, 2870, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 1996, 5436, 2003, 8857, 2105, 1037, 2402, 4467, 3689, 3076, 2315, 14229, 2040, 4122, 2000, 4553, 2673, 2016, 2064, 2055, 2166, 1012, 1999, 3327, 2016, 4122, 2000, 3579, 2014, 3086, 2015, 2000, 2437, 2070, 4066, 1997, 4516, 2006, 2054, 1996, 2779, 25430, 14728, 2245, 2055, 3056, 2576, 3314, 2107, 2004, 1996, 5148, 2162, 1998, 2679, 3314, 1999, 1996, 2142, 2163, 1012, 1999, 2090, 4851, 8801, 1998, 6623, 7939, 4697, 3619, 1997, 8947, 2055, 2037, 10740, 2006, 4331, 1010, 2016, 2038, 3348, 2007, 2014, 3689, 3836, 1010, 19846, 1010, 1998, 2496, 2273, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 2054, 8563, 2033, 2055, 1045, 2572, 8025, 1011, 3756, 2003, 2008, 2871, 2086, 3283, 1010, 2023, 2001, 2641, 26932, 1012, 2428, 1010, 1996, 3348, 1998, 16371, 25469, 5019, 2024, 2261, 1998, 2521, 2090, 1010, 2130, 2059, 2009, 1005, 1055, 2025, 2915, 2066, 2070, 10036, 2135, 2081, 22555, 2080, 1012, 2096, 2026, 2406, 3549, 2568, 2424, 2009, 16880, 1010, 1999, 4507, 3348, 1998, 16371, 25469, 2024, 1037, 2350, 18785, 1999, 4467, 5988, 1012, 2130, 13749, 7849, 24544, 1010, 15835, 2037, 3437, 2000, 2204, 2214, 2879, 2198, 4811, 1010, 2018, 3348, 5019, 1999, 2010, 3152, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 1045, 2079, 4012, 3549, 2094, 1996, 16587, 2005, 1996, 2755, 2008, 2151, 3348, 3491, 1999, 1996, 2143, 2003, 3491, 2005, 6018, 5682, 2738, 2084, 2074, 2000, 5213, 2111, 1998, 2191, 2769, 2000, 2022, 3491, 1999, 26932, 12370, 1999, 2637, 1012, 1045, 2572, 8025, 1011, 3756, 2003, 1037, 2204, 2143, 2005, 3087, 5782, 2000, 2817, 1996, 6240, 1998, 14629, 1006, 2053, 26136, 3832, 1007, 1997, 4467, 5988, 1012, 2021, 2428, 1010, 2023, 2143, 2987, 1005, 1056, 2031, 2172, 1997, 1037, 5436, 1012, 102]
[101, 1000, 1045, 2572, 8025, 1024, 3756, 1000, 2003, 1037, 15544, 19307, 1998, 3653, 6528, 20771, 19986, 8632, 1012, 2009, 2987, 1005, 1056, 3043, 2054, 2028, 1005, 1055, 2576, 5328, 2024, 2138, 2023, 2143, 2064, 6684, 2022, 2579, 5667, 2006, 2151, 2504, 1012, 2004, 2005, 1996, 4366, 2008, 19124, 3287, 16371, 25469, 2003, 2019, 6882, 13316, 1011, 2459, 1010, 2008, 3475, 1005, 1056, 2995, 1012, 1045, 1005, 2310, 2464, 1054, 1011, 6758, 3152, 2007, 3287, 16371, 25469, 1012, 4379, 1010, 2027, 2069, 3749, 2070, 25085, 5328, 1010, 2021, 2073, 2024, 1996, 1054, 1011, 6758, 3152, 2007, 21226, 24728, 22144, 2015, 1998, 20916, 4691, 6845, 2401, 1029, 7880, 1010, 2138, 2027, 2123, 1005, 1056, 4839, 1012, 1996, 2168, 3632, 2005, 2216, 10231, 7685, 5830, 3065, 1024, 8040, 7317, 5063, 2015, 11820, 1999, 1996, 9478, 2021, 2025, 1037, 17962, 21239, 1999, 4356, 1012, 1998, 2216, 3653, 6528, 20771, 10271, 5691, 2066, 1996, 2829, 16291, 1010, 1999, 2029, 2057, 1005, 2128, 5845, 2000, 1996, 2609, 1997, 6320, 25624, 1005, 1055, 17061, 3779, 1010, 2021, 2025, 1037, 7637, 1997, 5061, 5710, 2006, 9318, 7367, 5737, 19393, 1012, 2077, 6933, 1006, 2030, 20242, 1007, 1000, 3313, 1011, 3115, 1000, 1999, 5609, 1997, 16371, 25469, 1010, 1996, 10597, 27885, 5809, 2063, 2323, 2202, 2046, 4070, 2028, 14477, 6767, 8524, 6321, 5793, 28141, 4489, 2090, 2273, 1998, 2308, 1024, 2045, 2024, 2053, 8991, 18400, 2015, 2006, 4653, 2043, 19910, 3544, 15287, 1010, 1998, 1996, 2168, 3685, 2022, 2056, 2005, 1037, 2158, 1012, 1999, 2755, 1010, 2017, 3227, 2180, 1005, 1056, 2156, 2931, 8991, 18400, 2015, 1999, 2019, 2137, 2143, 1999, 2505, 2460, 1997, 22555, 2030, 13216, 14253, 2050, 1012, 2023, 6884, 3313, 1011, 3115, 2003, 2625, 1037, 3313, 3115, 2084, 2019, 4914, 2135, 2139, 24128, 3754, 2000, 2272, 2000, 3408, 20547, 2007, 1996, 19008, 1997, 2308, 1005, 1055, 4230, 1012, 102]
[101, 2065, 2069, 2000, 4468, 2437, 2023, 2828, 1997, 2143, 1999, 1996, 2925, 1012, 2023, 2143, 2003, 5875, 2004, 2019, 7551, 2021, 4136, 2053, 2522, 11461, 2466, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 2028, 2453, 2514, 6819, 5339, 8918, 2005, 3564, 27046, 2009, 2138, 2009, 12817, 2006, 2061, 2116, 2590, 3314, 2021, 2009, 2515, 2061, 2302, 2151, 5860, 11795, 3085, 15793, 1012, 1996, 13972, 3310, 2185, 2007, 2053, 2047, 15251, 1006, 4983, 2028, 3310, 2039, 2007, 2028, 2096, 2028, 1005, 1055, 2568, 17677, 2015, 1010, 2004, 2009, 2097, 26597, 2079, 2076, 2023, 23100, 2143, 1007, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 2028, 2453, 2488, 5247, 2028, 1005, 1055, 2051, 4582, 2041, 1037, 3332, 2012, 1037, 3392, 3652, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 102]
[101, 2023, 2143, 2001, 2763, 4427, 2011, 2643, 4232, 1005, 1055, 16137, 10841, 4115, 1010, 10768, 25300, 2078, 1998, 1045, 9075, 2017, 2000, 2156, 2008, 2143, 2612, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 1996, 2143, 2038, 2048, 2844, 3787, 1998, 2216, 2024, 1010, 1006, 1015, 1007, 1996, 12689, 3772, 1006, 1016, 1007, 1996, 8052, 1010, 6151, 6810, 2099, 7178, 2135, 2204, 1010, 6302, 1012, 4237, 2013, 2008, 1010, 2054, 9326, 2033, 2087, 2003, 1996, 10866, 5460, 1997, 9033, 21202, 7971, 1012, 14229, 6396, 2386, 2038, 2000, 2022, 2087, 15703, 3883, 1999, 1996, 2088, 1012, 2016, 4490, 2061, 5236, 1998, 2007, 2035, 1996, 16371, 25469, 1999, 2023, 2143, 1010, 1012, 1012, 1012, 2009, 1005, 1055, 14477, 4779, 26884, 1012, 13599, 2000, 2643, 4232, 1005, 1055, 2143, 1010, 7789, 3012, 2038, 2042, 2999, 2007, 28072, 1012, 2302, 2183, 2205, 2521, 2006, 2023, 3395, 1010, 1045, 2052, 2360, 2008, 4076, 2013, 1996, 4489, 1999, 15084, 2090, 1996, 2413, 1998, 1996, 4467, 2554, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 1037, 3185, 1997, 2049, 2051, 1010, 1998, 2173, 1012, 1016, 1013, 2184, 1012, 102]
[101, 2821, 1010, 2567, 1012, 1012, 1012, 2044, 4994, 2055, 2023, 9951, 2143, 2005, 8529, 13876, 12129, 2086, 2035, 1045, 2064, 2228, 1997, 2003, 2008, 2214, 14911, 3389, 2299, 1012, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 1000, 2003, 2008, 2035, 2045, 2003, 1029, 1029, 1000, 1012, 1012, 1012, 1045, 2001, 2074, 2019, 2220, 9458, 2043, 2023, 20482, 3869, 2718, 1996, 1057, 1012, 1055, 1012, 1045, 2001, 2205, 2402, 2000, 2131, 1999, 1996, 4258, 1006, 2348, 1045, 2106, 6133, 2000, 13583, 2046, 1000, 9119, 8912, 1000, 1007, 1012, 2059, 1037, 11326, 2012, 1037, 2334, 2143, 2688, 10272, 17799, 1011, 2633, 1045, 2071, 2156, 2023, 2143, 1010, 3272, 2085, 1045, 2001, 2004, 2214, 2004, 2026, 3008, 2020, 2043, 2027, 8040, 7317, 13699, 5669, 2000, 2156, 2009, 999, 999, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 1996, 2069, 3114, 2023, 2143, 2001, 2025, 10033, 2000, 1996, 10812, 13457, 1997, 2051, 2001, 2138, 1997, 1996, 27885, 11020, 20693, 2553, 13977, 2011, 2049, 1057, 1012, 1055, 1012, 2713, 1012, 8817, 1997, 2111, 19311, 2098, 2000, 2023, 27136, 2121, 1010, 3241, 2027, 2020, 2183, 2000, 2156, 1037, 3348, 2143, 1012, 1012, 1012, 2612, 1010, 2027, 2288, 7167, 1997, 2485, 22264, 1997, 1043, 11802, 2135, 1010, 16360, 23004, 25430, 18352, 1010, 2006, 1011, 2395, 7636, 1999, 20857, 6023, 25943, 1010, 2004, 5498, 2063, 2576, 3653, 29048, 1012, 1012, 1012, 1998, 7408, 3468, 2040, 1011, 14977, 23599, 3348, 5019, 2007, 7842, 22772, 1010, 5122, 5889, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 3451, 12696, 1010, 4151, 24665, 12502, 1010, 3181, 20785, 1012, 1012, 3649, 2023, 2518, 2001, 1010, 14021, 5596, 2009, 1010, 6402, 2009, 1010, 2059, 4933, 1996, 11289, 1999, 1037, 2599, 3482, 999, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 7069, 9765, 27065, 2229, 2145, 26988, 2000, 2424, 3643, 1999, 2049, 11771, 18404, 6208, 2576, 11867, 7974, 8613, 1012, 1012, 2021, 2065, 2009, 4694, 1005, 1056, 2005, 1996, 15657, 9446, 1010, 2009, 2052, 2031, 2042, 6439, 1010, 2059, 6404, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 2612, 1010, 1996, 1000, 1045, 2572, 8744, 1010, 8744, 1000, 1054, 10536, 16921, 7583, 2516, 2001, 5567, 10866, 2135, 2005, 2086, 2004, 1037, 14841, 26065, 3508, 2005, 22555, 2080, 3152, 1006, 1045, 2572, 8025, 1010, 20920, 1011, 2005, 5637, 3152, 1010, 1045, 2572, 8025, 1010, 2304, 1011, 2005, 1038, 2721, 2595, 24759, 28100, 3370, 3152, 1010, 4385, 1012, 1012, 1007, 1998, 2296, 2702, 2086, 2030, 2061, 1996, 2518, 9466, 2013, 1996, 2757, 1010, 2000, 2022, 7021, 2011, 1037, 2047, 4245, 1997, 26476, 2015, 2040, 2215, 2000, 2156, 2008, 1000, 20355, 3348, 2143, 1000, 2008, 1000, 4329, 3550, 1996, 2143, 3068, 1000, 1012, 1012, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 6300, 9953, 1010, 4468, 2066, 1996, 11629, 1012, 1012, 2030, 2065, 2017, 2442, 2156, 2009, 1011, 9278, 1996, 2678, 1998, 3435, 2830, 2000, 1996, 1000, 6530, 1000, 3033, 1010, 2074, 2000, 2131, 2009, 2058, 2007, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 102]
-
결론: imdb
에는 자료가 “텍스트” 형태로 저장되어있고, tokenized_imdb
는 자료가 “텍스트 \(\cup\) 숫자” 형태로 저장되어 있음. 즉 tokenized_imdb
는 원래데이터 (raw data) 와 전처리된 데이터 (preprocessed data) 가 같이 있음.
B. Step2 – 인공지능 생성
-
인공지능을 생성하는 코드
from transformers import AutoModelForSequenceClassification
= AutoModelForSequenceClassification.from_pretrained(
model "distilbert/distilbert-base-uncased", num_labels=2
)
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
C. Step3 – 인공지능 학습
-
트레이너를 만들때 필요한 재료 data_collator
생성
from transformers import DataCollatorWithPadding
= DataCollatorWithPadding(tokenizer=tokenizer) data_collator
-
트레이너를 만들때 필요한 재료 compute_metrics
생성
import evaluate
import numpy as np
= evaluate.load("accuracy")
accuracy def compute_metrics(eval_pred):
= eval_pred
predictions, labels = np.argmax(predictions, axis=1)
predictions return accuracy.compute(predictions=predictions, references=labels)
-
트레이너를 만들때 필요한 재료 training_args
(훈련지침?) 생성
from transformers import TrainingArguments, Trainer
= TrainingArguments(
training_args ="my_awesome_model",
output_dir=2e-5,
learning_rate=16,
per_device_train_batch_size=16,
per_device_eval_batch_size=2, # 전체문제세트를 2번 공부하라..
num_train_epochs=0.01,
weight_decay="epoch",
eval_strategy="epoch",
save_strategy=True,
load_best_model_at_end=False,
push_to_hub )
-
트레이너 생성
= Trainer(
trainer =model,
model=training_args,
args=tokenized_imdb["train"],
train_dataset=tokenized_imdb["test"],
eval_dataset=tokenizer,
tokenizer=data_collator,
data_collator=compute_metrics,
compute_metrics )
-
트레이너를 이용하여 인공지능을 학습시키는 코드
trainer.train()
Epoch | Training Loss | Validation Loss | Accuracy |
---|---|---|---|
1 | 0.222000 | 0.202918 | 0.921160 |
2 | 0.147300 | 0.226498 | 0.932000 |
TrainOutput(global_step=3126, training_loss=0.20518110291132619, metrics={'train_runtime': 753.5002, 'train_samples_per_second': 66.357, 'train_steps_per_second': 4.149, 'total_flos': 6556904415524352.0, 'train_loss': 0.20518110291132619, 'epoch': 2.0})
D. Step4 – 예측
= "The movie was a major disappointment. The plot was weak, and the pacing felt disjointed. The characters lacked depth, making it hard to connect with them. Predictable twists and a cliché resolution left no sense of excitement. Visually, it was unimpressive, and the soundtrack didn’t fit the scenes, pulling me out of the experience. Overall, it failed to deliver anything memorable."
text0 = "This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three." text1
from transformers import pipeline
= pipeline("sentiment-analysis", model="my_awesome_model/checkpoint-1563")
classifier1 print(classifier1(text0))
print(classifier1(text1))
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
[{'label': 'LABEL_0', 'score': 0.9910238981246948}]
[{'label': 'LABEL_1', 'score': 0.993097722530365}]
= pipeline("sentiment-analysis", model="my_awesome_model/checkpoint-3126")
classifier2 print(classifier2(text0))
print(classifier2(text1))
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
[{'label': 'LABEL_0', 'score': 0.9966901540756226}]
[{'label': 'LABEL_1', 'score': 0.9965715408325195}]
5. 코드정리2
-
패키지
import datasets
import transformers
import evaluate
import numpy as np
-
Step1~4 준비하기
## Step1
= datasets.load_dataset
데이터불러오기 1 = 토크나이저 = transformers.AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")
데이터전처리하기# str -> {'input_ids': , 'attention_mask': }
def 데이터전처리하기2(examples):
return 데이터전처리하기1(examples["text"], truncation=True)
# tokenized_imdb['train'][0] -> {'input_ids': , 'attention_mask': }
## Step2
= transformers.AutoModelForSequenceClassification.from_pretrained
인공지능생성하기 ## Step3
= transformers.DataCollatorWithPadding(tokenizer=토크나이저)
데이터콜렉터 def 평가하기(eval_pred):
= eval_pred
predictions, labels = np.argmax(predictions, axis=1)
predictions = evaluate.load("accuracy")
accuracy return accuracy.compute(predictions=predictions, references=labels)
= transformers.TrainingArguments
트레이너세부지침생성기 = transformers.Trainer
트레이너생성기 ## Step4
= transformers.pipeline 강인공지능생성하기
/home/cgb3/anaconda3/envs/hf/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
warnings.warn(
-
Step1 ~ 4
## Step1
= 데이터불러오기('imdb')
데이터 = 데이터.map(데이터전처리하기2,batched=True)
전처리된데이터 = 전처리된데이터['train'], 전처리된데이터['test']
전처리된훈련자료, 전처리된검증자료 ## Step2
= 인공지능생성하기("distilbert/distilbert-base-uncased", num_labels=2)
인공지능 ## Step3
= 트레이너세부지침생성기(
트레이너세부지침 ="my_awesome_model",
output_dir=2e-5,
learning_rate=16,
per_device_train_batch_size=16,
per_device_eval_batch_size=2, # 전체문제세트를 2번 공부하라..
num_train_epochs=0.01,
weight_decay="epoch",
eval_strategy="epoch",
save_strategy=True,
load_best_model_at_end=False,
push_to_hub
)= 트레이너생성기(
트레이너 =인공지능,
model=트레이너세부지침,
args=전처리된훈련자료,
train_dataset=전처리된검증자료,
eval_dataset=토크나이저,
tokenizer=데이터콜렉터,
data_collator=평가하기,
compute_metrics
) 트레이너.train()
Map: 100%|██████████| 25000/25000 [00:02<00:00, 9912.28 examples/s]
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
## Step4
= 강인공지능생성하기("sentiment-analysis", model="my_awesome_model/checkpoint-1563")
강인공지능 print(강인공지능(text0))
print(강인공지능(text1))
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
[{'label': 'LABEL_0', 'score': 0.9910238981246948}]
[{'label': 'LABEL_1', 'score': 0.993097722530365}]