강의영상

- (1/4) 추천시스템 데이터소개

- (2/4) dls 만들기, 학습

- (3/4) bias

- (4/4) 해석

import

import torch 
from fastai.collab import * 
from fastai.tabular.all import * 

data

path = untar_data(URLs.ML_100k) 

- 첫번째 데이터프레임

ratings=pd.read_csv(path/'u.data', delimiter='\t', header=None, names=['user','movie','rating','timestamp'])
ratings
user movie rating timestamp
0 196 242 3 881250949
1 186 302 3 891717742
2 22 377 1 878887116
3 244 51 2 880606923
4 166 346 1 886397596
... ... ... ... ...
99995 880 476 3 880175444
99996 716 204 5 879795543
99997 276 1090 1 874795795
99998 13 225 2 882399156
99999 12 203 3 879959583

100000 rows × 4 columns

  • 마지막열은 무의믜

- 두번째 데이터프레임

movies = pd.read_csv(path/'u.item', delimiter='|', encoding='latin-1', usecols=(0,1), names=('movie','title'), header=None)
movies
movie title
0 1 Toy Story (1995)
1 2 GoldenEye (1995)
2 3 Four Rooms (1995)
3 4 Get Shorty (1995)
4 5 Copycat (1995)
... ... ...
1677 1678 Mat' i syn (1997)
1678 1679 B. Monkey (1998)
1679 1680 Sliding Doors (1998)
1680 1681 You So Crazy (1994)
1681 1682 Scream of Stone (Schrei aus Stein) (1991)

1682 rows × 2 columns

- 두 데이터프레임을 합친다.

df = ratings.merge(movies)
df
user movie rating timestamp title
0 196 242 3 881250949 Kolya (1996)
1 63 242 3 875747190 Kolya (1996)
2 226 242 5 883888671 Kolya (1996)
3 154 242 3 879138235 Kolya (1996)
4 306 242 5 876503793 Kolya (1996)
... ... ... ... ... ...
99995 840 1674 4 891211682 Mamma Roma (1962)
99996 655 1640 3 888474646 Eighth Day, The (1996)
99997 655 1637 3 888984255 Girls Town (1996)
99998 655 1630 3 887428735 Silence of the Palace, The (Saimt el Qusur) (1994)
99999 655 1641 3 887427810 Dadetown (1995)

100000 rows × 5 columns

dls

dls = CollabDataLoaders.from_df(df,bs=64,item_name='title') 
dls.show_batch()
user title rating
0 853 Hoodlum (1997) 4
1 384 Jackal, The (1997) 4
2 721 Robert A. Heinlein's The Puppet Masters (1994) 3
3 840 Rear Window (1954) 5
4 429 Pink Floyd - The Wall (1982) 3
5 536 Age of Innocence, The (1993) 3
6 763 Amadeus (1984) 4
7 913 Rear Window (1954) 4
8 276 Eraser (1996) 3
9 645 Cook the Thief His Wife & Her Lover, The (1989) 4

learn

lrnr = collab_learner(dls, n_factors=10, y_range=(0,5)) 
lrnr.fit(13) 
epoch train_loss valid_loss time
0 1.142334 1.111295 00:04
1 0.919599 0.928836 00:03
2 0.865876 0.896597 00:03
3 0.853443 0.881308 00:03
4 0.860030 0.872683 00:03
5 0.849131 0.864368 00:03
6 0.827771 0.854535 00:03
7 0.797815 0.844754 00:03
8 0.810200 0.836310 00:03
9 0.756194 0.830930 00:03
10 0.766496 0.826815 00:03
11 0.734313 0.823833 00:03
12 0.726935 0.822590 00:03
  • 교재의 loss도 0.82 근처

- 결과를 살펴보자.

lrnr.show_results()
user title rating rating_pred
0 922 320 3 3.366130
1 75 245 2 3.019973
2 82 885 3 2.705256
3 25 42 4 4.255459
4 16 390 5 4.548031
5 488 33 4 3.670551
6 796 1477 3 3.682889
7 887 686 3 4.129278
8 297 442 1 3.848483
  • 솔직히 다 맞추는 느낌이 있진 않음

learn2

lrnr2 = collab_learner(dls, use_nn=True, y_range=(0,5), layers=[20,10]) 
lrnr2.fit(8)
epoch train_loss valid_loss time
0 0.943745 0.911532 00:05
1 0.886727 0.887183 00:05
2 0.851722 0.876992 00:04
3 0.866142 0.875833 00:04
4 0.804943 0.872449 00:04
5 0.810429 0.877015 00:04
6 0.753599 0.881696 00:04
7 0.722334 0.891261 00:04
lrnr2.show_results()
user title rating rating_pred
0 446 861 1 2.672008
1 311 1078 4 4.112122
2 234 909 4 3.065944
3 342 314 3 3.664986
4 823 95 3 3.768519
5 804 188 4 2.877244
6 234 497 4 3.379913
7 533 225 4 3.330431
8 848 927 5 4.673620
  • 적당한 수준에서 합리적임

bias

lrnr.model
EmbeddingDotBias(
  (u_weight): Embedding(944, 10)
  (i_weight): Embedding(1665, 10)
  (u_bias): Embedding(944, 1)
  (i_bias): Embedding(1665, 1)
)
lrnr.model.i_bias.weight.detach().to('cpu').squeeze()
tensor([ 0.0009, -0.1644,  0.0729,  ...,  0.0088,  0.2506,  0.0984])
  • 의미? 아이테의 바이어스: 평균적으로 높은 평점을 받거나 낮은 평점을 받는 영화들이있는데, 그 정도를 숫자로 표현`
lst1=lrnr.model.i_bias.weight.detach().to('cpu').squeeze().argsort()[:20].tolist()
lst2=lrnr.model.i_bias.weight.detach().to('cpu').squeeze().argsort(descending=True)[:20].tolist()
list(dls.classes['title'][lst1])
['Children of the Corn: The Gathering (1996)',
 'Body Parts (1991)',
 '3 Ninjas: High Noon At Mega Mountain (1998)',
 'Jury Duty (1995)',
 'Amityville II: The Possession (1982)',
 'Theodore Rex (1995)',
 'Lawnmower Man 2: Beyond Cyberspace (1996)',
 'Dunston Checks In (1996)',
 'Crow: City of Angels, The (1996)',
 'Barb Wire (1996)',
 'Robocop 3 (1993)',
 'Amityville 3-D (1983)',
 'Amityville: A New Generation (1993)',
 'Bloodsport 2 (1995)',
 'Island of Dr. Moreau, The (1996)',
 'Solo (1996)',
 'Bushwhacked (1995)',
 'Big Bully (1996)',
 'Gordy (1995)',
 'Amityville Curse, The (1990)']
  • 비인기
list(dls.classes['title'][lst2])
['Close Shave, A (1995)',
 'As Good As It Gets (1997)',
 'L.A. Confidential (1997)',
 "Schindler's List (1993)",
 'Silence of the Lambs, The (1991)',
 'Rear Window (1954)',
 'Titanic (1997)',
 'Apt Pupil (1998)',
 'Wrong Trousers, The (1993)',
 'Good Will Hunting (1997)',
 'Henry V (1989)',
 'North by Northwest (1959)',
 'Vertigo (1958)',
 'Sunset Blvd. (1950)',
 'Shawshank Redemption, The (1994)',
 'To Kill a Mockingbird (1962)',
 'Fugitive, The (1993)',
 'Full Monty, The (1997)',
 'Blade Runner (1982)',
 'Treasure of the Sierra Madre, The (1948)']
  • 인기

- 모형이 잘 학습된것 같다.

예측

- 타이타닉(1501)과 로보캅3(1251)에 관심을 가지자.

x,y = dls.one_batch()
x[:5]
tensor([[ 782, 1315],
        [ 145, 1207],
        [ 823, 1508],
        [ 452, 1157],
        [ 794, 1524]])

- 1~30번까지의 유저가 타이타닉(1501)을 어떻게 생각할지? 재미있게 생각한다.

xx = torch.tensor([[i,1501] for i in range(1,31)])
lrnr.model(xx.to("cuda:0"))
tensor([4.4531, 4.4386, 3.3761, 4.8889, 3.8445, 3.5618, 4.3848, 4.6649, 4.6996,
        4.4931, 4.1034, 4.7730, 4.0311, 4.2925, 3.6165, 4.8764, 3.8198, 4.0804,
        4.0261, 3.7349, 4.0564, 4.6282, 3.8605, 4.6444, 4.5684, 3.8845, 4.0185,
        4.4556, 4.2731, 4.4253], device='cuda:0', grad_fn=<AddBackward0>)
lrnr2.model(xx.to("cuda:0")).reshape(-1)
tensor([4.3266, 4.6268, 3.6982, 4.7522, 4.2184, 3.9160, 4.6478, 4.6014, 4.5140,
        4.2847, 3.8976, 4.5324, 3.9723, 4.0141, 3.6544, 4.7591, 4.0607, 4.2835,
        3.7749, 3.3524, 4.0209, 4.6510, 3.8351, 4.6803, 4.2459, 3.8215, 3.9208,
        4.4372, 4.2947, 4.3690], device='cuda:0',
       grad_fn=<ReshapeAliasBackward0>)

- 1~30번까지의 유저가 로보캅3(1251)을 어떻게 생각할지? 재미없게 생각한다.

xx = torch.tensor([[i,1251] for i in range(1,31)])
lrnr.model(xx.to("cuda:0"))
tensor([1.2492, 1.5321, 1.7723, 2.0212, 1.5710, 1.4888, 2.3363, 1.6964, 1.9715,
        2.4178, 2.2392, 2.1472, 1.6472, 2.1953, 1.7494, 1.3403, 1.7519, 2.2229,
        2.3043, 2.3619, 1.3508, 1.1981, 1.9056, 2.0056, 2.2205, 1.5279, 1.7869,
        1.7508, 2.0724, 2.4507], device='cuda:0', grad_fn=<AddBackward0>)
lrnr2.model(xx.to("cuda:0")).reshape(-1)
tensor([1.3524, 1.9509, 1.3900, 2.1394, 1.6396, 0.9705, 2.4360, 1.8458, 2.0537,
        2.2466, 1.8744, 2.5842, 1.7164, 2.3071, 1.2227, 1.8126, 1.0431, 1.8193,
        2.2985, 2.4256, 1.1279, 1.1289, 1.4594, 2.2158, 2.8599, 1.6962, 1.2229,
        1.8232, 1.7359, 1.6326], device='cuda:0',
       grad_fn=<ReshapeAliasBackward0>)