Lesson 07: pandas I

Author

최규빈

Published

July 24, 2023

imports

import numpy as np
import pandas as pd 

Pandas: 선언, 열의이름확인

dct = {'mid':[80,60,80,20],'fin':[40,100,0,70]}
df = pd.DataFrame(dct)
df 
mid fin
0 80 40
1 60 100
2 80 0
3 20 70

Pandas: 행과 열의 선택

- 데이터

np.random.seed(43052)
att = np.random.choice(np.arange(10,21)*5,20)
rep = np.random.choice(np.arange(5,21)*5,20)
mid = np.random.choice(np.arange(0,21)*5,20)
fin = np.random.choice(np.arange(0,21)*5,20)
student_id = ['2022-12'+str(s) for s in np.random.choice(np.arange(300,501),20,replace=False)]
df = pd.DataFrame({'att':att,'rep':rep,'mid':mid,'fin':fin})
df.head()
att rep mid fin
0 65 55 50 40
1 95 100 50 80
2 65 90 60 30
3 55 80 75 80
4 80 30 30 100
df2 = pd.DataFrame({'id':student_id,'att':att,'rep':rep,'mid':mid,'fin':fin})
df2.head()
id att rep mid fin
0 2022-12380 65 55 50 40
1 2022-12370 95 100 50 80
2 2022-12363 65 90 60 30
3 2022-12488 55 80 75 80
4 2022-12312 80 30 30 100

열의 선택

- 방법1: df.? + str

- 방법2: df[?] + str, [str,str]

# df['att'] # str 
# df[['att']] # [str]
# df[['att','rep']] # [str,str]

- 방법3: df.iloc[:,?] + int, int:int, [int,int], [bool,bool], range

# df.iloc[:,0] # int
# df.iloc[:,-2:] # int:int - 슬라이싱
# df.iloc[:,1::2] # int:int - 스트라이딩
# df.iloc[:,[0]] # [int]
# df.iloc[:,[0,1]] # [int,int]
# df.iloc[:,[True,True,False,False]] # bool의 list 
# df.iloc[:,range(2)] # range

- 방법4: df.loc[:,?] + str, ‘str:str’, [str,str], [bool,bool]

# df.loc[:,'att'] # str
# df.loc[:,'rep':'mid'] # 'str':'str' -- 칼럼이름으로 슬라이싱 **
# df.loc[:,'rep'::2] # 'str':'str' -- 칼럼이름으로 스트라이딩 ** 
# df.loc[:,['att']] # [str]
# df.loc[:,['att','rep']] # [str,str]
# df.loc[:,[True,False,False,True]] # bool의 list

행의 선택

- 방법1: df[] + int:int, str:str, [bool,bool], pd.Series([bool,bool]) – \((\star\star\star\star\star)\)

# df[:2] # int:int -- 슬라이싱 // df.iloc[:2,:], df.iloc[:2] 와 같음
# df[0:5:2] # int:int -- 스트라이딩 
# df[:'2022-12312'] # str:str -- 슬라이싱 // df.loc[:'2022-12312'], df.loc[:'2022-12312',:] 와 같음
# df[:'2022-12312':2] # str:str -- 스트라이딩
# df[list(df.att < 70)] # [bool,bool]
# df[df.att < 70] # pd.Series([bool,bool])

- 방법2: df.iloc[], df.iloc[,:] + int, int:int, [int,int], [bool,bool], range

# df.iloc[0] # int 
# df.iloc[-2:] # int:int -- 슬라이싱
# df.iloc[1::2] # int:int -- 스트라이딩
# df.iloc[[0]] # [int]
# df.iloc[[0,1]] # [int,int]
# df.iloc[[True]+[False]*19] # [bool,bool]
# df.iloc[range(2)] # range
# df.iloc[0,:] # int 
# df.iloc[-2:,:] # int:int -- 슬라이싱
# df.iloc[1::2,:] # int:int -- 스트라이딩
# df.iloc[[0],:] # [int]
# df.iloc[[0,1],:] # [int,int]
# df.iloc[[True]+[False]*19,:] # [bool,bool]
# df.iloc[range(2),:] # range

- 방법3: df.loc[], df.loc[,:] + int, str, int:int, str:str, [int,int], [str,str], [bool,bool], pd.Series([bool,bool])

# df2.loc[0] # int 
# df.loc['2022-12380'] # str 
# df2.loc[:2] # int:int 
# df.loc[:'2022-12363'] # str:str 
# df2.loc[[0,1]] # [int,int]
# df.loc[['2022-12380','2022-12370']] # [str,str]
# df.loc[[True]+[False]*19] # [bool,bool]
# df.loc[df.att>70] # pd.Series([bool,bool]) 
# df2.loc[0,:] # int 
# df.loc['2022-12380',:] # str 
# df2.loc[:2,:] # int:int 
# df.loc[:'2022-12363',:] # str:str 
# df2.loc[[0,1],:] # [int,int]
# df.loc[['2022-12380','2022-12370'],:] # [str,str]
# df.loc[[True]+[False]*19,:] # [bool,bool]
# df.loc[df.att>70,:] # pd.Series([bool,bool]) 

요약

- 알아두면 좋은 규칙

  • .iloc[].iloc[,:]는 완전히 동등하다.
  • .loc[].loc[,:]는 완전히 동등하다.

- 정리

type of indexer target . [] .iloc .loc comment
int row X X O \(\Delta\)
int col X X O X
str row X X X O
str col O O X O
int:int row X O O \(\Delta\)
int:int col X X O X
str:str row X O X O
str:str col X X X O
[int,int] row X X O \(\Delta\)
[int,int] col X X O X
[str,str] row X X X O
[str,str] col X O X O
[bool,bool] row X O O O
[bool,bool] col X X O O
pd.Series([bool,bool]) row X O X O

Pandas: 제가 선호하는 행과 열의 선택

제 스타일

- 가장 안전한 코드

# df.loc[:,:]

- 상황1: 하나의 col을 뽑으려 할때 좋은 코드

# df.att # 최애 
# df['att'] # 차애 
# df[['att']] # 차애의 대안

- 상황2: row 슬라이싱을 할때 좋은 코드 \((\star\star\star)\)

# df[:5] # 최애, df.iloc[:5]와 같다. 
# df[:'2022-12312'] # 차애, df.loc[:'2022-12312'] 와 같다

- 상황3: 조건에 맞는 row를 뽑을때 좋은 코드

# df[df.att<60] # 최애
# df[[att<60 for att in df.att]] # 차애

- 상황4: 하나의 row를 뽑으려 할때 좋은 코드

# df.loc[0] # 최애
# df.iloc[0] # 최애 
# df.loc[[0]] # 최애의 대안
# df.iloc[[0]] # 최애의 대안

- 상황5: (row,col)을 뽑으려 할때 좋은 코드

# 방법1: pd.Series를 뽑고 -> 인덱스로접근
# df.att[0]
# df['att'][0]

# 방법2: iloc, loc 으로 한번에 뽑기
# df.iloc[0,0]
# df.loc[0,'att']

위의 상황이외에는 df.loc[:,:]를 사용하는것이 유리하다

- 상황6: column 슬라이싱을 할때

# df.loc[:,'att':'mid'] # 끝점포함

- 상황7: row + column 슬라이싱을 하는 가장 좋은 코드

# df.loc[0:5,'att':'mid']

- 상황8: 조건에 맞는 col을 뽑기에 가장 좋은 코드

# df.loc[:,[len(col_name)>2 for col_name in df.columns]]

- 상황9: 조건에 맞는 row, col을 뽑기에 가장 좋은 코드

# df.loc[df.att<60,[len(col_name)>2 for col_name in df.columns]]

제 스타일 X

- 제가 안쓰는 코드1

df[:1]
att rep mid fin
0 65 55 50 40

이러면 내 입장에서는 마치 아래가 동작할 것 같잖아..

df[0]
KeyError: 0

- 제가 안쓰는 코드2: bool의 list를 사용할때 iloc은 가급적 쓰지마세요

df.iloc[list(df['att']<80),:]
att rep mid fin
0 65 55 50 40
2 65 90 60 30
3 55 80 75 80
5 75 40 100 15
6 65 45 45 90
7 60 60 25 0
10 55 75 35 25
13 50 80 40 30
14 50 55 15 85
16 50 50 45 10
17 65 55 15 45
18 70 70 40 35

이러면 마치 아래도 동작할 것 같잖아..

df.iloc[df['att']<80,:]
ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types

Quiz

(1)-(3) 아래와 같은 자료를 고려하자.

df = pd.DataFrame(np.random.normal(size=(100,5)),columns=list('ABCDE'))
df
A B C D E
0 0.106173 0.723759 0.217990 0.194022 -0.688990
1 -0.351670 0.990933 1.212147 -0.608965 0.032549
2 -1.884089 -0.860866 0.552755 0.104294 -1.661620
3 -0.965230 1.898428 2.625531 -0.381609 -0.948021
4 0.453758 -0.716270 -0.167750 0.089546 0.874006
... ... ... ... ... ...
95 0.568935 -0.901952 0.781831 0.045797 -0.102197
96 0.111096 0.685877 -0.472303 -0.043043 -0.952074
97 2.206012 -0.375628 0.549381 0.576883 -1.280959
98 -1.125719 -0.021064 -0.018777 1.000606 0.772153
99 -0.158276 0.569409 -0.209664 0.326520 1.251615

100 rows × 5 columns

(1) B,D열을 선택하라.

df.loc[:,['B','C']]
B C
0 0.723759 0.217990
1 0.990933 1.212147
2 -0.860866 0.552755
3 1.898428 2.625531
4 -0.716270 -0.167750
... ... ...
95 -0.901952 0.781831
96 0.685877 -0.472303
97 -0.375628 0.549381
98 -0.021064 -0.018777
99 0.569409 -0.209664

100 rows × 2 columns

(2) 마지막 10개의 row를 출력하라.

df.iloc[-10:,:]
A B C D E
90 -0.515706 -1.345645 1.310011 -2.072882 -0.200044
91 -2.488585 -1.760744 -0.790638 -2.568387 0.884457
92 -0.178529 -0.672677 0.811988 -1.062689 0.183336
93 -1.410621 0.876799 0.119850 0.295357 -0.918690
94 0.632102 -0.777506 -1.732188 0.016164 0.842943
95 0.568935 -0.901952 0.781831 0.045797 -0.102197
96 0.111096 0.685877 -0.472303 -0.043043 -0.952074
97 2.206012 -0.375628 0.549381 0.576883 -1.280959
98 -1.125719 -0.021064 -0.018777 1.000606 0.772153
99 -0.158276 0.569409 -0.209664 0.326520 1.251615

(3) A,B 열의 처음 10개의 row를 출력하라.

df.loc[:9,'A':'B']
A B
0 0.106173 0.723759
1 -0.351670 0.990933
2 -1.884089 -0.860866
3 -0.965230 1.898428
4 0.453758 -0.716270
5 0.401627 0.204911
6 -0.434388 -1.436772
7 0.937827 1.079232
8 0.117038 -0.350165
9 1.032189 0.404380

(4)-(8) 아래와 같은 자료를 고려하자.

df=pd.read_csv('https://raw.githubusercontent.com/PacktPublishing/Pandas-Cookbook/master/data/movie.csv')
df
color director_name num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_2_name actor_1_facebook_likes gross genres ... num_user_for_reviews language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
0 Color James Cameron 723.0 178.0 0.0 855.0 Joel David Moore 1000.0 760505847.0 Action|Adventure|Fantasy|Sci-Fi ... 3054.0 English USA PG-13 237000000.0 2009.0 936.0 7.9 1.78 33000
1 Color Gore Verbinski 302.0 169.0 563.0 1000.0 Orlando Bloom 40000.0 309404152.0 Action|Adventure|Fantasy ... 1238.0 English USA PG-13 300000000.0 2007.0 5000.0 7.1 2.35 0
2 Color Sam Mendes 602.0 148.0 0.0 161.0 Rory Kinnear 11000.0 200074175.0 Action|Adventure|Thriller ... 994.0 English UK PG-13 245000000.0 2015.0 393.0 6.8 2.35 85000
3 Color Christopher Nolan 813.0 164.0 22000.0 23000.0 Christian Bale 27000.0 448130642.0 Action|Thriller ... 2701.0 English USA PG-13 250000000.0 2012.0 23000.0 8.5 2.35 164000
4 NaN Doug Walker NaN NaN 131.0 NaN Rob Walker 131.0 NaN Documentary ... NaN NaN NaN NaN NaN NaN 12.0 7.1 NaN 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4911 Color Scott Smith 1.0 87.0 2.0 318.0 Daphne Zuniga 637.0 NaN Comedy|Drama ... 6.0 English Canada NaN NaN 2013.0 470.0 7.7 NaN 84
4912 Color NaN 43.0 43.0 NaN 319.0 Valorie Curry 841.0 NaN Crime|Drama|Mystery|Thriller ... 359.0 English USA TV-14 NaN NaN 593.0 7.5 16.00 32000
4913 Color Benjamin Roberds 13.0 76.0 0.0 0.0 Maxwell Moody 0.0 NaN Drama|Horror|Thriller ... 3.0 English USA NaN 1400.0 2013.0 0.0 6.3 NaN 16
4914 Color Daniel Hsia 14.0 100.0 0.0 489.0 Daniel Henney 946.0 10443.0 Comedy|Drama|Romance ... 9.0 English USA PG-13 NaN 2012.0 719.0 6.3 2.35 660
4915 Color Jon Gunn 43.0 90.0 16.0 16.0 Brian Herzlinger 86.0 85222.0 Documentary ... 84.0 English USA PG 1100.0 2004.0 23.0 6.6 1.85 456

4916 rows × 28 columns

(4) 이 데이터프레임에는 몇개의 컬럼이 있는지 count하라.

len(list(df.keys()))
28

(5) 데이터프레임의 컬럼이름이 c혹은 d로 시작하는 열은 몇개 있는지 세어보라.

sum([(l[0] == 'c') or (l[0] == 'd') for l in list(df.keys())])
7

(6) c 혹은 d로 시작하는 열을 출력하라.

df.loc[:,[(l[0] == 'c') or (l[0] == 'd') for l in list(df.keys())]]
color director_name duration director_facebook_likes cast_total_facebook_likes country content_rating
0 Color James Cameron 178.0 0.0 4834 USA PG-13
1 Color Gore Verbinski 169.0 563.0 48350 USA PG-13
2 Color Sam Mendes 148.0 0.0 11700 UK PG-13
3 Color Christopher Nolan 164.0 22000.0 106759 USA PG-13
4 NaN Doug Walker NaN 131.0 143 NaN NaN
... ... ... ... ... ... ... ...
4911 Color Scott Smith 87.0 2.0 2283 Canada NaN
4912 Color NaN 43.0 NaN 1753 USA TV-14
4913 Color Benjamin Roberds 76.0 0.0 0 USA NaN
4914 Color Daniel Hsia 100.0 0.0 2386 USA PG-13
4915 Color Jon Gunn 90.0 16.0 163 USA PG

4916 rows × 7 columns

(7) 이 데이터프레임에서 ’actor’라는 단어가 포함된 열의 출력하라.

df.loc[:,['actor' in l for l in df.columns]]
actor_3_facebook_likes actor_2_name actor_1_facebook_likes actor_1_name actor_3_name actor_2_facebook_likes
0 855.0 Joel David Moore 1000.0 CCH Pounder Wes Studi 936.0
1 1000.0 Orlando Bloom 40000.0 Johnny Depp Jack Davenport 5000.0
2 161.0 Rory Kinnear 11000.0 Christoph Waltz Stephanie Sigman 393.0
3 23000.0 Christian Bale 27000.0 Tom Hardy Joseph Gordon-Levitt 23000.0
4 NaN Rob Walker 131.0 Doug Walker NaN 12.0
... ... ... ... ... ... ...
4911 318.0 Daphne Zuniga 637.0 Eric Mabius Crystal Lowe 470.0
4912 319.0 Valorie Curry 841.0 Natalie Zea Sam Underwood 593.0
4913 0.0 Maxwell Moody 0.0 Eva Boehnke David Chandler 0.0
4914 489.0 Daniel Henney 946.0 Alan Ruck Eliza Coupe 719.0
4915 16.0 Brian Herzlinger 86.0 John August Jon Gunn 23.0

4916 rows × 6 columns

(8) 이 데이터프레임에서 ’_’ 가 포함된 열을 출력하라.

df.loc[:,['_' in l for l in df.columns]]
director_name num_critic_for_reviews director_facebook_likes actor_3_facebook_likes actor_2_name actor_1_facebook_likes actor_1_name movie_title num_voted_users cast_total_facebook_likes ... facenumber_in_poster plot_keywords movie_imdb_link num_user_for_reviews content_rating title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
0 James Cameron 723.0 0.0 855.0 Joel David Moore 1000.0 CCH Pounder Avatar 886204 4834 ... 0.0 avatar|future|marine|native|paraplegic http://www.imdb.com/title/tt0499549/?ref_=fn_t... 3054.0 PG-13 2009.0 936.0 7.9 1.78 33000
1 Gore Verbinski 302.0 563.0 1000.0 Orlando Bloom 40000.0 Johnny Depp Pirates of the Caribbean: At World's End 471220 48350 ... 0.0 goddess|marriage ceremony|marriage proposal|pi... http://www.imdb.com/title/tt0449088/?ref_=fn_t... 1238.0 PG-13 2007.0 5000.0 7.1 2.35 0
2 Sam Mendes 602.0 0.0 161.0 Rory Kinnear 11000.0 Christoph Waltz Spectre 275868 11700 ... 1.0 bomb|espionage|sequel|spy|terrorist http://www.imdb.com/title/tt2379713/?ref_=fn_t... 994.0 PG-13 2015.0 393.0 6.8 2.35 85000
3 Christopher Nolan 813.0 22000.0 23000.0 Christian Bale 27000.0 Tom Hardy The Dark Knight Rises 1144337 106759 ... 0.0 deception|imprisonment|lawlessness|police offi... http://www.imdb.com/title/tt1345836/?ref_=fn_t... 2701.0 PG-13 2012.0 23000.0 8.5 2.35 164000
4 Doug Walker NaN 131.0 NaN Rob Walker 131.0 Doug Walker Star Wars: Episode VII - The Force Awakens 8 143 ... 0.0 NaN http://www.imdb.com/title/tt5289954/?ref_=fn_t... NaN NaN NaN 12.0 7.1 NaN 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4911 Scott Smith 1.0 2.0 318.0 Daphne Zuniga 637.0 Eric Mabius Signed Sealed Delivered 629 2283 ... 2.0 fraud|postal worker|prison|theft|trial http://www.imdb.com/title/tt3000844/?ref_=fn_t... 6.0 NaN 2013.0 470.0 7.7 NaN 84
4912 NaN 43.0 NaN 319.0 Valorie Curry 841.0 Natalie Zea The Following 73839 1753 ... 1.0 cult|fbi|hideout|prison escape|serial killer http://www.imdb.com/title/tt2071645/?ref_=fn_t... 359.0 TV-14 NaN 593.0 7.5 16.00 32000
4913 Benjamin Roberds 13.0 0.0 0.0 Maxwell Moody 0.0 Eva Boehnke A Plague So Pleasant 38 0 ... 0.0 NaN http://www.imdb.com/title/tt2107644/?ref_=fn_t... 3.0 NaN 2013.0 0.0 6.3 NaN 16
4914 Daniel Hsia 14.0 0.0 489.0 Daniel Henney 946.0 Alan Ruck Shanghai Calling 1255 2386 ... 5.0 NaN http://www.imdb.com/title/tt2070597/?ref_=fn_t... 9.0 PG-13 2012.0 719.0 6.3 2.35 660
4915 Jon Gunn 43.0 16.0 16.0 Brian Herzlinger 86.0 John August My Date with Drew 4285 163 ... 0.0 actress name in title|crush|date|four word tit... http://www.imdb.com/title/tt0378407/?ref_=fn_t... 84.0 PG 2004.0 23.0 6.6 1.85 456

4916 rows × 21 columns