Lesson 07: pandas I

Author

최규빈

Published

July 24, 2023

imports

import numpy as np
import pandas as pd

Pandas: 선언, 열의이름확인

dct = {'mid':[80,60,80,20],'fin':[40,100,0,70]}
df = pd.DataFrame(dct)
df

	mid	fin
0	80	40
1	60	100
2	80	0
3	20	70

Pandas: 행과 열의 선택

- 데이터

np.random.seed(43052)
att = np.random.choice(np.arange(10,21)*5,20)
rep = np.random.choice(np.arange(5,21)*5,20)
mid = np.random.choice(np.arange(0,21)*5,20)
fin = np.random.choice(np.arange(0,21)*5,20)
student_id = ['2022-12'+str(s) for s in np.random.choice(np.arange(300,501),20,replace=False)]

df = pd.DataFrame({'att':att,'rep':rep,'mid':mid,'fin':fin})
df.head()

	att	rep	mid	fin
0	65	55	50	40
1	95	100	50	80
2	65	90	60	30
3	55	80	75	80
4	80	30	30	100

df2 = pd.DataFrame({'id':student_id,'att':att,'rep':rep,'mid':mid,'fin':fin})
df2.head()

	id	att	rep	mid	fin
0	2022-12380	65	55	50	40
1	2022-12370	95	100	50	80
2	2022-12363	65	90	60	30
3	2022-12488	55	80	75	80
4	2022-12312	80	30	30	100

열의 선택

- 방법1: df.? + str

- 방법2: df[?] + str, [str,str]

# df['att'] # str 
# df[['att']] # [str]
# df[['att','rep']] # [str,str]

- 방법3: df.iloc[:,?] + int, int:int, [int,int], [bool,bool], range

# df.iloc[:,0] # int
# df.iloc[:,-2:] # int:int - 슬라이싱
# df.iloc[:,1::2] # int:int - 스트라이딩
# df.iloc[:,[0]] # [int]
# df.iloc[:,[0,1]] # [int,int]
# df.iloc[:,[True,True,False,False]] # bool의 list 
# df.iloc[:,range(2)] # range

- 방법4: df.loc[:,?] + str, ‘str:str’, [str,str], [bool,bool]

# df.loc[:,'att'] # str
# df.loc[:,'rep':'mid'] # 'str':'str' -- 칼럼이름으로 슬라이싱 **
# df.loc[:,'rep'::2] # 'str':'str' -- 칼럼이름으로 스트라이딩 ** 
# df.loc[:,['att']] # [str]
# df.loc[:,['att','rep']] # [str,str]
# df.loc[:,[True,False,False,True]] # bool의 list

행의 선택

- 방법1: df[] + int:int, str:str, [bool,bool], pd.Series([bool,bool]) – \((\star\star\star\star\star)\)

# df[:2] # int:int -- 슬라이싱 // df.iloc[:2,:], df.iloc[:2] 와 같음
# df[0:5:2] # int:int -- 스트라이딩 
# df[:'2022-12312'] # str:str -- 슬라이싱 // df.loc[:'2022-12312'], df.loc[:'2022-12312',:] 와 같음
# df[:'2022-12312':2] # str:str -- 스트라이딩
# df[list(df.att < 70)] # [bool,bool]
# df[df.att < 70] # pd.Series([bool,bool])

- 방법2: df.iloc[], df.iloc[,:] + int, int:int, [int,int], [bool,bool], range

# df.iloc[0] # int 
# df.iloc[-2:] # int:int -- 슬라이싱
# df.iloc[1::2] # int:int -- 스트라이딩
# df.iloc[[0]] # [int]
# df.iloc[[0,1]] # [int,int]
# df.iloc[[True]+[False]*19] # [bool,bool]
# df.iloc[range(2)] # range

# df.iloc[0,:] # int 
# df.iloc[-2:,:] # int:int -- 슬라이싱
# df.iloc[1::2,:] # int:int -- 스트라이딩
# df.iloc[[0],:] # [int]
# df.iloc[[0,1],:] # [int,int]
# df.iloc[[True]+[False]*19,:] # [bool,bool]
# df.iloc[range(2),:] # range

- 방법3: df.loc[], df.loc[,:] + int, str, int:int, str:str, [int,int], [str,str], [bool,bool], pd.Series([bool,bool])

# df2.loc[0] # int 
# df.loc['2022-12380'] # str 
# df2.loc[:2] # int:int 
# df.loc[:'2022-12363'] # str:str 
# df2.loc[[0,1]] # [int,int]
# df.loc[['2022-12380','2022-12370']] # [str,str]
# df.loc[[True]+[False]*19] # [bool,bool]
# df.loc[df.att>70] # pd.Series([bool,bool])

# df2.loc[0,:] # int 
# df.loc['2022-12380',:] # str 
# df2.loc[:2,:] # int:int 
# df.loc[:'2022-12363',:] # str:str 
# df2.loc[[0,1],:] # [int,int]
# df.loc[['2022-12380','2022-12370'],:] # [str,str]
# df.loc[[True]+[False]*19,:] # [bool,bool]
# df.loc[df.att>70,:] # pd.Series([bool,bool])

요약

- 알아두면 좋은 규칙

.iloc[] 와 .iloc[,:]는 완전히 동등하다.
.loc[] 와 .loc[,:]는 완전히 동등하다.

- 정리

type of indexer	target	`.`	`[]`	`.iloc`	`.loc`
int	row	X	X	O	\(\Delta\)
int	col	X	X	O	X
str	row	X	X	X	O
str	col	O	O	X	O
int:int	row	X	O	O	\(\Delta\)
int:int	col	X	X	O	X
str:str	row	X	O	X	O
str:str	col	X	X	X	O
[int,int]	row	X	X	O	\(\Delta\)
[int,int]	col	X	X	O	X
[str,str]	row	X	X	X	O
[str,str]	col	X	O	X	O
[bool,bool]	row	X	O	O	O
[bool,bool]	col	X	X	O	O
pd.Series([bool,bool])	row	X	O	X	O

Pandas: 제가 선호하는 행과 열의 선택

제 스타일

- 가장 안전한 코드

# df.loc[:,:]

- 상황1: 하나의 col을 뽑으려 할때 좋은 코드

# df.att # 최애 
# df['att'] # 차애 
# df[['att']] # 차애의 대안

- 상황2: row 슬라이싱을 할때 좋은 코드 \((\star\star\star)\)

# df[:5] # 최애, df.iloc[:5]와 같다. 
# df[:'2022-12312'] # 차애, df.loc[:'2022-12312'] 와 같다

- 상황3: 조건에 맞는 row를 뽑을때 좋은 코드

# df[df.att<60] # 최애
# df[[att<60 for att in df.att]] # 차애

- 상황4: 하나의 row를 뽑으려 할때 좋은 코드

# df.loc[0] # 최애
# df.iloc[0] # 최애 
# df.loc[[0]] # 최애의 대안
# df.iloc[[0]] # 최애의 대안

- 상황5: (row,col)을 뽑으려 할때 좋은 코드

# 방법1: pd.Series를 뽑고 -> 인덱스로접근
# df.att[0]
# df['att'][0]

# 방법2: iloc, loc 으로 한번에 뽑기
# df.iloc[0,0]
# df.loc[0,'att']

위의 상황이외에는 df.loc[:,:]를 사용하는것이 유리하다

- 상황6: column 슬라이싱을 할때

# df.loc[:,'att':'mid'] # 끝점포함

- 상황7: row + column 슬라이싱을 하는 가장 좋은 코드

# df.loc[0:5,'att':'mid']

- 상황8: 조건에 맞는 col을 뽑기에 가장 좋은 코드

# df.loc[:,[len(col_name)>2 for col_name in df.columns]]

- 상황9: 조건에 맞는 row, col을 뽑기에 가장 좋은 코드

# df.loc[df.att<60,[len(col_name)>2 for col_name in df.columns]]

제 스타일 X

- 제가 안쓰는 코드1

df[:1]

	att	rep	mid	fin
0	65	55	50	40

이러면 내 입장에서는 마치 아래가 동작할 것 같잖아..

df[0]

KeyError: 0

- 제가 안쓰는 코드2: bool의 list를 사용할때 iloc은 가급적 쓰지마세요

df.iloc[list(df['att']<80),:]

	att	rep	mid	fin
0	65	55	50	40
2	65	90	60	30
3	55	80	75	80
5	75	40	100	15
6	65	45	45	90
7	60	60	25	0
10	55	75	35	25
13	50	80	40	30
14	50	55	15	85
16	50	50	45	10
17	65	55	15	45
18	70	70	40	35

이러면 마치 아래도 동작할 것 같잖아..

df.iloc[df['att']<80,:]

ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types

Quiz

(1)-(3) 아래와 같은 자료를 고려하자.

df = pd.DataFrame(np.random.normal(size=(100,5)),columns=list('ABCDE'))
df

	A	B	C	D	E
0	0.106173	0.723759	0.217990	0.194022	-0.688990
1	-0.351670	0.990933	1.212147	-0.608965	0.032549
2	-1.884089	-0.860866	0.552755	0.104294	-1.661620
3	-0.965230	1.898428	2.625531	-0.381609	-0.948021
4	0.453758	-0.716270	-0.167750	0.089546	0.874006
...	...	...	...	...	...
95	0.568935	-0.901952	0.781831	0.045797	-0.102197
96	0.111096	0.685877	-0.472303	-0.043043	-0.952074
97	2.206012	-0.375628	0.549381	0.576883	-1.280959
98	-1.125719	-0.021064	-0.018777	1.000606	0.772153
99	-0.158276	0.569409	-0.209664	0.326520	1.251615

100 rows × 5 columns

(1) B,D열을 선택하라.

df.loc[:,['B','C']]

	B	C
0	0.723759	0.217990
1	0.990933	1.212147
2	-0.860866	0.552755
3	1.898428	2.625531
4	-0.716270	-0.167750
...	...	...
95	-0.901952	0.781831
96	0.685877	-0.472303
97	-0.375628	0.549381
98	-0.021064	-0.018777
99	0.569409	-0.209664

100 rows × 2 columns

(2) 마지막 10개의 row를 출력하라.

df.iloc[-10:,:]

	A	B	C	D	E
90	-0.515706	-1.345645	1.310011	-2.072882	-0.200044
91	-2.488585	-1.760744	-0.790638	-2.568387	0.884457
92	-0.178529	-0.672677	0.811988	-1.062689	0.183336
93	-1.410621	0.876799	0.119850	0.295357	-0.918690
94	0.632102	-0.777506	-1.732188	0.016164	0.842943
95	0.568935	-0.901952	0.781831	0.045797	-0.102197
96	0.111096	0.685877	-0.472303	-0.043043	-0.952074
97	2.206012	-0.375628	0.549381	0.576883	-1.280959
98	-1.125719	-0.021064	-0.018777	1.000606	0.772153
99	-0.158276	0.569409	-0.209664	0.326520	1.251615

(3) A,B 열의 처음 10개의 row를 출력하라.

df.loc[:9,'A':'B']

	A	B
0	0.106173	0.723759
1	-0.351670	0.990933
2	-1.884089	-0.860866
3	-0.965230	1.898428
4	0.453758	-0.716270
5	0.401627	0.204911
6	-0.434388	-1.436772
7	0.937827	1.079232
8	0.117038	-0.350165
9	1.032189	0.404380

(4)-(8) 아래와 같은 자료를 고려하자.

df=pd.read_csv('https://raw.githubusercontent.com/PacktPublishing/Pandas-Cookbook/master/data/movie.csv')
df

	color	director_name	num_critic_for_reviews	duration	director_facebook_likes	actor_3_facebook_likes	actor_2_name	actor_1_facebook_likes	gross	genres	...	num_user_for_reviews	language	country	content_rating	budget	title_year	actor_2_facebook_likes	imdb_score	aspect_ratio	movie_facebook_likes
0	Color	James Cameron	723.0	178.0	0.0	855.0	Joel David Moore	1000.0	760505847.0	Action\|Adventure\|Fantasy\|Sci-Fi	...	3054.0	English	USA	PG-13	237000000.0	2009.0	936.0	7.9	1.78	33000
1	Color	Gore Verbinski	302.0	169.0	563.0	1000.0	Orlando Bloom	40000.0	309404152.0	Action\|Adventure\|Fantasy	...	1238.0	English	USA	PG-13	300000000.0	2007.0	5000.0	7.1	2.35	0
2	Color	Sam Mendes	602.0	148.0	0.0	161.0	Rory Kinnear	11000.0	200074175.0	Action\|Adventure\|Thriller	...	994.0	English	UK	PG-13	245000000.0	2015.0	393.0	6.8	2.35	85000
3	Color	Christopher Nolan	813.0	164.0	22000.0	23000.0	Christian Bale	27000.0	448130642.0	Action\|Thriller	...	2701.0	English	USA	PG-13	250000000.0	2012.0	23000.0	8.5	2.35	164000
4	NaN	Doug Walker	NaN	NaN	131.0	NaN	Rob Walker	131.0	NaN	Documentary	...	NaN	NaN	NaN	NaN	NaN	NaN	12.0	7.1	NaN	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
4911	Color	Scott Smith	1.0	87.0	2.0	318.0	Daphne Zuniga	637.0	NaN	Comedy\|Drama	...	6.0	English	Canada	NaN	NaN	2013.0	470.0	7.7	NaN	84
4912	Color	NaN	43.0	43.0	NaN	319.0	Valorie Curry	841.0	NaN	Crime\|Drama\|Mystery\|Thriller	...	359.0	English	USA	TV-14	NaN	NaN	593.0	7.5	16.00	32000
4913	Color	Benjamin Roberds	13.0	76.0	0.0	0.0	Maxwell Moody	0.0	NaN	Drama\|Horror\|Thriller	...	3.0	English	USA	NaN	1400.0	2013.0	0.0	6.3	NaN	16
4914	Color	Daniel Hsia	14.0	100.0	0.0	489.0	Daniel Henney	946.0	10443.0	Comedy\|Drama\|Romance	...	9.0	English	USA	PG-13	NaN	2012.0	719.0	6.3	2.35	660
4915	Color	Jon Gunn	43.0	90.0	16.0	16.0	Brian Herzlinger	86.0	85222.0	Documentary	...	84.0	English	USA	PG	1100.0	2004.0	23.0	6.6	1.85	456

4916 rows × 28 columns

(4) 이 데이터프레임에는 몇개의 컬럼이 있는지 count하라.

len(list(df.keys()))

(5) 데이터프레임의 컬럼이름이 c혹은 d로 시작하는 열은 몇개 있는지 세어보라.

sum([(l[0] == 'c') or (l[0] == 'd') for l in list(df.keys())])

(6) c 혹은 d로 시작하는 열을 출력하라.

df.loc[:,[(l[0] == 'c') or (l[0] == 'd') for l in list(df.keys())]]

	color	director_name	duration	director_facebook_likes	cast_total_facebook_likes	country	content_rating
0	Color	James Cameron	178.0	0.0	4834	USA	PG-13
1	Color	Gore Verbinski	169.0	563.0	48350	USA	PG-13
2	Color	Sam Mendes	148.0	0.0	11700	UK	PG-13
3	Color	Christopher Nolan	164.0	22000.0	106759	USA	PG-13
4	NaN	Doug Walker	NaN	131.0	143	NaN	NaN
...	...	...	...	...	...	...	...
4911	Color	Scott Smith	87.0	2.0	2283	Canada	NaN
4912	Color	NaN	43.0	NaN	1753	USA	TV-14
4913	Color	Benjamin Roberds	76.0	0.0	0	USA	NaN
4914	Color	Daniel Hsia	100.0	0.0	2386	USA	PG-13
4915	Color	Jon Gunn	90.0	16.0	163	USA	PG

4916 rows × 7 columns

(7) 이 데이터프레임에서 ’actor’라는 단어가 포함된 열의 출력하라.

df.loc[:,['actor' in l for l in df.columns]]

	actor_3_facebook_likes	actor_2_name	actor_1_facebook_likes	actor_1_name	actor_3_name	actor_2_facebook_likes
0	855.0	Joel David Moore	1000.0	CCH Pounder	Wes Studi	936.0
1	1000.0	Orlando Bloom	40000.0	Johnny Depp	Jack Davenport	5000.0
2	161.0	Rory Kinnear	11000.0	Christoph Waltz	Stephanie Sigman	393.0
3	23000.0	Christian Bale	27000.0	Tom Hardy	Joseph Gordon-Levitt	23000.0
4	NaN	Rob Walker	131.0	Doug Walker	NaN	12.0
...	...	...	...	...	...	...
4911	318.0	Daphne Zuniga	637.0	Eric Mabius	Crystal Lowe	470.0
4912	319.0	Valorie Curry	841.0	Natalie Zea	Sam Underwood	593.0
4913	0.0	Maxwell Moody	0.0	Eva Boehnke	David Chandler	0.0
4914	489.0	Daniel Henney	946.0	Alan Ruck	Eliza Coupe	719.0
4915	16.0	Brian Herzlinger	86.0	John August	Jon Gunn	23.0

4916 rows × 6 columns

(8) 이 데이터프레임에서 ’_’ 가 포함된 열을 출력하라.

df.loc[:,['_' in l for l in df.columns]]

	director_name	num_critic_for_reviews	director_facebook_likes	actor_3_facebook_likes	actor_2_name	actor_1_facebook_likes	actor_1_name	movie_title	num_voted_users	cast_total_facebook_likes	...	facenumber_in_poster	plot_keywords	movie_imdb_link	num_user_for_reviews	content_rating	title_year	actor_2_facebook_likes	imdb_score	aspect_ratio	movie_facebook_likes
0	James Cameron	723.0	0.0	855.0	Joel David Moore	1000.0	CCH Pounder	Avatar	886204	4834	...	0.0	avatar\|future\|marine\|native\|paraplegic	http://www.imdb.com/title/tt0499549/?ref_=fn_t...	3054.0	PG-13	2009.0	936.0	7.9	1.78	33000
1	Gore Verbinski	302.0	563.0	1000.0	Orlando Bloom	40000.0	Johnny Depp	Pirates of the Caribbean: At World's End	471220	48350	...	0.0	goddess\|marriage ceremony\|marriage proposal\|pi...	http://www.imdb.com/title/tt0449088/?ref_=fn_t...	1238.0	PG-13	2007.0	5000.0	7.1	2.35	0
2	Sam Mendes	602.0	0.0	161.0	Rory Kinnear	11000.0	Christoph Waltz	Spectre	275868	11700	...	1.0	bomb\|espionage\|sequel\|spy\|terrorist	http://www.imdb.com/title/tt2379713/?ref_=fn_t...	994.0	PG-13	2015.0	393.0	6.8	2.35	85000
3	Christopher Nolan	813.0	22000.0	23000.0	Christian Bale	27000.0	Tom Hardy	The Dark Knight Rises	1144337	106759	...	0.0	deception\|imprisonment\|lawlessness\|police offi...	http://www.imdb.com/title/tt1345836/?ref_=fn_t...	2701.0	PG-13	2012.0	23000.0	8.5	2.35	164000
4	Doug Walker	NaN	131.0	NaN	Rob Walker	131.0	Doug Walker	Star Wars: Episode VII - The Force Awakens	8	143	...	0.0	NaN	http://www.imdb.com/title/tt5289954/?ref_=fn_t...	NaN	NaN	NaN	12.0	7.1	NaN	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
4911	Scott Smith	1.0	2.0	318.0	Daphne Zuniga	637.0	Eric Mabius	Signed Sealed Delivered	629	2283	...	2.0	fraud\|postal worker\|prison\|theft\|trial	http://www.imdb.com/title/tt3000844/?ref_=fn_t...	6.0	NaN	2013.0	470.0	7.7	NaN	84
4912	NaN	43.0	NaN	319.0	Valorie Curry	841.0	Natalie Zea	The Following	73839	1753	...	1.0	cult\|fbi\|hideout\|prison escape\|serial killer	http://www.imdb.com/title/tt2071645/?ref_=fn_t...	359.0	TV-14	NaN	593.0	7.5	16.00	32000
4913	Benjamin Roberds	13.0	0.0	0.0	Maxwell Moody	0.0	Eva Boehnke	A Plague So Pleasant	38	0	...	0.0	NaN	http://www.imdb.com/title/tt2107644/?ref_=fn_t...	3.0	NaN	2013.0	0.0	6.3	NaN	16
4914	Daniel Hsia	14.0	0.0	489.0	Daniel Henney	946.0	Alan Ruck	Shanghai Calling	1255	2386	...	5.0	NaN	http://www.imdb.com/title/tt2070597/?ref_=fn_t...	9.0	PG-13	2012.0	719.0	6.3	2.35	660
4915	Jon Gunn	43.0	16.0	16.0	Brian Herzlinger	86.0	John August	My Date with Drew	4285	163	...	0.0	actress name in title\|crush\|date\|four word tit...	http://www.imdb.com/title/tt0378407/?ref_=fn_t...	84.0	PG	2004.0	23.0	6.6	1.85	456

4916 rows × 21 columns