강의영상

- (1/1) rpy2

- (2/7) 데이터를 읽어오는법, 기본산점도

- (3/7) 산점도응용 (3차원)

- (4/7) 산점도응용 (4차원)

- (5/7) 판다스에서 열을 선택하는 방법

- (6/7) 판다스에서 lambda와 map을 이용하여 열을 선택하는 방법

- (7/7) 판다스에서 lambda와 map을 이용하여 열을 선택하는 방법 (2)

해들리위컴 그래프레이어

import

import pandas as pd 
from plotnine import * 

data

rpy2

import rpy2 
%load_ext rpy2.ipython 
%%R 
### 여기는 R처럼 쓸 수 있다. 
a<-c(1,2,3) 
a+1
[1] 2 3 4
a
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/tmp/ipykernel_2170811/2167009006.py in <module>
----> 1 a

NameError: name 'a' is not defined
%%R 
library(tidyverse)
mpg
R[write to console]: ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

R[write to console]: ✔ ggplot2 3.3.5     ✔ purrr   0.3.4
✔ tibble  3.1.3     ✔ dplyr   1.0.7
✔ tidyr   1.1.3     ✔ stringr 1.4.0
✔ readr   1.4.0     ✔ forcats 0.5.1

R[write to console]: ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

# A tibble: 234 × 11
   manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
   <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
 1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
 2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
 3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
 4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
 5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
 6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
 7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
 8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
 9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
# … with 224 more rows
mpg
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/tmp/ipykernel_2170811/602803287.py in <module>
----> 1 mpg

NameError: name 'mpg' is not defined
%R -o mpg # R에 있던 자료가 파이썬으로 넘어옴 
mpg
manufacturer model displ year cyl trans drv cty hwy fl class
1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
3 audi a4 2.0 2008 4 manual(m6) f 20 31 p compact
4 audi a4 2.0 2008 4 auto(av) f 21 30 p compact
5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
... ... ... ... ... ... ... ... ... ... ... ...
230 volkswagen passat 2.0 2008 4 auto(s6) f 19 28 p midsize
231 volkswagen passat 2.0 2008 4 manual(m6) f 21 29 p midsize
232 volkswagen passat 2.8 1999 6 auto(l5) f 16 26 p midsize
233 volkswagen passat 2.8 1999 6 manual(m5) f 18 26 p midsize
234 volkswagen passat 3.6 2008 6 auto(s6) f 17 26 p midsize

234 rows × 11 columns

저장된 파일을 통하여 데이터를 확보

mpg.to_csv("mpg.csv")
pd.read_csv("mpg.csv")
# mpg = pd.read_csv("mpg.csv")
Unnamed: 0 manufacturer model displ year cyl trans drv cty hwy fl class
0 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
1 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
2 3 audi a4 2.0 2008 4 manual(m6) f 20 31 p compact
3 4 audi a4 2.0 2008 4 auto(av) f 21 30 p compact
4 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
... ... ... ... ... ... ... ... ... ... ... ... ...
229 230 volkswagen passat 2.0 2008 4 auto(s6) f 19 28 p midsize
230 231 volkswagen passat 2.0 2008 4 manual(m6) f 21 29 p midsize
231 232 volkswagen passat 2.8 1999 6 auto(l5) f 16 26 p midsize
232 233 volkswagen passat 2.8 1999 6 manual(m5) f 18 26 p midsize
233 234 volkswagen passat 3.6 2008 6 auto(s6) f 17 26 p midsize

234 rows × 12 columns

- 무언가 잘못되었다?

- 다시 저장하자.

mpg.to_csv("mpg.csv",index=False)
pd.read_csv("mpg.csv")
# mpg=pd.read_csv("mpg.csv")
manufacturer model displ year cyl trans drv cty hwy fl class
0 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
1 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
2 audi a4 2.0 2008 4 manual(m6) f 20 31 p compact
3 audi a4 2.0 2008 4 auto(av) f 21 30 p compact
4 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
... ... ... ... ... ... ... ... ... ... ... ...
229 volkswagen passat 2.0 2008 4 auto(s6) f 19 28 p midsize
230 volkswagen passat 2.0 2008 4 manual(m6) f 21 29 p midsize
231 volkswagen passat 2.8 1999 6 auto(l5) f 16 26 p midsize
232 volkswagen passat 2.8 1999 6 manual(m5) f 18 26 p midsize
233 volkswagen passat 3.6 2008 6 auto(s6) f 17 26 p midsize

234 rows × 11 columns

  • 제대로 불러졌음

github등에 공개된 csv를 읽어오기

pd.read_csv('https://raw.githubusercontent.com/guebin/2021DV/master/_notebooks/mpg.csv')
# mpg=pd.read_csv('https://raw.githubusercontent.com/guebin/2021DV/master/_notebooks/mpg.csv')
manufacturer model displ year cyl trans drv cty hwy fl class
0 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
1 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
2 audi a4 2.0 2008 4 manual(m6) f 20 31 p compact
3 audi a4 2.0 2008 4 auto(av) f 21 30 p compact
4 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
... ... ... ... ... ... ... ... ... ... ... ...
229 volkswagen passat 2.0 2008 4 auto(s6) f 19 28 p midsize
230 volkswagen passat 2.0 2008 4 manual(m6) f 21 29 p midsize
231 volkswagen passat 2.8 1999 6 auto(l5) f 16 26 p midsize
232 volkswagen passat 2.8 1999 6 manual(m5) f 18 26 p midsize
233 volkswagen passat 3.6 2008 6 auto(s6) f 17 26 p midsize

234 rows × 11 columns

- 깃허브 저장소에 아예 데이터만 따로 모아서 관리하는 것도 좋은 방법입니다.

data 설명

- displ: 자동차의 엔진크기

- hwy: 연료의 효율, 동일한 연료로 얼마나 멀리 가느냐?

- 자세한 설명은 R에서 ?mpg로 알아볼것

기본산점도 (2차원)

ggplot(data = mpg) + geom_point(mapping = aes(x = "displ", y = "hwy")) ## plotnine 
<ggplot: (8784362403904)>
  • 산점도: 엔진크기와 연료효율은 반비례. (엔진이 큰 차일수록 연비가 좋지 않다)

- ggplot2를 이용한 산점도

%%R -w 800
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) 
## 진짜 ggplot에서 그릴때에는 변수이름에 "" 를 제거함

- 객체지향적인 느낌으로 산점도 그리기

step1: 도화지를 준비한다

fig=ggplot(data=mpg) 
fig
<ggplot: (8784356495638)>

step2: 변수와 에스테틱사이의 맵핑을 설정한다.

a1=aes(x='displ',y='hwy') 
a1
{'x': 'displ', 'y': 'hwy'}

step3: 점들의 집합을 만든다. 즉 포인트지옴을 만든다.

point1=geom_point(mapping=a1) 
  • geom_point(): 점들을 그려! 어떻게?
  • a1에서 설정된 표를 보고

step4: 도화지와 지옴을 합친다.

fig+point1
<ggplot: (8784356403231)>

- 빠르게 그리기: mapping =data=는 생략가능함

ggplot(mpg) + geom_point(aes(x = "displ", y = "hwy")) ## plotnine 
<ggplot: (8784356355518)>

산점도응용 (3차원)

- 데이터를 다시관찰

mpg.head()
manufacturer model displ year cyl trans drv cty hwy fl class
1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
3 audi a4 2.0 2008 4 manual(m6) f 20 31 p compact
4 audi a4 2.0 2008 4 auto(av) f 21 30 p compact
5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact

- class도 함께 plot에 표시하면 데이터를 탐색할때 좀 더 좋을것 같다.

산점도 + 점크기변경

ggplot(data=mpg)+ geom_point(mapping=aes(x='displ',y='hwy',size= 'class'))
/home/cgb3/anaconda3/envs/dv2021/lib/python3.8/site-packages/plotnine/scales/scale_size.py:48: PlotnineWarning: Using size for a discrete variable is not advised.
<ggplot: (8784356327762)>

산점도 + 투명도변경

ggplot(data=mpg)+ geom_point(mapping=aes(x='displ',y='hwy',alpha= 'class'))
/home/cgb3/anaconda3/envs/dv2021/lib/python3.8/site-packages/plotnine/scales/scale_alpha.py:68: PlotnineWarning: Using alpha for a discrete variable is not advised.
<ggplot: (8784356297638)>
ggplot(data=mpg)+ geom_point(mapping=aes(x='displ',y='hwy',size= 'class',alpha='class'))
/home/cgb3/anaconda3/envs/dv2021/lib/python3.8/site-packages/plotnine/scales/scale_size.py:48: PlotnineWarning: Using size for a discrete variable is not advised.
/home/cgb3/anaconda3/envs/dv2021/lib/python3.8/site-packages/plotnine/scales/scale_alpha.py:68: PlotnineWarning: Using alpha for a discrete variable is not advised.
<ggplot: (8784356241740)>

산점도 + 형태

ggplot(data=mpg)+ geom_point(mapping=aes(x='displ',y='hwy',shape='class'))
<ggplot: (8784356201093)>

산점도 + 색깔

ggplot(data=mpg)+ geom_point(mapping=aes(x='displ',y='hwy',color='class'))
<ggplot: (8784356360406)>

- 객체지향적으로?

a2=aes(x='displ',y='hwy',color='class')
a1,a2
({'x': 'displ', 'y': 'hwy'}, {'x': 'displ', 'y': 'hwy', 'color': 'class'})
point2=geom_point(a2)
fig+point2
<ggplot: (8784356129871)>

지옴을 더 추가 (적합선)

fig+point1
<ggplot: (8784181694823)>
sline1=geom_smooth(a1)
fig+point1+sline1
/home/cgb3/anaconda3/envs/dv2021/lib/python3.8/site-packages/plotnine/stats/smoothers.py:310: PlotnineWarning: Confidence intervals are not yet implementedfor lowess smoothings.
<ggplot: (8784181645499)>
fig+point2+sline1
/home/cgb3/anaconda3/envs/dv2021/lib/python3.8/site-packages/plotnine/stats/smoothers.py:310: PlotnineWarning: Confidence intervals are not yet implementedfor lowess smoothings.
<ggplot: (8784181617902)>

- 명령어로 한번에 그리기

ggplot(data=mpg)+geom_point(mapping=aes(x='displ',y='hwy',color='class'))+geom_smooth(mapping=aes(x='displ',y='hwy'))
/home/cgb3/anaconda3/envs/dv2021/lib/python3.8/site-packages/plotnine/stats/smoothers.py:310: PlotnineWarning: Confidence intervals are not yet implementedfor lowess smoothings.
<ggplot: (8784181506021)>

- 공통적인 맵핑규칙은 ggplot()쪽으로 빼기도 한다. (figure를 선언하는 곳에서 공통으로 선언함)

ggplot(data=mpg,mapping=aes(x='displ',y='hwy'))+geom_point(mapping=aes(color='class'))+geom_smooth()
/home/cgb3/anaconda3/envs/dv2021/lib/python3.8/site-packages/plotnine/stats/smoothers.py:310: PlotnineWarning: Confidence intervals are not yet implementedfor lowess smoothings.
<ggplot: (8784181519761)>

- R에서는 confidence interval도 geom_smooth()를 이용하여 확인할 수 있다.

%%R -w 800
ggplot(data=mpg,mapping=aes(x=displ,y=hwy))+geom_point(mapping=aes(color=class))+geom_smooth()
R[write to console]: `geom_smooth()` using method = 'loess' and formula 'y ~ x'

산점도응용2 (4차원)

- 데이터를 살펴보자.

mpg.head()
manufacturer model displ year cyl trans drv cty hwy fl class
1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
3 audi a4 2.0 2008 4 manual(m6) f 20 31 p compact
4 audi a4 2.0 2008 4 auto(av) f 21 30 p compact
5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact

- drv (전륜, 후륜, 4륜 구동)에 따라서 데이터를 시각화 하고 싶다.

ggplot(data=mpg,mapping=aes(x='displ',y='hwy'))+geom_point(mapping=aes(size='class',color='drv'),alpha=0.2)
/home/cgb3/anaconda3/envs/dv2021/lib/python3.8/site-packages/plotnine/scales/scale_size.py:48: PlotnineWarning: Using size for a discrete variable is not advised.
<ggplot: (8784181695124)>
  • 모든 $x$에 대하여 붉은색 점들이 대부분 초록선과 보라색 점들에 비하여 아래쪽에 위치하여 있음 $\to$ 4륜구동방식이 연비가 좋지 않음

- 객체지향적

a1,a2
({'x': 'displ', 'y': 'hwy'}, {'x': 'displ', 'y': 'hwy', 'color': 'class'})
a3=a2.copy()
id(a1),id(a2),id(a3)
(140549703310128, 140549698707184, 140549699928368)
a1,a2,a3
({'x': 'displ', 'y': 'hwy'},
 {'x': 'displ', 'y': 'hwy', 'color': 'class'},
 {'x': 'displ', 'y': 'hwy', 'color': 'class'})
a3['color']='drv'
a3['size']='class'
a1,a2,a3
({'x': 'displ', 'y': 'hwy'},
 {'x': 'displ', 'y': 'hwy', 'color': 'class'},
 {'x': 'displ', 'y': 'hwy', 'color': 'drv', 'size': 'class'})
  • 아래와 같이 선언해도 괜찮음
    a3=aes(x='displ',y='hwy',color='drv',size='class')
    
point3=geom_point(a3)
fig+point3
/home/cgb3/anaconda3/envs/dv2021/lib/python3.8/site-packages/plotnine/scales/scale_size.py:48: PlotnineWarning: Using size for a discrete variable is not advised.
<ggplot: (8784181514132)>
  • 앗 투명도 조절
point3=geom_point(a3,alpha=0.2)
fig+point3
/home/cgb3/anaconda3/envs/dv2021/lib/python3.8/site-packages/plotnine/scales/scale_size.py:48: PlotnineWarning: Using size for a discrete variable is not advised.
<ggplot: (8784356384317)>

- 여기에 선을 추가하여 보자.

fig+point3+sline1
/home/cgb3/anaconda3/envs/dv2021/lib/python3.8/site-packages/plotnine/scales/scale_size.py:48: PlotnineWarning: Using size for a discrete variable is not advised.
/home/cgb3/anaconda3/envs/dv2021/lib/python3.8/site-packages/plotnine/stats/smoothers.py:310: PlotnineWarning: Confidence intervals are not yet implementedfor lowess smoothings.
<ggplot: (8784356334797)>

- 각 그룹별로 선을 따로 그릴수도 있을까?

a1,a2,a3
({'x': 'displ', 'y': 'hwy'},
 {'x': 'displ', 'y': 'hwy', 'color': 'class'},
 {'x': 'displ', 'y': 'hwy', 'color': 'drv', 'size': 'class'})
a4=a2.copy()
a4['color']='drv'
sline2=geom_smooth(a4)
fig+sline2+point3
/home/cgb3/anaconda3/envs/dv2021/lib/python3.8/site-packages/plotnine/scales/scale_size.py:48: PlotnineWarning: Using size for a discrete variable is not advised.
/home/cgb3/anaconda3/envs/dv2021/lib/python3.8/site-packages/plotnine/stats/smoothers.py:310: PlotnineWarning: Confidence intervals are not yet implementedfor lowess smoothings.
<ggplot: (8784356384500)>

- 선의 색깔을 동일하게 하고 선의 타입을 변경하여 그룹을 표시할수도 있지 않을까?

a1,a2,a3,a4
({'x': 'displ', 'y': 'hwy'},
 {'x': 'displ', 'y': 'hwy', 'color': 'class'},
 {'x': 'displ', 'y': 'hwy', 'color': 'drv', 'size': 'class'},
 {'x': 'displ', 'y': 'hwy', 'color': 'drv'})
a5=a1.copy()
a5['linetype']='drv'
a5
{'x': 'displ', 'y': 'hwy', 'linetype': 'drv'}
sline3=geom_smooth(a5,size=0.5,color='gray')
fig+point3+sline3
/home/cgb3/anaconda3/envs/dv2021/lib/python3.8/site-packages/plotnine/scales/scale_size.py:48: PlotnineWarning: Using size for a discrete variable is not advised.
/home/cgb3/anaconda3/envs/dv2021/lib/python3.8/site-packages/plotnine/stats/smoothers.py:310: PlotnineWarning: Confidence intervals are not yet implementedfor lowess smoothings.
<ggplot: (8784176884764)>
fig+point3+sline3+sline1
/home/cgb3/anaconda3/envs/dv2021/lib/python3.8/site-packages/plotnine/scales/scale_size.py:48: PlotnineWarning: Using size for a discrete variable is not advised.
/home/cgb3/anaconda3/envs/dv2021/lib/python3.8/site-packages/plotnine/stats/smoothers.py:310: PlotnineWarning: Confidence intervals are not yet implementedfor lowess smoothings.
<ggplot: (8784176769018)>

- 그래도 색깔로 구분하는것이 나은것 같다.

sline2=geom_smooth(a4,size=0.5,linetype='dashed')
fig+point3+sline2+sline1
/home/cgb3/anaconda3/envs/dv2021/lib/python3.8/site-packages/plotnine/scales/scale_size.py:48: PlotnineWarning: Using size for a discrete variable is not advised.
/home/cgb3/anaconda3/envs/dv2021/lib/python3.8/site-packages/plotnine/stats/smoothers.py:310: PlotnineWarning: Confidence intervals are not yet implementedfor lowess smoothings.
<ggplot: (8784176718963)>

- 고차원의 변수를 표현할 수 있는 무기는 다양하다.

  • 산점도(포인트지옴): 점의크기, 점의형태, 점의색깔, 점의투명도
  • 라인플랏(스무스지옴, 라인지옴): 선의형태, 선의색깔, 선의굵기

결론

- 잘 훈련한다면 여러가지 형태의 고차원 그래프를 우리도 그릴 수 있다. (마치 미나드처럼)

- 해들리위컴은 이러한 방법을 체계적으로 정리했다고 보여진다.

- 해들리위컴: 그래프는 데이터 + 지옴 + 맵핑(변수와 에스테틱간의 맵핑) + 스탯(통계) + 포지션 + 축 + 패싯그리드 7개의 조합으로 그릴수 있다.

  • 내생각: 지옴과 맵핑만 잘 이용해도 아주 다양한 그래프를 그릴 수 있음.

판다스에서 column을 선택하는 방법

import

예제1

import numpy as np
dic={'X1':np.random.normal(0,1,5),
     'X2':np.random.normal(0,1,5),
     'X3':np.random.normal(0,1,5)}
df=pd.DataFrame(dic)
df
X1 X2 X3
0 1.364717 0.775949 1.685477
1 0.853549 -0.401712 -1.285741
2 -0.383503 -0.950388 0.028700
3 -0.639761 -0.384636 2.567487
4 -0.742307 -1.738164 1.022246

- 방법1

df.X1
0    1.364717
1    0.853549
2   -0.383503
3   -0.639761
4   -0.742307
Name: X1, dtype: float64

- 방법2

df['X1']
0    1.364717
1    0.853549
2   -0.383503
3   -0.639761
4   -0.742307
Name: X1, dtype: float64

- 방법3

df[['X1']]
X1
0 1.364717
1 0.853549
2 -0.383503
3 -0.639761
4 -0.742307
  • df['X1']는 series를 리턴하고 df[['X1']]는 dataframe을 리턴한다.

- 방법4

df.loc[:,'X1']
0    1.364717
1    0.853549
2   -0.383503
3   -0.639761
4   -0.742307
Name: X1, dtype: float64

- 방법5

df.loc[:,['X1']]
X1
0 1.364717
1 0.853549
2 -0.383503
3 -0.639761
4 -0.742307
  • df.loc[:,'X1']는 series를 리턴하고 df.loc[:,['X1']]는 dataframe을 리턴한다.

- 방법6

df.loc[:,[True,False,False]] 
X1
0 1.364717
1 0.853549
2 -0.383503
3 -0.639761
4 -0.742307
  • 불인덱싱가능

- 방법7

df.iloc[:,0]
0    1.364717
1    0.853549
2   -0.383503
3   -0.639761
4   -0.742307
Name: X1, dtype: float64

- 방법8

df.iloc[:,[0]]
X1
0 1.364717
1 0.853549
2 -0.383503
3 -0.639761
4 -0.742307

- 방법9

df.iloc[:,[True,False,False]]
X1
0 1.364717
1 0.853549
2 -0.383503
3 -0.639761
4 -0.742307

참고사항: 열이름이 interger일 경우

import numpy as np 
_df = pd.DataFrame(np.array([[1,2,3],[3,4,5],[5,6,7]])) 
_df
0 1 2
0 1 2 3
1 3 4 5
2 5 6 7

- 아래가 모두 가능하다.

_df[0]
0    1
1    3
2    5
Name: 0, dtype: int64
_df[[0]]
0
0 1
1 3
2 5
_df.loc[:,0]
0    1
1    3
2    5
Name: 0, dtype: int64
_df.loc[:,[0]]
0
0 1
1 3
2 5
_df.iloc[:,0]
0    1
1    3
2    5
Name: 0, dtype: int64

data

_df.iloc[:,[0]]
0
0 1
1 3
2 5

방법1~9의 요약 (제 생각)

- df.X1로 열을 선택하는게 간단하고 편리함.

  • 단점1: 변수이름을 알고 있어야 한다는 단점이 있음.
  • 단점2: 변수이름에 .이 있거나 변수이름에서 공백이 있을경우 사용할 수 없음.

- 언급한 단점의 예시

dic={'X.1':np.random.normal(0,1,5),
     'X.2':np.random.normal(0,1,5),
     'X.3':np.random.normal(0,1,5)}
_df=pd.DataFrame(dic)
_df
X.1 X.2 X.3
0 -0.674886 0.248970 -0.168482
1 0.217904 -0.633454 0.261744
2 -0.338874 0.876894 -2.196765
3 2.476408 0.901102 0.199516
4 0.710193 -1.691488 -0.907115
_df['X.1']
0   -0.674886
1    0.217904
2   -0.338874
3    2.476408
4    0.710193
Name: X.1, dtype: float64
df=pd.read_csv('https://raw.githubusercontent.com/PacktPublishing/Pandas-Cookbook/master/data/movie.csv')
_df.X.1
  File "/tmp/ipykernel_2170811/58401248.py", line 1
    _df.X.1
         ^
SyntaxError: invalid syntax

예제2: 여러개의 열을 선택

- 데이터

dic={'X1':np.random.normal(0,1,5),
     'X2':np.random.normal(0,1,5),
     'X3':np.random.normal(0,1,5),
     'X4':np.random.normal(0,1,5)}
df=pd.DataFrame(dic)
df
X1 X2 X3 X4
0 1.408453 -0.170602 -1.454650 -0.047064
1 1.093401 1.108376 -0.746791 -0.037699
2 -0.320964 -1.183577 1.228422 0.415494
3 -1.460182 -0.912314 -0.262836 -1.569006
4 0.160380 0.824893 1.442562 -0.450621

- 목표: 1,2,3열을 선택

- 방법1

df[['X1','X2','X3']]
X1 X2 X3
0 1.408453 -0.170602 -1.454650
1 1.093401 1.108376 -0.746791
2 -0.320964 -1.183577 1.228422
3 -1.460182 -0.912314 -0.262836
4 0.160380 0.824893 1.442562

- 방법2

df.loc[:,['X1','X2','X3']]
X1 X2 X3
0 1.408453 -0.170602 -1.454650
1 1.093401 1.108376 -0.746791
2 -0.320964 -1.183577 1.228422
3 -1.460182 -0.912314 -0.262836
4 0.160380 0.824893 1.442562

- 방법3

df.loc[:,'X1':'X3'] 
X1 X2 X3
0 1.408453 -0.170602 -1.454650
1 1.093401 1.108376 -0.746791
2 -0.320964 -1.183577 1.228422
3 -1.460182 -0.912314 -0.262836
4 0.160380 0.824893 1.442562

- 방법4

df.loc[:,[True,True,True,False]]
X1 X2 X3
0 1.408453 -0.170602 -1.454650
1 1.093401 1.108376 -0.746791
2 -0.320964 -1.183577 1.228422
3 -1.460182 -0.912314 -0.262836
4 0.160380 0.824893 1.442562

- 방법5

df.iloc[:,[0,1,2]]
X1 X2 X3
0 1.408453 -0.170602 -1.454650
1 1.093401 1.108376 -0.746791
2 -0.320964 -1.183577 1.228422
3 -1.460182 -0.912314 -0.262836
4 0.160380 0.824893 1.442562

- 방법6

df.iloc[:,:3]
X1 X2 X3
0 1.408453 -0.170602 -1.454650
1 1.093401 1.108376 -0.746791
2 -0.320964 -1.183577 1.228422
3 -1.460182 -0.912314 -0.262836
4 0.160380 0.824893 1.442562
df.iloc[:,0:3]
X1 X2 X3
0 1.408453 -0.170602 -1.454650
1 1.093401 1.108376 -0.746791
2 -0.320964 -1.183577 1.228422
3 -1.460182 -0.912314 -0.262836
4 0.160380 0.824893 1.442562
df.iloc[:,range(3)]
X1 X2 X3
0 1.408453 -0.170602 -1.454650
1 1.093401 1.108376 -0.746791
2 -0.320964 -1.183577 1.228422
3 -1.460182 -0.912314 -0.262836
4 0.160380 0.824893 1.442562

- 방법7

df.iloc[:,[True,True,True,False]]
X1 X2 X3
0 1.408453 -0.170602 -1.454650
1 1.093401 1.108376 -0.746791
2 -0.320964 -1.183577 1.228422
3 -1.460182 -0.912314 -0.262836
4 0.160380 0.824893 1.442562

(주의) loc에서의 슬라이싱은 마지막변수를 포함하지만 iloc에서는 포함하지 않음

- 아래를 비교하라.

df.iloc[:,0:3] ## 0,1,2,3중 3은 포함되지 않는다.
X1 X2 X3
0 1.408453 -0.170602 -1.454650
1 1.093401 1.108376 -0.746791
2 -0.320964 -1.183577 1.228422
3 -1.460182 -0.912314 -0.262836
4 0.160380 0.824893 1.442562
df.loc[:,'X1':'X3'] ## 'X3'도 포함된다. 
X1 X2 X3
0 1.408453 -0.170602 -1.454650
1 1.093401 1.108376 -0.746791
2 -0.320964 -1.183577 1.228422
3 -1.460182 -0.912314 -0.262836
4 0.160380 0.824893 1.442562

- 그래서 column의 이름이 integer일 경우는 종종 매우 헷갈리는 일이 일어남

_df = pd.DataFrame(np.array([[1,2,3,4],[3,4,5,6],[5,6,7,8]]))
_df
0 1 2 3
0 1 2 3 4
1 3 4 5 6
2 5 6 7 8
_df.loc[:,0:2]
0 1 2
0 1 2 3
1 3 4 5
2 5 6 7
_df.iloc[:,0:2]
0 1
0 1 2
1 3 4
2 5 6

Note: 사실 이것은 일부러 헷갈리게 예제를 구성한 것이다. 실제로는 헷갈리는 상황이 그렇게 자주 발생하지 않는다. 왜냐하면 보통 위와 같은 형태의 자료는 ndarray로 처리하고 colname이 있는 경우만 데이터프레임으로 처리하기 때문.

예제3: movie data - 특정조건에 맞는 열을 선택

df=pd.read_csv('https://raw.githubusercontent.com/PacktPublishing/Pandas-Cookbook/master/data/movie.csv')
df
color director_name num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_2_name actor_1_facebook_likes gross genres ... num_user_for_reviews language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
0 Color James Cameron 723.0 178.0 0.0 855.0 Joel David Moore 1000.0 760505847.0 Action|Adventure|Fantasy|Sci-Fi ... 3054.0 English USA PG-13 237000000.0 2009.0 936.0 7.9 1.78 33000
1 Color Gore Verbinski 302.0 169.0 563.0 1000.0 Orlando Bloom 40000.0 309404152.0 Action|Adventure|Fantasy ... 1238.0 English USA PG-13 300000000.0 2007.0 5000.0 7.1 2.35 0
2 Color Sam Mendes 602.0 148.0 0.0 161.0 Rory Kinnear 11000.0 200074175.0 Action|Adventure|Thriller ... 994.0 English UK PG-13 245000000.0 2015.0 393.0 6.8 2.35 85000
3 Color Christopher Nolan 813.0 164.0 22000.0 23000.0 Christian Bale 27000.0 448130642.0 Action|Thriller ... 2701.0 English USA PG-13 250000000.0 2012.0 23000.0 8.5 2.35 164000
4 NaN Doug Walker NaN NaN 131.0 NaN Rob Walker 131.0 NaN Documentary ... NaN NaN NaN NaN NaN NaN 12.0 7.1 NaN 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4911 Color Scott Smith 1.0 87.0 2.0 318.0 Daphne Zuniga 637.0 NaN Comedy|Drama ... 6.0 English Canada NaN NaN 2013.0 470.0 7.7 NaN 84
4912 Color NaN 43.0 43.0 NaN 319.0 Valorie Curry 841.0 NaN Crime|Drama|Mystery|Thriller ... 359.0 English USA TV-14 NaN NaN 593.0 7.5 16.00 32000
4913 Color Benjamin Roberds 13.0 76.0 0.0 0.0 Maxwell Moody 0.0 NaN Drama|Horror|Thriller ... 3.0 English USA NaN 1400.0 2013.0 0.0 6.3 NaN 16
4914 Color Daniel Hsia 14.0 100.0 0.0 489.0 Daniel Henney 946.0 10443.0 Comedy|Drama|Romance ... 9.0 English USA PG-13 NaN 2012.0 719.0 6.3 2.35 660
4915 Color Jon Gunn 43.0 90.0 16.0 16.0 Brian Herzlinger 86.0 85222.0 Documentary ... 84.0 English USA PG 1100.0 2004.0 23.0 6.6 1.85 456

4916 rows × 28 columns

- 열의 이름을 출력하여 보자.

df.columns
Index(['color', 'director_name', 'num_critic_for_reviews', 'duration',
       'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
       'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',
       'movie_title', 'num_voted_users', 'cast_total_facebook_likes',
       'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
       'movie_imdb_link', 'num_user_for_reviews', 'language', 'country',
       'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',
       'imdb_score', 'aspect_ratio', 'movie_facebook_likes'],
      dtype='object')

- color ~ num_voted_user 를 뽑고 + aspect_ratio 도 추가적으로 뽑고싶다.

df.loc[:,['color':'num_voted_users','aspect_ratio']]
  File "/tmp/ipykernel_2170811/1210972629.py", line 1
    df.loc[:,['color':'num_voted_users','aspect_ratio']]
                     ^
SyntaxError: invalid syntax

- (팁) 복잡한 조건은 iloc으로 쓰는게 편할때가 있다. $\to$ 그런데 df.columns 변수들이 몇번인지 알아보기 힘듬 $\to$ 아래와 같이 하면 열의 이름을 인덱스와 함께 출력할 수 있음

pd.Series(df.columns)
0                         color
1                 director_name
2        num_critic_for_reviews
3                      duration
4       director_facebook_likes
5        actor_3_facebook_likes
6                  actor_2_name
7        actor_1_facebook_likes
8                         gross
9                        genres
10                 actor_1_name
11                  movie_title
12              num_voted_users
13    cast_total_facebook_likes
14                 actor_3_name
15         facenumber_in_poster
16                plot_keywords
17              movie_imdb_link
18         num_user_for_reviews
19                     language
20                      country
21               content_rating
22                       budget
23                   title_year
24       actor_2_facebook_likes
25                   imdb_score
26                 aspect_ratio
27         movie_facebook_likes
dtype: object
list(range(13))+[26]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 26]
df.iloc[:,list(range(13))+[26]] 
color director_name num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_2_name actor_1_facebook_likes gross genres actor_1_name movie_title num_voted_users aspect_ratio
0 Color James Cameron 723.0 178.0 0.0 855.0 Joel David Moore 1000.0 760505847.0 Action|Adventure|Fantasy|Sci-Fi CCH Pounder Avatar 886204 1.78
1 Color Gore Verbinski 302.0 169.0 563.0 1000.0 Orlando Bloom 40000.0 309404152.0 Action|Adventure|Fantasy Johnny Depp Pirates of the Caribbean: At World's End 471220 2.35
2 Color Sam Mendes 602.0 148.0 0.0 161.0 Rory Kinnear 11000.0 200074175.0 Action|Adventure|Thriller Christoph Waltz Spectre 275868 2.35
3 Color Christopher Nolan 813.0 164.0 22000.0 23000.0 Christian Bale 27000.0 448130642.0 Action|Thriller Tom Hardy The Dark Knight Rises 1144337 2.35
4 NaN Doug Walker NaN NaN 131.0 NaN Rob Walker 131.0 NaN Documentary Doug Walker Star Wars: Episode VII - The Force Awakens 8 NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4911 Color Scott Smith 1.0 87.0 2.0 318.0 Daphne Zuniga 637.0 NaN Comedy|Drama Eric Mabius Signed Sealed Delivered 629 NaN
4912 Color NaN 43.0 43.0 NaN 319.0 Valorie Curry 841.0 NaN Crime|Drama|Mystery|Thriller Natalie Zea The Following 73839 16.00
4913 Color Benjamin Roberds 13.0 76.0 0.0 0.0 Maxwell Moody 0.0 NaN Drama|Horror|Thriller Eva Boehnke A Plague So Pleasant 38 NaN
4914 Color Daniel Hsia 14.0 100.0 0.0 489.0 Daniel Henney 946.0 10443.0 Comedy|Drama|Romance Alan Ruck Shanghai Calling 1255 2.35
4915 Color Jon Gunn 43.0 90.0 16.0 16.0 Brian Herzlinger 86.0 85222.0 Documentary John August My Date with Drew 4285 1.85

4916 rows × 14 columns

- 다시열의 이름들을 확인

df.columns
Index(['color', 'director_name', 'num_critic_for_reviews', 'duration',
       'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
       'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',
       'movie_title', 'num_voted_users', 'cast_total_facebook_likes',
       'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
       'movie_imdb_link', 'num_user_for_reviews', 'language', 'country',
       'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',
       'imdb_score', 'aspect_ratio', 'movie_facebook_likes'],
      dtype='object')

actor라는 단어가 포함된 변수들만 뽑고싶다.

- 방법1

df.iloc[:,list(map(lambda x : 'actor' in x, df.columns) )]
actor_3_facebook_likes actor_2_name actor_1_facebook_likes actor_1_name actor_3_name actor_2_facebook_likes
0 855.0 Joel David Moore 1000.0 CCH Pounder Wes Studi 936.0
1 1000.0 Orlando Bloom 40000.0 Johnny Depp Jack Davenport 5000.0
2 161.0 Rory Kinnear 11000.0 Christoph Waltz Stephanie Sigman 393.0
3 23000.0 Christian Bale 27000.0 Tom Hardy Joseph Gordon-Levitt 23000.0
4 NaN Rob Walker 131.0 Doug Walker NaN 12.0
... ... ... ... ... ... ...
4911 318.0 Daphne Zuniga 637.0 Eric Mabius Crystal Lowe 470.0
4912 319.0 Valorie Curry 841.0 Natalie Zea Sam Underwood 593.0
4913 0.0 Maxwell Moody 0.0 Eva Boehnke David Chandler 0.0
4914 489.0 Daniel Henney 946.0 Alan Ruck Eliza Coupe 719.0
4915 16.0 Brian Herzlinger 86.0 John August Jon Gunn 23.0

4916 rows × 6 columns

- 방법2

df.loc[:,list(map(lambda x : 'actor' in x, df.columns) )]
actor_3_facebook_likes actor_2_name actor_1_facebook_likes actor_1_name actor_3_name actor_2_facebook_likes
0 855.0 Joel David Moore 1000.0 CCH Pounder Wes Studi 936.0
1 1000.0 Orlando Bloom 40000.0 Johnny Depp Jack Davenport 5000.0
2 161.0 Rory Kinnear 11000.0 Christoph Waltz Stephanie Sigman 393.0
3 23000.0 Christian Bale 27000.0 Tom Hardy Joseph Gordon-Levitt 23000.0
4 NaN Rob Walker 131.0 Doug Walker NaN 12.0
... ... ... ... ... ... ...
4911 318.0 Daphne Zuniga 637.0 Eric Mabius Crystal Lowe 470.0
4912 319.0 Valorie Curry 841.0 Natalie Zea Sam Underwood 593.0
4913 0.0 Maxwell Moody 0.0 Eva Boehnke David Chandler 0.0
4914 489.0 Daniel Henney 946.0 Alan Ruck Eliza Coupe 719.0
4915 16.0 Brian Herzlinger 86.0 John August Jon Gunn 23.0

4916 rows × 6 columns

- 방법3

df.iloc[:,map(lambda x : 'actor' in x, df.columns)]
actor_3_facebook_likes actor_2_name actor_1_facebook_likes actor_1_name actor_3_name actor_2_facebook_likes
0 855.0 Joel David Moore 1000.0 CCH Pounder Wes Studi 936.0
1 1000.0 Orlando Bloom 40000.0 Johnny Depp Jack Davenport 5000.0
2 161.0 Rory Kinnear 11000.0 Christoph Waltz Stephanie Sigman 393.0
3 23000.0 Christian Bale 27000.0 Tom Hardy Joseph Gordon-Levitt 23000.0
4 NaN Rob Walker 131.0 Doug Walker NaN 12.0
... ... ... ... ... ... ...
4911 318.0 Daphne Zuniga 637.0 Eric Mabius Crystal Lowe 470.0
4912 319.0 Valorie Curry 841.0 Natalie Zea Sam Underwood 593.0
4913 0.0 Maxwell Moody 0.0 Eva Boehnke David Chandler 0.0
4914 489.0 Daniel Henney 946.0 Alan Ruck Eliza Coupe 719.0
4915 16.0 Brian Herzlinger 86.0 John August Jon Gunn 23.0

4916 rows × 6 columns

- 방법4

df.loc[:,map(lambda x : 'actor' in x, df.columns)]
actor_3_facebook_likes actor_2_name actor_1_facebook_likes actor_1_name actor_3_name actor_2_facebook_likes
0 855.0 Joel David Moore 1000.0 CCH Pounder Wes Studi 936.0
1 1000.0 Orlando Bloom 40000.0 Johnny Depp Jack Davenport 5000.0
2 161.0 Rory Kinnear 11000.0 Christoph Waltz Stephanie Sigman 393.0
3 23000.0 Christian Bale 27000.0 Tom Hardy Joseph Gordon-Levitt 23000.0
4 NaN Rob Walker 131.0 Doug Walker NaN 12.0
... ... ... ... ... ... ...
4911 318.0 Daphne Zuniga 637.0 Eric Mabius Crystal Lowe 470.0
4912 319.0 Valorie Curry 841.0 Natalie Zea Sam Underwood 593.0
4913 0.0 Maxwell Moody 0.0 Eva Boehnke David Chandler 0.0
4914 489.0 Daniel Henney 946.0 Alan Ruck Eliza Coupe 719.0
4915 16.0 Brian Herzlinger 86.0 John August Jon Gunn 23.0

4916 rows × 6 columns

- 방법5

df.loc[:,filter(lambda x : 'actor' in x, df.columns)]
actor_3_facebook_likes actor_2_name actor_1_facebook_likes actor_1_name actor_3_name actor_2_facebook_likes
0 855.0 Joel David Moore 1000.0 CCH Pounder Wes Studi 936.0
1 1000.0 Orlando Bloom 40000.0 Johnny Depp Jack Davenport 5000.0
2 161.0 Rory Kinnear 11000.0 Christoph Waltz Stephanie Sigman 393.0
3 23000.0 Christian Bale 27000.0 Tom Hardy Joseph Gordon-Levitt 23000.0
4 NaN Rob Walker 131.0 Doug Walker NaN 12.0
... ... ... ... ... ... ...
4911 318.0 Daphne Zuniga 637.0 Eric Mabius Crystal Lowe 470.0
4912 319.0 Valorie Curry 841.0 Natalie Zea Sam Underwood 593.0
4913 0.0 Maxwell Moody 0.0 Eva Boehnke David Chandler 0.0
4914 489.0 Daniel Henney 946.0 Alan Ruck Eliza Coupe 719.0
4915 16.0 Brian Herzlinger 86.0 John August Jon Gunn 23.0

4916 rows × 6 columns

변수이름이 s로 끝나는 변수들만 뽑고싶다.

df.iloc[:,map(lambda x: 's' == x[-1],df.columns )]
num_critic_for_reviews director_facebook_likes actor_3_facebook_likes actor_1_facebook_likes gross genres num_voted_users cast_total_facebook_likes plot_keywords num_user_for_reviews actor_2_facebook_likes movie_facebook_likes
0 723.0 0.0 855.0 1000.0 760505847.0 Action|Adventure|Fantasy|Sci-Fi 886204 4834 avatar|future|marine|native|paraplegic 3054.0 936.0 33000
1 302.0 563.0 1000.0 40000.0 309404152.0 Action|Adventure|Fantasy 471220 48350 goddess|marriage ceremony|marriage proposal|pi... 1238.0 5000.0 0
2 602.0 0.0 161.0 11000.0 200074175.0 Action|Adventure|Thriller 275868 11700 bomb|espionage|sequel|spy|terrorist 994.0 393.0 85000
3 813.0 22000.0 23000.0 27000.0 448130642.0 Action|Thriller 1144337 106759 deception|imprisonment|lawlessness|police offi... 2701.0 23000.0 164000
4 NaN 131.0 NaN 131.0 NaN Documentary 8 143 NaN NaN 12.0 0
... ... ... ... ... ... ... ... ... ... ... ... ...
4911 1.0 2.0 318.0 637.0 NaN Comedy|Drama 629 2283 fraud|postal worker|prison|theft|trial 6.0 470.0 84
4912 43.0 NaN 319.0 841.0 NaN Crime|Drama|Mystery|Thriller 73839 1753 cult|fbi|hideout|prison escape|serial killer 359.0 593.0 32000
4913 13.0 0.0 0.0 0.0 NaN Drama|Horror|Thriller 38 0 NaN 3.0 0.0 16
4914 14.0 0.0 489.0 946.0 10443.0 Comedy|Drama|Romance 1255 2386 NaN 9.0 719.0 660
4915 43.0 16.0 16.0 86.0 85222.0 Documentary 4285 163 actress name in title|crush|date|four word tit... 84.0 23.0 456

4916 rows × 12 columns

df.loc[:,map(lambda x: 's' == x[-1],df.columns )]
num_critic_for_reviews director_facebook_likes actor_3_facebook_likes actor_1_facebook_likes gross genres num_voted_users cast_total_facebook_likes plot_keywords num_user_for_reviews actor_2_facebook_likes movie_facebook_likes
0 723.0 0.0 855.0 1000.0 760505847.0 Action|Adventure|Fantasy|Sci-Fi 886204 4834 avatar|future|marine|native|paraplegic 3054.0 936.0 33000
1 302.0 563.0 1000.0 40000.0 309404152.0 Action|Adventure|Fantasy 471220 48350 goddess|marriage ceremony|marriage proposal|pi... 1238.0 5000.0 0
2 602.0 0.0 161.0 11000.0 200074175.0 Action|Adventure|Thriller 275868 11700 bomb|espionage|sequel|spy|terrorist 994.0 393.0 85000
3 813.0 22000.0 23000.0 27000.0 448130642.0 Action|Thriller 1144337 106759 deception|imprisonment|lawlessness|police offi... 2701.0 23000.0 164000
4 NaN 131.0 NaN 131.0 NaN Documentary 8 143 NaN NaN 12.0 0
... ... ... ... ... ... ... ... ... ... ... ... ...
4911 1.0 2.0 318.0 637.0 NaN Comedy|Drama 629 2283 fraud|postal worker|prison|theft|trial 6.0 470.0 84
4912 43.0 NaN 319.0 841.0 NaN Crime|Drama|Mystery|Thriller 73839 1753 cult|fbi|hideout|prison escape|serial killer 359.0 593.0 32000
4913 13.0 0.0 0.0 0.0 NaN Drama|Horror|Thriller 38 0 NaN 3.0 0.0 16
4914 14.0 0.0 489.0 946.0 10443.0 Comedy|Drama|Romance 1255 2386 NaN 9.0 719.0 660
4915 43.0 16.0 16.0 86.0 85222.0 Documentary 4285 163 actress name in title|crush|date|four word tit... 84.0 23.0 456

4916 rows × 12 columns

변수이름이 c 혹은 d로 시작하는 변수들만 뽑고싶다.

df.iloc[:,map(lambda x: 'c' == x[0] or 'd' == x[0] ,df.columns )]
color director_name duration director_facebook_likes cast_total_facebook_likes country content_rating
0 Color James Cameron 178.0 0.0 4834 USA PG-13
1 Color Gore Verbinski 169.0 563.0 48350 USA PG-13
2 Color Sam Mendes 148.0 0.0 11700 UK PG-13
3 Color Christopher Nolan 164.0 22000.0 106759 USA PG-13
4 NaN Doug Walker NaN 131.0 143 NaN NaN
... ... ... ... ... ... ... ...
4911 Color Scott Smith 87.0 2.0 2283 Canada NaN
4912 Color NaN 43.0 NaN 1753 USA TV-14
4913 Color Benjamin Roberds 76.0 0.0 0 USA NaN
4914 Color Daniel Hsia 100.0 0.0 2386 USA PG-13
4915 Color Jon Gunn 90.0 16.0 163 USA PG

4916 rows × 7 columns

숙제

movie data frame에서 'face'라는 단어가 포함된 변수열을 선택하라.