강의영상

- (1/11) Partial Correlation: 아이스크림을 많이 먹으면 걸리는 병

- (2/11) Partial Correlation: 기상청기온자료를 활용하여 가짜자료 구축 및 해석 (1)

- (3/11) Partial Correlation: 기상청기온자료를 활용하여 가짜자료 구축 및 해석 (2)

- (4/11) Partial Correlation: plotnine을 활용한 시각화

- (5/11) 행을선택하는 방법 (기본)

- (6/11) 행을선택하는 방법 (람다)

- (7/11) 행을선택하는 방법 (쿼리)

- (8/11) fifa22자료 설명

- (9/11) fifa22자료 데이터변형 및 시각화 (1)

- (10/11) fifa22자료 데이터변형 및 시각화 (2)

- (11/11) 과제설명

Partial Correlation

아이스크림을 많이 먹으면 걸리는 병

- 내용요약

  • 여름 $\to$ 수영장 $\to$ 소아마비
  • 여름 $\to$ 아이스크림
  • 아이스크림과 소아마비는 상관관계가 높다: 아이스크림 성분중에서 소아마비를 유발하는 유해물질이 있을 것이다 (?)

- 아래와 같이 모형을 간단하게 하자.

  • 온도 $\to$ 소아마비
  • 온도 $\to$ 아이스크림

Toy exam

import numpy as np 
import matplotlib.pyplot as plt 
import pandas as pd

- 교재의 예제상황은 예를들면 아래와 같다.

- 아이스크림 판매량 = 20 + 온도 $\times$ 2 + $\epsilon$

np.random.seed(1) 
temp= np.array([-10.2, -5.2, 0.1, 10.1, 12.2, 14.7, 
                25.4, 26.8, 28.9, 35.1, 32.2, 34.6])
ϵ1= np.random.normal(size=12,scale=5)
icecream= 20 + temp * 2 + ϵ1 
plt.plot(temp,icecream,'.')
[<matplotlib.lines.Line2D at 0x7fc550768e80>]
  • 온도와 아이스크림 판매량의 산점도

- 소아마비 = 30 + 온도 $\times$ 0.5 + $\epsilon^*$

np.random.seed(2) 
ϵ2= np.random.normal(size=12,scale=5) 
disease = 30+ temp* 0.5 + ϵ2
plt.plot(temp,disease,'.')
[<matplotlib.lines.Line2D at 0x7fc550480c40>]
  • 온도와 소아마비의 산점도

- 아이스크림과 질병의 산점도를 그려보자.

plt.plot(icecream,disease,'.')
[<matplotlib.lines.Line2D at 0x7fc55071b5e0>]
  • 양의 상관관계에 있다.

- 아이스크림 중 어떠한 물질이 소아마비를 일으키는것이 분명하므로 (인과성이 분명해보이니까) 아래와 같은 모형을 세우자. <-- 여기서부터 틀렸음

$${\tt disease}_i =\beta_0 +\beta_1 {\tt icecream}_i +\epsilon_i,\quad \textbf{for} ~~ i=1,2,\dots, 12$$

- 적절한 $\beta_0$와 $\beta_1$을 추정하면 우리는 아이스크림과 소아마비의 관계를 알 수 있다. <-- 틀린주장

  • 틀린 모형
  • 도데체 우리가 뭘 잘못했는가?

- 두 변수 사이에 상관관계가 있어도 실제 원인은 다른 변수에 숨겨져 있는 경우가 많다.

  • 예제의상황: 아이스크림과 익사자도 양의 상관관계에 있을것이다.
  • 아이스크림을 먹이면 물에 빠져 죽는다 $\to$ 틀린주장
  • 사실 기온이 숨겨진 원인이다. 기온이 증가하면 아이스크림 판매량도 증가하고 폭염때문에 익사사고율도 높아지는 구조이다.

- 아래와 같은 예제를 생각하자.

  • 인구수 $\to$ 교회
  • 인구수 $\to$ 범죄건수
  • 지역별 교회와 범죄건수를 살펴보면 상관관계가 높게 나올것임

- 교회를 많이 지으면 범죄건수도 증가한다?

  • 사실 그렇지 않다.
  • 인구수가 비슷한 도시끼리 묶어서 비교해보면 교회와 범죄건수는 양의 상관관계에 있지 않을것임

- 올바른 분석: 온도가 비슷한 그룹끼리 묶어서 그려보자. $\to$ 상관계수가 줄어들 것이다.

plt.plot(icecream[:6],disease[:6],'.')
[<matplotlib.lines.Line2D at 0x7fc54dd5aac0>]
plt.plot(icecream[6:],disease[6:],'.')
[<matplotlib.lines.Line2D at 0x7fc54db7ee20>]

- 진짜로 선형관계가 약해졌다..

좀 더 그럴듯한 자료

- 위의 toy example은 데이터가 너무 작아서 억지스러움

df=pd.read_csv('https://raw.githubusercontent.com/guebin/2021DV/master/_notebooks/extremum.csv')
df
지점번호 지점명 일시 평균기온(℃) 최고기온(℃) 최고기온시각 최저기온(℃) 최저기온시각일교차 Unnamed: 8
0 146 전주 2020-01-01 -0.5 4.3 15:09 -6.4 1:42 10.7
1 146 전주 2020-01-02 1.4 6.5 14:12 -3.0 7:55 9.5
2 146 전주 2020-01-03 2.6 7.6 13:32 -0.5 23:53 8.1
3 146 전주 2020-01-04 2.0 7.7 13:51 -2.6 5:30 10.3
4 146 전주 2020-01-05 2.5 8.6 14:05 -3.2 7:36 11.8
... ... ... ... ... ... ... ... ... ...
651 146 전주 2021-10-13 19.9 25.5 14:29 15.6 3:49 9.9
652 146 전주 2021-10-14 20.4 25.5 13:36 17.0 6:15 8.5
653 146 전주 2021-10-15 18.3 22.0 13:47 15.7 4:48 6.3
654 146 전주 2021-10-16 12.8 17.4 0:01 6.5 23:31 10.9
655 146 전주 2021-10-17 6.7 12.4 15:18 2.2 6:43 10.2

656 rows × 9 columns

- 평균기온만 선택하여 뽑자.

pd.Series(df.columns)
0          지점번호
1           지점명
2            일시
3       평균기온(℃)
4       최고기온(℃)
5        최고기온시각
6       최저기온(℃)
7     최저기온시각일교차
8    Unnamed: 8
dtype: object
temp=np.array(df.iloc[:,3])
len(temp)
656

- 아이스크림 판매량

np.random.seed(1)
ϵ1=np.random.normal(size=656, scale=10) 
icecream=temp*2 + 30 + ϵ1 
plt.plot(temp,icecream,'.')
[<matplotlib.lines.Line2D at 0x7fc54b3e69d0>]

- 소아마비

np.random.seed(2) 
ϵ2=np.random.normal(size=656,scale=1)
disease=temp*0.5 + 40 +ϵ2
plt.plot(temp,disease,'.')
[<matplotlib.lines.Line2D at 0x7fc54b3c6be0>]

- 아이스크림과 소아마비

plt.plot(icecream,disease,'.')
[<matplotlib.lines.Line2D at 0x7fc54b32c3d0>]
np.corrcoef(icecream,disease)
array([[1.        , 0.86298975],
       [0.86298975, 1.        ]])
  • 0.86정도..

- 여름만 뽑아서 그러보면?

plt.plot(icecream[temp>25],disease[temp>25], '.') ## 평균기온이 25도가 넘어가면 여름
[<matplotlib.lines.Line2D at 0x7fc54b30c9a0>]

- 산점도

fig , ((ax1,ax2), (ax3,ax4)) = plt.subplots(2,2) 
ax1.plot(temp,icecream,'.')
ax2.plot(temp,disease,'.')
ax3.plot(icecream,disease,'.')
ax4.plot(icecream[temp>25],disease[temp>25],'.')
[<matplotlib.lines.Line2D at 0x7fc54b1b6220>]
fig , ((ax1,ax2), (ax3,ax4)) = plt.subplots(2,2) 
ax1.plot(temp,icecream,'.')
ax2.plot(temp,disease,'.')
ax3.plot(icecream,disease,'.')
ax4.plot(icecream,disease,'.')
ax4.plot(icecream[temp>25],disease[temp>25],'.')
[<matplotlib.lines.Line2D at 0x7fc54b06ffa0>]

온도구간을 세분화 하여 시각화

- 목표: 모든 온도구간에 대하여 각각 색을 다르게 하여 그려보자.

  • 사실 지금 변수는 온도, 아이스크림판매량, 소아마비
  • 지금까지는 기본산점도만 사용하였기에 2차원플랏만 그렸음 $\to$ 그래서 각각의 산점도를 정신없이 그려왔음
  • 온도가 유사한 지역을 색으로 묶으면 3차원 플랏이 가능함

- 일단 데이터 프레임을 정리하자.

df1=pd.DataFrame({'temp':temp, 'icecream':icecream, 'disease':disease})
df1
temp icecream disease
0 -0.5 45.243454 39.333242
1 1.4 26.682436 40.643733
2 2.6 29.918282 39.163804
3 2.0 23.270314 42.640271
4 2.5 43.654076 39.456564
... ... ... ...
651 19.9 78.839992 49.633906
652 20.4 86.554679 48.920443
653 18.3 78.666079 49.882650
654 12.8 52.771364 46.613159
655 6.7 40.736731 44.902513

656 rows × 3 columns

- 온도를 카테고리화 하자 $\to$ 적당한 구긴을 설정하기 위해서 히스토그램을 그려보자.

df1.temp.hist()
<AxesSubplot:>
plt.hist(df1.temp)
(array([  3.,   9.,  29.,  60.,  92.,  86.,  65.,  93., 139.,  80.]),
 array([-12.4 ,  -8.16,  -3.92,   0.32,   4.56,   8.8 ,  13.04,  17.28,
         21.52,  25.76,  30.  ]),
 <BarContainer object of 10 artists>)

- 구간은 5정도로 하면 적당할것 같다.

def f(x): 
    if x<0: 
        y='group0'
    elif x<5: 
        y='group5'
    elif x<10: 
        y='group10'
    elif x<15: 
        y='group15'
    elif x<20:
        y='group20'
    elif x<25: 
        y='group25'
    else: 
        y='group30'
    return y 
df1['temp2']=list(map(f,df1.temp))
df1
temp icecream disease temp2
0 -0.5 45.243454 39.333242 group0
1 1.4 26.682436 40.643733 group5
2 2.6 29.918282 39.163804 group5
3 2.0 23.270314 42.640271 group5
4 2.5 43.654076 39.456564 group5
... ... ... ... ...
651 19.9 78.839992 49.633906 group20
652 20.4 86.554679 48.920443 group25
653 18.3 78.666079 49.882650 group20
654 12.8 52.771364 46.613159 group15
655 6.7 40.736731 44.902513 group10

656 rows × 4 columns

from plotnine import * 
ggplot(data=df1)+geom_point(aes(x='icecream',y='disease',colour='temp2'),alpha=0.5)
<ggplot: (8780268732441)>
ggplot(data=df1)+geom_point(aes(x='icecream',y='disease',colour='temp2'),alpha=0.2)+geom_smooth(aes(x='icecream',y='disease',colour='temp2'),size=2,linetype='dashed')
/home/cgb3/anaconda3/envs/dv2021/lib/python3.8/site-packages/plotnine/stats/smoothers.py:310: PlotnineWarning: Confidence intervals are not yet implementedfor lowess smoothings.
<ggplot: (8780268015601)>

- 온도를 통제하니까 아이스크림과 질병은 관련이 없어보인다.

진짜 만약에 아이스크림과 소아마비가 관련있는 경우라면?

np.random.seed(1)
ϵ1=np.random.normal(size=656, scale=10) 
icecream=temp*2 + 30 + ϵ1 
np.random.seed(2) 
ϵ2=np.random.normal(size=656,scale=1)
disease= 30+ temp*0.0 + icecream*0.15 +ϵ2*2
df2=pd.DataFrame({'temp':temp,'icecream':icecream,'disease':disease})
df2['temp2']=list(map(f,df2.temp))
df2
temp icecream disease temp2
0 -0.5 45.243454 35.953002 group0
1 1.4 26.682436 33.889832 group5
2 2.6 29.918282 30.215350 group5
3 2.0 23.270314 36.771089 group5
4 2.5 43.654076 32.961240 group5
... ... ... ... ...
651 19.9 78.839992 41.193811 group20
652 20.4 86.554679 40.424088 group25
653 18.3 78.666079 43.265212 group20
654 12.8 52.771364 38.342022 group15
655 6.7 40.736731 39.215537 group10

656 rows × 4 columns

ggplot(data=df2)+geom_point(aes(x='icecream',y='disease',colour='temp2'),alpha=0.2)+geom_smooth(aes(x='icecream',y='disease',colour='temp2'),size=2,linetype='dashed')
/home/cgb3/anaconda3/envs/dv2021/lib/python3.8/site-packages/plotnine/stats/smoothers.py:310: PlotnineWarning: Confidence intervals are not yet implementedfor lowess smoothings.
<ggplot: (8780335021228)>
df1.corr()
temp icecream disease
temp 1.000000 0.884366 0.975609
icecream 0.884366 1.000000 0.862990
disease 0.975609 0.862990 1.000000
df2.corr()
temp icecream disease
temp 1.000000 0.884366 0.725505
icecream 0.884366 1.000000 0.830539
disease 0.725505 0.830539 1.000000

숙제

- 온도구간을 10으로 변경하고 df1, df2에서 아이스크림과 소아마비의 산점도를 시각화한뒤 스크린샷 제출

행을 선택하는 방법

첫번째 행을 선택하는 법

np.random.seed(1)
dic= {'X1':np.random.normal(0,1,5), 
      'X2':np.random.normal(0,1,5), 
      'X3':np.random.normal(0,1,5), 
      'X4':np.random.normal(0,1,5), 
      'X5':np.random.normal(0,1,5), 
      'X6':np.random.normal(0,1,5)}
df1=pd.DataFrame(dic)
df1
X1 X2 X3 X4 X5 X6
0 1.624345 -2.301539 1.462108 -1.099891 -1.100619 -0.683728
1 -0.611756 1.744812 -2.060141 -0.172428 1.144724 -0.122890
2 -0.528172 -0.761207 -0.322417 -0.877858 0.901591 -0.935769
3 -1.072969 0.319039 -0.384054 0.042214 0.502494 -0.267888
4 0.865408 -0.249370 1.133769 0.582815 0.900856 0.530355

- 방법1

df1.iloc[0] # 이상해 
X1    1.624345
X2   -2.301539
X3    1.462108
X4   -1.099891
X5   -1.100619
X6   -0.683728
Name: 0, dtype: float64

- 방법2

df1.iloc[[0]]
X1 X2 X3 X4 X5 X6
0 1.624345 -2.301539 1.462108 -1.099891 -1.100619 -0.683728

- 방법3

df1.iloc[0,:]
X1    1.624345
X2   -2.301539
X3    1.462108
X4   -1.099891
X5   -1.100619
X6   -0.683728
Name: 0, dtype: float64

- 방법4

df1.iloc[[0],:]
X1 X2 X3 X4 X5 X6
0 1.624345 -2.301539 1.462108 -1.099891 -1.100619 -0.683728

- 방법5

df1.loc[0] # 이상해 
X1    1.624345
X2   -2.301539
X3    1.462108
X4   -1.099891
X5   -1.100619
X6   -0.683728
Name: 0, dtype: float64

- 방법6

df1.loc[[0]]
X1 X2 X3 X4 X5 X6
0 1.624345 -2.301539 1.462108 -1.099891 -1.100619 -0.683728

- 방법7

df1.loc[0,:]
X1    1.624345
X2   -2.301539
X3    1.462108
X4   -1.099891
X5   -1.100619
X6   -0.683728
Name: 0, dtype: float64

- 방법8

df1.loc[[0],:]
X1 X2 X3 X4 X5 X6
0 1.624345 -2.301539 1.462108 -1.099891 -1.100619 -0.683728

- 방법9

df1.iloc[[True,False,False,False,False]]
X1 X2 X3 X4 X5 X6
0 1.624345 -2.301539 1.462108 -1.099891 -1.100619 -0.683728

- 방법10

df1.iloc[[True,False,False,False,False],:]
X1 X2 X3 X4 X5 X6
0 1.624345 -2.301539 1.462108 -1.099891 -1.100619 -0.683728

- 방법11

df1.loc[[True,False,False,False,False]]
X1 X2 X3 X4 X5 X6
0 1.624345 -2.301539 1.462108 -1.099891 -1.100619 -0.683728

- 방법12

df1.loc[[True,False,False,False,False],:]
X1 X2 X3 X4 X5 X6
0 1.624345 -2.301539 1.462108 -1.099891 -1.100619 -0.683728

1,3행을 선택하는 방법

df1
X1 X2 X3 X4 X5 X6
0 1.624345 -2.301539 1.462108 -1.099891 -1.100619 -0.683728
1 -0.611756 1.744812 -2.060141 -0.172428 1.144724 -0.122890
2 -0.528172 -0.761207 -0.322417 -0.877858 0.901591 -0.935769
3 -1.072969 0.319039 -0.384054 0.042214 0.502494 -0.267888
4 0.865408 -0.249370 1.133769 0.582815 0.900856 0.530355

- 방법1

df1.iloc[[0,2],:]
X1 X2 X3 X4 X5 X6
0 1.624345 -2.301539 1.462108 -1.099891 -1.100619 -0.683728
2 -0.528172 -0.761207 -0.322417 -0.877858 0.901591 -0.935769

- 방법2

df1.loc[[0,2],:]
X1 X2 X3 X4 X5 X6
0 1.624345 -2.301539 1.462108 -1.099891 -1.100619 -0.683728
2 -0.528172 -0.761207 -0.322417 -0.877858 0.901591 -0.935769

- 그외에 여러행을 뽑는 방법이 있음; 슬라이싱, 불인덱싱

loc vs iloc ??

- 인덱스가 정수가 아닌경우

_df= pd.DataFrame({'A':[1,2,3,4],'B':[4,5,6,7]},index=list('abcd'))
_df
A B
a 1 4
b 2 5
c 3 6
d 4 7
_df.loc['a':'c',:]
A B
a 1 4
b 2 5
c 3 6
_df.iloc[0:3,:]
A B
a 1 4
b 2 5
c 3 6
_df.loc[['a','b','c'],:]
A B
a 1 4
b 2 5
c 3 6
_df.iloc[[0,1,2],:]
A B
a 1 4
b 2 5
c 3 6

- 대부분의 경우 observation에 특정한 이름이 있는 경우는 없으므로 loc이 그다지 쓸모 없음

- 그렇지만 특정경우에는 쓸모가 있음

np.random.normal(size=(20,4))
array([[ 0.83600472,  1.54335911,  0.75880566,  0.88490881],
       [-0.87728152, -0.86778722, -1.44087602,  1.23225307],
       [-0.25417987,  1.39984394, -0.78191168, -0.43750898],
       [ 0.09542509,  0.92145007,  0.0607502 ,  0.21112476],
       [ 0.01652757,  0.17718772, -1.11647002,  0.0809271 ],
       [-0.18657899, -0.05682448,  0.49233656, -0.68067814],
       [-0.08450803, -0.29736188,  0.417302  ,  0.78477065],
       [-0.95542526,  0.58591043,  2.06578332, -1.47115693],
       [-0.8301719 , -0.8805776 , -0.27909772,  1.62284909],
       [ 0.01335268, -0.6946936 ,  0.6218035 , -0.59980453],
       [ 1.12341216,  0.30526704,  1.3887794 , -0.66134424],
       [ 3.03085711,  0.82458463,  0.65458015, -0.05118845],
       [-0.72559712, -0.86776868, -0.13597733, -0.79726979],
       [ 0.28267571, -0.82609743,  0.6210827 ,  0.9561217 ],
       [-0.70584051,  1.19268607, -0.23794194,  1.15528789],
       [ 0.43816635,  1.12232832, -0.9970198 , -0.10679399],
       [ 1.45142926, -0.61803685, -2.03720123, -1.94258918],
       [-2.50644065, -2.11416392, -0.41163916,  1.27852808],
       [-0.44222928,  0.32352735, -0.10999149,  0.00854895],
       [-0.16819884, -0.17418034,  0.4611641 , -1.17598267]])
np.random.seed(1)
_df= pd.DataFrame(np.random.normal(size=(20,4)), columns=list('ABCD'), index=pd.date_range('20201225',periods=20))
_df
A B C D
2020-12-25 1.624345 -0.611756 -0.528172 -1.072969
2020-12-26 0.865408 -2.301539 1.744812 -0.761207
2020-12-27 0.319039 -0.249370 1.462108 -2.060141
2020-12-28 -0.322417 -0.384054 1.133769 -1.099891
2020-12-29 -0.172428 -0.877858 0.042214 0.582815
2020-12-30 -1.100619 1.144724 0.901591 0.502494
2020-12-31 0.900856 -0.683728 -0.122890 -0.935769
2021-01-01 -0.267888 0.530355 -0.691661 -0.396754
2021-01-02 -0.687173 -0.845206 -0.671246 -0.012665
2021-01-03 -1.117310 0.234416 1.659802 0.742044
2021-01-04 -0.191836 -0.887629 -0.747158 1.692455
2021-01-05 0.050808 -0.636996 0.190915 2.100255
2021-01-06 0.120159 0.617203 0.300170 -0.352250
2021-01-07 -1.142518 -0.349343 -0.208894 0.586623
2021-01-08 0.838983 0.931102 0.285587 0.885141
2021-01-09 -0.754398 1.252868 0.512930 -0.298093
2021-01-10 0.488518 -0.075572 1.131629 1.519817
2021-01-11 2.185575 -1.396496 -1.444114 -0.504466
2021-01-12 0.160037 0.876169 0.315635 -2.022201
2021-01-13 -0.306204 0.827975 0.230095 0.762011

- 1월5일부터 1월8일까지의 자료만 보고싶다.

_df.loc['20210105':'20210108']
A B C D
2021-01-05 0.050808 -0.636996 0.190915 2.100255
2021-01-06 0.120159 0.617203 0.300170 -0.352250
2021-01-07 -1.142518 -0.349343 -0.208894 0.586623
2021-01-08 0.838983 0.931102 0.285587 0.885141

- iloc으로 하려면 힘들다.

_df.index
DatetimeIndex(['2020-12-25', '2020-12-26', '2020-12-27', '2020-12-28',
               '2020-12-29', '2020-12-30', '2020-12-31', '2021-01-01',
               '2021-01-02', '2021-01-03', '2021-01-04', '2021-01-05',
               '2021-01-06', '2021-01-07', '2021-01-08', '2021-01-09',
               '2021-01-10', '2021-01-11', '2021-01-12', '2021-01-13'],
              dtype='datetime64[ns]', freq='D')
pd.Series(_df.index)
0    2020-12-25
1    2020-12-26
2    2020-12-27
3    2020-12-28
4    2020-12-29
5    2020-12-30
6    2020-12-31
7    2021-01-01
8    2021-01-02
9    2021-01-03
10   2021-01-04
11   2021-01-05
12   2021-01-06
13   2021-01-07
14   2021-01-08
15   2021-01-09
16   2021-01-10
17   2021-01-11
18   2021-01-12
19   2021-01-13
dtype: datetime64[ns]
_df.iloc[11:15]
A B C D
2021-01-05 0.050808 -0.636996 0.190915 2.100255
2021-01-06 0.120159 0.617203 0.300170 -0.352250
2021-01-07 -1.142518 -0.349343 -0.208894 0.586623
2021-01-08 0.838983 0.931102 0.285587 0.885141

자주하는 실수

- 저는 아래와 같은 실수를 자주해요

_df.loc['A']
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~/anaconda3/envs/dv2021/lib/python3.8/site-packages/pandas/_libs/tslibs/conversion.pyx in pandas._libs.tslibs.conversion._convert_str_to_tsobject()

~/anaconda3/envs/dv2021/lib/python3.8/site-packages/pandas/_libs/tslibs/parsing.pyx in pandas._libs.tslibs.parsing.parse_datetime_string()

ValueError: Given date string not likely a datetime.

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
~/anaconda3/envs/dv2021/lib/python3.8/site-packages/pandas/core/indexes/datetimes.py in get_loc(self, key, method, tolerance)
    680             try:
--> 681                 key = self._maybe_cast_for_get_loc(key)
    682             except ValueError as err:

~/anaconda3/envs/dv2021/lib/python3.8/site-packages/pandas/core/indexes/datetimes.py in _maybe_cast_for_get_loc(self, key)
    708         # needed to localize naive datetimes or dates (GH 35690)
--> 709         key = Timestamp(key)
    710         if key.tzinfo is None:

~/anaconda3/envs/dv2021/lib/python3.8/site-packages/pandas/_libs/tslibs/timestamps.pyx in pandas._libs.tslibs.timestamps.Timestamp.__new__()

~/anaconda3/envs/dv2021/lib/python3.8/site-packages/pandas/_libs/tslibs/conversion.pyx in pandas._libs.tslibs.conversion.convert_to_tsobject()

~/anaconda3/envs/dv2021/lib/python3.8/site-packages/pandas/_libs/tslibs/conversion.pyx in pandas._libs.tslibs.conversion._convert_str_to_tsobject()

ValueError: could not convert string to Timestamp

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
/tmp/ipykernel_38055/4015539971.py in <module>
----> 1 _df.loc['A']

~/anaconda3/envs/dv2021/lib/python3.8/site-packages/pandas/core/indexing.py in __getitem__(self, key)
    929 
    930             maybe_callable = com.apply_if_callable(key, self.obj)
--> 931             return self._getitem_axis(maybe_callable, axis=axis)
    932 
    933     def _is_scalar_access(self, key: tuple):

~/anaconda3/envs/dv2021/lib/python3.8/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
   1162         # fall thru to straight lookup
   1163         self._validate_key(key, axis)
-> 1164         return self._get_label(key, axis=axis)
   1165 
   1166     def _get_slice_axis(self, slice_obj: slice, axis: int):

~/anaconda3/envs/dv2021/lib/python3.8/site-packages/pandas/core/indexing.py in _get_label(self, label, axis)
   1111     def _get_label(self, label, axis: int):
   1112         # GH#5667 this will fail if the label is not present in the axis.
-> 1113         return self.obj.xs(label, axis=axis)
   1114 
   1115     def _handle_lowerdim_multi_index_axis0(self, tup: tuple):

~/anaconda3/envs/dv2021/lib/python3.8/site-packages/pandas/core/generic.py in xs(self, key, axis, level, drop_level)
   3774                 raise TypeError(f"Expected label or tuple of labels, got {key}") from e
   3775         else:
-> 3776             loc = index.get_loc(key)
   3777 
   3778             if isinstance(loc, np.ndarray):

~/anaconda3/envs/dv2021/lib/python3.8/site-packages/pandas/core/indexes/datetimes.py in get_loc(self, key, method, tolerance)
    681                 key = self._maybe_cast_for_get_loc(key)
    682             except ValueError as err:
--> 683                 raise KeyError(key) from err
    684 
    685         elif isinstance(key, timedelta):

KeyError: 'A'

- 올바른 방법

_df['A']
2020-12-25    1.624345
2020-12-26    0.865408
2020-12-27    0.319039
2020-12-28   -0.322417
2020-12-29   -0.172428
2020-12-30   -1.100619
2020-12-31    0.900856
2021-01-01   -0.267888
2021-01-02   -0.687173
2021-01-03   -1.117310
2021-01-04   -0.191836
2021-01-05    0.050808
2021-01-06    0.120159
2021-01-07   -1.142518
2021-01-08    0.838983
2021-01-09   -0.754398
2021-01-10    0.488518
2021-01-11    2.185575
2021-01-12    0.160037
2021-01-13   -0.306204
Freq: D, Name: A, dtype: float64

- 아래의 사실을 기억하자.

  • 기본적으로는 iloc, loc은 [2], [2:] 처럼 1차원으로 원소를 인덱싱할수도 있고, [2,3], [:,2] 와 같이 2차원으로 인덱싱할 수도 있다.

  • 1차원으로 인덱싱하는 경우는 기본적으로 행을 인덱싱한다 $\to$ iloc, loc은 행과 더 친하고 열과 친하지 않다.

  • 따라서 열을 선택하는 방법에 있어서 loc, iloc이 그렇제 좋은 방법은 아니다.

  • 그렇지만 열을 선택하는 방법은 iloc이나 loc이 제일 편리하다. (이외의 다른 방법이 마땅하게 없음) 그래서 열을 선택할때도 iloc이나 loc을 선호한다.

슬라이싱에 대한 응용

_df.iloc[::2]
A B C D
2020-12-25 1.624345 -0.611756 -0.528172 -1.072969
2020-12-27 0.319039 -0.249370 1.462108 -2.060141
2020-12-29 -0.172428 -0.877858 0.042214 0.582815
2020-12-31 0.900856 -0.683728 -0.122890 -0.935769
2021-01-02 -0.687173 -0.845206 -0.671246 -0.012665
2021-01-04 -0.191836 -0.887629 -0.747158 1.692455
2021-01-06 0.120159 0.617203 0.300170 -0.352250
2021-01-08 0.838983 0.931102 0.285587 0.885141
2021-01-10 0.488518 -0.075572 1.131629 1.519817
2021-01-12 0.160037 0.876169 0.315635 -2.022201

- 이 방법은 칼럼에도 적용가능

_df.iloc[:,::2]
A C
2020-12-25 1.624345 -0.528172
2020-12-26 0.865408 1.744812
2020-12-27 0.319039 1.462108
2020-12-28 -0.322417 1.133769
2020-12-29 -0.172428 0.042214
2020-12-30 -1.100619 0.901591
2020-12-31 0.900856 -0.122890
2021-01-01 -0.267888 -0.691661
2021-01-02 -0.687173 -0.671246
2021-01-03 -1.117310 1.659802
2021-01-04 -0.191836 -0.747158
2021-01-05 0.050808 0.190915
2021-01-06 0.120159 0.300170
2021-01-07 -1.142518 -0.208894
2021-01-08 0.838983 0.285587
2021-01-09 -0.754398 0.512930
2021-01-10 0.488518 1.131629
2021-01-11 2.185575 -1.444114
2021-01-12 0.160037 0.315635
2021-01-13 -0.306204 0.230095

- 칼럼에는 잘 쓰지 않는이유?

  • row는 특정간격으로 뽑는 일이 빈번함. (예를들어 일별데이터를 주별데이터로 바꾸고싶을때, 바꾸고 싶을 경우?)

  • col을 특정간격으로 뽑아야 하는 일은 없음

lambda + map

np.random.seed(1)
df2= pd.DataFrame(np.random.normal(size=(10,4)),columns=list('ABCD'))
df2
A B C D
0 1.624345 -0.611756 -0.528172 -1.072969
1 0.865408 -2.301539 1.744812 -0.761207
2 0.319039 -0.249370 1.462108 -2.060141
3 -0.322417 -0.384054 1.133769 -1.099891
4 -0.172428 -0.877858 0.042214 0.582815
5 -1.100619 1.144724 0.901591 0.502494
6 0.900856 -0.683728 -0.122890 -0.935769
7 -0.267888 0.530355 -0.691661 -0.396754
8 -0.687173 -0.845206 -0.671246 -0.012665
9 -1.117310 0.234416 1.659802 0.742044

칼럼A의 값이 0보다 큰 경우만

- 방법1

df2.loc[map(lambda x: x>0,df2['A']),:]
A B C D
0 1.624345 -0.611756 -0.528172 -1.072969
1 0.865408 -2.301539 1.744812 -0.761207
2 0.319039 -0.249370 1.462108 -2.060141
6 0.900856 -0.683728 -0.122890 -0.935769

- 방법2

df2.loc[lambda df: df['A']>0,:]
A B C D
0 1.624345 -0.611756 -0.528172 -1.072969
1 0.865408 -2.301539 1.744812 -0.761207
2 0.319039 -0.249370 1.462108 -2.060141
6 0.900856 -0.683728 -0.122890 -0.935769
  • ??
  • map의 기능은 (1) 리스트를 원소별로 분해하여 (2) 어떠한 함수를 적용하여 아웃풋을 구한뒤 (3) 각각의 아웃풋을 다시 하나의 리스트로 묶음
  • 우리는 이중에서 (1),(3)에만 집중했음
  • 하지만 생각해보면 일단 (2) 일단 함수를 적용하는 기능이 있었음
  • 그런데 위의 코든느 함수를 적용한 결과가 아니라 함수 오브젝트 자체를 전달하여도 동작함

요약!!

  • True, False로 이루어진 벡터를 리스트의 형태로 전달하여 인덱상했음 (원래 우리가 알고 있는 개념)
  • True, False로 이루어진 벡터를 리턴할 수 있는 함수오브젝트 자체를 전달해도 인덱싱이 가능

- 방법3

df2.iloc[map(lambda x: x>0,df2['A']),:]
A B C D
0 1.624345 -0.611756 -0.528172 -1.072969
1 0.865408 -2.301539 1.744812 -0.761207
2 0.319039 -0.249370 1.462108 -2.060141
6 0.900856 -0.683728 -0.122890 -0.935769

- 방법4: 실패

df2.iloc[lambda df: df['A']>0,:]
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
/tmp/ipykernel_38055/235064012.py in <module>
----> 1 df2.iloc[lambda df: df['A']>0,:]

~/anaconda3/envs/dv2021/lib/python3.8/site-packages/pandas/core/indexing.py in __getitem__(self, key)
    923                 with suppress(KeyError, IndexError):
    924                     return self.obj._get_value(*key, takeable=self._takeable)
--> 925             return self._getitem_tuple(key)
    926         else:
    927             # we by definition only have the 0th axis

~/anaconda3/envs/dv2021/lib/python3.8/site-packages/pandas/core/indexing.py in _getitem_tuple(self, tup)
   1504     def _getitem_tuple(self, tup: tuple):
   1505 
-> 1506         self._has_valid_tuple(tup)
   1507         with suppress(IndexingError):
   1508             return self._getitem_lowerdim(tup)

~/anaconda3/envs/dv2021/lib/python3.8/site-packages/pandas/core/indexing.py in _has_valid_tuple(self, key)
    752         for i, k in enumerate(key):
    753             try:
--> 754                 self._validate_key(k, i)
    755             except ValueError as err:
    756                 raise ValueError(

~/anaconda3/envs/dv2021/lib/python3.8/site-packages/pandas/core/indexing.py in _validate_key(self, key, axis)
   1393             if hasattr(key, "index") and isinstance(key.index, Index):
   1394                 if key.index.inferred_type == "integer":
-> 1395                     raise NotImplementedError(
   1396                         "iLocation based boolean "
   1397                         "indexing on an integer type "

NotImplementedError: iLocation based boolean indexing on an integer type is not available
  • 위에서 iloc을 loc으로 바꾸면 되는데..
  • iloc입장에서는 조금 서운함

칼럼A>0 이고 칼럼C<0 인 경우

df2
A B C D
0 1.624345 -0.611756 -0.528172 -1.072969
1 0.865408 -2.301539 1.744812 -0.761207
2 0.319039 -0.249370 1.462108 -2.060141
3 -0.322417 -0.384054 1.133769 -1.099891
4 -0.172428 -0.877858 0.042214 0.582815
5 -1.100619 1.144724 0.901591 0.502494
6 0.900856 -0.683728 -0.122890 -0.935769
7 -0.267888 0.530355 -0.691661 -0.396754
8 -0.687173 -0.845206 -0.671246 -0.012665
9 -1.117310 0.234416 1.659802 0.742044

- 방법1

df2.loc[map(lambda x,y: x>0 and y<0, df2['A'],df2['C']),:] 
A B C D
0 1.624345 -0.611756 -0.528172 -1.072969
6 0.900856 -0.683728 -0.122890 -0.935769

- 방법2

df2.loc[map(lambda x,y: x>0 & y<0, df2['A'],df2['C']),:] 
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_38055/232901037.py in <module>
----> 1 df2.loc[map(lambda x,y: x>0 & y<0, df2['A'],df2['C']),:]

~/anaconda3/envs/dv2021/lib/python3.8/site-packages/pandas/core/indexing.py in __getitem__(self, key)
    918     def __getitem__(self, key):
    919         if type(key) is tuple:
--> 920             key = tuple(list(x) if is_iterator(x) else x for x in key)
    921             key = tuple(com.apply_if_callable(x, self.obj) for x in key)
    922             if self._is_scalar_access(key):

~/anaconda3/envs/dv2021/lib/python3.8/site-packages/pandas/core/indexing.py in <genexpr>(.0)
    918     def __getitem__(self, key):
    919         if type(key) is tuple:
--> 920             key = tuple(list(x) if is_iterator(x) else x for x in key)
    921             key = tuple(com.apply_if_callable(x, self.obj) for x in key)
    922             if self._is_scalar_access(key):

/tmp/ipykernel_38055/232901037.py in <lambda>(x, y)
----> 1 df2.loc[map(lambda x,y: x>0 & y<0, df2['A'],df2['C']),:]

TypeError: unsupported operand type(s) for &: 'int' and 'float'
  • ??? 아래를 관찰

보충학습

0<3.2 &  0<2.2
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_38055/993001292.py in <module>
----> 1 0<3.2 &  0<2.2

TypeError: unsupported operand type(s) for &: 'float' and 'int'
(0<3.2) &  (0<2.2)
True

보충학습끝

위의코드도 괄호로 묶어주면 잘 동작한다.

df2.loc[map(lambda x,y: (x>0) & (y<0), df2['A'],df2['C']),:] 
A B C D
0 1.624345 -0.611756 -0.528172 -1.072969
6 0.900856 -0.683728 -0.122890 -0.935769

- 방법3

df2.loc[lambda df: (df['A'] >0) & (df['C']<0)] 
A B C D
0 1.624345 -0.611756 -0.528172 -1.072969
6 0.900856 -0.683728 -0.122890 -0.935769

아래는 실행되지 않는다

df2.loc[lambda df: (df['A'] >0) and (df['C']<0)] 
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_38055/2807753716.py in <module>
----> 1 df2.loc[lambda df: (df['A'] >0) and (df['C']<0)]

~/anaconda3/envs/dv2021/lib/python3.8/site-packages/pandas/core/indexing.py in __getitem__(self, key)
    928             axis = self.axis or 0
    929 
--> 930             maybe_callable = com.apply_if_callable(key, self.obj)
    931             return self._getitem_axis(maybe_callable, axis=axis)
    932 

~/anaconda3/envs/dv2021/lib/python3.8/site-packages/pandas/core/common.py in apply_if_callable(maybe_callable, obj, **kwargs)
    356     """
    357     if callable(maybe_callable):
--> 358         return maybe_callable(obj, **kwargs)
    359 
    360     return maybe_callable

/tmp/ipykernel_38055/2807753716.py in <lambda>(df)
----> 1 df2.loc[lambda df: (df['A'] >0) and (df['C']<0)]

~/anaconda3/envs/dv2021/lib/python3.8/site-packages/pandas/core/generic.py in __nonzero__(self)
   1535     @final
   1536     def __nonzero__(self):
-> 1537         raise ValueError(
   1538             f"The truth value of a {type(self).__name__} is ambiguous. "
   1539             "Use a.empty, a.bool(), a.item(), a.any() or a.all()."

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

실행되지 않는이유

np.array([True, False]) & np.array([True, True])
array([ True, False])
np.array([True, False]) and np.array([True, True])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_38055/1293310712.py in <module>
----> 1 np.array([True, False]) and np.array([True, True])

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

- iloc을 이용한 방법은 생략

query

import numpy as np 
import pandas as pd
np.random.seed(1)
df=pd.DataFrame(np.random.normal(size=(15,4)),columns=list('ABCD'))
df
A B C D
0 1.624345 -0.611756 -0.528172 -1.072969
1 0.865408 -2.301539 1.744812 -0.761207
2 0.319039 -0.249370 1.462108 -2.060141
3 -0.322417 -0.384054 1.133769 -1.099891
4 -0.172428 -0.877858 0.042214 0.582815
5 -1.100619 1.144724 0.901591 0.502494
6 0.900856 -0.683728 -0.122890 -0.935769
7 -0.267888 0.530355 -0.691661 -0.396754
8 -0.687173 -0.845206 -0.671246 -0.012665
9 -1.117310 0.234416 1.659802 0.742044
10 -0.191836 -0.887629 -0.747158 1.692455
11 0.050808 -0.636996 0.190915 2.100255
12 0.120159 0.617203 0.300170 -0.352250
13 -1.142518 -0.349343 -0.208894 0.586623
14 0.838983 0.931102 0.285587 0.885141

A>0 and B<0 인 행을 선택

- 방법1

df.query('A>0 & B<0')
A B C D
0 1.624345 -0.611756 -0.528172 -1.072969
1 0.865408 -2.301539 1.744812 -0.761207
2 0.319039 -0.249370 1.462108 -2.060141
6 0.900856 -0.683728 -0.122890 -0.935769
11 0.050808 -0.636996 0.190915 2.100255

- 방법2

df.query('A>0 and B<0')
A B C D
0 1.624345 -0.611756 -0.528172 -1.072969
1 0.865408 -2.301539 1.744812 -0.761207
2 0.319039 -0.249370 1.462108 -2.060141
6 0.900856 -0.683728 -0.122890 -0.935769
11 0.050808 -0.636996 0.190915 2.100255

A<B<C 인 행을 선택

df.query('A<B<C')
A B C D
9 -1.117310 0.234416 1.659802 0.742044
13 -1.142518 -0.349343 -0.208894 0.586623

A>mean(A) 인 행을 선택

- 방법1

df.A.mean()
-0.018839420539994597
df.query('A>-0.018839420539994597')
A B C D
0 1.624345 -0.611756 -0.528172 -1.072969
1 0.865408 -2.301539 1.744812 -0.761207
2 0.319039 -0.249370 1.462108 -2.060141
6 0.900856 -0.683728 -0.122890 -0.935769
11 0.050808 -0.636996 0.190915 2.100255
12 0.120159 0.617203 0.300170 -0.352250
14 0.838983 0.931102 0.285587 0.885141

- 방법2

meanA=df.A.mean()
meanA
-0.018839420539994597
df.query('A> @meanA')
A B C D
0 1.624345 -0.611756 -0.528172 -1.072969
1 0.865408 -2.301539 1.744812 -0.761207
2 0.319039 -0.249370 1.462108 -2.060141
6 0.900856 -0.683728 -0.122890 -0.935769
11 0.050808 -0.636996 0.190915 2.100255
12 0.120159 0.617203 0.300170 -0.352250
14 0.838983 0.931102 0.285587 0.885141

A>mean(A) 이고, A<0.8 인 것을 선택

- 방법1

df.query(' A> @meanA and A<0.8')
A B C D
2 0.319039 -0.249370 1.462108 -2.060141
11 0.050808 -0.636996 0.190915 2.100255
12 0.120159 0.617203 0.300170 -0.352250

- 방법2

df.query(' A> @meanA'
         ' and A<0.8')
A B C D
2 0.319039 -0.249370 1.462108 -2.060141
11 0.050808 -0.636996 0.190915 2.100255
12 0.120159 0.617203 0.300170 -0.352250

- 참고사항: 아래는 에러가 발생한다.

df.query('A> @meanA'
         'and A<0.8')
Traceback (most recent call last):

  File "/home/cgb4/anaconda3/envs/bda2021/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3444, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)

  File "/tmp/ipykernel_1369917/821519992.py", line 1, in <module>
    df.query('A> @meanA'

  File "/home/cgb4/anaconda3/envs/bda2021/lib/python3.8/site-packages/pandas/core/frame.py", line 4060, in query
    res = self.eval(expr, **kwargs)

  File "/home/cgb4/anaconda3/envs/bda2021/lib/python3.8/site-packages/pandas/core/frame.py", line 4191, in eval
    return _eval(expr, inplace=inplace, **kwargs)

  File "/home/cgb4/anaconda3/envs/bda2021/lib/python3.8/site-packages/pandas/core/computation/eval.py", line 348, in eval
    parsed_expr = Expr(expr, engine=engine, parser=parser, env=env)

  File "/home/cgb4/anaconda3/envs/bda2021/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 806, in __init__
    self.terms = self.parse()

  File "/home/cgb4/anaconda3/envs/bda2021/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 825, in parse
    return self._visitor.visit(self.expr)

  File "/home/cgb4/anaconda3/envs/bda2021/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 407, in visit
    raise e

  File "/home/cgb4/anaconda3/envs/bda2021/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 403, in visit
    node = ast.fix_missing_locations(ast.parse(clean))

  File "/home/cgb4/anaconda3/envs/bda2021/lib/python3.8/ast.py", line 47, in parse
    return compile(source, filename, mode, flags,

  File "<unknown>", line 1
    A >__pd_eval_local_meanAand A <0.8
                                ^
SyntaxError: invalid syntax

단순인덱싱

df
A B C D
0 1.624345 -0.611756 -0.528172 -1.072969
1 0.865408 -2.301539 1.744812 -0.761207
2 0.319039 -0.249370 1.462108 -2.060141
3 -0.322417 -0.384054 1.133769 -1.099891
4 -0.172428 -0.877858 0.042214 0.582815
5 -1.100619 1.144724 0.901591 0.502494
6 0.900856 -0.683728 -0.122890 -0.935769
7 -0.267888 0.530355 -0.691661 -0.396754
8 -0.687173 -0.845206 -0.671246 -0.012665
9 -1.117310 0.234416 1.659802 0.742044
10 -0.191836 -0.887629 -0.747158 1.692455
11 0.050808 -0.636996 0.190915 2.100255
12 0.120159 0.617203 0.300170 -0.352250
13 -1.142518 -0.349343 -0.208894 0.586623
14 0.838983 0.931102 0.285587 0.885141

- 0, 3:5, 9:11 에 해당하는 row를 뽑고싶다. $\to$ 칼럼이름을 index로 받아서 사용한다.

df.query('index==0 or 3<=index <=5  or 9<=index <=11')
A B C D
0 1.624345 -0.611756 -0.528172 -1.072969
3 -0.322417 -0.384054 1.133769 -1.099891
4 -0.172428 -0.877858 0.042214 0.582815
5 -1.100619 1.144724 0.901591 0.502494
9 -1.117310 0.234416 1.659802 0.742044
10 -0.191836 -0.887629 -0.747158 1.692455
11 0.050808 -0.636996 0.190915 2.100255

- 응용사례1

df.query('index==0 or index ==[8,9,10]')
A B C D
0 1.624345 -0.611756 -0.528172 -1.072969
8 -0.687173 -0.845206 -0.671246 -0.012665
9 -1.117310 0.234416 1.659802 0.742044
10 -0.191836 -0.887629 -0.747158 1.692455

- 응용사례2

i1= np.arange(3)
i1
array([0, 1, 2])
df.query('index in @i1  or index==5')
A B C D
0 1.624345 -0.611756 -0.528172 -1.072969
1 0.865408 -2.301539 1.744812 -0.761207
2 0.319039 -0.249370 1.462108 -2.060141
5 -1.100619 1.144724 0.901591 0.502494

- 시계열자료에서 특히 유용함

df2=pd.DataFrame(np.random.normal(size=(10,4)), columns=list('ABCD'), index=pd.date_range('20201226',periods=10))
df2
A B C D
2020-12-26 -0.754398 1.252868 0.512930 -0.298093
2020-12-27 0.488518 -0.075572 1.131629 1.519817
2020-12-28 2.185575 -1.396496 -1.444114 -0.504466
2020-12-29 0.160037 0.876169 0.315635 -2.022201
2020-12-30 -0.306204 0.827975 0.230095 0.762011
2020-12-31 -0.222328 -0.200758 0.186561 0.410052
2021-01-01 0.198300 0.119009 -0.670662 0.377564
2021-01-02 0.121821 1.129484 1.198918 0.185156
2021-01-03 -0.375285 -0.638730 0.423494 0.077340
2021-01-04 -0.343854 0.043597 -0.620001 0.698032
df2.query(
    ' "2020-12-27"<= index <= "2021-01-03" ')
A B C D
2020-12-27 0.488518 -0.075572 1.131629 1.519817
2020-12-28 2.185575 -1.396496 -1.444114 -0.504466
2020-12-29 0.160037 0.876169 0.315635 -2.022201
2020-12-30 -0.306204 0.827975 0.230095 0.762011
2020-12-31 -0.222328 -0.200758 0.186561 0.410052
2021-01-01 0.198300 0.119009 -0.670662 0.377564
2021-01-02 0.121821 1.129484 1.198918 0.185156
2021-01-03 -0.375285 -0.638730 0.423494 0.077340
df2.query(
    ' "2020-12-27"<= index <= "2021-01-03" '
    ' and A+B < C')
A B C D
2020-12-27 0.488518 -0.075572 1.131629 1.519817
2020-12-31 -0.222328 -0.200758 0.186561 0.410052
2021-01-03 -0.375285 -0.638730 0.423494 0.077340

FIFA 선수들 시각화

FIFA data

- FIFA22라는 축구게임이 있음 (굉장히 인기있음)

- 게임에 실제 선수들이 나오면서 선수들의 능력치가 세밀하게 구현되어 있음

- 이 능력치에 대한 데이터셋은 캐글에 공개되어 있음

Data

fifa22=pd.read_csv('https://raw.githubusercontent.com/guebin/2021DV/master/_notebooks/2021-10-25-FIFA22_official_data.csv')

- Overall을 기준으로 정렬하여 보자.

fifa22=fifa22.sort_values(by='Overall',ascending=False).reset_index().rename(columns={'index':'index_old'})
fifa22.head()
index_old ID Name Age Photo Nationality Flag Overall Potential Club ... SlidingTackle GKDiving GKHandling GKKicking GKPositioning GKReflexes Best Position Best Overall Rating Release Clause DefensiveAwareness
0 29 158023 L. Messi 34 https://cdn.sofifa.com/players/158/023/22_60.png Argentina https://cdn.sofifa.com/flags/ar.png 93 93 Paris Saint-Germain ... 24.0 6.0 11.0 15.0 14.0 8.0 RW 93.0 €144.3M 20.0
1 33 188545 R. Lewandowski 32 https://cdn.sofifa.com/players/188/545/22_60.png Poland https://cdn.sofifa.com/flags/pl.png 92 92 FC Bayern München ... 19.0 15.0 6.0 12.0 8.0 10.0 ST 92.0 €197.2M 35.0
2 14244 200389 J. Oblak 28 https://cdn.sofifa.com/players/200/389/22_60.png Slovenia https://cdn.sofifa.com/flags/si.png 91 93 Atlético de Madrid ... 18.0 87.0 92.0 78.0 90.0 90.0 GK 91.0 €238M 27.0
3 3 192985 K. De Bruyne 30 https://cdn.sofifa.com/players/192/985/22_60.png Belgium https://cdn.sofifa.com/flags/be.png 91 91 Manchester City ... 53.0 15.0 13.0 5.0 10.0 13.0 CM 91.0 €232.2M 68.0
4 64 190871 Neymar Jr 29 https://cdn.sofifa.com/players/190/871/22_60.png Brazil https://cdn.sofifa.com/flags/br.png 91 91 Paris Saint-Germain ... 29.0 9.0 9.0 15.0 15.0 11.0 LW 91.0 €238.7M 35.0

5 rows × 66 columns

Overall vs Potential

from plotnine import *
ggplot(data=fifa22)+geom_point(aes(x='Overall', y='Potential'))
<ggplot: (8766943203419)>

- 뭔가 Potential > Overall 인 관계가 성립하는것 같다. $\to$ Potetial2= Potential - Overall 인 변수를 새로 만들고 시각화해보자.

  • 판다스: 새로운열 추가
fifa22['Potential2'] = fifa22['Potential'] - fifa22['Overall']
ggplot(data=fifa22)+geom_point(aes(x='Overall', y='Potential2'),alpha=0.1)
<ggplot: (8766943123941)>
ggplot(data=fifa22)+geom_point(aes(x='Overall', y='Potential2'),alpha=0.1,position='jitter')
<ggplot: (8766942753591)>

- 포텐셜2가 너무 0근처인 선수들이 있다. (아마 은퇴한 선수가 아닐까?) $\to$ 제외하고 그리자.

ggplot(data=fifa22.query('Potential2>0.1'))+geom_point(aes(x='Overall', y='Potential2'),alpha=0.1,position='jitter')
<ggplot: (8766942938091)>

- 해석

  • 음의 상관관계가 있다.
  • 오버올이 클수록 포텐셜2의 분산이 작아진다. (오버올이 클수록 더 성장할 부분이 없으니까)

- Overall을 구간별로 나누자: 어느정도가 적당한 구간일까?

fifa22.Overall.describe()
count    16710.000000
mean        67.646320
std          6.457695
min         28.000000
25%         63.000000
50%         68.000000
75%         72.000000
max         93.000000
Name: Overall, dtype: float64
import matplotlib.pyplot as plt 
fifa22.Overall.hist()
<AxesSubplot:>
def f(x): 
    if x>72: y='Q1' 
    elif x>68: y='Q2'
    elif x>63: y='Q3'
    else: y='Q4'
    return y 
fifa22['Q']=list(map(f,fifa22.Overall))
fifa22[['Q','Overall']]
Q Overall
0 Q1 93
1 Q1 92
2 Q1 91
3 Q1 91
4 Q1 91
... ... ...
16705 Q4 46
16706 Q4 46
16707 Q4 44
16708 Q4 44
16709 Q4 28

16710 rows × 2 columns

ggplot(data=fifa22.query('Potential2>0.1'))\
+geom_boxplot(aes(x='Q',y='Potential2'))
<ggplot: (8766938801267)>

- Q1으로 갈수록 분산이 작아짐! $\to$ 헷갈린다...

- 산점도와 박스플랏을 겹쳐서 그린다면 좀더 이해가 쉬울것 같다.

- x축의 위치를 조정하면 될것 같다 $\to$ Q1, Q2, Q3, Q4 각 그룹별로 x축의 위치를 구하자.

fifa22.query('Q=="Q1"').Overall.mean()
76.3506528835691
  • 이런식으로 해도 되지만
fifa22.groupby(by='Q').mean().Overall
Q
Q1    76.350653
Q2    70.411781
Q3    66.074449
Q4    59.602691
Name: Overall, dtype: float64
l=fifa22.groupby(by='Q').mean().Overall.to_list()
l
[76.3506528835691, 70.4117807472048, 66.07444942506334, 59.60269121813031]

- 이제 박스플랏이 들어갈 x축의 위치를 저장할 컬럼을 추가하고 그 이름을 Qx 라고 하자.

def g(x): 
    if x=='Q1': y=l[0]
    elif x=='Q2': y=l[1]
    elif x=='Q3': y=l[2]
    else: y=l[3]
    return y
fifa22['Qx']=list(map(g,fifa22.Q))
fifa22
index_old ID Name Age Photo Nationality Flag Overall Potential Club ... GKKicking GKPositioning GKReflexes Best Position Best Overall Rating Release Clause DefensiveAwareness Potential2 Q Qx
0 29 158023 L. Messi 34 https://cdn.sofifa.com/players/158/023/22_60.png Argentina https://cdn.sofifa.com/flags/ar.png 93 93 Paris Saint-Germain ... 15.0 14.0 8.0 RW 93.0 €144.3M 20.0 0 Q1 76.350653
1 33 188545 R. Lewandowski 32 https://cdn.sofifa.com/players/188/545/22_60.png Poland https://cdn.sofifa.com/flags/pl.png 92 92 FC Bayern München ... 12.0 8.0 10.0 ST 92.0 €197.2M 35.0 0 Q1 76.350653
2 14244 200389 J. Oblak 28 https://cdn.sofifa.com/players/200/389/22_60.png Slovenia https://cdn.sofifa.com/flags/si.png 91 93 Atlético de Madrid ... 78.0 90.0 90.0 GK 91.0 €238M 27.0 2 Q1 76.350653
3 3 192985 K. De Bruyne 30 https://cdn.sofifa.com/players/192/985/22_60.png Belgium https://cdn.sofifa.com/flags/be.png 91 91 Manchester City ... 5.0 10.0 13.0 CM 91.0 €232.2M 68.0 0 Q1 76.350653
4 64 190871 Neymar Jr 29 https://cdn.sofifa.com/players/190/871/22_60.png Brazil https://cdn.sofifa.com/flags/br.png 91 91 Paris Saint-Germain ... 15.0 15.0 11.0 LW 91.0 €238.7M 35.0 0 Q1 76.350653
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
16705 15593 235352 18 T. Käßemodel 28 https://cdn.sofifa.com/players/235/352/18_60.png Germany https://cdn.sofifa.com/flags/de.png 46 46 FC Erzgebirge Aue ... 6.0 13.0 6.0 CM 45.0 €47K NaN 0 Q4 59.602691
16706 15685 219735 15 T. Fletcher 19 https://cdn.sofifa.com/players/219/735/15_60.png England https://cdn.sofifa.com/flags/gb-eng.png 46 52 Wycombe Wanderers ... 8.0 14.0 11.0 CB 46.0 NaN NaN 6 Q4 59.602691
16707 16572 19334 10 I. Baraclough 38 https://cdn.sofifa.com/players/019/334/10_60.png England https://cdn.sofifa.com/flags/gb-eng.png 44 65 NaN ... 46.0 20.0 20.0 CM 46.0 NaN NaN 21 Q4 59.602691
16708 15999 220806 16 E. Redman 18 https://cdn.sofifa.com/players/220/806/16_60.png Wales https://cdn.sofifa.com/flags/gb-wls.png 44 57 Newport County ... 16.0 9.0 7.0 CB 44.0 NaN NaN 13 Q4 59.602691
16709 16709 178453 07 A. Censori 17 https://cdn.sofifa.com/players/178/453/07_60.png Italy https://cdn.sofifa.com/flags/it.png 28 38 Arezzo ... 36.0 6.0 9.0 ST 36.0 NaN NaN 10 Q4 59.602691

16710 rows × 69 columns

ggplot(data=fifa22.query('Potential2>0.1'))\
+geom_point(aes(x='Overall', y='Potential2',color='Q'),alpha=0.1,size=0.1,position='jitter')\
+geom_boxplot(aes(x='Qx', y='Potential2',color='Q'))
<ggplot: (8766942887677)>

숙제

fifa22 데이터셋에서 Q==Q1이고, Potentail2>20 인 선수들의 이름을 출력하라.