강의영상

- (1/8) tidydata (1)

- (2/8) tidydata (2)

- (3/8) tidydata (3)

- (4/8) Barplot + 해들리위컴 그래프레이어 (1)

- (5/8) Barplot + 해들리위컴 그래프레이어 (2)

- (6/8) Barplot + 해들리위컴 그래프레이어 (2)

- (7/8) 심슨의 역설 (1)

- (8/8) 심슨의 역설 (2)

import pandas as pd 
import numpy as np 
from plotnine import * 
import matplotlib.pyplot as plt 

tidy data

- 느낌: ggplot으로 그림 그리기 좋은 데이터 + pandas로 query, group by 등을 쓰기 좋은 자료

- 정의: https://r4ds.had.co.nz/tidy-data.html

  • Each variable must have its own column.
  • Each observation must have its own row.
  • Each value must have its own cell.

예시1

obs x y shape color
0 0 0 'star' 'F'
1 0 1 'circ' 'F'
2 1 0 'star' 'M'
3 1 1 'circ' 'M'

예시2

shape=star shape=circ
color=F (0,0) (0,1)
color=M (1,0) (1,1)

예제

풀이1: stack + reset_index

- 문제의 깃헙주소로 들어가서 데이터를 관찰 $\to$ 좌측상단이 비워져있음 $\to$ index_col=0 옵션을 사용

url = 'https://raw.githubusercontent.com/PacktPublishing/Pandas-Cookbook/master/data/state_fruit.csv'
df=pd.read_csv(url,index_col=0)
df
Apple Orange Banana
Texas 12 10 40
Arizona 9 7 12
Florida 0 14 190

- 데이터변형

df.stack()
Texas    Apple      12
         Orange     10
         Banana     40
Arizona  Apple       9
         Orange      7
         Banana     12
Florida  Apple       0
         Orange     14
         Banana    190
dtype: int64
df.stack().reset_index()
level_0 level_1 0
0 Texas Apple 12
1 Texas Orange 10
2 Texas Banana 40
3 Arizona Apple 9
4 Arizona Orange 7
5 Arizona Banana 12
6 Florida Apple 0
7 Florida Orange 14
8 Florida Banana 190
df.stack().reset_index().rename(columns={'level_0':'group1','level_1':'group2',0:'X'})
group1 group2 X
0 Texas Apple 12
1 Texas Orange 10
2 Texas Banana 40
3 Arizona Apple 9
4 Arizona Orange 7
5 Arizona Banana 12
6 Florida Apple 0
7 Florida Orange 14
8 Florida Banana 190

풀이2: melt(id_vars=??)

- index_col=0 옵션을 사용하지않음

url = 'https://raw.githubusercontent.com/PacktPublishing/Pandas-Cookbook/master/data/state_fruit.csv'
df2=pd.read_csv(url)
df2
Unnamed: 0 Apple Orange Banana
0 Texas 12 10 40
1 Arizona 9 7 12
2 Florida 0 14 190
df2.rename(columns={'Unnamed: 0':'group1'})
group1 Apple Orange Banana
0 Texas 12 10 40
1 Arizona 9 7 12
2 Florida 0 14 190
df2.rename(columns={'Unnamed: 0':'group1'}).melt(id_vars='group1')
group1 variable value
0 Texas Apple 12
1 Arizona Apple 9
2 Florida Apple 0
3 Texas Orange 10
4 Arizona Orange 7
5 Florida Orange 14
6 Texas Banana 40
7 Arizona Banana 12
8 Florida Banana 190
df2.rename(columns={'Unnamed: 0':'group1'}).melt(id_vars='group1')\
.rename(columns={'variable':'group2','value':'X'})
group1 group2 X
0 Texas Apple 12
1 Arizona Apple 9
2 Florida Apple 0
3 Texas Orange 10
4 Arizona Orange 7
5 Florida Orange 14
6 Texas Banana 40
7 Arizona Banana 12
8 Florida Banana 190

틀린풀이1

df
Apple Orange Banana
Texas 12 10 40
Arizona 9 7 12
Florida 0 14 190
df.melt()
variable value
0 Apple 12
1 Apple 9
2 Apple 0
3 Orange 10
4 Orange 7
5 Orange 14
6 Banana 40
7 Banana 12
8 Banana 190

틀린풀이2

df2
Unnamed: 0 Apple Orange Banana
0 Texas 12 10 40
1 Arizona 9 7 12
2 Florida 0 14 190
df2.stack()
0  Unnamed: 0      Texas
   Apple              12
   Orange             10
   Banana             40
1  Unnamed: 0    Arizona
   Apple               9
   Orange              7
   Banana             12
2  Unnamed: 0    Florida
   Apple               0
   Orange             14
   Banana            190
dtype: object

풀이3

df
Apple Orange Banana
Texas 12 10 40
Arizona 9 7 12
Florida 0 14 190
df.reset_index()
index Apple Orange Banana
0 Texas 12 10 40
1 Arizona 9 7 12
2 Florida 0 14 190
df.reset_index().melt(id_vars='index')
index variable value
0 Texas Apple 12
1 Arizona Apple 9
2 Florida Apple 0
3 Texas Orange 10
4 Arizona Orange 7
5 Florida Orange 14
6 Texas Banana 40
7 Arizona Banana 12
8 Florida Banana 190
df.reset_index().melt(id_vars='index')\
.rename(columns={'index':'group1','variable':'group2','value':'X'})
group1 group2 X
0 Texas Apple 12
1 Arizona Apple 9
2 Florida Apple 0
3 Texas Orange 10
4 Arizona Orange 7
5 Florida Orange 14
6 Texas Banana 40
7 Arizona Banana 12
8 Florida Banana 190

풀이4

df2.set_index('Unnamed: 0')
Apple Orange Banana
Unnamed: 0
Texas 12 10 40
Arizona 9 7 12
Florida 0 14 190
df2.set_index('Unnamed: 0').stack()
Unnamed: 0        
Texas       Apple      12
            Orange     10
            Banana     40
Arizona     Apple       9
            Orange      7
            Banana     12
Florida     Apple       0
            Orange     14
            Banana    190
dtype: int64
df2.set_index('Unnamed: 0').stack().reset_index()
Unnamed: 0 level_1 0
0 Texas Apple 12
1 Texas Orange 10
2 Texas Banana 40
3 Arizona Apple 9
4 Arizona Orange 7
5 Arizona Banana 12
6 Florida Apple 0
7 Florida Orange 14
8 Florida Banana 190
df2.set_index('Unnamed: 0').stack().reset_index()\
.rename(columns={'Unnamed: 0':'group1','level_1':'group2',0:'X'})
group1 group2 X
0 Texas Apple 12
1 Texas Orange 10
2 Texas Banana 40
3 Arizona Apple 9
4 Arizona Orange 7
5 Arizona Banana 12
6 Florida Apple 0
7 Florida Orange 14
8 Florida Banana 190

Barplot + 해들리위컴의 그래프레이어

기본사용법

g=['A']*100+['B']*200 
y=list(np.random.randn(100)*2+2)+list(np.random.randn(200)+3)
df=pd.DataFrame({'g':g,'y':y})
df
g y
0 A -1.594055
1 A 1.225490
2 A 2.223234
3 A 1.842460
4 A 1.624541
... ... ...
295 B 3.011681
296 B 3.558141
297 B 4.348230
298 B 3.966407
299 B 2.694083

300 rows × 2 columns

ggplot(df)+geom_bar(aes(x='g',fill='g')) ## 디폴트로 카운트를 수행해줌
<ggplot: (8726962443840)>

- 이것은 아래의 코드와 같다.

df.groupby(by='g').count()
y
g
A 100
B 200
fig=ggplot(df.groupby(by='g').count().reset_index())
fig+geom_bar(aes(x='g',y='y',fill='g'),stat='identity')
<ggplot: (8726962432681)>

- barplot은 기본적으로 groupby+count()가 내장되어 있다. 따라서 아래의 코드

ggplot(df)+geom_bar(aes(x='g',fill='g')) ## 디폴트로 카운트를 수행해줌

를 좀더 엄밀하게 쓰면

ggplot(df)+geom_bar(aes(x='g',fill='g'),stat='count') 
<ggplot: (8726962477975)>

- 이것은 때때로 불편하다. 왜냐하면 데이터프레임을 변환하는 것은 판다스를 이용하는게 더 쉽고 자유로움

barplot의 불편한점1

td=df.groupby(by='g').count().reset_index()
td
g y
0 A 100
1 B 200

- 그냥 'x=g, y=y'를 맵핑하여 그리면 안되나?

plt.bar(td.g,td.y)
<BarContainer object of 2 artists>
td.plot(kind='bar',x='g',y='y')
<AxesSubplot:xlabel='g'>

- 그런데 ggplot을 쓰려고 하면?

ggplot(td)+geom_bar(aes(x='g',y='y',fill='g')) 
---------------------------------------------------------------------------
PlotnineError                             Traceback (most recent call last)
~/anaconda3/envs/dv2021/lib/python3.8/site-packages/IPython/core/formatters.py in __call__(self, obj)
    700                 type_pprinters=self.type_printers,
    701                 deferred_pprinters=self.deferred_printers)
--> 702             printer.pretty(obj)
    703             printer.flush()
    704             return stream.getvalue()

~/anaconda3/envs/dv2021/lib/python3.8/site-packages/IPython/lib/pretty.py in pretty(self, obj)
    392                         if cls is not object \
    393                                 and callable(cls.__dict__.get('__repr__')):
--> 394                             return _repr_pprint(obj, self, cycle)
    395 
    396             return _default_pprint(obj, self, cycle)

~/anaconda3/envs/dv2021/lib/python3.8/site-packages/IPython/lib/pretty.py in _repr_pprint(obj, p, cycle)
    698     """A pprint that just redirects to the normal repr function."""
    699     # Find newlines and replace them with p.break_()
--> 700     output = repr(obj)
    701     lines = output.splitlines()
    702     with p.group():

~/anaconda3/envs/dv2021/lib/python3.8/site-packages/plotnine/ggplot.py in __repr__(self)
     95         Print/show the plot
     96         """
---> 97         self.__str__()
     98         return '<ggplot: (%d)>' % self.__hash__()
     99 

~/anaconda3/envs/dv2021/lib/python3.8/site-packages/plotnine/ggplot.py in __str__(self)
     86         Print/show the plot
     87         """
---> 88         self.draw(show=True)
     89 
     90         # Return and empty string so that print(p) is "pretty"

~/anaconda3/envs/dv2021/lib/python3.8/site-packages/plotnine/ggplot.py in draw(self, return_ggplot, show)
    203         self = deepcopy(self)
    204         with plot_context(self, show=show):
--> 205             self._build()
    206 
    207             # setup

~/anaconda3/envs/dv2021/lib/python3.8/site-packages/plotnine/ggplot.py in _build(self)
    298 
    299         # Apply and map statistics
--> 300         layers.compute_statistic(layout)
    301         layers.map_statistic(self)
    302 

~/anaconda3/envs/dv2021/lib/python3.8/site-packages/plotnine/layer.py in compute_statistic(self, layout)
     71     def compute_statistic(self, layout):
     72         for l in self:
---> 73             l.compute_statistic(layout)
     74 
     75     def map_statistic(self, plot):

~/anaconda3/envs/dv2021/lib/python3.8/site-packages/plotnine/layer.py in compute_statistic(self, layout)
    322         data = self.stat.use_defaults(data)
    323         data = self.stat.setup_data(data)
--> 324         data = self.stat.compute_layer(data, params, layout)
    325         self.data = data
    326 

~/anaconda3/envs/dv2021/lib/python3.8/site-packages/plotnine/stats/stat.py in compute_layer(cls, data, params, layout)
    274             return cls.compute_panel(pdata, pscales, **params)
    275 
--> 276         return groupby_apply(data, 'PANEL', fn)
    277 
    278     @classmethod

~/anaconda3/envs/dv2021/lib/python3.8/site-packages/plotnine/utils.py in groupby_apply(df, cols, func, *args, **kwargs)
    632         # function fn should be free to modify dataframe d, therefore
    633         # do not mark d as a slice of df i.e no SettingWithCopyWarning
--> 634         lst.append(func(d, *args, **kwargs))
    635     return pd.concat(lst, axis=axis, ignore_index=True)
    636 

~/anaconda3/envs/dv2021/lib/python3.8/site-packages/plotnine/stats/stat.py in fn(pdata)
    272                 return pdata
    273             pscales = layout.get_scales(pdata['PANEL'].iat[0])
--> 274             return cls.compute_panel(pdata, pscales, **params)
    275 
    276         return groupby_apply(data, 'PANEL', fn)

~/anaconda3/envs/dv2021/lib/python3.8/site-packages/plotnine/stats/stat.py in compute_panel(cls, data, scales, **params)
    305         stats = []
    306         for _, old in data.groupby('group'):
--> 307             new = cls.compute_group(old, scales, **params)
    308             unique = uniquecols(old)
    309             missing = unique.columns.difference(new.columns)

~/anaconda3/envs/dv2021/lib/python3.8/site-packages/plotnine/stats/stat_count.py in compute_group(cls, data, scales, **params)
     51         if ('y' in data) or ('y' in params):
     52             msg = 'stat_count() must not be used with a y aesthetic'
---> 53             raise PlotnineError(msg)
     54 
     55         weight = data.get('weight', np.ones(len(x), dtype=int))

PlotnineError: 'stat_count() must not be used with a y aesthetic'
  • 너무 불편해요.. stat='identity' 를 항상 써야하는것이!

barplot의 불편한점2

- groupby 를 자동으로 해주므로 익숙해지면 ggplot2 방식이 더 편하지 않을까? $\to$ groupby 하는게 더 편해요..

df.groupby('g').agg({'y':[np.mean,np.median,np.std,lambda x: np.max(x)-np.min(x)]})
y
mean median std <lambda_0>
g
A 1.805694 1.960777 2.060555 10.166683
B 2.900715 2.906418 1.004063 5.174041
df.groupby('g')\
.agg({'y':[np.mean,np.median,np.std,lambda x: np.max(x)-np.min(x)]})\
.rename(columns={'<lambda_0>':'range'}).stack().reset_index()
g level_1 y
0 A mean 1.805694
1 A median 1.960777
2 A range 10.166683
3 A std 2.060555
4 B mean 2.900715
5 B median 2.906418
6 B range 5.174041
7 B std 1.004063
td=df.groupby('g')\
.agg({'y':[np.mean,np.median,np.std,lambda x: np.max(x)-np.min(x)]})\
.rename(columns={'<lambda_0>':'range'}).stack().reset_index()
ggplot(td)+geom_bar(aes(x='level_1',y='y',fill='g'),stat='identity')
<ggplot: (8726959187327)>
  • 쌓인상태로 보이는것이 불편함. $\to$ position='dodge' 로!

position

ggplot(td)+geom_bar(aes(x='level_1',y='y',fill='g'),stat='identity',position='dodge')
<ggplot: (8726958675716)>

coord_flip()

- 때때로 아래와 같이 보는 것이 더 좋은 경우도 있음

ggplot(td)\
+geom_bar(aes(x='level_1',y='y',fill='g'),stat='identity',position='dodge')\
+coord_flip()
<ggplot: (8726962202191)>

facet_wrap()

ggplot(td)\
+geom_bar(aes(x='level_1',y='y',fill='g'),stat='identity',position='dodge')\
+coord_flip()+facet_wrap('level_1')
<ggplot: (8726962596675)>
ggplot(td)\
+geom_bar(aes(x='g',y='y',fill='g'),stat='identity',position='dodge')\
+coord_flip()+facet_wrap('level_1')
<ggplot: (8726958748290)>
ggplot(td)+facet_grid('level_1~g')\
+geom_bar(aes(x='g',y='y',fill='g'),stat='identity',position='dodge')+coord_flip()
<ggplot: (8726962231423)>

해들리위컴의 그래프레이어

- 데이터셋 + 맵핑 + 지옴 + 포지션 + 스탯 + 축 + 면분할

  • 데이터셋: 판다스
  • 맵핑: x축, y축, 색깔, 크기, 투명도
  • 지옴: 포인트지옴, 바지옴, 라인지옴, 스무스지옴
  • 포지션: jitter, dodge, intentity
  • 스탯: identity, count
  • 축: coord_flip()
  • 면분할: facet_wrap(), facet_grid()

예제: 심슨의 역설

DEP=(['A1']*2+['A2']*2+['B1']*2+['B2']*2)*2 
GEN=['M']*8+['F']*8
STATE=['PASS','FAIL']*8
COUNT=[1,9,2,8,80,20,85,15,5,5,5,5,9,1,9,1]
df=pd.DataFrame({'DEP':DEP,'STATE':STATE,'GEN':GEN,'COUNT':COUNT})
df
DEP STATE GEN COUNT
0 A1 PASS M 1
1 A1 FAIL M 9
2 A2 PASS M 2
3 A2 FAIL M 8
4 B1 PASS M 80
5 B1 FAIL M 20
6 B2 PASS M 85
7 B2 FAIL M 15
8 A1 PASS F 5
9 A1 FAIL F 5
10 A2 PASS F 5
11 A2 FAIL F 5
12 B1 PASS F 9
13 B1 FAIL F 1
14 B2 PASS F 9
15 B2 FAIL F 1
df.groupby(['GEN','STATE']).agg({'COUNT':np.sum})
COUNT
GEN STATE
F FAIL 12
PASS 28
M FAIL 52
PASS 168

시각화1: 전체합격률

df.groupby(['GEN','STATE']).agg({'COUNT':np.sum})
COUNT
GEN STATE
F FAIL 12
PASS 28
M FAIL 52
PASS 168
df.groupby(['GEN','STATE']).agg({'COUNT':np.sum}).reset_index()
GEN STATE COUNT
0 F FAIL 12
1 F PASS 28
2 M FAIL 52
3 M PASS 168
df.groupby(['GEN']).agg({'COUNT':np.sum}).reset_index()
GEN COUNT
0 F 40
1 M 220

- 두개의 데이터프레임을 합쳐야 한다.

_df1=df.groupby(['GEN','STATE']).agg({'COUNT':np.sum}).reset_index()
_df2=df.groupby(['GEN']).agg({'COUNT':np.sum}).reset_index().rename(columns={'COUNT':'SUM'})
display(_df1)
display(_df2)
GEN STATE COUNT
0 F FAIL 12
1 F PASS 28
2 M FAIL 52
3 M PASS 168
GEN SUM
0 F 40
1 M 220

- 단순한 방법

def f(x): 
    if x=='F':
        return 40 
    if x=='M':
        return 220 
_df1['SUM']=list(map(f,_df1.GEN))
_df1
GEN STATE COUNT SUM
0 F FAIL 12 40
1 F PASS 28 40
2 M FAIL 52 220
3 M PASS 168 220

- 좀 더 좋은 방법

_df1=df.groupby(['GEN','STATE']).agg({'COUNT':np.sum}).reset_index()
  • _df1를 다시 롤백
def f(_df2): 
    return lambda x: _df2.query('GEN == @x').SUM.item()
_df1.GEN
0    F
1    F
2    M
3    M
Name: GEN, dtype: object
_df1['SUM']=list(map(f(_df2),_df1.GEN))
_df1
GEN STATE COUNT SUM
0 F FAIL 12 40
1 F PASS 28 40
2 M FAIL 52 220
3 M PASS 168 220

- 더 좋은 방법

_df1=df.groupby(['GEN','STATE']).agg({'COUNT':np.sum}).reset_index()
  • _df1을 다시 롤백
_df1
GEN STATE COUNT
0 F FAIL 12
1 F PASS 28
2 M FAIL 52
3 M PASS 168
_df2
GEN SUM
0 F 40
1 M 220
pd.merge(_df1,_df2)
GEN STATE COUNT SUM
0 F FAIL 12 40
1 F PASS 28 40
2 M FAIL 52 220
3 M PASS 168 220
_df1.merge(_df2)
GEN STATE COUNT SUM
0 F FAIL 12 40
1 F PASS 28 40
2 M FAIL 52 220
3 M PASS 168 220
_df2.merge(_df1)
GEN SUM STATE COUNT
0 F 40 FAIL 12
1 F 40 PASS 28
2 M 220 FAIL 52
3 M 220 PASS 168
td=_df2.merge(_df1)
td
GEN SUM STATE COUNT
0 F 40 FAIL 12
1 F 40 PASS 28
2 M 220 FAIL 52
3 M 220 PASS 168
td['PROP']=td.COUNT/td.SUM
td
GEN SUM STATE COUNT PROP
0 F 40 FAIL 12 0.300000
1 F 40 PASS 28 0.700000
2 M 220 FAIL 52 0.236364
3 M 220 PASS 168 0.763636
ggplot(td.query('STATE=="PASS"'))+geom_bar(aes(x='GEN',y='PROP',fill='GEN'),stat='identity')
<ggplot: (8726958443650)>

- 남자의 합격률이 더 높다. $\to$ 성차별이 있어보인다(?)

시각화2: 학과별 합격률

- 학과별 합격률

df
DEP STATE GEN COUNT
0 A1 PASS M 1
1 A1 FAIL M 9
2 A2 PASS M 2
3 A2 FAIL M 8
4 B1 PASS M 80
5 B1 FAIL M 20
6 B2 PASS M 85
7 B2 FAIL M 15
8 A1 PASS F 5
9 A1 FAIL F 5
10 A2 PASS F 5
11 A2 FAIL F 5
12 B1 PASS F 9
13 B1 FAIL F 1
14 B2 PASS F 9
15 B2 FAIL F 1
td=df.groupby(['DEP','GEN']).agg({'COUNT':sum}).reset_index()\
.rename(columns={'COUNT':'SUM'}).merge(df)
td['PROP']=td.COUNT/td.SUM
td
DEP GEN SUM STATE COUNT PROP
0 A1 F 10 PASS 5 0.50
1 A1 F 10 FAIL 5 0.50
2 A1 M 10 PASS 1 0.10
3 A1 M 10 FAIL 9 0.90
4 A2 F 10 PASS 5 0.50
5 A2 F 10 FAIL 5 0.50
6 A2 M 10 PASS 2 0.20
7 A2 M 10 FAIL 8 0.80
8 B1 F 10 PASS 9 0.90
9 B1 F 10 FAIL 1 0.10
10 B1 M 100 PASS 80 0.80
11 B1 M 100 FAIL 20 0.20
12 B2 F 10 PASS 9 0.90
13 B2 F 10 FAIL 1 0.10
14 B2 M 100 PASS 85 0.85
15 B2 M 100 FAIL 15 0.15
td.query('STATE=="PASS"')
DEP GEN SUM STATE COUNT PROP
0 A1 F 10 PASS 5 0.50
2 A1 M 10 PASS 1 0.10
4 A2 F 10 PASS 5 0.50
6 A2 M 10 PASS 2 0.20
8 B1 F 10 PASS 9 0.90
10 B1 M 100 PASS 80 0.80
12 B2 F 10 PASS 9 0.90
14 B2 M 100 PASS 85 0.85
ggplot(td.query('STATE=="PASS"'))\
+geom_bar(aes(x='GEN',y='PROP',fill='GEN'),stat='identity')\
+facet_wrap('DEP')
<ggplot: (8726962449285)>