Lesson 12: 심슨의 역설
강의영상
Imports
심슨의 역설
- 버클리대학교의 입학데이터
- 주장: 버클리대학에 gender bias가 존재한다.
- 1973년 가을학기의 입학통계에 따르면 지원하는 남성이 여성보다 훨씬 많이 합격했고, 그 차이가 너무 커서 우연의 일치라 보기 어렵다.
df=pd.read_csv("https://raw.githubusercontent.com/guebin/DV2022/master/posts/Simpson.csv",index_col=0,header=[0,1])\
.stack().stack().reset_index()\
.rename({'level_0':'department','level_1':'result','level_2':'gender',0:'count'},axis=1)
df| department | result | gender | count | |
|---|---|---|---|---|
| 0 | A | fail | male | 314 |
| 1 | A | fail | female | 19 |
| 2 | A | pass | male | 511 |
| 3 | A | pass | female | 89 |
| 4 | B | fail | male | 208 |
| 5 | B | fail | female | 7 |
| 6 | B | pass | male | 352 |
| 7 | B | pass | female | 18 |
| 8 | C | fail | male | 204 |
| 9 | C | fail | female | 391 |
| 10 | C | pass | male | 121 |
| 11 | C | pass | female | 202 |
| 12 | D | fail | male | 279 |
| 13 | D | fail | female | 244 |
| 14 | D | pass | male | 138 |
| 15 | D | pass | female | 131 |
| 16 | E | fail | male | 137 |
| 17 | E | fail | female | 299 |
| 18 | E | pass | male | 54 |
| 19 | E | pass | female | 94 |
| 20 | F | fail | male | 149 |
| 21 | F | fail | female | 103 |
| 22 | F | pass | male | 224 |
| 23 | F | pass | female | 238 |
전체합격률 시각화
A. 시각화1: 전체합격률 시각화 – pandas 초보
- 여성지원자의 합격률
df.pivot_table(index='gender',columns='result',values='count',aggfunc='sum')\
.assign(total = lambda df: df['fail']+df['pass'])\
.assign(rate = lambda df: df['pass']/df['total'])| result | fail | pass | total | rate |
|---|---|---|---|---|
| gender | ||||
| female | 1063 | 772 | 1835 | 0.420708 |
| male | 1291 | 1400 | 2691 | 0.520253 |
df.query('gender == "female" and result =="pass"')['count'].sum() / df.query('gender == "female"')['count'].sum()0.420708446866485
- 남성지원자의 합격률
df.query('gender == "male" and result =="pass"')['count'].sum() / df.query('gender == "male"')['count'].sum()0.5202526941657376
- 시각화
B. 시각화1: 전체합격률 시각화 – pandas 고수
tidydata = df.pivot_table(index='gender', columns='result', values='count', aggfunc=sum)\
.assign(rate = lambda df: df['pass'] / (df['fail'] + df['pass']))\
.reset_index()
fig = ggplot(tidydata)
col = geom_col(aes(x='gender',y='rate',fill='gender'))
fig + col /tmp/ipykernel_1307301/2552015845.py:1: FutureWarning: The provided callable <built-in function sum> is currently using DataFrameGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.

학과별 합격률 시각화
tidydata = df.pivot_table(index=['gender','department'], columns='result',values='count',aggfunc=sum)\
.assign(rate = lambda df: df['pass']/(df['fail']+df['pass']))\
.reset_index()
fig = ggplot(tidydata)
facet = facet_wrap('department')
col = geom_col(aes(x='gender',y='rate',fill='gender'))
fig + facet + col /tmp/ipykernel_1307301/2165066459.py:1: FutureWarning: The provided callable <built-in function sum> is currently using DataFrameGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.

해석
- 전체합격률: 남자의 합격률이 더 높다. \(\to\) 성차별이 있어보인다(?)
- 학과별합격률: 학과별로 살펴보니 오히려 A,B,F,D의 경우 여성의 합격률이 높다.
- 교재에서 설명한 이유: 여성이 합격률이 낮은 학과에만 많이 지원하였기 때문
df.pivot_table(index='department', columns='gender', values='count',aggfunc='sum')\
.stack().reset_index().rename({0:'count'},axis=1)| department | gender | count | |
|---|---|---|---|
| 0 | A | female | 108 |
| 1 | A | male | 825 |
| 2 | B | female | 25 |
| 3 | B | male | 560 |
| 4 | C | female | 593 |
| 5 | C | male | 325 |
| 6 | D | female | 375 |
| 7 | D | male | 417 |
| 8 | E | female | 393 |
| 9 | E | male | 191 |
| 10 | F | female | 341 |
| 11 | F | male | 373 |
더 극단적인 예시
| department | result | gender | count | |
|---|---|---|---|---|
| 0 | A | fail | female | 0 |
| 1 | A | fail | male | 100 |
| 2 | A | pass | female | 1 |
| 3 | A | pass | male | 900 |
| 4 | B | fail | female | 400 |
| 5 | B | fail | male | 1 |
| 6 | B | pass | female | 600 |
| 7 | B | pass | male | 1 |
- 전체합격률 조사
display(
f"합격률(여성):{601/1001:.4f}",
df[df.gender=='female'],
f"합격률(남성):{901/1002:.4f}",
df[df.gender=='male'],
)'합격률(여성):0.6004'
| department | result | gender | count | |
|---|---|---|---|---|
| 0 | A | fail | female | 0 |
| 2 | A | pass | female | 1 |
| 4 | B | fail | female | 400 |
| 6 | B | pass | female | 600 |
'합격률(남성):0.8992'
| department | result | gender | count | |
|---|---|---|---|---|
| 1 | A | fail | male | 100 |
| 3 | A | pass | male | 900 |
| 5 | B | fail | male | 1 |
| 7 | B | pass | male | 1 |
- 학과별 합격률 조사
| department | result | gender | count | |
|---|---|---|---|---|
| 0 | A | fail | female | 0 |
| 1 | A | fail | male | 100 |
| 2 | A | pass | female | 1 |
| 3 | A | pass | male | 900 |
| 4 | B | fail | female | 400 |
| 5 | B | fail | male | 1 |
| 6 | B | pass | female | 600 |
| 7 | B | pass | male | 1 |

