07wk-2: Pandas – `pivot_table`, `groupby`+`agg`

Author

최규빈

Published

October 16, 2023

1. 강의영상

2. Imports

import pandas as pd 
import numpy as np

3. pivot_table, groupby+agg

A. intro

- 개념: 그룹화 $\to$ 집계

# 예제1: 아래의 예제에서 (학과,성별)로 count의 합계를 구하라.

df=pd.read_csv("https://raw.githubusercontent.com/guebin/DV2022/master/posts/Simpson.csv",index_col=0,header=[0,1])\
.stack().stack().reset_index()\
.rename({'level_0':'department','level_1':'result','level_2':'gender',0:'count'},axis=1)
df

	department	result	gender	count
0	A	fail	male	314
1	A	fail	female	19
2	A	pass	male	511
3	A	pass	female	89
4	B	fail	male	208
5	B	fail	female	7
6	B	pass	male	352
7	B	pass	female	18
8	C	fail	male	204
9	C	fail	female	391
10	C	pass	male	121
11	C	pass	female	202
12	D	fail	male	279
13	D	fail	female	244
14	D	pass	male	138
15	D	pass	female	131
16	E	fail	male	137
17	E	fail	female	299
18	E	pass	male	54
19	E	pass	female	94
20	F	fail	male	149
21	F	fail	female	103
22	F	pass	male	224
23	F	pass	female	238

df.pivot_table(index=['department','gender'],values='count',aggfunc='sum')

		count
department	gender
A	female	108
A	male	825
B	female	25
B	male	560
C	female	593
C	male	325
D	female	375
D	male	417
E	female	393
E	male	191
F	female	341
F	male	373

#

- 예시에서 본 작업은 아래의 작업들로 세분화 할 수 있다.

그룹화(쿼리): 하나의 dataframe을 sub-dataframe으로 나누는 과정 – 전체자료를 (학과,성별)로 묶어 총 10개의 sub-dataframe을 만듦
각각집계(각각계산): 나누어진 sub-dataframe에서 어떠한 계산을 각각 수행함. – 나누어진 sub-dataframe에서 지원자수의 합계를 각각 구함

- 위의 같은 작업을 하려면 아래와 같은 요소들이 필요하다.

그룹변수¹ – 그룹화를 위해 필요한 변수, dataframe을 sub-dataframe으로 나누는 역할.
집계변수² – 집계함수의 대상이 되는 변수.
집계변수 – 그룹화된 데이터프레임에 수행하는 계산을 정의하는 함수.

¹ 이건 없는 용어에요

² 이것도 없는 용어에요

B. `pivot_table`의 문법

- pivot_table의 문법

df.pivot_table(
    index = 그룹변수
    colums = 그룹변수 
    values = 집계변수
    aggfunc = 집계함수
)

- 그룹변수: string, 혹은 list of string 으로 전달한다.

예시: ‘department’, [‘department’], [‘department’,‘gender’]

- 집계변수: string, 혹은 list of string 으로 전달한다.

예시: ‘CANCELLED’, [‘CANCELLED’], [‘CANCELLED’,‘AIR_TIME’]

- 집계함수: 함수자체³를 전달하거나, 함수를 의미하는 문자열⁴, 혹은 그것들의 리스트형태로 전달한다.

³ np.mean,sum

⁴ ‘count’, ‘sum’, ‘mean’, ‘median’, ‘min’, ‘max’, ‘std’, ‘var’

# 예시: 집계합수를 전달하는 방법

data

df = pd.DataFrame({'category':['A']*5+['B']*5, 'value':np.concatenate([np.random.randn(5), np.random.randn(5)+10])})
df

	category	value
0	A	-1.522629
1	A	0.525883
2	A	0.285563
3	A	-1.219968
4	A	1.523162
5	B	9.942545
6	B	10.811401
7	B	9.817500
8	B	10.062780
9	B	11.450590

방법1 – 함수자체를 전달

df.pivot_table(index=['category'],values='value',aggfunc=np.sum) # 함수자체

/tmp/ipykernel_3437616/3923531937.py:1: FutureWarning: The provided callable <function sum at 0x7f2bb810b1c0> is currently using DataFrameGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
  df.pivot_table(index=['category'],values='value',aggfunc=np.sum) # 함수자체

	value
category
A	-0.407990
B	52.084817

방법2 – 함수를 의미하는 문자열을 전달

df.pivot_table(index=['category'],values='value',aggfunc='sum')  # 리스트

	value
category
A	-0.407990
B	52.084817

방법3 – 리스트를 전달

df.pivot_table(
    index=['category'],
    values='value',
    aggfunc=['sum','min',np.mean,np.max,'count']
)

/tmp/ipykernel_3437616/1880687379.py:1: FutureWarning: The provided callable <function mean at 0x7f2bb8110670> is currently using DataFrameGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
  df.pivot_table(
/tmp/ipykernel_3437616/1880687379.py:1: FutureWarning: The provided callable <function amax at 0x7f2bb810ba30> is currently using DataFrameGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
  df.pivot_table(

	sum	min	mean	amax	count
	value	value	value	value	value
category
A	-0.407990	-1.522629	-0.081598	1.523162	5
B	52.084817	9.817500	10.416963	11.450590	5

C. `groupby` + `aggregate` 의 문법

- groupby + aggregate 의 문법

df.groupby(그룹변수).aggregate({집계변수:집계함수})

4. AIRLINE 자료로 연습

df=pd.read_csv('https://raw.githubusercontent.com/PacktPublishing/Pandas-Cookbook/master/data/flights.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58492 entries, 0 to 58491
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   MONTH      58492 non-null  int64  
 1   DAY        58492 non-null  int64  
 2   WEEKDAY    58492 non-null  int64  
 3   AIRLINE    58492 non-null  object 
 4   ORG_AIR    58492 non-null  object 
 5   DEST_AIR   58492 non-null  object 
 6   SCHED_DEP  58492 non-null  int64  
 7   DEP_DELAY  57659 non-null  float64
 8   AIR_TIME   57474 non-null  float64
 9   DIST       58492 non-null  int64  
 10  SCHED_ARR  58492 non-null  int64  
 11  ARR_DELAY  57474 non-null  float64
 12  DIVERTED   58492 non-null  int64  
 13  CANCELLED  58492 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 6.2+ MB

- 각 변수들에 대한 설명은 아래와 같다. (ChatGPT의 도움을 받아 정리함)

MONTH: 비행이 이루어진 월을 나타냄. 1에서 12 사이의 값을 갖음.
DAY: 비행이 이루어진 일자를 나타냄. 월에 따라 1~28/29/30/31 사이의 값을 가질 수 있음.
WEEKDAY: 비행이 이루어진 요일을 나타냄. 일반적으로 1(일요일)부터 7(토요일)까지의 값을 갖음.
AIRLINE: 해당 항공편을 운영하는 항공사의 약어나 코드를 나타냄.
ORG_AIR: 비행기가 출발하는 공항의 약어나 코드를 나타냄.
DEST_AIR: 비행기가 도착하는 공항의 약어나 코드를 나타냄.
SCHED_DEP: 원래의 예정된 출발 시간을 나타냄. 시간은 일반적으로 HHMM 형식으로 표시될 수 있음.
DEP_DELAY: 출발 지연 시간을 나타냄. 음수 값은 조기 출발, 양수 값은 지연을 의미함.
AIR_TIME: 실제 공중에서 비행한 시간을 분 단위로 나타냄.
DIST: 비행 거리를 나타냄. 일반적으로 마일 또는 킬로미터로 표시됨.
SCHED_ARR: 원래의 예정된 도착 시간을 나타냄. SCHED_DEP와 같은 형식으로 표시될 수 있음.
ARR_DELAY: 도착 지연 시간을 나타냄. 음수는 조기 도착, 양수는 지연을 의미함.
DIVERTED: 항공편이 다른 곳으로 우회되었는지를 나타냄. 1은 우회, 0은 정상 경로를 의미함.
CANCELLED: 항공편이 취소되었는지 여부를 나타냄. 1은 취소, 0은 취소되지 않음을 의미함.

# 예제1: 항공사별로 도착지연시간의 평균을 구하라.

- 풀이1

df.pivot_table(index='AIRLINE',values='ARR_DELAY')

	ARR_DELAY
AIRLINE
AA	5.542661
AS	-0.833333
B6	8.692593
DL	0.339691
EV	7.034580
F9	13.630651
HA	4.972973
MQ	6.860591
NK	18.436070
OO	7.593463
UA	7.765755
US	1.681105
VX	5.348884
WN	6.397353

- 풀이2

df.groupby(by="AIRLINE").aggregate({'ARR_DELAY':np.mean})

/tmp/ipykernel_3437616/4188396604.py:1: FutureWarning: The provided callable <function mean at 0x7f2bb8110670> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
  df.groupby(by="AIRLINE").aggregate({'ARR_DELAY':np.mean})

	ARR_DELAY
AIRLINE
AA	5.542661
AS	-0.833333
B6	8.692593
DL	0.339691
EV	7.034580
F9	13.630651
HA	4.972973
MQ	6.860591
NK	18.436070
OO	7.593463
UA	7.765755
US	1.681105
VX	5.348884
WN	6.397353

#

# 예제2: 항공사별로 비행취소건수의 합계를 구하라. 취소건수가 높은 항공사순으로 정렬하라.

- 풀이1: .pivot_table()을 이용

df.pivot_table(index='AIRLINE',values='CANCELLED',aggfunc='sum').sort_values('CANCELLED',ascending=False)

	CANCELLED
AIRLINE
AA	154
MQ	152
EV	146
OO	142
UA	93
WN	93
DL	38
NK	25
US	21
F9	10
VX	6
B6	1
AS	0
HA	0

- 풀이2: .groupby()+.aggregate()를 이용

df.groupby('AIRLINE').aggregate({'CANCELLED':'sum'}).sort_values('CANCELLED',ascending=False)

	CANCELLED
AIRLINE
AA	154
MQ	152
EV	146
OO	142
UA	93
WN	93
DL	38
NK	25
US	21
F9	10
VX	6
B6	1
AS	0
HA	0

# 예제3: 항공사별로 비행취소율을 구하라. 비행취소율이 가장 높은 항공사 순으로 정렬하라.

- 풀이1: .pivot_table()을 이용

df.pivot_table(index='AIRLINE',values='CANCELLED',aggfunc='mean').sort_values('CANCELLED',ascending=False)

	CANCELLED
AIRLINE
MQ	0.043791
EV	0.024923
OO	0.021554
AA	0.017303
NK	0.016491
US	0.013003
UA	0.011935
WN	0.011048
F9	0.007593
VX	0.006042
DL	0.003585
B6	0.001842
AS	0.000000
HA	0.000000

- 풀이2: .groupby()+.aggregate()를 이용

df.groupby('AIRLINE').aggregate({'CANCELLED':'mean'}).sort_values('CANCELLED',ascending=False)

	CANCELLED
AIRLINE
MQ	0.043791
EV	0.024923
OO	0.021554
AA	0.017303
NK	0.016491
US	0.013003
UA	0.011935
WN	0.011048
F9	0.007593
VX	0.006042
DL	0.003585
B6	0.001842
AS	0.000000
HA	0.000000

# 예제4 (항공사,요일)별 비행취소건수와 비행취소율을 조사하라.

- 풀이1: .pivot_table()을 이용

df.pivot_table(index=['AIRLINE','WEEKDAY'],values='CANCELLED',aggfunc=[np.mean,sum])

/tmp/ipykernel_3437616/757677508.py:1: FutureWarning: The provided callable <function mean at 0x7f2bb8110670> is currently using DataFrameGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
  df.pivot_table(index=['AIRLINE','WEEKDAY'],values='CANCELLED',aggfunc=[np.mean,sum])
/tmp/ipykernel_3437616/757677508.py:1: FutureWarning: The provided callable <built-in function sum> is currently using DataFrameGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
  df.pivot_table(index=['AIRLINE','WEEKDAY'],values='CANCELLED',aggfunc=[np.mean,sum])

		mean	sum
		CANCELLED	CANCELLED
AIRLINE	WEEKDAY
AA	1	0.032106	41
	2	0.007341	9
	3	0.011949	16
	4	0.015004	20
	5	0.014151	18
...	...	...	...
WN	3	0.014118	18
	4	0.007911	10
	5	0.005828	7
	6	0.010132	10
	7	0.006066	7

98 rows × 2 columns

- 풀이2: .groupby()+.aggregate()를 이용

df.groupby(['AIRLINE','WEEKDAY']).aggregate({'CANCELLED':[np.mean,sum]})

/tmp/ipykernel_3437616/1354178761.py:1: FutureWarning: The provided callable <function mean at 0x7f2bb8110670> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
  df.groupby(['AIRLINE','WEEKDAY']).aggregate({'CANCELLED':[np.mean,sum]})
/tmp/ipykernel_3437616/1354178761.py:1: FutureWarning: The provided callable <built-in function sum> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
  df.groupby(['AIRLINE','WEEKDAY']).aggregate({'CANCELLED':[np.mean,sum]})

		CANCELLED
		mean	sum
AIRLINE	WEEKDAY
AA	1	0.032106	41
	2	0.007341	9
	3	0.011949	16
	4	0.015004	20
	5	0.014151	18
...	...	...	...
WN	3	0.014118	18
	4	0.007911	10
	5	0.005828	7
	6	0.010132	10
	7	0.006066	7

98 rows × 2 columns

# 예제4: (항공사,요일)별로 CANCELLED는 평균과 합계를 구하고 (즉 비행취소건수와 취소율을 구하고), AIR_TIME은 평균과 표준편차를 구하여라.

- 풀이1: .pivot_table() –> 이거 제가 수업할때는 못한다고 했는데 찾아보니까 아래처럼 할 수 있습니다.

df.pivot_table(
    index=['AIRLINE', 'WEEKDAY'],
    aggfunc={'CANCELLED': [np.mean, sum], 'AIR_TIME': ['mean','std']}
)

/tmp/ipykernel_3437616/2860571331.py:1: FutureWarning: The provided callable <function mean at 0x7f2bb8110670> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
  df.pivot_table(
/tmp/ipykernel_3437616/2860571331.py:1: FutureWarning: The provided callable <built-in function sum> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
  df.pivot_table(

		AIR_TIME		CANCELLED
		mean	std	mean	sum
AIRLINE	WEEKDAY
AA	1	147.610569	73.442540	0.032106	41
	2	143.851852	73.211275	0.007341	9
	3	144.514005	73.340675	0.011949	16
	4	141.124618	69.220840	0.015004	20
	5	145.430966	76.711095	0.014151	18
...	...	...	...	...	...
WN	3	104.219920	53.869040	0.014118	18
	4	107.200800	54.466218	0.007911	10
	5	107.893635	57.172695	0.005828	7
	6	109.247433	56.149388	0.010132	10
	7	107.602273	56.419207	0.006066	7

98 rows × 4 columns

- 풀이2: .groupby()+.aggregate()를 이용

df.groupby(['AIRLINE','WEEKDAY'])\
.aggregate({'CANCELLED': [np.mean,sum],'AIR_TIME': ['mean','std']})

/tmp/ipykernel_3437616/2787668343.py:2: FutureWarning: The provided callable <function mean at 0x7f2bb8110670> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
  .aggregate({'CANCELLED': [np.mean,sum],'AIR_TIME': ['mean','std']})
/tmp/ipykernel_3437616/2787668343.py:2: FutureWarning: The provided callable <built-in function sum> is currently using SeriesGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
  .aggregate({'CANCELLED': [np.mean,sum],'AIR_TIME': ['mean','std']})

		CANCELLED		AIR_TIME
		mean	sum	mean	std
AIRLINE	WEEKDAY
AA	1	0.032106	41	147.610569	73.442540
	2	0.007341	9	143.851852	73.211275
	3	0.011949	16	144.514005	73.340675
	4	0.015004	20	141.124618	69.220840
	5	0.014151	18	145.430966	76.711095
...	...	...	...	...	...
WN	3	0.014118	18	104.219920	53.869040
	4	0.007911	10	107.200800	54.466218
	5	0.005828	7	107.893635	57.172695
	6	0.010132	10	109.247433	56.149388
	7	0.006066	7	107.602273	56.419207

98 rows × 4 columns

#

# 예제5: 운행구간을 그룹화하고, 운행구간별 비행취소건수와 취소율을 구하여라.

- 풀이1

df.assign(DIST_CUT= pd.qcut(df.DIST,q=4)).pivot_table(
    index= ['DIST_CUT'],
    values= 'CANCELLED',
    aggfunc= ['mean','sum']
)

	mean	sum
	CANCELLED	CANCELLED
DIST_CUT
(66.999, 391.0]	0.022659	334
(391.0, 690.0]	0.013503	196
(690.0, 1199.0]	0.013637	203
(1199.0, 4502.0]	0.010313	148

- 풀이2

df.assign(DIST_CUT= pd.qcut(df.DIST,q=4))\
.groupby('DIST_CUT')\
.aggregate({'CANCELLED':['mean','sum']})

/tmp/ipykernel_3437616/402182245.py:2: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  .groupby('DIST_CUT')\

	CANCELLED
	mean	sum
DIST_CUT
(66.999, 391.0]	0.022659	334
(391.0, 690.0]	0.013503	196
(690.0, 1199.0]	0.013637	203
(1199.0, 4502.0]	0.010313	148

#

5. HW

np.random.seed(43052)
df = pd.DataFrame({'X1':['A']*5+['B']*5, 'X2':np.concatenate([np.random.randn(5), np.random.randn(5)+10])})
df

	X1	X2
0	A	0.383420
1	A	1.084175
2	A	1.142778
3	A	0.307894
4	A	0.237787
5	B	10.355951
6	B	8.336925
7	B	8.617227
8	B	8.073155
9	B	8.513784

위의 자료에서 X1을 기준으로 그룹화한뒤 X2에 대하여 아래의 함수를 적용하라.

$max (X2) - min (X2)$

출력결과는 아래와 같아야 한다 .

	X2
X1
A	0.904991
B	2.282796

1. 강의영상

2. Imports

3. pivot_table, groupby+agg

A. intro

B. pivot_table의 문법

C. groupby + aggregate 의 문법

4. AIRLINE 자료로 연습

5. HW

B. `pivot_table`의 문법

C. `groupby` + `aggregate` 의 문법