Lesson 15: tidydata

Author

최규빈

Published

February 1, 2023

ref

https://guebin.github.io/DV2022/posts/2022-11-07-10wk-1.html#판다스-melt-stack

imports

import pandas as pd

판다스: melt, stack

melt

- 설명:

- 데이터

df=pd.read_csv('https://raw.githubusercontent.com/guebin/2021DV/master/_notebooks/phone.csv')
df
Date Samsung Apple Huawei Xiaomi Oppo Mobicel Motorola LG Others Realme Google Nokia Lenovo OnePlus Sony Asus
0 2019-10 461 324 136 109 76 81 43 37 135 28 39 14 22 17 20 17
1 2019-11 461 358 167 141 86 61 29 36 141 27 29 20 23 10 19 27
2 2019-12 426 383 143 105 53 45 51 48 129 30 20 26 28 18 18 19
3 2020-01 677 494 212 187 110 79 65 49 158 23 13 19 19 22 27 22
4 2020-02 593 520 217 195 112 67 62 71 157 25 18 16 24 18 23 20
5 2020-03 637 537 246 187 92 66 59 67 145 21 16 24 18 31 22 14
6 2020-04 647 583 222 154 98 59 48 64 113 20 23 25 19 19 23 21
7 2020-05 629 518 192 176 91 87 50 66 150 43 27 15 18 19 19 13
8 2020-06 663 552 209 185 93 69 54 60 140 39 16 16 17 29 25 16
9 2020-07 599 471 214 193 89 78 65 59 130 40 27 25 21 18 18 12
10 2020-08 615 567 204 182 105 82 62 42 129 47 16 23 21 27 23 20
11 2020-09 621 481 230 220 102 88 56 49 143 54 14 15 17 15 19 15
12 2020-10 637 555 232 203 90 52 63 49 140 33 17 20 22 9 22 21

- 사용예시1: 기본사용

df.melt()
variable value
0 Date 2019-10
1 Date 2019-11
2 Date 2019-12
3 Date 2020-01
4 Date 2020-02
... ... ...
216 Asus 16
217 Asus 12
218 Asus 20
219 Asus 15
220 Asus 21

221 rows × 2 columns

  • variable: column name들이 들어간다.
  • value: column name에 대응하는 값들이 들어간다.

- 사용예시2: id_vars –> tidy data

df.melt(id_vars='Date')
Date variable value
0 2019-10 Samsung 461
1 2019-11 Samsung 461
2 2019-12 Samsung 426
3 2020-01 Samsung 677
4 2020-02 Samsung 593
... ... ... ...
203 2020-06 Asus 16
204 2020-07 Asus 12
205 2020-08 Asus 20
206 2020-09 Asus 15
207 2020-10 Asus 21

208 rows × 3 columns

- 사용예시3:

df.set_index('Date').melt()
variable value
0 Samsung 461
1 Samsung 461
2 Samsung 426
3 Samsung 677
4 Samsung 593
... ... ...
203 Asus 16
204 Asus 12
205 Asus 20
206 Asus 15
207 Asus 21

208 rows × 2 columns

- 사용예시4: ignore_index=False

df.melt(id_vars='Date')
Date variable value
0 2019-10 Samsung 461
1 2019-11 Samsung 461
2 2019-12 Samsung 426
3 2020-01 Samsung 677
4 2020-02 Samsung 593
... ... ... ...
203 2020-06 Asus 16
204 2020-07 Asus 12
205 2020-08 Asus 20
206 2020-09 Asus 15
207 2020-10 Asus 21

208 rows × 3 columns

df.set_index('Date').melt(ignore_index=False).reset_index()
Date variable value
0 2019-10 Samsung 461
1 2019-11 Samsung 461
2 2019-12 Samsung 426
3 2020-01 Samsung 677
4 2020-02 Samsung 593
... ... ... ...
203 2020-06 Asus 16
204 2020-07 Asus 12
205 2020-08 Asus 20
206 2020-09 Asus 15
207 2020-10 Asus 21

208 rows × 3 columns

stack

- 설명:

df=pd.read_csv('https://raw.githubusercontent.com/PacktPublishing/Pandas-Cookbook/master/data/flights.csv')\
.groupby(["AIRLINE","WEEKDAY"]).agg({"CANCELLED":[np.mean,"count"],"DIVERTED":[np.mean,"count"]})
df
CANCELLED DIVERTED
mean count mean count
AIRLINE WEEKDAY
AA 1 0.032106 1277 0.004699 1277
2 0.007341 1226 0.001631 1226
3 0.011949 1339 0.001494 1339
4 0.015004 1333 0.003751 1333
5 0.014151 1272 0.000786 1272
... ... ... ... ... ...
WN 3 0.014118 1275 0.001569 1275
4 0.007911 1264 0.003165 1264
5 0.005828 1201 0.000000 1201
6 0.010132 987 0.003040 987
7 0.006066 1154 0.002600 1154

98 rows × 4 columns

- 사용예시1

df.stack()
CANCELLED DIVERTED
AIRLINE WEEKDAY
AA 1 mean 0.032106 0.004699
count 1277.000000 1277.000000
2 mean 0.007341 0.001631
count 1226.000000 1226.000000
3 mean 0.011949 0.001494
... ... ... ... ...
WN 5 count 1201.000000 1201.000000
6 mean 0.010132 0.003040
count 987.000000 987.000000
7 mean 0.006066 0.002600
count 1154.000000 1154.000000

196 rows × 2 columns

- 사용예시2

df.stack().stack().reset_index().rename({0:'value'},axis=1)
#df.stack().stack().reset_index().rename(columns={'level_2':'aggtype'})
AIRLINE WEEKDAY level_2 level_3 value
0 AA 1 mean CANCELLED 0.032106
1 AA 1 mean DIVERTED 0.004699
2 AA 1 count CANCELLED 1277.000000
3 AA 1 count DIVERTED 1277.000000
4 AA 2 mean CANCELLED 0.007341
... ... ... ... ... ...
387 WN 6 count DIVERTED 987.000000
388 WN 7 mean CANCELLED 0.006066
389 WN 7 mean DIVERTED 0.002600
390 WN 7 count CANCELLED 1154.000000
391 WN 7 count DIVERTED 1154.000000

392 rows × 5 columns

- 사용예시3 (unstack)

df.stack().unstack()
CANCELLED DIVERTED
mean count mean count
AIRLINE WEEKDAY
AA 1 0.032106 1277.0 0.004699 1277.0
2 0.007341 1226.0 0.001631 1226.0
3 0.011949 1339.0 0.001494 1339.0
4 0.015004 1333.0 0.003751 1333.0
5 0.014151 1272.0 0.000786 1272.0
... ... ... ... ... ...
WN 3 0.014118 1275.0 0.001569 1275.0
4 0.007911 1264.0 0.003165 1264.0
5 0.005828 1201.0 0.000000 1201.0
6 0.010132 987.0 0.003040 987.0
7 0.006066 1154.0 0.002600 1154.0

98 rows × 4 columns

melt, stack 을 set_index와 reset_index와 함께 사용하면 tidydata를 만들기 용이하다.

tidydata

tidydata의 정의

- 느낌: ggplot으로 그림 그리기 좋은 데이터 + pandas로 query, group by 등을 쓰기 좋은 자료

- 정의: https://r4ds.had.co.nz/tidy-data.html

  • Each variable must have its own column.
  • Each observation must have its own row.
  • Each value must have its own cell.

예시1 (tidy data)

obs x y shape color
0 0 0 ‘star’ ‘F’
1 0 1 ‘circ’ ‘F’
2 1 0 ‘star’ ‘M’
3 1 1 ‘circ’ ‘M’

예시2 (tidy data x)

shape=star shape=circ
color=F (0,0) (0,1)
color=M (1,0) (1,1)

예제1: wide df

- data

df=pd.read_csv('https://raw.githubusercontent.com/guebin/2021DV/master/_notebooks/phone.csv')
df
Date Samsung Apple Huawei Xiaomi Oppo Mobicel Motorola LG Others Realme Google Nokia Lenovo OnePlus Sony Asus
0 2019-10 461 324 136 109 76 81 43 37 135 28 39 14 22 17 20 17
1 2019-11 461 358 167 141 86 61 29 36 141 27 29 20 23 10 19 27
2 2019-12 426 383 143 105 53 45 51 48 129 30 20 26 28 18 18 19
3 2020-01 677 494 212 187 110 79 65 49 158 23 13 19 19 22 27 22
4 2020-02 593 520 217 195 112 67 62 71 157 25 18 16 24 18 23 20
5 2020-03 637 537 246 187 92 66 59 67 145 21 16 24 18 31 22 14
6 2020-04 647 583 222 154 98 59 48 64 113 20 23 25 19 19 23 21
7 2020-05 629 518 192 176 91 87 50 66 150 43 27 15 18 19 19 13
8 2020-06 663 552 209 185 93 69 54 60 140 39 16 16 17 29 25 16
9 2020-07 599 471 214 193 89 78 65 59 130 40 27 25 21 18 18 12
10 2020-08 615 567 204 182 105 82 62 42 129 47 16 23 21 27 23 20
11 2020-09 621 481 230 220 102 88 56 49 143 54 14 15 17 15 19 15
12 2020-10 637 555 232 203 90 52 63 49 140 33 17 20 22 9 22 21
  • tidy data 아님
  • 정의에 의한 판단: 하나의 observation이 하나의 행을 차지하고 있지 않음.
  • 직관적인 판단: 회사별로 색을 다르게 하여 x:‘Date’, y:’판매량’을 하고 싶다면?

- tidydata로 변환 (melt는 너무 쉬우니까 stack으로 해보자)

df.set_index('Date').stack().reset_index().rename({'level_1':'Company',0:'Sales'},axis=1)
Date Company Sales
0 2019-10 Samsung 461
1 2019-10 Apple 324
2 2019-10 Huawei 136
3 2019-10 Xiaomi 109
4 2019-10 Oppo 76
... ... ... ...
203 2020-10 Nokia 20
204 2020-10 Lenovo 22
205 2020-10 OnePlus 9
206 2020-10 Sony 22
207 2020-10 Asus 21

208 rows × 3 columns

예제2: multi-indexed data

- 데이터

df=pd.read_csv('https://raw.githubusercontent.com/PacktPublishing/Pandas-Cookbook/master/data/flights.csv')\
.groupby(["AIRLINE","WEEKDAY"]).agg({"CANCELLED":[np.mean,"count"],"DIVERTED":[np.mean,"count"]})
df
CANCELLED DIVERTED
mean count mean count
AIRLINE WEEKDAY
AA 1 0.032106 1277 0.004699 1277
2 0.007341 1226 0.001631 1226
3 0.011949 1339 0.001494 1339
4 0.015004 1333 0.003751 1333
5 0.014151 1272 0.000786 1272
... ... ... ... ... ...
WN 3 0.014118 1275 0.001569 1275
4 0.007911 1264 0.003165 1264
5 0.005828 1201 0.000000 1201
6 0.010132 987 0.003040 987
7 0.006066 1154 0.002600 1154

98 rows × 4 columns

  • tidy data 아님
  • 정의에 의한 판단: 하나의 셀에 여러 관측치가 있음 (표안의 표 느낌)
  • 직관적인 판단: WEEKDAY == 4 and mean(CANCELLED) > 0.001 인 자료를 뽑고 싶다면?

- tidydata로 변환 (stack으로 풀면 너무 쉬우니까 melt로 해보자)

df.melt(ignore_index=False).reset_index()
AIRLINE WEEKDAY variable_0 variable_1 value
0 AA 1 CANCELLED mean 0.032106
1 AA 2 CANCELLED mean 0.007341
2 AA 3 CANCELLED mean 0.011949
3 AA 4 CANCELLED mean 0.015004
4 AA 5 CANCELLED mean 0.014151
... ... ... ... ... ...
387 WN 3 DIVERTED count 1275.000000
388 WN 4 DIVERTED count 1264.000000
389 WN 5 DIVERTED count 1201.000000
390 WN 6 DIVERTED count 987.000000
391 WN 7 DIVERTED count 1154.000000

392 rows × 5 columns