Lesson 14: groupby

Author

최규빈

Published

February 1, 2023

ref

https://guebin.github.io/DV2022/posts/2022-10-24-8wk-12.html#판다스-groupby

imports

import pandas as pd

판다스: Groupby

flights data

df=pd.read_csv('https://raw.githubusercontent.com/PacktPublishing/Pandas-Cookbook/master/data/flights.csv')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58492 entries, 0 to 58491
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   MONTH      58492 non-null  int64  
 1   DAY        58492 non-null  int64  
 2   WEEKDAY    58492 non-null  int64  
 3   AIRLINE    58492 non-null  object 
 4   ORG_AIR    58492 non-null  object 
 5   DEST_AIR   58492 non-null  object 
 6   SCHED_DEP  58492 non-null  int64  
 7   DEP_DELAY  57659 non-null  float64
 8   AIR_TIME   57474 non-null  float64
 9   DIST       58492 non-null  int64  
 10  SCHED_ARR  58492 non-null  int64  
 11  ARR_DELAY  57474 non-null  float64
 12  DIVERTED   58492 non-null  int64  
 13  CANCELLED  58492 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 6.2+ MB

get_groups

- groupby - 데이터프레임을 여러개의 서브데이터프레임으로 나누는 기술 - 단독으로 쓸 이유는 별로 없다. \(\to\) 그룹을 나누고 each 그룹마다 어떠한 “변수”에 “연산”을 하기 위함.

df.groupby(by="AIRLINE")
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f2682a76410>
  • 지금 이것이 항공사별로 데이터프레임이 나누어진 상태임

- 진짜 sub dataframe 으로 나누어져 있는지 확인

grouped = df.groupby(by="AIRLINE")
grouped.groups.keys()
dict_keys(['AA', 'AS', 'B6', 'DL', 'EV', 'F9', 'HA', 'MQ', 'NK', 'OO', 'UA', 'US', 'VX', 'WN'])
display(grouped.get_group('AS'))
MONTH DAY WEEKDAY AIRLINE ORG_AIR DEST_AIR SCHED_DEP DEP_DELAY AIR_TIME DIST SCHED_ARR ARR_DELAY DIVERTED CANCELLED
38 1 1 4 AS PHX SEA 1505 -2.0 155.0 1107 1702 -3.0 0 0
198 1 2 5 AS LAX SEA 2110 5.0 145.0 954 2352 8.0 0 0
241 1 2 5 AS LAS PDX 650 -5.0 117.0 763 906 -3.0 0 0
277 1 2 5 AS ORD ANC 935 -1.0 402.0 2846 1339 -6.0 0 0
397 1 3 6 AS LAS SEA 1300 48.0 137.0 867 1535 47.0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
58305 12 30 3 AS LAX SEA 1325 -2.0 134.0 954 1608 -7.0 0 0
58355 12 31 4 AS PHX SEA 1200 -5.0 145.0 1107 1407 -24.0 0 0
58404 12 31 4 AS SFO SLC 2110 -2.0 80.0 599 2358 -4.0 0 0
58407 12 31 4 AS SFO PDX 645 -2.0 81.0 550 832 -3.0 0 0
58428 12 31 4 AS LAX SEA 1420 -8.0 127.0 954 1709 -25.0 0 0

768 rows × 14 columns

grouped.groups.keys()
dict_keys(['AA', 'AS', 'B6', 'DL', 'EV', 'F9', 'HA', 'MQ', 'NK', 'OO', 'UA', 'US', 'VX', 'WN'])
#collapse_output
for key in grouped.groups.keys():
    display(grouped.get_group(key))
MONTH DAY WEEKDAY AIRLINE ORG_AIR DEST_AIR SCHED_DEP DEP_DELAY AIR_TIME DIST SCHED_ARR ARR_DELAY DIVERTED CANCELLED
3 1 1 4 AA DFW DCA 1555 7.0 126.0 1192 1935 -7.0 0 0
6 1 1 4 AA DFW MSY 1250 84.0 64.0 447 1410 83.0 0 0
8 1 1 4 AA ORD STL 1845 -5.0 44.0 258 1950 -5.0 0 0
15 1 1 4 AA DEN DFW 1445 -6.0 93.0 641 1745 4.0 0 0
26 1 1 4 AA LAX AUS 1430 33.0 157.0 1242 1925 41.0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
58470 12 31 4 AA DFW FAT 1020 -3.0 196.0 1313 1156 -2.0 0 0
58475 12 31 4 AA IAH CLT 710 1.0 113.0 912 1037 -12.0 0 0
58476 12 31 4 AA DFW TPA 1020 -3.0 121.0 929 1340 -6.0 0 0
58479 12 31 4 AA DFW ELP 1200 3.0 94.0 551 1250 13.0 0 0
58487 12 31 4 AA SFO DFW 515 5.0 166.0 1464 1045 -19.0 0 0

8900 rows × 14 columns

MONTH DAY WEEKDAY AIRLINE ORG_AIR DEST_AIR SCHED_DEP DEP_DELAY AIR_TIME DIST SCHED_ARR ARR_DELAY DIVERTED CANCELLED
38 1 1 4 AS PHX SEA 1505 -2.0 155.0 1107 1702 -3.0 0 0
198 1 2 5 AS LAX SEA 2110 5.0 145.0 954 2352 8.0 0 0
241 1 2 5 AS LAS PDX 650 -5.0 117.0 763 906 -3.0 0 0
277 1 2 5 AS ORD ANC 935 -1.0 402.0 2846 1339 -6.0 0 0
397 1 3 6 AS LAS SEA 1300 48.0 137.0 867 1535 47.0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
58305 12 30 3 AS LAX SEA 1325 -2.0 134.0 954 1608 -7.0 0 0
58355 12 31 4 AS PHX SEA 1200 -5.0 145.0 1107 1407 -24.0 0 0
58404 12 31 4 AS SFO SLC 2110 -2.0 80.0 599 2358 -4.0 0 0
58407 12 31 4 AS SFO PDX 645 -2.0 81.0 550 832 -3.0 0 0
58428 12 31 4 AS LAX SEA 1420 -8.0 127.0 954 1709 -25.0 0 0

768 rows × 14 columns

MONTH DAY WEEKDAY AIRLINE ORG_AIR DEST_AIR SCHED_DEP DEP_DELAY AIR_TIME DIST SCHED_ARR ARR_DELAY DIVERTED CANCELLED
123 1 1 4 B6 LAS BOS 1230 0.0 246.0 2381 2026 -27.0 0 0
127 1 1 4 B6 LAS BOS 2359 68.0 247.0 2381 749 46.0 0 0
239 1 2 5 B6 ORD BOS 540 -8.0 96.0 867 856 -22.0 0 0
333 1 3 6 B6 LAX FLL 2237 32.0 270.0 2342 619 42.0 0 0
548 1 4 7 B6 SFO FLL 2307 -4.0 298.0 2583 724 -1.0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
58262 12 30 3 B6 SFO LGB 1921 -6.0 57.0 354 2038 -14.0 0 0
58301 12 30 3 B6 LAX JFK 630 4.0 285.0 2475 1445 -6.0 0 0
58425 12 31 4 B6 ORD SJU 700 239.0 250.0 2072 1335 239.0 0 0
58477 12 31 4 B6 DFW BOS 1145 12.0 161.0 1562 1608 -14.0 0 0
58483 12 31 4 B6 PHX BOS 2236 -12.0 231.0 2300 515 -45.0 0 0

543 rows × 14 columns

MONTH DAY WEEKDAY AIRLINE ORG_AIR DEST_AIR SCHED_DEP DEP_DELAY AIR_TIME DIST SCHED_ARR ARR_DELAY DIVERTED CANCELLED
53 1 1 4 DL LAS MSP 713 -5.0 156.0 1299 1220 -18.0 0 0
57 1 1 4 DL MSP RSW 700 -1.0 169.0 1416 1130 -20.0 0 0
77 1 1 4 DL LAX ATL 1130 24.0 217.0 1947 1840 16.0 0 0
79 1 1 4 DL LAX CMH 2146 -3.0 223.0 1995 459 -13.0 0 0
85 1 1 4 DL ATL OKC 2059 -4.0 116.0 761 2227 -12.0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
58440 12 31 4 DL ATL CVG 1611 -4.0 61.0 373 1736 -6.0 0 0
58448 12 31 4 DL ATL SRQ 1610 0.0 61.0 444 1740 -13.0 0 0
58464 12 31 4 DL LAX SFO 700 108.0 54.0 337 825 105.0 0 0
58467 12 31 4 DL ATL IND 1235 -3.0 63.0 432 1407 -13.0 0 0
58485 12 31 4 DL ATL CMH 2206 2.0 64.0 447 2338 -8.0 0 0

10601 rows × 14 columns

MONTH DAY WEEKDAY AIRLINE ORG_AIR DEST_AIR SCHED_DEP DEP_DELAY AIR_TIME DIST SCHED_ARR ARR_DELAY DIVERTED CANCELLED
11 1 1 4 EV ORD JAN 1155 6.0 113.0 677 1403 5.0 0 0
13 1 1 4 EV ORD CMH 1010 -2.0 46.0 296 1228 -9.0 0 0
29 1 1 4 EV ORD IND 1025 -6.0 29.0 177 1228 -19.0 0 0
40 1 1 4 EV IAH CLE 1038 -3.0 126.0 1091 1425 -18.0 0 0
69 1 1 4 EV ATL RAP 1930 -5.0 181.0 1230 2104 -15.0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
58445 12 31 4 EV DFW TXK 850 -5.0 30.0 181 948 -17.0 0 0
58452 12 31 4 EV DFW SHV 1650 -4.0 32.0 190 1746 -12.0 0 0
58459 12 31 4 EV MSP ORD 1435 18.0 61.0 334 1609 3.0 0 0
58463 12 31 4 EV ORD MSN 1220 18.0 32.0 108 1319 27.0 0 0
58486 12 31 4 EV DFW LFT 850 21.0 52.0 351 1012 14.0 0 0

5858 rows × 14 columns

MONTH DAY WEEKDAY AIRLINE ORG_AIR DEST_AIR SCHED_DEP DEP_DELAY AIR_TIME DIST SCHED_ARR ARR_DELAY DIVERTED CANCELLED
7 1 1 4 F9 SFO PHX 1020 -7.0 91.0 651 1315 -6.0 0 0
93 1 1 4 F9 ATL DEN 859 16.0 181.0 1199 1026 10.0 0 0
209 1 2 5 F9 MSP DEN 1025 -6.0 97.0 680 1134 -13.0 0 0
232 1 2 5 F9 DEN PHX 2040 -7.0 83.0 602 2228 -18.0 0 0
247 1 2 5 F9 ORD ATL 730 10.0 86.0 606 1020 23.0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
58288 12 30 3 F9 DEN ORD 625 -4.0 136.0 888 1000 14.0 0 0
58331 12 30 3 F9 ORD PHX 825 18.0 207.0 1440 1127 14.0 0 0
58447 12 31 4 F9 DEN LAS 1245 13.0 94.0 628 1340 13.0 0 0
58449 12 31 4 F9 DEN MCO 645 11.0 169.0 1546 1224 -11.0 0 0
58488 12 31 4 F9 LAS SFO 1910 13.0 71.0 414 2050 4.0 0 0

1317 rows × 14 columns

MONTH DAY WEEKDAY AIRLINE ORG_AIR DEST_AIR SCHED_DEP DEP_DELAY AIR_TIME DIST SCHED_ARR ARR_DELAY DIVERTED CANCELLED
582 1 4 7 HA LAX OGG 1115 -11.0 310.0 2486 1500 -27.0 0 0
712 1 5 1 HA LAS HNL 900 -5.0 357.0 2762 1315 5.0 0 0
878 1 6 2 HA PHX HNL 800 1.0 374.0 2917 1140 3.0 0 0
1053 1 7 3 HA LAX HNL 1705 0.0 332.0 2556 2055 -2.0 0 0
1269 1 8 4 HA LAX HNL 1000 -1.0 335.0 2556 1350 0.0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
55883 12 16 3 HA LAX HNL 835 1.0 314.0 2556 1235 -18.0 0 0
56174 12 18 5 HA LAX HNL 835 -5.0 342.0 2556 1235 -4.0 0 0
56350 12 19 6 HA PHX HNL 800 -5.0 363.0 2917 1155 -34.0 0 0
56816 12 21 1 HA LAX LIH 740 20.0 303.0 2615 1145 -11.0 0 0
58391 12 31 4 HA LAX HNL 1000 0.0 324.0 2556 1350 -9.0 0 0

112 rows × 14 columns

MONTH DAY WEEKDAY AIRLINE ORG_AIR DEST_AIR SCHED_DEP DEP_DELAY AIR_TIME DIST SCHED_ARR ARR_DELAY DIVERTED CANCELLED
2 1 1 4 MQ DFW VPS 1305 36.0 85.0 641 1453 35.0 0 0
10 1 1 4 MQ DFW DRO 1335 28.0 104.0 674 1438 28.0 0 0
18 1 1 4 MQ ORD DAY 2220 19.0 37.0 240 23 20.0 0 0
24 1 1 4 MQ DFW BTR 730 NaN NaN 383 853 NaN 0 1
50 1 1 4 MQ ORD CID 1135 -7.0 37.0 196 1238 -15.0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
58415 12 31 4 MQ ORD FWA 845 -2.0 37.0 157 1045 -4.0 0 0
58426 12 31 4 MQ DFW FAR 1154 4.0 124.0 968 1437 -13.0 0 0
58468 12 31 4 MQ DFW OKC 1720 -3.0 31.0 175 1819 -10.0 0 0
58474 12 31 4 MQ ORD FNT 829 4.0 40.0 223 1034 -4.0 0 0
58484 12 31 4 MQ ORD DSM 1333 1.0 57.0 299 1455 -7.0 0 0

3471 rows × 14 columns

MONTH DAY WEEKDAY AIRLINE ORG_AIR DEST_AIR SCHED_DEP DEP_DELAY AIR_TIME DIST SCHED_ARR ARR_DELAY DIVERTED CANCELLED
17 1 1 4 NK DEN DTW 1952 37.0 124.0 1123 31 54.0 0 0
74 1 1 4 NK PHX DFW 159 -1.0 103.0 868 502 1.0 0 0
95 1 1 4 NK LAS OAK 1115 22.0 62.0 407 1246 10.0 0 0
109 1 1 4 NK MSP ORD 616 2.0 49.0 334 745 -19.0 0 0
166 1 2 5 NK LAS PDX 1535 -8.0 123.0 763 1754 -4.0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
58160 12 29 2 NK MSP MCO 740 0.0 171.0 1310 1158 33.0 0 0
58197 12 30 3 NK IAH ORD 755 -8.0 136.0 925 1030 -2.0 0 0
58437 12 31 4 NK ORD DFW 1952 15.0 135.0 802 2225 23.0 0 0
58461 12 31 4 NK ORD LGA 1801 -5.0 84.0 733 2109 -26.0 0 0
58469 12 31 4 NK LAS MSY 1950 124.0 163.0 1500 112 101.0 0 0

1516 rows × 14 columns

MONTH DAY WEEKDAY AIRLINE ORG_AIR DEST_AIR SCHED_DEP DEP_DELAY AIR_TIME DIST SCHED_ARR ARR_DELAY DIVERTED CANCELLED
12 1 1 4 OO ORD MSP 1510 2.0 65.0 334 1646 4.0 0 0
16 1 1 4 OO DEN SGU 1105 21.0 66.0 517 1249 20.0 0 0
22 1 1 4 OO LAS LAX 1544 -4.0 39.0 236 1655 -12.0 0 0
25 1 1 4 OO ORD SPI 2110 -4.0 31.0 174 2205 5.0 0 0
27 1 1 4 OO IAH JAC 1104 -1.0 161.0 1265 1316 -1.0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
58451 12 31 4 OO ATL FWA 1905 -3.0 72.0 508 2051 -14.0 0 0
58480 12 31 4 OO MSP BIS 1310 -2.0 65.0 386 1449 -9.0 0 0
58482 12 31 4 OO DEN CPR 1850 -2.0 38.0 230 1956 1.0 0 0
58489 12 31 4 OO SFO SBA 1846 -6.0 46.0 262 1956 -5.0 0 0
58491 12 31 4 OO SFO BOI 859 5.0 73.0 522 1146 -1.0 0 0

6588 rows × 14 columns

MONTH DAY WEEKDAY AIRLINE ORG_AIR DEST_AIR SCHED_DEP DEP_DELAY AIR_TIME DIST SCHED_ARR ARR_DELAY DIVERTED CANCELLED
1 1 1 4 UA DEN IAD 823 7.0 154.0 1452 1333 -13.0 0 0
5 1 1 4 UA IAH SAN 1450 1.0 178.0 1303 1620 -14.0 0 0
9 1 1 4 UA IAH SJC 925 3.0 215.0 1608 1136 -14.0 0 0
14 1 1 4 UA IAH IND 1426 -1.0 102.0 844 1742 -20.0 0 0
21 1 1 4 UA ORD CLE 2102 48.0 47.0 315 2320 41.0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
58422 12 31 4 UA DEN SAN 1535 0.0 124.0 853 1704 -13.0 0 0
58432 12 31 4 UA ORD SAN 1915 7.0 238.0 1723 2143 -3.0 0 0
58457 12 31 4 UA ORD LAX 659 -1.0 241.0 1744 946 0.0 0 0
58460 12 31 4 UA SFO PHL 2235 -6.0 265.0 2521 700 -42.0 0 0
58481 12 31 4 UA IAH LAX 1433 1.0 197.0 1379 1625 -13.0 0 0

7792 rows × 14 columns

MONTH DAY WEEKDAY AIRLINE ORG_AIR DEST_AIR SCHED_DEP DEP_DELAY AIR_TIME DIST SCHED_ARR ARR_DELAY DIVERTED CANCELLED
31 1 1 4 US PHX DEN 1810 29.0 94.0 602 1954 49.0 0 0
35 1 1 4 US ORD PHL 1600 -2.0 80.0 678 1857 -9.0 0 0
49 1 1 4 US IAH PHX 1445 -1.0 147.0 1009 1638 -7.0 0 0
96 1 1 4 US ATL PHL 1445 -4.0 90.0 666 1644 -11.0 0 0
104 1 1 4 US MSP PHX 730 -3.0 174.0 1276 1010 -20.0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
31514 6 30 2 US DEN PHL 705 -4.0 188.0 1558 1240 1.0 0 0
31523 6 30 2 US PHX DEN 1451 6.0 85.0 602 1738 7.0 0 0
31535 6 30 2 US PHX AUS 840 -3.0 116.0 872 1304 -11.0 0 0
31561 6 30 2 US ORD PHX 710 -5.0 170.0 1440 901 -50.0 0 0
31582 6 30 2 US PHX OGG 800 -4.0 356.0 2845 1127 -13.0 0 0

1615 rows × 14 columns

MONTH DAY WEEKDAY AIRLINE ORG_AIR DEST_AIR SCHED_DEP DEP_DELAY AIR_TIME DIST SCHED_ARR ARR_DELAY DIVERTED CANCELLED
56 1 1 4 VX LAS SFO 900 23.0 65.0 414 1035 11.0 0 0
227 1 2 5 VX SFO LAS 1220 -5.0 68.0 414 1350 -5.0 0 0
243 1 2 5 VX SFO SEA 700 -4.0 104.0 679 905 -1.0 0 0
417 1 3 6 VX SFO LAS 900 -2.0 62.0 414 1030 -11.0 0 0
432 1 3 6 VX SFO SEA 2035 -2.0 106.0 679 2240 -2.0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
58332 12 30 3 VX SFO LAS 1950 -3.0 58.0 414 2120 -4.0 0 0
58383 12 31 4 VX SFO PSP 1630 -7.0 65.0 421 1755 -12.0 0 0
58400 12 31 4 VX SFO LAX 1125 -4.0 54.0 337 1245 -10.0 0 0
58471 12 31 4 VX SFO LAX 700 6.0 51.0 337 820 3.0 0 0
58478 12 31 4 VX SFO LAX 1530 29.0 52.0 337 1650 22.0 0 0

993 rows × 14 columns

MONTH DAY WEEKDAY AIRLINE ORG_AIR DEST_AIR SCHED_DEP DEP_DELAY AIR_TIME DIST SCHED_ARR ARR_DELAY DIVERTED CANCELLED
0 1 1 4 WN LAX SLC 1625 58.0 94.0 590 1905 65.0 0 0
4 1 1 4 WN LAX MCI 1720 48.0 166.0 1363 2225 39.0 0 0
19 1 1 4 WN PHX LAX 1640 51.0 58.0 370 1700 59.0 0 0
20 1 1 4 WN ATL BWI 1115 1.0 76.0 577 1305 -15.0 0 0
23 1 1 4 WN ATL HOU 1555 30.0 113.0 696 1720 18.0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
58455 12 31 4 WN LAX SMF 1420 -2.0 64.0 373 1540 -7.0 0 0
58458 12 31 4 WN LAS SFO 1825 25.0 67.0 414 1955 17.0 0 0
58472 12 31 4 WN PHX HOU 845 5.0 119.0 1020 1210 7.0 0 0
58473 12 31 4 WN DEN PDX 1205 4.0 130.0 991 1400 -13.0 0 0
58490 12 31 4 WN MSP ATL 525 39.0 124.0 907 855 34.0 0 0

8418 rows × 14 columns

범주형변수를 기준으로 groupby -> agg

df.columns
Index(['MONTH', 'DAY', 'WEEKDAY', 'AIRLINE', 'ORG_AIR', 'DEST_AIR',
       'SCHED_DEP', 'DEP_DELAY', 'AIR_TIME', 'DIST', 'SCHED_ARR', 'ARR_DELAY',
       'DIVERTED', 'CANCELLED'],
      dtype='object')
grouped.get_group('AA').ARR_DELAY.mean(),grouped.get_group('WN').ARR_DELAY.mean()
(5.542660550458716, 6.397352587244284)

# EX1: [AIRLINE] \(\to\) {ARR_DELAY:mean}

- 방법1: grouby() \(\to\) .agg({colname: function})

(예시1)

df.groupby(by="AIRLINE").agg({'ARR_DELAY':np.mean})
ARR_DELAY
AIRLINE
AA 5.542661
AS -0.833333
B6 8.692593
DL 0.339691
EV 7.034580
F9 13.630651
HA 4.972973
MQ 6.860591
NK 18.436070
OO 7.593463
UA 7.765755
US 1.681105
VX 5.348884
WN 6.397353

(예시2)

df.groupby(by="AIRLINE").agg({'ARR_DELAY':'mean'})
ARR_DELAY
AIRLINE
AA 5.542661
AS -0.833333
B6 8.692593
DL 0.339691
EV 7.034580
F9 13.630651
HA 4.972973
MQ 6.860591
NK 18.436070
OO 7.593463
UA 7.765755
US 1.681105
VX 5.348884
WN 6.397353

- 방법2: grouby() \(\to\) key로 column선택 \(\to\) .agg(f) or .f()

(예시1)

df.groupby(by='AIRLINE')['ARR_DELAY'].agg(np.mean)
AIRLINE
AA     5.542661
AS    -0.833333
B6     8.692593
DL     0.339691
EV     7.034580
F9    13.630651
HA     4.972973
MQ     6.860591
NK    18.436070
OO     7.593463
UA     7.765755
US     1.681105
VX     5.348884
WN     6.397353
Name: ARR_DELAY, dtype: float64

(예시2)

df.groupby(by='AIRLINE')['ARR_DELAY'].agg("mean")
AIRLINE
AA     5.542661
AS    -0.833333
B6     8.692593
DL     0.339691
EV     7.034580
F9    13.630651
HA     4.972973
MQ     6.860591
NK    18.436070
OO     7.593463
UA     7.765755
US     1.681105
VX     5.348884
WN     6.397353
Name: ARR_DELAY, dtype: float64

(예시3)

df.groupby(by='AIRLINE')['ARR_DELAY'].mean()
AIRLINE
AA     5.542661
AS    -0.833333
B6     8.692593
DL     0.339691
EV     7.034580
F9    13.630651
HA     4.972973
MQ     6.860591
NK    18.436070
OO     7.593463
UA     7.765755
US     1.681105
VX     5.348884
WN     6.397353
Name: ARR_DELAY, dtype: float64

# EX2: [AIRLINE,WEEKDAY] \(\to\) {CANCELLED:sum}

- 방법1

(예시1)

df.groupby(by=["AIRLINE","WEEKDAY"]).agg({"CANCELLED":np.sum})
CANCELLED
AIRLINE WEEKDAY
AA 1 41
2 9
3 16
4 20
5 18
... ... ...
WN 3 18
4 10
5 7
6 10
7 7

98 rows × 1 columns

(예시2)

df.groupby(by=["AIRLINE","WEEKDAY"]).agg({"CANCELLED":"sum"})
CANCELLED
AIRLINE WEEKDAY
AA 1 41
2 9
3 16
4 20
5 18
... ... ...
WN 3 18
4 10
5 7
6 10
7 7

98 rows × 1 columns

- 방법2

(예시1)

df.groupby(by=["AIRLINE","WEEKDAY"])[["CANCELLED"]].agg(np.sum)
CANCELLED
AIRLINE WEEKDAY
AA 1 41
2 9
3 16
4 20
5 18
... ... ...
WN 3 18
4 10
5 7
6 10
7 7

98 rows × 1 columns

(예시2)

df.groupby(by=["AIRLINE","WEEKDAY"])["CANCELLED"].agg("sum")
AIRLINE  WEEKDAY
AA       1          41
         2           9
         3          16
         4          20
         5          18
                    ..
WN       3          18
         4          10
         5           7
         6          10
         7           7
Name: CANCELLED, Length: 98, dtype: int64

(예시3)

df.groupby(by=["AIRLINE","WEEKDAY"])["CANCELLED"].sum()
AIRLINE  WEEKDAY
AA       1          41
         2           9
         3          16
         4          20
         5          18
                    ..
WN       3          18
         4          10
         5           7
         6          10
         7           7
Name: CANCELLED, Length: 98, dtype: int64
df.DIVERTED
0        0
1        0
2        0
3        0
4        0
        ..
58487    0
58488    0
58489    0
58490    0
58491    0
Name: DIVERTED, Length: 58492, dtype: int64

# EX3: [AIRLINE,WEEKDAY] \(\to\) {CANCELLED:sum,mean}, {DIVERTED: sum,mean}

- 방법1

(예시1)

df.groupby(["AIRLINE","WEEKDAY"])\
.agg({"CANCELLED":[np.sum,np.mean],"DIVERTED":[np.sum,np.mean]})
CANCELLED DIVERTED
sum mean sum mean
AIRLINE WEEKDAY
AA 1 41 0.032106 6 0.004699
2 9 0.007341 2 0.001631
3 16 0.011949 2 0.001494
4 20 0.015004 5 0.003751
5 18 0.014151 1 0.000786
... ... ... ... ... ...
WN 3 18 0.014118 2 0.001569
4 10 0.007911 4 0.003165
5 7 0.005828 0 0.000000
6 10 0.010132 3 0.003040
7 7 0.006066 3 0.002600

98 rows × 4 columns

(예시2)

df.groupby(["AIRLINE","WEEKDAY"])\
.agg({"CANCELLED":["sum","mean"],"DIVERTED":["sum","mean"]})
CANCELLED DIVERTED
sum mean sum mean
AIRLINE WEEKDAY
AA 1 41 0.032106 6 0.004699
2 9 0.007341 2 0.001631
3 16 0.011949 2 0.001494
4 20 0.015004 5 0.003751
5 18 0.014151 1 0.000786
... ... ... ... ... ...
WN 3 18 0.014118 2 0.001569
4 10 0.007911 4 0.003165
5 7 0.005828 0 0.000000
6 10 0.010132 3 0.003040
7 7 0.006066 3 0.002600

98 rows × 4 columns

- 방법2

(예시1)

df.groupby(["AIRLINE","WEEKDAY"])[["CANCELLED","DIVERTED"]]\
.agg([np.sum,np.mean])
CANCELLED DIVERTED
sum mean sum mean
AIRLINE WEEKDAY
AA 1 41 0.032106 6 0.004699
2 9 0.007341 2 0.001631
3 16 0.011949 2 0.001494
4 20 0.015004 5 0.003751
5 18 0.014151 1 0.000786
... ... ... ... ... ...
WN 3 18 0.014118 2 0.001569
4 10 0.007911 4 0.003165
5 7 0.005828 0 0.000000
6 10 0.010132 3 0.003040
7 7 0.006066 3 0.002600

98 rows × 4 columns

(예시2)

df.groupby(["AIRLINE","WEEKDAY"])[["CANCELLED","DIVERTED"]]\
.agg(["sum","mean"])
CANCELLED DIVERTED
sum mean sum mean
AIRLINE WEEKDAY
AA 1 41 0.032106 6 0.004699
2 9 0.007341 2 0.001631
3 16 0.011949 2 0.001494
4 20 0.015004 5 0.003751
5 18 0.014151 1 0.000786
... ... ... ... ... ...
WN 3 18 0.014118 2 0.001569
4 10 0.007911 4 0.003165
5 7 0.005828 0 0.000000
6 10 0.010132 3 0.003040
7 7 0.006066 3 0.002600

98 rows × 4 columns

(예시3) – 사용불가능

# EX4: [AIRLINE,WEEKDAY] \(\to\) {CANCELLED:sum,mean,count}, {AIR_TIME: mean,var}

- 방법1

(예시1)

df.groupby(["AIRLINE","WEEKDAY"])\
.agg({'CANCELLED':[np.sum,np.mean,len],'AIR_TIME':[np.mean,np.var]})
CANCELLED AIR_TIME
sum mean len mean var
AIRLINE WEEKDAY
AA 1 41 0.032106 1277 147.610569 5393.806723
2 9 0.007341 1226 143.851852 5359.890719
3 16 0.011949 1339 144.514005 5378.854539
4 20 0.015004 1333 141.124618 4791.524627
5 18 0.014151 1272 145.430966 5884.592076
... ... ... ... ... ... ...
WN 3 18 0.014118 1275 104.219920 2901.873447
4 10 0.007911 1264 107.200800 2966.568935
5 7 0.005828 1201 107.893635 3268.717093
6 10 0.010132 987 109.247433 3152.753719
7 7 0.006066 1154 107.602273 3183.126889

98 rows × 5 columns

(예시2)

df.groupby(["AIRLINE","WEEKDAY"])\
.agg({'CANCELLED':["sum","mean","count"],'AIR_TIME':["mean","var"]})
CANCELLED AIR_TIME
sum mean count mean var
AIRLINE WEEKDAY
AA 1 41 0.032106 1277 147.610569 5393.806723
2 9 0.007341 1226 143.851852 5359.890719
3 16 0.011949 1339 144.514005 5378.854539
4 20 0.015004 1333 141.124618 4791.524627
5 18 0.014151 1272 145.430966 5884.592076
... ... ... ... ... ... ...
WN 3 18 0.014118 1275 104.219920 2901.873447
4 10 0.007911 1264 107.200800 2966.568935
5 7 0.005828 1201 107.893635 3268.717093
6 10 0.010132 987 109.247433 3152.753719
7 7 0.006066 1154 107.602273 3183.126889

98 rows × 5 columns

(사용자정의함수)

df.groupby(["AIRLINE","WEEKDAY"])\
.agg({'CANCELLED':[np.sum,np.mean,len],
      'AIR_TIME':[np.mean,lambda x: np.std(x,ddof=1)**2]})
CANCELLED AIR_TIME
sum mean len mean <lambda_0>
AIRLINE WEEKDAY
AA 1 41 0.032106 1277 147.610569 5393.806723
2 9 0.007341 1226 143.851852 5359.890719
3 16 0.011949 1339 144.514005 5378.854539
4 20 0.015004 1333 141.124618 4791.524627
5 18 0.014151 1272 145.430966 5884.592076
... ... ... ... ... ... ...
WN 3 18 0.014118 1275 104.219920 2901.873447
4 10 0.007911 1264 107.200800 2966.568935
5 7 0.005828 1201 107.893635 3268.717093
6 10 0.010132 987 109.247433 3152.753719
7 7 0.006066 1154 107.602273 3183.126889

98 rows × 5 columns

연속형변수를 기준으로 groupby -> agg

df.T
0 1 2 3 4 5 6 7 8 9 ... 58482 58483 58484 58485 58486 58487 58488 58489 58490 58491
MONTH 1 1 1 1 1 1 1 1 1 1 ... 12 12 12 12 12 12 12 12 12 12
DAY 1 1 1 1 1 1 1 1 1 1 ... 31 31 31 31 31 31 31 31 31 31
WEEKDAY 4 4 4 4 4 4 4 4 4 4 ... 4 4 4 4 4 4 4 4 4 4
AIRLINE WN UA MQ AA WN UA AA F9 AA UA ... OO B6 MQ DL EV AA F9 OO WN OO
ORG_AIR LAX DEN DFW DFW LAX IAH DFW SFO ORD IAH ... DEN PHX ORD ATL DFW SFO LAS SFO MSP SFO
DEST_AIR SLC IAD VPS DCA MCI SAN MSY PHX STL SJC ... CPR BOS DSM CMH LFT DFW SFO SBA ATL BOI
SCHED_DEP 1625 823 1305 1555 1720 1450 1250 1020 1845 925 ... 1850 2236 1333 2206 850 515 1910 1846 525 859
DEP_DELAY 58.0 7.0 36.0 7.0 48.0 1.0 84.0 -7.0 -5.0 3.0 ... -2.0 -12.0 1.0 2.0 21.0 5.0 13.0 -6.0 39.0 5.0
AIR_TIME 94.0 154.0 85.0 126.0 166.0 178.0 64.0 91.0 44.0 215.0 ... 38.0 231.0 57.0 64.0 52.0 166.0 71.0 46.0 124.0 73.0
DIST 590 1452 641 1192 1363 1303 447 651 258 1608 ... 230 2300 299 447 351 1464 414 262 907 522
SCHED_ARR 1905 1333 1453 1935 2225 1620 1410 1315 1950 1136 ... 1956 515 1455 2338 1012 1045 2050 1956 855 1146
ARR_DELAY 65.0 -13.0 35.0 -7.0 39.0 -14.0 83.0 -6.0 -5.0 -14.0 ... 1.0 -45.0 -7.0 -8.0 14.0 -19.0 4.0 -5.0 34.0 -1.0
DIVERTED 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
CANCELLED 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

14 rows × 58492 columns

df.DIST.describe()
count    58492.000000
mean       872.900072
std        624.996805
min         67.000000
25%        391.000000
50%        690.000000
75%       1199.000000
max       4502.000000
Name: DIST, dtype: float64
df.assign(DIST2 = pd.cut(df.DIST,[-np.inf,391,690,1199,np.inf]))\
.groupby(["AIRLINE","DIST2"]).agg({'CANCELLED':["sum","mean","count"]})
CANCELLED
sum mean count
AIRLINE DIST2
AA (-inf, 391.0] 18 0.015986 1126
(391.0, 690.0] 17 0.013589 1251
(690.0, 1199.0] 69 0.022066 3127
(1199.0, inf] 50 0.014723 3396
AS (-inf, 391.0] 0 NaN 0
(391.0, 690.0] 0 0.000000 145
(690.0, 1199.0] 0 0.000000 462
(1199.0, inf] 0 0.000000 161
B6 (-inf, 391.0] 0 0.000000 71
(391.0, 690.0] 0 0.000000 38
(690.0, 1199.0] 0 0.000000 61
(1199.0, inf] 1 0.002681 373
DL (-inf, 391.0] 7 0.003086 2268
(391.0, 690.0] 8 0.002421 3304
(690.0, 1199.0] 16 0.006405 2498
(1199.0, inf] 7 0.002766 2531
EV (-inf, 391.0] 77 0.028785 2675
(391.0, 690.0] 47 0.022793 2062
(690.0, 1199.0] 22 0.019982 1101
(1199.0, inf] 0 0.000000 20
F9 (-inf, 391.0] 0 0.000000 27
(391.0, 690.0] 6 0.013825 434
(690.0, 1199.0] 4 0.007105 563
(1199.0, inf] 0 0.000000 293
HA (-inf, 391.0] 0 NaN 0
(391.0, 690.0] 0 NaN 0
(690.0, 1199.0] 0 NaN 0
(1199.0, inf] 0 0.000000 112
MQ (-inf, 391.0] 90 0.047120 1910
(391.0, 690.0] 39 0.037356 1044
(690.0, 1199.0] 22 0.044266 497
(1199.0, inf] 1 0.050000 20
NK (-inf, 391.0] 5 0.036496 137
(391.0, 690.0] 4 0.013201 303
(690.0, 1199.0] 6 0.011029 544
(1199.0, inf] 10 0.018797 532
OO (-inf, 391.0] 75 0.024826 3021
(391.0, 690.0] 39 0.019364 2014
(690.0, 1199.0] 19 0.016351 1162
(1199.0, inf] 9 0.023018 391
UA (-inf, 391.0] 5 0.007143 700
(391.0, 690.0] 14 0.011824 1184
(690.0, 1199.0] 26 0.010924 2380
(1199.0, inf] 48 0.013605 3528
US (-inf, 391.0] 0 0.000000 254
(391.0, 690.0] 7 0.021944 319
(690.0, 1199.0] 2 0.006329 316
(1199.0, inf] 12 0.016529 726
VX (-inf, 391.0] 2 0.008299 241
(391.0, 690.0] 1 0.003861 259
(690.0, 1199.0] 0 0.000000 22
(1199.0, inf] 3 0.006369 471
WN (-inf, 391.0] 55 0.023810 2310
(391.0, 690.0] 14 0.006487 2158
(690.0, 1199.0] 17 0.007896 2153
(1199.0, inf] 7 0.003895 1797
pd.cut(df.DIST,[-np.inf,400,700,1200,np.inf],labels=['~400','400~700','700~1200','1200~'])
0         400~700
1           1200~
2         400~700
3        700~1200
4           1200~
           ...   
58487       1200~
58488     400~700
58489        ~400
58490    700~1200
58491     400~700
Name: DIST, Length: 58492, dtype: category
Categories (4, object): ['~400' < '400~700' < '700~1200' < '1200~']
df.assign(DIST2 = pd.cut(df.DIST,[-np.inf,400,700,1200,np.inf],labels=['~400','400~700','700~1200','1200~']))\
.groupby(["AIRLINE","DIST2"]).agg({'CANCELLED':["sum","mean","count"]})
CANCELLED
sum mean count
AIRLINE DIST2
AA ~400 18 0.015986 1126
400~700 17 0.013589 1251
700~1200 69 0.022066 3127
1200~ 50 0.014723 3396
AS ~400 0 NaN 0
400~700 0 0.000000 145
700~1200 0 0.000000 462
1200~ 0 0.000000 161
B6 ~400 0 0.000000 71
400~700 0 0.000000 38
700~1200 0 0.000000 61
1200~ 1 0.002681 373
DL ~400 7 0.003040 2303
400~700 8 0.002352 3402
700~1200 16 0.006765 2365
1200~ 7 0.002766 2531
EV ~400 77 0.027838 2766
400~700 48 0.023312 2059
700~1200 21 0.020731 1013
1200~ 0 0.000000 20
F9 ~400 0 0.000000 27
400~700 7 0.015837 442
700~1200 3 0.005405 555
1200~ 0 0.000000 293
HA ~400 0 NaN 0
400~700 0 NaN 0
700~1200 0 NaN 0
1200~ 0 0.000000 112
MQ ~400 92 0.047472 1938
400~700 39 0.035682 1093
700~1200 20 0.047619 420
1200~ 1 0.050000 20
NK ~400 5 0.036496 137
400~700 4 0.013201 303
700~1200 6 0.011029 544
1200~ 10 0.018797 532
OO ~400 76 0.024837 3060
400~700 38 0.018673 2035
700~1200 19 0.017241 1102
1200~ 9 0.023018 391
UA ~400 5 0.006993 715
400~700 14 0.011966 1170
700~1200 26 0.010929 2379
1200~ 48 0.013605 3528
US ~400 0 0.000000 254
400~700 7 0.021944 319
700~1200 2 0.006329 316
1200~ 12 0.016529 726
VX ~400 2 0.008299 241
400~700 1 0.003861 259
700~1200 0 0.000000 22
1200~ 3 0.006369 471
WN ~400 55 0.023022 2389
400~700 17 0.007795 2181
700~1200 14 0.006826 2051
1200~ 7 0.003895 1797

HRDataset_v14 자료분석

아래의 코드를 활용하여 Kaggle의 HRdataset을 불러오라.

df = pd.read_csv('https://raw.githubusercontent.com/guebin/DV2022/master/posts/HRDataset_v14.csv')
df
Employee_Name EmpID MarriedID MaritalStatusID GenderID EmpStatusID DeptID PerfScoreID FromDiversityJobFairID Salary ... ManagerName ManagerID RecruitmentSource PerformanceScore EngagementSurvey EmpSatisfaction SpecialProjectsCount LastPerformanceReview_Date DaysLateLast30 Absences
0 Adinolfi, Wilson K 10026 0 0 1 1 5 4 0 62506 ... Michael Albert 22.0 LinkedIn Exceeds 4.60 5 0 1/17/2019 0 1
1 Ait Sidi, Karthikeyan 10084 1 1 1 5 3 3 0 104437 ... Simon Roup 4.0 Indeed Fully Meets 4.96 3 6 2/24/2016 0 17
2 Akinkuolie, Sarah 10196 1 1 0 5 5 3 0 64955 ... Kissy Sullivan 20.0 LinkedIn Fully Meets 3.02 3 0 5/15/2012 0 3
3 Alagbe,Trina 10088 1 1 0 1 5 3 0 64991 ... Elijiah Gray 16.0 Indeed Fully Meets 4.84 5 0 1/3/2019 0 15
4 Anderson, Carol 10069 0 2 0 5 5 3 0 50825 ... Webster Butler 39.0 Google Search Fully Meets 5.00 4 0 2/1/2016 0 2
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
306 Woodson, Jason 10135 0 0 1 1 5 3 0 65893 ... Kissy Sullivan 20.0 LinkedIn Fully Meets 4.07 4 0 2/28/2019 0 13
307 Ybarra, Catherine 10301 0 0 0 5 5 1 0 48513 ... Brannon Miller 12.0 Google Search PIP 3.20 2 0 9/2/2015 5 4
308 Zamora, Jennifer 10010 0 0 0 1 3 4 0 220450 ... Janet King 2.0 Employee Referral Exceeds 4.60 5 6 2/21/2019 0 16
309 Zhou, Julia 10043 0 0 0 1 3 3 0 89292 ... Simon Roup 4.0 Employee Referral Fully Meets 5.00 3 5 2/1/2019 0 11
310 Zima, Colleen 10271 0 4 0 1 5 3 0 45046 ... David Stanley 14.0 LinkedIn Fully Meets 4.50 5 0 1/30/2019 0 2

311 rows × 36 columns

df.groupby('RaceDesc')
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f268209c890>

데이터를 조사하고 올바르게 분석한 사람을 모두 고르라. (모두 맞칠경우만 정답으로 인정)

  • 소윤: 근무인원수가 가장 많은 인종(RaceDesc)은 ’White’이며 이는 ’Asian’인종과 ’Black or African American’의 합보다 많다.
  • 다호: ’RaceDesc==White’의 성별(Sex)임금차이는 2000이상이다.
  • 하니: 퇴직한사람(Termd==1)은 모두 104명이며 백인여성의 퇴직자수가 가장 많다.
  • 도한: 퇴직한사람중 아시아인의 비율은 10%가 넘지 않는다.

(풀이)

모두 참이다.

데이터조사

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 311 entries, 0 to 310
Data columns (total 36 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Employee_Name               311 non-null    object 
 1   EmpID                       311 non-null    int64  
 2   MarriedID                   311 non-null    int64  
 3   MaritalStatusID             311 non-null    int64  
 4   GenderID                    311 non-null    int64  
 5   EmpStatusID                 311 non-null    int64  
 6   DeptID                      311 non-null    int64  
 7   PerfScoreID                 311 non-null    int64  
 8   FromDiversityJobFairID      311 non-null    int64  
 9   Salary                      311 non-null    int64  
 10  Termd                       311 non-null    int64  
 11  PositionID                  311 non-null    int64  
 12  Position                    311 non-null    object 
 13  State                       311 non-null    object 
 14  Zip                         311 non-null    int64  
 15  DOB                         311 non-null    object 
 16  Sex                         311 non-null    object 
 17  MaritalDesc                 311 non-null    object 
 18  CitizenDesc                 311 non-null    object 
 19  HispanicLatino              311 non-null    object 
 20  RaceDesc                    311 non-null    object 
 21  DateofHire                  311 non-null    object 
 22  DateofTermination           104 non-null    object 
 23  TermReason                  311 non-null    object 
 24  EmploymentStatus            311 non-null    object 
 25  Department                  311 non-null    object 
 26  ManagerName                 311 non-null    object 
 27  ManagerID                   303 non-null    float64
 28  RecruitmentSource           311 non-null    object 
 29  PerformanceScore            311 non-null    object 
 30  EngagementSurvey            311 non-null    float64
 31  EmpSatisfaction             311 non-null    int64  
 32  SpecialProjectsCount        311 non-null    int64  
 33  LastPerformanceReview_Date  311 non-null    object 
 34  DaysLateLast30              311 non-null    int64  
 35  Absences                    311 non-null    int64  
dtypes: float64(2), int64(16), object(18)
memory usage: 87.6+ KB
  • EmpID가 missing이 없는 열임

소윤: 근무인원수가 가장 많은 인종(RaceDesc)은 ’White’이며 이는 ’Asian’인종과 ’Black or African American’의 합보다 많다. — 참

df.groupby(by='RaceDesc').agg({'EmpID':len})
EmpID
RaceDesc
American Indian or Alaska Native 3
Asian 29
Black or African American 80
Hispanic 1
Two or more races 11
White 187
29+80
109

다호: ’RaceDesc==White’의 성별(Sex)임금차이는 2000이상이다. — 참

df.groupby(by=['RaceDesc','Sex']).agg({'Salary':np.mean})
Salary
RaceDesc Sex
American Indian or Alaska Native F 63436.500000
M 70545.000000
Asian F 67520.117647
M 69939.416667
Black or African American F 66963.829787
M 85066.121212
Hispanic M 83667.000000
Two or more races F 58068.500000
M 62313.800000
White F 68846.519231
M 65334.132530
68846.519231 - 65334.132530
3512.386700999996

하니: 퇴직한사람(Termd==1)은 모두 104명이며 백인여성의 퇴직자수가 가장 많다. — 참

(df.Termd==1).sum()
104
df.groupby(by=['RaceDesc','Sex']).agg({'Termd':np.sum})
Termd
RaceDesc Sex
American Indian or Alaska Native F 0
M 0
Asian F 6
M 3
Black or African American F 15
M 14
Hispanic M 0
Two or more races F 2
M 1
White F 37
M 26

도한: 퇴직한사람중 아시아인의 비율은 10%가 넘지 않는다. — 참

(df.Termd==1).sum()
104
df.groupby(by=['RaceDesc']).agg({'Termd':np.sum})
Termd
RaceDesc
American Indian or Alaska Native 0
Asian 9
Black or African American 29
Hispanic 0
Two or more races 3
White 63