ref
https://guebin.github.io/DV2022/posts/2022-10-24-8wk-12.html#판다스-groupby
imports
판다스: Groupby
flights data
df=pd.read_csv('https://raw.githubusercontent.com/PacktPublishing/Pandas-Cookbook/master/data/flights.csv')
df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58492 entries, 0 to 58491
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 MONTH 58492 non-null int64
1 DAY 58492 non-null int64
2 WEEKDAY 58492 non-null int64
3 AIRLINE 58492 non-null object
4 ORG_AIR 58492 non-null object
5 DEST_AIR 58492 non-null object
6 SCHED_DEP 58492 non-null int64
7 DEP_DELAY 57659 non-null float64
8 AIR_TIME 57474 non-null float64
9 DIST 58492 non-null int64
10 SCHED_ARR 58492 non-null int64
11 ARR_DELAY 57474 non-null float64
12 DIVERTED 58492 non-null int64
13 CANCELLED 58492 non-null int64
dtypes: float64(3), int64(8), object(3)
memory usage: 6.2+ MB
get_groups
- groupby - 데이터프레임을 여러개의 서브데이터프레임으로 나누는 기술 - 단독으로 쓸 이유는 별로 없다. \(\to\) 그룹을 나누고 each 그룹마다 어떠한 “변수”에 “연산”을 하기 위함.
- 지금 이것이 항공사별로 데이터프레임이 나누어진 상태임
- 진짜 sub dataframe 으로 나누어져 있는지 확인
dict_keys(['AA', 'AS', 'B6', 'DL', 'EV', 'F9', 'HA', 'MQ', 'NK', 'OO', 'UA', 'US', 'VX', 'WN'])
| MONTH | DAY | WEEKDAY | AIRLINE | ORG_AIR | DEST_AIR | SCHED_DEP | DEP_DELAY | AIR_TIME | DIST | SCHED_ARR | ARR_DELAY | DIVERTED | CANCELLED | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 38 | 1 | 1 | 4 | AS | PHX | SEA | 1505 | -2.0 | 155.0 | 1107 | 1702 | -3.0 | 0 | 0 |
| 198 | 1 | 2 | 5 | AS | LAX | SEA | 2110 | 5.0 | 145.0 | 954 | 2352 | 8.0 | 0 | 0 |
| 241 | 1 | 2 | 5 | AS | LAS | PDX | 650 | -5.0 | 117.0 | 763 | 906 | -3.0 | 0 | 0 |
| 277 | 1 | 2 | 5 | AS | ORD | ANC | 935 | -1.0 | 402.0 | 2846 | 1339 | -6.0 | 0 | 0 |
| 397 | 1 | 3 | 6 | AS | LAS | SEA | 1300 | 48.0 | 137.0 | 867 | 1535 | 47.0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 58305 | 12 | 30 | 3 | AS | LAX | SEA | 1325 | -2.0 | 134.0 | 954 | 1608 | -7.0 | 0 | 0 |
| 58355 | 12 | 31 | 4 | AS | PHX | SEA | 1200 | -5.0 | 145.0 | 1107 | 1407 | -24.0 | 0 | 0 |
| 58404 | 12 | 31 | 4 | AS | SFO | SLC | 2110 | -2.0 | 80.0 | 599 | 2358 | -4.0 | 0 | 0 |
| 58407 | 12 | 31 | 4 | AS | SFO | PDX | 645 | -2.0 | 81.0 | 550 | 832 | -3.0 | 0 | 0 |
| 58428 | 12 | 31 | 4 | AS | LAX | SEA | 1420 | -8.0 | 127.0 | 954 | 1709 | -25.0 | 0 | 0 |
768 rows × 14 columns
dict_keys(['AA', 'AS', 'B6', 'DL', 'EV', 'F9', 'HA', 'MQ', 'NK', 'OO', 'UA', 'US', 'VX', 'WN'])
| MONTH | DAY | WEEKDAY | AIRLINE | ORG_AIR | DEST_AIR | SCHED_DEP | DEP_DELAY | AIR_TIME | DIST | SCHED_ARR | ARR_DELAY | DIVERTED | CANCELLED | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3 | 1 | 1 | 4 | AA | DFW | DCA | 1555 | 7.0 | 126.0 | 1192 | 1935 | -7.0 | 0 | 0 |
| 6 | 1 | 1 | 4 | AA | DFW | MSY | 1250 | 84.0 | 64.0 | 447 | 1410 | 83.0 | 0 | 0 |
| 8 | 1 | 1 | 4 | AA | ORD | STL | 1845 | -5.0 | 44.0 | 258 | 1950 | -5.0 | 0 | 0 |
| 15 | 1 | 1 | 4 | AA | DEN | DFW | 1445 | -6.0 | 93.0 | 641 | 1745 | 4.0 | 0 | 0 |
| 26 | 1 | 1 | 4 | AA | LAX | AUS | 1430 | 33.0 | 157.0 | 1242 | 1925 | 41.0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 58470 | 12 | 31 | 4 | AA | DFW | FAT | 1020 | -3.0 | 196.0 | 1313 | 1156 | -2.0 | 0 | 0 |
| 58475 | 12 | 31 | 4 | AA | IAH | CLT | 710 | 1.0 | 113.0 | 912 | 1037 | -12.0 | 0 | 0 |
| 58476 | 12 | 31 | 4 | AA | DFW | TPA | 1020 | -3.0 | 121.0 | 929 | 1340 | -6.0 | 0 | 0 |
| 58479 | 12 | 31 | 4 | AA | DFW | ELP | 1200 | 3.0 | 94.0 | 551 | 1250 | 13.0 | 0 | 0 |
| 58487 | 12 | 31 | 4 | AA | SFO | DFW | 515 | 5.0 | 166.0 | 1464 | 1045 | -19.0 | 0 | 0 |
8900 rows × 14 columns
| MONTH | DAY | WEEKDAY | AIRLINE | ORG_AIR | DEST_AIR | SCHED_DEP | DEP_DELAY | AIR_TIME | DIST | SCHED_ARR | ARR_DELAY | DIVERTED | CANCELLED | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 38 | 1 | 1 | 4 | AS | PHX | SEA | 1505 | -2.0 | 155.0 | 1107 | 1702 | -3.0 | 0 | 0 |
| 198 | 1 | 2 | 5 | AS | LAX | SEA | 2110 | 5.0 | 145.0 | 954 | 2352 | 8.0 | 0 | 0 |
| 241 | 1 | 2 | 5 | AS | LAS | PDX | 650 | -5.0 | 117.0 | 763 | 906 | -3.0 | 0 | 0 |
| 277 | 1 | 2 | 5 | AS | ORD | ANC | 935 | -1.0 | 402.0 | 2846 | 1339 | -6.0 | 0 | 0 |
| 397 | 1 | 3 | 6 | AS | LAS | SEA | 1300 | 48.0 | 137.0 | 867 | 1535 | 47.0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 58305 | 12 | 30 | 3 | AS | LAX | SEA | 1325 | -2.0 | 134.0 | 954 | 1608 | -7.0 | 0 | 0 |
| 58355 | 12 | 31 | 4 | AS | PHX | SEA | 1200 | -5.0 | 145.0 | 1107 | 1407 | -24.0 | 0 | 0 |
| 58404 | 12 | 31 | 4 | AS | SFO | SLC | 2110 | -2.0 | 80.0 | 599 | 2358 | -4.0 | 0 | 0 |
| 58407 | 12 | 31 | 4 | AS | SFO | PDX | 645 | -2.0 | 81.0 | 550 | 832 | -3.0 | 0 | 0 |
| 58428 | 12 | 31 | 4 | AS | LAX | SEA | 1420 | -8.0 | 127.0 | 954 | 1709 | -25.0 | 0 | 0 |
768 rows × 14 columns
| MONTH | DAY | WEEKDAY | AIRLINE | ORG_AIR | DEST_AIR | SCHED_DEP | DEP_DELAY | AIR_TIME | DIST | SCHED_ARR | ARR_DELAY | DIVERTED | CANCELLED | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 123 | 1 | 1 | 4 | B6 | LAS | BOS | 1230 | 0.0 | 246.0 | 2381 | 2026 | -27.0 | 0 | 0 |
| 127 | 1 | 1 | 4 | B6 | LAS | BOS | 2359 | 68.0 | 247.0 | 2381 | 749 | 46.0 | 0 | 0 |
| 239 | 1 | 2 | 5 | B6 | ORD | BOS | 540 | -8.0 | 96.0 | 867 | 856 | -22.0 | 0 | 0 |
| 333 | 1 | 3 | 6 | B6 | LAX | FLL | 2237 | 32.0 | 270.0 | 2342 | 619 | 42.0 | 0 | 0 |
| 548 | 1 | 4 | 7 | B6 | SFO | FLL | 2307 | -4.0 | 298.0 | 2583 | 724 | -1.0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 58262 | 12 | 30 | 3 | B6 | SFO | LGB | 1921 | -6.0 | 57.0 | 354 | 2038 | -14.0 | 0 | 0 |
| 58301 | 12 | 30 | 3 | B6 | LAX | JFK | 630 | 4.0 | 285.0 | 2475 | 1445 | -6.0 | 0 | 0 |
| 58425 | 12 | 31 | 4 | B6 | ORD | SJU | 700 | 239.0 | 250.0 | 2072 | 1335 | 239.0 | 0 | 0 |
| 58477 | 12 | 31 | 4 | B6 | DFW | BOS | 1145 | 12.0 | 161.0 | 1562 | 1608 | -14.0 | 0 | 0 |
| 58483 | 12 | 31 | 4 | B6 | PHX | BOS | 2236 | -12.0 | 231.0 | 2300 | 515 | -45.0 | 0 | 0 |
543 rows × 14 columns
| MONTH | DAY | WEEKDAY | AIRLINE | ORG_AIR | DEST_AIR | SCHED_DEP | DEP_DELAY | AIR_TIME | DIST | SCHED_ARR | ARR_DELAY | DIVERTED | CANCELLED | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 53 | 1 | 1 | 4 | DL | LAS | MSP | 713 | -5.0 | 156.0 | 1299 | 1220 | -18.0 | 0 | 0 |
| 57 | 1 | 1 | 4 | DL | MSP | RSW | 700 | -1.0 | 169.0 | 1416 | 1130 | -20.0 | 0 | 0 |
| 77 | 1 | 1 | 4 | DL | LAX | ATL | 1130 | 24.0 | 217.0 | 1947 | 1840 | 16.0 | 0 | 0 |
| 79 | 1 | 1 | 4 | DL | LAX | CMH | 2146 | -3.0 | 223.0 | 1995 | 459 | -13.0 | 0 | 0 |
| 85 | 1 | 1 | 4 | DL | ATL | OKC | 2059 | -4.0 | 116.0 | 761 | 2227 | -12.0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 58440 | 12 | 31 | 4 | DL | ATL | CVG | 1611 | -4.0 | 61.0 | 373 | 1736 | -6.0 | 0 | 0 |
| 58448 | 12 | 31 | 4 | DL | ATL | SRQ | 1610 | 0.0 | 61.0 | 444 | 1740 | -13.0 | 0 | 0 |
| 58464 | 12 | 31 | 4 | DL | LAX | SFO | 700 | 108.0 | 54.0 | 337 | 825 | 105.0 | 0 | 0 |
| 58467 | 12 | 31 | 4 | DL | ATL | IND | 1235 | -3.0 | 63.0 | 432 | 1407 | -13.0 | 0 | 0 |
| 58485 | 12 | 31 | 4 | DL | ATL | CMH | 2206 | 2.0 | 64.0 | 447 | 2338 | -8.0 | 0 | 0 |
10601 rows × 14 columns
| MONTH | DAY | WEEKDAY | AIRLINE | ORG_AIR | DEST_AIR | SCHED_DEP | DEP_DELAY | AIR_TIME | DIST | SCHED_ARR | ARR_DELAY | DIVERTED | CANCELLED | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 11 | 1 | 1 | 4 | EV | ORD | JAN | 1155 | 6.0 | 113.0 | 677 | 1403 | 5.0 | 0 | 0 |
| 13 | 1 | 1 | 4 | EV | ORD | CMH | 1010 | -2.0 | 46.0 | 296 | 1228 | -9.0 | 0 | 0 |
| 29 | 1 | 1 | 4 | EV | ORD | IND | 1025 | -6.0 | 29.0 | 177 | 1228 | -19.0 | 0 | 0 |
| 40 | 1 | 1 | 4 | EV | IAH | CLE | 1038 | -3.0 | 126.0 | 1091 | 1425 | -18.0 | 0 | 0 |
| 69 | 1 | 1 | 4 | EV | ATL | RAP | 1930 | -5.0 | 181.0 | 1230 | 2104 | -15.0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 58445 | 12 | 31 | 4 | EV | DFW | TXK | 850 | -5.0 | 30.0 | 181 | 948 | -17.0 | 0 | 0 |
| 58452 | 12 | 31 | 4 | EV | DFW | SHV | 1650 | -4.0 | 32.0 | 190 | 1746 | -12.0 | 0 | 0 |
| 58459 | 12 | 31 | 4 | EV | MSP | ORD | 1435 | 18.0 | 61.0 | 334 | 1609 | 3.0 | 0 | 0 |
| 58463 | 12 | 31 | 4 | EV | ORD | MSN | 1220 | 18.0 | 32.0 | 108 | 1319 | 27.0 | 0 | 0 |
| 58486 | 12 | 31 | 4 | EV | DFW | LFT | 850 | 21.0 | 52.0 | 351 | 1012 | 14.0 | 0 | 0 |
5858 rows × 14 columns
| MONTH | DAY | WEEKDAY | AIRLINE | ORG_AIR | DEST_AIR | SCHED_DEP | DEP_DELAY | AIR_TIME | DIST | SCHED_ARR | ARR_DELAY | DIVERTED | CANCELLED | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7 | 1 | 1 | 4 | F9 | SFO | PHX | 1020 | -7.0 | 91.0 | 651 | 1315 | -6.0 | 0 | 0 |
| 93 | 1 | 1 | 4 | F9 | ATL | DEN | 859 | 16.0 | 181.0 | 1199 | 1026 | 10.0 | 0 | 0 |
| 209 | 1 | 2 | 5 | F9 | MSP | DEN | 1025 | -6.0 | 97.0 | 680 | 1134 | -13.0 | 0 | 0 |
| 232 | 1 | 2 | 5 | F9 | DEN | PHX | 2040 | -7.0 | 83.0 | 602 | 2228 | -18.0 | 0 | 0 |
| 247 | 1 | 2 | 5 | F9 | ORD | ATL | 730 | 10.0 | 86.0 | 606 | 1020 | 23.0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 58288 | 12 | 30 | 3 | F9 | DEN | ORD | 625 | -4.0 | 136.0 | 888 | 1000 | 14.0 | 0 | 0 |
| 58331 | 12 | 30 | 3 | F9 | ORD | PHX | 825 | 18.0 | 207.0 | 1440 | 1127 | 14.0 | 0 | 0 |
| 58447 | 12 | 31 | 4 | F9 | DEN | LAS | 1245 | 13.0 | 94.0 | 628 | 1340 | 13.0 | 0 | 0 |
| 58449 | 12 | 31 | 4 | F9 | DEN | MCO | 645 | 11.0 | 169.0 | 1546 | 1224 | -11.0 | 0 | 0 |
| 58488 | 12 | 31 | 4 | F9 | LAS | SFO | 1910 | 13.0 | 71.0 | 414 | 2050 | 4.0 | 0 | 0 |
1317 rows × 14 columns
| MONTH | DAY | WEEKDAY | AIRLINE | ORG_AIR | DEST_AIR | SCHED_DEP | DEP_DELAY | AIR_TIME | DIST | SCHED_ARR | ARR_DELAY | DIVERTED | CANCELLED | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 582 | 1 | 4 | 7 | HA | LAX | OGG | 1115 | -11.0 | 310.0 | 2486 | 1500 | -27.0 | 0 | 0 |
| 712 | 1 | 5 | 1 | HA | LAS | HNL | 900 | -5.0 | 357.0 | 2762 | 1315 | 5.0 | 0 | 0 |
| 878 | 1 | 6 | 2 | HA | PHX | HNL | 800 | 1.0 | 374.0 | 2917 | 1140 | 3.0 | 0 | 0 |
| 1053 | 1 | 7 | 3 | HA | LAX | HNL | 1705 | 0.0 | 332.0 | 2556 | 2055 | -2.0 | 0 | 0 |
| 1269 | 1 | 8 | 4 | HA | LAX | HNL | 1000 | -1.0 | 335.0 | 2556 | 1350 | 0.0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 55883 | 12 | 16 | 3 | HA | LAX | HNL | 835 | 1.0 | 314.0 | 2556 | 1235 | -18.0 | 0 | 0 |
| 56174 | 12 | 18 | 5 | HA | LAX | HNL | 835 | -5.0 | 342.0 | 2556 | 1235 | -4.0 | 0 | 0 |
| 56350 | 12 | 19 | 6 | HA | PHX | HNL | 800 | -5.0 | 363.0 | 2917 | 1155 | -34.0 | 0 | 0 |
| 56816 | 12 | 21 | 1 | HA | LAX | LIH | 740 | 20.0 | 303.0 | 2615 | 1145 | -11.0 | 0 | 0 |
| 58391 | 12 | 31 | 4 | HA | LAX | HNL | 1000 | 0.0 | 324.0 | 2556 | 1350 | -9.0 | 0 | 0 |
112 rows × 14 columns
| MONTH | DAY | WEEKDAY | AIRLINE | ORG_AIR | DEST_AIR | SCHED_DEP | DEP_DELAY | AIR_TIME | DIST | SCHED_ARR | ARR_DELAY | DIVERTED | CANCELLED | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 1 | 1 | 4 | MQ | DFW | VPS | 1305 | 36.0 | 85.0 | 641 | 1453 | 35.0 | 0 | 0 |
| 10 | 1 | 1 | 4 | MQ | DFW | DRO | 1335 | 28.0 | 104.0 | 674 | 1438 | 28.0 | 0 | 0 |
| 18 | 1 | 1 | 4 | MQ | ORD | DAY | 2220 | 19.0 | 37.0 | 240 | 23 | 20.0 | 0 | 0 |
| 24 | 1 | 1 | 4 | MQ | DFW | BTR | 730 | NaN | NaN | 383 | 853 | NaN | 0 | 1 |
| 50 | 1 | 1 | 4 | MQ | ORD | CID | 1135 | -7.0 | 37.0 | 196 | 1238 | -15.0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 58415 | 12 | 31 | 4 | MQ | ORD | FWA | 845 | -2.0 | 37.0 | 157 | 1045 | -4.0 | 0 | 0 |
| 58426 | 12 | 31 | 4 | MQ | DFW | FAR | 1154 | 4.0 | 124.0 | 968 | 1437 | -13.0 | 0 | 0 |
| 58468 | 12 | 31 | 4 | MQ | DFW | OKC | 1720 | -3.0 | 31.0 | 175 | 1819 | -10.0 | 0 | 0 |
| 58474 | 12 | 31 | 4 | MQ | ORD | FNT | 829 | 4.0 | 40.0 | 223 | 1034 | -4.0 | 0 | 0 |
| 58484 | 12 | 31 | 4 | MQ | ORD | DSM | 1333 | 1.0 | 57.0 | 299 | 1455 | -7.0 | 0 | 0 |
3471 rows × 14 columns
| MONTH | DAY | WEEKDAY | AIRLINE | ORG_AIR | DEST_AIR | SCHED_DEP | DEP_DELAY | AIR_TIME | DIST | SCHED_ARR | ARR_DELAY | DIVERTED | CANCELLED | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 17 | 1 | 1 | 4 | NK | DEN | DTW | 1952 | 37.0 | 124.0 | 1123 | 31 | 54.0 | 0 | 0 |
| 74 | 1 | 1 | 4 | NK | PHX | DFW | 159 | -1.0 | 103.0 | 868 | 502 | 1.0 | 0 | 0 |
| 95 | 1 | 1 | 4 | NK | LAS | OAK | 1115 | 22.0 | 62.0 | 407 | 1246 | 10.0 | 0 | 0 |
| 109 | 1 | 1 | 4 | NK | MSP | ORD | 616 | 2.0 | 49.0 | 334 | 745 | -19.0 | 0 | 0 |
| 166 | 1 | 2 | 5 | NK | LAS | PDX | 1535 | -8.0 | 123.0 | 763 | 1754 | -4.0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 58160 | 12 | 29 | 2 | NK | MSP | MCO | 740 | 0.0 | 171.0 | 1310 | 1158 | 33.0 | 0 | 0 |
| 58197 | 12 | 30 | 3 | NK | IAH | ORD | 755 | -8.0 | 136.0 | 925 | 1030 | -2.0 | 0 | 0 |
| 58437 | 12 | 31 | 4 | NK | ORD | DFW | 1952 | 15.0 | 135.0 | 802 | 2225 | 23.0 | 0 | 0 |
| 58461 | 12 | 31 | 4 | NK | ORD | LGA | 1801 | -5.0 | 84.0 | 733 | 2109 | -26.0 | 0 | 0 |
| 58469 | 12 | 31 | 4 | NK | LAS | MSY | 1950 | 124.0 | 163.0 | 1500 | 112 | 101.0 | 0 | 0 |
1516 rows × 14 columns
| MONTH | DAY | WEEKDAY | AIRLINE | ORG_AIR | DEST_AIR | SCHED_DEP | DEP_DELAY | AIR_TIME | DIST | SCHED_ARR | ARR_DELAY | DIVERTED | CANCELLED | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 12 | 1 | 1 | 4 | OO | ORD | MSP | 1510 | 2.0 | 65.0 | 334 | 1646 | 4.0 | 0 | 0 |
| 16 | 1 | 1 | 4 | OO | DEN | SGU | 1105 | 21.0 | 66.0 | 517 | 1249 | 20.0 | 0 | 0 |
| 22 | 1 | 1 | 4 | OO | LAS | LAX | 1544 | -4.0 | 39.0 | 236 | 1655 | -12.0 | 0 | 0 |
| 25 | 1 | 1 | 4 | OO | ORD | SPI | 2110 | -4.0 | 31.0 | 174 | 2205 | 5.0 | 0 | 0 |
| 27 | 1 | 1 | 4 | OO | IAH | JAC | 1104 | -1.0 | 161.0 | 1265 | 1316 | -1.0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 58451 | 12 | 31 | 4 | OO | ATL | FWA | 1905 | -3.0 | 72.0 | 508 | 2051 | -14.0 | 0 | 0 |
| 58480 | 12 | 31 | 4 | OO | MSP | BIS | 1310 | -2.0 | 65.0 | 386 | 1449 | -9.0 | 0 | 0 |
| 58482 | 12 | 31 | 4 | OO | DEN | CPR | 1850 | -2.0 | 38.0 | 230 | 1956 | 1.0 | 0 | 0 |
| 58489 | 12 | 31 | 4 | OO | SFO | SBA | 1846 | -6.0 | 46.0 | 262 | 1956 | -5.0 | 0 | 0 |
| 58491 | 12 | 31 | 4 | OO | SFO | BOI | 859 | 5.0 | 73.0 | 522 | 1146 | -1.0 | 0 | 0 |
6588 rows × 14 columns
| MONTH | DAY | WEEKDAY | AIRLINE | ORG_AIR | DEST_AIR | SCHED_DEP | DEP_DELAY | AIR_TIME | DIST | SCHED_ARR | ARR_DELAY | DIVERTED | CANCELLED | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 1 | 4 | UA | DEN | IAD | 823 | 7.0 | 154.0 | 1452 | 1333 | -13.0 | 0 | 0 |
| 5 | 1 | 1 | 4 | UA | IAH | SAN | 1450 | 1.0 | 178.0 | 1303 | 1620 | -14.0 | 0 | 0 |
| 9 | 1 | 1 | 4 | UA | IAH | SJC | 925 | 3.0 | 215.0 | 1608 | 1136 | -14.0 | 0 | 0 |
| 14 | 1 | 1 | 4 | UA | IAH | IND | 1426 | -1.0 | 102.0 | 844 | 1742 | -20.0 | 0 | 0 |
| 21 | 1 | 1 | 4 | UA | ORD | CLE | 2102 | 48.0 | 47.0 | 315 | 2320 | 41.0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 58422 | 12 | 31 | 4 | UA | DEN | SAN | 1535 | 0.0 | 124.0 | 853 | 1704 | -13.0 | 0 | 0 |
| 58432 | 12 | 31 | 4 | UA | ORD | SAN | 1915 | 7.0 | 238.0 | 1723 | 2143 | -3.0 | 0 | 0 |
| 58457 | 12 | 31 | 4 | UA | ORD | LAX | 659 | -1.0 | 241.0 | 1744 | 946 | 0.0 | 0 | 0 |
| 58460 | 12 | 31 | 4 | UA | SFO | PHL | 2235 | -6.0 | 265.0 | 2521 | 700 | -42.0 | 0 | 0 |
| 58481 | 12 | 31 | 4 | UA | IAH | LAX | 1433 | 1.0 | 197.0 | 1379 | 1625 | -13.0 | 0 | 0 |
7792 rows × 14 columns
| MONTH | DAY | WEEKDAY | AIRLINE | ORG_AIR | DEST_AIR | SCHED_DEP | DEP_DELAY | AIR_TIME | DIST | SCHED_ARR | ARR_DELAY | DIVERTED | CANCELLED | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 31 | 1 | 1 | 4 | US | PHX | DEN | 1810 | 29.0 | 94.0 | 602 | 1954 | 49.0 | 0 | 0 |
| 35 | 1 | 1 | 4 | US | ORD | PHL | 1600 | -2.0 | 80.0 | 678 | 1857 | -9.0 | 0 | 0 |
| 49 | 1 | 1 | 4 | US | IAH | PHX | 1445 | -1.0 | 147.0 | 1009 | 1638 | -7.0 | 0 | 0 |
| 96 | 1 | 1 | 4 | US | ATL | PHL | 1445 | -4.0 | 90.0 | 666 | 1644 | -11.0 | 0 | 0 |
| 104 | 1 | 1 | 4 | US | MSP | PHX | 730 | -3.0 | 174.0 | 1276 | 1010 | -20.0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 31514 | 6 | 30 | 2 | US | DEN | PHL | 705 | -4.0 | 188.0 | 1558 | 1240 | 1.0 | 0 | 0 |
| 31523 | 6 | 30 | 2 | US | PHX | DEN | 1451 | 6.0 | 85.0 | 602 | 1738 | 7.0 | 0 | 0 |
| 31535 | 6 | 30 | 2 | US | PHX | AUS | 840 | -3.0 | 116.0 | 872 | 1304 | -11.0 | 0 | 0 |
| 31561 | 6 | 30 | 2 | US | ORD | PHX | 710 | -5.0 | 170.0 | 1440 | 901 | -50.0 | 0 | 0 |
| 31582 | 6 | 30 | 2 | US | PHX | OGG | 800 | -4.0 | 356.0 | 2845 | 1127 | -13.0 | 0 | 0 |
1615 rows × 14 columns
| MONTH | DAY | WEEKDAY | AIRLINE | ORG_AIR | DEST_AIR | SCHED_DEP | DEP_DELAY | AIR_TIME | DIST | SCHED_ARR | ARR_DELAY | DIVERTED | CANCELLED | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 56 | 1 | 1 | 4 | VX | LAS | SFO | 900 | 23.0 | 65.0 | 414 | 1035 | 11.0 | 0 | 0 |
| 227 | 1 | 2 | 5 | VX | SFO | LAS | 1220 | -5.0 | 68.0 | 414 | 1350 | -5.0 | 0 | 0 |
| 243 | 1 | 2 | 5 | VX | SFO | SEA | 700 | -4.0 | 104.0 | 679 | 905 | -1.0 | 0 | 0 |
| 417 | 1 | 3 | 6 | VX | SFO | LAS | 900 | -2.0 | 62.0 | 414 | 1030 | -11.0 | 0 | 0 |
| 432 | 1 | 3 | 6 | VX | SFO | SEA | 2035 | -2.0 | 106.0 | 679 | 2240 | -2.0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 58332 | 12 | 30 | 3 | VX | SFO | LAS | 1950 | -3.0 | 58.0 | 414 | 2120 | -4.0 | 0 | 0 |
| 58383 | 12 | 31 | 4 | VX | SFO | PSP | 1630 | -7.0 | 65.0 | 421 | 1755 | -12.0 | 0 | 0 |
| 58400 | 12 | 31 | 4 | VX | SFO | LAX | 1125 | -4.0 | 54.0 | 337 | 1245 | -10.0 | 0 | 0 |
| 58471 | 12 | 31 | 4 | VX | SFO | LAX | 700 | 6.0 | 51.0 | 337 | 820 | 3.0 | 0 | 0 |
| 58478 | 12 | 31 | 4 | VX | SFO | LAX | 1530 | 29.0 | 52.0 | 337 | 1650 | 22.0 | 0 | 0 |
993 rows × 14 columns
| MONTH | DAY | WEEKDAY | AIRLINE | ORG_AIR | DEST_AIR | SCHED_DEP | DEP_DELAY | AIR_TIME | DIST | SCHED_ARR | ARR_DELAY | DIVERTED | CANCELLED | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | 4 | WN | LAX | SLC | 1625 | 58.0 | 94.0 | 590 | 1905 | 65.0 | 0 | 0 |
| 4 | 1 | 1 | 4 | WN | LAX | MCI | 1720 | 48.0 | 166.0 | 1363 | 2225 | 39.0 | 0 | 0 |
| 19 | 1 | 1 | 4 | WN | PHX | LAX | 1640 | 51.0 | 58.0 | 370 | 1700 | 59.0 | 0 | 0 |
| 20 | 1 | 1 | 4 | WN | ATL | BWI | 1115 | 1.0 | 76.0 | 577 | 1305 | -15.0 | 0 | 0 |
| 23 | 1 | 1 | 4 | WN | ATL | HOU | 1555 | 30.0 | 113.0 | 696 | 1720 | 18.0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 58455 | 12 | 31 | 4 | WN | LAX | SMF | 1420 | -2.0 | 64.0 | 373 | 1540 | -7.0 | 0 | 0 |
| 58458 | 12 | 31 | 4 | WN | LAS | SFO | 1825 | 25.0 | 67.0 | 414 | 1955 | 17.0 | 0 | 0 |
| 58472 | 12 | 31 | 4 | WN | PHX | HOU | 845 | 5.0 | 119.0 | 1020 | 1210 | 7.0 | 0 | 0 |
| 58473 | 12 | 31 | 4 | WN | DEN | PDX | 1205 | 4.0 | 130.0 | 991 | 1400 | -13.0 | 0 | 0 |
| 58490 | 12 | 31 | 4 | WN | MSP | ATL | 525 | 39.0 | 124.0 | 907 | 855 | 34.0 | 0 | 0 |
8418 rows × 14 columns
범주형변수를 기준으로 groupby -> agg
Index(['MONTH', 'DAY', 'WEEKDAY', 'AIRLINE', 'ORG_AIR', 'DEST_AIR',
'SCHED_DEP', 'DEP_DELAY', 'AIR_TIME', 'DIST', 'SCHED_ARR', 'ARR_DELAY',
'DIVERTED', 'CANCELLED'],
dtype='object')
(5.542660550458716, 6.397352587244284)
# EX1: [AIRLINE] \(\to\) {ARR_DELAY:mean}
- 방법1: grouby() \(\to\) .agg({colname: function})
(예시1)
| ARR_DELAY | |
|---|---|
| AIRLINE | |
| AA | 5.542661 |
| AS | -0.833333 |
| B6 | 8.692593 |
| DL | 0.339691 |
| EV | 7.034580 |
| F9 | 13.630651 |
| HA | 4.972973 |
| MQ | 6.860591 |
| NK | 18.436070 |
| OO | 7.593463 |
| UA | 7.765755 |
| US | 1.681105 |
| VX | 5.348884 |
| WN | 6.397353 |
(예시2)
| ARR_DELAY | |
|---|---|
| AIRLINE | |
| AA | 5.542661 |
| AS | -0.833333 |
| B6 | 8.692593 |
| DL | 0.339691 |
| EV | 7.034580 |
| F9 | 13.630651 |
| HA | 4.972973 |
| MQ | 6.860591 |
| NK | 18.436070 |
| OO | 7.593463 |
| UA | 7.765755 |
| US | 1.681105 |
| VX | 5.348884 |
| WN | 6.397353 |
- 방법2: grouby() \(\to\) key로 column선택 \(\to\) .agg(f) or .f()
(예시1)
AIRLINE
AA 5.542661
AS -0.833333
B6 8.692593
DL 0.339691
EV 7.034580
F9 13.630651
HA 4.972973
MQ 6.860591
NK 18.436070
OO 7.593463
UA 7.765755
US 1.681105
VX 5.348884
WN 6.397353
Name: ARR_DELAY, dtype: float64
(예시2)
AIRLINE
AA 5.542661
AS -0.833333
B6 8.692593
DL 0.339691
EV 7.034580
F9 13.630651
HA 4.972973
MQ 6.860591
NK 18.436070
OO 7.593463
UA 7.765755
US 1.681105
VX 5.348884
WN 6.397353
Name: ARR_DELAY, dtype: float64
(예시3)
# EX2: [AIRLINE,WEEKDAY] \(\to\) {CANCELLED:sum}
- 방법1
(예시1)
| CANCELLED | ||
|---|---|---|
| AIRLINE | WEEKDAY | |
| AA | 1 | 41 |
| 2 | 9 | |
| 3 | 16 | |
| 4 | 20 | |
| 5 | 18 | |
| ... | ... | ... |
| WN | 3 | 18 |
| 4 | 10 | |
| 5 | 7 | |
| 6 | 10 | |
| 7 | 7 |
98 rows × 1 columns
(예시2)
| CANCELLED | ||
|---|---|---|
| AIRLINE | WEEKDAY | |
| AA | 1 | 41 |
| 2 | 9 | |
| 3 | 16 | |
| 4 | 20 | |
| 5 | 18 | |
| ... | ... | ... |
| WN | 3 | 18 |
| 4 | 10 | |
| 5 | 7 | |
| 6 | 10 | |
| 7 | 7 |
98 rows × 1 columns
- 방법2
(예시1)
| CANCELLED | ||
|---|---|---|
| AIRLINE | WEEKDAY | |
| AA | 1 | 41 |
| 2 | 9 | |
| 3 | 16 | |
| 4 | 20 | |
| 5 | 18 | |
| ... | ... | ... |
| WN | 3 | 18 |
| 4 | 10 | |
| 5 | 7 | |
| 6 | 10 | |
| 7 | 7 |
98 rows × 1 columns
(예시2)
AIRLINE WEEKDAY
AA 1 41
2 9
3 16
4 20
5 18
..
WN 3 18
4 10
5 7
6 10
7 7
Name: CANCELLED, Length: 98, dtype: int64
(예시3)
AIRLINE WEEKDAY
AA 1 41
2 9
3 16
4 20
5 18
..
WN 3 18
4 10
5 7
6 10
7 7
Name: CANCELLED, Length: 98, dtype: int64
# EX3: [AIRLINE,WEEKDAY] \(\to\) {CANCELLED:sum,mean}, {DIVERTED: sum,mean}
- 방법1
(예시1)
| CANCELLED | DIVERTED | ||||
|---|---|---|---|---|---|
| sum | mean | sum | mean | ||
| AIRLINE | WEEKDAY | ||||
| AA | 1 | 41 | 0.032106 | 6 | 0.004699 |
| 2 | 9 | 0.007341 | 2 | 0.001631 | |
| 3 | 16 | 0.011949 | 2 | 0.001494 | |
| 4 | 20 | 0.015004 | 5 | 0.003751 | |
| 5 | 18 | 0.014151 | 1 | 0.000786 | |
| ... | ... | ... | ... | ... | ... |
| WN | 3 | 18 | 0.014118 | 2 | 0.001569 |
| 4 | 10 | 0.007911 | 4 | 0.003165 | |
| 5 | 7 | 0.005828 | 0 | 0.000000 | |
| 6 | 10 | 0.010132 | 3 | 0.003040 | |
| 7 | 7 | 0.006066 | 3 | 0.002600 | |
98 rows × 4 columns
(예시2)
| CANCELLED | DIVERTED | ||||
|---|---|---|---|---|---|
| sum | mean | sum | mean | ||
| AIRLINE | WEEKDAY | ||||
| AA | 1 | 41 | 0.032106 | 6 | 0.004699 |
| 2 | 9 | 0.007341 | 2 | 0.001631 | |
| 3 | 16 | 0.011949 | 2 | 0.001494 | |
| 4 | 20 | 0.015004 | 5 | 0.003751 | |
| 5 | 18 | 0.014151 | 1 | 0.000786 | |
| ... | ... | ... | ... | ... | ... |
| WN | 3 | 18 | 0.014118 | 2 | 0.001569 |
| 4 | 10 | 0.007911 | 4 | 0.003165 | |
| 5 | 7 | 0.005828 | 0 | 0.000000 | |
| 6 | 10 | 0.010132 | 3 | 0.003040 | |
| 7 | 7 | 0.006066 | 3 | 0.002600 | |
98 rows × 4 columns
- 방법2
(예시1)
| CANCELLED | DIVERTED | ||||
|---|---|---|---|---|---|
| sum | mean | sum | mean | ||
| AIRLINE | WEEKDAY | ||||
| AA | 1 | 41 | 0.032106 | 6 | 0.004699 |
| 2 | 9 | 0.007341 | 2 | 0.001631 | |
| 3 | 16 | 0.011949 | 2 | 0.001494 | |
| 4 | 20 | 0.015004 | 5 | 0.003751 | |
| 5 | 18 | 0.014151 | 1 | 0.000786 | |
| ... | ... | ... | ... | ... | ... |
| WN | 3 | 18 | 0.014118 | 2 | 0.001569 |
| 4 | 10 | 0.007911 | 4 | 0.003165 | |
| 5 | 7 | 0.005828 | 0 | 0.000000 | |
| 6 | 10 | 0.010132 | 3 | 0.003040 | |
| 7 | 7 | 0.006066 | 3 | 0.002600 | |
98 rows × 4 columns
(예시2)
| CANCELLED | DIVERTED | ||||
|---|---|---|---|---|---|
| sum | mean | sum | mean | ||
| AIRLINE | WEEKDAY | ||||
| AA | 1 | 41 | 0.032106 | 6 | 0.004699 |
| 2 | 9 | 0.007341 | 2 | 0.001631 | |
| 3 | 16 | 0.011949 | 2 | 0.001494 | |
| 4 | 20 | 0.015004 | 5 | 0.003751 | |
| 5 | 18 | 0.014151 | 1 | 0.000786 | |
| ... | ... | ... | ... | ... | ... |
| WN | 3 | 18 | 0.014118 | 2 | 0.001569 |
| 4 | 10 | 0.007911 | 4 | 0.003165 | |
| 5 | 7 | 0.005828 | 0 | 0.000000 | |
| 6 | 10 | 0.010132 | 3 | 0.003040 | |
| 7 | 7 | 0.006066 | 3 | 0.002600 | |
98 rows × 4 columns
(예시3) – 사용불가능
# EX4: [AIRLINE,WEEKDAY] \(\to\) {CANCELLED:sum,mean,count}, {AIR_TIME: mean,var}
- 방법1
(예시1)
df.groupby(["AIRLINE","WEEKDAY"])\
.agg({'CANCELLED':[np.sum,np.mean,len],'AIR_TIME':[np.mean,np.var]})| CANCELLED | AIR_TIME | |||||
|---|---|---|---|---|---|---|
| sum | mean | len | mean | var | ||
| AIRLINE | WEEKDAY | |||||
| AA | 1 | 41 | 0.032106 | 1277 | 147.610569 | 5393.806723 |
| 2 | 9 | 0.007341 | 1226 | 143.851852 | 5359.890719 | |
| 3 | 16 | 0.011949 | 1339 | 144.514005 | 5378.854539 | |
| 4 | 20 | 0.015004 | 1333 | 141.124618 | 4791.524627 | |
| 5 | 18 | 0.014151 | 1272 | 145.430966 | 5884.592076 | |
| ... | ... | ... | ... | ... | ... | ... |
| WN | 3 | 18 | 0.014118 | 1275 | 104.219920 | 2901.873447 |
| 4 | 10 | 0.007911 | 1264 | 107.200800 | 2966.568935 | |
| 5 | 7 | 0.005828 | 1201 | 107.893635 | 3268.717093 | |
| 6 | 10 | 0.010132 | 987 | 109.247433 | 3152.753719 | |
| 7 | 7 | 0.006066 | 1154 | 107.602273 | 3183.126889 | |
98 rows × 5 columns
(예시2)
df.groupby(["AIRLINE","WEEKDAY"])\
.agg({'CANCELLED':["sum","mean","count"],'AIR_TIME':["mean","var"]})| CANCELLED | AIR_TIME | |||||
|---|---|---|---|---|---|---|
| sum | mean | count | mean | var | ||
| AIRLINE | WEEKDAY | |||||
| AA | 1 | 41 | 0.032106 | 1277 | 147.610569 | 5393.806723 |
| 2 | 9 | 0.007341 | 1226 | 143.851852 | 5359.890719 | |
| 3 | 16 | 0.011949 | 1339 | 144.514005 | 5378.854539 | |
| 4 | 20 | 0.015004 | 1333 | 141.124618 | 4791.524627 | |
| 5 | 18 | 0.014151 | 1272 | 145.430966 | 5884.592076 | |
| ... | ... | ... | ... | ... | ... | ... |
| WN | 3 | 18 | 0.014118 | 1275 | 104.219920 | 2901.873447 |
| 4 | 10 | 0.007911 | 1264 | 107.200800 | 2966.568935 | |
| 5 | 7 | 0.005828 | 1201 | 107.893635 | 3268.717093 | |
| 6 | 10 | 0.010132 | 987 | 109.247433 | 3152.753719 | |
| 7 | 7 | 0.006066 | 1154 | 107.602273 | 3183.126889 | |
98 rows × 5 columns
(사용자정의함수)
df.groupby(["AIRLINE","WEEKDAY"])\
.agg({'CANCELLED':[np.sum,np.mean,len],
'AIR_TIME':[np.mean,lambda x: np.std(x,ddof=1)**2]})| CANCELLED | AIR_TIME | |||||
|---|---|---|---|---|---|---|
| sum | mean | len | mean | <lambda_0> | ||
| AIRLINE | WEEKDAY | |||||
| AA | 1 | 41 | 0.032106 | 1277 | 147.610569 | 5393.806723 |
| 2 | 9 | 0.007341 | 1226 | 143.851852 | 5359.890719 | |
| 3 | 16 | 0.011949 | 1339 | 144.514005 | 5378.854539 | |
| 4 | 20 | 0.015004 | 1333 | 141.124618 | 4791.524627 | |
| 5 | 18 | 0.014151 | 1272 | 145.430966 | 5884.592076 | |
| ... | ... | ... | ... | ... | ... | ... |
| WN | 3 | 18 | 0.014118 | 1275 | 104.219920 | 2901.873447 |
| 4 | 10 | 0.007911 | 1264 | 107.200800 | 2966.568935 | |
| 5 | 7 | 0.005828 | 1201 | 107.893635 | 3268.717093 | |
| 6 | 10 | 0.010132 | 987 | 109.247433 | 3152.753719 | |
| 7 | 7 | 0.006066 | 1154 | 107.602273 | 3183.126889 | |
98 rows × 5 columns
연속형변수를 기준으로 groupby -> agg
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 58482 | 58483 | 58484 | 58485 | 58486 | 58487 | 58488 | 58489 | 58490 | 58491 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MONTH | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 12 | 12 | 12 | 12 | 12 | 12 | 12 | 12 | 12 | 12 |
| DAY | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 31 | 31 | 31 | 31 | 31 | 31 | 31 | 31 | 31 | 31 |
| WEEKDAY | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | ... | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 4 |
| AIRLINE | WN | UA | MQ | AA | WN | UA | AA | F9 | AA | UA | ... | OO | B6 | MQ | DL | EV | AA | F9 | OO | WN | OO |
| ORG_AIR | LAX | DEN | DFW | DFW | LAX | IAH | DFW | SFO | ORD | IAH | ... | DEN | PHX | ORD | ATL | DFW | SFO | LAS | SFO | MSP | SFO |
| DEST_AIR | SLC | IAD | VPS | DCA | MCI | SAN | MSY | PHX | STL | SJC | ... | CPR | BOS | DSM | CMH | LFT | DFW | SFO | SBA | ATL | BOI |
| SCHED_DEP | 1625 | 823 | 1305 | 1555 | 1720 | 1450 | 1250 | 1020 | 1845 | 925 | ... | 1850 | 2236 | 1333 | 2206 | 850 | 515 | 1910 | 1846 | 525 | 859 |
| DEP_DELAY | 58.0 | 7.0 | 36.0 | 7.0 | 48.0 | 1.0 | 84.0 | -7.0 | -5.0 | 3.0 | ... | -2.0 | -12.0 | 1.0 | 2.0 | 21.0 | 5.0 | 13.0 | -6.0 | 39.0 | 5.0 |
| AIR_TIME | 94.0 | 154.0 | 85.0 | 126.0 | 166.0 | 178.0 | 64.0 | 91.0 | 44.0 | 215.0 | ... | 38.0 | 231.0 | 57.0 | 64.0 | 52.0 | 166.0 | 71.0 | 46.0 | 124.0 | 73.0 |
| DIST | 590 | 1452 | 641 | 1192 | 1363 | 1303 | 447 | 651 | 258 | 1608 | ... | 230 | 2300 | 299 | 447 | 351 | 1464 | 414 | 262 | 907 | 522 |
| SCHED_ARR | 1905 | 1333 | 1453 | 1935 | 2225 | 1620 | 1410 | 1315 | 1950 | 1136 | ... | 1956 | 515 | 1455 | 2338 | 1012 | 1045 | 2050 | 1956 | 855 | 1146 |
| ARR_DELAY | 65.0 | -13.0 | 35.0 | -7.0 | 39.0 | -14.0 | 83.0 | -6.0 | -5.0 | -14.0 | ... | 1.0 | -45.0 | -7.0 | -8.0 | 14.0 | -19.0 | 4.0 | -5.0 | 34.0 | -1.0 |
| DIVERTED | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| CANCELLED | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
14 rows × 58492 columns
count 58492.000000
mean 872.900072
std 624.996805
min 67.000000
25% 391.000000
50% 690.000000
75% 1199.000000
max 4502.000000
Name: DIST, dtype: float64
df.assign(DIST2 = pd.cut(df.DIST,[-np.inf,391,690,1199,np.inf]))\
.groupby(["AIRLINE","DIST2"]).agg({'CANCELLED':["sum","mean","count"]})| CANCELLED | ||||
|---|---|---|---|---|
| sum | mean | count | ||
| AIRLINE | DIST2 | |||
| AA | (-inf, 391.0] | 18 | 0.015986 | 1126 |
| (391.0, 690.0] | 17 | 0.013589 | 1251 | |
| (690.0, 1199.0] | 69 | 0.022066 | 3127 | |
| (1199.0, inf] | 50 | 0.014723 | 3396 | |
| AS | (-inf, 391.0] | 0 | NaN | 0 |
| (391.0, 690.0] | 0 | 0.000000 | 145 | |
| (690.0, 1199.0] | 0 | 0.000000 | 462 | |
| (1199.0, inf] | 0 | 0.000000 | 161 | |
| B6 | (-inf, 391.0] | 0 | 0.000000 | 71 |
| (391.0, 690.0] | 0 | 0.000000 | 38 | |
| (690.0, 1199.0] | 0 | 0.000000 | 61 | |
| (1199.0, inf] | 1 | 0.002681 | 373 | |
| DL | (-inf, 391.0] | 7 | 0.003086 | 2268 |
| (391.0, 690.0] | 8 | 0.002421 | 3304 | |
| (690.0, 1199.0] | 16 | 0.006405 | 2498 | |
| (1199.0, inf] | 7 | 0.002766 | 2531 | |
| EV | (-inf, 391.0] | 77 | 0.028785 | 2675 |
| (391.0, 690.0] | 47 | 0.022793 | 2062 | |
| (690.0, 1199.0] | 22 | 0.019982 | 1101 | |
| (1199.0, inf] | 0 | 0.000000 | 20 | |
| F9 | (-inf, 391.0] | 0 | 0.000000 | 27 |
| (391.0, 690.0] | 6 | 0.013825 | 434 | |
| (690.0, 1199.0] | 4 | 0.007105 | 563 | |
| (1199.0, inf] | 0 | 0.000000 | 293 | |
| HA | (-inf, 391.0] | 0 | NaN | 0 |
| (391.0, 690.0] | 0 | NaN | 0 | |
| (690.0, 1199.0] | 0 | NaN | 0 | |
| (1199.0, inf] | 0 | 0.000000 | 112 | |
| MQ | (-inf, 391.0] | 90 | 0.047120 | 1910 |
| (391.0, 690.0] | 39 | 0.037356 | 1044 | |
| (690.0, 1199.0] | 22 | 0.044266 | 497 | |
| (1199.0, inf] | 1 | 0.050000 | 20 | |
| NK | (-inf, 391.0] | 5 | 0.036496 | 137 |
| (391.0, 690.0] | 4 | 0.013201 | 303 | |
| (690.0, 1199.0] | 6 | 0.011029 | 544 | |
| (1199.0, inf] | 10 | 0.018797 | 532 | |
| OO | (-inf, 391.0] | 75 | 0.024826 | 3021 |
| (391.0, 690.0] | 39 | 0.019364 | 2014 | |
| (690.0, 1199.0] | 19 | 0.016351 | 1162 | |
| (1199.0, inf] | 9 | 0.023018 | 391 | |
| UA | (-inf, 391.0] | 5 | 0.007143 | 700 |
| (391.0, 690.0] | 14 | 0.011824 | 1184 | |
| (690.0, 1199.0] | 26 | 0.010924 | 2380 | |
| (1199.0, inf] | 48 | 0.013605 | 3528 | |
| US | (-inf, 391.0] | 0 | 0.000000 | 254 |
| (391.0, 690.0] | 7 | 0.021944 | 319 | |
| (690.0, 1199.0] | 2 | 0.006329 | 316 | |
| (1199.0, inf] | 12 | 0.016529 | 726 | |
| VX | (-inf, 391.0] | 2 | 0.008299 | 241 |
| (391.0, 690.0] | 1 | 0.003861 | 259 | |
| (690.0, 1199.0] | 0 | 0.000000 | 22 | |
| (1199.0, inf] | 3 | 0.006369 | 471 | |
| WN | (-inf, 391.0] | 55 | 0.023810 | 2310 |
| (391.0, 690.0] | 14 | 0.006487 | 2158 | |
| (690.0, 1199.0] | 17 | 0.007896 | 2153 | |
| (1199.0, inf] | 7 | 0.003895 | 1797 | |
0 400~700
1 1200~
2 400~700
3 700~1200
4 1200~
...
58487 1200~
58488 400~700
58489 ~400
58490 700~1200
58491 400~700
Name: DIST, Length: 58492, dtype: category
Categories (4, object): ['~400' < '400~700' < '700~1200' < '1200~']
df.assign(DIST2 = pd.cut(df.DIST,[-np.inf,400,700,1200,np.inf],labels=['~400','400~700','700~1200','1200~']))\
.groupby(["AIRLINE","DIST2"]).agg({'CANCELLED':["sum","mean","count"]})| CANCELLED | ||||
|---|---|---|---|---|
| sum | mean | count | ||
| AIRLINE | DIST2 | |||
| AA | ~400 | 18 | 0.015986 | 1126 |
| 400~700 | 17 | 0.013589 | 1251 | |
| 700~1200 | 69 | 0.022066 | 3127 | |
| 1200~ | 50 | 0.014723 | 3396 | |
| AS | ~400 | 0 | NaN | 0 |
| 400~700 | 0 | 0.000000 | 145 | |
| 700~1200 | 0 | 0.000000 | 462 | |
| 1200~ | 0 | 0.000000 | 161 | |
| B6 | ~400 | 0 | 0.000000 | 71 |
| 400~700 | 0 | 0.000000 | 38 | |
| 700~1200 | 0 | 0.000000 | 61 | |
| 1200~ | 1 | 0.002681 | 373 | |
| DL | ~400 | 7 | 0.003040 | 2303 |
| 400~700 | 8 | 0.002352 | 3402 | |
| 700~1200 | 16 | 0.006765 | 2365 | |
| 1200~ | 7 | 0.002766 | 2531 | |
| EV | ~400 | 77 | 0.027838 | 2766 |
| 400~700 | 48 | 0.023312 | 2059 | |
| 700~1200 | 21 | 0.020731 | 1013 | |
| 1200~ | 0 | 0.000000 | 20 | |
| F9 | ~400 | 0 | 0.000000 | 27 |
| 400~700 | 7 | 0.015837 | 442 | |
| 700~1200 | 3 | 0.005405 | 555 | |
| 1200~ | 0 | 0.000000 | 293 | |
| HA | ~400 | 0 | NaN | 0 |
| 400~700 | 0 | NaN | 0 | |
| 700~1200 | 0 | NaN | 0 | |
| 1200~ | 0 | 0.000000 | 112 | |
| MQ | ~400 | 92 | 0.047472 | 1938 |
| 400~700 | 39 | 0.035682 | 1093 | |
| 700~1200 | 20 | 0.047619 | 420 | |
| 1200~ | 1 | 0.050000 | 20 | |
| NK | ~400 | 5 | 0.036496 | 137 |
| 400~700 | 4 | 0.013201 | 303 | |
| 700~1200 | 6 | 0.011029 | 544 | |
| 1200~ | 10 | 0.018797 | 532 | |
| OO | ~400 | 76 | 0.024837 | 3060 |
| 400~700 | 38 | 0.018673 | 2035 | |
| 700~1200 | 19 | 0.017241 | 1102 | |
| 1200~ | 9 | 0.023018 | 391 | |
| UA | ~400 | 5 | 0.006993 | 715 |
| 400~700 | 14 | 0.011966 | 1170 | |
| 700~1200 | 26 | 0.010929 | 2379 | |
| 1200~ | 48 | 0.013605 | 3528 | |
| US | ~400 | 0 | 0.000000 | 254 |
| 400~700 | 7 | 0.021944 | 319 | |
| 700~1200 | 2 | 0.006329 | 316 | |
| 1200~ | 12 | 0.016529 | 726 | |
| VX | ~400 | 2 | 0.008299 | 241 |
| 400~700 | 1 | 0.003861 | 259 | |
| 700~1200 | 0 | 0.000000 | 22 | |
| 1200~ | 3 | 0.006369 | 471 | |
| WN | ~400 | 55 | 0.023022 | 2389 |
| 400~700 | 17 | 0.007795 | 2181 | |
| 700~1200 | 14 | 0.006826 | 2051 | |
| 1200~ | 7 | 0.003895 | 1797 | |
HRDataset_v14 자료분석
아래의 코드를 활용하여 Kaggle의 HRdataset을 불러오라.
df = pd.read_csv('https://raw.githubusercontent.com/guebin/DV2022/master/posts/HRDataset_v14.csv')
df| Employee_Name | EmpID | MarriedID | MaritalStatusID | GenderID | EmpStatusID | DeptID | PerfScoreID | FromDiversityJobFairID | Salary | ... | ManagerName | ManagerID | RecruitmentSource | PerformanceScore | EngagementSurvey | EmpSatisfaction | SpecialProjectsCount | LastPerformanceReview_Date | DaysLateLast30 | Absences | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Adinolfi, Wilson K | 10026 | 0 | 0 | 1 | 1 | 5 | 4 | 0 | 62506 | ... | Michael Albert | 22.0 | Exceeds | 4.60 | 5 | 0 | 1/17/2019 | 0 | 1 | |
| 1 | Ait Sidi, Karthikeyan | 10084 | 1 | 1 | 1 | 5 | 3 | 3 | 0 | 104437 | ... | Simon Roup | 4.0 | Indeed | Fully Meets | 4.96 | 3 | 6 | 2/24/2016 | 0 | 17 |
| 2 | Akinkuolie, Sarah | 10196 | 1 | 1 | 0 | 5 | 5 | 3 | 0 | 64955 | ... | Kissy Sullivan | 20.0 | Fully Meets | 3.02 | 3 | 0 | 5/15/2012 | 0 | 3 | |
| 3 | Alagbe,Trina | 10088 | 1 | 1 | 0 | 1 | 5 | 3 | 0 | 64991 | ... | Elijiah Gray | 16.0 | Indeed | Fully Meets | 4.84 | 5 | 0 | 1/3/2019 | 0 | 15 |
| 4 | Anderson, Carol | 10069 | 0 | 2 | 0 | 5 | 5 | 3 | 0 | 50825 | ... | Webster Butler | 39.0 | Google Search | Fully Meets | 5.00 | 4 | 0 | 2/1/2016 | 0 | 2 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 306 | Woodson, Jason | 10135 | 0 | 0 | 1 | 1 | 5 | 3 | 0 | 65893 | ... | Kissy Sullivan | 20.0 | Fully Meets | 4.07 | 4 | 0 | 2/28/2019 | 0 | 13 | |
| 307 | Ybarra, Catherine | 10301 | 0 | 0 | 0 | 5 | 5 | 1 | 0 | 48513 | ... | Brannon Miller | 12.0 | Google Search | PIP | 3.20 | 2 | 0 | 9/2/2015 | 5 | 4 |
| 308 | Zamora, Jennifer | 10010 | 0 | 0 | 0 | 1 | 3 | 4 | 0 | 220450 | ... | Janet King | 2.0 | Employee Referral | Exceeds | 4.60 | 5 | 6 | 2/21/2019 | 0 | 16 |
| 309 | Zhou, Julia | 10043 | 0 | 0 | 0 | 1 | 3 | 3 | 0 | 89292 | ... | Simon Roup | 4.0 | Employee Referral | Fully Meets | 5.00 | 3 | 5 | 2/1/2019 | 0 | 11 |
| 310 | Zima, Colleen | 10271 | 0 | 4 | 0 | 1 | 5 | 3 | 0 | 45046 | ... | David Stanley | 14.0 | Fully Meets | 4.50 | 5 | 0 | 1/30/2019 | 0 | 2 |
311 rows × 36 columns
데이터를 조사하고 올바르게 분석한 사람을 모두 고르라. (모두 맞칠경우만 정답으로 인정)
- 소윤: 근무인원수가 가장 많은 인종(RaceDesc)은 ’White’이며 이는 ’Asian’인종과 ’Black or African American’의 합보다 많다.
- 다호: ’RaceDesc==White’의 성별(Sex)임금차이는 2000이상이다.
- 하니: 퇴직한사람(Termd==1)은 모두 104명이며 백인여성의 퇴직자수가 가장 많다.
- 도한: 퇴직한사람중 아시아인의 비율은 10%가 넘지 않는다.
(풀이)
모두 참이다.
데이터조사
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 311 entries, 0 to 310
Data columns (total 36 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Employee_Name 311 non-null object
1 EmpID 311 non-null int64
2 MarriedID 311 non-null int64
3 MaritalStatusID 311 non-null int64
4 GenderID 311 non-null int64
5 EmpStatusID 311 non-null int64
6 DeptID 311 non-null int64
7 PerfScoreID 311 non-null int64
8 FromDiversityJobFairID 311 non-null int64
9 Salary 311 non-null int64
10 Termd 311 non-null int64
11 PositionID 311 non-null int64
12 Position 311 non-null object
13 State 311 non-null object
14 Zip 311 non-null int64
15 DOB 311 non-null object
16 Sex 311 non-null object
17 MaritalDesc 311 non-null object
18 CitizenDesc 311 non-null object
19 HispanicLatino 311 non-null object
20 RaceDesc 311 non-null object
21 DateofHire 311 non-null object
22 DateofTermination 104 non-null object
23 TermReason 311 non-null object
24 EmploymentStatus 311 non-null object
25 Department 311 non-null object
26 ManagerName 311 non-null object
27 ManagerID 303 non-null float64
28 RecruitmentSource 311 non-null object
29 PerformanceScore 311 non-null object
30 EngagementSurvey 311 non-null float64
31 EmpSatisfaction 311 non-null int64
32 SpecialProjectsCount 311 non-null int64
33 LastPerformanceReview_Date 311 non-null object
34 DaysLateLast30 311 non-null int64
35 Absences 311 non-null int64
dtypes: float64(2), int64(16), object(18)
memory usage: 87.6+ KB
- EmpID가 missing이 없는 열임
소윤: 근무인원수가 가장 많은 인종(RaceDesc)은 ’White’이며 이는 ’Asian’인종과 ’Black or African American’의 합보다 많다. — 참
| EmpID | |
|---|---|
| RaceDesc | |
| American Indian or Alaska Native | 3 |
| Asian | 29 |
| Black or African American | 80 |
| Hispanic | 1 |
| Two or more races | 11 |
| White | 187 |
다호: ’RaceDesc==White’의 성별(Sex)임금차이는 2000이상이다. — 참
| Salary | ||
|---|---|---|
| RaceDesc | Sex | |
| American Indian or Alaska Native | F | 63436.500000 |
| M | 70545.000000 | |
| Asian | F | 67520.117647 |
| M | 69939.416667 | |
| Black or African American | F | 66963.829787 |
| M | 85066.121212 | |
| Hispanic | M | 83667.000000 |
| Two or more races | F | 58068.500000 |
| M | 62313.800000 | |
| White | F | 68846.519231 |
| M | 65334.132530 |
하니: 퇴직한사람(Termd==1)은 모두 104명이며 백인여성의 퇴직자수가 가장 많다. — 참
| Termd | ||
|---|---|---|
| RaceDesc | Sex | |
| American Indian or Alaska Native | F | 0 |
| M | 0 | |
| Asian | F | 6 |
| M | 3 | |
| Black or African American | F | 15 |
| M | 14 | |
| Hispanic | M | 0 |
| Two or more races | F | 2 |
| M | 1 | |
| White | F | 37 |
| M | 26 |
도한: 퇴직한사람중 아시아인의 비율은 10%가 넘지 않는다. — 참