강의영상

- (1/3) 데이터프레임을 합치는 방법 (1)

- (2/3) 데이터프레임을 합치는 방법 (2)

- (3/3) 데이터프레임을 합치는 방법 (3), zip 활용방법

import

import pandas as pd 

data

df2016=pd.read_csv("https://raw.githubusercontent.com/PacktPublishing/Pandas-Cookbook/master/data/stocks_2016.csv")
df2017=pd.read_csv("https://raw.githubusercontent.com/PacktPublishing/Pandas-Cookbook/master/data/stocks_2017.csv")
df2018=pd.read_csv("https://raw.githubusercontent.com/PacktPublishing/Pandas-Cookbook/master/data/stocks_2018.csv")

pd.concat([...])

예제1

df2016
Symbol Shares Low High
0 AAPL 80 95 110
1 TSLA 50 80 130
2 WMT 40 55 70
df2017
Symbol Shares Low High
0 AAPL 50 120 140
1 GE 100 30 40
2 IBM 87 75 95
3 SLB 20 55 85
4 TXN 500 15 23
5 TSLA 100 100 300
df2018
Symbol Shares Low High
0 AAPL 40 135 170
1 AMZN 8 900 1125
2 TSLA 50 220 400
pd.concat([df2016,df2017,df2018]) 
Symbol Shares Low High
0 AAPL 80 95 110
1 TSLA 50 80 130
2 WMT 40 55 70
0 AAPL 50 120 140
1 GE 100 30 40
2 IBM 87 75 95
3 SLB 20 55 85
4 TXN 500 15 23
5 TSLA 100 100 300
0 AAPL 40 135 170
1 AMZN 8 900 1125
2 TSLA 50 220 400
pd.concat([df2016,df2017,df2018]).reset_index()
index Symbol Shares Low High
0 0 AAPL 80 95 110
1 1 TSLA 50 80 130
2 2 WMT 40 55 70
3 0 AAPL 50 120 140
4 1 GE 100 30 40
5 2 IBM 87 75 95
6 3 SLB 20 55 85
7 4 TXN 500 15 23
8 5 TSLA 100 100 300
9 0 AAPL 40 135 170
10 1 AMZN 8 900 1125
11 2 TSLA 50 220 400
pd.concat([df2016,df2017,df2018]).reset_index(drop=True)
Symbol Shares Low High
0 AAPL 80 95 110
1 TSLA 50 80 130
2 WMT 40 55 70
3 AAPL 50 120 140
4 GE 100 30 40
5 IBM 87 75 95
6 SLB 20 55 85
7 TXN 500 15 23
8 TSLA 100 100 300
9 AAPL 40 135 170
10 AMZN 8 900 1125
11 TSLA 50 220 400

- R에서 rbind와 유사하다

예제2

- rbind와의 차이점: 합치려는 데이터 프레임의 columns이 꼭 동일할 필요는 없다.

df2016
Symbol Shares Low High
0 AAPL 80 95 110
1 TSLA 50 80 130
2 WMT 40 55 70
df2017.iloc[:,1:]
Shares Low High
0 50 120 140
1 100 30 40
2 87 75 95
3 20 55 85
4 500 15 23
5 100 100 300
pd.concat([df2016,df2017.iloc[:,1:]])
Symbol Shares Low High
0 AAPL 80 95 110
1 TSLA 50 80 130
2 WMT 40 55 70
0 NaN 50 120 140
1 NaN 100 30 40
2 NaN 87 75 95
3 NaN 20 55 85
4 NaN 500 15 23
5 NaN 100 100 300

pd.concat([...],axis='columns')

- R에서 cbind와 비슷한 느낌이다. (그런데 row의 숫자가 서로 달라도 괜찮음)

예제3

df2016
Symbol Shares Low High
0 AAPL 80 95 110
1 TSLA 50 80 130
2 WMT 40 55 70
df2017
Symbol Shares Low High
0 AAPL 50 120 140
1 GE 100 30 40
2 IBM 87 75 95
3 SLB 20 55 85
4 TXN 500 15 23
5 TSLA 100 100 300
df2018
Symbol Shares Low High
0 AAPL 40 135 170
1 AMZN 8 900 1125
2 TSLA 50 220 400
pd.concat([df2016,df2017,df2018],axis='columns')
Symbol Shares Low High Symbol Shares Low High Symbol Shares Low High
0 AAPL 80.0 95.0 110.0 AAPL 50 120 140 AAPL 40.0 135.0 170.0
1 TSLA 50.0 80.0 130.0 GE 100 30 40 AMZN 8.0 900.0 1125.0
2 WMT 40.0 55.0 70.0 IBM 87 75 95 TSLA 50.0 220.0 400.0
3 NaN NaN NaN NaN SLB 20 55 85 NaN NaN NaN NaN
4 NaN NaN NaN NaN TXN 500 15 23 NaN NaN NaN NaN
5 NaN NaN NaN NaN TSLA 100 100 300 NaN NaN NaN NaN
pd.concat([df2016,df2017,df2018],axis=1)
Symbol Shares Low High Symbol Shares Low High Symbol Shares Low High
0 AAPL 80.0 95.0 110.0 AAPL 50 120 140 AAPL 40.0 135.0 170.0
1 TSLA 50.0 80.0 130.0 GE 100 30 40 AMZN 8.0 900.0 1125.0
2 WMT 40.0 55.0 70.0 IBM 87 75 95 TSLA 50.0 220.0 400.0
3 NaN NaN NaN NaN SLB 20 55 85 NaN NaN NaN NaN
4 NaN NaN NaN NaN TXN 500 15 23 NaN NaN NaN NaN
5 NaN NaN NaN NaN TSLA 100 100 300 NaN NaN NaN NaN
  • 칼럼이 길잖아요.. 그래서 1이에요..

pd.concat([...],keys=[...])

예제4

df2016
Symbol Shares Low High
0 AAPL 80 95 110
1 TSLA 50 80 130
2 WMT 40 55 70
df2017
Symbol Shares Low High
0 AAPL 50 120 140
1 GE 100 30 40
2 IBM 87 75 95
3 SLB 20 55 85
4 TXN 500 15 23
5 TSLA 100 100 300
df2018
Symbol Shares Low High
0 AAPL 40 135 170
1 AMZN 8 900 1125
2 TSLA 50 220 400
pd.concat([df2016,df2017,df2018],keys=[2016,2017,2018])
Symbol Shares Low High
2016 0 AAPL 80 95 110
1 TSLA 50 80 130
2 WMT 40 55 70
2017 0 AAPL 50 120 140
1 GE 100 30 40
2 IBM 87 75 95
3 SLB 20 55 85
4 TXN 500 15 23
5 TSLA 100 100 300
2018 0 AAPL 40 135 170
1 AMZN 8 900 1125
2 TSLA 50 220 400

- 위의 코드는 아래와 같다.

pd.concat({2016:df2016,2017:df2017,2018:df2018})
Symbol Shares Low High
2016 0 AAPL 80 95 110
1 TSLA 50 80 130
2 WMT 40 55 70
2017 0 AAPL 50 120 140
1 GE 100 30 40
2 IBM 87 75 95
3 SLB 20 55 85
4 TXN 500 15 23
5 TSLA 100 100 300
2018 0 AAPL 40 135 170
1 AMZN 8 900 1125
2 TSLA 50 220 400

- 그런데 {2016:df2016,2017:df2017,2018:df2018}dict(zip([2016,2017,2018], [df2016,df2017,df2018])) 는 같으므로 위의코드는 다시 아래와 같다.

pd.concat(dict(zip([2016,2017,2018],[df2016,df2017,df2018])))
Symbol Shares Low High
2016 0 AAPL 80 95 110
1 TSLA 50 80 130
2 WMT 40 55 70
2017 0 AAPL 50 120 140
1 GE 100 30 40
2 IBM 87 75 95
3 SLB 20 55 85
4 TXN 500 15 23
5 TSLA 100 100 300
2018 0 AAPL 40 135 170
1 AMZN 8 900 1125
2 TSLA 50 220 400

- 의도하지 않은 코드들

pd.concat([df2016,df2017,df2018],keys=[2016,2017])
Symbol Shares Low High
2016 0 AAPL 80 95 110
1 TSLA 50 80 130
2 WMT 40 55 70
2017 0 AAPL 50 120 140
1 GE 100 30 40
2 IBM 87 75 95
3 SLB 20 55 85
4 TXN 500 15 23
5 TSLA 100 100 300
pd.concat([df2016,df2017,df2018],keys=[2016,2018])
Symbol Shares Low High
2016 0 AAPL 80 95 110
1 TSLA 50 80 130
2 WMT 40 55 70
2018 0 AAPL 50 120 140
1 GE 100 30 40
2 IBM 87 75 95
3 SLB 20 55 85
4 TXN 500 15 23
5 TSLA 100 100 300
pd.concat([df2016,df2017,df2018],keys=[2016,2017,2018,2019])
Symbol Shares Low High
2016 0 AAPL 80 95 110
1 TSLA 50 80 130
2 WMT 40 55 70
2017 0 AAPL 50 120 140
1 GE 100 30 40
2 IBM 87 75 95
3 SLB 20 55 85
4 TXN 500 15 23
5 TSLA 100 100 300
2018 0 AAPL 40 135 170
1 AMZN 8 900 1125
2 TSLA 50 220 400

pd.concat([...],keys=[...],axis='columns')

예제5

df2016
Symbol Shares Low High
0 AAPL 80 95 110
1 TSLA 50 80 130
2 WMT 40 55 70
df2017
Symbol Shares Low High
0 AAPL 50 120 140
1 GE 100 30 40
2 IBM 87 75 95
3 SLB 20 55 85
4 TXN 500 15 23
5 TSLA 100 100 300
df2018
Symbol Shares Low High
0 AAPL 40 135 170
1 AMZN 8 900 1125
2 TSLA 50 220 400
pd.concat([df2016,df2017,df2018],axis=1)
Symbol Shares Low High Symbol Shares Low High Symbol Shares Low High
0 AAPL 80.0 95.0 110.0 AAPL 50 120 140 AAPL 40.0 135.0 170.0
1 TSLA 50.0 80.0 130.0 GE 100 30 40 AMZN 8.0 900.0 1125.0
2 WMT 40.0 55.0 70.0 IBM 87 75 95 TSLA 50.0 220.0 400.0
3 NaN NaN NaN NaN SLB 20 55 85 NaN NaN NaN NaN
4 NaN NaN NaN NaN TXN 500 15 23 NaN NaN NaN NaN
5 NaN NaN NaN NaN TSLA 100 100 300 NaN NaN NaN NaN
pd.concat([df2016,df2017,df2018],axis=1,keys=[2016,2017,2018])
2016 2017 2018
Symbol Shares Low High Symbol Shares Low High Symbol Shares Low High
0 AAPL 80.0 95.0 110.0 AAPL 50 120 140 AAPL 40.0 135.0 170.0
1 TSLA 50.0 80.0 130.0 GE 100 30 40 AMZN 8.0 900.0 1125.0
2 WMT 40.0 55.0 70.0 IBM 87 75 95 TSLA 50.0 220.0 400.0
3 NaN NaN NaN NaN SLB 20 55 85 NaN NaN NaN NaN
4 NaN NaN NaN NaN TXN 500 15 23 NaN NaN NaN NaN
5 NaN NaN NaN NaN TSLA 100 100 300 NaN NaN NaN NaN
pd.concat({2016:df2016,2017:df2017,2018:df2018},axis=1)
2016 2017 2018
Symbol Shares Low High Symbol Shares Low High Symbol Shares Low High
0 AAPL 80.0 95.0 110.0 AAPL 50 120 140 AAPL 40.0 135.0 170.0
1 TSLA 50.0 80.0 130.0 GE 100 30 40 AMZN 8.0 900.0 1125.0
2 WMT 40.0 55.0 70.0 IBM 87 75 95 TSLA 50.0 220.0 400.0
3 NaN NaN NaN NaN SLB 20 55 85 NaN NaN NaN NaN
4 NaN NaN NaN NaN TXN 500 15 23 NaN NaN NaN NaN
5 NaN NaN NaN NaN TSLA 100 100 300 NaN NaN NaN NaN
pd.concat(dict(zip([2016,2017,2018],[df2016,df2017,df2018])),axis=1)
2016 2017 2018
Symbol Shares Low High Symbol Shares Low High Symbol Shares Low High
0 AAPL 80.0 95.0 110.0 AAPL 50 120 140 AAPL 40.0 135.0 170.0
1 TSLA 50.0 80.0 130.0 GE 100 30 40 AMZN 8.0 900.0 1125.0
2 WMT 40.0 55.0 70.0 IBM 87 75 95 TSLA 50.0 220.0 400.0
3 NaN NaN NaN NaN SLB 20 55 85 NaN NaN NaN NaN
4 NaN NaN NaN NaN TXN 500 15 23 NaN NaN NaN NaN
5 NaN NaN NaN NaN TSLA 100 100 300 NaN NaN NaN NaN

숙제

- 예제1~5의 결과를 타이디한 데이터로 바꿔볼 것 (제출의무는 없음)