강의영상

- (1/8) 부분 데이터 꺼내기: 판다스를 왜 써야할까?

- (2/8) pandas 개발동기

- (3/8) 열의 이름 부여

- (4/8) 행의 이름 부여, 자료형, len, shape, for문의 반복변수, pd.Series

- (5/8) 첫번째 칼럼을 선택, 여러개의 칼럼을 선택

- (6/8) 첫번째 행의 선택, 여러개의 행을 선택

- (7/8) query (1)

- (8/8) query (2), 판다스공부 3단계

import

import numpy as np
import pandas as pd

부분 데이터 꺼내기: 판다스를 왜 써야할까?

기본 인덱싱

- 예제1: 기본인덱싱

a='asdf'
a[2]
'd'
a[-1]
'f'

- 예제2: 슬라이싱

a='asdf'
a[1:3]
'sd'
a[-2:]
'df'

- 예제3: 스트라이딩

a='asdfg'
a[::2]
'adg'

- 예제4: 불가능한것

a='asdf'
a[[1,2]] # 정수인덱스를 리스트화 시켜서 인덱싱하는 것은 불가능 
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [9], in <cell line: 2>()
      1 a='asdf'
----> 2 a[[1,2]]

TypeError: string indices must be integers
a='asdf'
a[[True,True,False,False]] 
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [10], in <cell line: 2>()
      1 a='asdf'
----> 2 a[[True,True,False,False]]

TypeError: string indices must be integers

팬시 인덱싱

- 예제1: 인덱스의 리스트 (혹은 ndarray)를 전달

a=np.arange(55,61)
a
array([55, 56, 57, 58, 59, 60])
a[[1,2,-1]]
array([56, 57, 60])
a[np.array([1,2,-1])]
array([56, 57, 60])

- 예제2: bool로 이루어진 리스트 (혹은 ndarray)를 전달

a=np.arange(55,61)
a
array([55, 56, 57, 58, 59, 60])
a[[True,True,False,False,False,False]]
array([55, 56])
a[np.array([True,True,False,False,False,False])]
array([55, 56])
a[a<58]
array([55, 56, 57])

2차원자료형의 인덱싱

- 예제1

a = np.arange(4*3).reshape(4,3)
a
array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])
a[0:2,1]
array([1, 4])

- 예제2: 차원을 유지하면서 인덱싱을 하고 싶으면?

a = np.arange(4*3).reshape(4,3)
a[0:2,[1]]
array([[1],
       [4]])

Hash

- 예제1: (key,value)

d={'att':65, 'rep':45, 'mid':30, 'fin':100}
d
{'att': 65, 'rep': 45, 'mid': 30, 'fin': 100}
d['att'] # key를 넣으면 value가 리턴
65

- 예제2: numpy와 비교

np.random.seed(43052)
att = np.random.choice(np.arange(10,21)*5,200)
rep = np.random.choice(np.arange(5,21)*5,200)
mid = np.random.choice(np.arange(0,21)*5,200)
fin = np.random.choice(np.arange(0,21)*5,200)
key = ['202212'+str(s) for s in np.random.choice(np.arange(300,501),200,replace=False)]
test_dic = {key[i] : {'att':att[i], 'rep':rep[i], 'mid':mid[i], 'fin':fin[i]} for i in range(200)}
test_ndarray = np.array([key,att,rep,mid,fin],dtype=np.int64).T
del(att);del(rep);del(mid);del(fin);del(key)

학번 '202212460'에 해당하는 학생의 출석점수를 알고 싶다면?

(풀이1)

test_dic['202212460']['att']
55

(풀이2)

test_ndarray[test_ndarray[:,0] == 202212460, 1] ## 이게 무슨코드야 도데체!
array([55])

(풀이2)가 (풀이1)에 비하여 불편한 점

  • test_ndarray의 첫칼럼은 student id 이고 두번째 칼럼은 att라는 사실을 암기하고 있어야 한다.
  • student id가 아니고 만약에 학생이름을 써서 데이터를 정리한다면 모든 자료형은 문자형이 되어야 한다.
  • 작성한 코드의 가독성이 없다. (위치로 접근하기 때문)

- 요약: hash 스타일로 정보를 추출하는 것이 유용할 때가 있다. 그리고 보통 hash 스타일로 정보를 뽑는 것이 유리하다. (사실 numpy는 정보추출을 위해 개발된 자료형이 아니라 행렬 및 벡터의 수학연산을 지원하기 위해 개발된 자료형이다)

- 소망: 정보를 추출할때는 hash 스타일도 유용하다는 것은 이해함 $\to$ 하지만 나는 가끔 넘파이스타일로 정보를 뽑고 싶은걸? 그리고 딕셔너리 형태가 아니고 엑셀처럼(행렬처럼) 데이터를 보고 싶은걸? $\to$ pandas의 개발

pandas 개발동기

엑셀처럼 데이터를 테이블 형태로 정리하고 싶다

np.random.seed(43052)
att = np.random.choice(np.arange(10,21)*5,20)
rep = np.random.choice(np.arange(5,21)*5,20)
mid = np.random.choice(np.arange(0,21)*5,20)
fin = np.random.choice(np.arange(0,21)*5,20)
key = ['202212'+str(s) for s in np.random.choice(np.arange(300,501),20,replace=False)]
test_dic = {key[i] : {'att':att[i], 'rep':rep[i], 'mid':mid[i], 'fin':fin[i]} for i in range(20)}
test_dic
{'202212380': {'att': 65, 'rep': 55, 'mid': 50, 'fin': 40},
 '202212370': {'att': 95, 'rep': 100, 'mid': 50, 'fin': 80},
 '202212363': {'att': 65, 'rep': 90, 'mid': 60, 'fin': 30},
 '202212488': {'att': 55, 'rep': 80, 'mid': 75, 'fin': 80},
 '202212312': {'att': 80, 'rep': 30, 'mid': 30, 'fin': 100},
 '202212377': {'att': 75, 'rep': 40, 'mid': 100, 'fin': 15},
 '202212463': {'att': 65, 'rep': 45, 'mid': 45, 'fin': 90},
 '202212471': {'att': 60, 'rep': 60, 'mid': 25, 'fin': 0},
 '202212400': {'att': 95, 'rep': 65, 'mid': 20, 'fin': 10},
 '202212469': {'att': 90, 'rep': 80, 'mid': 80, 'fin': 20},
 '202212318': {'att': 55, 'rep': 75, 'mid': 35, 'fin': 25},
 '202212432': {'att': 95, 'rep': 95, 'mid': 45, 'fin': 0},
 '202212443': {'att': 95, 'rep': 55, 'mid': 15, 'fin': 35},
 '202212367': {'att': 50, 'rep': 80, 'mid': 40, 'fin': 30},
 '202212458': {'att': 50, 'rep': 55, 'mid': 15, 'fin': 85},
 '202212396': {'att': 95, 'rep': 30, 'mid': 30, 'fin': 95},
 '202212482': {'att': 50, 'rep': 50, 'mid': 45, 'fin': 10},
 '202212452': {'att': 65, 'rep': 55, 'mid': 15, 'fin': 45},
 '202212387': {'att': 70, 'rep': 70, 'mid': 40, 'fin': 35},
 '202212354': {'att': 90, 'rep': 90, 'mid': 80, 'fin': 90}}
  • 테이블형태로 보고 싶다.

(방법1) -- 행렬이기는 하지만 방법 2,3,4,5 에 비하여 우리가 원하는 만큼 가독성을 주는 형태는 아님..

test_ndarray = np.array([key,att,rep,mid,fin],dtype=np.int64).T
test_ndarray
array([[202212380,        65,        55,        50,        40],
       [202212370,        95,       100,        50,        80],
       [202212363,        65,        90,        60,        30],
       [202212488,        55,        80,        75,        80],
       [202212312,        80,        30,        30,       100],
       [202212377,        75,        40,       100,        15],
       [202212463,        65,        45,        45,        90],
       [202212471,        60,        60,        25,         0],
       [202212400,        95,        65,        20,        10],
       [202212469,        90,        80,        80,        20],
       [202212318,        55,        75,        35,        25],
       [202212432,        95,        95,        45,         0],
       [202212443,        95,        55,        15,        35],
       [202212367,        50,        80,        40,        30],
       [202212458,        50,        55,        15,        85],
       [202212396,        95,        30,        30,        95],
       [202212482,        50,        50,        45,        10],
       [202212452,        65,        55,        15,        45],
       [202212387,        70,        70,        40,        35],
       [202212354,        90,        90,        80,        90]])

(방법2)

pd.DataFrame(test_dic).T
att rep mid fin
202212380 65 55 50 40
202212370 95 100 50 80
202212363 65 90 60 30
202212488 55 80 75 80
202212312 80 30 30 100
202212377 75 40 100 15
202212463 65 45 45 90
202212471 60 60 25 0
202212400 95 65 20 10
202212469 90 80 80 20
202212318 55 75 35 25
202212432 95 95 45 0
202212443 95 55 15 35
202212367 50 80 40 30
202212458 50 55 15 85
202212396 95 30 30 95
202212482 50 50 45 10
202212452 65 55 15 45
202212387 70 70 40 35
202212354 90 90 80 90

(방법3)

test_dic2 = {'att':{key[i]:att[i] for i in range(20)}, 
             'rep':{key[i]:rep[i] for i in range(20)},
             'mid':{key[i]:mid[i] for i in range(20)},
             'fin':{key[i]:fin[i] for i in range(20)}}
pd.DataFrame(test_dic2)
att rep mid fin
202212380 65 55 50 40
202212370 95 100 50 80
202212363 65 90 60 30
202212488 55 80 75 80
202212312 80 30 30 100
202212377 75 40 100 15
202212463 65 45 45 90
202212471 60 60 25 0
202212400 95 65 20 10
202212469 90 80 80 20
202212318 55 75 35 25
202212432 95 95 45 0
202212443 95 55 15 35
202212367 50 80 40 30
202212458 50 55 15 85
202212396 95 30 30 95
202212482 50 50 45 10
202212452 65 55 15 45
202212387 70 70 40 35
202212354 90 90 80 90

(방법4)

df = pd.DataFrame({'att':att, 'rep':rep, 'mid':mid, 'fin':fin},index=key)
df
att rep mid fin
202212380 65 55 50 40
202212370 95 100 50 80
202212363 65 90 60 30
202212488 55 80 75 80
202212312 80 30 30 100
202212377 75 40 100 15
202212463 65 45 45 90
202212471 60 60 25 0
202212400 95 65 20 10
202212469 90 80 80 20
202212318 55 75 35 25
202212432 95 95 45 0
202212443 95 55 15 35
202212367 50 80 40 30
202212458 50 55 15 85
202212396 95 30 30 95
202212482 50 50 45 10
202212452 65 55 15 45
202212387 70 70 40 35
202212354 90 90 80 90

(방법5)

df = pd.DataFrame({'att':att, 'rep':rep, 'mid':mid, 'fin':fin})
df
att rep mid fin
0 65 55 50 40
1 95 100 50 80
2 65 90 60 30
3 55 80 75 80
4 80 30 30 100
5 75 40 100 15
6 65 45 45 90
7 60 60 25 0
8 95 65 20 10
9 90 80 80 20
10 55 75 35 25
11 95 95 45 0
12 95 55 15 35
13 50 80 40 30
14 50 55 15 85
15 95 30 30 95
16 50 50 45 10
17 65 55 15 45
18 70 70 40 35
19 90 90 80 90
df=df.set_index([key])
df
att rep mid fin
202212380 65 55 50 40
202212370 95 100 50 80
202212363 65 90 60 30
202212488 55 80 75 80
202212312 80 30 30 100
202212377 75 40 100 15
202212463 65 45 45 90
202212471 60 60 25 0
202212400 95 65 20 10
202212469 90 80 80 20
202212318 55 75 35 25
202212432 95 95 45 0
202212443 95 55 15 35
202212367 50 80 40 30
202212458 50 55 15 85
202212396 95 30 30 95
202212482 50 50 45 10
202212452 65 55 15 45
202212387 70 70 40 35
202212354 90 90 80 90

해싱으로 원하는 정보를 뽑으면 좋겠다 (마치 딕셔너리처럼)

- 예제1: 출석점수를 출력

test_dic2['att']
{'202212380': 65,
 '202212370': 95,
 '202212363': 65,
 '202212488': 55,
 '202212312': 80,
 '202212377': 75,
 '202212463': 65,
 '202212471': 60,
 '202212400': 95,
 '202212469': 90,
 '202212318': 55,
 '202212432': 95,
 '202212443': 95,
 '202212367': 50,
 '202212458': 50,
 '202212396': 95,
 '202212482': 50,
 '202212452': 65,
 '202212387': 70,
 '202212354': 90}
df['att']
202212380    65
202212370    95
202212363    65
202212488    55
202212312    80
202212377    75
202212463    65
202212471    60
202212400    95
202212469    90
202212318    55
202212432    95
202212443    95
202212367    50
202212458    50
202212396    95
202212482    50
202212452    65
202212387    70
202212354    90
Name: att, dtype: int64

- 예제2: 학번 202212380'의 출석점수 출력

test_dic2['att']['202212380']
65
df['att']['202212380']
65

인덱싱으로 정보를 뽑는 기능도 지원을 하면 좋겠다 (마치 리스트나 넘파이처럼)

- 예제1: 첫번째 학생의 기말고사 성적을 출력하고 싶다.

test_ndarray[0,-1]
40
df.iloc[0,-1]
40
  • 벼락치기: df에서 iloc이라는 특수기능을 이용하면 넘파이 인덱싱처럼 원소출력이 가능하다.

- 예제2: 홀수번째 학생의 점수를 뽑고 싶다.

test_ndarray[::2]
array([[202212380,        65,        55,        50,        40],
       [202212363,        65,        90,        60,        30],
       [202212312,        80,        30,        30,       100],
       [202212463,        65,        45,        45,        90],
       [202212400,        95,        65,        20,        10],
       [202212318,        55,        75,        35,        25],
       [202212443,        95,        55,        15,        35],
       [202212458,        50,        55,        15,        85],
       [202212482,        50,        50,        45,        10],
       [202212387,        70,        70,        40,        35]])
df.iloc[::2]
att rep mid fin
202212380 65 55 50 40
202212363 65 90 60 30
202212312 80 30 30 100
202212463 65 45 45 90
202212400 95 65 20 10
202212318 55 75 35 25
202212443 95 55 15 35
202212458 50 55 15 85
202212482 50 50 45 10
202212387 70 70 40 35

- 예제3: 맨 끝에서 3명의 점수를 출력하고 싶다.

test_ndarray[-3:]
array([[202212452,        65,        55,        15,        45],
       [202212387,        70,        70,        40,        35],
       [202212354,        90,        90,        80,        90]])
df.iloc[-3:]
att rep mid fin
202212452 65 55 15 45
202212387 70 70 40 35
202212354 90 90 80 90
df[-3:]
att rep mid fin
202212452 65 55 15 45
202212387 70 70 40 35
202212354 90 90 80 90

- 예제4: 맨 끝에서 3명의 점수중 마지막 2개의 칼럼만 출력하고 싶다.

test_ndarray[-3:,-2:]
array([[15, 45],
       [40, 35],
       [80, 90]])
df.iloc[-3:,-2:]
mid fin
202212452 15 45
202212387 40 35
202212354 80 90

궁극: 해싱과 인덱싱을 모두 지원하는 아주 우수한 자료형을 만들고 싶음

- 예제1: 중간고사 점수가 20점 이상이면서 동시에 출석점수가 60점미만인 학생들의 기말고사 점수를 출력

(방법1) 데이터베이스 스타일

df.query("mid >= 20 and att <60")
att rep mid fin
202212488 55 80 75 80
202212318 55 75 35 25
202212367 50 80 40 30
202212482 50 50 45 10
df.query("mid >= 20 and att <60")['fin']
202212488    80
202212318    25
202212367    30
202212482    10
Name: fin, dtype: int64

(방법2) 넘파이 스타일이라면?

test_ndarray
array([[202212380,        65,        55,        50,        40],
       [202212370,        95,       100,        50,        80],
       [202212363,        65,        90,        60,        30],
       [202212488,        55,        80,        75,        80],
       [202212312,        80,        30,        30,       100],
       [202212377,        75,        40,       100,        15],
       [202212463,        65,        45,        45,        90],
       [202212471,        60,        60,        25,         0],
       [202212400,        95,        65,        20,        10],
       [202212469,        90,        80,        80,        20],
       [202212318,        55,        75,        35,        25],
       [202212432,        95,        95,        45,         0],
       [202212443,        95,        55,        15,        35],
       [202212367,        50,        80,        40,        30],
       [202212458,        50,        55,        15,        85],
       [202212396,        95,        30,        30,        95],
       [202212482,        50,        50,        45,        10],
       [202212452,        65,        55,        15,        45],
       [202212387,        70,        70,        40,        35],
       [202212354,        90,        90,        80,        90]])
test_ndarray[:,3] >= 20  ## 중간고사가 20점이상
array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True, False,  True, False,  True,  True, False,
        True,  True])
test_ndarray[:,1] < 60  ## 출석이 60미만
array([False, False, False,  True, False, False, False, False, False,
       False,  True, False, False,  True,  True, False,  True, False,
       False, False])
(test_ndarray[:,3] >= 20) & (test_ndarray[:,1] < 60)
array([False, False, False,  True, False, False, False, False, False,
       False,  True, False, False,  True, False, False,  True, False,
       False, False])

Note: test_ndarray[:,3] >= 20 & test_ndarray[:,1] < 60 와 같이 하면 에러가 난다. 조심하자!
test_ndarray[(test_ndarray[:,3] >= 20) & (test_ndarray[:,1] < 60),-1] # 이게 무슨코드야
array([80, 25, 30, 10])
  • 구현난이도 어려움, 가독성 꽝

- 예제2: '중간고사점수<기말고사점수'인 학생들의 출석점수 평균을 구하자.

df.query('mid<fin')['att'].mean()
76.66666666666667

pandas 사용법

pandas 공부 1단계

데이터프레임 선언

- 방법1: dictionary에서 만든다.

pd.DataFrame({'att':[30,40,50],'mid':[50,60,70]})
att mid
0 30 50
1 40 60
2 50 70
pd.DataFrame({'att':(30,40,50),'mid':(50,60,70)})
att mid
0 30 50
1 40 60
2 50 70
pd.DataFrame({'att':np.array([30,40,50]),'mid':np.array([50,60,70])})
att mid
0 30 50
1 40 60
2 50 70

- 방법: 2차원 ndarray에서 만든다.

np.arange(2*3).reshape(2,3)
array([[0, 1, 2],
       [3, 4, 5]])
pd.DataFrame(np.arange(2*3).reshape(2,3))
0 1 2
0 0 1 2
1 3 4 5

열의 이름 부여

- 방법1: 딕셔너리를 통하여 만들면 딕셔너리의 key가 자동으로 열의 이름이 된다.

pd.DataFrame({'att':np.array([30,40,50]),'mid':np.array([50,60,70])})
att mid
0 30 50
1 40 60
2 50 70

- 방법2: pd.DataFrame()의 옵션에 columns를 이용

pd.DataFrame(np.arange(2*3).reshape(2,3),columns=['X1','X2','X3'])
X1 X2 X3
0 0 1 2
1 3 4 5

- 방법3: df.columns에 원하는 열이름을 덮어씀 (1)

df=pd.DataFrame(np.arange(2*3).reshape(2,3))
df
0 1 2
0 0 1 2
1 3 4 5
df.columns = ['X1','X2','X3']
df
X1 X2 X3
0 0 1 2
1 3 4 5
df.columns
Index(['X1', 'X2', 'X3'], dtype='object')

- 방법4: df.columns에 원하는 열이름을 덮어씀 (2)

df=pd.DataFrame(np.arange(2*3).reshape(2,3))
df
0 1 2
0 0 1 2
1 3 4 5
df.columns = pd.Index(['X1','X2','X3'])
df
X1 X2 X3
0 0 1 2
1 3 4 5

방법4가 방법3의 방식보다 컴퓨터가 이해하기 좋다. (= 불필요한 에러를 방지할 수 있다)

df.columns, type(df.columns)
(Index(['X1', 'X2', 'X3'], dtype='object'), pandas.core.indexes.base.Index)
['X1','X2','X3'], type(['X1','X2','X3'])
(['X1', 'X2', 'X3'], list)
pd.Index(['X1','X2','X3']), type(pd.Index(['X1','X2','X3']))
(Index(['X1', 'X2', 'X3'], dtype='object'), pandas.core.indexes.base.Index)

행의 이름 부여

- 방법1: 중첩 dict이면 nested dic의 key가 알아서 행의 이름으로 된다.

pd.DataFrame({'att':{'guebin':30, 'iu':40, 'hynn':50} , 'mid':{'guebin':5, 'iu':45, 'hynn':90}})
att mid
guebin 30 5
iu 40 45
hynn 50 90

- 방법2: pd.DataFrame()의 index 옵션 이용

pd.DataFrame({'att':[30,40,50] , 'mid':[5,45,90]}, index=['guebin','iu','hynn'])
att mid
guebin 30 5
iu 40 45
hynn 50 90

- 방법3: df.index에 덮어씌움

df=pd.DataFrame({'att':[30,40,50] , 'mid':[5,45,90]})
df
att mid
0 30 5
1 40 45
2 50 90
df.index = pd.Index(['guebin','iu','hynn'])
#df.index = ['guebin','iu','hynn'] <- 이것도 실행 되기는 된다.
df
att mid
guebin 30 5
iu 40 45
hynn 50 90

- 방법4: df.set_index() 를 이용하여 덮어씌운다

df=pd.DataFrame({'att':[30,40,50] , 'mid':[5,45,90]})
df
att mid
0 30 5
1 40 45
2 50 90
df.set_index(pd.Index(['guebin','iu','hynn']))
att mid
guebin 30 5
iu 40 45
hynn 50 90

(주의) 아래는 에러가 난다.

df.set_index(['guebin','iu','hynn'])
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Input In [230], in <cell line: 1>()
----> 1 df.set_index(['guebin','iu','hynn'])

File ~/anaconda3/envs/py39/lib/python3.9/site-packages/pandas/util/_decorators.py:311, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    305 if len(args) > num_allow_args:
    306     warnings.warn(
    307         msg.format(arguments=arguments),
    308         FutureWarning,
    309         stacklevel=stacklevel,
    310     )
--> 311 return func(*args, **kwargs)

File ~/anaconda3/envs/py39/lib/python3.9/site-packages/pandas/core/frame.py:5488, in DataFrame.set_index(self, keys, drop, append, inplace, verify_integrity)
   5485                 missing.append(col)
   5487 if missing:
-> 5488     raise KeyError(f"None of {missing} are in the columns")
   5490 if inplace:
   5491     frame = self

KeyError: "None of ['guebin', 'iu', 'hynn'] are in the columns"
df.set_index([['guebin','iu','hynn']]) # 꺽쇠를 한번 더 넣어주면 에러를 피할수 있다. 
att mid
guebin 30 5
iu 40 45
hynn 50 90

자료형, len, shape, for문의 반복변수

df = pd.DataFrame({'att':[30,40,50],'mid':[5,45,90]})
df
att mid
0 30 5
1 40 45
2 50 90

- type

type(df)
pandas.core.frame.DataFrame

- len

len(df) # row의 갯수 
3

- shape

df.shape 
(3, 2)

- for문의 반복변수

for k in df:
    print(k) # 딕셔너리같죠
att
mid
for k in {'att':[30,40,50],'mid':[5,45,90]}: 
    print(k)
att
mid

pd.Series

- 2차원 ndarray가 pd.DataFrame에 대응한다면 1차원 ndarray는 pd.Series에 대응한다.

a=pd.Series(np.random.randn(10))
a
0    0.453758
1   -0.716270
2   -0.167750
3    0.089546
4    0.874006
5    0.401627
6    0.204911
7    1.042722
8   -0.352356
9   -1.241658
dtype: float64
type(a)
pandas.core.series.Series
len(a)
10
a.shape
(10,)
for value in a: 
    print(value)
0.45375797778887145
-0.7162702753778524
-0.1677503294060748
0.08954572073306402
0.8740060519863648
0.40162700372357163
0.20491068603376858
1.0427224662413124
-0.3523562215445016
-1.2416580554177619

pandas 공부 2단계

- 데이터

np.random.seed(43052)
att = np.random.choice(np.arange(10,21)*5,20)
rep = np.random.choice(np.arange(5,21)*5,20)
mid = np.random.choice(np.arange(0,21)*5,20)
fin = np.random.choice(np.arange(0,21)*5,20)
key = ['202212'+str(s) for s in np.random.choice(np.arange(300,501),20,replace=False)]
df=pd.DataFrame({'att':att,'rep':rep,'mid':mid,'fin':fin},index=key)
df
att rep mid fin
202212380 65 55 50 40
202212370 95 100 50 80
202212363 65 90 60 30
202212488 55 80 75 80
202212312 80 30 30 100
202212377 75 40 100 15
202212463 65 45 45 90
202212471 60 60 25 0
202212400 95 65 20 10
202212469 90 80 80 20
202212318 55 75 35 25
202212432 95 95 45 0
202212443 95 55 15 35
202212367 50 80 40 30
202212458 50 55 15 85
202212396 95 30 30 95
202212482 50 50 45 10
202212452 65 55 15 45
202212387 70 70 40 35
202212354 90 90 80 90

첫번째 칼럼을 선택

- 방법1

df.att
202212380    65
202212370    95
202212363    65
202212488    55
202212312    80
202212377    75
202212463    65
202212471    60
202212400    95
202212469    90
202212318    55
202212432    95
202212443    95
202212367    50
202212458    50
202212396    95
202212482    50
202212452    65
202212387    70
202212354    90
Name: att, dtype: int64

- 방법2: dict스타일

df['att']
202212380    65
202212370    95
202212363    65
202212488    55
202212312    80
202212377    75
202212463    65
202212471    60
202212400    95
202212469    90
202212318    55
202212432    95
202212443    95
202212367    50
202212458    50
202212396    95
202212482    50
202212452    65
202212387    70
202212354    90
Name: att, dtype: int64

- 방법3: dict스타일

df[['att']]
att
202212380 65
202212370 95
202212363 65
202212488 55
202212312 80
202212377 75
202212463 65
202212471 60
202212400 95
202212469 90
202212318 55
202212432 95
202212443 95
202212367 50
202212458 50
202212396 95
202212482 50
202212452 65
202212387 70
202212354 90
  • df.att 나 df['att']는 series를 리턴하고 df[['att']]는 dataframe을 리턴한다.

- 방법4: ndarray스타일

df.iloc[:,0] 
202212380    65
202212370    95
202212363    65
202212488    55
202212312    80
202212377    75
202212463    65
202212471    60
202212400    95
202212469    90
202212318    55
202212432    95
202212443    95
202212367    50
202212458    50
202212396    95
202212482    50
202212452    65
202212387    70
202212354    90
Name: att, dtype: int64

- 방법5: ndarray스타일

df.iloc[:,[0]]
att
202212380 65
202212370 95
202212363 65
202212488 55
202212312 80
202212377 75
202212463 65
202212471 60
202212400 95
202212469 90
202212318 55
202212432 95
202212443 95
202212367 50
202212458 50
202212396 95
202212482 50
202212452 65
202212387 70
202212354 90
  • df.iloc[:,0]은 series를 리턴하고 df.iloc[:,[0]]은 dataframe을 리턴한다.

- 방법6: ndarray 스타일과 dict 스타일의 혼합

df.loc[:,'att'] 
202212380    65
202212370    95
202212363    65
202212488    55
202212312    80
202212377    75
202212463    65
202212471    60
202212400    95
202212469    90
202212318    55
202212432    95
202212443    95
202212367    50
202212458    50
202212396    95
202212482    50
202212452    65
202212387    70
202212354    90
Name: att, dtype: int64

- 방법7: ndarray 스타일과 dict 스타일의 혼합

df.loc[:,['att']] 
att
202212380 65
202212370 95
202212363 65
202212488 55
202212312 80
202212377 75
202212463 65
202212471 60
202212400 95
202212469 90
202212318 55
202212432 95
202212443 95
202212367 50
202212458 50
202212396 95
202212482 50
202212452 65
202212387 70
202212354 90
  • df.loc[:,'att']은 series를 리턴하고 df.loc[:,['att']]은 dataframe을 리턴한다.

- 방법7: nparray 스타일 + bool 인덱싱

df.iloc[:,[True,False,False,False]]
att
202212380 65
202212370 95
202212363 65
202212488 55
202212312 80
202212377 75
202212463 65
202212471 60
202212400 95
202212469 90
202212318 55
202212432 95
202212443 95
202212367 50
202212458 50
202212396 95
202212482 50
202212452 65
202212387 70
202212354 90

- 방법8: ndarray와 dict의 홉합형 + bool 인덱싱

df.loc[:,[True,False,False,False]]
att
202212380 65
202212370 95
202212363 65
202212488 55
202212312 80
202212377 75
202212463 65
202212471 60
202212400 95
202212469 90
202212318 55
202212432 95
202212443 95
202212367 50
202212458 50
202212396 95
202212482 50
202212452 65
202212387 70
202212354 90

여러개의 칼럼을 선택

- 방법1: dict 스타일

df[['att','fin']]
att fin
202212380 65 40
202212370 95 80
202212363 65 30
202212488 55 80
202212312 80 100
202212377 75 15
202212463 65 90
202212471 60 0
202212400 95 10
202212469 90 20
202212318 55 25
202212432 95 0
202212443 95 35
202212367 50 30
202212458 50 85
202212396 95 95
202212482 50 10
202212452 65 45
202212387 70 35
202212354 90 90

- 방법2: ndarray 스타일 (정수리스트로 인덱싱, 슬라이싱, 스트라이딩)

df.iloc[:,[0,1]] # 정수의 리스트를 전달하여 컬럼추출
att rep
202212380 65 55
202212370 95 100
202212363 65 90
202212488 55 80
202212312 80 30
202212377 75 40
202212463 65 45
202212471 60 60
202212400 95 65
202212469 90 80
202212318 55 75
202212432 95 95
202212443 95 55
202212367 50 80
202212458 50 55
202212396 95 30
202212482 50 50
202212452 65 55
202212387 70 70
202212354 90 90
df.iloc[:,range(2)] 
att rep
202212380 65 55
202212370 95 100
202212363 65 90
202212488 55 80
202212312 80 30
202212377 75 40
202212463 65 45
202212471 60 60
202212400 95 65
202212469 90 80
202212318 55 75
202212432 95 95
202212443 95 55
202212367 50 80
202212458 50 55
202212396 95 30
202212482 50 50
202212452 65 55
202212387 70 70
202212354 90 90
df.iloc[:,:2]  # 슬라이싱 , 0,1,2에서 마지막 2는 제외되고 0,1에 해당하는 것만 추출
att rep
202212380 65 55
202212370 95 100
202212363 65 90
202212488 55 80
202212312 80 30
202212377 75 40
202212463 65 45
202212471 60 60
202212400 95 65
202212469 90 80
202212318 55 75
202212432 95 95
202212443 95 55
202212367 50 80
202212458 50 55
202212396 95 30
202212482 50 50
202212452 65 55
202212387 70 70
202212354 90 90
df.iloc[:,::2]  # 스트라이딩
att mid
202212380 65 50
202212370 95 50
202212363 65 60
202212488 55 75
202212312 80 30
202212377 75 100
202212463 65 45
202212471 60 25
202212400 95 20
202212469 90 80
202212318 55 35
202212432 95 45
202212443 95 15
202212367 50 40
202212458 50 15
202212396 95 30
202212482 50 45
202212452 65 15
202212387 70 40
202212354 90 80

- 방법3: ndarray 와 dict의 혼합형

df.loc[:,['att','mid']] 
att mid
202212380 65 50
202212370 95 50
202212363 65 60
202212488 55 75
202212312 80 30
202212377 75 100
202212463 65 45
202212471 60 25
202212400 95 20
202212469 90 80
202212318 55 35
202212432 95 45
202212443 95 15
202212367 50 40
202212458 50 15
202212396 95 30
202212482 50 45
202212452 65 15
202212387 70 40
202212354 90 80
df.loc[:,'att':'mid']  # 마지막의 mid도 포함된다. 
att rep mid
202212380 65 55 50
202212370 95 100 50
202212363 65 90 60
202212488 55 80 75
202212312 80 30 30
202212377 75 40 100
202212463 65 45 45
202212471 60 60 25
202212400 95 65 20
202212469 90 80 80
202212318 55 75 35
202212432 95 95 45
202212443 95 55 15
202212367 50 80 40
202212458 50 55 15
202212396 95 30 30
202212482 50 50 45
202212452 65 55 15
202212387 70 70 40
202212354 90 90 80
df.loc[:,'rep':] 
rep mid fin
202212380 55 50 40
202212370 100 50 80
202212363 90 60 30
202212488 80 75 80
202212312 30 30 100
202212377 40 100 15
202212463 45 45 90
202212471 60 25 0
202212400 65 20 10
202212469 80 80 20
202212318 75 35 25
202212432 95 45 0
202212443 55 15 35
202212367 80 40 30
202212458 55 15 85
202212396 30 30 95
202212482 50 45 10
202212452 55 15 45
202212387 70 40 35
202212354 90 80 90

- 방법4: bool을 이용한 인덱싱

df.iloc[:,[True,False,True,False]]
att mid
202212380 65 50
202212370 95 50
202212363 65 60
202212488 55 75
202212312 80 30
202212377 75 100
202212463 65 45
202212471 60 25
202212400 95 20
202212469 90 80
202212318 55 35
202212432 95 45
202212443 95 15
202212367 50 40
202212458 50 15
202212396 95 30
202212482 50 45
202212452 65 15
202212387 70 40
202212354 90 80
df.loc[:,[True,False,True,False]]
att mid
202212380 65 50
202212370 95 50
202212363 65 60
202212488 55 75
202212312 80 30
202212377 75 100
202212463 65 45
202212471 60 25
202212400 95 20
202212469 90 80
202212318 55 35
202212432 95 45
202212443 95 15
202212367 50 40
202212458 50 15
202212396 95 30
202212482 50 45
202212452 65 15
202212387 70 40
202212354 90 80

첫번째 행을 선택

- 방법1

df.iloc[0]
att    65
rep    55
mid    50
fin    40
Name: 202212380, dtype: int64

- 방법2

df.iloc[[0]]
att rep mid fin
202212380 65 55 50 40

- 방법3

df.iloc[0,:]
att    65
rep    55
mid    50
fin    40
Name: 202212380, dtype: int64

- 방법4

df.iloc[[0],:]
att rep mid fin
202212380 65 55 50 40

- 방법5

df.loc['202212380']
att    65
rep    55
mid    50
fin    40
Name: 202212380, dtype: int64

- 방법6

df.loc[['202212380']]
att rep mid fin
202212380 65 55 50 40

- 방법7

df.loc['202212380',:]
att    65
rep    55
mid    50
fin    40
Name: 202212380, dtype: int64

- 방법8

df.loc[['202212380'],:]
att rep mid fin
202212380 65 55 50 40

- 방법9

len(df)
20
_lst = [True]+[False]*19
df.iloc[_lst] 
att rep mid fin
202212380 65 55 50 40
df.iloc[_lst,:] 
att rep mid fin
202212380 65 55 50 40
df.loc[_lst] 
att rep mid fin
202212380 65 55 50 40
df.loc[_lst,:] 
att rep mid fin
202212380 65 55 50 40

여러개의 행을 선택

- 방법1

df.iloc[[0,2]] 
att rep mid fin
202212380 65 55 50 40
202212363 65 90 60 30
df.iloc[[0,2],:] 
att rep mid fin
202212380 65 55 50 40
202212363 65 90 60 30

- 방법2

df.loc[['202212380','202212363']] 
att rep mid fin
202212380 65 55 50 40
202212363 65 90 60 30
df.loc[['202212380','202212363'],:] 
att rep mid fin
202212380 65 55 50 40
202212363 65 90 60 30

- 그 밖의 방법들

df.iloc[::3] # 스트라이딩
att rep mid fin
202212380 65 55 50 40
202212488 55 80 75 80
202212463 65 45 45 90
202212469 90 80 80 20
202212443 95 55 15 35
202212396 95 30 30 95
202212387 70 70 40 35
df.iloc[:5]
att rep mid fin
202212380 65 55 50 40
202212370 95 100 50 80
202212363 65 90 60 30
202212488 55 80 75 80
202212312 80 30 30 100
df.loc[:'202212312']
att rep mid fin
202212380 65 55 50 40
202212370 95 100 50 80
202212363 65 90 60 30
202212488 55 80 75 80
202212312 80 30 30 100
df.loc[list(df.att<80),'rep':]
rep mid fin
202212380 55 50 40
202212363 90 60 30
202212488 80 75 80
202212377 40 100 15
202212463 45 45 90
202212471 60 25 0
202212318 75 35 25
202212367 80 40 30
202212458 55 15 85
202212482 50 45 10
202212452 55 15 45
202212387 70 40 35
df.loc[df.att<80,'rep':]
rep mid fin
202212380 55 50 40
202212363 90 60 30
202212488 80 75 80
202212377 40 100 15
202212463 45 45 90
202212471 60 25 0
202212318 75 35 25
202212367 80 40 30
202212458 55 15 85
202212482 50 45 10
202212452 55 15 45
202212387 70 40 35
df.iloc[list(df.att<80),1:]
rep mid fin
202212380 55 50 40
202212363 90 60 30
202212488 80 75 80
202212377 40 100 15
202212463 45 45 90
202212471 60 25 0
202212318 75 35 25
202212367 80 40 30
202212458 55 15 85
202212482 50 45 10
202212452 55 15 45
202212387 70 40 35

- 아래는 에러가 난다 주의!

df.iloc[df.att<80,1:]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File ~/anaconda3/envs/py39/lib/python3.9/site-packages/pandas/core/indexing.py:769, in _LocationIndexer._validate_tuple_indexer(self, key)
    768 try:
--> 769     self._validate_key(k, i)
    770 except ValueError as err:

File ~/anaconda3/envs/py39/lib/python3.9/site-packages/pandas/core/indexing.py:1352, in _iLocIndexer._validate_key(self, key, axis)
   1347         raise NotImplementedError(
   1348             "iLocation based boolean "
   1349             "indexing on an integer type "
   1350             "is not available"
   1351         )
-> 1352     raise ValueError(
   1353         "iLocation based boolean indexing cannot use "
   1354         "an indexable as a mask"
   1355     )
   1356 return

ValueError: iLocation based boolean indexing cannot use an indexable as a mask

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
Input In [383], in <cell line: 1>()
----> 1 df.iloc[df.att<80,1:]

File ~/anaconda3/envs/py39/lib/python3.9/site-packages/pandas/core/indexing.py:961, in _LocationIndexer.__getitem__(self, key)
    959     if self._is_scalar_access(key):
    960         return self.obj._get_value(*key, takeable=self._takeable)
--> 961     return self._getitem_tuple(key)
    962 else:
    963     # we by definition only have the 0th axis
    964     axis = self.axis or 0

File ~/anaconda3/envs/py39/lib/python3.9/site-packages/pandas/core/indexing.py:1458, in _iLocIndexer._getitem_tuple(self, tup)
   1456 def _getitem_tuple(self, tup: tuple):
-> 1458     tup = self._validate_tuple_indexer(tup)
   1459     with suppress(IndexingError):
   1460         return self._getitem_lowerdim(tup)

File ~/anaconda3/envs/py39/lib/python3.9/site-packages/pandas/core/indexing.py:771, in _LocationIndexer._validate_tuple_indexer(self, key)
    769         self._validate_key(k, i)
    770     except ValueError as err:
--> 771         raise ValueError(
    772             "Location based indexing can only have "
    773             f"[{self._valid_types}] types"
    774         ) from err
    775 return key

ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types

query ($\star$)

- 예제1

df.query('att==90 and mid>30')
att rep mid fin
202212469 90 80 80 20
202212354 90 90 80 90

- 예제2

df.query('att<rep and mid<fin')
att rep mid fin
202212370 95 100 50 80
202212488 55 80 75 80
202212458 50 55 15 85

- 예제3

df.query('att < rep < 80')
att rep mid fin
202212318 55 75 35 25
202212458 50 55 15 85

- 예제4

df.query('50 < att <= 90 and mid < fin')
att rep mid fin
202212488 55 80 75 80
202212312 80 30 30 100
202212463 65 45 45 90
202212452 65 55 15 45
202212354 90 90 80 90

- 예제5

df.query(' (mid+fin)/2 >=60')
att rep mid fin
202212370 95 100 50 80
202212488 55 80 75 80
202212312 80 30 30 100
202212463 65 45 45 90
202212396 95 30 30 95
202212354 90 90 80 90

- 예제6

_mean  = df.att.mean()
_mean 
73.0
df.query('att >= 73')
att rep mid fin
202212370 95 100 50 80
202212312 80 30 30 100
202212377 75 40 100 15
202212400 95 65 20 10
202212469 90 80 80 20
202212432 95 95 45 0
202212443 95 55 15 35
202212396 95 30 30 95
202212354 90 90 80 90
df.query('att >= @_mean')
# df.query('att>= _mean')은 실행 안된다.
att rep mid fin
202212370 95 100 50 80
202212312 80 30 30 100
202212377 75 40 100 15
202212400 95 65 20 10
202212469 90 80 80 20
202212432 95 95 45 0
202212443 95 55 15 35
202212396 95 30 30 95
202212354 90 90 80 90

- 예제7

df
att rep mid fin
202212380 65 55 50 40
202212370 95 100 50 80
202212363 65 90 60 30
202212488 55 80 75 80
202212312 80 30 30 100
202212377 75 40 100 15
202212463 65 45 45 90
202212471 60 60 25 0
202212400 95 65 20 10
202212469 90 80 80 20
202212318 55 75 35 25
202212432 95 95 45 0
202212443 95 55 15 35
202212367 50 80 40 30
202212458 50 55 15 85
202212396 95 30 30 95
202212482 50 50 45 10
202212452 65 55 15 45
202212387 70 70 40 35
202212354 90 90 80 90
df.query("index <= '202212354' or index=='202212387'")
att rep mid fin
202212312 80 30 30 100
202212318 55 75 35 25
202212387 70 70 40 35
202212354 90 90 80 90

사실 이 기능은 시계열자료에서 꽃핀다.

- 예제8

pd.date_range('20211226',periods=10)
DatetimeIndex(['2021-12-26', '2021-12-27', '2021-12-28', '2021-12-29',
               '2021-12-30', '2021-12-31', '2022-01-01', '2022-01-02',
               '2022-01-03', '2022-01-04'],
              dtype='datetime64[ns]', freq='D')
_df=pd.DataFrame(np.random.normal(size=(10,4)),columns=list('ABCD'),index=pd.date_range('20211226',periods=10))
_df
A B C D
2021-12-26 -0.367399 0.925727 -0.378494 -1.597765
2021-12-27 -0.016530 1.041946 1.511319 -0.970646
2021-12-28 -1.556636 1.629024 -0.965807 0.309184
2021-12-29 0.554734 0.712753 0.769650 0.277420
2021-12-30 -0.897457 0.110547 1.026816 0.294377
2021-12-31 0.904037 -0.809864 0.473510 -0.047875
2022-01-01 -0.850946 -0.915568 -1.549329 -0.040507
2022-01-02 0.054391 -0.732693 0.409205 -1.165549
2022-01-03 0.053472 0.800553 -0.040499 -0.447199
2022-01-04 -0.304462 1.257485 0.888741 0.615561
_df.query(" '2021-12-29' <=  index <= '2022-01-03' ")
A B C D
2021-12-29 0.554734 0.712753 0.769650 0.277420
2021-12-30 -0.897457 0.110547 1.026816 0.294377
2021-12-31 0.904037 -0.809864 0.473510 -0.047875
2022-01-01 -0.850946 -0.915568 -1.549329 -0.040507
2022-01-02 0.054391 -0.732693 0.409205 -1.165549
2022-01-03 0.053472 0.800553 -0.040499 -0.447199
_df.query(" '2021-12-29' <=  index <= '2022-01-03' and A+B<C  ")
A B C D
2021-12-30 -0.897457 0.110547 1.026816 0.294377
2021-12-31 0.904037 -0.809864 0.473510 -0.047875
2022-01-01 -0.850946 -0.915568 -1.549329 -0.040507
2022-01-02 0.054391 -0.732693 0.409205 -1.165549

- query가 만능은 아니다.

df.columns = pd.Index(['att score', 'rep score', 'mid score', 'fin score'])
df.query(" att score < 90 ")
# df.loc[df['att score'] < 90,:] <-- 이렇게 구현하면 된다! (아니면 변수이름을 바꿔주는 코드를 짜든가)
Traceback (most recent call last):

  File ~/anaconda3/envs/py39/lib/python3.9/site-packages/IPython/core/interactiveshell.py:3361 in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)

  Input In [455] in <cell line: 1>
    df.query(" att score < 90 ")

  File ~/anaconda3/envs/py39/lib/python3.9/site-packages/pandas/core/frame.py:4105 in query
    res = self.eval(expr, **kwargs)

  File ~/anaconda3/envs/py39/lib/python3.9/site-packages/pandas/core/frame.py:4234 in eval
    return _eval(expr, inplace=inplace, **kwargs)

  File ~/anaconda3/envs/py39/lib/python3.9/site-packages/pandas/core/computation/eval.py:350 in eval
    parsed_expr = Expr(expr, engine=engine, parser=parser, env=env)

  File ~/anaconda3/envs/py39/lib/python3.9/site-packages/pandas/core/computation/expr.py:811 in __init__
    self.terms = self.parse()

  File ~/anaconda3/envs/py39/lib/python3.9/site-packages/pandas/core/computation/expr.py:830 in parse
    return self._visitor.visit(self.expr)

  File ~/anaconda3/envs/py39/lib/python3.9/site-packages/pandas/core/computation/expr.py:411 in visit
    raise e

  File ~/anaconda3/envs/py39/lib/python3.9/site-packages/pandas/core/computation/expr.py:407 in visit
    node = ast.fix_missing_locations(ast.parse(clean))

  File ~/anaconda3/envs/py39/lib/python3.9/ast.py:50 in parse
    return compile(source, filename, mode, flags,

  File <unknown>:1
    att score <90
        ^
SyntaxError: invalid syntax

pandas 공부 3단계

전치

ndarray = np.arange(2*3).reshape(2,3)
df = pd.DataFrame(ndarray)
df
0 1 2
0 0 1 2
1 3 4 5
ndarray.T
array([[0, 3],
       [1, 4],
       [2, 5]])
df.T
0 1
0 0 3
1 1 4
2 2 5

ndarray.sum(axis=0)
array([3, 5, 7])
df.sum(axis=0)
0    3
1    5
2    7
dtype: int64
ndarray.sum(axis=1)
array([ 3, 12])
df.sum(axis=1)
0     3
1    12
dtype: int64

cumsum

df
0 1 2
0 0 1 2
1 3 4 5
ndarray.cumsum(axis=0)
array([[0, 1, 2],
       [3, 5, 7]])
df.cumsum(axis=0)
0 1 2
0 0 1 2
1 3 5 7
ndarray.cumsum(axis=1)
array([[ 0,  1,  3],
       [ 3,  7, 12]])
df.cumsum(axis=1)
0 1 2
0 0 1 3
1 3 7 12

형태변환

ndarray.tolist()
[[0, 1, 2], [3, 4, 5]]
df.to_dict()
{0: {0: 0, 1: 3}, 1: {0: 1, 1: 4}, 2: {0: 2, 1: 5}}
df.to_numpy()
array([[0, 1, 2],
       [3, 4, 5]])
df.to_numpy().tolist()
[[0, 1, 2], [3, 4, 5]]

생략...

pandas 공부 4단계~

생략

숙제

- 아래의 데이터프레임에서 1,3번째 열을 추출하라.

df= pd.DataFrame({'att':[90,90,95],'rep':[80,90,90],'mid':[50,60,70], 'fin':[70,80,50]})
df
att rep mid fin
0 90 80 50 70
1 90 90 60 80
2 95 90 70 50