04wk-15: sklearn.impute를 이용한 결측치 처리

Author

최규빈

Published

September 26, 2023

1. 강의영상

2. Import

import numpy as np
import pandas as pd 
import sklearn.impute

3. sklearn.impute

A. 숫자형자료의 impute

- 주어진자료

df = pd.DataFrame({'A':[2.1,1.9,2.2,np.nan,1.9], 'B':[0,0,np.nan,0,0]})
df
A B
0 2.1 0.0
1 1.9 0.0
2 2.2 NaN
3 NaN 0.0
4 1.9 0.0

- 빈칸은 대충 아래와 같이 추정하면 되지 않을까?

df.loc[3,'A'] = df.A.mean()
df.loc[2,'B'] = df.B.mean()
df
A B
0 2.100 0.0
1 1.900 0.0
2 2.200 0.0
3 2.025 0.0
4 1.900 0.0

- 자동으로 하려면?

df = pd.DataFrame({'A':[2.1,1.9,2.2,np.nan,1.9], 'B':[0,0,np.nan,0,0]})
df
A B
0 2.1 0.0
1 1.9 0.0
2 2.2 NaN
3 NaN 0.0
4 1.9 0.0

(방법1)

imputer = sklearn.impute.SimpleImputer()
imputer.fit(df)
imputer.transform(df)
array([[2.1  , 0.   ],
       [1.9  , 0.   ],
       [2.2  , 0.   ],
       [2.025, 0.   ],
       [1.9  , 0.   ]])

(방법2)

imputer = sklearn.impute.SimpleImputer()
imputer.fit_transform(df)
array([[2.1  , 0.   ],
       [1.9  , 0.   ],
       [2.2  , 0.   ],
       [2.025, 0.   ],
       [1.9  , 0.   ]])

- 다른방식으로 결측값 대체

(방법1) – 평균으로 대체

imputer = sklearn.impute.SimpleImputer(strategy='mean')
imputer.fit_transform(df)
array([[2.1  , 0.   ],
       [1.9  , 0.   ],
       [2.2  , 0.   ],
       [2.025, 0.   ],
       [1.9  , 0.   ]])

(방법2) – 중앙값으로 대체

imputer = sklearn.impute.SimpleImputer(strategy='median')
imputer.fit_transform(df)
array([[2.1, 0. ],
       [1.9, 0. ],
       [2.2, 0. ],
       [2. , 0. ],
       [1.9, 0. ]])

(방법3) – 최빈값으로 대체

imputer = sklearn.impute.SimpleImputer(strategy='most_frequent')
imputer.fit_transform(df)
array([[2.1, 0. ],
       [1.9, 0. ],
       [2.2, 0. ],
       [1.9, 0. ],
       [1.9, 0. ]])

(방법4) – 상수대체

imputer = sklearn.impute.SimpleImputer(strategy='constant',fill_value=-999)
imputer.fit_transform(df)
array([[   2.1,    0. ],
       [   1.9,    0. ],
       [   2.2, -999. ],
       [-999. ,    0. ],
       [   1.9,    0. ]])

B. 범주형자료의 impute

- 자료

df = pd.DataFrame({'A':['Y','N','Y','Y',np.nan], 'B':['stat','math',np.nan,'stat','bio']})
df
A B
0 Y stat
1 N math
2 Y NaN
3 Y stat
4 NaN bio

- 최빈값 혹은 상수대체만 가능

(방법1) – 최빈값을 이용

imptr = sklearn.impute.SimpleImputer(strategy='most_frequent')
imptr.fit_transform(df)
array([['Y', 'stat'],
       ['N', 'math'],
       ['Y', 'stat'],
       ['Y', 'stat'],
       ['Y', 'bio']], dtype=object)

(방법2) – 상수로 대체함

imptr1 = sklearn.impute.SimpleImputer(strategy='constant',fill_value='Y')
imptr1.fit_transform(df[['A']])
imptr2 = sklearn.impute.SimpleImputer(strategy='constant',fill_value='math')
imptr2.fit_transform(df[['B']])
array([['stat'],
       ['math'],
       ['math'],
       ['stat'],
       ['bio']], dtype=object)
np.concatenate([imptr1.fit_transform(df[['A']]),imptr2.fit_transform(df[['B']])],axis=1)
array([['Y', 'stat'],
       ['N', 'math'],
       ['Y', 'math'],
       ['Y', 'stat'],
       ['Y', 'bio']], dtype=object)

C. 혼합형자료의 impute – (1) 모두 최빈값으로 impute

# 예제: 아래의 df에서 결측치를 모두 최빈값으로 impute하라.

df = pd.DataFrame(
    {'A':[2.1,1.9,2.2,np.nan,1.9],
     'B':[0,0,np.nan,0,0],
     'C':['Y','N','Y','Y',np.nan], 
     'D':['stat','math',np.nan,'stat','bio']}
)
df
A B C D
0 2.1 0.0 Y stat
1 1.9 0.0 N math
2 2.2 NaN Y NaN
3 NaN 0.0 Y stat
4 1.9 0.0 NaN bio

(풀이)

imptr = sklearn.impute.SimpleImputer(strategy='most_frequent')
imptr.fit_transform(df)
array([[2.1, 0.0, 'Y', 'stat'],
       [1.9, 0.0, 'N', 'math'],
       [2.2, 0.0, 'Y', 'stat'],
       [1.9, 0.0, 'Y', 'stat'],
       [1.9, 0.0, 'Y', 'bio']], dtype=object)

#

D. 혼합형자료의 impute – (2) 숫자형은 평균값으로, 범주는 최빈값으로 impute

# 예제: 아래의 df를 숫자형일 경우는 평균대치, 문자형일 경우는 최빈값으로 대치하라.

df = pd.DataFrame(
    {'A':[2.1,1.9,2.2,np.nan,1.9],
     'B':[0,0,np.nan,0,0],
     'C':['Y','N','Y','Y',np.nan], 
     'D':['stat','math',np.nan,'stat','bio']}
)
df
A B C D
0 2.1 0.0 Y stat
1 1.9 0.0 N math
2 2.2 NaN Y NaN
3 NaN 0.0 Y stat
4 1.9 0.0 NaN bio

(풀이)

- step1: 복사본 생성

df_imputed = df.copy()
df_imputed
A B C D
0 2.1 0.0 Y stat
1 1.9 0.0 N math
2 2.2 NaN Y NaN
3 NaN 0.0 Y stat
4 1.9 0.0 NaN bio

- step2: 데이터프레임 분리

df_num = df.select_dtypes(include="number")
df_num
A B
0 2.1 0.0
1 1.9 0.0
2 2.2 NaN
3 NaN 0.0
4 1.9 0.0
df_cat = df.select_dtypes(exclude="number")
df_cat 
C D
0 Y stat
1 N math
2 Y NaN
3 Y stat
4 NaN bio

- step3: impute

df_imputed[df_num.columns] = sklearn.impute.SimpleImputer(strategy='mean').fit_transform(df_num)
df_imputed[df_cat.columns] = sklearn.impute.SimpleImputer(strategy='most_frequent').fit_transform(df_cat)
df_imputed
A B C D
0 2.100 0.0 Y stat
1 1.900 0.0 N math
2 2.200 0.0 Y stat
3 2.025 0.0 Y stat
4 1.900 0.0 Y bio