05wk-2: Pandas – transform column (꿀팁)

Author

최규빈

Published

October 5, 2023

1. 강의영상

2. Imports

import pandas as pd
import numpy as np

3. Pandas: transform column (꿀팁)

A. `lambda`

저번시간에 했음

B. `map`

저번시간에 했음

C. `s.apply(변환함수)`

- 예시1: 원소별로 처음3개의 숫자만 출력

df = pd.read_csv('https://raw.githubusercontent.com/guebin/DV2022/master/posts/FIFA23_official_data.csv')
s = df.Height

s.apply(lambda x: x[:3])

0        189
1        179
2        172
3        181
4        172
        ... 
17655    190
17656    195
17657    190
17658    187
17659    186
Name: Height, Length: 17660, dtype: object

- 예시2: 원소별로 처음3개의 문자만 출력 $\to$ str자료형을 int자료형으로 변환

df = pd.read_csv('https://raw.githubusercontent.com/guebin/DV2022/master/posts/FIFA23_official_data.csv')
s = df.Height

(방법1) – 한번에

s.apply(lambda x: int(x[:3]))

0        189
1        179
2        172
3        181
4        172
        ... 
17655    190
17656    195
17657    190
17658    187
17659    186
Name: Height, Length: 17660, dtype: int64

(방법2) – 연쇄적으로..

s.apply(lambda x: x[:3]).apply(int)

0        189
1        179
2        172
3        181
4        172
        ... 
17655    190
17656    195
17657    190
17658    187
17659    186
Name: Height, Length: 17660, dtype: int64

D. `s.str`, `idx.str`

- 예시1: 원소별로 처음 3개의 숫자만 출력

df = pd.read_csv('https://raw.githubusercontent.com/guebin/DV2022/master/posts/FIFA23_official_data.csv')
s = df.Height

s.str[:3]

0        189
1        179
2        172
3        181
4        172
        ... 
17655    190
17656    195
17657    190
17658    187
17659    186
Name: Height, Length: 17660, dtype: object

- 예시2: 원소별로 isupper를 수행

s = pd.Series(['A','B','C','d','e','F'])
s

0    A
1    B
2    C
3    d
4    e
5    F
dtype: object

s.str.isupper()

0     True
1     True
2     True
3    False
4    False
5     True
dtype: bool

- 예시3: 원소별로 공백제거 (pd.Series 뿐만 아니라 pd.Index 자료형에도 사용가능)

df = pd.read_csv('https://raw.githubusercontent.com/guebin/DV2022/master/posts/FIFA23_official_data.csv')
idx = df.columns

idx.str.replace(' ','')

Index(['ID', 'Name', 'Age', 'Photo', 'Nationality', 'Flag', 'Overall',
       'Potential', 'Club', 'ClubLogo', 'Value', 'Wage', 'Special',
       'PreferredFoot', 'InternationalReputation', 'WeakFoot', 'SkillMoves',
       'WorkRate', 'BodyType', 'RealFace', 'Position', 'Joined', 'LoanedFrom',
       'ContractValidUntil', 'Height', 'Weight', 'ReleaseClause', 'KitNumber',
       'BestOverallRating'],
      dtype='object')

E. `s.astype()`

- 예시1: 원소의 타입을 모두 int형으로 변경

s = pd.Series(list('12345'))
s

0    1
1    2
2    3
3    4
4    5
dtype: object

s.astype(int)

0    1
1    2
2    3
3    4
4    5
dtype: int64

- 예시2: 원소의 타입을 변환한 이후 브로드캐스팅 (int)

s1 = pd.Series(list('12345'))
s2 = pd.Series([-1,-2,-3,-4,-5])

s1.astype(int)+s2

0    0
1    0
2    0
3    0
4    0
dtype: int64

# 예시3: 원소의 타입을 변환한 이후 브로드캐스팅 (str)

df = pd.read_csv("https://raw.githubusercontent.com/guebin/DV2023/main/posts/titanic.csv")[:5]
df

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	logFare
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S	1.981001
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C	4.266662
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S	2.070022
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S	3.972177
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S	2.085672

위의 자료에서 Embarked 열과 Pclass열을 이용하여 아래와 같은 New Feature를 만들어라.

Embarked	Pclass	New Feature
‘S’	3	‘S3’
‘C’	1	‘C1’
‘S’	3	‘S3’
‘S’	1	‘S1’
‘S’	3	‘S3’

(풀이)

df.Embarked + df.Pclass.astype(str)

0    S3
1    C1
2    S3
3    S1
4    S3
dtype: object

#

F. 컴프리헨션, `lambda+map`을 무시하지 말 것

# 예시1

df = pd.read_csv("https://raw.githubusercontent.com/guebin/DV2023/main/posts/titanic.csv")[:5]
df

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	logFare
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S	1.981001
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C	4.266662
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S	2.070022
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S	3.972177
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S	2.085672

위의 자료에서 아래와 같은 변환을 하고 싶다면?

$f (sex, sibsp) = {\begin{cases} 0.7 + 0.25 \times sibsp & if sex = ’female’ \\ 0.1 + 0.15 \times sibsp & otherwise \end{cases}$

df.assign(Prob= list(map(lambda sex,sibsp: 0.7 + sibsp*0.25 if sex=='female' else 0.2 + sibsp*0.15, df.Sex, df.SibSp)))

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	logFare	Prob
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S	1.981001	0.35
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C	4.266662	0.95
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S	2.070022	0.70
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S	3.972177	0.95
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S	2.085672	0.20

#

# 예시2

df = pd.read_csv("https://raw.githubusercontent.com/guebin/DV2023/main/posts/titanic.csv")[:5]
df

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	logFare
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S	1.981001
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C	4.266662
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S	2.070022
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S	3.972177
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S	2.085672

위의 자료에서 Name열을 아래와 같이 분리하는 작업을 수행하라.

	title	Name
0	Mr	Owen Harris Braund
1	Mrs	John Bradley (Florence Briggs Thayer) Cumings
2	Miss	Laina Heikkinen
3	Mrs	Jacques Heath (Lily May Peel) Futrelle
4	Mr	William Henry Allen

(풀이)

_lst = [[title,f'{first_name} {last_name}'] for last_name, title, first_name in df.Name.str.replace(', ','/').str.replace('. ','/').str.split('/')]
pd.DataFrame(_lst, columns=['title','Name2'])

	title	Name2
0	Mr	Owen Harris Braund
1	Mrs	John Bradley (Florence Briggs Thayer) Cumings
2	Miss	Laina Heikkinen
3	Mrs	Jacques Heath (Lily May Peel) Futrelle
4	Mr	William Henry Allen

참고: “Mr,Mrs,Miss” 만 뽑아내는 코드 (대면수업에서 하려다가 실패한 코드)

df.Name.str.split(', ').str[-1].str.split('. ').str[0]

0      Mr
1     Mrs
2    Miss
3     Mrs
4      Mr
Name: Name, dtype: object

#

4. 숙제

아래와 같은 데이터프레임이 있다고 하자.

df = pd.DataFrame({'A':[1,2,3,4]})
df

	A
0	1
1	2
2	3
3	4

아래의 결과를 관찰하고 이러한 결과가 나오는 이유를 정리하여 제출하라.

df[['A']].apply(np.mean)

A    2.5
dtype: float64

df['A'].apply(np.mean)

0    1.0
1    2.0
2    3.0
3    4.0
Name: A, dtype: float64

1. 강의영상

2. Imports

3. Pandas: transform column (꿀팁)

A. lambda

B. map

C. s.apply(변환함수)

D. s.str, idx.str

E. s.astype()

F. 컴프리헨션, lambda+map을 무시하지 말 것