import pandas as pd
import numpy as np05wk-2: Pandas – transform column (꿀팁)
1. 강의영상
2. Imports
3. Pandas: transform column (꿀팁)
A. lambda
저번시간에 했음
B. map
저번시간에 했음
C. s.apply(변환함수)
- 예시1: 원소별로 처음3개의 숫자만 출력
df = pd.read_csv('https://raw.githubusercontent.com/guebin/DV2022/master/posts/FIFA23_official_data.csv')
s = df.Heights.apply(lambda x: x[:3])0 189
1 179
2 172
3 181
4 172
...
17655 190
17656 195
17657 190
17658 187
17659 186
Name: Height, Length: 17660, dtype: object
- 예시2: 원소별로 처음3개의 문자만 출력 \(\to\) str자료형을 int자료형으로 변환
df = pd.read_csv('https://raw.githubusercontent.com/guebin/DV2022/master/posts/FIFA23_official_data.csv')
s = df.Height(방법1) – 한번에
s.apply(lambda x: int(x[:3]))0 189
1 179
2 172
3 181
4 172
...
17655 190
17656 195
17657 190
17658 187
17659 186
Name: Height, Length: 17660, dtype: int64
(방법2) – 연쇄적으로..
s.apply(lambda x: x[:3]).apply(int)0 189
1 179
2 172
3 181
4 172
...
17655 190
17656 195
17657 190
17658 187
17659 186
Name: Height, Length: 17660, dtype: int64
D. s.str, idx.str
- 예시1: 원소별로 처음 3개의 숫자만 출력
df = pd.read_csv('https://raw.githubusercontent.com/guebin/DV2022/master/posts/FIFA23_official_data.csv')
s = df.Heights.str[:3]0 189
1 179
2 172
3 181
4 172
...
17655 190
17656 195
17657 190
17658 187
17659 186
Name: Height, Length: 17660, dtype: object
- 예시2: 원소별로 isupper를 수행
s = pd.Series(['A','B','C','d','e','F'])
s0 A
1 B
2 C
3 d
4 e
5 F
dtype: object
s.str.isupper()0 True
1 True
2 True
3 False
4 False
5 True
dtype: bool
- 예시3: 원소별로 공백제거 (pd.Series 뿐만 아니라 pd.Index 자료형에도 사용가능)
df = pd.read_csv('https://raw.githubusercontent.com/guebin/DV2022/master/posts/FIFA23_official_data.csv')
idx = df.columnsidx.str.replace(' ','')Index(['ID', 'Name', 'Age', 'Photo', 'Nationality', 'Flag', 'Overall',
'Potential', 'Club', 'ClubLogo', 'Value', 'Wage', 'Special',
'PreferredFoot', 'InternationalReputation', 'WeakFoot', 'SkillMoves',
'WorkRate', 'BodyType', 'RealFace', 'Position', 'Joined', 'LoanedFrom',
'ContractValidUntil', 'Height', 'Weight', 'ReleaseClause', 'KitNumber',
'BestOverallRating'],
dtype='object')
E. s.astype()
- 예시1: 원소의 타입을 모두 int형으로 변경
s = pd.Series(list('12345'))
s0 1
1 2
2 3
3 4
4 5
dtype: object
s.astype(int)0 1
1 2
2 3
3 4
4 5
dtype: int64
- 예시2: 원소의 타입을 변환한 이후 브로드캐스팅 (int)
s1 = pd.Series(list('12345'))
s2 = pd.Series([-1,-2,-3,-4,-5])s1.astype(int)+s20 0
1 0
2 0
3 0
4 0
dtype: int64
# 예시3: 원소의 타입을 변환한 이후 브로드캐스팅 (str)
df = pd.read_csv("https://raw.githubusercontent.com/guebin/DV2023/main/posts/titanic.csv")[:5]
df| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | logFare | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 1.981001 |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 4.266662 |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 2.070022 |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 3.972177 |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | 2.085672 |
위의 자료에서 Embarked 열과 Pclass열을 이용하여 아래와 같은 New Feature를 만들어라.
| Embarked | Pclass | New Feature |
|---|---|---|
| ‘S’ | 3 | ‘S3’ |
| ‘C’ | 1 | ‘C1’ |
| ‘S’ | 3 | ‘S3’ |
| ‘S’ | 1 | ‘S1’ |
| ‘S’ | 3 | ‘S3’ |
(풀이)
df.Embarked + df.Pclass.astype(str)0 S3
1 C1
2 S3
3 S1
4 S3
dtype: object
#
F. 컴프리헨션, lambda+map을 무시하지 말 것
# 예시1
df = pd.read_csv("https://raw.githubusercontent.com/guebin/DV2023/main/posts/titanic.csv")[:5]
df| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | logFare | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 1.981001 |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 4.266662 |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 2.070022 |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 3.972177 |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | 2.085672 |
위의 자료에서 아래와 같은 변환을 하고 싶다면?
\[ f(\text{sex}, \text{sibsp}) = \begin{cases} 0.7 + 0.25 \times \text{sibsp} & \text{if } \text{sex} = \text{'female'} \\ 0.1 + 0.15 \times \text{sibsp} & \text{otherwise} \end{cases} \]
df.assign(Prob= list(map(lambda sex,sibsp: 0.7 + sibsp*0.25 if sex=='female' else 0.2 + sibsp*0.15, df.Sex, df.SibSp)))| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | logFare | Prob | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 1.981001 | 0.35 |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 4.266662 | 0.95 |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 2.070022 | 0.70 |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 3.972177 | 0.95 |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | 2.085672 | 0.20 |
#
# 예시2
df = pd.read_csv("https://raw.githubusercontent.com/guebin/DV2023/main/posts/titanic.csv")[:5]
df| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | logFare | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 1.981001 |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 4.266662 |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 2.070022 |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 3.972177 |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | 2.085672 |
위의 자료에서 Name열을 아래와 같이 분리하는 작업을 수행하라.
| title | Name | |
|---|---|---|
| 0 | Mr | Owen Harris Braund |
| 1 | Mrs | John Bradley (Florence Briggs Thayer) Cumings |
| 2 | Miss | Laina Heikkinen |
| 3 | Mrs | Jacques Heath (Lily May Peel) Futrelle |
| 4 | Mr | William Henry Allen |
(풀이)
_lst = [[title,f'{first_name} {last_name}'] for last_name, title, first_name in df.Name.str.replace(', ','/').str.replace('. ','/').str.split('/')]
pd.DataFrame(_lst, columns=['title','Name2'])| title | Name2 | |
|---|---|---|
| 0 | Mr | Owen Harris Braund |
| 1 | Mrs | John Bradley (Florence Briggs Thayer) Cumings |
| 2 | Miss | Laina Heikkinen |
| 3 | Mrs | Jacques Heath (Lily May Peel) Futrelle |
| 4 | Mr | William Henry Allen |
참고: “Mr,Mrs,Miss” 만 뽑아내는 코드 (대면수업에서 하려다가 실패한 코드)
df.Name.str.split(', ').str[-1].str.split('. ').str[0]0 Mr
1 Mrs
2 Miss
3 Mrs
4 Mr
Name: Name, dtype: object
#
4. 숙제
아래와 같은 데이터프레임이 있다고 하자.
df = pd.DataFrame({'A':[1,2,3,4]})
df| A | |
|---|---|
| 0 | 1 |
| 1 | 2 |
| 2 | 3 |
| 3 | 4 |
아래의 결과를 관찰하고 이러한 결과가 나오는 이유를 정리하여 제출하라.
df[['A']].apply(np.mean)A 2.5
dtype: float64
df['A'].apply(np.mean)0 1.0
1 2.0
2 3.0
3 4.0
Name: A, dtype: float64