훌륭한 시각화, mpg 데이터 소개, plotnine(p9)을 이용한 고차원 산점도

강의영상

https://youtube.com/playlist?list=PLQqh36zP38-yAhneToXbka1K31FT21q4X

imports

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from plotnine import *

훌륭한 시각화란?

애드워드 터프티

- 데이터 시각화계의 거장

- 터프티의 이론중 백미: 엄격한 미니멀리즘

최소한의 잉크로 많은 정보를 전달할 수 있다면 그것이 바로 좋은 그래프이다.
작은 지면 내에서 잉크를 최대한 적게 써서 짧은 시간 안에 많은 영감을 주어야 한다.

- 데이터-잉크비: 데이터를 표현하는데 들아가는 잉크의 양 / 그래픽을 인쇄하는데 들어가는 잉크의 총량

- 차트정크 (나이젤홈즈의 그래프)

“Lurking behind chartjunk is contempt both for information and for the audience. Chartjunk promoters imagine that numbers and details are boring, dull, and tedious, requiring ornament to enliven. Cosmetic decoration, which frequently distorts the data, will never salvage an underlying lack of content. If the numbers are boring, then you’ve got the wrong numbers (…) Worse is contempt for our audience, designing as if readers were obtuse and uncaring. In fact, consumers of graphics are often more intelligent about the information at hand than those who fabricate the data decoration (…) The operating moral premise of information design should be that our readers are alert and caring; they may be busy, eager to get on with it, but they are not stupid.”

차트정크 = 대중을 멸시 + 데이터에 대한 모독
차트정크 옹호가는 숫자와 데이터가 지루하여 활기가 필요하다고 생각하는 모양이다..

- 별로인 그래프 (왼쪽) / 우수한 그래프 오른쪽

- 제 생각: 글쎄…

찰스미나드의 도표

인류역사상 가장 훌륭한 시각화

- 터프티의 평

지금까지 그려진 최고의 통계 그래픽일지도 모른다.
여기에서는 군대의 크기, 2차원 평면상의 위치, 군대의 이동방향, 모스코바에서 퇴각하는 동안의 여러날짜, 온도 \(\to\) 6차원의 변수
백만번에 한번 이런 그림을 그릴수는 있겠지만 이러한 멋진 그래픽을 만드는 방법에 대한 원칙은 없다. \(\to\) 미니멀리즘..

- 왜 우수한 그래프일까?

자료를 파악하는 기법은 최근까지도 산점도, 막대그래프, 라인플랏에 의존
이러한 플랏의 단점은 고차원의 자료를 분석하기 어렵다는 것임
미나드는 여러그램을 그리는 방법 대신에 한 그림에서 패널을 늘리는 방법을 선택함.

미나드처럼 그리는게 왜 어려운가?

- 몸무게, 키, 성별, 국적

df1=pd.read_csv('https://raw.githubusercontent.com/guebin/DV2022/master/posts/male1.csv')
df2=pd.read_csv('https://raw.githubusercontent.com/guebin/DV2022/master/posts/male2.csv')  
df3=pd.read_csv('https://raw.githubusercontent.com/guebin/DV2022/master/posts/female.csv') 
df4=pd.read_csv('https://raw.githubusercontent.com/guebin/DV2022/master/posts/foreign.csv')

- 미나드의 접근방법

_df = pd.concat([pd.concat([df1,df2],axis=1).assign(g='m'),df3.assign(g='f')])
df = pd.concat([_df.assign(g2='korea'),df4.assign(g2='foreign')]).reset_index(drop=True)
df

	w	h	g	g2
0	72.788217	183.486773	m	korea
1	66.606430	173.599877	m	korea
2	69.806324	173.237903	m	korea
3	67.449439	173.223805	m	korea
4	70.463183	174.931946	m	korea
...	...	...	...	...
1525	78.154632	188.324350	m	foreign
1526	74.754308	183.017979	f	foreign
1527	91.196208	190.100456	m	foreign
1528	87.770394	187.987255	m	foreign
1529	88.021995	193.456798	m	foreign

1530 rows × 4 columns

sns.scatterplot(data=df,x='w',y='h',hue='g',style='g2')

<AxesSubplot:xlabel='w', ylabel='h'>

- 어려운점: (1) 센스가 없어서 hue/style을 이용하여 그룹을 구분할 생각을 못함 (2) long df (=tidy data) 형태로 데이터를 정리할 생각을 못함 (3) long df 형태로 데이터를 변형하는 코드를 모름

1. 기획력부족 -> 훌륭한 시각화를 많이 볼 것
1. 데이터프레임에 대한 이해부족 -> tidydata에 대한 개념
1. 프로그래밍 능력 -> 코딩공부열심히 (pandas를 엄청 잘해야함)

read mpg data

- ref: https://r4ds.had.co.nz/index.html

방법1: rpy2 (코랩 아닌경우 실습금지)

import rpy2
%load_ext rpy2.ipython

%%R 
### 여기는 R처럼 쓸 수 있다. 
a <- c(1,2,3) 
a+1

[1] 2 3 4

NameError: name 'a' is not defined

%%R 
library(tidyverse)
mpg

# A tibble: 234 × 11
   manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
   <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
 1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
 2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
 3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
 4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
 5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
 6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
 7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
 8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
 9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
# … with 224 more rows
# ℹ Use `print(n = ...)` to see more rows

mpg

NameError: name 'mpg' is not defined

%R -o mpg # R에 있는 자료가 파이썬으로 넘어옴

mpg

	manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	class
1	audi	a4	1.8	1999	4	auto(l5)	f	18	29	p	compact
2	audi	a4	1.8	1999	4	manual(m5)	f	21	29	p	compact
3	audi	a4	2.0	2008	4	manual(m6)	f	20	31	p	compact
4	audi	a4	2.0	2008	4	auto(av)	f	21	30	p	compact
5	audi	a4	2.8	1999	6	auto(l5)	f	16	26	p	compact
...	...	...	...	...	...	...	...	...	...	...	...
230	volkswagen	passat	2.0	2008	4	auto(s6)	f	19	28	p	midsize
231	volkswagen	passat	2.0	2008	4	manual(m6)	f	21	29	p	midsize
232	volkswagen	passat	2.8	1999	6	auto(l5)	f	16	26	p	midsize
233	volkswagen	passat	2.8	1999	6	manual(m5)	f	18	26	p	midsize
234	volkswagen	passat	3.6	2008	6	auto(s6)	f	17	26	p	midsize

234 rows × 11 columns

방법2: 저장된 csv파일을 통하여 데이터를 확보

mpg.to_csv("mpg.csv",index=False)

pd.read_csv("mpg.csv")

	manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	class
0	audi	a4	1.8	1999	4	auto(l5)	f	18	29	p	compact
1	audi	a4	1.8	1999	4	manual(m5)	f	21	29	p	compact
2	audi	a4	2.0	2008	4	manual(m6)	f	20	31	p	compact
3	audi	a4	2.0	2008	4	auto(av)	f	21	30	p	compact
4	audi	a4	2.8	1999	6	auto(l5)	f	16	26	p	compact
...	...	...	...	...	...	...	...	...	...	...	...
229	volkswagen	passat	2.0	2008	4	auto(s6)	f	19	28	p	midsize
230	volkswagen	passat	2.0	2008	4	manual(m6)	f	21	29	p	midsize
231	volkswagen	passat	2.8	1999	6	auto(l5)	f	16	26	p	midsize
232	volkswagen	passat	2.8	1999	6	manual(m5)	f	18	26	p	midsize
233	volkswagen	passat	3.6	2008	6	auto(s6)	f	17	26	p	midsize

234 rows × 11 columns

방법3: github등에 공개된 csv를 읽어오기

pd.read_csv('https://raw.githubusercontent.com/guebin/DV2022/master/posts/mpg.csv')

	manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	class
0	audi	a4	1.8	1999	4	auto(l5)	f	18	29	p	compact
1	audi	a4	1.8	1999	4	manual(m5)	f	21	29	p	compact
2	audi	a4	2.0	2008	4	manual(m6)	f	20	31	p	compact
3	audi	a4	2.0	2008	4	auto(av)	f	21	30	p	compact
4	audi	a4	2.8	1999	6	auto(l5)	f	16	26	p	compact
...	...	...	...	...	...	...	...	...	...	...	...
229	volkswagen	passat	2.0	2008	4	auto(s6)	f	19	28	p	midsize
230	volkswagen	passat	2.0	2008	4	manual(m6)	f	21	29	p	midsize
231	volkswagen	passat	2.8	1999	6	auto(l5)	f	16	26	p	midsize
232	volkswagen	passat	2.8	1999	6	manual(m5)	f	18	26	p	midsize
233	volkswagen	passat	3.6	2008	6	auto(s6)	f	17	26	p	midsize

234 rows × 11 columns

- 깃허브 저장소에 아예 데이터만 따로 모아서 관리하는 것도 좋은 방법입니다.

data 설명

- displ: 자동차의 엔진크기

- hwy: 연료의 효율, 동일한 연료로 얼마나 멀리 가느냐?

- 자세한 설명은 R에서 ?mpg를 이용해 스스로 찾아볼 것

p9를 이용한 산점도 (2차원)

python에서: plotnine을 이용한 산점도

ggplot(data=mpg) + geom_point(mapping=aes(x='displ',y='hwy')) ## plotnine

<ggplot: (8726736046009)>

산점도 해석: 엔진크기가 클수록 효율이 낮음.

- 빠르게 그리기: data=와 mapping=은 생략가능함

ggplot(mpg) + geom_point(aes(x='displ',y='hwy')) ## plotnine

<ggplot: (8726735544581)>

R에서: ggplot2를 이용한 산점도

- R에서도 거의 똑같은 문법으로 그릴 수 있음 (데이터프레임 혹은 티블에 저장된 column 이름을 사용할때 따옴표만 제거하면 된다!)

%%R -w 800
ggplot(mpg) + geom_point(aes(x=displ,y=hwy)) ## plotnine

python에서: 객체지향적인 느낌으로 산점도 그리기

step1: 도화지를 준비한다.

fig = ggplot(data=mpg)
fig

<ggplot: (8726735085529)>

step2 변수와 에스테틱사이의 맵핑을 설정한다.

a1= aes(x='displ',y='hwy')
a1

{'x': 'displ', 'y': 'hwy'}

step3 점들의 집합을 만든다. 즉 포인트 지옴을 만든다.

point1=geom_point(mapping=a1)

geom_point(): 점들을 그려! 어떻게?
a1에서 설정된 표를 보고

step4 도화지와 지옴을 합친다.

fig+point1

<ggplot: (8726775447877)>

p9를 이용한 산점도 (3차원)

- 데이터를 다시 관찰

mpg.head()

	manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	class
1	audi	a4	1.8	1999	4	auto(l5)	f	18	29	p	compact
2	audi	a4	1.8	1999	4	manual(m5)	f	21	29	p	compact
3	audi	a4	2.0	2008	4	manual(m6)	f	20	31	p	compact
4	audi	a4	2.0	2008	4	auto(av)	f	21	30	p	compact
5	audi	a4	2.8	1999	6	auto(l5)	f	16	26	p	compact

- class도 함께 plot에 표시하면 데이터를 탐색할때 좀 더 좋을 것 같다.

산점도 + 점크기변경

ggplot(data=mpg) + geom_point(mapping = aes(x='displ',y='hwy',size='class'))

/home/cgb4/anaconda3/envs/py37/lib/python3.7/site-packages/plotnine/scales/scale_size.py:50: PlotnineWarning: Using size for a discrete variable is not advised.

<ggplot: (8726734563561)>

산점도 + 투명도변경

ggplot(data=mpg) + geom_point(mapping = aes(x='displ',y='hwy',alpha='class'))

/home/cgb4/anaconda3/envs/py37/lib/python3.7/site-packages/plotnine/scales/scale_alpha.py:70: PlotnineWarning: Using alpha for a discrete variable is not advised.

<ggplot: (8726734989121)>

산점도 + 투명도/점크기를 동시에 적용

ggplot(data=mpg) + geom_point(mapping = aes(x='displ',y='hwy',alpha='class',size='class'))

/home/cgb4/anaconda3/envs/py37/lib/python3.7/site-packages/plotnine/scales/scale_alpha.py:70: PlotnineWarning: Using alpha for a discrete variable is not advised.
/home/cgb4/anaconda3/envs/py37/lib/python3.7/site-packages/plotnine/scales/scale_size.py:50: PlotnineWarning: Using size for a discrete variable is not advised.

<ggplot: (8726734522405)>

산점도 + 형태

ggplot(data=mpg) + geom_point(mapping = aes(x='displ',y='hwy',shape='class'))

<ggplot: (8726734265229)>

산점도 + 색깔

ggplot(data=mpg) + geom_point(mapping = aes(x='displ',y='hwy',color='class'))

<ggplot: (8726734017473)>

객체지향적 느낌으로?

a2 = aes(x='displ', y='hwy', color='class')

a1,a2

({'x': 'displ', 'y': 'hwy'}, {'x': 'displ', 'y': 'hwy', 'color': 'class'})

point2=geom_point(a2)

fig+point2

<ggplot: (8726733712885)>

산점도 + 색깔 + 적합선

- 일단 색깔이 없는 포인트 지옴부터 연습

fig+point1

<ggplot: (8726733452617)>

line1 = geom_smooth(a1)

fig+point1+line1

/home/cgb4/anaconda3/envs/py37/lib/python3.7/site-packages/plotnine/stats/smoothers.py:311: PlotnineWarning: Confidence intervals are not yet implementedfor lowess smoothings.

<ggplot: (8726732994973)>

- point1(색깔없는 포인트 지옴)을 point2(색깔있는 포인트 지옴)으로 언제든지 바꿔치기 가능!

fig+point2+line1

/home/cgb4/anaconda3/envs/py37/lib/python3.7/site-packages/plotnine/stats/smoothers.py:311: PlotnineWarning: Confidence intervals are not yet implementedfor lowess smoothings.

<ggplot: (8726732661565)>

- 명령어로 한번에 그리기

ggplot(data=mpg) + \
geom_point(mapping=aes(x='displ',y='hwy',color='class')) + \
geom_smooth(mapping=aes(x='displ',y='hwy'))

/home/cgb4/anaconda3/envs/py37/lib/python3.7/site-packages/plotnine/stats/smoothers.py:311: PlotnineWarning: Confidence intervals are not yet implementedfor lowess smoothings.

<ggplot: (8726732727485)>

- 공통적인 맵핑규칙은 ggplot()쪽으로 빼기도 한다. (figure를 선언하는 곳에서 공통으로 선언함)

ggplot(data=mpg,mapping=aes(x='displ',y='hwy')) + \
geom_point(mapping=aes(color='class')) + \
geom_smooth()

/home/cgb4/anaconda3/envs/py37/lib/python3.7/site-packages/plotnine/stats/smoothers.py:311: PlotnineWarning: Confidence intervals are not yet implementedfor lowess smoothings.

<ggplot: (8726733489953)>

- R에서는 confidence interval도 geom_smooth()를 이용하여 확인할 수 있다.

%%R -w 800
ggplot(data=mpg,mapping=aes(x=displ,y=hwy)) + geom_point(mapping=aes(color=class)) + geom_smooth()

R[write to console]: `geom_smooth()` using method = 'loess' and formula 'y ~ x'

p9를 이용한 산점도 (4차원)

- 데이터를 살펴보자.

mpg.head()

	manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	class
1	audi	a4	1.8	1999	4	auto(l5)	f	18	29	p	compact
2	audi	a4	1.8	1999	4	manual(m5)	f	21	29	p	compact
3	audi	a4	2.0	2008	4	manual(m6)	f	20	31	p	compact
4	audi	a4	2.0	2008	4	auto(av)	f	21	30	p	compact
5	audi	a4	2.8	1999	6	auto(l5)	f	16	26	p	compact

산점도 + 점크기변경 + 색깔

- drv (전륜, 후륜, 4륜 구동)에 따라서 데이터를 시각화 하고 싶다.

ggplot(data=mpg, mapping=aes(x='displ',y='hwy')) + geom_point(mapping=aes(size='class',color='drv'),alpha=0.3)

/home/cgb4/anaconda3/envs/py37/lib/python3.7/site-packages/plotnine/scales/scale_size.py:50: PlotnineWarning: Using size for a discrete variable is not advised.

<ggplot: (8726731152845)>

모든 \(x\)에 대하여 붉은색 점들이 대부분 초록색과 보라색 점들에 비하여 아래쪽에 있음 \(\to\) 4륜구동방식이 연비가 좋지 않음

산점도 + 점크기변경 + 색깔 (객체지향버전)

- 맵핑규칙

a1,a2

({'x': 'displ', 'y': 'hwy'}, {'x': 'displ', 'y': 'hwy', 'color': 'class'})

a3 = a2.copy()

a3['color'] = 'drv'
a3['size'] = 'class'
a3

{'x': 'displ', 'y': 'hwy', 'color': 'drv', 'size': 'class'}

아래와 같이 선언해도 괜찮음

a3= aes(x='displ',y='hwy',color='drv',size='class')

point3=geom_point(a3)

fig+point3

/home/cgb4/anaconda3/envs/py37/lib/python3.7/site-packages/plotnine/scales/scale_size.py:50: PlotnineWarning: Using size for a discrete variable is not advised.

<ggplot: (8726731065581)>

그림의 전체적인 투명도를 조절하면 좋겠음

point3=geom_point(a3,alpha=0.2)
fig+point3

/home/cgb4/anaconda3/envs/py37/lib/python3.7/site-packages/plotnine/scales/scale_size.py:50: PlotnineWarning: Using size for a discrete variable is not advised.

<ggplot: (8726730819657)>

산점도 + 점크기변경 + 색깔 + 선추가

fig+point3+line1

/home/cgb4/anaconda3/envs/py37/lib/python3.7/site-packages/plotnine/scales/scale_size.py:50: PlotnineWarning: Using size for a discrete variable is not advised.
/home/cgb4/anaconda3/envs/py37/lib/python3.7/site-packages/plotnine/stats/smoothers.py:311: PlotnineWarning: Confidence intervals are not yet implementedfor lowess smoothings.

<ggplot: (8726730575253)>

산점도 + 점크기변경 + 색깔 + drv별로 선추가

- 맵핑규칙

a1,a2,a3

({'x': 'displ', 'y': 'hwy'},
 {'x': 'displ', 'y': 'hwy', 'color': 'class'},
 {'x': 'displ', 'y': 'hwy', 'color': 'drv', 'size': 'class'})

a4 = a2.copy() 
a4['color']='drv'
a4

{'x': 'displ', 'y': 'hwy', 'color': 'drv'}

line2 = geom_smooth(a4)

fig + point3 +line2

/home/cgb4/anaconda3/envs/py37/lib/python3.7/site-packages/plotnine/scales/scale_size.py:50: PlotnineWarning: Using size for a discrete variable is not advised.
/home/cgb4/anaconda3/envs/py37/lib/python3.7/site-packages/plotnine/stats/smoothers.py:311: PlotnineWarning: Confidence intervals are not yet implementedfor lowess smoothings.

<ggplot: (8726729919385)>

- 선의 색깔을 동일하게 하고 선의 타입을 변경하여 drv를 표시하고 싶다면?

a1,a2,a3,a4

({'x': 'displ', 'y': 'hwy'},
 {'x': 'displ', 'y': 'hwy', 'color': 'class'},
 {'x': 'displ', 'y': 'hwy', 'color': 'drv', 'size': 'class'},
 {'x': 'displ', 'y': 'hwy', 'color': 'drv'})

a5=a1.copy()
a5['linetype']='drv' 
a5

{'x': 'displ', 'y': 'hwy', 'linetype': 'drv'}

line3 = geom_smooth(a5,size=0.5,color='gray')

fig+point3+line3

/home/cgb4/anaconda3/envs/py37/lib/python3.7/site-packages/plotnine/scales/scale_size.py:50: PlotnineWarning: Using size for a discrete variable is not advised.
/home/cgb4/anaconda3/envs/py37/lib/python3.7/site-packages/plotnine/stats/smoothers.py:311: PlotnineWarning: Confidence intervals are not yet implementedfor lowess smoothings.

<ggplot: (8726732637457)>

- 전체적인 추세선도 추가하고 싶다면?

fig+point3+line3+line1

/home/cgb4/anaconda3/envs/py37/lib/python3.7/site-packages/plotnine/scales/scale_size.py:50: PlotnineWarning: Using size for a discrete variable is not advised.
/home/cgb4/anaconda3/envs/py37/lib/python3.7/site-packages/plotnine/stats/smoothers.py:311: PlotnineWarning: Confidence intervals are not yet implementedfor lowess smoothings.

<ggplot: (8726732939513)>

- 그려보니까 역시 drv별로 그려지는 추세선은 색깔별로 구분하는게 좋겠음.

line2 = geom_smooth(a4,size=0.5,linetype='dashed')
fig+point3+line2+line1

/home/cgb4/anaconda3/envs/py37/lib/python3.7/site-packages/plotnine/scales/scale_size.py:50: PlotnineWarning: Using size for a discrete variable is not advised.
/home/cgb4/anaconda3/envs/py37/lib/python3.7/site-packages/plotnine/stats/smoothers.py:311: PlotnineWarning: Confidence intervals are not yet implementedfor lowess smoothings.

<ggplot: (8726733678229)>

- 고차원을 변수를 표현할 수 있는 무기는 다양하다.

산점도(포인트지옴): 점의크기, 점의형태, 점의색깔, 점의투명도
라인플랏(스무스지옴,라인지옴): 선의형태, 선의색깔, 선의굵기

결론

- 잘 훈련한다면 여러가지 형태의 고차원 그래프를 우리도 그릴 수 있다. (마치 미나드처럼)

- 해들리위컴은 이러한 방법을 체계적으로 정리했다고 보여진다.

- 해들리위컴: 그래프는 데이터 + 지옴 + 맵핑(변수와 에스테틱간의 연결) + 스탯(통계) + 포지션 + 축 + 패싯그리드 7개의 조합으로 그릴 수 있다.

내생각: 지옴과 맵핑만 잘 다루어도 아주 다양한 그래프를 그릴 수 있음.